Power a Github Notification Bot for Issue Reviewers with Graph Based NLP

September 6, 2016 · 16 min read

Without question, Github is the biggest code sharing platform on the planet. With more than 14 millions users and 35 million repositories, the insights you can discover by analyzing the data available through its API are surprising and revealing.

I’ve been very passionate about this data for a long time, as much as I am about Neo4j. Importing this data into a graph database can reveal interesting new information.

Two years ago I created a Github Gist with some interesting queries on that dataset. I showed many more queries at several conferences since.

Recently, I came across mention-bot from Facebook, which analyses the git blame of the files of a Pull Request in order to notify (via @mention) potential reviewers.

It uses two heuristics :

If a line was deleted or modified, the person that last touched that line is likely going to care about this pull request.
If a person last touched many lines in the file where the change was made, they will want to be notified.

I got inspired by the approach taken by Facebook. Because I’ve been running a continuous importer of all Github events into Neo4j for a long time, I have a lot of GitHub data to work with.

I was curious about the opportunity of using our Natural Language Processing plugin for Neo4j recently created by my colleague, friend and Chief Scientist of GraphAware Dr. Alessandro Negro, to apply it to the text descriptions of Github Pull Requests and Issues.

Our approach

We will import the last 100 Pull Requests of our favorite Github repository, Neo4j with their title, body, author and files.

We will then process the above text through an NLP pipeline storing the processing results in the graph which, thanks to the nature of graphs, will allow us to relate already existing code file nodes to text entities.

Then after some scoring queries, we will create a new GitHub Issue and find potential reviewers based on:

The tags extracted from the new issue’s text using NLP
The similarity between such tags and the tags extracted from previous pull requests
The files affected by those previous Pull Requests
The activity frequency of the users on the files found with the above

Importing the data

Thanks to the amazing work of Michael Hunger and the Neo4j community, the APOC procedures library offers an easy way to import data from public API’s directly via Cypher.

Let’s go!

Creating unique constraints

CREATE CONSTRAINT ON (user:User) ASSERT user.id IS UNIQUE
CREATE CONSTRAINT ON (repository:Repository) ASSERT repository.id IS UNIQUE
CREATE CONSTRAINT ON (file:File) ASSERT file.path IS UNIQUE
CREATE CONSTRAINT ON (pullrequest:PullRequest) ASSERT pullrequest.id IS UNIQUE

Importing the last 100 pull requests

MERGE (r:Repository {full_name:"neo4j/neo4j"})
WITH r as repo
CALL apoc.load.json("https://api.github.com/repos/neo4j/neo4j/pulls?per_page=100&state=closed&direction=desc&base=3.0")
YIELD value as pull
MERGE (pr:PullRequest {id: pull.id})
ON CREATE SET pr.title = pull.title, pr.body = pull.body, pr.time = pull.created_at,
pr.number = pull.number
MERGE (user:User {id: pull.user.id})
ON CREATE SET user.login = pull.user.login
MERGE (user)-[:OPENED_PR]->(pr)
MERGE (pr)-[:PR_TO_REPO]->(repo)

Frequency of contribution

The idea is to create a direct relationship, between a user and a repository, with a property representing the frequency of contributons:

MATCH (r:Repository {full_name:"neo4j/neo4j"})<-[:PR_TO_REPO]-(pr)<-[:OPENED_PR]-(user)
WITH r, user, count(*) as freq
MERGE (user)-[c:CONTRIBUTED_TO]->(r) SET c.freq = freq

Once the last statement has finished, you can already visualize a graph of contributors to the Neo4j repository

Importing Pull Request files

The final statement will import the files of those Pull Requests and create the repository hierarchy. Replace the XXX in the api url with your GitHub client id and secret if you have one :

MATCH (pr:PullRequest)
CALL apoc.load.json("https://api.github.com/repos/neo4j/neo4j/pulls/" + pr.number + "/files?client_id=XXX&client_secret=XXX")
YIELD value as file
WITH pr, file as fileInfo, split(file.filename,"/") AS elts
UNWIND range(0,size(elts)-1) AS i
MERGE (prev:File {path: "neo4j/neo4j_" + reduce(s="",x IN range(0,i-2) | s+ elts[x]+"/") + CASE i WHEN 0 THEN "" ELSE elts[i-1] END}) 
SET prev.name = CASE i WHEN 0 THEN "." ELSE reduce(s="",x IN range(0,i-2) | s+ elts[x]+"/") + elts[i-1] END 
MERGE (file:File {path: "neo4j/neo4j_" + reduce(s="",x IN range(0,i-1) | s+ elts[x]+"/") + elts[i]})
SET file.name = reduce(s="",x IN range(0,i-1) | s+ elts[x]+"/") + elts[i]
MERGE (file)-[:PARENT]->(prev)
WITH "neo4j/neo4j_" + reduce(s="",x IN range(0,size(elts)-2) | s+ elts[x]+"/") + elts[-1] AS la, pr, fileInfo
MATCH (f:File {path:la})
SET f:FileRecord
MERGE (f)<-[r:TOUCHED_FILE]-(pr) SET r.changes = fileInfo.changes

The last part of the query is more about iterating over a text split on a file path in order to recreate the file tree structure in the graph:

Processing the NLP pipeline

Using the Natural Language Processing plugin is just about calling a procedure on the given text and creating a relationship between the Pull Request and the newly created AnnotatedText:

MATCH (n:PullRequest)
CALL ga.nlp.annotate({text: n.title + " " + n.body, id: id(n), nlpDepth:0})
YIELD result
MERGE (n)-[:ANNOTATED_TEXT]->(result)

The extraction is automatically persisted in the graph and relates sentences and tags to Pull Requests, for example:

By just browsing your database, you can identifty which tags (big pink nodes) are related to a pull request.

Scoring tags on pull requests

Here we will take a simple scoring approach- for each pull request we will find the extracted tags and create a score where:

score = tag occurences in this pr / (total occurences of the tag * number of tags for this PR)

We take how many times the tag occurs in this particular pull request and we divide this number by the mulitplication of :

a) how many times this tag appears for pull requests b) how many tags are tagging this particular pull request

The corresponding Cypher query :

MATCH (pr:PullRequest)-[:ANNOTATED_TEXT]->(text)-[:CONTAINS_SENTENCE]->(s)-[:HAS_TAG]->(tag)
// no tags yet
WHERE size((pr)<-[:TAGS_PR]-()) = 0
WITH pr, count(distinct(tag)) as total
MATCH (pr)-[:ANNOTATED_TEXT]->(text)-[:CONTAINS_SENTENCE]->(s)-[:HAS_TAG]->(tag)
WITH pr, tag, count(*) as freq, total
ORDER BY freq DESC
WITH pr, tag, toFloat((freq * 1.0f)  / (size((tag)<-[:HAS_TAG]-()) * total)) as score, total
MERGE (tag)-[r:TAGS_PR]->(pr) SET r.score = score
RETURN tag.value, score

When the query is completed, an interesting question to ask is to take a file and find its corresponding tags based on the pull requests, NLP tags and the score:

WITH "neo4j/neo4j_community/kernel/src/main/java/org/neo4j/kernel/impl/api/KernelTransactions.java" as path
MATCH (f:FileRecord {path: path})<-[:TOUCHED_FILE]-(pr)<-[r:TAGS_PR]-(tag)
WITH tag, collect(r) as occurences
RETURN tag.value, size(occurences) + reduce(x=0, o IN occurences | x + o.score) as score
ORDER BY score DESC

Result :

╒══════════════════════════════╤══════════════════╕
│tag.value                     │score             │
╞══════════════════════════════╪══════════════════╡
│transaction                   │4.0154783125371365│
├──────────────────────────────┼──────────────────┤
│commit                        │3.0050662878787877│
├──────────────────────────────┼──────────────────┤
│can                           │3.0026903651903654│
├──────────────────────────────┼──────────────────┤
│make                          │3.0020750988142293│
├──────────────────────────────┼──────────────────┤
│external                      │2.028030303030303 │
├──────────────────────────────┼──────────────────┤
│guard                         │2.0244444444444443│
├──────────────────────────────┼──────────────────┤
│execution                     │2.0194444444444444│
├──────────────────────────────┼──────────────────┤
│guarantee                     │2.018308080808081 │
├──────────────────────────────┼──────────────────┤
│logical                       │2.016919191919192 │
├──────────────────────────────┼──────────────────┤
│note                          │2.0112794612794613│
├──────────────────────────────┼──────────────────┤
│during                        │2.00915404040404  │
├──────────────────────────────┼──────────────────┤
│right                         │2.008459595959596 │
├──────────────────────────────┼──────────────────┤
│slave                         │2.0075396825396825│
├──────────────────────────────┼──────────────────┤
│kernel                        │2.0063492063492063│
├──────────────────────────────┼──────────────────┤
│get                           │2.0061026936026938│
├──────────────────────────────┼──────────────────┤
│instance                      │2.0056277056277056│
├──────────────────────────────┼──────────────────┤
│what                          │2.004513888888889 │
├──────────────────────────────┼──────────────────┤
│through                       │2.004166666666667 │
├──────────────────────────────┼──────────────────┤

Without having seen a single line of code in this file, we already know that this file is about transaction, guard execution, kernel and guarantee. These terms quite precisely represent the purpose of the file.

A visual comparison is really surprising :

Actually, with the help of Neo4j and NLP, we made the code talk to us.

Creating a new issue

The next step will represent a user creating a new issue and we will process it using the NLP pipeline. We will take a real issue about escaping output of the constraints procedure reported on the Neo4j repository:

MATCH (repo:Repository {full_name:"neo4j/neo4j"})
CALL apoc.load.json("https://api.github.com/repos/neo4j/neo4j/issues/7634")
YIELD value as issue
CREATE (i:Issue {number: issue.number, title: issue.title, body: issue.body})
MERGE (i)-[:ISSUE_ON_REPO]->(repo)

Now we process the NLP pipeline:

MERGE (r:Repository {full_name:"neo4j/neo4j"})
WITH r as repo
CALL apoc.load.json("https://api.github.com/repos/neo4j/neo4j/pulls?per_page=100&state=closed&direction=desc&base=3.0")
YIELD value as pull
MERGE (pr:PullRequest {id: pull.id})
ON CREATE SET pr.title = pull.title, pr.body = pull.body, pr.time = pull.created_at,
pr.number = pull.number
MERGE (user:User {id: pull.user.id})
ON CREATE SET user.login = pull.user.login
MERGE (user)-[:OPENED_PR]->(pr)
MERGE (pr)-[:PR_TO_REPO]->(repo)

which will produce the same extraction format as for pull requests:

Finding potential reviewers

Starting from the issue, we will incrementally run Cypher queries adding more complexity at each step:

Note here that we are not interested in tags having less than 5 characters

MERGE (r:Repository {full_name:"neo4j/neo4j"})
WITH r as repo
CALL apoc.load.json("https://api.github.com/repos/neo4j/neo4j/pulls?per_page=100&state=closed&direction=desc&base=3.0")
YIELD value as pull
MERGE (pr:PullRequest {id: pull.id})
ON CREATE SET pr.title = pull.title, pr.body = pull.body, pr.time = pull.created_at,
pr.number = pull.number
MERGE (user:User {id: pull.user.id})
ON CREATE SET user.login = pull.user.login
MERGE (user)-[:OPENED_PR]->(pr)
MERGE (pr)-[:PR_TO_REPO]->(repo)

Result :

MERGE (r:Repository {full_name:"neo4j/neo4j"})
WITH r as repo
CALL apoc.load.json("https://api.github.com/repos/neo4j/neo4j/pulls?per_page=100&state=closed&direction=desc&base=3.0")
YIELD value as pull
MERGE (pr:PullRequest {id: pull.id})
ON CREATE SET pr.title = pull.title, pr.body = pull.body, pr.time = pull.created_at,
pr.number = pull.number
MERGE (user:User {id: pull.user.id})
ON CREATE SET user.login = pull.user.login
MERGE (user)-[:OPENED_PR]->(pr)
MERGE (pr)-[:PR_TO_REPO]->(repo)

From the found Tags, you can find the related previous Pull Requests as well as their Authors:

MERGE (r:Repository {full_name:"neo4j/neo4j"})
WITH r as repo
CALL apoc.load.json("https://api.github.com/repos/neo4j/neo4j/pulls?per_page=100&state=closed&direction=desc&base=3.0")
YIELD value as pull
MERGE (pr:PullRequest {id: pull.id})
ON CREATE SET pr.title = pull.title, pr.body = pull.body, pr.time = pull.created_at,
pr.number = pull.number
MERGE (user:User {id: pull.user.id})
ON CREATE SET user.login = pull.user.login
MERGE (user)-[:OPENED_PR]->(pr)
MERGE (pr)-[:PR_TO_REPO]->(repo)

Result:

MERGE (r:Repository {full_name:"neo4j/neo4j"})
WITH r as repo
CALL apoc.load.json("https://api.github.com/repos/neo4j/neo4j/pulls?per_page=100&state=closed&direction=desc&base=3.0")
YIELD value as pull
MERGE (pr:PullRequest {id: pull.id})
ON CREATE SET pr.title = pull.title, pr.body = pull.body, pr.time = pull.created_at,
pr.number = pull.number
MERGE (user:User {id: pull.user.id})
ON CREATE SET user.login = pull.user.login
MERGE (user)-[:OPENED_PR]->(pr)
MERGE (pr)-[:PR_TO_REPO]->(repo)

The next step is to combine the occurences of the tag in the issue with the score property on the TAGS_PR relationship :

MERGE (r:Repository {full_name:"neo4j/neo4j"})
WITH r as repo
CALL apoc.load.json("https://api.github.com/repos/neo4j/neo4j/pulls?per_page=100&state=closed&direction=desc&base=3.0")
YIELD value as pull
MERGE (pr:PullRequest {id: pull.id})
ON CREATE SET pr.title = pull.title, pr.body = pull.body, pr.time = pull.created_at,
pr.number = pull.number
MERGE (user:User {id: pull.user.id})
ON CREATE SET user.login = pull.user.login
MERGE (user)-[:OPENED_PR]->(pr)
MERGE (pr)-[:PR_TO_REPO]->(repo)

Result :

MERGE (r:Repository {full_name:"neo4j/neo4j"})
WITH r as repo
CALL apoc.load.json("https://api.github.com/repos/neo4j/neo4j/pulls?per_page=100&state=closed&direction=desc&base=3.0")
YIELD value as pull
MERGE (pr:PullRequest {id: pull.id})
ON CREATE SET pr.title = pull.title, pr.body = pull.body, pr.time = pull.created_at,
pr.number = pull.number
MERGE (user:User {id: pull.user.id})
ON CREATE SET user.login = pull.user.login
MERGE (user)-[:OPENED_PR]->(pr)
MERGE (pr)-[:PR_TO_REPO]->(repo)

Now, we will aggregate the tags by user and return the top potential users :

MERGE (r:Repository {full_name:"neo4j/neo4j"})
WITH r as repo
CALL apoc.load.json("https://api.github.com/repos/neo4j/neo4j/pulls?per_page=100&state=closed&direction=desc&base=3.0")
YIELD value as pull
MERGE (pr:PullRequest {id: pull.id})
ON CREATE SET pr.title = pull.title, pr.body = pull.body, pr.time = pull.created_at,
pr.number = pull.number
MERGE (user:User {id: pull.user.id})
ON CREATE SET user.login = pull.user.login
MERGE (user)-[:OPENED_PR]->(pr)
MERGE (pr)-[:PR_TO_REPO]->(repo)

Result :

MERGE (r:Repository {full_name:"neo4j/neo4j"})
WITH r as repo
CALL apoc.load.json("https://api.github.com/repos/neo4j/neo4j/pulls?per_page=100&state=closed&direction=desc&base=3.0")
YIELD value as pull
MERGE (pr:PullRequest {id: pull.id})
ON CREATE SET pr.title = pull.title, pr.body = pull.body, pr.time = pull.created_at,
pr.number = pull.number
MERGE (user:User {id: pull.user.id})
ON CREATE SET user.login = pull.user.login
MERGE (user)-[:OPENED_PR]->(pr)
MERGE (pr)-[:PR_TO_REPO]->(repo)

Finally, let’s combine the frequency score of the user together with the current score :

MERGE (r:Repository {full_name:"neo4j/neo4j"})
WITH r as repo
CALL apoc.load.json("https://api.github.com/repos/neo4j/neo4j/pulls?per_page=100&state=closed&direction=desc&base=3.0")
YIELD value as pull
MERGE (pr:PullRequest {id: pull.id})
ON CREATE SET pr.title = pull.title, pr.body = pull.body, pr.time = pull.created_at,
pr.number = pull.number
MERGE (user:User {id: pull.user.id})
ON CREATE SET user.login = pull.user.login
MERGE (user)-[:OPENED_PR]->(pr)
MERGE (pr)-[:PR_TO_REPO]->(repo)

Result :

MATCH (r:Repository {full_name:"neo4j/neo4j"})<-[:PR_TO_REPO]-(pr)<-[:OPENED_PR]-(user)
WITH r, user, count(*) as freq
MERGE (user)-[c:CONTRIBUTED_TO]->(r) SET c.freq = freq

If you look at the Commits referencing the Issue on GitHub and view the Pull Request, the Author of the Pull Request validates our suggested results.

The next step will be to connect to the GitHub API with your bot account and mention the three first users of the list of results. An immediate improvement can be to cut off results depending on their score in comparison to the standard deviation of the overall scores.

Going further by suggesting pull requests to reviewers

The goal described in the beginning of the blog post is now achieved, people will be mentioned when they are good candidates to review an issue.

In order to help them, we might want to suggest to them pull requests that they can potentially refer to, fix and confirm the new issue.

The Natural Language Processing plugin offers a search procedure call to which you can provide a text to search for (here, the text of the issue) and it returns documents (nodes) similar to the given text along with a relevance score.

The internal query of the NLP search procedure looks like this :

MATCH (r:Repository {full_name:"neo4j/neo4j"})<-[:PR_TO_REPO]-(pr)<-[:OPENED_PR]-(user)
WITH r, user, count(*) as freq
MERGE (user)-[c:CONTRIBUTED_TO]->(r) SET c.freq = freq

Let’s use it and see the results :

MATCH (r:Repository {full_name:"neo4j/neo4j"})<-[:PR_TO_REPO]-(pr)<-[:OPENED_PR]-(user)
WITH r, user, count(*) as freq
MERGE (user)-[c:CONTRIBUTED_TO]->(r) SET c.freq = freq

Result :

MATCH (r:Repository {full_name:"neo4j/neo4j"})<-[:PR_TO_REPO]-(pr)<-[:OPENED_PR]-(user)
WITH r, user, count(*) as freq
MERGE (user)-[c:CONTRIBUTED_TO]->(r) SET c.freq = freq

Conclusion

A graph database like Neo4j offers you the flexibiltiy to store data in its natural shape and view it from multiple angles in order to gain new insights. Combined with NLP we proved today that the only limit to the possibilities is your imagination.

For the sake of the blog post, we removed some steps that we used in our continuous integration, like :

Fetching comments of issues and pull requests
Text parsing for detecting commit SHA references in order to relate pull requests closing issues, where you can then combine the text from the issue and the pull request with different scores
Issue and Pull request labels which are valuable extra tags

You would have also noticed that the NLP extraction yields tags such as have, would and the like which bring little value to the language we use on Github. This is why we are currently tuning the pipeline in order to detect and ignore such tags and are working on the capability to define your own stopwords to be excluded during the NLP pipeline process.

The NLP plugin is going to be open-sourced under GPL in the future and we would like to make sure it is production ready with private beta testers. If you’re interested to know more or see its usage in action, please get in touch.

If you’re attending GraphConnect in San Francisco in October this year, or in London next year, make sure to stop by our booth!

Meet the authors

Christophe Willemsen

Technology & Infrastructure

Christophe Willemsen brings almost 15 years of experience as a Communication Systems Engineer for the Belgian Navy to the table. Highly skilled in software engineering, he is one of the primary authors of Hume and an active participant in the Neo4j community, dedicating his efforts towards driving the development of cutting-edge technologies in the field.

Power a Github Notification Bot for Issue Reviewers with Graph Based NLP

Our approach

Importing the data

Creating unique constraints

Importing the last 100 pull requests

Frequency of contribution

Importing Pull Request files

Processing the NLP pipeline

Scoring tags on pull requests

Creating a new issue

Finding potential reviewers

Find related tags to the issue and their number of occurences

Going further by suggesting pull requests to reviewers

Conclusion