Power a Github Notification Bot for Issue Reviewers with Graph Based NLP

· 13 min read

Without question, Github is the biggest code sharing platform on the planet. With more than 14 millions users and 35 million repositories, the insights you can discover by analyzing the data available through its API are surprising and revealing.

I’ve been very passionate about this data for a long time, as much as I am about Neo4j. Importing this data into a graph database can reveal interesting new information.

Two years ago I created a Github Gist with some interesting queries on that dataset. I showed many more queries at several conferences since.

Recently, I came across mention-bot from Facebook, which analyses the git blame of the files of a Pull Request in order to notify (via @mention) potential reviewers.

It uses two heuristics :

  • If a line was deleted or modified, the person that last touched that line is likely going to care about this pull request.

  • If a person last touched many lines in the file where the change was made, they will want to be notified.

I got inspired by the approach taken by Facebook. Because I’ve been running a continuous importer of all Github events into Neo4j for a long time, I have a lot of GitHub data to work with.

I was curious about the opportunity of using our Natural Language Processing plugin for Neo4j recently created by my colleague, friend and Chief Scientist of GraphAware Dr. Alessandro Negro, to apply it to the text descriptions of Github Pull Requests and Issues.

Our approach

We will import the last 100 Pull Requests of our favorite Github repository, Neo4j with their title, body, author and files.

We will then process the above text through an NLP pipeline storing the processing results in the graph which, thanks to the nature of graphs, will allow us to relate already existing code file nodes to text entities.

Then after some scoring queries, we will create a new GitHub Issue and find potential reviewers based on:

  • The tags extracted from the new issue’s text using NLP
  • The similarity between such tags and the tags extracted from previous pull requests
  • The files affected by those previous Pull Requests
  • The activity frequency of the users on the files found with the above

Importing the data

Thanks to the amazing work of Michael Hunger and the Neo4j community, the APOC procedures library offers an easy way to import data from public API’s directly via Cypher.

Let’s go!

Creating unique constraints

CREATE CONSTRAINT ON (user:User) ASSERT user.id IS UNIQUE
CREATE CONSTRAINT ON (repository:Repository) ASSERT repository.id IS UNIQUE
CREATE CONSTRAINT ON (file:File) ASSERT file.path IS UNIQUE
CREATE CONSTRAINT ON (pullrequest:PullRequest) ASSERT pullrequest.id IS UNIQUE

Importing the last 100 pull requests

MERGE (r:Repository {full_name:"neo4j/neo4j"})
WITH r as repo
CALL apoc.load.json("https://api.github.com/repos/neo4j/neo4j/pulls?per_page=100&state=closed&direction=desc&base=3.0")
YIELD value as pull
MERGE (pr:PullRequest {id: pull.id})
ON CREATE SET pr.title = pull.title, pr.body = pull.body, pr.time = pull.created_at,
pr.number = pull.number
MERGE (user:User {id: pull.user.id})
ON CREATE SET user.login = pull.user.login
MERGE (user)-[:OPENED_PR]->(pr)
MERGE (pr)-[:PR_TO_REPO]->(repo)

Frequency of contribution

The idea is to create a direct relationship, between a user and a repository, with a property representing the frequency of contributons:

MATCH (r:Repository {full_name:"neo4j/neo4j"})<-[:PR_TO_REPO]-(pr)<-[:OPENED_PR]-(user)
WITH r, user, count(*) as freq
MERGE (user)-[c:CONTRIBUTED_TO]->(r) SET c.freq = freq

Once the last statement has finished, you can already visualize a graph of contributors to the Neo4j repository

Importing Pull Request files

The final statement will import the files of those Pull Requests and create the repository hierarchy. Replace the XXX in the api url with your GitHub client id and secret if you have one :

MATCH (pr:PullRequest)
CALL apoc.load.json("https://api.github.com/repos/neo4j/neo4j/pulls/" + pr.number + "/files?client_id=XXX&client_secret=XXX")
YIELD value as file
WITH pr, file as fileInfo, split(file.filename,"/") AS elts
UNWIND range(0,size(elts)-1) AS i
MERGE (prev:File {path: "neo4j/neo4j_" + reduce(s="",x IN range(0,i-2) | s+ elts[x]+"/") + CASE i WHEN 0 THEN "" ELSE elts[i-1] END}) 
SET prev.name = CASE i WHEN 0 THEN "." ELSE reduce(s="",x IN range(0,i-2) | s+ elts[x]+"/") + elts[i-1] END 
MERGE (file:File {path: "neo4j/neo4j_" + reduce(s="",x IN range(0,i-1) | s+ elts[x]+"/") + elts[i]})
SET file.name = reduce(s="",x IN range(0,i-1) | s+ elts[x]+"/") + elts[i]
MERGE (file)-[:PARENT]->(prev)
WITH "neo4j/neo4j_" + reduce(s="",x IN range(0,size(elts)-2) | s+ elts[x]+"/") + elts[-1] AS la, pr, fileInfo
MATCH (f:File {path:la})
SET f:FileRecord
MERGE (f)<-[r:TOUCHED_FILE]-(pr) SET r.changes = fileInfo.changes

The last part of the query is more about iterating over a text split on a file path in order to recreate the file tree structure in the graph:

Processing the NLP pipeline

Using the Natural Language Processing plugin is just about calling a procedure on the given text and creating a relationship between the Pull Request and the newly created AnnotatedText:

MATCH (n:PullRequest)
CALL ga.nlp.annotate({text: n.title + " " + n.body, id: id(n), nlpDepth:0})
YIELD result
MERGE (n)-[:ANNOTATED_TEXT]->(result)

The extraction is automatically persisted in the graph and relates sentences and tags to Pull Requests, for example:

By just browsing your database, you can identifty which tags (big pink nodes) are related to a pull request.

Scoring tags on pull requests

Here we will take a simple scoring approach- for each pull request we will find the extracted tags and create a score where:

score = tag occurences in this pr / (total occurences of the tag * number of tags for this PR)

We take how many times the tag occurs in this particular pull request and we divide this number by the mulitplication of :

a) how many times this tag appears for pull requests b) how many tags are tagging this particular pull request

The corresponding Cypher query :

MATCH (pr:PullRequest)-[:ANNOTATED_TEXT]->(text)-[:CONTAINS_SENTENCE]->(s)-[:HAS_TAG]->(tag)
// no tags yet
WHERE size((pr)<-[:TAGS_PR]-()) = 0
WITH pr, count(distinct(tag)) as total
MATCH (pr)-[:ANNOTATED_TEXT]->(text)-[:CONTAINS_SENTENCE]->(s)-[:HAS_TAG]->(tag)
WITH pr, tag, count(*) as freq, total
ORDER BY freq DESC
WITH pr, tag, toFloat((freq * 1.0f)  / (size((tag)<-[:HAS_TAG]-()) * total)) as score, total
MERGE (tag)-[r:TAGS_PR]->(pr) SET r.score = score
RETURN tag.value, score

When the query is completed, an interesting question to ask is to take a file and find its corresponding tags based on the pull requests, NLP tags and the score:

WITH "neo4j/neo4j_community/kernel/src/main/java/org/neo4j/kernel/impl/api/KernelTransactions.java" as path
MATCH (f:FileRecord {path: path})<-[:TOUCHED_FILE]-(pr)<-[r:TAGS_PR]-(tag)
WITH tag, collect(r) as occurences
RETURN tag.value, size(occurences) + reduce(x=0, o IN occurences | x + o.score) as score
ORDER BY score DESC

Result :

╒══════════════════════════════╤══════════════════╕
│tag.value                     │score             │
╞══════════════════════════════╪══════════════════╡
│transaction                   │4.0154783125371365│
├──────────────────────────────┼──────────────────┤
│commit                        │3.0050662878787877│
├──────────────────────────────┼──────────────────┤
│can                           │3.0026903651903654│
├──────────────────────────────┼──────────────────┤
│make                          │3.0020750988142293│
├──────────────────────────────┼──────────────────┤
│external                      │2.028030303030303 │
├──────────────────────────────┼──────────────────┤
│guard                         │2.0244444444444443│
├──────────────────────────────┼──────────────────┤
│execution                     │2.0194444444444444│
├──────────────────────────────┼──────────────────┤
│guarantee                     │2.018308080808081 │
├──────────────────────────────┼──────────────────┤
│logical                       │2.016919191919192 │
├──────────────────────────────┼──────────────────┤
│note                          │2.0112794612794613│
├──────────────────────────────┼──────────────────┤
│during                        │2.00915404040404  │
├──────────────────────────────┼──────────────────┤
│right                         │2.008459595959596 │
├──────────────────────────────┼──────────────────┤
│slave                         │2.0075396825396825│
├──────────────────────────────┼──────────────────┤
│kernel                        │2.0063492063492063│
├──────────────────────────────┼──────────────────┤
│get                           │2.0061026936026938│
├──────────────────────────────┼──────────────────┤
│instance                      │2.0056277056277056│
├──────────────────────────────┼──────────────────┤
│what                          │2.004513888888889 │
├──────────────────────────────┼──────────────────┤
│through                       │2.004166666666667 │
├──────────────────────────────┼──────────────────┤

Without having seen a single line of code in this file, we already know that this file is about transaction, guard execution, kernel and guarantee. These terms quite precisely represent the purpose of the file.

A visual comparison is really surprising :

Actually, with the help of Neo4j and NLP, we made the code talk to us.

Creating a new issue

The next step will represent a user creating a new issue and we will process it using the NLP pipeline. We will take a real issue about escaping output of the constraints procedure reported on the Neo4j repository:

MATCH (repo:Repository {full_name:"neo4j/neo4j"})
CALL apoc.load.json("https://api.github.com/repos/neo4j/neo4j/issues/7634")
YIELD value as issue
CREATE (i:Issue {number: issue.number, title: issue.title, body: issue.body})
MERGE (i)-[:ISSUE_ON_REPO]->(repo)

Now we process the NLP pipeline:

MATCH (i:Issue {number: 7634})
CALL ga.nlp.annotate({text: i.title + " " + i.body, id: id(i), nlpDepth:0})
YIELD result
MERGE (i)-[:ANNOTATED_TEXT]->(result)

which will produce the same extraction format as for pull requests:

Finding potential reviewers

Starting from the issue, we will incrementally run Cypher queries adding more complexity at each step:

Note here that we are not interested in tags having less than 5 characters

MATCH (i:Issue {number: 7634})-[:ANNOTATED_TEXT]->(text)-[:CONTAINS_SENTENCE]->(s)-[:HAS_TAG]->(tag)
WHERE size(tag.value) > 4
RETURN tag.value as tag, count(*) as occurences
ORDER BY occurences DESC

Result :

╒══════════╤══════════╕
│tag       │occurences│
╞══════════╪══════════╡
│UNIQUE    │3         │
├──────────┼──────────┤
│label     │3         │
├──────────┼──────────┤
│property  │2         │
├──────────┼──────────┤
│neo4j     │2         │
├──────────┼──────────┤
│CONSTRAINT│2         │
├──────────┼──────────┤
│constraint│2         │
├──────────┼──────────┤
│assert    │2         │
├──────────┼──────────┤
│create    │2         │
├──────────┼──────────┤

[...] result truncated

From the found Tags, you can find the related previous Pull Requests as well as their Authors:

MATCH (i:Issue {number: 7634})-[:ANNOTATED_TEXT]->(text)-[:CONTAINS_SENTENCE]->(s)-[:HAS_TAG]->(tag)
WHERE size(tag.value) > 4
WITH tag, count(*) as occurences
MATCH (tag)-[r:TAGS_PR]->(pull)<-[:OPENED_PR]-(author)
RETURN tag.value as tag, author.login, count(*) as freq
ORDER BY freq DESC

Result:

╒══════════╤═══════════════╤════╕
│tag       │author.login   │freq│
╞══════════╪═══════════════╪════╡
│index     │MishaDemianenko│5   │
├──────────┼───────────────┼────┤
│which     │tinwelint      │5   │
├──────────┼───────────────┼────┤
│where     │tinwelint      │4   │
├──────────┼───────────────┼────┤
│should    │systay         │3   │
├──────────┼───────────────┼────┤
│property  │MishaDemianenko│3   │
├──────────┼───────────────┼────┤
│match     │pontusmelke    │3   │
├──────────┼───────────────┼────┤
│label     │Mats-SX        │2   │
├──────────┼───────────────┼────┤
│neo4j     │lutovich       │2   │
├──────────┼───────────────┼────┤
│should    │Mats-SX        │2   │
├──────────┼───────────────┼────┤
│property  │Mats-SX        │2   │
├──────────┼───────────────┼────┤
│which     │glennsarti     │2   │
├──────────┼───────────────┼────┤
│where     │pontusmelke    │2   │
├──────────┼───────────────┼────┤
│should    │lutovich       │2   │
├──────────┼───────────────┼────┤
│index     │tinwelint      │2   │
├──────────┼───────────────┼────┤
│label     │MishaDemianenko│2   │
[...] result truncated

The next step is to combine the occurences of the tag in the issue with the score property on the TAGS_PR relationship :

MATCH (i:Issue {number: 7634})-[:ANNOTATED_TEXT]->(text)-[:CONTAINS_SENTENCE]->(s)-[:HAS_TAG]->(tag)
WHERE size(tag.value) > 4
WITH tag, count(*) as occurences
MATCH (tag)-[r:TAGS_PR]->(pull)<-[:OPENED_PR]-(author)
RETURN tag.value as tag, author.login, occurences * sum(r.score) as score
ORDER BY score DESC

Result :

╒══════════╤═══════════════╤═════════════════════╕
│tag       │author.login   │score                │
╞══════════╪═══════════════╪═════════════════════╡
│neo4j     │pe4cey         │0.125                │
├──────────┼───────────────┼─────────────────────┤
│version   │pe4cey         │0.125                │
├──────────┼───────────────┼─────────────────────┤
│UNIQUE    │pontusmelke    │0.10714285714285714  │
├──────────┼───────────────┼─────────────────────┤
│property  │Mats-SX        │0.08571428571428572  │
├──────────┼───────────────┼─────────────────────┤
│assert    │pontusmelke    │0.07142857142857142  │
├──────────┼───────────────┼─────────────────────┤
│CONSTRAINT│pontusmelke    │0.07142857142857142  │
├──────────┼───────────────┼─────────────────────┤
│create    │Mats-SX        │0.05                 │
├──────────┼───────────────┼─────────────────────┤
│around    │chrisvest      │0.05                 │
├──────────┼───────────────┼─────────────────────┤
│label     │MishaDemianenko│0.03953823953823954  │
├──────────┼───────────────┼─────────────────────┤
│property  │MishaDemianenko│0.038644688644688646 │
├──────────┼───────────────┼─────────────────────┤
│constraint│MishaDemianenko│0.037037037037037035 │
├──────────┼───────────────┼─────────────────────┤
│label     │Mats-SX        │0.03638127577007034  │
├──────────┼───────────────┼─────────────────────┤
│constraint│pontusmelke    │0.03571428571428571  │
├──────────┼───────────────┼─────────────────────┤
[...] result truncated

Now, we will aggregate the tags by user and return the top potential users :

MATCH (i:Issue {number: 7634})-[:ANNOTATED_TEXT]->(text)-[:CONTAINS_SENTENCE]->(s)-[:HAS_TAG]->(tag)
WHERE size(tag.value) > 4
WITH tag, count(*) as occurences
MATCH (tag)-[r:TAGS_PR]->(pull)<-[:OPENED_PR]-(author)
WITH tag, author.login as user, occurences * sum(r.score) as score
RETURN user, sum(score) as score

Result :

╒═══════════════╤════════════════════╕
│user           │score               │
╞═══════════════╪════════════════════╡
│pontusmelke    │0.39697590421274626 │
├───────────────┼────────────────────┤
│Mats-SX        │0.19041057979937437 │
├───────────────┼────────────────────┤
│MishaDemianenko│0.18433177933177933 │
├───────────────┼────────────────────┤
│systay         │0.03407509157509157 │
├───────────────┼────────────────────┤
│chrisvest      │0.09658185529709445 │
├───────────────┼────────────────────┤
│glennsarti     │0.028600823045267492│
├───────────────┼────────────────────┤
│davidegrohmann │0.005917159763313609│
├───────────────┼────────────────────┤
│jakewins       │0.003108003108003108│
├───────────────┼────────────────────┤
│thobe          │0.02110805860805861 │
├───────────────┼────────────────────┤
│tinwelint      │0.10783658701133339 │
├───────────────┼────────────────────┤
[...] result truncated

Finally, let’s combine the frequency score of the user together with the current score :

MATCH (repo:Repository {full_name:"neo4j/neo4j"})<-[:ISSUE_ON_REPO]-(i:Issue {number: 7634})-[:ANNOTATED_TEXT]->(text)-[:CONTAINS_SENTENCE]->(s)-[:HAS_TAG]->(tag)
WHERE size(tag.value) > 4
WITH repo, tag, count(*) as occurences
MATCH (tag)-[r:TAGS_PR]->(pull)<-[:OPENED_PR]-(author)
WITH repo, tag, author as user, occurences * sum(r.score) as score
MATCH (user)-[r:CONTRIBUTED_TO]->(repo)
RETURN user.login, sum(score) * ( r.freq * 1.0f / size((repo)<-[:CONTRIBUTED_TO]-()) ) as score
ORDER BY score DESC LIMIT 10

Result :

╒═══════════════╤═════════════════════╕
│user.login     │score                │
╞═══════════════╪═════════════════════╡
│pontusmelke    │0.21502861478190424  │
├───────────────┼─────────────────────┤
│MishaDemianenko│0.09984638047138046  │
├───────────────┼─────────────────────┤
│Mats-SX        │0.07933774158307266  │
├───────────────┼─────────────────────┤
│chrisvest      │0.048290927648547234 │
├───────────────┼─────────────────────┤
│tinwelint      │0.04043872012925002  │
├───────────────┼─────────────────────┤
│lutovich       │0.021550774328552104 │
├───────────────┼─────────────────────┤
│pe4cey         │0.020833333333333332 │
├───────────────┼─────────────────────┤
│systay         │0.005679181929181929 │
├───────────────┼─────────────────────┤
│glennsarti     │0.002383401920438957 │
├───────────────┼─────────────────────┤
│digitalstain   │0.0015988874182151494│
└───────────────┴─────────────────────┘

If you look at the Commits referencing the Issue on GitHub and view the Pull Request, the Author of the Pull Request validates our suggested results.

The next step will be to connect to the GitHub API with your bot account and mention the three first users of the list of results. An immediate improvement can be to cut off results depending on their score in comparison to the standard deviation of the overall scores.

Going further by suggesting pull requests to reviewers

The goal described in the beginning of the blog post is now achieved, people will be mentioned when they are good candidates to review an issue.

In order to help them, we might want to suggest to them pull requests that they can potentially refer to, fix and confirm the new issue.

The Natural Language Processing plugin offers a search procedure call to which you can provide a text to search for (here, the text of the issue) and it returns documents (nodes) similar to the given text along with a relevance score.

The internal query of the NLP search procedure looks like this :

MATCH (doc:AnnotatedText)
WITH count(doc) as documentsCount
MATCH (t:Tag)
WHERE t.value IN {searchTokens}
WITH t, documentsCount, {querySize} as queryTagsCount
MATCH (t)<-[:HAS_TAG]-(:Sentence)<-[]-(document:AnnotatedText)
WITH t, count(distinct document) as documentsCountForTag, documentsCount, queryTagsCount
MATCH (t)<-[ht:HAS_TAG]-(sentence:Sentence)<-[]-(at:AnnotatedText)
WITH DISTINCT at, t.value as value, sum(ht.tf)*(1 + log((1.0f*documentsCount)/(documentsCountForTag + 1)))* (1.0f/at.numTerms^0.5f) as sum, queryTagsCount
RETURN at, (1.0f*size(collect(value))/queryTagsCount)*(sum(sum)) as score

Let’s use it and see the results :

MATCH (n:Issue {number: 7634})
CALL ga.nlp.search(n.title + " " + n.body) YIELD result, score
MATCH (result)<-[:ANNOTATED_TEXT]-(pull:PullRequest)
RETURN pull.title, score
ORDER BY score DESC LIMIT 10

Result :

╒═════════════════════════════════════════════════╤═════════════════════╕
│pull.title                                       │score                │
╞═════════════════════════════════════════════════╪═════════════════════╡
│Escape when listing constraints containing `:`   │7.645201712255016    │
├─────────────────────────────────────────────────┼─────────────────────┤
│Revamped parallelism of parts of batch importer  │1.3149300137481459   │
├─────────────────────────────────────────────────┼─────────────────────┤
│Apply predicates in VarExpand, not in filter     │0.5192339447267563   │
├─────────────────────────────────────────────────┼─────────────────────┤
│Returns index partitions in sorted order         │0.4926431770732637   │
├─────────────────────────────────────────────────┼─────────────────────┤
│3.0 using label scan store for uniqueness constra│0.40473910166978916  │
│ints (Prototype) (queue all updates)             │                     │
├─────────────────────────────────────────────────┼─────────────────────┤
│Open schema and label indexes in read only mode i│0.38958225594380896  │
│n case database started in read only mode        │                     │
├─────────────────────────────────────────────────┼─────────────────────┤
│CsvInput parses data in parallel                 │0.375735879917604    │
├─────────────────────────────────────────────────┼─────────────────────┤
│Fix theoretical data-race in SequenceLock        │0.3730853451635107   │
├─────────────────────────────────────────────────┼─────────────────────┤
│Complete Procedure documentation                 │0.351140524268717    │
├─────────────────────────────────────────────────┼─────────────────────┤
[...] result truncated

Conclusion

A graph database like Neo4j offers you the flexibiltiy to store data in its natural shape and view it from multiple angles in order to gain new insights. Combined with NLP we proved today that the only limit to the possibilities is your imagination.

For the sake of the blog post, we removed some steps that we used in our continuous integration, like :

  • Fetching comments of issues and pull requests
  • Text parsing for detecting commit SHA references in order to relate pull requests closing issues, where you can then combine the text from the issue and the pull request with different scores
  • Issue and Pull request labels which are valuable extra tags

You would have also noticed that the NLP extraction yields tags such as have, would and the like which bring little value to the language we use on Github. This is why we are currently tuning the pipeline in order to detect and ignore such tags and are working on the capability to define your own stopwords to be excluded during the NLP pipeline process.

The NLP plugin is going to be open-sourced under GPL in the future and we would like to make sure it is production ready with private beta testers. If you’re interested to know more or see its usage in action, please get in touch.

If you’re attending GraphConnect in San Francisco in October this year, or in London next year, make sure to stop by our booth!

Christophe Willemsen

Technology & Infrastructure | Neo4j certification

Christophe Willemsen brings almost 15 years of experience as a Communication Systems Engineer for the Belgian Navy to the table. Highly skilled in software engineering, he is one of the primary authors of Hume and an active participant in the Neo4j community, dedicating his efforts towards driving the development of cutting-edge technologies in the field.