Efficient counting of relationships in Neo4j was the cornerstone of my Master Thesis and the reason the very first GraphAware Framework module called the Relationship Count Module was born. The improvements in Neo4j 2.1 around dense nodes and the addition of getDegree(…) methods on the Node interface made me eager to do some benchmarking around relationship counts again.
The improvements in Neo4j 2.1, indeed, make the RelCount module slightly less useful. We’ve decided, however, not to make it obsolete, since it is a useful reference implementation of a GraphAware Framework Transaction-Driven Runtime Module. Instead, we ported the module to Neo4j 2.1 and reused all the performance (and functional) tests that had been already developed to learn a few things about Neo4j 2.1, which we would like to share in this short blog post.
Relationship Counting Performance
Prior to Neo4j 2.1, counting a node’s relationships was an operation with O(n) complexity, where n is the number of relationships. This was because all of the node’s relationships had to be visited in order for them to be counted. Neo4j 2.1 added a few getDegree(…) methods to the Node interface, which allows callers to ask for the degree of a node, optionally limited to a specific relationships type and/or direction.
We ran a test in which we asked for the degree of nodes with varying degree in three different scenarios: nodes on disk (no caches involved), nodes in low-level cache (memory-mapped files), and nodes in high-level cache (on the heap). The results are plotted in the following figure.
As we can see, the method runs in constant time (O(1)) when data is in high level cache, which is a massive improvement over O(n). For the other two scenarios, the curve is closer to linear time complexity. This means that Neo4j users will see some performance improvements, but also that the Relationship Count Module module might not be as obsolete as it first seemed. Please refer to its documentation for more benchmarks.
Degree of Nodes with Loops
The second learning point is related to functional rather than performance aspect of the newly added API methods. In graph theory, a loop (i.e., a relationship of a node with itself) counts as 2 towards the total degree of the node. However, in Neo4j 2.1, getDegree() and getDegree(RelationshipType) will only count loops as 1. This is a “gotcha” that could have been guessed by looking at the Javadoc:
/** * Returns the number of relationships connected to this node regardless of * direction or type. This operation is always O(1). * @return the number of relationships connected to this node. */ public int getDegree();
A loop seems to be regarded by Neo4j as a single relationship “connected to this node”. Luckily, the workaround isn’t too hard. If you need a true degree and there are loops present in your graph, just sum up the outdegree and indegree, which still is an O(1) operation. Have a look at the following test:
@TestpublicvoidloopsAndDegrees(){GraphDatabaseServicedatabase=newTestGraphDatabaseFactory().newImpermanentDatabasetry(Transactiontx=database.beginTx()){Nodenode=database.createNodenode.createRelationshipTo(node,DynamicRelationshipType.withName("TEST"tx.success}try(Transactiontx=database.beginTx()){Nodenode=database.getNodeById(0assertEquals(1,node.getDegree(INCOMINGassertEquals(1,node.getDegree(OUTGOINGassertEquals(1,node.getDegree());//not really expected!assertEquals(2,node.getDegree(OUTGOING)+node.getDegree(INCOMING//use this insteadtx.success}}
Conclusion
Neo4j 2.1 improves relationship counting performance for data in high-level cache. GraphAware Relationship Count Module is still useful, however, for data in low-level cache and on disk, and for scenarios in which relationships need to be counted based on a property value (potentially on top of relationship type and/or direction). To find out a node’s degree in Neo4j 2.1 in the presence of loops, sum incoming and outgoing relationships rather than using getDegree() directly.