Towards OLAP in Graph Databases (MSc. Thesis)
· 2 min read
I have just finished a year-long MSc. program in Computing at Imperial College London. My thesis was called GraphAware: Towards Online Analytical Processing in Graph Databases, which you can freely download. It’s not an easy, cover-to-cover read, but there might be some interesting parts, even if you don’t go through all the (over 100) pages.
First of all, the overview of Neo4j architecture in Section 2.4 could be helpful for trying to understand how Neo4j works and, in particular, how it stores its data. Section 5.1 is a good introduction to the GraphAware Framework’s architecture for people who would like to dig deeper into it or, for instance, write their own GraphAware modules. Finally, section 6.2 shows the considerations one should take into account when testing performance of Neo4j, especially with respect to cache configurations.
Abstract
Graph databases are becoming increasingly popular as an alternative to relational databases for managing complex, densely-connected, semi-structured data. Whilst primarily optimised for online transactional processing, graph databases would greatly benefit from online analytical processing capabilities. Since relational databases were introduced over four decades ago, they have acquired online analytical processing facilities; this is not the case with graph databases, which have only drawn mainstream attention in the past few years.
In this project, we study the problem of online analytical processing in graph databases that use the property graph data model, which is a graph with properties attached to both vertices and edges. We use vertex degree analysis as a simple example problem, create a formal definition of vertex degree in a property graph, and develop a theoretical vertex degree cache with constant space and read time complexity, enabled by a cache compaction operation and a property change frequency heuristic.
We then apply the theory to Neo4j, an open-source property graph database, by developing a Relationship Count Module, which implements the theoretical vertex degree caching. We also design and implement a framework, called GraphAware, which provides supporting functionality for the module and serves as a platform for additional development, particularly of modules that store and maintain graph metadata.
Finally, we show that for certain use cases, for example those in which vertices have relatively high degrees and edges are created in separate transactions, vertex degree analysis can be performed several orders of magnitude faster, whilst sacrificing less than 20% of the write throughput, when using GraphAware Framework with the Relationship Count Module.
By demonstrating the extent of possible performance improvements, exposing the true complexity of a seemingly simple problem, and providing a starting point for future analysis and module development, we take an important step towards online analytical processing in graph databases.