A common question when planning and designing your Neo4j Graph Database is how to handle “flagged” entities. This could include users that are active, blog posts that are published, news articles that have been read, etc.
Introduction
In the SQL world, you would typically create a a boolean|tinyint
column; in Neo4j, the same can be achieved in the following two ways:
- A flagged indexed property
- A dedicated label
Having faced this design dilemma a number of times, we would like to share our experience with the two presented possibilities and some Cypher query optimizations that will help you take a full advantage of a the graph database.
Throughout the blog post, we’ll use the following example scenario:
- We have
User
nodes - User
FOLLOWS
other users - Each user writes multiple blog posts stored as
BlogPost
nodes - Some of the blog posts are
drafted
, others arepublished
(active
)
Setting up
If we want to get a running test database for this use case, we can generate a small graph using Graphgen and import it into our local database. We use the following pattern:
(user:#User *10) (post:#BlogPost *200) (post2:#BlogPost:ActivePost {active:{randomElement:["true"]}} *200) (user)-[:WRITTEN *1..n]->(post) (user)-[:WRITTEN *1..n]->(post2) (user)-[:FOLLOWS *n..n]->(user)
Naturally, we will create a unique constraint on the User _id
property :
CREATE CONSTRAINT ON (user:User) ASSERT user._id IS UNIQUE;
Retrieving active blog posts
We now want to retrieve all the BlogPost
nodes that are active.
Using labels
PROFILE MATCH (post:ActivePost) RETURN count(post);
will result in the following execution plan :
neo4j-sh (?)$ PROFILE MATCH (post:ActivePost) RETURN count(post); +-------------+ | count(post) | +-------------+ | 200 | +-------------+ 1 row ColumnFilter | +EagerAggregation | +NodeByLabel +------------------+------+--------+-------------+--------------------------+ | Operator | Rows | DbHits | Identifiers | Other | +------------------+------+--------+-------------+--------------------------+ | ColumnFilter | 1 | 0 | | keep columns count(post) | | EagerAggregation | 1 | 0 | | | | NodeByLabel | 200 | 201 | post, post | :ActivePost | +------------------+------+--------+-------------+--------------------------+ Total database accesses: 201
As we can see there are no more database accesses than the number of ActivePost
nodes in the database, thanks to the label indexing.
Now, let’s compare it with the use of an indexed active property on the nodes.
Using an indexed flagged property
CREATE INDEX ON :BlogPost(active);
PROFILE MATCH (post:BlogPost) WHERE post.active = true RETURN count(post);
neo4j-sh (?)$ PROFILE MATCH (post:BlogPost) WHERE post.active = 'true' RETURN count(post); +-------------+ | count(post) | +-------------+ | 200 | +-------------+ 1 row ColumnFilter | +EagerAggregation | +SchemaIndex +------------------+------+--------+-------------+------------------------------------+ | Operator | Rows | DbHits | Identifiers | Other | +------------------+------+--------+-------------+------------------------------------+ | ColumnFilter | 1 | 0 | | keep columns count(post) | | EagerAggregation | 1 | 0 | | | | SchemaIndex | 200 | 201 | post, post | { AUTOSTRING0}; :BlogPost(active) | +------------------+------+--------+-------------+------------------------------------+ Total database accesses: 201
As we can see, there is no difference between matching on a dedicated label and using indexed properties.
But why this blog post then?
The difference will be apparent when the queried nodes are not in the beginning of the pattern.
Retrieving active blog posts written by a user
Let’s say we want to retrieve all active posts for a specific user :
Using labels
PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'}) WITH user MATCH (user)-[:WRITTEN]->(p:ActivePost) RETURN count(p);
neo4j-sh (?)$ PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'}) > WITH user > MATCH (user)-[:WRITTEN]->(p:ActivePost) > RETURN count(p); +----------+ | count(p) | +----------+ | 21 | +----------+ 1 row ColumnFilter | +EagerAggregation | +Filter | +SimplePatternMatcher | +SchemaIndex +----------------------+------+--------+----------------------+-----------------------------+ | Operator | Rows | DbHits | Identifiers | Other | +----------------------+------+--------+----------------------+-----------------------------+ | ColumnFilter | 1 | 0 | | keep columns count(p) | | EagerAggregation | 1 | 0 | | | | Filter | 21 | 21 | | hasLabel(p:ActivePost(2)) | | SimplePatternMatcher | 21 | 35 | user, p, UNNAMED85 | | | SchemaIndex | 1 | 2 | user, user | { AUTOSTRING0}; :User(_id) | +----------------------+------+--------+----------------------+-----------------------------+ Total database accesses: 58
Using an indexed property
PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'}) WITH user MATCH (user)-[:WRITTEN]->(p:BlogPost) WHERE p.active = 'true' RETURN count(p);
neo4j-sh (?)$ PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'}) > WITH user > MATCH (user)-[:WRITTEN]->(p:BlogPost) > WHERE p.active = 'true' > RETURN count(p); +----------+ | count(p) | +----------+ | 21 | +----------+ 1 row ColumnFilter | +EagerAggregation | +Filter | +SimplePatternMatcher | +SchemaIndex +----------------------+------+--------+----------------------+------------------------------------------------------------------------+ | Operator | Rows | DbHits | Identifiers | Other | +----------------------+------+--------+----------------------+------------------------------------------------------------------------+ | ColumnFilter | 1 | 0 | | keep columns count(p) | | EagerAggregation | 1 | 0 | | | | Filter | 21 | 63 | | (hasLabel(p:BlogPost(1)) AND Property(p,active(8)) == { AUTOSTRING1}) | | SimplePatternMatcher | 21 | 105 | user, p, UNNAMED85 | | | SchemaIndex | 1 | 2 | user, user | { AUTOSTRING0}; :User(_id) | +----------------------+------+--------+----------------------+------------------------------------------------------------------------+ Total database accesses: 170
As we can see, the indexed property is not used by Cypher and it needs to filter on all blog posts. Using a dedicated label is thus more performant.
Going further with some tips and tricks
We will not stop here! We would like to share some tips that will help you optimize your Cypher queries.
Always use dedicated labels for positives
You may want to ask: why not adding a Draft
label on the BlogPost
nodes for non-active blog posts?
The reason is that this would force an additional negation filter when retrieving active blog posts, and negation in Cypher is costly :
PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'}) WITH user MATCH (user)-[:WRITTEN]->(p:BlogPost) WHERE NOT p :Draft RETURN count(p);
neo4j-sh (?)$ PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'}) > WITH user > MATCH (user)-[:WRITTEN]->(p:BlogPost) > WHERE NOT p :Draft > RETURN count(p); +----------+ | count(p) | +----------+ | 21 | +----------+ 1 row ColumnFilter | +EagerAggregation | +Filter | +SimplePatternMatcher | +SchemaIndex +----------------------+------+--------+----------------------+---------------------------------------------------------+ | Operator | Rows | DbHits | Identifiers | Other | +----------------------+------+--------+----------------------+---------------------------------------------------------+ | ColumnFilter | 1 | 0 | | keep columns count(p) | | EagerAggregation | 1 | 0 | | | | Filter | 21 | 42 | | (hasLabel(p:BlogPost(1)) AND NOT(hasLabel(p:Draft(3)))) | | SimplePatternMatcher | 21 | 70 | user, p, UNNAMED85 | | | SchemaIndex | 1 | 2 | user, user | { AUTOSTRING0}; :User(_id) | +----------------------+------+--------+----------------------+---------------------------------------------------------+ Total database accesses: 114
Avoid the need to match on multiple labels
Some designs we have encountered were making too much use of dedicated labels, forcing queries to match on multiple labels to fetch the desired set of nodes.
Matching on more than one label will make Cypher apply an additional hasLabel
filter :
PROFILE MATCH (post:BlogPost:ActivePost) RETURN count(post);
neo4j-sh (?)$ PROFILE MATCH (post:BlogPost:ActivePost) RETURN count(post); +-------------+ | count(post) | +-------------+ | 200 | +-------------+ 1 row ColumnFilter | +EagerAggregation | +Filter | +NodeByLabel +------------------+------+--------+-------------+------------------------------+ | Operator | Rows | DbHits | Identifiers | Other | +------------------+------+--------+-------------+------------------------------+ | ColumnFilter | 1 | 0 | | keep columns count(post) | | EagerAggregation | 1 | 0 | | | | Filter | 200 | 400 | | hasLabel(post:ActivePost(2)) | | NodeByLabel | 400 | 401 | post, post | :BlogPost | +------------------+------+--------+-------------+------------------------------+ Total database accesses: 801
We can see here that the number of database accesses is growing and you can imagine that it will be costly when we have millions of blog posts or entities in the database.
Use well named relationship types to avoid some use of labels
When looking up patterns, specifying the label of the end node will force Cypher to filter on labels.
Dedicated and clear relationship types will help you traverse the graph easily and make full use of the powerful graph model, gaining performance by not having to match on labels.
Let’s refactor our little model by adding a PUBLISHED
relationship from User
to ActivePost
nodes and a DRAFTED
relationship to the others.
MATCH (n:ActivePost) WITH n MATCH (n)<-[:WRITTEN]-(u) MERGE (u)-[:PUBLISHED]->(n);
MATCH (n:BlogPost) WHERE NOT n :ActivePost WITH n MATCH (n)<-[:WRITTEN]-(u) MERGE (u)-[:DRAFTED]->(n);
The very advantage of Neo4j is that nodes contain references to the relationships they connected to them. This means that once we have a starting point in the graph, following relationships from nodes to nodes costs almost nothing :
PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'}) WITH user MATCH (user)-[:PUBLISHED]->(p) RETURN count(p);
neo4j-sh (?)$ PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'}) > WITH user > MATCH (user)-[:PUBLISHED]->(p) > RETURN count(p); +----------+ | count(p) | +----------+ | 21 | +----------+ 1 row ColumnFilter | +EagerAggregation | +SimplePatternMatcher | +SchemaIndex +----------------------+------+--------+----------------------+-----------------------------+ | Operator | Rows | DbHits | Identifiers | Other | +----------------------+------+--------+----------------------+-----------------------------+ | ColumnFilter | 1 | 0 | | keep columns count(p) | | EagerAggregation | 1 | 0 | | | | SimplePatternMatcher | 21 | 0 | user, p, UNNAMED85 | | | SchemaIndex | 1 | 2 | user, user | { AUTOSTRING0}; :User(_id) | +----------------------+------+--------+----------------------+-----------------------------+ Total database accesses: 2
As we can see, we did not define any labels because we know that the :PUBLISHED
relationship will point to blog posts that are active and retrieving thousands of blog posts will just cost us nothing more than using the schema index for our user and a few traversals from the user to blog post nodes by following their relationships.
Conclusion
A good design of your graph database model will help you perform very fast queries and provide a good user experience. Testing your queries and the usage of the query profiler can help you discover bottlenecks in your schema and should be done regularly during the development process, before going to production with a large data set.
Note: This blog post is a follow up of my StackOverflow answer on the subject