A common question when planning and designing your Neo4j Graph Database is how to handle “flagged” entities. This could include users that are active, blog posts that are published, news articles that have been read, etc.
Introduction
In the SQL world, you would typically create a a boolean|tinyint column; in Neo4j, the same can be achieved in the following two ways:
- A flagged indexed property
- A dedicated label
Having faced this design dilemma a number of times, we would like to share our experience with the two presented possibilities and some Cypher query optimizations that will help you take a full advantage of a the graph database.
Throughout the blog post, we’ll use the following example scenario:
- We have
Usernodes - User
FOLLOWSother users - Each user writes multiple blog posts stored as
BlogPostnodes - Some of the blog posts are
drafted, others arepublished(active)
Setting up
If we want to get a running test database for this use case, we can generate a small graph using Graphgen and import it into our local database. We use the following pattern:
(user:#User *10)
(post:#BlogPost *200)
(post2:#BlogPost:ActivePost {active:{randomElement:["true"]}} *200)
(user)-[:WRITTEN *1..n]->(post)
(user)-[:WRITTEN *1..n]->(post2)
(user)-[:FOLLOWS *n..n]->(user)Naturally, we will create a unique constraint on the User _id property :
CREATE CONSTRAINT ON (user:User) ASSERT user._id IS UNIQUE;Retrieving active blog posts
We now want to retrieve all the BlogPost nodes that are active.
Using labels
PROFILE MATCH (post:ActivePost) RETURN count(post);will result in the following execution plan :
neo4j-sh (?)$ PROFILE MATCH (post:ActivePost) RETURN count(post);
+-------------+
| count(post) |
+-------------+
| 200 |
+-------------+
1 row
ColumnFilter
|
+EagerAggregation
|
+NodeByLabel
+------------------+------+--------+-------------+--------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+------------------+------+--------+-------------+--------------------------+
| ColumnFilter | 1 | 0 | | keep columns count(post) |
| EagerAggregation | 1 | 0 | | |
| NodeByLabel | 200 | 201 | post, post | :ActivePost |
+------------------+------+--------+-------------+--------------------------+
Total database accesses: 201As we can see there are no more database accesses than the number of ActivePost nodes in the database, thanks to the label indexing.
Now, let’s compare it with the use of an indexed active property on the nodes.
Using an indexed flagged property
CREATE INDEX ON :BlogPost(active);PROFILE MATCH (post:BlogPost) WHERE post.active = true RETURN count(post);neo4j-sh (?)$ PROFILE MATCH (post:BlogPost) WHERE post.active = 'true' RETURN count(post);
+-------------+
| count(post) |
+-------------+
| 200 |
+-------------+
1 row
ColumnFilter
|
+EagerAggregation
|
+SchemaIndex
+------------------+------+--------+-------------+------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+------------------+------+--------+-------------+------------------------------------+
| ColumnFilter | 1 | 0 | | keep columns count(post) |
| EagerAggregation | 1 | 0 | | |
| SchemaIndex | 200 | 201 | post, post | { AUTOSTRING0}; :BlogPost(active) |
+------------------+------+--------+-------------+------------------------------------+
Total database accesses: 201As we can see, there is no difference between matching on a dedicated label and using indexed properties.
But why this blog post then?
The difference will be apparent when the queried nodes are not in the beginning of the pattern.
Retrieving active blog posts written by a user
Let’s say we want to retrieve all active posts for a specific user :
Using labels
PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'})
WITH user
MATCH (user)-[:WRITTEN]->(p:ActivePost)
RETURN count(p);neo4j-sh (?)$ PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'})
> WITH user
> MATCH (user)-[:WRITTEN]->(p:ActivePost)
> RETURN count(p);
+----------+
| count(p) |
+----------+
| 21 |
+----------+
1 row
ColumnFilter
|
+EagerAggregation
|
+Filter
|
+SimplePatternMatcher
|
+SchemaIndex
+----------------------+------+--------+----------------------+-----------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------------+------+--------+----------------------+-----------------------------+
| ColumnFilter | 1 | 0 | | keep columns count(p) |
| EagerAggregation | 1 | 0 | | |
| Filter | 21 | 21 | | hasLabel(p:ActivePost(2)) |
| SimplePatternMatcher | 21 | 35 | user, p, UNNAMED85 | |
| SchemaIndex | 1 | 2 | user, user | { AUTOSTRING0}; :User(_id) |
+----------------------+------+--------+----------------------+-----------------------------+
Total database accesses: 58Using an indexed property
PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'})
WITH user
MATCH (user)-[:WRITTEN]->(p:BlogPost)
WHERE p.active = 'true'
RETURN count(p);neo4j-sh (?)$ PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'})
> WITH user
> MATCH (user)-[:WRITTEN]->(p:BlogPost)
> WHERE p.active = 'true'
> RETURN count(p);
+----------+
| count(p) |
+----------+
| 21 |
+----------+
1 row
ColumnFilter
|
+EagerAggregation
|
+Filter
|
+SimplePatternMatcher
|
+SchemaIndex
+----------------------+------+--------+----------------------+------------------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------------+------+--------+----------------------+------------------------------------------------------------------------+
| ColumnFilter | 1 | 0 | | keep columns count(p) |
| EagerAggregation | 1 | 0 | | |
| Filter | 21 | 63 | | (hasLabel(p:BlogPost(1)) AND Property(p,active(8)) == { AUTOSTRING1}) |
| SimplePatternMatcher | 21 | 105 | user, p, UNNAMED85 | |
| SchemaIndex | 1 | 2 | user, user | { AUTOSTRING0}; :User(_id) |
+----------------------+------+--------+----------------------+------------------------------------------------------------------------+
Total database accesses: 170As we can see, the indexed property is not used by Cypher and it needs to filter on all blog posts. Using a dedicated label is thus more performant.
Going further with some tips and tricks
We will not stop here! We would like to share some tips that will help you optimize your Cypher queries.
Always use dedicated labels for positives
You may want to ask: why not adding a Draft label on the BlogPost nodes for non-active blog posts?
The reason is that this would force an additional negation filter when retrieving active blog posts, and negation in Cypher is costly :
PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'})
WITH user
MATCH (user)-[:WRITTEN]->(p:BlogPost)
WHERE NOT p :Draft
RETURN count(p);neo4j-sh (?)$ PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'})
> WITH user
> MATCH (user)-[:WRITTEN]->(p:BlogPost)
> WHERE NOT p :Draft
> RETURN count(p);
+----------+
| count(p) |
+----------+
| 21 |
+----------+
1 row
ColumnFilter
|
+EagerAggregation
|
+Filter
|
+SimplePatternMatcher
|
+SchemaIndex
+----------------------+------+--------+----------------------+---------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------------+------+--------+----------------------+---------------------------------------------------------+
| ColumnFilter | 1 | 0 | | keep columns count(p) |
| EagerAggregation | 1 | 0 | | |
| Filter | 21 | 42 | | (hasLabel(p:BlogPost(1)) AND NOT(hasLabel(p:Draft(3)))) |
| SimplePatternMatcher | 21 | 70 | user, p, UNNAMED85 | |
| SchemaIndex | 1 | 2 | user, user | { AUTOSTRING0}; :User(_id) |
+----------------------+------+--------+----------------------+---------------------------------------------------------+
Total database accesses: 114Avoid the need to match on multiple labels
Some designs we have encountered were making too much use of dedicated labels, forcing queries to match on multiple labels to fetch the desired set of nodes.
Matching on more than one label will make Cypher apply an additional hasLabel filter :
PROFILE MATCH (post:BlogPost:ActivePost) RETURN count(post);neo4j-sh (?)$ PROFILE MATCH (post:BlogPost:ActivePost) RETURN count(post);
+-------------+
| count(post) |
+-------------+
| 200 |
+-------------+
1 row
ColumnFilter
|
+EagerAggregation
|
+Filter
|
+NodeByLabel
+------------------+------+--------+-------------+------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+------------------+------+--------+-------------+------------------------------+
| ColumnFilter | 1 | 0 | | keep columns count(post) |
| EagerAggregation | 1 | 0 | | |
| Filter | 200 | 400 | | hasLabel(post:ActivePost(2)) |
| NodeByLabel | 400 | 401 | post, post | :BlogPost |
+------------------+------+--------+-------------+------------------------------+
Total database accesses: 801We can see here that the number of database accesses is growing and you can imagine that it will be costly when we have millions of blog posts or entities in the database.
Use well named relationship types to avoid some use of labels
When looking up patterns, specifying the label of the end node will force Cypher to filter on labels.
Dedicated and clear relationship types will help you traverse the graph easily and make full use of the powerful graph model, gaining performance by not having to match on labels.
Let’s refactor our little model by adding a PUBLISHED relationship from User to ActivePost nodes and a DRAFTED relationship to the others.
MATCH (n:ActivePost)
WITH n
MATCH (n)<-[:WRITTEN]-(u)
MERGE (u)-[:PUBLISHED]->(n);MATCH (n:BlogPost)
WHERE NOT n :ActivePost
WITH n
MATCH (n)<-[:WRITTEN]-(u)
MERGE (u)-[:DRAFTED]->(n);The very advantage of Neo4j is that nodes contain references to the relationships they connected to them. This means that once we have a starting point in the graph, following relationships from nodes to nodes costs almost nothing :
PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'})
WITH user
MATCH (user)-[:PUBLISHED]->(p)
RETURN count(p);neo4j-sh (?)$ PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'})
> WITH user
> MATCH (user)-[:PUBLISHED]->(p)
> RETURN count(p);
+----------+
| count(p) |
+----------+
| 21 |
+----------+
1 row
ColumnFilter
|
+EagerAggregation
|
+SimplePatternMatcher
|
+SchemaIndex
+----------------------+------+--------+----------------------+-----------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------------+------+--------+----------------------+-----------------------------+
| ColumnFilter | 1 | 0 | | keep columns count(p) |
| EagerAggregation | 1 | 0 | | |
| SimplePatternMatcher | 21 | 0 | user, p, UNNAMED85 | |
| SchemaIndex | 1 | 2 | user, user | { AUTOSTRING0}; :User(_id) |
+----------------------+------+--------+----------------------+-----------------------------+
Total database accesses: 2As we can see, we did not define any labels because we know that the :PUBLISHED relationship will point to blog posts that are active and retrieving thousands of blog posts will just cost us nothing more than using the schema index for our user and a few traversals from the user to blog post nodes by following their relationships.
Conclusion
A good design of your graph database model will help you perform very fast queries and provide a good user experience. Testing your queries and the usage of the query profiler can help you discover bottlenecks in your schema and should be done regularly during the development process, before going to production with a large data set.
Note: This blog post is a follow up of my StackOverflow answer on the subject
