GraphAware Blog

Find out what's new in the Neo4j world

Modelling Data in Neo4j: Labels vs. Indexed Properties

Neo4j Modelling Cypher Intermediate 16 Jan 2015 by Christophe Willemsen

A common question when planning and designing your Neo4j Graph Database is how to handle “flagged” entities. This could include users that are active, blog posts that are published, news articles that have been read, etc.

Introduction

In the SQL world, you would typically create a a boolean|tinyint column; in Neo4j, the same can be achieved in the following two ways:

  • A flagged indexed property
  • A dedicated label

Having faced this design dilemma a number of times, we would like to share our experience with the two presented possibilities and some Cypher query optimizations that will help you take a full advantage of a the graph database.

Throughout the blog post, we’ll use the following example scenario:

  • We have User nodes
  • User FOLLOWS other users
  • Each user writes multiple blog posts stored as BlogPost nodes
  • Some of the blog posts are drafted, others are published (active)

Setting up

If we want to get a running test database for this use case, we can generate a small graph using Graphgen and import it into our local database. We use the following pattern:

(user:#User *10)
(post:#BlogPost *200)
(post2:#BlogPost:ActivePost {active:{randomElement:["true"]}} *200)
(user)-[:WRITTEN *1..n]->(post)
(user)-[:WRITTEN *1..n]->(post2)
(user)-[:FOLLOWS *n..n]->(user)

Naturally, we will create a unique constraint on the User _id property :

CREATE CONSTRAINT ON (user:User) ASSERT user._id IS UNIQUE;

Retrieving active blog posts

We now want to retrieve all the BlogPost nodes that are active.

Using labels

PROFILE MATCH (post:ActivePost) RETURN count(post);

will result in the following execution plan :

neo4j-sh (?)$ PROFILE MATCH (post:ActivePost) RETURN count(post);
+-------------+
| count(post) |
+-------------+
| 200         |
+-------------+
1 row

ColumnFilter
  |
  +EagerAggregation
    |
    +NodeByLabel

+------------------+------+--------+-------------+--------------------------+
|         Operator | Rows | DbHits | Identifiers |                    Other |
+------------------+------+--------+-------------+--------------------------+
|     ColumnFilter |    1 |      0 |             | keep columns count(post) |
| EagerAggregation |    1 |      0 |             |                          |
|      NodeByLabel |  200 |    201 |  post, post |              :ActivePost |
+------------------+------+--------+-------------+--------------------------+

Total database accesses: 201

As we can see there are no more database accesses than the number of ActivePost nodes in the database, thanks to the label indexing.

Now, let’s compare it with the use of an indexed active property on the nodes.

Using an indexed flagged property

CREATE INDEX ON :BlogPost(active);
PROFILE MATCH (post:BlogPost) WHERE post.active = true RETURN count(post);
neo4j-sh (?)$ PROFILE MATCH (post:BlogPost) WHERE post.active = 'true' RETURN count(post);
+-------------+
| count(post) |
+-------------+
| 200         |
+-------------+
1 row

ColumnFilter
  |
  +EagerAggregation
    |
    +SchemaIndex

+------------------+------+--------+-------------+------------------------------------+
|         Operator | Rows | DbHits | Identifiers |                              Other |
+------------------+------+--------+-------------+------------------------------------+
|     ColumnFilter |    1 |      0 |             |           keep columns count(post) |
| EagerAggregation |    1 |      0 |             |                                    |
|      SchemaIndex |  200 |    201 |  post, post | {  AUTOSTRING0}; :BlogPost(active) |
+------------------+------+--------+-------------+------------------------------------+

Total database accesses: 201

As we can see, there is no difference between matching on a dedicated label and using indexed properties.

But why this blog post then?

The difference will be apparent when the queried nodes are not in the beginning of the pattern.

Retrieving active blog posts written by a user

Let’s say we want to retrieve all active posts for a specific user :

Using labels

PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'})
WITH user
MATCH (user)-[:WRITTEN]->(p:ActivePost)
RETURN count(p);
neo4j-sh (?)$ PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'})
> WITH user
> MATCH (user)-[:WRITTEN]->(p:ActivePost)
> RETURN count(p);
+----------+
| count(p) |
+----------+
| 21       |
+----------+
1 row

ColumnFilter
  |
  +EagerAggregation
    |
    +Filter
      |
      +SimplePatternMatcher
        |
        +SchemaIndex

+----------------------+------+--------+----------------------+-----------------------------+
|             Operator | Rows | DbHits |          Identifiers |                       Other |
+----------------------+------+--------+----------------------+-----------------------------+
|         ColumnFilter |    1 |      0 |                      |       keep columns count(p) |
|     EagerAggregation |    1 |      0 |                      |                             |
|               Filter |   21 |     21 |                      |   hasLabel(p:ActivePost(2)) |
| SimplePatternMatcher |   21 |     35 | user, p,   UNNAMED85 |                             |
|          SchemaIndex |    1 |      2 |           user, user | {  AUTOSTRING0}; :User(_id) |
+----------------------+------+--------+----------------------+-----------------------------+

Total database accesses: 58

Using an indexed property

PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'})
WITH user
MATCH (user)-[:WRITTEN]->(p:BlogPost)
WHERE p.active = 'true'
RETURN count(p);
neo4j-sh (?)$ PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'})
> WITH user
> MATCH (user)-[:WRITTEN]->(p:BlogPost)
> WHERE p.active = 'true'
> RETURN count(p);
+----------+
| count(p) |
+----------+
| 21       |
+----------+
1 row

ColumnFilter
  |
  +EagerAggregation
    |
    +Filter
      |
      +SimplePatternMatcher
        |
        +SchemaIndex

+----------------------+------+--------+----------------------+------------------------------------------------------------------------+
|             Operator | Rows | DbHits |          Identifiers |                                                                  Other |
+----------------------+------+--------+----------------------+------------------------------------------------------------------------+
|         ColumnFilter |    1 |      0 |                      |                                                  keep columns count(p) |
|     EagerAggregation |    1 |      0 |                      |                                                                        |
|               Filter |   21 |     63 |                      | (hasLabel(p:BlogPost(1)) AND Property(p,active(8)) == {  AUTOSTRING1}) |
| SimplePatternMatcher |   21 |    105 | user, p,   UNNAMED85 |                                                                        |
|          SchemaIndex |    1 |      2 |           user, user |                                            {  AUTOSTRING0}; :User(_id) |
+----------------------+------+--------+----------------------+------------------------------------------------------------------------+

Total database accesses: 170

As we can see, the indexed property is not used by Cypher and it needs to filter on all blog posts. Using a dedicated label is thus more performant.

Going further with some tips and tricks

We will not stop here! We would like to share some tips that will help you optimize your Cypher queries.

Always use dedicated labels for positives

You may want to ask: why not adding a Draft label on the BlogPost nodes for non-active blog posts?

The reason is that this would force an additional negation filter when retrieving active blog posts, and negation in Cypher is costly :

PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'})
WITH user
MATCH (user)-[:WRITTEN]->(p:BlogPost)
WHERE NOT p :Draft
RETURN count(p);
neo4j-sh (?)$ PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'})
> WITH user
> MATCH (user)-[:WRITTEN]->(p:BlogPost)
> WHERE NOT p :Draft
> RETURN count(p);
+----------+
| count(p) |
+----------+
| 21       |
+----------+
1 row

ColumnFilter
  |
  +EagerAggregation
    |
    +Filter
      |
      +SimplePatternMatcher
        |
        +SchemaIndex

+----------------------+------+--------+----------------------+---------------------------------------------------------+
|             Operator | Rows | DbHits |          Identifiers |                                                   Other |
+----------------------+------+--------+----------------------+---------------------------------------------------------+
|         ColumnFilter |    1 |      0 |                      |                                   keep columns count(p) |
|     EagerAggregation |    1 |      0 |                      |                                                         |
|               Filter |   21 |     42 |                      | (hasLabel(p:BlogPost(1)) AND NOT(hasLabel(p:Draft(3)))) |
| SimplePatternMatcher |   21 |     70 | user, p,   UNNAMED85 |                                                         |
|          SchemaIndex |    1 |      2 |           user, user |                             {  AUTOSTRING0}; :User(_id) |
+----------------------+------+--------+----------------------+---------------------------------------------------------+

Total database accesses: 114

Avoid the need to match on multiple labels

Some designs we have encountered were making too much use of dedicated labels, forcing queries to match on multiple labels to fetch the desired set of nodes.

Matching on more than one label will make Cypher apply an additional hasLabel filter :

PROFILE MATCH (post:BlogPost:ActivePost) RETURN count(post);
neo4j-sh (?)$ PROFILE MATCH (post:BlogPost:ActivePost) RETURN count(post);
+-------------+
| count(post) |
+-------------+
| 200         |
+-------------+
1 row

ColumnFilter
  |
  +EagerAggregation
    |
    +Filter
      |
      +NodeByLabel

+------------------+------+--------+-------------+------------------------------+
|         Operator | Rows | DbHits | Identifiers |                        Other |
+------------------+------+--------+-------------+------------------------------+
|     ColumnFilter |    1 |      0 |             |     keep columns count(post) |
| EagerAggregation |    1 |      0 |             |                              |
|           Filter |  200 |    400 |             | hasLabel(post:ActivePost(2)) |
|      NodeByLabel |  400 |    401 |  post, post |                    :BlogPost |
+------------------+------+--------+-------------+------------------------------+

Total database accesses: 801

We can see here that the number of database accesses is growing and you can imagine that it will be costly when we have millions of blog posts or entities in the database.

Use well named relationship types to avoid some use of labels

When looking up patterns, specifying the label of the end node will force Cypher to filter on labels.

Dedicated and clear relationship types will help you traverse the graph easily and make full use of the powerful graph model, gaining performance by not having to match on labels.

Let’s refactor our little model by adding a PUBLISHED relationship from User to ActivePost nodes and a DRAFTED relationship to the others.

MATCH (n:ActivePost)
WITH n
MATCH (n)<-[:WRITTEN]-(u)
MERGE (u)-[:PUBLISHED]->(n);
MATCH (n:BlogPost)
WHERE NOT n :ActivePost
WITH n
MATCH (n)<-[:WRITTEN]-(u)
MERGE (u)-[:DRAFTED]->(n);

The very advantage of Neo4j is that nodes contain references to the relationships they connected to them. This means that once we have a starting point in the graph, following relationships from nodes to nodes costs almost nothing :

PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'})
WITH user
MATCH (user)-[:PUBLISHED]->(p)
RETURN count(p);
neo4j-sh (?)$ PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'})
> WITH user
> MATCH (user)-[:PUBLISHED]->(p)
> RETURN count(p);
+----------+
| count(p) |
+----------+
| 21       |
+----------+
1 row

ColumnFilter
  |
  +EagerAggregation
    |
    +SimplePatternMatcher
      |
      +SchemaIndex

+----------------------+------+--------+----------------------+-----------------------------+
|             Operator | Rows | DbHits |          Identifiers |                       Other |
+----------------------+------+--------+----------------------+-----------------------------+
|         ColumnFilter |    1 |      0 |                      |       keep columns count(p) |
|     EagerAggregation |    1 |      0 |                      |                             |
| SimplePatternMatcher |   21 |      0 | user, p,   UNNAMED85 |                             |
|          SchemaIndex |    1 |      2 |           user, user | {  AUTOSTRING0}; :User(_id) |
+----------------------+------+--------+----------------------+-----------------------------+

Total database accesses: 2

As we can see, we did not define any labels because we know that the :PUBLISHED relationship will point to blog posts that are active and retrieving thousands of blog posts will just cost us nothing more than using the schema index for our user and a few traversals from the user to blog post nodes by following their relationships.

Conclusion

A good design of your graph database model will help you perform very fast queries and provide a good user experience. Testing your queries and the usage of the query profiler can help you discover bottlenecks in your schema and should be done regularly during the development process, before going to production with a large data set.

Note: This blog post is a follow up of my StackOverflow answer on the subject

Share this blog post:

comments powered by Disqus