GraphAware Blog

Find out what's new in the Neo4j world

What’s New in Neo4j Databridge [April 2017]

Neo4j ETL Databridge 27 Apr 2017 by Vince Bickers

Since our first post a few months back, Neo4j-Databridge has seen a number of improvements and enhancements. In this post, we’ll take a quick tour of the latest features.

Streaming Endpoint

Although Databridge is primarily designed for bulk data import, which requires Neo4j to be offline, we recently added the capability to import data into a running Neo4j instance.

This was prompted by a specific request from a user who pointed out that in many cases people want to do a fast bulk-load of an initial large dataset with the database offline, and then subsequently apply small incremental updates to that data with the database running. This seemed like a great idea, so we added the streaming endpoint to enable this feature.

The streaming endpoint uses Neo4j’s Bolt binary protocol, and the good news is that you don’t need to change any of your existing import configuration to use it. Simply pass the -s option to the import command, and it will automatically use the streaming endpoint: Example: use the -s option to import the hawkeye dataset into a running instance of Neo4j.

bin/databridge import -s hawkeye

The streaming endpoint connects to Neo4j using the following defaults:

neo4j.url=bolt://localhost
neo4j.username=neo4j
neo4j.password=password

You can override these defaults by creating a file custom.properties in the Databridge config folder and setting the values as appropriate for your particular Neo4j installation.

Please note that despite using the Bolt protocol, the streaming endpoint will take quite a bit longer to run than the offline endpoint for large datasets, so it isn’t really intended to replace bulk import. For small incremental updates, however, this should not be a problem.

Updates from the streaming endpoint are batched, with the transaction commit size currently set to 1000, and the plan is to make the commit size user-configurable in the near future.

Specifying the Output Database Folder

By default, Neo4j-Databridge creates a new graph.db database in the same folder as the import task. We’ve now added the ability for you to define the output path to the database explicitly. To do this, use the -o option to specify the output folder path to the import command: Example: use the -o option to import the hawkeye dataset into a user-specified database.

bin/databridge import -o /databases/common hawkeye

In the example above, the hawkeye dataset will be imported into /databases/common/graph.db, instead of the default location hawkeye/graph.db.

Among other things, this new feature allows you to import different datasets into the same physical database: Example: use the -o option to allow the hawkeye and epsilon datasets to co-exist in the same Neo4j database.

bin/databridge import -o /databases/common hawkeye
bin/databridge pimport -o /databases/common epsilon

Simpler Commands

The eagle-eyed among you will have spotted that the above examples use the import command, while in our first blog post, our examples all used the run command, which was invoked with a variety of different option flags. The original run command still exists, but we’ve added some additional commands to make life a bit simpler.

All the new commands also now support a -l option, to limit the number of rows imported. This can be very useful when testing a new import task for example. The new commands are:

import: runs the specified import task
usage: import [-cdsq] [-o target] [-l limit] <import-task></import-task>
c: allow multiple copies of this import to co-exist in the target database
d: delete any existing dataset prior to running this import
s: stream data into a running instance of Neo4j
q: run the import task in the background, logging output to import.log instead of the console
o target: use the specified target database for this import
l limit: the maximum number of rows to process from each resource during the import

test: performs a dry run of the specified import task, but does not create a database
usage: test [-l limit] <import-task></import-task>
l limit: the maximum number of rows to process from each resource during the dry run

profile: profiles the resources for an import task. Databridge uses a profiler at the initial phase of every import. The profiler examines the various data resources that will be loaded during the import and generates tuning information for the actual import phase.
usage: profile [-l limit] <import-task></import-task>
l limit: the maximum number of rows to profile from each resource

The profiler display the statistics that will be used to tune the import. For nodes, these statistics include the average key length akl of the unique identifiers for each node type, as well as an upper bound max on the number of nodes of each type.

For relationships, the statistics include an upper bound on the number of edges of each type. (The max values are upper bounds because the profiler doesn’t attempt to detect possible duplicates.)

Profile statistics are displayed in JSON format:

{
        nodes: [
        { 'Orbit': {'max':11, 'akl':10.545455} }
            { 'Satellite': {'max':11, 'akl':8.909091} }
            { 'SpaceProgram': {'max':11, 'akl':9.818182} }
            { 'Location': {'max':11, 'akl':4.818182} }
        ],edges: [
            { 'LOCATION': {'max':11} }
            { 'ORBIT': {'max':11}
            { 'LAUNCHED': {'max':11} }
            { 'LIVE': {'max':11} }
       ]
    }

Deleting and Copying Individual Datasets

In order to support the new streaming endpoint as well as the ability to host multiple import datasets in the same database, Databridge only creates a brand new database the first time you run an import task.

If you run the same import task multiple times with the same datasets, Databridge will not create any new nodes or relationships in the graph during the second and subsequent imports.

If you want to force Databridge to clear down any previous data and re-import it again, you can use the -d option, which will delete the existing dataset first. Example: use the -d option to delete an existing dataset prior to re-importing it.

bin/databridge import hawkeye
bin/databridge import -d hawkeye

On the other hand, if you want to create a copy of an existing dataset, you can use the -c option instead: Example: use the -c option to create a copy of a previously imported dataset.

bin/databridge import hawkeye
bin/databridge import -c hawkeye

Deleting All the Things

If you need to delete everything in the graph database and start again with a completely clean slate, you can use the purge command:

bin/databridge purge hawkeye

Note that if you have imported multiple datasets into the same physical database, you should purge each of them individually, specifying the database path each time:

bin/databridge purge -o /databases/common hawkeye
bin/databridge purge -o /databases/common epsilon

Conclusion

Well, that about wraps up this quick survey of what’s new in Databridge from GraphAware. If you’re interested in finding out more, please take a look at the project WIKI, and in particular the Tutorials section.

If you believe Databridge would be useful for your project or organisation and are interested in trying it out, please contact me directly at [email protected] or drop an email to [email protected] and one of the GraphAware team members will get in touch.

Share this blog post:

comments powered by Disqus