Performance of traversals in Neo4j
Cet article est disponible en français sous le titre « Performance des traversées dans Neo4j ».
The project I'm currently working on (transparency of the supply chain (in French), food traceability through a graph-oriented approach (in French) or GraphConnect - Tracing the World Food from Farm to Fork) is developed around the Neo4j graph database (N.B.: ekino is a partner of Neo Technology), and we recently had the opportunity to migrate the application from version 2.1 to Neo4j 2.3. The application performs a number of graph traversals on hundreds of thousands of nodes and relationships, for a web application and not for offline analysis. The performance of these traversals is extremely important.
As it happens, the 2.2 and 2.3 releases of Neo4j were focused on performance (especially for Cypher, the query language of Neo4j), and we expected a direct benefit. We'll see what the actual results are, and analyse the reasons of the changes.
Surprise! The update shows increased response times in our performance tests, up to a factor of 2. A first comparison with a profiler shows an explosion of the cost of operations of the code API of Neo4j handling nodes and relationships, as well as an increase of the memory allocated by a request (garbage: the objects created while processing the request and collectable at the end of the request).
I choose to create a simple example demonstrating the problem and push it on GitHub, to then discuss it on SlackHQ Neo4j.
It's available as an unmanaged extension for Neo4j, that creates its own test data and exposes an HTTP service which performs a traversal similar to what our application does, using several core APIs:
The data is organized as a tree, from a root node of type
A, connected to nodes of type
B (through a
HAS_B relationship), connected to nodes of type
A (through a
HAS_A relationship), etc. on 10 levels of
B and with 4 children per parent (i.e. 1,398,101 nodes and 1,398,100 relationships). Each node also has a boolean property with a value assigned randomly.
The traverser goes through the nodes in breadth-first mode, and is configured with a
PathExpander using the label of the current node to choose the type of the next relationship to follow and an
Evaluator also using the label of the node to only select
B nodes and count those which property is
true. It starts the traversal from the central node (identified with the
Root label). The service finally returns the counter in the HTTP response, to avoid Dead Code Elimination by the Just In Time compiler (JIT) of the JVM.
The test data is large enough that a single traversal is measured in seconds, making any noise from the measure itself negligible.
Comparison of Neo4j 2.2 versus 2.3
The extension is compiled and deployed in two instances of Neo4j 2.2 (because it's similar in that regard to 2.1) and 2.3, with the same configuration of the disk cache
dbms.pagecache.memory=500m (the generated data set occupies about 400 MB on disk) and heap parameters for the JVM set to respectively
-Xms2g -Xmx4g and
-Xms512m -Xmx512m (Neo4j 2.3 needs less memory as it removed the in-memory object cache).
The test is a measure of the response time for 100 traversal requests executed sequentially in a single thread, after “warming up” the JVM with 100 traversal requests and discarding the results, to let the JIT compilation happen and reach a stable state.
It's executed on my 15-inch, mid-2014 Mac Book Pro with an Intel Core i7 2.2 GHz processor using 4 cores, and 16 GB of RAM, running under OS X 10.11.4 (El Capitan).
Without reaching a factor of 2 as previously observed, the results show response times degraded by more than 50%, and I have a test case to share.
I then discuss the benchmark through Slack and GitHub, especially with Michael Hunger (@mesirii, Developer Relations at Neo Technology), who gets different results with a fork of my project where he created a unit test using my code with an embedded Neo4j database: his comparisons show that 2.3 is faster than 2.2.
However, there are several differences between the benchmarks at this point:
<li> his results show that the size of the heap has an influence, contrary to what I had observed </li> <br /> <li> his traversals are parallel instead of sequential, which means we don't measure the same kind of load </li> <br /> <li> by using an embedded database, he uses Neo4j Community Edition whereas I'm using Neo4j Enterprise Edition, which has a High-Performance (Object) Cache <a href="http://neo4j.com/docs/2.2.8/configuration-caches.html#high-performance-cache">until version 2.2</a> </li> <br /> <li> finally, he changed the traversal mode from <a href="https://en.wikipedia.org/wiki/Breadth-first_search">breadth-first</a> to <a href="https://en.wikipedia.org/wiki/Depth-first_search">depth-first</a> </li> <br />
I modify his code slightly again to go back to a sequential execution and leave more time to the warmup and JIT, but I still get results comparable to his.
We discuss the last point on Slack: the depth-first mode allocates less memory as it's a simple recursion, whereas the breadth-first mode maintains the state of all the traversed branches in memory. As we use the breadth-first mode in the application (the real one, not the example) at that time, for several optimisations, I had kept that mode in the example.
To get a better insight, I add support for the depth-first mode in the example (using a query parameter), and run a new test series using the two modes and several heap sizes.
That's when things get interesting and help me reason about the problem:
<li> Neo4j 2.3 is much less sensitive to the heap size </li> <br /> <li> The depth-first mode is much faster than breadth-first </li> <br /> <li> With “enough” memory, Neo4j Enterprise 2.2 remains faster than 2.3 </li> <br />
Depth-first is faster than breadth-first
Our real traversers in the application manage themselves reaching a single child node through multiple paths from the root, because we keep the best “distance” to the root node for some key nodes. Using this metric, it's possible to stop the traversal beyond a known node if the current distance is worse than the recorded one, and we have deactivated the uniqueness management in Neo4j (to avoid duplicating the work) and used the breadth-first mode, as it guarantees than a node will always be reached by the shortest path first.
It seemed logical, but in hindsight it was an error not to have compared the two modes to validate the hypothesis. The benchmark in the example shows the result is not that obvious.
In practice, it actually made me switch back to the depth-first mode as it ended up being faster despite the probability of duplicate work (given 2 paths leading to the same node from the root, there's a 50% chance of taking the shortest path first): eliminating the state management and its associated memory usage gives better results.
With enough memory, Neo4j 2.2 seems faster
What's using that memory that will make 2.2 faster than 2.3? The cache, of course! More precisely, the “object cache”, which got removed in 2.3 and keeps in-memory all the data it can for nodes and relationships (the High-Performance Cache of the Enterprise Edition adapts to the available memory). It eliminates disk reads.
But what's the connection with my test scenario? My data set (both in the example and the application) represents an interesting volume for the traversal, but still fits in RAM.
The comparison was biased from the start
It's the problem with so-called “synthetic” benchmarks: you measure a unique scenario, in perfect conditions, and you really have to pay attention to what you're effectively measuring. In practice, with my 2.2 instance, I was not measuring the speed of a raw traversal, but of a traversal with all the data in cache - a perfect cache, as it were.
What would have happened the day when the application has enough data that it won't fit in RAM anymore? Either you add memory (vertical scalability, but you can only go so far, and there are real negative effects, almost exponential, on the pause time of the garbage collections), or the response times degrade suddenly and look like the RAM-limited scenario in the table.
The cache was removed in Neo4j 2.3 to improve the scalability, not necessarily the performance: on a “small” data set, the effect is negative, but there is a gain on the predictability of performances which depend much less on the total size of the data set.
Let's adjust the performance comparison
To compare the raw traversal in both versions, we need a way to clear the cache between each request on the 2.2 instance. The High-Performance Cache of the Enterprise Edition exposes an MBean (JMX) for that.
I add a new service to the example using that MBean, and run the tests again, but only for the best case scenario (4 GB heap, depth-first mode) where Neo4j 2.2 was faster.
The results with an empty cache are much worse, and it's hard to believe it's only due to the overhead of the cache itself. Besides eliminating the object cache, Neo4j 2.3 has obviously had performance improvements. ?
2.3 is much faster than 2.2 as soon as the data is not in cache
It also explains the differences with the benchmark by Michael Hunger: the cache implementations available in Neo4j Community Edition are slower than the High-Performance Cache, and don't compensate the intrinsically lower performance of 2.2. As a corollary, there is no obvious difference in the performance of traversals in 2.3 between the Community Edition and Enterprise Edition: the main differences remain at the operations level (backup, high availability), as well as the Cypher optimizer.
A few more figures
I ran a single request to get both its allocations and the actual heap usage, using Yourkit. The Neo4j instance is still configured with a 4 GB heap, and the heap is done in depth-first mode.
Of course, the heap is the same for both scenarios using 2.2, as the cache is filled with the same data after the request, but the allocations of the request are much higher when the cache starts empty, as Neo4j will read the data on disk and cache it. Version 2.3, without the object cache, has a really small memory footprint, and the allocations of the request are much lower than in the two other cases. The response time is worse than when using the cache in 2.2, despite the reduced allocations, because it's limited by the disk reads whereas 2.2 only reads memory.
Neo4j 2.3 has sacrificed part of the native performance of repeated operations for data fitting in the cache, to better address the general use case. Yes, there are negative effects on some scenarios, but the overall result is positive:
<li> better predictability of the performances </li> <br /> <li> less heap pressure on the JVM, so less garbage collection and less pauses </li> <br />
The other takeaway is that beyond measuring the performance, you have to understand where it comes from, even if it's made difficult by using third-party code which we don't know as well as our own. But anyway, it's always better to measure than to guess.
Finally, in a future article, I'll come back to the changes we made in our code to reduce the real degradation of the response times seen by the users, and I'll widen the comparison to Neo4j 3.0, which has been released since.
Updated on July 20th, 2016: the follow-up article is available under “Optimization strategies for traversals in Neo4j”.