Performance of traversals in Neo4j

Actualité ekino -

The pitfalls of benchmarking

Article paru le :
Performance of traversals in Neo4j
Article paru le :

Cet article est disponible en français sous le titre « Performance des traversées dans Neo4j ».

The project I’m currently working on (transparency of the supply chain (in French), food traceability through a graph-oriented approach (in French) or GraphConnect – Tracing the World Food from Farm to Fork) is developed around the Neo4j graph database (N.B.: ekino is a partner of Neo Technology), and we recently had the opportunity to migrate the application from version 2.1 to Neo4j 2.3. The application performs a number of graph traversals on hundreds of thousands of nodes and relationships, for a web application and not for offline analysis. The performance of these traversals is extremely important.

As it happens, the 2.2 and 2.3 releases of Neo4j were focused on performance (especially for Cypher, the query language of Neo4j), and we expected a direct benefit. We’ll see what the actual results are, and analyse the reasons of the changes.

Wrong direction?

Surprise! The update shows increased response times in our performance tests, up to a factor of 2. A first comparison with a profiler shows an explosion of the cost of operations of the code API of Neo4j handling nodes and relationships, as well as an increase of the memory allocated by a request (garbage: the objects created while processing the request and collectable at the end of the request).

I choose to create a simple example demonstrating the problem and push it on GitHub, to then discuss it on SlackHQ Neo4j.

The example

It’s available as an unmanaged extension for Neo4j, that creates its own test data and exposes an HTTP service which performs a traversal similar to what our application does, using several core APIs: Node::hasLabel, Node::getProperty, Node::getRelationships.

The data is organized as a tree, from a root node of type A, connected to nodes of type B (through a HAS_B relationship), connected to nodes of type A (through a HAS_A relationship), etc. on 10 levels of A/B and with 4 children per parent (i.e. 1,398,101 nodes and 1,398,100 relationships). Each node also has a boolean property with a value assigned randomly.

Portion of the example graph

The traverser goes through the nodes in breadth-first mode, and is configured with a PathExpander using the label of the current node to choose the type of the next relationship to follow and an Evaluator also using the label of the node to only select B nodes and count those which property is true. It starts the traversal from the central node (identified with the Root label). The service finally returns the counter in the HTTP response, to avoid Dead Code Elimination by the Just In Time compiler (JIT) of the JVM.

The test data is large enough that a single traversal is measured in seconds, making any noise from the measure itself negligible.

Comparison of Neo4j 2.2 versus 2.3

The extension is compiled and deployed in two instances of Neo4j 2.2 (because it’s similar in that regard to 2.1) and 2.3, with the same configuration of the disk cache dbms.pagecache.memory=500m (the generated data set occupies about 400 MB on disk) and heap parameters for the JVM set to respectively -Xms2g -Xmx4g and -Xms512m -Xmx512m (Neo4j 2.3 needs less memory as it removed the in-memory object cache).

The test is a measure of the response time for 100 traversal requests executed sequentially in a single thread, after “warming up” the JVM with 100 traversal requests and discarding the results, to let the JIT compilation happen and reach a stable state.

It’s executed on my 15-inch, mid-2014 Mac Book Pro with an Intel Core i7 2.2 GHz processor using 4 cores, and 16 GB of RAM, running under OS X 10.11.4 (El Capitan).

Without reaching a factor of 2 as previously observed, the results show response times degraded by more than 50%, and I have a test case to share.

Comparison of the response times in milliseconds between 2.2 and 2.3
Version Neo4j Enterprise 2.2.7 Neo4j Enterprise 2.3.2
Mean 1914 3109
50% (median) 1890 3088
90% 2169 3358
95% 2254 3470
Maximum 2322 3707
Histogram of the 2.2 and 2.3 response times, higher for 2.3

The times for 2.3 are higher than for 2.2 with our benchmark, in all the observed quantiles

The discussion

I then discuss the benchmark through Slack and GitHub, especially with Michael Hunger (@mesirii, Developer Relations at Neo Technology), who gets different results with a fork of my project where he created a unit test using my code with an embedded Neo4j database: his comparisons show that 2.3 is faster than 2.2.

However, there are several differences between the benchmarks at this point:

  • his results show that the size of the heap has an influence, contrary to what I had observed
  • his traversals are parallel instead of sequential, which means we don’t measure the same kind of load
  • by using an embedded database, he uses Neo4j Community Edition whereas I’m using Neo4j Enterprise Edition, which has a High-Performance (Object) Cache until version 2.2
  • finally, he changed the traversal mode from breadth-first to depth-first

I modify his code slightly again to go back to a sequential execution and leave more time to the warmup and JIT, but I still get results comparable to his.

We discuss the last point on Slack: the depth-first mode allocates less memory as it’s a simple recursion, whereas the breadth-first mode maintains the state of all the traversed branches in memory. As we use the breadth-first mode in the application (the real one, not the example) at that time, for several optimisations, I had kept that mode in the example.

To get a better insight, I add support for the depth-first mode in the example (using a query parameter), and run a new test series using the two modes and several heap sizes.

That’s when things get interesting and help me reason about the problem:

  1. Neo4j 2.3 is much less sensitive to the heap size
  2. The depth-first mode is much faster than breadth-first
  3. With “enough” memory, Neo4j Enterprise 2.2 remains faster than 2.3
Comparison of the response times between the 2 versions, as a function of the traversal mode and the heap
Version Neo4j Enterprise 2.2.7 Neo4j Enterprise 2.3.2
Traversal Breadth-First Depth-First Breadth-First Depth-First
Heap (GB) 1 2 4 1 2 4 1 2 4 1 2 4
Mean 6616 3441 2045 2229 2209 1276 2771 2500 2465 1544 1543 1540
50% 6308 3281 2022 2228 2208 1277 2748 2487 2443 1549 1517 1547
90% 8520 3823 2296 2406 2415 1391 2981 2672 2674 1679 1690 1693
95% 8820 5268 2353 2454 2488 1424 3145 2773 2709 1711 1727 1716
Maximum 10591 5807 2856 2485 2795 1513 3722 3321 2936 1882 1771 1757

Histogram of the response times by version, traversal mode and heap

Lessons learned

Depth-first is faster than breadth-first

Our real traversers in the application manage themselves reaching a single child node through multiple paths from the root, because we keep the best “distance” to the root node for some key nodes. Using this metric, it’s possible to stop the traversal beyond a known node if the current distance is worse than the recorded one, and we have deactivated the uniqueness management in Neo4j (to avoid duplicating the work) and used the breadth-first mode, as it guarantees than a node will always be reached by the shortest path first.

It seemed logical, but in hindsight it was an error not to have compared the two modes to validate the hypothesis. The benchmark in the example shows the result is not that obvious.

In practice, it actually made me switch back to the depth-first mode as it ended up being faster despite the probability of duplicate work (given 2 paths leading to the same node from the root, there’s a 50% chance of taking the shortest path first): eliminating the state management and its associated memory usage gives better results.

With enough memory, Neo4j 2.2 seems faster

What’s using that memory that will make 2.2 faster than 2.3? The cache, of course! More precisely, the “object cache”, which got removed in 2.3 and keeps in-memory all the data it can for nodes and relationships (the High-Performance Cache of the Enterprise Edition adapts to the available memory). It eliminates disk reads.

But what’s the connection with my test scenario? My data set (both in the example and the application) represents an interesting volume for the traversal, but still fits in RAM.

The comparison was biased from the start

It’s the problem with so-called “synthetic” benchmarks: you measure a unique scenario, in perfect conditions, and you really have to pay attention to what you’re effectively measuring. In practice, with my 2.2 instance, I was not measuring the speed of a raw traversal, but of a traversal with all the data in cache – a perfect cache, as it were.

What would have happened the day when the application has enough data that it won’t fit in RAM anymore? Either you add memory (vertical scalability, but you can only go so far, and there are real negative effects, almost exponential, on the pause time of the garbage collections), or the response times degrade suddenly and look like the RAM-limited scenario in the table.

The cache was removed in Neo4j 2.3 to improve the scalability, not necessarily the performance: on a “small” data set, the effect is negative, but there is a gain on the predictability of performances which depend much less on the total size of the data set.

Let’s adjust the performance comparison

To compare the raw traversal in both versions, we need a way to clear the cache between each request on the 2.2 instance. The High-Performance Cache of the Enterprise Edition exposes an MBean (JMX) for that.

I add a new service to the example using that MBean, and run the tests again, but only for the best case scenario (4 GB heap, depth-first mode) where Neo4j 2.2 was faster.

The results with an empty cache are much worse, and it’s hard to believe it’s only due to the overhead of the cache itself. Besides eliminating the object cache, Neo4j 2.3 has obviously had performance improvements. 👍

Comparison of the response times between the 2 versions, depending on the cache
Version Neo4j Enterprise 2.2.7 Neo4j Enterprise 2.3.2
Scenario Full cache Empty cache No cache
Mean 1276 2938 1540
50% 1277 2889 1547
90% 1391 3178 1693
95% 1424 3238 1716
Maximum 1513 3642 1757
Histogram of the response times for the 3 scenarios, showing the performance advantage of 2.3 over a cache-less 2.2

2.3 is much faster than 2.2 as soon as the data is not in cache

It also explains the differences with the benchmark by Michael Hunger: the cache implementations available in Neo4j Community Edition are slower than the High-Performance Cache, and don’t compensate the intrinsically lower performance of 2.2. As a corollary, there is no obvious difference in the performance of traversals in 2.3 between the Community Edition and Enterprise Edition: the main differences remain at the operations level (backup, high availability), as well as the Cypher optimizer.

A few more figures

I ran a single request to get both its allocations and the actual heap usage, using Yourkit. The Neo4j instance is still configured with a 4 GB heap, and the heap is done in depth-first mode.

Of course, the heap is the same for both scenarios using 2.2, as the cache is filled with the same data after the request, but the allocations of the request are much higher when the cache starts empty, as Neo4j will read the data on disk and cache it. Version 2.3, without the object cache, has a really small memory footprint, and the allocations of the request are much lower than in the two other cases. The response time is worse than when using the cache in 2.2, despite the reduced allocations, because it’s limited by the disk reads whereas 2.2 only reads memory.

Comparison of the memory usage between the 2 versions, depending on the cache
Version Neo4j Enterprise 2.2.7 Neo4j Enterprise 2.3.2
Scenario Full cache Empty cache No cache
Used heap (MB) 372 374 24
Allocations of the request (MB) 144 332 76
Position of the 3 scenarios by static and dynamic memory usage

Version 2.3 has a definite advantage on both static and dynamic memory usage

Conclusion

Neo4j 2.3 has sacrificed part of the native performance of repeated operations for data fitting in the cache, to better address the general use case. Yes, there are negative effects on some scenarios, but the overall result is positive:

  • better predictability of the performances
  • less heap pressure on the JVM, so less garbage collection and less pauses

The other takeaway is that beyond measuring the performance, you have to understand where it comes from, even if it’s made difficult by using third-party code which we don’t know as well as our own. But anyway, it’s always better to measure than to guess.

Finally, in a future article, I’ll come back to the changes we made in our code to reduce the real degradation of the response times seen by the users, and I’ll widen the comparison to Neo4j 3.0, which has been released since.

Updated on July 20th, 2016: the follow-up article is available under “Optimization strategies for traversals in Neo4j”.