Optimization strategies for traversals in Neo4j

Actualité ekino -

Work less, get more (performance)

Article paru le :
Optimization strategies for traversals in Neo4j
Article paru le :

Cet article est disponible en français sous le titre « Stratégies d’optimisation des traversées dans Neo4j ».

This article is a follow-up of “Performance of traversals in Neo4j”.

The story could have ended after my analysis of the evolution of traversal times, but I still had a real problem with users observing degraded response times. It was difficult to argue that it was for the greater good and a better scalability in the future.

Going back to the initial problem, the cost of operations in the Neo4j core API (Node::hasLabel, for example) exploded when comparing to the version using a cache, and now represents a significant part of the execution time. What can be done to alleviate that cost?

Preliminary analysis

The task is probably bound by reads on disk, since we’re in a database. The best optimization would be to remove some parts of the task if they aren’t actually needed, instead of trying to make them run faster.

What is our traversal doing exactly? It goes through each node in the tree, the Evaluator component checks if it’s of type B and if the value of its property is true then always continues the traversal, and then the PathExpander component checks if it’s of type A or B to fetch the outgoing relationships that are appropriate. That means between 2 and 3 calls to Node::hasLabel per node, whether it’s of type A or B. There’s also a call to Node::getRelationships per node to expand the traversal, and a call to Node::getProperty per node of type B. It would be difficult to do less functionally, and there’s unfortunately no code to cut.

Flowchart of the traversal per node

Let’s use a profiler to find out the cost distribution of the calls to Node::hasLabel during a traversal. I’m exploring incrementally using the various modes of Yourkit.

First in CPU sampling mode, that only has a small overhead. Node::hasLabel (or more precisely, its implementation NodeProxy::hasLabel) is measured as contributing 19% to the traversal, with its own calls distributed like:

  • 16% of OperationsFacade::labelGetForName, to get the label index from its name
  • 84% of OperationsFacade::nodeHasLabel, which itself calls another single method, which calls another single one, etc. with a decreasing relative cost (each method has its own cost, added to the cost of what it calls). The descend ends up with methods such as NodeStore::loadRecord or MuninnReadPageCursor::next, which do look like disk reads (load, read page), so it’s consistent with what we expected.

Call tree sampled by Yourkit

Let’s then use the CPU tracing mode, which is more invasive but gives us the number of calls and a finer measure of the method durations. Node::hasLabel now represents 28% of the traversal (observation changes the measure), and its cost distribution is:

  • 7% of OperationsFacade::labelGetForName
  • 73% of OperationsFacade::nodeHasLabel
  • 19% of its own code
  • 1% elsewhere

It’s still data reads, but we now know that there are 3 millions calls to Node::hasLabel, which match the expected 2 to 3 calls per node.

Call tree instrumented by Yourkit, without capturing the memory allocations

Finally, by setting both the CPU tracing mode and the capture of memory allocations, we get a view combining the CPU cost with the memory flow, at the cost of higher bias and execution time. Node::hasLabel now contributes 42% to the traversal:

  • 2% of OperationsFacade::labelGetForName
  • 92% of OperationsFacade::nodeHasLabel
  • 6% of its own code

Call tree instrumented by Yourkit, when capturing the memory allocations

Obviously, reads still constitute most of the cost. Memory allocations for the request are measured at 76 MB, distributed like:

  • 615,155 instances of long[], i.e. 14 MB
  • 307,591 InlineNodeLabels and as many Bits, i.e. 7 and 9 MB respectively
  • 139,822 int[], i.e. 3 MB
  • 139,818 NodeProxy, NodeProxy$2 and TraversalBranchImpl, i.e. 3, 3 and 5 MB respectively
  • 139,808 RelationshipType[], RelationshipConversion and RelationshipProxy, i.e. 3, 4 and 4 MB respectively
  • etc.

Allocations mémoire capturées par Yourkit

100% of the long[] allocated during the request are actually caused by Node::hasLabel! If we can limit the number of calls to the method, there is potential for improved allocations.

Potential optimizations

Labels

With 2 to 3 calls to Node::hasLabel per node, 1 to Node::getRelationships and 0 or 1 to Node::getProperty, we can’t do much for a single request. However, inspired by an article by Max de Marzi (@maxdemarzi, Support Engineer at Neo Technology) about speeding up traversals, we can explore the possibilities if we cache data for all requests. In the real application, we have around 10 calls to Node::hasLabel per node anyway, there would be more opportunity to amortize the cost of a cache local to the request.

I’ll try to replace disk reads and the related allocations by another type of allocation, and see if there’s a benefit.

The simplest approach would be to cache with a composite key based on the node and the label, but it involves a lot of objects to finally get a simple boolean, and we might do better. By looking at the implementation of Node::hasLabel (subject to change, though), we can observe a simple iteration on the labels till the requested one is found: it doesn’t use a more evolved structure (inverted index, hash map, etc.). Therefore, if a node has n labels, the search for a label will statistically iterate over n/2 nodes before stopping, and 2 calls to Node::hasLabel will iterate over the equivalent of n nodes, which is what Node::getLabels does. Of course, Node::getLabels adds an overhead since it creates the collection of labels instead of returning a simple boolean, but if Node::hasLabel is called more than twice on average, using Node::getLabels to get all the labels once for all and store them in a structure optimized for search (a Set for example) could improve the situation.

Actually, a Set is fine, but why not go even further? What’s a Label? Its API is really simple:

It looks just like an Enum, and that’s actually a known implementation pattern:

As a Set, we could specifically use an EnumSet, more efficient as it’s based on the ordinal value of the enum. Continuing in the direction, a label present in the enum is equivalent to an index (its ordinal value), which can be encoded using bit masking: the n-th bit represents the presence of label n on the node, whether its value is 0 or 1. That’s exactly what I’m going to do: since we only have 3 labels, we can do bit masking on a byte et cache the labels of a node (or more precisely, its Neo4j identifier, a long) in that byte.

Since we’re already using fastutil in the application to get primitive collections (which use less memory than their object counterparts), I’ll use a Long2ByteOpenHashMap for the cache.

Properties

As properties can be of a number of types (boolean, number, string, or an array of one of these types), we can only cache the read properties in a Map<String, Object>.

However, I can use a trick due on the fact that there’s only one property per node in the example, and use Collections::singletonMap to cache the first property, before falling back to a HashMap once there are 2; it’s much more compact in memory.

Relationships

Caching relationships is much more complex, as the simplest solution would be to only cache the identifiers of relationships and reload them on the fly. It’s not so easy to implement, with limited potential gains. I decide not to implement it, as it’s not even a scenario which can be optimized in the application, where the same relationship is hardly ever traversed more than once.

Measuring the effects

All this is implemented in a new branch of the example, using facades to allow independent activation of the caches (via query parameters).

All measures are done on a Neo4j Enterprise 2.3.2 instance with a 4 GB heap, and a traversal in depth-first mode. Neo4j runs on my 15-inch, mid-2014 Mac Book Pro with an Intel Core i7 2.2 GHz processor using 4 cores, and 16 GB of RAM, running under OS X 10.11.4 (El Capitan), and with a 64-bit, 1.8 u92 Oracle JVM.

The label cache gives a higher gain than the property cache, and combining the two improves the response times by around 30%.

Comparison of the response times in milliseconds depending on the caches
Scenario Reference Label cache Property cache Both caches
Mean 1540 1209 1387 1067
50% 1547 1207 1384 1053
90% 1693 1229 1427 1100
95% 1716 1241 1432 1192
Maximum 1757 1284 1447 1248
Used heap (MB) 24 42 40 58
Allocations of the request (MB) 76 46 73 43

Histogram of the response times for the 4 scenarios combining the caches, showing the best improvement with both active

The effect is largely positive, with a limited increased of the used memory. The allocation of the permanent caches during the first request (18 MB for the label cache, 16 MB for the property cache) is more than compensated by the reduced allocations in a single request after the first one: the label cache has the most important impact by eliminating 30 MB of allocations (about 40%), where the property cache only eliminates 3 MB. The memory pressure is reduced, there are fewer garbage collections, fewer pauses, etc.

Position of the 4 scenarios by static and dynamic memory usage

How about not modifying code?

When you’re using a platform, another strategy is possible, which is to take advantage of the low-level optimizations of that platform, when available. Since we migrated to version 2.3, Neo4j 3.0 was released. It’s a new major version, with both functional and technical evolutions. Independently of the upgrade plan of our application to 3.0, let’s have a look at what it brings through the example.

Actually, we’re based on a second platform: the JVM. I neglected that part when we migrated from version 2.1 to 2.3: the garbage collector (GC) configured by default in Neo4j changed from Concurrent Mark Sweep (CMS) to Garbage First (G1), and we also used the migration to upgrade from JDK 7 to JDK 8. However, even if G1 is supposed to supersede CMS, a lot of users are experiencing mixed results (likeElasticSearch or its users for example), and it has seen lots of fixes up to recent versions of the JDK. We have a synthetic benchmark, not representative of a production situation, but I’ll use it anyway to explore how it’s affected by the GC algorithm.

As during the comparison between versions 2.1 and 2.3, the same jar containing the unmanaged extension is deployed on different versions of Neo4j Enterprise with the same configuration: the heap is still at 4 GB, and the traversal is done in depth-first mode, with no active cache.

If we focus on the cases using G1, a degradation could be observed with version 3.0.0-RC1 (Release Candidate 1), but the final version actually shows a slight improvement of the response times over 2.3.2. A spectacular gain can be seen starting at version 3.0.2, where response times are reduced by around 25%. Version 3.0.3 even manages to score a little better.

However, when comparing G1 and CMS, CMS gives better response times in almost all cases, with a huge difference in 2.3.2 where response times are 25% lower. Since the results have converged in 3.0.2 and 3.0.3, we can imagine that the improved performance comes partly from reduced allocations, which would decrease the weight of the garbage collector in the equation.

Comparison of the response times between the versions of Neo4j
Version 2.3.2 3.0.0 RC1 3.0.0 3.0.1 3.0.2 3.0.3
GC G1 CMS G1 CMS G1 CMS G1 CMS G1 CMS G1 CMS
Moyenne 1438 1100 1521 1461 1367 1281 1406 1252 980 1027 1067 993
50% 1395 1098 1517 1429 1365 1282 1386 1251 980 992 1067 992
90% 1596 1131 1548 1616 1384 1294 1498 1264 990 1125 1075 1006
95% 1619 1140 1556 1654 1391 1298 1560 1268 994 1134 1078 1010
Maximum 1694 1168 1637 1735 1409 1323 1630 1305 1006 1163 1120 1049

Histogram of the response times, by version and GC, showing a continuous improvement

I didn’t compare (yet) the versions with a profiler, but obviously the engineers at Neo Technology did a good job, even on minor 3.0.x versions.

Conclusion

Based on these results, I applied a slightly different label cache to the real application:

  • it uses a larger type than byte (as we have more than 8 labels)
  • the cache is local to the HTTP request, and not permanent as in the example

Making the cache local means we don’t have to manage its invalidation, its memory footprint, or the related pauses. The result is positive nonetheless because for a single node, we can get around 10 label calls (to classify it, find which relationships to expand, filter it, etc.). The result is especially significant for allocations. With that single optimization, response times have been reduced by 23%, bringing them back to an acceptable level, even it they remain greater than the “perfect” measurement with Neo4j 2.2.

Finally, the lower allocations and the decreased GC pauses give us more stable response times, but the GC algorithm itself should not be neglected either. However, a single benchmark is not representative for that, and G1 and CMS should really be compared with real traffic on the application before making a choice.