Investigating Neo4j Indexes
Published Dec 23, 2020 • Last reviewed Dec 23, 2020
An investigation into the trade-offs of a Neo4j database index.
In the past year, I've been working on a product recommendation system using Neo4j that included product information for over 100,000 items. Databases with upwards of 5 million nodes and 30 million relationships are running in production so this was by no means pushing Neo4j to its limit but it was a scale that began to have performance issues if no optimizations were made.
Database Indexes
One such optimization was to create an index for the database. A database index is a data structure that can improve the search efficiency of the data. A good simplification is to consider a phone book. When you're looking for a particular entry, you likely won't look through every listing as there might be hundreds of thousands of entries. Instead, phonebooks come with tabs that sort the entries by name.
Process
I first created two droplets on Digital Ocean and configured Neo4j the same way on both droplets. I then created a property index in one database (Alpha) and left the other one as is (Beta).
CREATE INDEX value FOR (number:Number) ON number.valueI then created 1,000,000 nodes in both databases.
FOREACH(i in RANGE(1, 1000000) | CREATE (:Number {value: i}))I didn't actually use the above cypher, I had to do it in a few steps.
From there, I randomly generated 5 unique numbers within 1 and 1,000,000 and tried to fetch the query in both databases. Here are the results:
Results
Finding a single :Number node by value in a database with 1,000,000 nodes. Alpha is the database that has an index on the value property of the :Number node. Beta is the database that has no index configured.
| Test # | Alpha (ms) | Beta (ms) | % Difference |
|---|---|---|---|
| 1 | 423.2053 | 5257.1049 | 42.55 |
| 2 | 422.947901 | 5254.6187 | 42.55 |
| 3 | 411.369301 | 5255.7346 | 42.74 |
| 4 | 429.008599 | 5240.1161 | 42.43 |
| 5 | 424.7967 | 5240.403301 | 42.50 |
I ran the same tests for 800,000 nodes 600,000 nodes. 800000 nodes.
800,000 Nodes:
| Test # | Alpha (ms) | Beta (ms) | % Difference |
|---|---|---|---|
| 1 | 470.078101 | 2091.8318 | 31.65 |
| 2 | 433.553101 | 915.1855 | 17.85 |
| 3 | 435.0441 | 2115.005 | 32.94 |
| 4 | 465.709401 | 2089.146301 | 31.77 |
| 5 | 458.296 | 2087.101899 | 32.00 |
600,000 Nodes:
| Test # | Alpha (ms) | Beta (ms) | % Difference |
|---|---|---|---|
| 1 | 480.539601 | 2965.8342 | 36.06 |
| 2 | 476.063701 | 2855.8858 | 35.71 |
| 3 | 476.549 | 2923.386399 | 35.98 |
| 4 | 473.4447 | 2515.1195 | 34.16 |
| 5 | 476.3136 | 2962.481301 | 36.15 |
Analysis
In a database with 1 million nodes, creating an index can improve query performance by 40%. I want to look at the trade-off of the improved performance and how an index improves the query so drastically.
We can get a better understanding of how we can achieve such performance increases by prepending our cypher with PROFILE.
Alpha (with index):
neo4j@neo4j> PROFILE MATCH (n:Number {value: 900400}) RETURN n;
+---------------------------+
| n |
+---------------------------+
| (:Number {value: 900400}) |
+---------------------------+
+----------------------------------------------------------------------------------------------------------+
| Plan | Statement | Version | Planner | Runtime | Time | DbHits | Rows | Memory (Bytes) |
+----------------------------------------------------------------------------------------------------------+
| "PROFILE" | "READ_ONLY" | "CYPHER 4.2" | "COST" | "INTERPRETED" | 55 | 4 | 1 | 0 |
+----------------------------------------------------------------------------------------------------------+
+-----------------------+------------------------------------------+----------------+------+---------+------------------------+
| Operator | Details | Estimated Rows | Rows | DB Hits | Page Cache Hits/Misses |
+-----------------------+------------------------------------------+----------------+------+---------+------------------------+
| +ProduceResults@neo4j | n | 1 | 1 | 2 |
0/0 |
| | +------------------------------------------+----------------+------+---------+------------------------+
| +NodeIndexSeek@neo4j | n:Number(value) WHERE value = $autoint_0 | 1 | 1 | 2 |
0/0 |
+-----------------------+------------------------------------------+----------------+------+---------+------------------------+
1 row available after 40 ms, consumed after another 15 msBeta (without index):
neo4j@neo4j> PROFILE MATCH (n:Number {value: 900400}) RETURN n;
+---------------------------+
| n |
+---------------------------+
| (:Number {value: 900400}) |
+---------------------------+
+-----------------------------------------------------------------------------------------------------------+
| Plan | Statement | Version | Planner | Runtime | Time | DbHits | Rows | Memory (Bytes) |
+-----------------------------------------------------------------------------------------------------------+
| "PROFILE" | "READ_ONLY" | "CYPHER 4.2" | "COST" | "INTERPRETED" | 754 | 2000003 | 1 | 0 |
+-----------------------------------------------------------------------------------------------------------+
+------------------------+----------------------+----------------+---------+---------+------------------------+
| Operator | Details | Estimated Rows | Rows | DB Hits | Page Cache Hits/Misses |
+------------------------+----------------------+----------------+---------+---------+------------------------+
| +ProduceResults@neo4j | n | 100000 | 1 | 2 | 0/0 |
| | +----------------------+----------------+---------+---------+------------------------+
| +Filter@neo4j | n.value = $autoint_0 | 100000 | 1 | 1000000 | 0/0 |
| | +----------------------+----------------+---------+---------+------------------------+
| +NodeByLabelScan@neo4j | n:Number | 1000000 | 1000000 | 1000001 | 0/0 |
+------------------------+----------------------+----------------+---------+---------+------------------------+
1 row available after 38 ms, consumed after another 716 msThe cypher to fetch a single node by exact value made to the database without an index hits the database much more. It has no choice but to scan every node in the database.
Here are resources if you're interested in further interpretting execution plans and cypher queries:
A database index is able to produce such performance gains by reducing the complexity of the problem from a problem that scales linearly with the number of nodes to a problem that scales logarithmically by creating and using a binary tree.
But how much space does this extra datastructure take? What's the trade-off of adding a database index? Surely, the extra datastructure involved takes up more memory!
The Trade-off
The gains in speed come at the cost of storage. I went into each droplet and took a look at how much memory was being used by Neo4j's schema index files:
| Alpha | Beta | |
|---|---|---|
| Sum (K) | 196968 | 336 |
| Sum (M) | 192.3515625 | 0.328125 |
An index for one million nodes takes ~190 MB more storage. This scales linearly with the number of nodes indexed.
Summary
In a database with one million nodes, a database index can increase query speed on the indexed property by 40%. These gains are reduced as the number of nodes decreases. These speed improvements come at the cost of memory.
Connect
Last reviewed on February 24, 2026