Investigating Neo4j Indexes

Published Dec 23, 2020 • Last reviewed Dec 23, 2020

An investigation into the trade-offs of a Neo4j database index.

In the past year, I've been working on a product recommendation system using Neo4j that included product information for over 100,000 items. Databases with upwards of 5 million nodes and 30 million relationships are running in production so this was by no means pushing Neo4j to its limit but it was a scale that began to have performance issues if no optimizations were made.

Database Indexes

One such optimization was to create an index for the database. A database index is a data structure that can improve the search efficiency of the data. A good simplification is to consider a phone book. When you're looking for a particular entry, you likely won't look through every listing as there might be hundreds of thousands of entries. Instead, phonebooks come with tabs that sort the entries by name.

Process

I first created two droplets on Digital Ocean and configured Neo4j the same way on both droplets. I then created a property index in one database (Alpha) and left the other one as is (Beta).

CREATE INDEX value FOR (number:Number) ON number.value

I then created 1,000,000 nodes in both databases.

FOREACH(i in RANGE(1, 1000000) | CREATE (:Number {value: i}))

I didn't actually use the above cypher, I had to do it in a few steps.

From there, I randomly generated 5 unique numbers within 1 and 1,000,000 and tried to fetch the query in both databases. Here are the results:

Results

Finding a single :Number node by value in a database with 1,000,000 nodes. Alpha is the database that has an index on the value property of the :Number node. Beta is the database that has no index configured.

Test #	Alpha (ms)	Beta (ms)	% Difference
1	423.2053	5257.1049	42.55
2	422.947901	5254.6187	42.55
3	411.369301	5255.7346	42.74
4	429.008599	5240.1161	42.43
5	424.7967	5240.403301	42.50

I ran the same tests for 800,000 nodes 600,000 nodes. 800000 nodes.

800,000 Nodes:

Test #	Alpha (ms)	Beta (ms)	% Difference
1	470.078101	2091.8318	31.65
2	433.553101	915.1855	17.85
3	435.0441	2115.005	32.94
4	465.709401	2089.146301	31.77
5	458.296	2087.101899	32.00

600,000 Nodes:

Test #	Alpha (ms)	Beta (ms)	% Difference
1	480.539601	2965.8342	36.06
2	476.063701	2855.8858	35.71
3	476.549	2923.386399	35.98
4	473.4447	2515.1195	34.16
5	476.3136	2962.481301	36.15

Analysis

In a database with 1 million nodes, creating an index can improve query performance by 40%. I want to look at the trade-off of the improved performance and how an index improves the query so drastically.

We can get a better understanding of how we can achieve such performance increases by prepending our cypher with PROFILE.

Alpha (with index):

neo4j@neo4j> PROFILE MATCH (n:Number {value: 900400}) RETURN n;
+---------------------------+
| n                         |
+---------------------------+
| (:Number {value: 900400}) |
+---------------------------+

+----------------------------------------------------------------------------------------------------------+
| Plan      | Statement   | Version      | Planner | Runtime       | Time | DbHits | Rows | Memory (Bytes) |
+----------------------------------------------------------------------------------------------------------+
| "PROFILE" | "READ_ONLY" | "CYPHER 4.2" | "COST"  | "INTERPRETED" | 55   | 4      | 1    | 0              |
+----------------------------------------------------------------------------------------------------------+


+-----------------------+------------------------------------------+----------------+------+---------+------------------------+
| Operator              | Details                                  | Estimated Rows | Rows | DB Hits | Page Cache Hits/Misses |
+-----------------------+------------------------------------------+----------------+------+---------+------------------------+
| +ProduceResults@neo4j | n                                        |              1 |    1 |       2 |
  0/0 |
| |                     +------------------------------------------+----------------+------+---------+------------------------+
| +NodeIndexSeek@neo4j  | n:Number(value) WHERE value = $autoint_0 |              1 |    1 |       2 |
  0/0 |
+-----------------------+------------------------------------------+----------------+------+---------+------------------------+

1 row available after 40 ms, consumed after another 15 ms

Beta (without index):

neo4j@neo4j> PROFILE MATCH (n:Number {value: 900400}) RETURN n;
+---------------------------+
| n                         |
+---------------------------+
| (:Number {value: 900400}) |
+---------------------------+

+-----------------------------------------------------------------------------------------------------------+
| Plan      | Statement   | Version      | Planner | Runtime       | Time | DbHits  | Rows | Memory (Bytes) |
+-----------------------------------------------------------------------------------------------------------+
| "PROFILE" | "READ_ONLY" | "CYPHER 4.2" | "COST"  | "INTERPRETED" | 754  | 2000003 | 1    | 0              |
+-----------------------------------------------------------------------------------------------------------+


+------------------------+----------------------+----------------+---------+---------+------------------------+
| Operator               | Details              | Estimated Rows | Rows    | DB Hits | Page Cache Hits/Misses |
+------------------------+----------------------+----------------+---------+---------+------------------------+
| +ProduceResults@neo4j  | n                    |         100000 |       1 |       2 |                    0/0 |
| |                      +----------------------+----------------+---------+---------+------------------------+
| +Filter@neo4j          | n.value = $autoint_0 |         100000 |       1 | 1000000 |                    0/0 |
| |                      +----------------------+----------------+---------+---------+------------------------+
| +NodeByLabelScan@neo4j | n:Number             |        1000000 | 1000000 | 1000001 |                    0/0 |
+------------------------+----------------------+----------------+---------+---------+------------------------+

1 row available after 38 ms, consumed after another 716 ms

The cypher to fetch a single node by exact value made to the database without an index hits the database much more. It has no choice but to scan every node in the database.

Here are resources if you're interested in further interpretting execution plans and cypher queries:

A database index is able to produce such performance gains by reducing the complexity of the problem from a problem that scales linearly with the number of nodes to a problem that scales logarithmically by creating and using a binary tree.

But how much space does this extra datastructure take? What's the trade-off of adding a database index? Surely, the extra datastructure involved takes up more memory!

The Trade-off

The gains in speed come at the cost of storage. I went into each droplet and took a look at how much memory was being used by Neo4j's schema index files:

	Alpha	Beta
Sum (K)	196968	336
Sum (M)	192.3515625	0.328125

An index for one million nodes takes ~190 MB more storage. This scales linearly with the number of nodes indexed.

Summary

In a database with one million nodes, a database index can increase query speed on the indexed property by 40%. These gains are reduced as the number of nodes decreases. These speed improvements come at the cost of memory.

Connect

Follow on LinkedIn

Last reviewed on February 20, 2026