Bonus 2 — Graph Databases: Neo4j, Memgraph

“Relational DB join 2-3 tables fast, 5-10 tables painful, 10+ unusable. Graph DB: traverse 100 hops in milliseconds. Khi nào cần?”

Tags: database neo4j memgraph graph-db cypher Thời lượng: 5-6 ngày Prerequisites: Tuan-02-Schema-Design-Normalization Liên quan: Case-Design-Data-AI-RAG (knowledge graph), Tuan-15-Vector-DB-AI (graph + vector)

1. When Graph DB

1.1 Use cases real-world

Social network — friends-of-friends, recommendations, “people you may know”
Knowledge graph — entity relationships (Google KG, Wikidata)
Fraud detection — connected accounts, money flow tracing
Recommendation — collaborative filtering (“users who liked X also liked Y”)
Network/IT — dependency graph, impact analysis
Identity / access — permissions traversal, role inheritance
Supply chain — multi-hop dependencies
Bioinformatics — protein interactions, gene networks
Cybersecurity — attack graphs

1.2 Decision: graph or not?

Pattern: >3 levels of relationships needed in query.

Quick test: if you’re writing recursive CTE or 5+ joins in SQL → consider graph DB.

flowchart TD
    A[Bài toán] --> B{Cần traverse relationships?}
    B -->|Không| RDB[Relational]
    B -->|Yes| C{Số hops mong đợi?}
    C -->|1-2| C1[Relational JOIN OK]
    C -->|3-5| C2[Relational recursive CTE OK<br/>nhưng performance suffer]
    C -->|>5 or variable| G[Graph DB]
    C -->|Path/shortest path queries| G

    style G fill:#c8e6c9
    style RDB fill:#c8e6c9

2. Property Graph Model

2.1 Concept

graph LR
    Alice[Alice<br/>:User<br/>age=30<br/>country=VN] -->|FRIENDS_WITH<br/>since=2020<br/>strength=0.8| Bob[Bob<br/>:User<br/>age=28]
    Alice -->|LIKES<br/>rating=5<br/>date=2024-05-10| Movie1[Inception<br/>:Movie<br/>year=2010<br/>genre=SciFi]
    Bob -->|LIKES<br/>rating=4| Movie1
    Bob -->|LIKES| Movie2[Matrix<br/>:Movie]
    Bob -->|WORKS_AT| Company[Acme Inc<br/>:Company]

Elements:

Nodes — entities, can have multiple labels + properties
Relationships — typed, directed, can have properties
Schema flexible — no upfront schema required, can add at any time

2.2 vs Relational equivalent

Relational requires:

users table
movies table
companies table
friendships junction table (user_id, friend_id, since, strength)
likes junction table (user_id, movie_id, rating)
employments junction table (user_id, company_id)

For “movies friends of friends like that Alice hasn’t seen”:

SELECT DISTINCT m.* FROM movies m
JOIN likes l1 ON l1.movie_id = m.id
JOIN friendships f1 ON f1.friend_id = l1.user_id
JOIN friendships f2 ON f2.user_id = f1.user_id
WHERE f2.friend_id = (SELECT id FROM users WHERE name = 'Alice')
  AND NOT EXISTS (SELECT 1 FROM likes WHERE movie_id = m.id AND user_id = ...);

Cypher:

MATCH (alice:User {name: 'Alice'})-[:FRIENDS_WITH*2]->(fof:User)-[:LIKES]->(movie:Movie)
WHERE NOT (alice)-[:LIKES]->(movie)
RETURN DISTINCT movie.title, count(*) AS friend_likes
ORDER BY friend_likes DESC LIMIT 10;

Cleaner. And much faster as relationships grow.

3. Cypher Query Language

3.1 CRUD basics

// Create node
CREATE (alice:User {name: 'Alice', age: 30})
 
// Create with multiple labels
CREATE (a:User:Employee {name: 'Alice'})
 
// Create relationship
MATCH (a:User {name: 'Alice'}), (b:User {name: 'Bob'})
CREATE (a)-[r:FRIENDS_WITH {since: 2020}]->(b)
RETURN r
 
// MERGE (upsert)
MERGE (u:User {email: '[email protected]'})
ON CREATE SET u.name = 'Alice', u.created = timestamp()
ON MATCH SET u.last_seen = timestamp()
 
// Update
MATCH (u:User {name: 'Alice'}) SET u.age = 31
MATCH (u:User {name: 'Alice'}) SET u += {city: 'HCMC', verified: true}  // merge map
 
// Delete
MATCH (u:User {name: 'Alice'}) DELETE u  // fails if has relationships
MATCH (u:User {name: 'Alice'}) DETACH DELETE u  // delete + relationships

3.2 Pattern matching

// Simple
MATCH (u:User {name: 'Alice'}) RETURN u
 
// Property filter
MATCH (u:User) WHERE u.age > 18 AND u.country = 'VN' RETURN u.name
 
// Relationship pattern
MATCH (a:User)-[r:FRIENDS_WITH]->(b:User) WHERE a.name = 'Alice' RETURN b.name
 
// Undirected
MATCH (a:User)-[:FRIENDS_WITH]-(b:User) WHERE a.name = 'Alice' RETURN b
// Friends regardless of direction
 
// Optional match (LEFT JOIN)
MATCH (u:User {name: 'Alice'})
OPTIONAL MATCH (u)-[:LIKES]->(m:Movie)
RETURN u, m

3.3 Path operations

// Variable-length path
MATCH (start:User {name: 'Alice'})-[:FRIENDS_WITH*1..6]-(end:User)
RETURN DISTINCT end.name
// Within 6 degrees of separation
 
// Exact length
MATCH (a:User)-[:FRIENDS_WITH*3]-(b:User) WHERE a.name = 'Alice'
RETURN b
// Exactly 3 hops away
 
// Shortest path
MATCH p = shortestPath((a:User {name: 'Alice'})-[:KNOWS*]-(b:User {name: 'Bob'}))
RETURN p, length(p)
 
// All shortest paths
MATCH p = allShortestPaths((a:User {name: 'Alice'})-[:KNOWS*..6]-(b:User))
WHERE b.country = 'VN'
RETURN p LIMIT 5

3.4 Aggregations

// Count
MATCH (u:User)-[:FRIENDS_WITH]->(f) WHERE u.name = 'Alice'
RETURN count(f) AS friend_count
 
// Group
MATCH (u:User)-[:LIKES]->(m:Movie)
RETURN m.genre, count(*) AS likes_count
ORDER BY likes_count DESC
LIMIT 10
 
// Collect
MATCH (u:User {name: 'Alice'})-[:FRIENDS_WITH]->(f)
RETURN u.name, collect(f.name) AS friends

3.5 Subqueries (Cypher 25 + Neo4j 5)

// CALL { ... } subquery
MATCH (u:User)
CALL {
    WITH u
    MATCH (u)-[:LIKES]->(m:Movie)
    RETURN count(m) AS movies_liked
}
RETURN u.name, movies_liked

3.6 EXISTS / NOT EXISTS

MATCH (u:User)
WHERE EXISTS { (u)-[:LIKES]->(:Movie {genre: 'Horror'}) }
RETURN u
 
MATCH (u:User {name: 'Alice'})-[:FRIENDS_WITH]->(f)
WHERE NOT EXISTS { (f)-[:BLOCKED]->(u) }
RETURN f

4. Indexes & Constraints

4.1 Indexes

// Single property
CREATE INDEX user_name FOR (u:User) ON (u.name)
 
// Composite (Neo4j 4.4+)
CREATE INDEX user_country_age FOR (u:User) ON (u.country, u.age)
 
// Relationship property (Neo4j 5+)
CREATE INDEX FOR ()-[r:FRIENDS_WITH]-() ON (r.since)
 
// Full-text (Lucene-backed)
CREATE FULLTEXT INDEX movie_search FOR (m:Movie) ON EACH [m.title, m.description]
// Query
CALL db.index.fulltext.queryNodes("movie_search", "matrix +sci-fi") YIELD node, score
RETURN node.title, score
 
// Vector index (Neo4j 5.11+)
CREATE VECTOR INDEX movie_embedding FOR (m:Movie) ON (m.embedding)
OPTIONS {indexConfig: {`vector.dimensions`: 1536, `vector.similarity_function`: 'cosine'}}
// Query
CALL db.index.vector.queryNodes('movie_embedding', 10, $query_embedding)
YIELD node, score
RETURN node.title, score

4.2 Constraints

// Unique
CREATE CONSTRAINT user_email_unique FOR (u:User) REQUIRE u.email IS UNIQUE
 
// Existence (Neo4j 5+)
CREATE CONSTRAINT user_name_exists FOR (u:User) REQUIRE u.name IS NOT NULL
 
// Node key (PK equivalent)
CREATE CONSTRAINT user_email_key FOR (u:User) REQUIRE u.email IS NODE KEY
 
// Type constraint (Neo4j 5.9+)
CREATE CONSTRAINT user_age_int FOR (u:User) REQUIRE u.age IS :: INTEGER

4.3 Index hints

MATCH (u:User {country: 'VN'})
USING INDEX u:User(country)
RETURN u

Rarely needed; planner usually picks well.

5. Performance Characteristics

5.1 Index-free adjacency

Each node has direct pointers to its neighbors → O(1) traversal.

graph LR
    NodeA[Node A<br/>pointers to: B, C, D] --> NodeB
    NodeA --> NodeC
    NodeA --> NodeD

vs Relational: JOIN requires index lookup per row. With 10M rows, 5-hop traversal = 5 × index_lookup × match_count → exponential.

5.2 Local query optimization

Graph DB explores from start node outward. Good for:

Bounded traversal (depth-limited)
Path queries (start → end)
Neighborhood queries (within N hops)

Bad for:

Global aggregates (count all nodes)
Full-graph scans
Operations that fit relational well

5.3 Benchmark direction

Friends-of-friends-of-friends (3 hops) on 1M users, 50 avg friends:

	Relational	Graph
Query time	seconds-minutes	milliseconds
Memory	huge intermediate sets	local exploration
Scaling	gets worse with depth	constant per hop

6. Graph Algorithms — GDS Library

Neo4j Graph Data Science library (GDS) — algorithm catalog.

6.1 Setup

// Project graph into memory
CALL gds.graph.project(
    'my-graph',
    'User',
    'FRIENDS_WITH',
    {relationshipProperties: 'strength'}
)

6.2 PageRank

CALL gds.pageRank.stream('my-graph')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS user, score
ORDER BY score DESC LIMIT 10

Use case: find influential users.

6.3 Community detection (Louvain)

CALL gds.louvain.stream('my-graph')
YIELD nodeId, communityId
RETURN communityId, count(*) AS size
ORDER BY size DESC

Use case: cluster users into communities.

6.4 Shortest path (Dijkstra)

CALL gds.shortestPath.dijkstra.stream('my-graph', {
    sourceNode: source,
    targetNode: target,
    relationshipWeightProperty: 'strength'
}) YIELD path

6.5 Centrality

Betweenness centrality — nodes that bridge communities
Closeness centrality — well-connected nodes
Degree centrality — most connections

6.6 Similarity

Node similarity via Jaccard, Cosine, Overlap.

6.7 Embedding (graph embedding)

GDS 2.x supports:

Node2Vec
FastRP
GraphSAGE

CALL gds.fastRP.write('my-graph', {
    embeddingDimension: 128,
    writeProperty: 'embedding'
})

Generated embedding → use for ML downstream.

7. Neo4j Architecture

7.1 Editions

Neo4j Community Edition — Free, OSS (GPLv3), single-node only
Neo4j Enterprise Edition — Commercial, sharding, RBAC, hot backup
AuraDB — Managed cloud (Free tier + paid)

Sharding (multi-DB / Fabric) only in Enterprise. Big gotcha for OSS users with >1TB.

7.2 Storage

Native graph storage. Property store + node store + relationship store + label store + index store.

Bolt protocol for clients (similar to TCP+JSON-ish framing).

7.3 ACID transactions

Full ACID, including multi-statement transactions.

BEGIN
MATCH (a:Account {id: 1}) SET a.balance = a.balance - 100
MATCH (b:Account {id: 2}) SET b.balance = b.balance + 100
COMMIT

7.4 Causal Cluster (Enterprise)

graph TB
    Core1[Core 1 - leader] --> Replica1[Read Replica 1]
    Core1 -.Raft.-> Core2[Core 2]
    Core1 -.Raft.-> Core3[Core 3]
    Core1 --> Replica2[Read Replica 2]

Cores: Raft consensus, write quorum
Read replicas: async, scale reads

7.5 Backup & restore

Enterprise: online backup with neo4j-admin backup. Community: shut down + copy files.

8. Memgraph (Alternative)

Cypher-compatible, in-memory, Rust+C++. Optimized for streaming + real-time analytics.

8.1 Pros over Neo4j

10-100x faster on some workloads (in-mem)
Streaming (consume Kafka → graph)
OSS Apache-style license (more permissive)
Multi-tenancy primitives

8.2 Cons

Smaller ecosystem
Less mature tooling
Smaller community

8.3 Use case

Real-time fraud detection, streaming graph analytics, dynamic networks.

9. Comparison 2024-2026

	Neo4j	Memgraph	Amazon Neptune	TigerGraph	DGraph
Type	Native graph	In-mem native	Multi-model	Distributed	Distributed
Language	Cypher	Cypher	Gremlin + openCypher	GSQL	GraphQL+-
OSS	Community GPLv3	Apache 2.0	No	Community	Yes
HA	Enterprise only	Yes	Yes (managed)	Yes	Yes
Sharding	Enterprise only	Yes	Yes	Yes	Yes
Vector	5.11+	Yes	Limited	No	No
Vendor	Neo4j Inc	Memgraph	AWS	TigerGraph	Dgraph Labs
Cloud	AuraDB	Cloud	Neptune	Cloud	Cloud

Pattern 2024-2026: Neo4j dominant for learning + most workloads. Memgraph rising for real-time. Neptune if AWS-only + integration with other AWS services.

10. Modeling Tips

10.1 Direction matters

// Semantic direction
(:User)-[:WROTE]->(:Post)         // good
(:User)-[:WROTE]-(:Post)          // ambiguous
 
// Bidirectional with semantic clarity
(:User)-[:FRIENDS_WITH]->(:User)   // implicit reciprocal? Or directional?

Conventions:

Directional for asymmetric (FOLLOWS, OWNS, WROTE)
Single direction with implicit reciprocal for symmetric (FRIENDS_WITH)
Or store both directions if needed for traversal speed

10.2 Super-node problem

One node with millions of relationships → bad locality, slow traversal.

graph LR
    User[User Alice<br/>10M followers] --> F1[F1]
    User --> F2[F2]
    User --> F3[...10M...]
    User --> Fn[Fn]

Mitigations:

Add intermediate nodes (Group between User and Members)
Property indexes on relationship props for filtered traversal
Separate “hot” vs “cold” relationships

10.3 Relationship vs node

When to make X a node vs property:

Property when X is value (color = “red”)
Node when X has own relationships (Movie has many actors, director, etc.)

10.4 Labels for type, not flags

// Good
(:User:Verified)   // type combo
(:Movie)
 
// Bad - state as label
(:User:Online)     // changes frequently → use property instead

10.5 Property design

Atomic values (no nested objects, except in newer Cypher with maps)
Avoid lists when relationships fit better
Index frequently-filtered properties

11. Common Patterns

11.1 Recommendation

// "Movies friends like that I haven't seen"
MATCH (me:User {name: 'Alice'})-[:FRIENDS_WITH]->(friend)-[:LIKES]->(movie:Movie)
WHERE NOT (me)-[:LIKES]->(movie)
RETURN movie.title, count(*) AS friend_likes
ORDER BY friend_likes DESC LIMIT 10

11.2 Collaborative filtering

// "Users similar to me, what do they like that I don't?"
MATCH (me:User {name: 'Alice'})-[:LIKES]->(m:Movie)<-[:LIKES]-(other:User)
WITH me, other, count(m) AS overlap
WHERE overlap > 5
MATCH (other)-[:LIKES]->(recommended:Movie)
WHERE NOT (me)-[:LIKES]->(recommended)
RETURN recommended.title, count(*) AS strength
ORDER BY strength DESC LIMIT 10

11.3 Fraud rings

// Find accounts sharing characteristics in suspicious patterns
MATCH (a:Account)-[:USES]->(card:Card)<-[:USES]-(b:Account)
WHERE a <> b
MATCH (a)-[:USES]->(addr:Address)<-[:USES]-(b)
RETURN a, b, card.number, addr.line1
// Two accounts sharing card AND address - suspicious

11.4 Knowledge graph for RAG

Combine vector + graph:

// Find documents semantically similar AND topically related
CALL db.index.vector.queryNodes('doc_embedding', 50, $query_emb) YIELD node AS doc, score
MATCH (doc)-[:ABOUT]->(topic:Topic)-[:RELATED_TO]->(related_topic:Topic)<-[:ABOUT]-(other_doc:Document)
RETURN DISTINCT other_doc, score
ORDER BY score DESC LIMIT 10

11.5 Authorization graph

// Can user X access resource Y?
MATCH path = (u:User {id: $user_id})-[:MEMBER_OF*0..]->(g:Group)-[:HAS_ROLE]->(r:Role)-[:CAN_ACCESS]->(res:Resource {id: $resource_id})
RETURN length(path) AS distance, path

Far more powerful than relational RBAC for inheritance.

12. Operations

12.1 Backup

Community: stop + file copy. Enterprise: neo4j-admin backup online.

12.2 Monitoring

CALL dbms.queryJmx("org.neo4j:*")
// JMX metrics
 
// Slow queries
CALL dbms.listQueries() YIELD queryId, query, elapsedTimeMillis
WHERE elapsedTimeMillis > 1000
RETURN *

Integrate with Prometheus + Grafana.

12.3 Cypher EXPLAIN / PROFILE

EXPLAIN MATCH ... RETURN ...   // plan only
PROFILE MATCH ... RETURN ...   // actual run with stats

Look for “NodeByLabelScan” (full table scan equivalent) → add index.

12.4 Tuning

# neo4j.conf
dbms.memory.heap.initial_size=4G
dbms.memory.heap.max_size=8G
dbms.memory.pagecache.size=10G

Page cache should hold working set of nodes + relationships.

13. When NOT to use Graph DB

Simple CRUD app — relational sufficient
Aggregation-heavy (sum, count over whole graph) — column store better
Mostly tabular data with occasional joins — relational
Throughput >100K transactions/sec — most graph DBs less optimized for OLTP write rate
Team unfamiliar with graph thinking — learning curve

Don’t use graph as primary store for general business data. Use it as specialized store alongside Postgres/etc.

14. Anti-patterns

Pattern	Why bad
Modeling everything as graph	Aggregations slow, harder than SQL
Super-nodes with 10M+ rels	Slow traversal
Multi-hop without depth limit	Query times out
Using labels for state (Online/Offline)	Frequent updates fragment graph
Sharding too early (Community → Enterprise migration)	Plan upfront
Ignoring index hints in PROFILE	Hidden full scans
Relationship without direction (always undirected)	Loses semantics
Using as primary OLTP DB	Sharding/HA limited in OSS Community

15. Lab

Day 1: Setup

docker run -d --name neo4j -p 7474:7474 -p 7687:7687 \
    -e NEO4J_AUTH=neo4j/lab \
    neo4j:5

Browser at http://localhost:7474.

Day 2: Cypher basics

Create LinkedIn-like graph: users, companies, jobs, skills. Practice CRUD.

Day 3: Traversal

2-hop connections (people you may know)
Shortest path between 2 users
Companies hiring for skills you have

Day 4: GDS

Run PageRank on user influence. Louvain for community detection.

Day 5: Vector index

Add embedding to articles. Hybrid query: semantic + graph.

Day 6: Performance

EXPLAIN/PROFILE 5 queries. Add indexes. Compare.

Day 7: Real scenario

Build fraud detection: transactions, accounts, IPs. Find rings.

16. Self-check

Khi nào pick graph DB? 3 triệu chứng?
Index-free adjacency — vì sao quan trọng?
Property graph vs RDF — khác biệt?
Variable-length path syntax Cypher?
Super-node problem + mitigation?
PageRank dùng cho gì?
Neo4j Community vs Enterprise — limit lớn nhất?
Memgraph khác Neo4j ở đâu (mạnh hơn / yếu hơn)?
Cypher MERGE vs CREATE — khác biệt?
Khi nào KHÔNG dùng graph DB?

17. Tiếp theo

Tuan-Bonus-Embedded-DBs

Cập nhật: 2026-05-16

lthieu's notes

Explorer

Tuan-Bonus-Graph-DB-Neo4j