Bonus 2 — Graph Databases: Neo4j, Memgraph
“Relational DB join 2-3 tables fast, 5-10 tables painful, 10+ unusable. Graph DB: traverse 100 hops in milliseconds. Khi nào cần?”
Tags: database neo4j memgraph graph-db cypher Thời lượng: 5-6 ngày Prerequisites: Tuan-02-Schema-Design-Normalization Liên quan: Case-Design-Data-AI-RAG (knowledge graph), Tuan-15-Vector-DB-AI (graph + vector)
1. When Graph DB
1.1 Use cases real-world
- Social network — friends-of-friends, recommendations, “people you may know”
- Knowledge graph — entity relationships (Google KG, Wikidata)
- Fraud detection — connected accounts, money flow tracing
- Recommendation — collaborative filtering (“users who liked X also liked Y”)
- Network/IT — dependency graph, impact analysis
- Identity / access — permissions traversal, role inheritance
- Supply chain — multi-hop dependencies
- Bioinformatics — protein interactions, gene networks
- Cybersecurity — attack graphs
1.2 Decision: graph or not?
Pattern: >3 levels of relationships needed in query.
Quick test: if you’re writing recursive CTE or 5+ joins in SQL → consider graph DB.
flowchart TD A[Bài toán] --> B{Cần traverse relationships?} B -->|Không| RDB[Relational] B -->|Yes| C{Số hops mong đợi?} C -->|1-2| C1[Relational JOIN OK] C -->|3-5| C2[Relational recursive CTE OK<br/>nhưng performance suffer] C -->|>5 or variable| G[Graph DB] C -->|Path/shortest path queries| G style G fill:#c8e6c9 style RDB fill:#c8e6c9
2. Property Graph Model
2.1 Concept
graph LR Alice[Alice<br/>:User<br/>age=30<br/>country=VN] -->|FRIENDS_WITH<br/>since=2020<br/>strength=0.8| Bob[Bob<br/>:User<br/>age=28] Alice -->|LIKES<br/>rating=5<br/>date=2024-05-10| Movie1[Inception<br/>:Movie<br/>year=2010<br/>genre=SciFi] Bob -->|LIKES<br/>rating=4| Movie1 Bob -->|LIKES| Movie2[Matrix<br/>:Movie] Bob -->|WORKS_AT| Company[Acme Inc<br/>:Company]
Elements:
- Nodes — entities, can have multiple labels + properties
- Relationships — typed, directed, can have properties
- Schema flexible — no upfront schema required, can add at any time
2.2 vs Relational equivalent
Relational requires:
userstablemoviestablecompaniestablefriendshipsjunction table (user_id, friend_id, since, strength)likesjunction table (user_id, movie_id, rating)employmentsjunction table (user_id, company_id)
For “movies friends of friends like that Alice hasn’t seen”:
SELECT DISTINCT m.* FROM movies m
JOIN likes l1 ON l1.movie_id = m.id
JOIN friendships f1 ON f1.friend_id = l1.user_id
JOIN friendships f2 ON f2.user_id = f1.user_id
WHERE f2.friend_id = (SELECT id FROM users WHERE name = 'Alice')
AND NOT EXISTS (SELECT 1 FROM likes WHERE movie_id = m.id AND user_id = ...);Cypher:
MATCH (alice:User {name: 'Alice'})-[:FRIENDS_WITH*2]->(fof:User)-[:LIKES]->(movie:Movie)
WHERE NOT (alice)-[:LIKES]->(movie)
RETURN DISTINCT movie.title, count(*) AS friend_likes
ORDER BY friend_likes DESC LIMIT 10;Cleaner. And much faster as relationships grow.
3. Cypher Query Language
3.1 CRUD basics
// Create node
CREATE (alice:User {name: 'Alice', age: 30})
// Create with multiple labels
CREATE (a:User:Employee {name: 'Alice'})
// Create relationship
MATCH (a:User {name: 'Alice'}), (b:User {name: 'Bob'})
CREATE (a)-[r:FRIENDS_WITH {since: 2020}]->(b)
RETURN r
// MERGE (upsert)
MERGE (u:User {email: '[email protected]'})
ON CREATE SET u.name = 'Alice', u.created = timestamp()
ON MATCH SET u.last_seen = timestamp()
// Update
MATCH (u:User {name: 'Alice'}) SET u.age = 31
MATCH (u:User {name: 'Alice'}) SET u += {city: 'HCMC', verified: true} // merge map
// Delete
MATCH (u:User {name: 'Alice'}) DELETE u // fails if has relationships
MATCH (u:User {name: 'Alice'}) DETACH DELETE u // delete + relationships3.2 Pattern matching
// Simple
MATCH (u:User {name: 'Alice'}) RETURN u
// Property filter
MATCH (u:User) WHERE u.age > 18 AND u.country = 'VN' RETURN u.name
// Relationship pattern
MATCH (a:User)-[r:FRIENDS_WITH]->(b:User) WHERE a.name = 'Alice' RETURN b.name
// Undirected
MATCH (a:User)-[:FRIENDS_WITH]-(b:User) WHERE a.name = 'Alice' RETURN b
// Friends regardless of direction
// Optional match (LEFT JOIN)
MATCH (u:User {name: 'Alice'})
OPTIONAL MATCH (u)-[:LIKES]->(m:Movie)
RETURN u, m3.3 Path operations
// Variable-length path
MATCH (start:User {name: 'Alice'})-[:FRIENDS_WITH*1..6]-(end:User)
RETURN DISTINCT end.name
// Within 6 degrees of separation
// Exact length
MATCH (a:User)-[:FRIENDS_WITH*3]-(b:User) WHERE a.name = 'Alice'
RETURN b
// Exactly 3 hops away
// Shortest path
MATCH p = shortestPath((a:User {name: 'Alice'})-[:KNOWS*]-(b:User {name: 'Bob'}))
RETURN p, length(p)
// All shortest paths
MATCH p = allShortestPaths((a:User {name: 'Alice'})-[:KNOWS*..6]-(b:User))
WHERE b.country = 'VN'
RETURN p LIMIT 53.4 Aggregations
// Count
MATCH (u:User)-[:FRIENDS_WITH]->(f) WHERE u.name = 'Alice'
RETURN count(f) AS friend_count
// Group
MATCH (u:User)-[:LIKES]->(m:Movie)
RETURN m.genre, count(*) AS likes_count
ORDER BY likes_count DESC
LIMIT 10
// Collect
MATCH (u:User {name: 'Alice'})-[:FRIENDS_WITH]->(f)
RETURN u.name, collect(f.name) AS friends3.5 Subqueries (Cypher 25 + Neo4j 5)
// CALL { ... } subquery
MATCH (u:User)
CALL {
WITH u
MATCH (u)-[:LIKES]->(m:Movie)
RETURN count(m) AS movies_liked
}
RETURN u.name, movies_liked3.6 EXISTS / NOT EXISTS
MATCH (u:User)
WHERE EXISTS { (u)-[:LIKES]->(:Movie {genre: 'Horror'}) }
RETURN u
MATCH (u:User {name: 'Alice'})-[:FRIENDS_WITH]->(f)
WHERE NOT EXISTS { (f)-[:BLOCKED]->(u) }
RETURN f4. Indexes & Constraints
4.1 Indexes
// Single property
CREATE INDEX user_name FOR (u:User) ON (u.name)
// Composite (Neo4j 4.4+)
CREATE INDEX user_country_age FOR (u:User) ON (u.country, u.age)
// Relationship property (Neo4j 5+)
CREATE INDEX FOR ()-[r:FRIENDS_WITH]-() ON (r.since)
// Full-text (Lucene-backed)
CREATE FULLTEXT INDEX movie_search FOR (m:Movie) ON EACH [m.title, m.description]
// Query
CALL db.index.fulltext.queryNodes("movie_search", "matrix +sci-fi") YIELD node, score
RETURN node.title, score
// Vector index (Neo4j 5.11+)
CREATE VECTOR INDEX movie_embedding FOR (m:Movie) ON (m.embedding)
OPTIONS {indexConfig: {`vector.dimensions`: 1536, `vector.similarity_function`: 'cosine'}}
// Query
CALL db.index.vector.queryNodes('movie_embedding', 10, $query_embedding)
YIELD node, score
RETURN node.title, score4.2 Constraints
// Unique
CREATE CONSTRAINT user_email_unique FOR (u:User) REQUIRE u.email IS UNIQUE
// Existence (Neo4j 5+)
CREATE CONSTRAINT user_name_exists FOR (u:User) REQUIRE u.name IS NOT NULL
// Node key (PK equivalent)
CREATE CONSTRAINT user_email_key FOR (u:User) REQUIRE u.email IS NODE KEY
// Type constraint (Neo4j 5.9+)
CREATE CONSTRAINT user_age_int FOR (u:User) REQUIRE u.age IS :: INTEGER4.3 Index hints
MATCH (u:User {country: 'VN'})
USING INDEX u:User(country)
RETURN uRarely needed; planner usually picks well.
5. Performance Characteristics
5.1 Index-free adjacency
Each node has direct pointers to its neighbors → O(1) traversal.
graph LR NodeA[Node A<br/>pointers to: B, C, D] --> NodeB NodeA --> NodeC NodeA --> NodeD
vs Relational: JOIN requires index lookup per row. With 10M rows, 5-hop traversal = 5 × index_lookup × match_count → exponential.
5.2 Local query optimization
Graph DB explores from start node outward. Good for:
- Bounded traversal (depth-limited)
- Path queries (start → end)
- Neighborhood queries (within N hops)
Bad for:
- Global aggregates (count all nodes)
- Full-graph scans
- Operations that fit relational well
5.3 Benchmark direction
Friends-of-friends-of-friends (3 hops) on 1M users, 50 avg friends:
| Relational | Graph | |
|---|---|---|
| Query time | seconds-minutes | milliseconds |
| Memory | huge intermediate sets | local exploration |
| Scaling | gets worse with depth | constant per hop |
6. Graph Algorithms — GDS Library
Neo4j Graph Data Science library (GDS) — algorithm catalog.
6.1 Setup
// Project graph into memory
CALL gds.graph.project(
'my-graph',
'User',
'FRIENDS_WITH',
{relationshipProperties: 'strength'}
)6.2 PageRank
CALL gds.pageRank.stream('my-graph')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS user, score
ORDER BY score DESC LIMIT 10Use case: find influential users.
6.3 Community detection (Louvain)
CALL gds.louvain.stream('my-graph')
YIELD nodeId, communityId
RETURN communityId, count(*) AS size
ORDER BY size DESCUse case: cluster users into communities.
6.4 Shortest path (Dijkstra)
CALL gds.shortestPath.dijkstra.stream('my-graph', {
sourceNode: source,
targetNode: target,
relationshipWeightProperty: 'strength'
}) YIELD path6.5 Centrality
- Betweenness centrality — nodes that bridge communities
- Closeness centrality — well-connected nodes
- Degree centrality — most connections
6.6 Similarity
Node similarity via Jaccard, Cosine, Overlap.
6.7 Embedding (graph embedding)
GDS 2.x supports:
- Node2Vec
- FastRP
- GraphSAGE
CALL gds.fastRP.write('my-graph', {
embeddingDimension: 128,
writeProperty: 'embedding'
})Generated embedding → use for ML downstream.
7. Neo4j Architecture
7.1 Editions
- Neo4j Community Edition — Free, OSS (GPLv3), single-node only
- Neo4j Enterprise Edition — Commercial, sharding, RBAC, hot backup
- AuraDB — Managed cloud (Free tier + paid)
Sharding (multi-DB / Fabric) only in Enterprise. Big gotcha for OSS users with >1TB.
7.2 Storage
Native graph storage. Property store + node store + relationship store + label store + index store.
Bolt protocol for clients (similar to TCP+JSON-ish framing).
7.3 ACID transactions
Full ACID, including multi-statement transactions.
BEGIN
MATCH (a:Account {id: 1}) SET a.balance = a.balance - 100
MATCH (b:Account {id: 2}) SET b.balance = b.balance + 100
COMMIT7.4 Causal Cluster (Enterprise)
graph TB Core1[Core 1 - leader] --> Replica1[Read Replica 1] Core1 -.Raft.-> Core2[Core 2] Core1 -.Raft.-> Core3[Core 3] Core1 --> Replica2[Read Replica 2]
- Cores: Raft consensus, write quorum
- Read replicas: async, scale reads
7.5 Backup & restore
Enterprise: online backup with neo4j-admin backup. Community: shut down + copy files.
8. Memgraph (Alternative)
Cypher-compatible, in-memory, Rust+C++. Optimized for streaming + real-time analytics.
8.1 Pros over Neo4j
- 10-100x faster on some workloads (in-mem)
- Streaming (consume Kafka → graph)
- OSS Apache-style license (more permissive)
- Multi-tenancy primitives
8.2 Cons
- Smaller ecosystem
- Less mature tooling
- Smaller community
8.3 Use case
Real-time fraud detection, streaming graph analytics, dynamic networks.
9. Comparison 2024-2026
| Neo4j | Memgraph | Amazon Neptune | TigerGraph | DGraph | |
|---|---|---|---|---|---|
| Type | Native graph | In-mem native | Multi-model | Distributed | Distributed |
| Language | Cypher | Cypher | Gremlin + openCypher | GSQL | GraphQL+- |
| OSS | Community GPLv3 | Apache 2.0 | No | Community | Yes |
| HA | Enterprise only | Yes | Yes (managed) | Yes | Yes |
| Sharding | Enterprise only | Yes | Yes | Yes | Yes |
| Vector | 5.11+ | Yes | Limited | No | No |
| Vendor | Neo4j Inc | Memgraph | AWS | TigerGraph | Dgraph Labs |
| Cloud | AuraDB | Cloud | Neptune | Cloud | Cloud |
Pattern 2024-2026: Neo4j dominant for learning + most workloads. Memgraph rising for real-time. Neptune if AWS-only + integration with other AWS services.
10. Modeling Tips
10.1 Direction matters
// Semantic direction
(:User)-[:WROTE]->(:Post) // good
(:User)-[:WROTE]-(:Post) // ambiguous
// Bidirectional with semantic clarity
(:User)-[:FRIENDS_WITH]->(:User) // implicit reciprocal? Or directional?Conventions:
- Directional for asymmetric (FOLLOWS, OWNS, WROTE)
- Single direction with implicit reciprocal for symmetric (FRIENDS_WITH)
- Or store both directions if needed for traversal speed
10.2 Super-node problem
One node with millions of relationships → bad locality, slow traversal.
graph LR User[User Alice<br/>10M followers] --> F1[F1] User --> F2[F2] User --> F3[...10M...] User --> Fn[Fn]
Mitigations:
- Add intermediate nodes (
Groupbetween User and Members) - Property indexes on relationship props for filtered traversal
- Separate “hot” vs “cold” relationships
10.3 Relationship vs node
When to make X a node vs property:
- Property when X is value (color = “red”)
- Node when X has own relationships (Movie has many actors, director, etc.)
10.4 Labels for type, not flags
// Good
(:User:Verified) // type combo
(:Movie)
// Bad - state as label
(:User:Online) // changes frequently → use property instead10.5 Property design
- Atomic values (no nested objects, except in newer Cypher with maps)
- Avoid lists when relationships fit better
- Index frequently-filtered properties
11. Common Patterns
11.1 Recommendation
// "Movies friends like that I haven't seen"
MATCH (me:User {name: 'Alice'})-[:FRIENDS_WITH]->(friend)-[:LIKES]->(movie:Movie)
WHERE NOT (me)-[:LIKES]->(movie)
RETURN movie.title, count(*) AS friend_likes
ORDER BY friend_likes DESC LIMIT 1011.2 Collaborative filtering
// "Users similar to me, what do they like that I don't?"
MATCH (me:User {name: 'Alice'})-[:LIKES]->(m:Movie)<-[:LIKES]-(other:User)
WITH me, other, count(m) AS overlap
WHERE overlap > 5
MATCH (other)-[:LIKES]->(recommended:Movie)
WHERE NOT (me)-[:LIKES]->(recommended)
RETURN recommended.title, count(*) AS strength
ORDER BY strength DESC LIMIT 1011.3 Fraud rings
// Find accounts sharing characteristics in suspicious patterns
MATCH (a:Account)-[:USES]->(card:Card)<-[:USES]-(b:Account)
WHERE a <> b
MATCH (a)-[:USES]->(addr:Address)<-[:USES]-(b)
RETURN a, b, card.number, addr.line1
// Two accounts sharing card AND address - suspicious11.4 Knowledge graph for RAG
Combine vector + graph:
// Find documents semantically similar AND topically related
CALL db.index.vector.queryNodes('doc_embedding', 50, $query_emb) YIELD node AS doc, score
MATCH (doc)-[:ABOUT]->(topic:Topic)-[:RELATED_TO]->(related_topic:Topic)<-[:ABOUT]-(other_doc:Document)
RETURN DISTINCT other_doc, score
ORDER BY score DESC LIMIT 1011.5 Authorization graph
// Can user X access resource Y?
MATCH path = (u:User {id: $user_id})-[:MEMBER_OF*0..]->(g:Group)-[:HAS_ROLE]->(r:Role)-[:CAN_ACCESS]->(res:Resource {id: $resource_id})
RETURN length(path) AS distance, pathFar more powerful than relational RBAC for inheritance.
12. Operations
12.1 Backup
Community: stop + file copy.
Enterprise: neo4j-admin backup online.
12.2 Monitoring
CALL dbms.queryJmx("org.neo4j:*")
// JMX metrics
// Slow queries
CALL dbms.listQueries() YIELD queryId, query, elapsedTimeMillis
WHERE elapsedTimeMillis > 1000
RETURN *Integrate with Prometheus + Grafana.
12.3 Cypher EXPLAIN / PROFILE
EXPLAIN MATCH ... RETURN ... // plan only
PROFILE MATCH ... RETURN ... // actual run with statsLook for “NodeByLabelScan” (full table scan equivalent) → add index.
12.4 Tuning
# neo4j.conf
dbms.memory.heap.initial_size=4G
dbms.memory.heap.max_size=8G
dbms.memory.pagecache.size=10GPage cache should hold working set of nodes + relationships.
13. When NOT to use Graph DB
- Simple CRUD app — relational sufficient
- Aggregation-heavy (sum, count over whole graph) — column store better
- Mostly tabular data with occasional joins — relational
- Throughput >100K transactions/sec — most graph DBs less optimized for OLTP write rate
- Team unfamiliar with graph thinking — learning curve
Don’t use graph as primary store for general business data. Use it as specialized store alongside Postgres/etc.
14. Anti-patterns
| Pattern | Why bad |
|---|---|
| Modeling everything as graph | Aggregations slow, harder than SQL |
| Super-nodes with 10M+ rels | Slow traversal |
| Multi-hop without depth limit | Query times out |
| Using labels for state (Online/Offline) | Frequent updates fragment graph |
| Sharding too early (Community → Enterprise migration) | Plan upfront |
| Ignoring index hints in PROFILE | Hidden full scans |
| Relationship without direction (always undirected) | Loses semantics |
| Using as primary OLTP DB | Sharding/HA limited in OSS Community |
15. Lab
Day 1: Setup
docker run -d --name neo4j -p 7474:7474 -p 7687:7687 \
-e NEO4J_AUTH=neo4j/lab \
neo4j:5Browser at http://localhost:7474.
Day 2: Cypher basics
Create LinkedIn-like graph: users, companies, jobs, skills. Practice CRUD.
Day 3: Traversal
- 2-hop connections (people you may know)
- Shortest path between 2 users
- Companies hiring for skills you have
Day 4: GDS
Run PageRank on user influence. Louvain for community detection.
Day 5: Vector index
Add embedding to articles. Hybrid query: semantic + graph.
Day 6: Performance
EXPLAIN/PROFILE 5 queries. Add indexes. Compare.
Day 7: Real scenario
Build fraud detection: transactions, accounts, IPs. Find rings.
16. Self-check
- Khi nào pick graph DB? 3 triệu chứng?
- Index-free adjacency — vì sao quan trọng?
- Property graph vs RDF — khác biệt?
- Variable-length path syntax Cypher?
- Super-node problem + mitigation?
- PageRank dùng cho gì?
- Neo4j Community vs Enterprise — limit lớn nhất?
- Memgraph khác Neo4j ở đâu (mạnh hơn / yếu hơn)?
- Cypher MERGE vs CREATE — khác biệt?
- Khi nào KHÔNG dùng graph DB?
17. Tiếp theo
Cập nhật: 2026-05-16