Tuần 12 — MongoDB & Document Databases
“MongoDB không phải ‘Postgres không có schema’. Đó là một paradigm khác — document-centric thay vì row-centric. Hiểu khác biệt = dùng đúng. Không hiểu = build product cuối cùng phải rewrite.”
Tags: database mongodb document-db nosql aggregation Thời lượng: 7 ngày (4-6h/ngày) Prerequisites: Tuan-02-Schema-Design-Normalization (so sánh tư duy) Liên quan: Case-Design-Data-AI-RAG (document store)
1. Context & Why
1.1 MongoDB trong 2024-2026
timeline title MongoDB Evolution 2009 : v1.0 - schemaless hype 2015 : v3.0 - WiredTiger storage engine 2018 : v4.0 - multi-document transactions 2022 : v6.0 - queryable encryption 2023 : v7.0 - Approximate Search via Atlas Vector Search 2024 : v8.0 - block-level storage, time-series enhancements, queryable encryption GA 2025-26 : v9 expected
License: SSPL (since 2018) — not OSI OSS. AWS DocumentDB compatible API. Atlas managed dominant.
1.2 Document model strengths
- Polymorphic data — different docs in same collection can have different fields
- Nested data — embed related data, fewer trips
- Schema evolution easy — add fields without migration
- JSON-native — fits modern web stack
1.3 Document model weaknesses
- Cross-document joins awkward (
$lookupslow) - Eventual consistency at scale
- No referential integrity by default
- Large document issues (16MB max)
1.4 Mục tiêu tuần
- Document model thinking — embedding vs referencing
- Aggregation pipeline — MongoDB’s “SQL”
- Indexes: single, compound, multikey, text, geo, hashed
- Sharding strategy
- Change streams (CDC)
- Transactions limits
- Atlas Vector Search (2024 feature for AI)
- Anti-patterns: unbounded array, deep nesting
1.5 Tham chiếu
- MongoDB Documentation — https://www.mongodb.com/docs/
- MongoDB University — free courses
- The Little MongoDB Book — Karl Seguin
- Designing Data-Intensive Applications Ch. 2 (Document model)
- Kyle Banker — MongoDB in Action (older but solid)
2. Data Model
2.1 Document basic
{
"_id": ObjectId("65e2f1234567890abcdef000"),
"email": "[email protected]",
"name": "Alice",
"addresses": [
{"city": "SF", "country": "US"},
{"city": "NY", "country": "US"}
],
"preferences": {
"theme": "dark",
"notifications": true
},
"tags": ["premium", "early-adopter"],
"created_at": ISODate("2026-05-16T10:00:00Z")
}Document = JSON-like (BSON internally — binary JSON with more types).
2.2 Schema design — embedding vs referencing
Embed when:
- 1-to-1 or 1-to-few
- Always accessed together
- Sub-doc doesn’t change independently
- No need to query sub-doc independently
// User with embedded addresses (1-to-few)
{
"_id": "user1",
"name": "Alice",
"addresses": [{"city": "SF"}, {"city": "NY"}]
}Reference when:
- 1-to-many (unbounded)
- Many-to-many
- Independent lifecycle
- Need to query separately
// User collection
{"_id": "user1", "name": "Alice"}
// Orders collection
{"_id": "order1", "user_id": "user1", "total": 100}
{"_id": "order2", "user_id": "user1", "total": 200}2.3 Rule of thumb
| Cardinality | Pattern |
|---|---|
| 1-to-1 | Embed |
| 1-to-few (<100) | Embed array |
| 1-to-many (100-1000) | Reference + maybe denormalize summary |
| 1-to-many (>1000) | Reference always |
| Many-to-many | Reference both sides |
2.4 Anti-pattern: Unbounded array
// BAD: array grows forever
{
"_id": "user1",
"events": [event1, event2, ..., event_millions] // ❌
}Document size hits 16MB. Update slows. Memory wasted.
→ Move to separate collection with references.
2.5 Anti-pattern: Deep nesting
{
"_id": "order1",
"items": [{
"product": {
"vendor": {
"address": {
"country": { // 4+ levels deep
...
}
}
}
}
}]
}Hard to query, hard to update specific fields.
→ Flatten or reference.
3. CRUD Operations
3.1 Insert
db.users.insertOne({
email: "[email protected]",
name: "Alice",
created_at: new Date()
});
db.users.insertMany([
{email: "[email protected]", name: "Bob"},
{email: "[email protected]", name: "Carol"}
], {ordered: false}); // continue on error3.2 Find
// All
db.users.find({});
// Equality
db.users.find({email: "[email protected]"});
// Operators
db.users.find({age: {$gt: 18, $lt: 65}});
db.users.find({tags: {$in: ["premium", "vip"]}});
db.users.find({"addresses.city": "SF"}); // nested field
// Logical
db.users.find({$or: [{age: {$lt: 18}}, {age: {$gt: 65}}]});
// Projection
db.users.find({}, {name: 1, email: 1, _id: 0});
// Sort, limit, skip
db.users.find().sort({created_at: -1}).limit(10).skip(20);3.3 Update
// Update one
db.users.updateOne(
{_id: ObjectId("...")},
{$set: {name: "Alice Smith"}, $inc: {login_count: 1}}
);
// Update many
db.users.updateMany(
{age: {$gte: 18}},
{$set: {status: "adult"}}
);
// Array operations
db.users.updateOne(
{_id: "user1"},
{$push: {tags: "new-tag"}}
);
db.users.updateOne(
{_id: "user1"},
{$pull: {tags: "old-tag"}}
);
db.users.updateOne(
{_id: "user1"},
{$addToSet: {tags: "unique"}} // only if not exists
);
// Positional update (first array match)
db.users.updateOne(
{"addresses.city": "SF"},
{$set: {"addresses.$.country": "USA"}}
);
// All matching
db.users.updateMany(
{},
{$set: {"addresses.$[].country": "USA"}} // all elements
);
// Filtered
db.users.updateMany(
{},
{$set: {"addresses.$[el].country": "USA"}},
{arrayFilters: [{"el.city": "SF"}]}
);3.4 Upsert
db.users.updateOne(
{email: "[email protected]"},
{$set: {name: "Alice"}},
{upsert: true}
);3.5 Delete
db.users.deleteOne({_id: ObjectId("...")});
db.users.deleteMany({status: "inactive"});4. Aggregation Pipeline
4.1 Concept — MongoDB’s SQL
Pipeline of stages, each transforms documents.
db.orders.aggregate([
{$match: {status: "completed"}}, // WHERE
{$group: { // GROUP BY
_id: "$user_id",
total: {$sum: "$amount"},
count: {$sum: 1}
}},
{$sort: {total: -1}}, // ORDER BY
{$limit: 10} // LIMIT
]);4.2 Common stages
| Stage | Purpose |
|---|---|
$match | Filter docs (use early!) |
$project | Reshape (select columns) |
$group | GROUP BY |
$sort | Sort |
$limit, $skip | Pagination |
$unwind | Explode array to multiple docs |
$lookup | Join (left outer) |
$addFields | Add computed fields |
$facet | Multiple pipelines in one |
$bucket | Histogram |
$out, $merge | Materialize result |
4.3 $lookup — join
db.orders.aggregate([
{$lookup: {
from: "users",
localField: "user_id",
foreignField: "_id",
as: "user"
}},
{$unwind: "$user"}, // since lookup gives array
{$project: {total: 1, "user.name": 1, "user.email": 1}}
]);⚠️ $lookup slow on large collections without proper indexes. MongoDB philosophy: avoid joins, embed when possible.
4.4 $unwind — array explode
// Doc with tags: ["rust", "db"]
db.posts.aggregate([
{$unwind: "$tags"},
{$group: {_id: "$tags", count: {$sum: 1}}}
]);
// Result: [{_id: "rust", count: X}, {_id: "db", count: Y}]4.5 $facet — multi-aggregate
db.orders.aggregate([
{$facet: {
"byStatus": [{$group: {_id: "$status", count: {$sum: 1}}}],
"totalRevenue": [{$group: {_id: null, sum: {$sum: "$amount"}}}],
"top10Users": [{$group: {_id: "$user_id", total: {$sum: "$amount"}}}, {$sort: {total: -1}}, {$limit: 10}]
}}
]);4.6 Performance: $match early
// BAD: filter after group
[{$group: ...}, {$match: {count: {$gt: 100}}}]
// GOOD: filter before group
[{$match: {date: {$gte: ISODate("2026-01-01")}}}, {$group: ...}]MongoDB optimizer rearranges some stages but not always. Be explicit.
4.7 Aggregation expressions
{$project: {
fullName: {$concat: ["$first", " ", "$last"]},
age: {$subtract: [{$year: "$$NOW"}, {$year: "$birthDate"}]},
discount: {$cond: {if: {$gte: ["$total", 100]}, then: 10, else: 0}}
}}5. Indexes
5.1 Types
// Single field
db.users.createIndex({email: 1});
// Compound
db.users.createIndex({country: 1, age: -1});
// Multikey (array)
db.users.createIndex({tags: 1});
// Text (FTS)
db.articles.createIndex({title: "text", body: "text"});
// Geo 2dsphere
db.places.createIndex({location: "2dsphere"});
// Hashed (for sharding)
db.events.createIndex({user_id: "hashed"});
// Unique
db.users.createIndex({email: 1}, {unique: true});
// Partial (sparse extended)
db.users.createIndex(
{email: 1},
{unique: true, partialFilterExpression: {deleted: {$ne: true}}}
);
// TTL
db.sessions.createIndex({expires_at: 1}, {expireAfterSeconds: 0});
// Wildcard
db.products.createIndex({"$**": 1}); // index every field (sparingly)5.2 ESR rule (same as Postgres)
Equality, Sort, Range — column order in compound index.
5.3 Index intersection
MongoDB can intersect indexes but often less efficient than compound. Prefer compound.
5.4 Explain
db.users.find({email: "[email protected]"}).explain("executionStats");
// stage: IXSCAN, COLLSCAN, FETCH, SORT, etc
// nReturned, docsExamined, executionTimeMillisdocsExamined >> nReturned → bad index or no index.
5.5 Index size
db.users.totalIndexSize();
db.users.stats().indexSizes;6. Transactions
6.1 Multi-document transactions (4.0+)
const session = client.startSession();
try {
session.startTransaction();
db.accounts.updateOne({_id: 1}, {$inc: {balance: -100}}, {session});
db.accounts.updateOne({_id: 2}, {$inc: {balance: 100}}, {session});
session.commitTransaction();
} catch (e) {
session.abortTransaction();
} finally {
session.endSession();
}6.2 Limitations
- Max 60 seconds runtime by default
- Can lock collections
- Performance cost ~5-15%
- Sharded transactions slower
Pattern: design schema so multi-doc transactions rarely needed (embed when atomicity required).
7. Replication
7.1 Replica set
graph TB P[Primary] S1[Secondary 1] S2[Secondary 2] A[Arbiter optional] P -.replicates.-> S1 P -.replicates.-> S2 S1 -.election votes.-> P S2 -.election votes.-> P A -.election votes.-> P Client[Client] --> P
- 3+ members (odd for elections)
- Primary handles all writes
- Secondaries replicate via oplog (rolling log)
- Auto-failover on primary loss (~10s)
7.2 Read preference
primary— default, strong consistencyprimaryPreferred— primary if up, secondary otherwisesecondary— only secondary (eventually consistent)secondaryPreferrednearest
db.users.find().readPref("secondary");7.3 Write concern
db.users.insertOne(doc, {writeConcern: {w: "majority", j: true, wtimeout: 5000}});w: 1— primary onlyw: "majority"— majority of replicasw: N— wait for N replicasj: true— wait for journal commitwtimeout— give up after
8. Sharding
8.1 Architecture
graph TB Client --> Mongos[mongos router] Mongos --> Config[Config servers<br/>replica set] Mongos --> S1[Shard 1<br/>replica set] Mongos --> S2[Shard 2<br/>replica set] Mongos --> S3[Shard 3<br/>replica set]
mongos— query router, stateless- Config servers — metadata
- Shards — replica sets, each owns chunk range
8.2 Shard key
sh.enableSharding("appdb");
sh.shardCollection("appdb.orders", {user_id: 1}); // range-based
sh.shardCollection("appdb.events", {_id: "hashed"}); // hashedChoosing shard key:
- High cardinality — many unique values
- Low frequency — no hot value
- Monotonic NOT good — increasing values target one shard
- Match query pattern — most queries should include shard key
Pattern: {tenant_id: 1, _id: 1} — composite, scales across shards but groups tenant data.
8.3 Chunk migration
MongoDB auto-balances chunks across shards. Background, generally invisible.
8.4 Anti-patterns
| Pattern | Why bad |
|---|---|
| Shard key never in query | Every query → all shards |
| Monotonically increasing key | All writes → last shard |
| Low cardinality | Few possible chunks |
| Mutable shard key (pre-4.4) | Can’t update efficiently |
9. Change Streams
CDC for MongoDB.
const changeStream = db.orders.watch();
changeStream.on("change", (change) => {
console.log(change);
// {operationType, fullDocument, documentKey, updateDescription}
});
// Resume after disconnect
const resumeToken = change._id;
db.orders.watch([], {resumeAfter: resumeToken});Use cases:
- Real-time updates to UI
- Sync to other systems (Elasticsearch, cache)
- Audit
- Trigger workflows
Requires replica set (oplog needed).
10. Atlas Vector Search (2024)
MongoDB Atlas added vector search for AI/RAG workloads.
// Create vector index
db.documents.createSearchIndex({
name: "vector_index",
type: "vectorSearch",
definition: {
fields: [{
type: "vector",
path: "embedding",
numDimensions: 1536,
similarity: "cosine"
}]
}
});
// Query
db.documents.aggregate([{
$vectorSearch: {
index: "vector_index",
path: "embedding",
queryVector: [0.1, 0.2, ...],
numCandidates: 100,
limit: 10
}
}]);Atlas only. Self-host MongoDB needs separate vector DB. Will compare more Tuan-15-Vector-DB-AI.
11. Patterns
11.1 Bucketing time-series data
{
"sensor_id": "s1",
"hour": ISODate("2026-05-16T10:00:00Z"),
"measurements": [
{"ts": ISODate("...10:00:30Z"), "value": 22.5},
{"ts": ISODate("...10:01:00Z"), "value": 22.6},
... // ~120 per hour
]
}vs document-per-measurement: 100x storage saving.
MongoDB 5.0+ native time-series collections:
db.createCollection("metrics", {
timeseries: {
timeField: "timestamp",
metaField: "metadata",
granularity: "seconds"
}
});Auto-buckets internally. Easier API.
11.2 Polymorphism
{"_id": "1", "type": "user", "name": "Alice"}
{"_id": "2", "type": "company", "name": "Acme", "employees": 100}Same collection, different shapes. Filter by type.
11.3 Schema validation
MongoDB doesn’t enforce by default, but can add JSON Schema validation:
db.createCollection("users", {
validator: {
$jsonSchema: {
bsonType: "object",
required: ["email", "name"],
properties: {
email: {bsonType: "string", pattern: "^.+@.+$"},
age: {bsonType: "int", minimum: 0, maximum: 150}
}
}
},
validationLevel: "strict",
validationAction: "error" // or "warn"
});→ Add structure where it matters.
12. When NOT to use MongoDB
- Complex relational data with many joins → Postgres
- Strict consistency needs → Postgres or distributed SQL
- Heavy analytics → ClickHouse or warehouse
- Tight budget → Postgres usually cheaper at smaller scale
- Geo-distributed strong consistency → Spanner / DSQL
Modern Postgres has JSONB + GIN index that handles 80% of MongoDB use cases. Default to Postgres unless clear win for MongoDB.
13. Anti-patterns
| Pattern | Why bad | Fix |
|---|---|---|
| Unbounded array | Doc size grows, slow | Reference collection |
| Deep nesting >3-4 levels | Hard query/update | Flatten or reference |
| No indexes (early stage) | Slow when data grows | Index early |
| $lookup on huge collections | Slow joins | Denormalize/embed |
| Same collection for very different docs | Hard to index/query | Separate collections |
| Mutable _id | Can’t update efficiently | Immutable |
| Use as replacement for Redis | Memory pressure, slower | Redis for cache |
| Use as replacement for OLAP | Aggregation slow on large | ClickHouse |
| Schema-free for years | Tech debt | Add validation |
14. Lab
14.1 Day 1: Basic CRUD
docker run -d -p 27017:27017 mongo:7
mongosh
use lab
db.users.insertOne(...)
# Practice CRUD14.2 Day 2: Aggregation
Use sample data. Build 5 complex aggregations: count by group, top N, faceted, lookup, unwind.
14.3 Day 3: Indexes
Create collection 1M docs. Run queries. Add indexes. EXPLAIN. Compare.
14.4 Day 4: Replica set
3-node replica set via docker. Failover testing.
14.5 Day 5: Sharded cluster
Setup mongos + config + 2 shards. Shard a collection.
14.6 Day 6: Change streams
Watch collection. Trigger script that updates docs. Receive events.
14.7 Day 7: Schema validation
Add $jsonSchema validator. Try insert valid + invalid. Test migration.
15. Self-check
- Embed vs reference — 4 rules?
- Document model strengths + weaknesses?
- Aggregation pipeline — 5 most-used stages?
- $match early vs late — vì sao quan trọng?
- Index types MongoDB?
- Shard key — 3 properties tốt?
- Change streams cần gì?
- Time-series collection 5.0+ — vì sao tốt hơn doc-per-measurement?
- MongoDB transaction limits?
- Khi nào không dùng MongoDB?
16. Tiếp theo
Phase 3 hoàn thành. Bài tiếp: Tuan-13-Search-Engines-ES — search workload.
Tuần 12 hoàn thành. Document well or use rows. Cập nhật: 2026-05-16