Tuần 12 — MongoDB & Document Databases

“MongoDB không phải ‘Postgres không có schema’. Đó là một paradigm khác — document-centric thay vì row-centric. Hiểu khác biệt = dùng đúng. Không hiểu = build product cuối cùng phải rewrite.”

Tags: database mongodb document-db nosql aggregation Thời lượng: 7 ngày (4-6h/ngày) Prerequisites: Tuan-02-Schema-Design-Normalization (so sánh tư duy) Liên quan: Case-Design-Data-AI-RAG (document store)

1. Context & Why

1.1 MongoDB trong 2024-2026

timeline
    title MongoDB Evolution
    2009 : v1.0 - schemaless hype
    2015 : v3.0 - WiredTiger storage engine
    2018 : v4.0 - multi-document transactions
    2022 : v6.0 - queryable encryption
    2023 : v7.0 - Approximate Search via Atlas Vector Search
    2024 : v8.0 - block-level storage, time-series enhancements, queryable encryption GA
    2025-26 : v9 expected

License: SSPL (since 2018) — not OSI OSS. AWS DocumentDB compatible API. Atlas managed dominant.

1.2 Document model strengths

Polymorphic data — different docs in same collection can have different fields
Nested data — embed related data, fewer trips
Schema evolution easy — add fields without migration
JSON-native — fits modern web stack

1.3 Document model weaknesses

Cross-document joins awkward ($lookup slow)
Eventual consistency at scale
No referential integrity by default
Large document issues (16MB max)

1.4 Mục tiêu tuần

Document model thinking — embedding vs referencing
Aggregation pipeline — MongoDB’s “SQL”
Indexes: single, compound, multikey, text, geo, hashed
Sharding strategy
Change streams (CDC)
Transactions limits
Atlas Vector Search (2024 feature for AI)
Anti-patterns: unbounded array, deep nesting

1.5 Tham chiếu

MongoDB Documentation — https://www.mongodb.com/docs/
MongoDB University — free courses
The Little MongoDB Book — Karl Seguin
Designing Data-Intensive Applications Ch. 2 (Document model)
Kyle Banker — MongoDB in Action (older but solid)

2. Data Model

2.1 Document basic

{
    "_id": ObjectId("65e2f1234567890abcdef000"),
    "email": "[email protected]",
    "name": "Alice",
    "addresses": [
        {"city": "SF", "country": "US"},
        {"city": "NY", "country": "US"}
    ],
    "preferences": {
        "theme": "dark",
        "notifications": true
    },
    "tags": ["premium", "early-adopter"],
    "created_at": ISODate("2026-05-16T10:00:00Z")
}

Document = JSON-like (BSON internally — binary JSON with more types).

2.2 Schema design — embedding vs referencing

Embed when:

1-to-1 or 1-to-few
Always accessed together
Sub-doc doesn’t change independently
No need to query sub-doc independently

// User with embedded addresses (1-to-few)
{
    "_id": "user1",
    "name": "Alice",
    "addresses": [{"city": "SF"}, {"city": "NY"}]
}

Reference when:

1-to-many (unbounded)
Many-to-many
Independent lifecycle
Need to query separately

// User collection
{"_id": "user1", "name": "Alice"}
 
// Orders collection
{"_id": "order1", "user_id": "user1", "total": 100}
{"_id": "order2", "user_id": "user1", "total": 200}

2.3 Rule of thumb

Cardinality	Pattern
1-to-1	Embed
1-to-few (<100)	Embed array
1-to-many (100-1000)	Reference + maybe denormalize summary
1-to-many (>1000)	Reference always
Many-to-many	Reference both sides

2.4 Anti-pattern: Unbounded array

// BAD: array grows forever
{
    "_id": "user1",
    "events": [event1, event2, ..., event_millions]  // ❌
}

Document size hits 16MB. Update slows. Memory wasted.

→ Move to separate collection with references.

2.5 Anti-pattern: Deep nesting

{
    "_id": "order1",
    "items": [{
        "product": {
            "vendor": {
                "address": {
                    "country": {  // 4+ levels deep
                        ...
                    }
                }
            }
        }
    }]
}

Hard to query, hard to update specific fields.

→ Flatten or reference.

3. CRUD Operations

3.1 Insert

db.users.insertOne({
    email: "[email protected]",
    name: "Alice",
    created_at: new Date()
});
 
db.users.insertMany([
    {email: "[email protected]", name: "Bob"},
    {email: "[email protected]", name: "Carol"}
], {ordered: false});  // continue on error

3.2 Find

// All
db.users.find({});
 
// Equality
db.users.find({email: "[email protected]"});
 
// Operators
db.users.find({age: {$gt: 18, $lt: 65}});
db.users.find({tags: {$in: ["premium", "vip"]}});
db.users.find({"addresses.city": "SF"});  // nested field
 
// Logical
db.users.find({$or: [{age: {$lt: 18}}, {age: {$gt: 65}}]});
 
// Projection
db.users.find({}, {name: 1, email: 1, _id: 0});
 
// Sort, limit, skip
db.users.find().sort({created_at: -1}).limit(10).skip(20);

3.3 Update

// Update one
db.users.updateOne(
    {_id: ObjectId("...")},
    {$set: {name: "Alice Smith"}, $inc: {login_count: 1}}
);
 
// Update many
db.users.updateMany(
    {age: {$gte: 18}},
    {$set: {status: "adult"}}
);
 
// Array operations
db.users.updateOne(
    {_id: "user1"},
    {$push: {tags: "new-tag"}}
);
db.users.updateOne(
    {_id: "user1"},
    {$pull: {tags: "old-tag"}}
);
db.users.updateOne(
    {_id: "user1"},
    {$addToSet: {tags: "unique"}}  // only if not exists
);
 
// Positional update (first array match)
db.users.updateOne(
    {"addresses.city": "SF"},
    {$set: {"addresses.$.country": "USA"}}
);
 
// All matching
db.users.updateMany(
    {},
    {$set: {"addresses.$[].country": "USA"}}  // all elements
);
 
// Filtered
db.users.updateMany(
    {},
    {$set: {"addresses.$[el].country": "USA"}},
    {arrayFilters: [{"el.city": "SF"}]}
);

3.4 Upsert

db.users.updateOne(
    {email: "[email protected]"},
    {$set: {name: "Alice"}},
    {upsert: true}
);

3.5 Delete

db.users.deleteOne({_id: ObjectId("...")});
db.users.deleteMany({status: "inactive"});

4. Aggregation Pipeline

4.1 Concept — MongoDB’s SQL

Pipeline of stages, each transforms documents.

db.orders.aggregate([
    {$match: {status: "completed"}},                    // WHERE
    {$group: {                                           // GROUP BY
        _id: "$user_id",
        total: {$sum: "$amount"},
        count: {$sum: 1}
    }},
    {$sort: {total: -1}},                                // ORDER BY
    {$limit: 10}                                         // LIMIT
]);

4.2 Common stages

Stage	Purpose
`$match`	Filter docs (use early!)
`$project`	Reshape (select columns)
`$group`	GROUP BY
`$sort`	Sort
`$limit`, `$skip`	Pagination
`$unwind`	Explode array to multiple docs
`$lookup`	Join (left outer)
`$addFields`	Add computed fields
`$facet`	Multiple pipelines in one
`$bucket`	Histogram
`$out`, `$merge`	Materialize result

4.3 $lookup — join

db.orders.aggregate([
    {$lookup: {
        from: "users",
        localField: "user_id",
        foreignField: "_id",
        as: "user"
    }},
    {$unwind: "$user"},  // since lookup gives array
    {$project: {total: 1, "user.name": 1, "user.email": 1}}
]);

⚠️ $lookup slow on large collections without proper indexes. MongoDB philosophy: avoid joins, embed when possible.

4.4 $unwind — array explode

// Doc with tags: ["rust", "db"]
db.posts.aggregate([
    {$unwind: "$tags"},
    {$group: {_id: "$tags", count: {$sum: 1}}}
]);
// Result: [{_id: "rust", count: X}, {_id: "db", count: Y}]

db.orders.aggregate([
    {$facet: {
        "byStatus": [{$group: {_id: "$status", count: {$sum: 1}}}],
        "totalRevenue": [{$group: {_id: null, sum: {$sum: "$amount"}}}],
        "top10Users": [{$group: {_id: "$user_id", total: {$sum: "$amount"}}}, {$sort: {total: -1}}, {$limit: 10}]
    }}
]);

4.6 Performance: $match early

// BAD: filter after group
[{$group: ...}, {$match: {count: {$gt: 100}}}]
 
// GOOD: filter before group
[{$match: {date: {$gte: ISODate("2026-01-01")}}}, {$group: ...}]

MongoDB optimizer rearranges some stages but not always. Be explicit.

4.7 Aggregation expressions

{$project: {
    fullName: {$concat: ["$first", " ", "$last"]},
    age: {$subtract: [{$year: "$$NOW"}, {$year: "$birthDate"}]},
    discount: {$cond: {if: {$gte: ["$total", 100]}, then: 10, else: 0}}
}}

5. Indexes

5.1 Types

// Single field
db.users.createIndex({email: 1});
 
// Compound
db.users.createIndex({country: 1, age: -1});
 
// Multikey (array)
db.users.createIndex({tags: 1});
 
// Text (FTS)
db.articles.createIndex({title: "text", body: "text"});
 
// Geo 2dsphere
db.places.createIndex({location: "2dsphere"});
 
// Hashed (for sharding)
db.events.createIndex({user_id: "hashed"});
 
// Unique
db.users.createIndex({email: 1}, {unique: true});
 
// Partial (sparse extended)
db.users.createIndex(
    {email: 1},
    {unique: true, partialFilterExpression: {deleted: {$ne: true}}}
);
 
// TTL
db.sessions.createIndex({expires_at: 1}, {expireAfterSeconds: 0});
 
// Wildcard
db.products.createIndex({"$**": 1});  // index every field (sparingly)

5.2 ESR rule (same as Postgres)

Equality, Sort, Range — column order in compound index.

5.3 Index intersection

MongoDB can intersect indexes but often less efficient than compound. Prefer compound.

5.4 Explain

db.users.find({email: "[email protected]"}).explain("executionStats");
// stage: IXSCAN, COLLSCAN, FETCH, SORT, etc
// nReturned, docsExamined, executionTimeMillis

docsExamined >> nReturned → bad index or no index.

5.5 Index size

db.users.totalIndexSize();
db.users.stats().indexSizes;

6. Transactions

6.1 Multi-document transactions (4.0+)

const session = client.startSession();
try {
    session.startTransaction();
    db.accounts.updateOne({_id: 1}, {$inc: {balance: -100}}, {session});
    db.accounts.updateOne({_id: 2}, {$inc: {balance: 100}}, {session});
    session.commitTransaction();
} catch (e) {
    session.abortTransaction();
} finally {
    session.endSession();
}

6.2 Limitations

Max 60 seconds runtime by default
Can lock collections
Performance cost ~5-15%
Sharded transactions slower

Pattern: design schema so multi-doc transactions rarely needed (embed when atomicity required).

7. Replication

7.1 Replica set

graph TB
    P[Primary]
    S1[Secondary 1]
    S2[Secondary 2]
    A[Arbiter optional]

    P -.replicates.-> S1
    P -.replicates.-> S2
    S1 -.election votes.-> P
    S2 -.election votes.-> P
    A -.election votes.-> P

    Client[Client] --> P

3+ members (odd for elections)
Primary handles all writes
Secondaries replicate via oplog (rolling log)
Auto-failover on primary loss (~10s)

7.2 Read preference

primary — default, strong consistency
primaryPreferred — primary if up, secondary otherwise
secondary — only secondary (eventually consistent)
secondaryPreferred
nearest

db.users.find().readPref("secondary");

7.3 Write concern

db.users.insertOne(doc, {writeConcern: {w: "majority", j: true, wtimeout: 5000}});

w: 1 — primary only
w: "majority" — majority of replicas
w: N — wait for N replicas
j: true — wait for journal commit
wtimeout — give up after

8. Sharding

8.1 Architecture

graph TB
    Client --> Mongos[mongos router]

    Mongos --> Config[Config servers<br/>replica set]
    Mongos --> S1[Shard 1<br/>replica set]
    Mongos --> S2[Shard 2<br/>replica set]
    Mongos --> S3[Shard 3<br/>replica set]

mongos — query router, stateless
Config servers — metadata
Shards — replica sets, each owns chunk range

8.2 Shard key

sh.enableSharding("appdb");
sh.shardCollection("appdb.orders", {user_id: 1});  // range-based
sh.shardCollection("appdb.events", {_id: "hashed"});  // hashed

Choosing shard key:

High cardinality — many unique values
Low frequency — no hot value
Monotonic NOT good — increasing values target one shard
Match query pattern — most queries should include shard key

Pattern: {tenant_id: 1, _id: 1} — composite, scales across shards but groups tenant data.

8.3 Chunk migration

MongoDB auto-balances chunks across shards. Background, generally invisible.

8.4 Anti-patterns

Pattern	Why bad
Shard key never in query	Every query → all shards
Monotonically increasing key	All writes → last shard
Low cardinality	Few possible chunks
Mutable shard key (pre-4.4)	Can’t update efficiently

9. Change Streams

CDC for MongoDB.

const changeStream = db.orders.watch();
changeStream.on("change", (change) => {
    console.log(change);
    // {operationType, fullDocument, documentKey, updateDescription}
});
 
// Resume after disconnect
const resumeToken = change._id;
db.orders.watch([], {resumeAfter: resumeToken});

Use cases:

Real-time updates to UI
Sync to other systems (Elasticsearch, cache)
Audit
Trigger workflows

Requires replica set (oplog needed).

10. Atlas Vector Search (2024)

MongoDB Atlas added vector search for AI/RAG workloads.

// Create vector index
db.documents.createSearchIndex({
    name: "vector_index",
    type: "vectorSearch",
    definition: {
        fields: [{
            type: "vector",
            path: "embedding",
            numDimensions: 1536,
            similarity: "cosine"
        }]
    }
});
 
// Query
db.documents.aggregate([{
    $vectorSearch: {
        index: "vector_index",
        path: "embedding",
        queryVector: [0.1, 0.2, ...],
        numCandidates: 100,
        limit: 10
    }
}]);

Atlas only. Self-host MongoDB needs separate vector DB. Will compare more Tuan-15-Vector-DB-AI.

11. Patterns

11.1 Bucketing time-series data

{
    "sensor_id": "s1",
    "hour": ISODate("2026-05-16T10:00:00Z"),
    "measurements": [
        {"ts": ISODate("...10:00:30Z"), "value": 22.5},
        {"ts": ISODate("...10:01:00Z"), "value": 22.6},
        ...  // ~120 per hour
    ]
}

vs document-per-measurement: 100x storage saving.

MongoDB 5.0+ native time-series collections:

db.createCollection("metrics", {
    timeseries: {
        timeField: "timestamp",
        metaField: "metadata",
        granularity: "seconds"
    }
});

Auto-buckets internally. Easier API.

11.2 Polymorphism

{"_id": "1", "type": "user", "name": "Alice"}
{"_id": "2", "type": "company", "name": "Acme", "employees": 100}

Same collection, different shapes. Filter by type.

11.3 Schema validation

MongoDB doesn’t enforce by default, but can add JSON Schema validation:

db.createCollection("users", {
    validator: {
        $jsonSchema: {
            bsonType: "object",
            required: ["email", "name"],
            properties: {
                email: {bsonType: "string", pattern: "^.+@.+$"},
                age: {bsonType: "int", minimum: 0, maximum: 150}
            }
        }
    },
    validationLevel: "strict",
    validationAction: "error"  // or "warn"
});

→ Add structure where it matters.

12. When NOT to use MongoDB

Complex relational data with many joins → Postgres
Strict consistency needs → Postgres or distributed SQL
Heavy analytics → ClickHouse or warehouse
Tight budget → Postgres usually cheaper at smaller scale
Geo-distributed strong consistency → Spanner / DSQL

Modern Postgres has JSONB + GIN index that handles 80% of MongoDB use cases. Default to Postgres unless clear win for MongoDB.

13. Anti-patterns

Pattern	Why bad	Fix
Unbounded array	Doc size grows, slow	Reference collection
Deep nesting >3-4 levels	Hard query/update	Flatten or reference
No indexes (early stage)	Slow when data grows	Index early
$lookup on huge collections	Slow joins	Denormalize/embed
Same collection for very different docs	Hard to index/query	Separate collections
Mutable _id	Can’t update efficiently	Immutable
Use as replacement for Redis	Memory pressure, slower	Redis for cache
Use as replacement for OLAP	Aggregation slow on large	ClickHouse
Schema-free for years	Tech debt	Add validation

14. Lab

14.1 Day 1: Basic CRUD

docker run -d -p 27017:27017 mongo:7
mongosh
use lab
db.users.insertOne(...)
# Practice CRUD

14.2 Day 2: Aggregation

Use sample data. Build 5 complex aggregations: count by group, top N, faceted, lookup, unwind.

14.3 Day 3: Indexes

Create collection 1M docs. Run queries. Add indexes. EXPLAIN. Compare.

14.4 Day 4: Replica set

3-node replica set via docker. Failover testing.

14.5 Day 5: Sharded cluster

Setup mongos + config + 2 shards. Shard a collection.

14.6 Day 6: Change streams

Watch collection. Trigger script that updates docs. Receive events.

14.7 Day 7: Schema validation

Add $jsonSchema validator. Try insert valid + invalid. Test migration.

15. Self-check

Embed vs reference — 4 rules?
Document model strengths + weaknesses?
Aggregation pipeline — 5 most-used stages?
$match early vs late — vì sao quan trọng?
Index types MongoDB?
Shard key — 3 properties tốt?
Change streams cần gì?
Time-series collection 5.0+ — vì sao tốt hơn doc-per-measurement?
MongoDB transaction limits?
Khi nào không dùng MongoDB?

16. Tiếp theo

Phase 3 hoàn thành. Bài tiếp: Tuan-13-Search-Engines-ES — search workload.

Tuần 12 hoàn thành. Document well or use rows. Cập nhật: 2026-05-16

lthieu's notes

Explorer

Tuan-12-MongoDB-Document-DB