Tuần 12 — MongoDB & Document Databases

“MongoDB không phải ‘Postgres không có schema’. Đó là một paradigm khác — document-centric thay vì row-centric. Hiểu khác biệt = dùng đúng. Không hiểu = build product cuối cùng phải rewrite.”

Tags: database mongodb document-db nosql aggregation Thời lượng: 7 ngày (4-6h/ngày) Prerequisites: Tuan-02-Schema-Design-Normalization (so sánh tư duy) Liên quan: Case-Design-Data-AI-RAG (document store)


1. Context & Why

1.1 MongoDB trong 2024-2026

timeline
    title MongoDB Evolution
    2009 : v1.0 - schemaless hype
    2015 : v3.0 - WiredTiger storage engine
    2018 : v4.0 - multi-document transactions
    2022 : v6.0 - queryable encryption
    2023 : v7.0 - Approximate Search via Atlas Vector Search
    2024 : v8.0 - block-level storage, time-series enhancements, queryable encryption GA
    2025-26 : v9 expected

License: SSPL (since 2018) — not OSI OSS. AWS DocumentDB compatible API. Atlas managed dominant.

1.2 Document model strengths

  • Polymorphic data — different docs in same collection can have different fields
  • Nested data — embed related data, fewer trips
  • Schema evolution easy — add fields without migration
  • JSON-native — fits modern web stack

1.3 Document model weaknesses

  • Cross-document joins awkward ($lookup slow)
  • Eventual consistency at scale
  • No referential integrity by default
  • Large document issues (16MB max)

1.4 Mục tiêu tuần

  • Document model thinking — embedding vs referencing
  • Aggregation pipeline — MongoDB’s “SQL”
  • Indexes: single, compound, multikey, text, geo, hashed
  • Sharding strategy
  • Change streams (CDC)
  • Transactions limits
  • Atlas Vector Search (2024 feature for AI)
  • Anti-patterns: unbounded array, deep nesting

1.5 Tham chiếu

  • MongoDB Documentationhttps://www.mongodb.com/docs/
  • MongoDB University — free courses
  • The Little MongoDB Book — Karl Seguin
  • Designing Data-Intensive Applications Ch. 2 (Document model)
  • Kyle Banker — MongoDB in Action (older but solid)

2. Data Model

2.1 Document basic

{
    "_id": ObjectId("65e2f1234567890abcdef000"),
    "email": "[email protected]",
    "name": "Alice",
    "addresses": [
        {"city": "SF", "country": "US"},
        {"city": "NY", "country": "US"}
    ],
    "preferences": {
        "theme": "dark",
        "notifications": true
    },
    "tags": ["premium", "early-adopter"],
    "created_at": ISODate("2026-05-16T10:00:00Z")
}

Document = JSON-like (BSON internally — binary JSON with more types).

2.2 Schema design — embedding vs referencing

Embed when:

  • 1-to-1 or 1-to-few
  • Always accessed together
  • Sub-doc doesn’t change independently
  • No need to query sub-doc independently
// User with embedded addresses (1-to-few)
{
    "_id": "user1",
    "name": "Alice",
    "addresses": [{"city": "SF"}, {"city": "NY"}]
}

Reference when:

  • 1-to-many (unbounded)
  • Many-to-many
  • Independent lifecycle
  • Need to query separately
// User collection
{"_id": "user1", "name": "Alice"}
 
// Orders collection
{"_id": "order1", "user_id": "user1", "total": 100}
{"_id": "order2", "user_id": "user1", "total": 200}

2.3 Rule of thumb

CardinalityPattern
1-to-1Embed
1-to-few (<100)Embed array
1-to-many (100-1000)Reference + maybe denormalize summary
1-to-many (>1000)Reference always
Many-to-manyReference both sides

2.4 Anti-pattern: Unbounded array

// BAD: array grows forever
{
    "_id": "user1",
    "events": [event1, event2, ..., event_millions]  // ❌
}

Document size hits 16MB. Update slows. Memory wasted.

→ Move to separate collection with references.

2.5 Anti-pattern: Deep nesting

{
    "_id": "order1",
    "items": [{
        "product": {
            "vendor": {
                "address": {
                    "country": {  // 4+ levels deep
                        ...
                    }
                }
            }
        }
    }]
}

Hard to query, hard to update specific fields.

→ Flatten or reference.


3. CRUD Operations

3.1 Insert

db.users.insertOne({
    email: "[email protected]",
    name: "Alice",
    created_at: new Date()
});
 
db.users.insertMany([
    {email: "[email protected]", name: "Bob"},
    {email: "[email protected]", name: "Carol"}
], {ordered: false});  // continue on error

3.2 Find

// All
db.users.find({});
 
// Equality
db.users.find({email: "[email protected]"});
 
// Operators
db.users.find({age: {$gt: 18, $lt: 65}});
db.users.find({tags: {$in: ["premium", "vip"]}});
db.users.find({"addresses.city": "SF"});  // nested field
 
// Logical
db.users.find({$or: [{age: {$lt: 18}}, {age: {$gt: 65}}]});
 
// Projection
db.users.find({}, {name: 1, email: 1, _id: 0});
 
// Sort, limit, skip
db.users.find().sort({created_at: -1}).limit(10).skip(20);

3.3 Update

// Update one
db.users.updateOne(
    {_id: ObjectId("...")},
    {$set: {name: "Alice Smith"}, $inc: {login_count: 1}}
);
 
// Update many
db.users.updateMany(
    {age: {$gte: 18}},
    {$set: {status: "adult"}}
);
 
// Array operations
db.users.updateOne(
    {_id: "user1"},
    {$push: {tags: "new-tag"}}
);
db.users.updateOne(
    {_id: "user1"},
    {$pull: {tags: "old-tag"}}
);
db.users.updateOne(
    {_id: "user1"},
    {$addToSet: {tags: "unique"}}  // only if not exists
);
 
// Positional update (first array match)
db.users.updateOne(
    {"addresses.city": "SF"},
    {$set: {"addresses.$.country": "USA"}}
);
 
// All matching
db.users.updateMany(
    {},
    {$set: {"addresses.$[].country": "USA"}}  // all elements
);
 
// Filtered
db.users.updateMany(
    {},
    {$set: {"addresses.$[el].country": "USA"}},
    {arrayFilters: [{"el.city": "SF"}]}
);

3.4 Upsert

db.users.updateOne(
    {email: "[email protected]"},
    {$set: {name: "Alice"}},
    {upsert: true}
);

3.5 Delete

db.users.deleteOne({_id: ObjectId("...")});
db.users.deleteMany({status: "inactive"});

4. Aggregation Pipeline

4.1 Concept — MongoDB’s SQL

Pipeline of stages, each transforms documents.

db.orders.aggregate([
    {$match: {status: "completed"}},                    // WHERE
    {$group: {                                           // GROUP BY
        _id: "$user_id",
        total: {$sum: "$amount"},
        count: {$sum: 1}
    }},
    {$sort: {total: -1}},                                // ORDER BY
    {$limit: 10}                                         // LIMIT
]);

4.2 Common stages

StagePurpose
$matchFilter docs (use early!)
$projectReshape (select columns)
$groupGROUP BY
$sortSort
$limit, $skipPagination
$unwindExplode array to multiple docs
$lookupJoin (left outer)
$addFieldsAdd computed fields
$facetMultiple pipelines in one
$bucketHistogram
$out, $mergeMaterialize result

4.3 $lookup — join

db.orders.aggregate([
    {$lookup: {
        from: "users",
        localField: "user_id",
        foreignField: "_id",
        as: "user"
    }},
    {$unwind: "$user"},  // since lookup gives array
    {$project: {total: 1, "user.name": 1, "user.email": 1}}
]);

⚠️ $lookup slow on large collections without proper indexes. MongoDB philosophy: avoid joins, embed when possible.

4.4 $unwind — array explode

// Doc with tags: ["rust", "db"]
db.posts.aggregate([
    {$unwind: "$tags"},
    {$group: {_id: "$tags", count: {$sum: 1}}}
]);
// Result: [{_id: "rust", count: X}, {_id: "db", count: Y}]

4.5 $facet — multi-aggregate

db.orders.aggregate([
    {$facet: {
        "byStatus": [{$group: {_id: "$status", count: {$sum: 1}}}],
        "totalRevenue": [{$group: {_id: null, sum: {$sum: "$amount"}}}],
        "top10Users": [{$group: {_id: "$user_id", total: {$sum: "$amount"}}}, {$sort: {total: -1}}, {$limit: 10}]
    }}
]);

4.6 Performance: $match early

// BAD: filter after group
[{$group: ...}, {$match: {count: {$gt: 100}}}]
 
// GOOD: filter before group
[{$match: {date: {$gte: ISODate("2026-01-01")}}}, {$group: ...}]

MongoDB optimizer rearranges some stages but not always. Be explicit.

4.7 Aggregation expressions

{$project: {
    fullName: {$concat: ["$first", " ", "$last"]},
    age: {$subtract: [{$year: "$$NOW"}, {$year: "$birthDate"}]},
    discount: {$cond: {if: {$gte: ["$total", 100]}, then: 10, else: 0}}
}}

5. Indexes

5.1 Types

// Single field
db.users.createIndex({email: 1});
 
// Compound
db.users.createIndex({country: 1, age: -1});
 
// Multikey (array)
db.users.createIndex({tags: 1});
 
// Text (FTS)
db.articles.createIndex({title: "text", body: "text"});
 
// Geo 2dsphere
db.places.createIndex({location: "2dsphere"});
 
// Hashed (for sharding)
db.events.createIndex({user_id: "hashed"});
 
// Unique
db.users.createIndex({email: 1}, {unique: true});
 
// Partial (sparse extended)
db.users.createIndex(
    {email: 1},
    {unique: true, partialFilterExpression: {deleted: {$ne: true}}}
);
 
// TTL
db.sessions.createIndex({expires_at: 1}, {expireAfterSeconds: 0});
 
// Wildcard
db.products.createIndex({"$**": 1});  // index every field (sparingly)

5.2 ESR rule (same as Postgres)

Equality, Sort, Range — column order in compound index.

5.3 Index intersection

MongoDB can intersect indexes but often less efficient than compound. Prefer compound.

5.4 Explain

db.users.find({email: "[email protected]"}).explain("executionStats");
// stage: IXSCAN, COLLSCAN, FETCH, SORT, etc
// nReturned, docsExamined, executionTimeMillis

docsExamined >> nReturned → bad index or no index.

5.5 Index size

db.users.totalIndexSize();
db.users.stats().indexSizes;

6. Transactions

6.1 Multi-document transactions (4.0+)

const session = client.startSession();
try {
    session.startTransaction();
    db.accounts.updateOne({_id: 1}, {$inc: {balance: -100}}, {session});
    db.accounts.updateOne({_id: 2}, {$inc: {balance: 100}}, {session});
    session.commitTransaction();
} catch (e) {
    session.abortTransaction();
} finally {
    session.endSession();
}

6.2 Limitations

  • Max 60 seconds runtime by default
  • Can lock collections
  • Performance cost ~5-15%
  • Sharded transactions slower

Pattern: design schema so multi-doc transactions rarely needed (embed when atomicity required).


7. Replication

7.1 Replica set

graph TB
    P[Primary]
    S1[Secondary 1]
    S2[Secondary 2]
    A[Arbiter optional]

    P -.replicates.-> S1
    P -.replicates.-> S2
    S1 -.election votes.-> P
    S2 -.election votes.-> P
    A -.election votes.-> P

    Client[Client] --> P
  • 3+ members (odd for elections)
  • Primary handles all writes
  • Secondaries replicate via oplog (rolling log)
  • Auto-failover on primary loss (~10s)

7.2 Read preference

  • primary — default, strong consistency
  • primaryPreferred — primary if up, secondary otherwise
  • secondary — only secondary (eventually consistent)
  • secondaryPreferred
  • nearest
db.users.find().readPref("secondary");

7.3 Write concern

db.users.insertOne(doc, {writeConcern: {w: "majority", j: true, wtimeout: 5000}});
  • w: 1 — primary only
  • w: "majority" — majority of replicas
  • w: N — wait for N replicas
  • j: true — wait for journal commit
  • wtimeout — give up after

8. Sharding

8.1 Architecture

graph TB
    Client --> Mongos[mongos router]

    Mongos --> Config[Config servers<br/>replica set]
    Mongos --> S1[Shard 1<br/>replica set]
    Mongos --> S2[Shard 2<br/>replica set]
    Mongos --> S3[Shard 3<br/>replica set]
  • mongos — query router, stateless
  • Config servers — metadata
  • Shards — replica sets, each owns chunk range

8.2 Shard key

sh.enableSharding("appdb");
sh.shardCollection("appdb.orders", {user_id: 1});  // range-based
sh.shardCollection("appdb.events", {_id: "hashed"});  // hashed

Choosing shard key:

  • High cardinality — many unique values
  • Low frequency — no hot value
  • Monotonic NOT good — increasing values target one shard
  • Match query pattern — most queries should include shard key

Pattern: {tenant_id: 1, _id: 1} — composite, scales across shards but groups tenant data.

8.3 Chunk migration

MongoDB auto-balances chunks across shards. Background, generally invisible.

8.4 Anti-patterns

PatternWhy bad
Shard key never in queryEvery query → all shards
Monotonically increasing keyAll writes → last shard
Low cardinalityFew possible chunks
Mutable shard key (pre-4.4)Can’t update efficiently

9. Change Streams

CDC for MongoDB.

const changeStream = db.orders.watch();
changeStream.on("change", (change) => {
    console.log(change);
    // {operationType, fullDocument, documentKey, updateDescription}
});
 
// Resume after disconnect
const resumeToken = change._id;
db.orders.watch([], {resumeAfter: resumeToken});

Use cases:

  • Real-time updates to UI
  • Sync to other systems (Elasticsearch, cache)
  • Audit
  • Trigger workflows

Requires replica set (oplog needed).


10. Atlas Vector Search (2024)

MongoDB Atlas added vector search for AI/RAG workloads.

// Create vector index
db.documents.createSearchIndex({
    name: "vector_index",
    type: "vectorSearch",
    definition: {
        fields: [{
            type: "vector",
            path: "embedding",
            numDimensions: 1536,
            similarity: "cosine"
        }]
    }
});
 
// Query
db.documents.aggregate([{
    $vectorSearch: {
        index: "vector_index",
        path: "embedding",
        queryVector: [0.1, 0.2, ...],
        numCandidates: 100,
        limit: 10
    }
}]);

Atlas only. Self-host MongoDB needs separate vector DB. Will compare more Tuan-15-Vector-DB-AI.


11. Patterns

11.1 Bucketing time-series data

{
    "sensor_id": "s1",
    "hour": ISODate("2026-05-16T10:00:00Z"),
    "measurements": [
        {"ts": ISODate("...10:00:30Z"), "value": 22.5},
        {"ts": ISODate("...10:01:00Z"), "value": 22.6},
        ...  // ~120 per hour
    ]
}

vs document-per-measurement: 100x storage saving.

MongoDB 5.0+ native time-series collections:

db.createCollection("metrics", {
    timeseries: {
        timeField: "timestamp",
        metaField: "metadata",
        granularity: "seconds"
    }
});

Auto-buckets internally. Easier API.

11.2 Polymorphism

{"_id": "1", "type": "user", "name": "Alice"}
{"_id": "2", "type": "company", "name": "Acme", "employees": 100}

Same collection, different shapes. Filter by type.

11.3 Schema validation

MongoDB doesn’t enforce by default, but can add JSON Schema validation:

db.createCollection("users", {
    validator: {
        $jsonSchema: {
            bsonType: "object",
            required: ["email", "name"],
            properties: {
                email: {bsonType: "string", pattern: "^.+@.+$"},
                age: {bsonType: "int", minimum: 0, maximum: 150}
            }
        }
    },
    validationLevel: "strict",
    validationAction: "error"  // or "warn"
});

→ Add structure where it matters.


12. When NOT to use MongoDB

  • Complex relational data with many joins → Postgres
  • Strict consistency needs → Postgres or distributed SQL
  • Heavy analytics → ClickHouse or warehouse
  • Tight budget → Postgres usually cheaper at smaller scale
  • Geo-distributed strong consistency → Spanner / DSQL

Modern Postgres has JSONB + GIN index that handles 80% of MongoDB use cases. Default to Postgres unless clear win for MongoDB.


13. Anti-patterns

PatternWhy badFix
Unbounded arrayDoc size grows, slowReference collection
Deep nesting >3-4 levelsHard query/updateFlatten or reference
No indexes (early stage)Slow when data growsIndex early
$lookup on huge collectionsSlow joinsDenormalize/embed
Same collection for very different docsHard to index/querySeparate collections
Mutable _idCan’t update efficientlyImmutable
Use as replacement for RedisMemory pressure, slowerRedis for cache
Use as replacement for OLAPAggregation slow on largeClickHouse
Schema-free for yearsTech debtAdd validation

14. Lab

14.1 Day 1: Basic CRUD

docker run -d -p 27017:27017 mongo:7
mongosh
use lab
db.users.insertOne(...)
# Practice CRUD

14.2 Day 2: Aggregation

Use sample data. Build 5 complex aggregations: count by group, top N, faceted, lookup, unwind.

14.3 Day 3: Indexes

Create collection 1M docs. Run queries. Add indexes. EXPLAIN. Compare.

14.4 Day 4: Replica set

3-node replica set via docker. Failover testing.

14.5 Day 5: Sharded cluster

Setup mongos + config + 2 shards. Shard a collection.

14.6 Day 6: Change streams

Watch collection. Trigger script that updates docs. Receive events.

14.7 Day 7: Schema validation

Add $jsonSchema validator. Try insert valid + invalid. Test migration.


15. Self-check

  1. Embed vs reference — 4 rules?
  2. Document model strengths + weaknesses?
  3. Aggregation pipeline — 5 most-used stages?
  4. $match early vs late — vì sao quan trọng?
  5. Index types MongoDB?
  6. Shard key — 3 properties tốt?
  7. Change streams cần gì?
  8. Time-series collection 5.0+ — vì sao tốt hơn doc-per-measurement?
  9. MongoDB transaction limits?
  10. Khi nào không dùng MongoDB?

16. Tiếp theo

Phase 3 hoàn thành. Bài tiếp: Tuan-13-Search-Engines-ES — search workload.


Tuần 12 hoàn thành. Document well or use rows. Cập nhật: 2026-05-16