Tuần 07: Database Sharding & Replication

“Một database duy nhất giống một thư viện chỉ có một chi nhánh — khi 10 triệu người cùng muốn mượn sách, hàng dài xếp ra ngoài đường. Sharding là chia kho sách theo chủ đề, Replication là đặt bản sao tại mỗi thành phố.”

Tags: system-design database sharding replication alex-xu Student: Hieu (Backend Dev → Architect) Prerequisite: Tuan-01-Scale-From-Zero-To-Millions · Tuan-02-Back-of-the-envelope Liên quan: Tuan-06-Cache-Strategy · Tuan-08-Message-Queue · Tuan-10-Consistent-Hashing · Tuan-15-Data-Security-Encryption

1. Context & Why

Analogy đời thường — Thư viện Quốc gia

Hieu, tưởng tượng em là giám đốc Thư viện Quốc gia Việt Nam. Ban đầu chỉ có một toà nhà ở Hà Nội với 1 triệu cuốn sách. Mọi thứ hoạt động tốt.

Nhưng khi dân số tăng, nhu cầu tăng:

50 triệu người muốn mượn sách → hàng đợi dài vô tận (QPS quá tải)
100 triệu cuốn sách → không đủ kệ chứa (storage quá tải)
Toà nhà bị cháy → mất toàn bộ sách (single point of failure)

Em cần hai chiến lược:

Replication (Sao chép — Nhân bản):

Tạo bản sao toàn bộ thư viện tại Đà Nẵng, TP.HCM, Cần Thơ
Người dân đọc sách tại chi nhánh gần nhất → giảm tải cho Hà Nội
Nếu chi nhánh Hà Nội cháy → chi nhánh khác vẫn phục vụ → high availability
Nhưng khi thêm sách mới, phải đồng bộ sang tất cả chi nhánh → replication lag

Sharding (Phân mảnh — Chia kho):

Chia kho sách theo chủ đề: Khoa học ở Hà Nội, Văn học ở Đà Nẵng, Lịch sử ở TP.HCM
Mỗi kho chỉ chứa một phần dữ liệu → không kho nào bị quá tải
Nhưng nếu ai muốn tìm “sách khoa học VỀ lịch sử” → phải hỏi hai kho → cross-shard query
Nếu chủ đề “Công nghệ” hot hơn hẳn → kho đó vẫn bị quá tải → hotspot problem

Tại sao Alex Xu đặt nó ở giai đoạn này?

Vì sau khi hiểu Load Balancer (phân tải ở tầng application) và Cache (giảm tải read), câu hỏi tiếp theo luôn là: “Database sẽ chết trước khi application chết.”

Theo thống kê từ các hệ thống production lớn:

80% bottleneck của hệ thống nằm ở database layer
Database là stateful — không thể scale đơn giản bằng cách thêm instance như stateless app server
Mất database = mất business (mất data khách hàng, mất giao dịch)

Key insight: Scaling database là bài toán khó nhất trong system design vì data có gravity — nó nặng, khó di chuyển, và phải luôn consistent.

2. Deep Dive — Các khái niệm cốt lõi

2.1 Replication — Nhân bản dữ liệu

2.1.1 Master-Slave Replication (Primary-Replica)

Mô hình phổ biến nhất. Một node Master (Primary) nhận tất cả write, rồi đồng bộ sang các node Slave (Replica).

Đặc điểm	Chi tiết
Write	Chỉ vào Master
Read	Từ Master hoặc Slave
Ưu điểm	Đơn giản, read scale tốt, dễ implement
Nhược điểm	Master là SPOF, replication lag
Use case	Read-heavy system (90%+ read), blog, e-commerce catalog

Cách hoạt động (PostgreSQL Streaming Replication):

Client gửi INSERT/UPDATE/DELETE → Master
Master ghi vào WAL (Write-Ahead Log) — đây là transaction log
WAL sender process gửi WAL records qua TCP tới Replica
Replica’s WAL receiver nhận và replay WAL records
Data trên Replica giờ giống Master (với một khoảng lag nhỏ)

2.1.2 Master-Master Replication (Multi-Primary)

Cả hai node đều nhận write. Phức tạp hơn nhiều.

Đặc điểm	Chi tiết
Write	Vào bất kỳ node nào
Ưu điểm	Không có write SPOF, write scale (lý thuyết)
Nhược điểm	Conflict resolution cực kỳ phức tạp, data inconsistency risk
Use case	Multi-region deployment (mỗi region một master), CRDTs

Conflict scenarios:

User A update name='Hieu' trên Master 1
User B update name='Hiếu' trên Master 2
Cùng thời điểm → conflict → ai thắng?

Conflict resolution strategies:

Last-Write-Wins (LWW) — timestamp lớn hơn thắng (đơn giản nhưng mất data)
Application-level resolution — app quyết định merge logic
CRDTs (Conflict-free Replicated Data Types) — data structure tự merge (dùng trong collaborative editing)

Lời khuyên thực tế: Tránh Master-Master trừ khi bắt buộc (multi-region write). Complexity cost rất cao. Hầu hết hệ thống dùng Master-Slave + failover là đủ.

2.1.3 Synchronous vs Asynchronous vs Semi-Synchronous Replication

Kiểu	Cách hoạt động	Ưu điểm	Nhược điểm
Synchronous (Đồng bộ)	Master đợi tất cả replica xác nhận trước khi commit	Zero data loss	Latency cao, 1 replica chậm → cả system chậm
Asynchronous (Bất đồng bộ)	Master commit ngay, replica nhận sau	Latency thấp, master không bị block	Có thể mất data nếu master chết trước khi replica nhận
Semi-Synchronous (Bán đồng bộ)	Master đợi ít nhất 1 replica xác nhận	Cân bằng: ít nhất 1 bản sao an toàn	Vẫn bị block bởi 1 replica

PostgreSQL configuration:

-- Synchronous replication: Master đợi replica xác nhận
-- postgresql.conf trên Master
synchronous_commit = on           -- default, đợi local WAL flush
synchronous_standby_names = 'replica1'  -- đợi replica1 xác nhận
 
-- Semi-synchronous: đợi ít nhất 1 trong N replica
synchronous_standby_names = 'ANY 1 (replica1, replica2, replica3)'
 
-- Asynchronous: không đợi replica
synchronous_commit = off  -- hoặc đơn giản không set synchronous_standby_names

Production recommendation: Semi-synchronous (ANY 1) cho hầu hết hệ thống. Đảm bảo ít nhất 1 replica có data, mà không quá chậm.

2.1.4 Replication Lag — Kẻ thù thầm lặng

Replication lag (độ trễ sao chép) là khoảng thời gian từ khi Master commit đến khi Replica có data đó.

Scenario	Lag thường thấy	Hậu quả
Cùng data center, load thấp	< 1 ms	Không đáng kể
Cùng data center, load cao	10–100 ms	User có thể đọc stale data
Cross-region (US–Asia)	100–300 ms	Đọc data cũ vài giây
Replica quá tải / network issue	1–60 giây	Nghiêm trọng — user thấy data cũ
Burst write (batch import)	Phút → giờ	Replica gần như “đóng băng”

Hậu quả thực tế của replication lag:

Hieu, tưởng tượng em đặt hàng trên Shopee:

Em bấm “Đặt hàng” → Write vào Master ✅
Em refresh trang để xem trạng thái → App đọc từ Replica (chưa có data!) ❌
Em nghĩ đơn hàng bị mất → bấm đặt hàng lần nữa → duplicate order 💀

Giải pháp:

Giải pháp	Cách hoạt động	Trade-off
Read-your-writes consistency	Sau khi write, đọc từ Master trong N giây	Master chịu thêm read load
Monotonic reads	User luôn đọc từ cùng 1 replica (sticky session)	Không tận dụng hết replica
Causal consistency	Đánh version cho data, replica chỉ serve nếu có version đủ mới	Implementation phức tạp
Synchronous replication	Đợi replica xác nhận	Latency tăng

2.1.5 Read Replicas — Scale Read

Khi hệ thống read-heavy (>80% read), thêm read replica là cách scale đơn giản nhất:

Số Replica	Read capacity (ước lượng)	Ghi chú
1 Master	10K QPS (read + write)	Baseline
1 Master + 1 Replica	10K write + 10K read = 20K total	2x read
1 Master + 4 Replica	10K write + 40K read = 50K total	5x read
1 Master + 9 Replica	10K write + 90K read = 100K total	10x read

Diminishing returns: Mỗi replica thêm cũng tăng replication lag và management overhead. Thực tế hiếm khi quá 5 read replicas. Nếu cần nhiều hơn → xem xét caching layer hoặc sharding.

2.1.6 Failover — Khi Master chết

Automatic Failover (Tự động chuyển đổi):

Bước	Hành động	Thời gian
1	Monitoring phát hiện Master không phản hồi	5–30 giây (health check interval)
2	Kiểm tra xác nhận Master thực sự chết (avoid false positive)	5–15 giây
3	Chọn Replica có data mới nhất làm Master mới (election)	1–5 giây
4	Cập nhật DNS/VIP/proxy để trỏ sang Master mới	1–30 giây
5	Các replica còn lại trỏ sang Master mới	5–15 giây
Tổng	Downtime	15–90 giây

Manual Failover (Chuyển đổi thủ công):

Ưu điểm	Nhược điểm
DBA kiểm soát hoàn toàn	Downtime dài (phút → giờ, tuỳ DBA có online không)
Tránh false positive trigger	Không phù hợp 24/7 operation
An toàn hơn cho critical data	Cần on-call rotation

Split Brain — Ác mộng của Automatic Failover:

Khi network partition xảy ra:

Monitoring tưởng Master chết (thực ra chỉ bị mất kết nối)
Promote Replica thành Master mới
Giờ có 2 Masters cùng nhận write → data divergence → split brain 🧠💀

Giải pháp:

Fencing (STONITH — Shoot The Other Node In The Head): Tắt Master cũ bằng force (power off qua IPMI/cloud API) trước khi promote Master mới
Quorum-based (Patroni, etcd): Cần majority vote (>50% nodes) để quyết định ai là Master
Lease-based: Master phải renew lease liên tục; nếu không renew → tự step down

2.2 Sharding — Phân mảnh dữ liệu

Khi Replication không đủ (write bottleneck, storage quá lớn cho 1 node), em cần Sharding (còn gọi là Horizontal Partitioning — phân vùng ngang).

Nguyên lý: Chia data thành nhiều phần (shards/partitions), mỗi phần nằm trên một database server riêng.

2.2.1 Sharding Strategies

Hash-based Sharding (Phân mảnh theo hash)

shard_id = hash(shard_key) % number_of_shards

Đặc điểm	Chi tiết
Ưu điểm	Phân bố đều data, đơn giản implement
Nhược điểm	Resharding nightmare — thêm/bớt shard → phải migrate gần như toàn bộ data
Cải tiến	Dùng Consistent Hashing → chỉ migrate ~1/N data → Tuan-10-Consistent-Hashing
Use case	User data, session data

Ví dụ: 4 shards, shard key = user_id

user_id = 12345 → hash(12345) % 4 = 1 → Shard 1
user_id = 67890 → hash(67890) % 4 = 2 → Shard 2
user_id = 11111 → hash(11111) % 4 = 3 → Shard 3

Range-based Sharding (Phân mảnh theo khoảng)

Shard 1: user_id 1 – 1,000,000
Shard 2: user_id 1,000,001 – 2,000,000
Shard 3: user_id 2,000,001 – 3,000,000

Đặc điểm	Chi tiết
Ưu điểm	Range queries hiệu quả (tìm user_id từ X đến Y chỉ cần 1 shard)
Nhược điểm	Hotspot — user mới (id lớn) tập trung vào shard cuối
Use case	Time-series data (shard theo tháng/năm), log data

Directory-based Sharding (Phân mảnh theo bảng tra cứu)

Dùng một lookup table (bảng ánh xạ) để biết data nằm ở shard nào.

Lookup Table:
  user_id 12345 → Shard 2
  user_id 67890 → Shard 1
  user_id 11111 → Shard 4

Đặc điểm	Chi tiết
Ưu điểm	Linh hoạt nhất — có thể move data giữa shards bất kỳ lúc nào
Nhược điểm	Lookup table là SPOF và bottleneck, thêm 1 network hop
Use case	Hệ thống cần resharding thường xuyên, multi-tenant SaaS

Geo-based Sharding (Phân mảnh theo địa lý)

Shard VN: Users ở Việt Nam → DB server tại Singapore
Shard US: Users ở Mỹ → DB server tại US-East
Shard EU: Users ở Châu Âu → DB server tại Frankfurt

Đặc điểm	Chi tiết
Ưu điểm	Latency cực thấp (data gần user), tuân thủ GDPR (data residency)
Nhược điểm	Cross-region queries chậm, user di chuyển → phải migrate shard
Use case	Global apps (Uber, Netflix), hệ thống cần tuân thủ data residency

2.2.2 Shard Key Selection — Quyết định sống còn

Chọn shard key sai → hệ thống chết. Đây là quyết định quan trọng nhất khi sharding.

Tiêu chí chọn shard key tốt:

Tiêu chí	Giải thích	Ví dụ tốt	Ví dụ xấu
High cardinality	Nhiều giá trị unique → phân bố đều	`user_id` (triệu giá trị)	`country` (< 200 giá trị)
Even distribution	Data phân bố đều giữa các shard	`user_id` (random)	`created_date` (burst vào ngày hot)
Query isolation	Hầu hết query chỉ cần 1 shard	Shard theo `user_id`, query theo user	Shard theo `user_id`, query theo `product_id`
Stable	Shard key không thay đổi	`user_id`	`email` (user có thể đổi)

Case study — Shard key sai cho e-commerce:

Shard key	Vấn đề
`order_date`	Black Friday → 1 shard chịu 90% traffic (hotspot)
`product_category`	”Điện thoại” hot hơn “Sách cổ” 1000 lần
`user_id`	✅ Tốt nếu query chủ yếu theo user. Nhưng query “top selling products” → cross-shard
`order_id` (hash)	✅ Phân bố đều, nhưng query “tất cả order của user X” → cross-shard

Rule of thumb: Shard key nên là entity chính trong hầu hết queries. Cho e-commerce lấy user làm trung tâm → user_id. Cho analytics platform → tenant_id.

2.2.3 Cross-Shard Queries — Nỗi đau của Sharding

Khi query cần data từ nhiều shards:

-- Shard theo user_id, nhưng cần query theo product:
SELECT COUNT(*) FROM orders WHERE product_id = 'IPHONE15';
-- Phải fan-out query tới TẤT CẢ shards, rồi aggregate kết quả!

Chi phí cross-shard query:

Metric	Single shard	10 shards (fan-out)	Ghi chú
Latency	5 ms	50–200 ms	Chậm nhất = shard chậm nhất
Network	1 round trip	10 round trips	10x network overhead
Complexity	Simple SQL	Coordinator + scatter-gather	Cần middleware/proxy
Consistency	ACID transaction	Distributed transaction (2PC) hoặc eventual consistency	Rất phức tạp

Giải pháp giảm cross-shard queries:

Denormalization: Copy data cần thiết vào mỗi shard (ví dụ: copy product_name vào orders table)
Global tables: Bảng nhỏ, ít thay đổi (countries, categories) → copy toàn bộ vào mỗi shard
Secondary index service: Dùng Elasticsearch cho cross-shard search
CQRS pattern: Tách read model (denormalized, cross-shard) khỏi write model (sharded)

2.2.4 Resharding — Khi cần thêm/bớt shard

Khi nào cần reshard?

Một shard đầy storage (>80% disk)
Một shard quá tải QPS (hotspot)
Cần thêm shard vì data growth
Cần gộp shard vì giảm data (cost optimization)

Resharding strategies:

Strategy	Cách hoạt động	Downtime	Complexity
Stop-the-world	Dừng write → migrate → resume	Giờ → ngày	Thấp
Double-write	Write cả shard cũ + mới trong transition period	Zero	Cao
Ghost table (gh-ost)	Copy data background → atomic switchover	Gần zero	Trung bình
Consistent hashing	Chỉ migrate ~1/N data khi thêm node	Gần zero	Thấp (nếu đã dùng từ đầu)

Production tip: Dùng Vitess (YouTube’s sharding solution) hoặc Citus (PostgreSQL extension) để tự động hoá resharding. Đừng tự build.

2.2.5 Hotspot Problem — Shard nóng

Celebrity problem (hay Thundering herd trên 1 shard):

Justin Bieber post ảnh → tất cả fan query shard chứa data Bieber → shard đó chết
Black Friday flash sale → tất cả order đổ vào shard chứa product hot

Giải pháp:

Giải pháp	Chi tiết
Salting	Thêm random suffix vào shard key: `user_123_salt_7` → phân tán vào nhiều shard, nhưng read phải query N shards rồi merge
Dedicated shard	Celebrity users có shard riêng (VIP treatment)
Caching layer	Hot data nằm hoàn toàn trong Redis → không hit DB
Rate limiting	Limit read QPS per entity

2.3 Sharding Middleware & Tools

Vitess (YouTube → CNCF project)

Đặc điểm	Chi tiết
Database	MySQL
Tính năng	Transparent sharding, online resharding, connection pooling, query routing
Ai dùng	YouTube, Slack, GitHub, Square
Ưu điểm	Battle-tested tại scale cực lớn, Kubernetes-native
Nhược điểm	Learning curve cao, MySQL-only

Citus (PostgreSQL extension)

Đặc điểm	Chi tiết
Database	PostgreSQL
Tính năng	Distributed tables, reference tables, co-located joins
Ai dùng	Microsoft (Azure), Algolia, Heap Analytics
Ưu điểm	Transparent — app vẫn nói chuyện PostgreSQL bình thường
Nhược điểm	Một số PostgreSQL features không support trong distributed mode

ProxySQL (MySQL Proxy)

Đặc điểm	Chi tiết
Tính năng	Query routing (read/write split), connection pooling, query caching, failover
Ưu điểm	Không cần thay đổi app code, config-driven
Nhược điểm	Thêm 1 hop latency, không phải sharding solution (chỉ routing)

PgBouncer (PostgreSQL Connection Pooler)

Đặc điểm	Chi tiết
Vai trò	Connection pooling — giảm số connection thực tới PostgreSQL
Vì sao cần	PostgreSQL fork 1 process per connection (~10MB RAM/connection). 1000 connections = 10GB RAM chỉ cho connection overhead
Mode	Transaction pooling (recommended): connection được trả lại pool sau mỗi transaction
Benchmark	Không có PgBouncer: 200 connections max. Có PgBouncer: 10,000+ connections với chỉ 50 backend connections

; pgbouncer.ini
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb
 
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction          ; transaction pooling — recommended
max_client_conn = 10000          ; max connections từ app
default_pool_size = 50           ; connections thực tới PostgreSQL
min_pool_size = 10
reserve_pool_size = 10
reserve_pool_timeout = 3
server_lifetime = 3600
server_idle_timeout = 600
log_connections = 1
log_disconnections = 1

2.4 SQL vs NoSQL — Ma trận lựa chọn

Tiêu chí	SQL (PostgreSQL, MySQL)	NoSQL (MongoDB, Cassandra, DynamoDB)
Schema	Fixed schema, migrations	Flexible schema, schema-on-read
Joins	Powerful JOIN support	Thường không có hoặc hạn chế
Transactions	ACID transactions	Thường BASE (eventual consistency)
Scaling	Vertical + Sharding (phức tạp)	Horizontal scaling built-in
Query flexibility	SQL — cực kỳ linh hoạt	Limited query patterns
Consistency	Strong consistency mặc định	Tunable consistency
Best for	Complex relationships, financial data, reporting	High-throughput, flexible schema, time-series, key-value

Khi nào chọn SQL?

Cần ACID transactions (payment, banking)
Data có nhiều relationships (e-commerce: users → orders → products → reviews)
Cần complex queries, aggregations, reporting
Schema tương đối ổn định

Khi nào chọn NoSQL?

Write-heavy, massive scale (>100K writes/s)
Schema thay đổi thường xuyên (startup iteration nhanh)
Data là key-value, document, hoặc time-series
Cần horizontal scaling dễ dàng (DynamoDB, Cassandra auto-shard)

Thực tế: Hầu hết hệ thống lớn dùng cả hai (polyglot persistence). PostgreSQL cho core business data, Redis cho cache, Elasticsearch cho search, Cassandra cho time-series/logs.

2.5 ACID vs BASE

Property	ACID	BASE
	Atomicity — all or nothing	Basically Available — system luôn phản hồi
	Consistency — data luôn valid	Soft state — state có thể thay đổi theo thời gian
	Isolation — transactions không ảnh hưởng nhau	Eventual consistency — cuối cùng sẽ consistent
	Durability — data persist sau commit
Trade-off	Consistency > Availability	Availability > Consistency
Latency	Cao hơn (phải coordinate)	Thấp hơn (không đợi)
Scaling	Khó scale write	Dễ scale write
Ví dụ	PostgreSQL, MySQL (InnoDB)	Cassandra, DynamoDB, MongoDB

2.6 CAP Theorem — Practical Implications

CAP: Trong distributed system, chỉ có thể chọn 2 trong 3: Consistency, Availability, Partition tolerance.

Nhưng trong thực tế, P (Partition Tolerance) là bắt buộc trong distributed system (network luôn có thể fail). Vậy lựa chọn thực sự là:

Khi network partition xảy ra…	CP (Consistency + Partition Tolerance)	AP (Availability + Partition Tolerance)
Hành vi	Từ chối request (trả error) thay vì trả data cũ	Tiếp tục phục vụ, nhưng có thể trả data cũ
Ví dụ	Bank transfer, inventory count	Social media feed, DNS
DB	PostgreSQL (synchronous replication), HBase, Spanner, etcd	Cassandra, DynamoDB, CouchDB
User thấy	”Service unavailable”	Data hơi cũ nhưng vẫn dùng được

Lưu ý quan trọng về PostgreSQL: PostgreSQL không cố định là CP — phụ thuộc cấu hình:

Asynchronous streaming replication (default): Master commit không đợi replica → khi master fail + replica chưa nhận WAL → mất data đã commit → AP-leaning (sacrifice consistency for availability/latency)

Synchronous replication (synchronous_commit=on + synchronous_standby_names): Master đợi replica fsync → CP (zero data loss khi failover, nhưng nếu replica down → master block writes)

Tham chiếu: Tuan-Bonus-Consensus-Raft-Paxos về quorum-based consensus và PostgreSQL docs

Practical insight: Phần lớn hệ thống không cần CP 100%. Ngay cả banking cũng dùng eventual consistency cho một số flows (balance display có thể lag vài giây, nhưng transfer phải strong consistent).

PACELC Theorem (mở rộng CAP): Nếu có Partition → chọn A hay C? Else (không partition) → chọn Latency hay Consistency?

Ví dụ: DynamoDB = PA/EL (khi partition → available; bình thường → low latency). PostgreSQL synchronous replication = PC/EC (khi partition → consistent; bình thường → consistent, nhưng latency cao hơn).

3. Estimation — Khi nào cần Sharding?

3.1 Ngưỡng cần xem xét Sharding

Metric	Ngưỡng “cần suy nghĩ”	Ngưỡng “phải shard”	Giải thích
Data size	> 500 GB	> 2 TB	Single PostgreSQL handle tốt tới ~1-2TB, sau đó vacuum/backup chậm
Write QPS	> 5,000/s	> 15,000/s	Single PostgreSQL write limit ~10-20K QPS
Read QPS	> 20,000/s	> 50,000/s	Thêm read replica trước khi shard
Table rows	> 500M rows	> 2B rows	Index maintenance chậm, queries degrade
Single query latency	> 100ms (p99)	> 500ms (p99)	Dù đã optimize index

3.2 Ước lượng cho E-commerce platform

Assumptions:

Thông số	Giá trị
DAU	10M
Orders/user/day	0.5
Avg order size (DB row)	2 KB
Products viewed/user/day	20
Avg product row	5 KB
Retention	5 năm
Read:Write ratio	20:1

Write QPS:

W r i t e QP S_{a vg} = \frac{10 M \times 0.5}{86 , 400} \approx 58 w r i t es / s

W r i t e QP S_{p e ak} = 58 \times 5 = 290 w r i t es / s

W r i t e QP S_{f l a s h s a l e} = 58 \times 50 = 2, 900 w r i t es / s

Nhận xét: 2,900 write QPS peak — single PostgreSQL vẫn handle được. Chưa cần shard cho write.

Read QPS:

R e a d QP S_{a vg} = \frac{10 M \times 20}{86 , 400} \approx 2, 315 re a d s / s

R e a d QP S_{p e ak} = 2, 315 \times 5 = 11, 575 re a d s / s

Nhận xét: 11.5K read QPS peak — cần 2-3 read replicas (mỗi replica ~5K QPS). Chưa cần shard cho read.

Storage:

O r d ers / d a y = 10 M \times 0.5 = 5 M or d ers / d a y

O r d er s t or a g e / d a y = 5 M \times 2 K B = 10 GB / d a y

O r d er s t or a g e (5 n \overset{a}{˘} m) = 10 GB \times 365 \times 5 = 18.25 TB

Alert: 18.25 TB order data — cần sharding cho storage! Single PostgreSQL không nên chứa quá 2TB cho performance tốt.

Số shard cần:

S ha r d s_{minim u m} = ⌈ \frac{18.25 TB}{1 TB p er s ha r d} ⌉ = 19 s ha r d s

S ha r d s_{reco mm e n d e d} = 19 \times 1.5 (h e a d roo m) \approx 32 s ha r d s

Tip: Chọn số shard là lũy thừa của 2 (32 = 2^5) để hash distribution đều và resharding dễ hơn.

3.3 Replication Lag Impact Estimation

Scenario: E-commerce platform, replication lag = 500ms

O r d ers a ff ec t e d / h o u r = W r i t e QPS \times l a g_w in d o w \times re a d \- a f t er \- w r i t e \- p ro babi l i t y

= 290 \times 0.5 \times 0.3 = 43.5 or d ers / h o u r ma y see s t a l e d a t a

= 43.5 \times 24 = 1, 044 or d ers / d a y a ff ec t e d

Với 2.5M orders/ngày, 1,044 orders bị affected = 0.04%. Có thể chấp nhận được nếu UX handle tốt (loading state). Nhưng nếu lag tăng lên 5 giây → 10,440 orders/day → cần fix.

3.4 Storage per Shard Estimation

St or a g e p er s ha r d = \frac{T o t a l s t or a g e}{N u mb er o f s ha r d s} = \frac{18.25 TB}{32} \approx 570 GB / s ha r d

G ro wt h p er s ha r d / ye a r = \frac{10 GB / d a y \times 365}{32} \approx 114 GB / s ha r d / ye a r

T im e u n t i l s ha r d re a c h es 1 TB = \frac{1 , 000 GB - 570 GB}{114 GB / ye a r} \approx 3.8 n \overset{a}{˘} m

Capacity planning: Với 32 shards, mỗi shard bắt đầu ở 570GB và có ~3.8 năm headroom trước khi cần reshard. Đặt alert ở 80% = 800GB.

4. Security First — Bảo mật Database

4.1 SQL Injection Prevention — Phòng chống tiêm SQL

SQL Injection vẫn là OWASP Top 10 và là nguyên nhân #1 data breach cho database.

Attack ví dụ:

-- User input: ' OR '1'='1'; DROP TABLE users; --
-- Query bị inject:
SELECT * FROM users WHERE username = '' OR '1'='1'; DROP TABLE users; --'

Phòng chống (nhiều lớp):

Layer	Giải pháp	Chi tiết
Code	Parameterized queries / Prepared statements	KHÔNG BAO GIỜ nối string cho SQL
Code	ORM (SQLAlchemy, Sequelize, Prisma)	ORM tự parameterize
Network	WAF (Web Application Firewall)	Detect SQL patterns trong request
DB	Least privilege	App user chỉ có SELECT/INSERT/UPDATE, KHÔNG có DROP/ALTER
DB	Stored procedures	Limit SQL operations app có thể chạy
Monitoring	Query anomaly detection	Alert khi thấy unusual query patterns

# ❌ VULNERABLE — String concatenation
query = f"SELECT * FROM users WHERE id = {user_input}"
 
# ✅ SAFE — Parameterized query
cursor.execute("SELECT * FROM users WHERE id = %s", (user_input,))
 
# ✅ SAFE — SQLAlchemy ORM
user = session.query(User).filter(User.id == user_input).first()

// ❌ VULNERABLE
const query = `SELECT * FROM users WHERE id = ${userId}`;
 
// ✅ SAFE — Parameterized query (pg library)
const result = await pool.query('SELECT * FROM users WHERE id = $1', [userId]);
 
// ✅ SAFE — Prisma ORM
const user = await prisma.user.findUnique({ where: { id: userId } });

4.2 Database Encryption at Rest — Mã hoá dữ liệu lưu trữ

TDE (Transparent Data Encryption): Mã hoá toàn bộ data trên disk mà application không cần thay đổi.

Cấp độ	Công cụ	Mã hoá gì	Ưu/Nhược
OS-level	LUKS (Linux), FileVault (macOS), BitLocker (Windows)	Toàn bộ disk	Đơn giản, nhưng ai có access OS = có data
Database-level	PostgreSQL TDE (pg_tde), MySQL TDE	Tablespace/table level	Granular hơn, performance overhead 3-5%
Column-level	pgcrypto (PostgreSQL), Application-level encryption	Cột cụ thể (SSN, credit card)	An toàn nhất, nhưng không thể index/search encrypted column

-- PostgreSQL: Column-level encryption với pgcrypto
CREATE EXTENSION IF NOT EXISTS pgcrypto;
 
-- Encrypt sensitive data
INSERT INTO users (name, ssn_encrypted)
VALUES ('Hieu', pgp_sym_encrypt('123-45-6789', 'my-secret-key'));
 
-- Decrypt
SELECT name, pgp_sym_decrypt(ssn_encrypted, 'my-secret-key') AS ssn
FROM users WHERE id = 1;

Production: Encryption key KHÔNG được lưu trong database hay code. Dùng KMS (AWS KMS, HashiCorp Vault) → Tuan-15-Data-Security-Encryption.

4.3 Network Segmentation cho Database

┌─────────────────────────────────────────────┐
│                Internet                      │
└───────────────┬─────────────────────────────┘
                │
┌───────────────▼─────────────────────────────┐
│          DMZ (Public Subnet)                 │
│   ┌─────────────────┐                       │
│   │  Load Balancer   │                       │
│   └────────┬────────┘                       │
└────────────┼────────────────────────────────┘
             │ Port 443 only
┌────────────▼────────────────────────────────┐
│       App Subnet (Private)                   │
│   ┌──────────┐  ┌──────────┐                │
│   │  App 1   │  │  App 2   │                │
│   └────┬─────┘  └────┬─────┘                │
└────────┼──────────────┼─────────────────────┘
         │ Port 5432    │ Port 5432 only
┌────────▼──────────────▼─────────────────────┐
│        DB Subnet (Most Restricted)           │
│   ┌──────────┐  ┌──────────┐                │
│   │ Master   │──│ Replica  │                │
│   └──────────┘  └──────────┘                │
│   - NO internet access                      │
│   - ONLY app subnet can connect             │
│   - Security Group: allow 5432 from app SG  │
│   - NACLs: deny all except app subnet CIDR  │
└─────────────────────────────────────────────┘

Security Group rules (AWS ví dụ):

Rule	Type	Port	Source	Mục đích
DB SG Inbound	PostgreSQL	5432	App Security Group	Chỉ app mới connect được DB
DB SG Inbound	SSH	22	Bastion SG	DBA access qua bastion host
DB SG Outbound	All	All	DB SG	Replication giữa DB nodes
DB SG Outbound	HTTPS	443	KMS endpoint	Lấy encryption keys

4.4 Audit Logging

-- PostgreSQL: Bật audit logging
-- postgresql.conf
-- log_statement = 'all'           -- Log tất cả SQL (dev/staging)
-- log_statement = 'ddl'           -- Log DDL only (production minimum)
-- log_statement = 'mod'           -- Log DDL + DML (recommended production)
 
-- pgAudit extension — chi tiết hơn built-in logging
CREATE EXTENSION pgaudit;
 
-- Log tất cả READ và WRITE operations
ALTER SYSTEM SET pgaudit.log = 'read, write';
 
-- Log role-specific: chỉ audit role 'app_user'
ALTER SYSTEM SET pgaudit.role = 'app_user';
 
-- Kiểm tra audit log
SELECT * FROM pg_catalog.pg_audit_log
WHERE timestamp > now() - interval '1 hour'
ORDER BY timestamp DESC;

4.5 Principle of Least Privilege — Quyền tối thiểu

-- ❌ WRONG: App dùng superuser
-- App kết nối bằng user 'postgres' với full quyền
 
-- ✅ RIGHT: Tạo users với quyền tối thiểu
 
-- Read-only user cho reporting/analytics
CREATE ROLE readonly_user LOGIN PASSWORD 'strong_password_here';
GRANT CONNECT ON DATABASE mydb TO readonly_user;
GRANT USAGE ON SCHEMA public TO readonly_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO readonly_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO readonly_user;
 
-- App user: read + write, KHÔNG có DDL
CREATE ROLE app_user LOGIN PASSWORD 'another_strong_password';
GRANT CONNECT ON DATABASE mydb TO app_user;
GRANT USAGE ON SCHEMA public TO app_user;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO app_user;
GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA public TO app_user;
-- KHÔNG GRANT: CREATE, DROP, ALTER, TRUNCATE
 
-- Migration user: chỉ dùng cho DB migration, có DDL
CREATE ROLE migration_user LOGIN PASSWORD 'migration_strong_password';
GRANT CONNECT ON DATABASE mydb TO migration_user;
GRANT ALL PRIVILEGES ON SCHEMA public TO migration_user;
-- Chỉ dùng trong CI/CD pipeline, KHÔNG dùng ở app runtime
 
-- Admin/DBA: full access, nhưng qua bastion host only
CREATE ROLE dba_user LOGIN PASSWORD 'dba_strong_password' SUPERUSER;

4.6 Backup Encryption

# pg_dump với encryption
pg_dump -h master-db -U backup_user mydb | \
  gpg --symmetric --cipher-algo AES256 \
  --passphrase-file /etc/backup/encryption.key | \
  aws s3 cp - s3://my-backup-bucket/mydb-$(date +%Y%m%d).sql.gpg \
  --sse aws:kms \
  --sse-kms-key-id alias/backup-key
 
# Decrypt khi cần restore
aws s3 cp s3://my-backup-bucket/mydb-20260318.sql.gpg - | \
  gpg --decrypt --passphrase-file /etc/backup/encryption.key | \
  psql -h restore-db -U dba_user mydb

Backup rule of 3-2-1: 3 bản backup, trên 2 loại media khác nhau, 1 bản offsite (khác region/cloud).

5. DevOps/Ops-Light — Triển khai & Vận hành

5.1 PostgreSQL Streaming Replication Setup

Master configuration (postgresql.conf):

# WAL settings
wal_level = replica                   # Bắt buộc cho replication
max_wal_senders = 5                   # Tối đa 5 replica connections
wal_keep_size = 1GB                   # Giữ 1GB WAL cho replica bị disconnect tạm
max_replication_slots = 5             # Replication slots — đảm bảo WAL không bị xoá khi replica offline
 
# Synchronous replication (optional)
synchronous_standby_names = 'ANY 1 (replica1, replica2)'
synchronous_commit = on
 
# Logging
log_replication_commands = on

Master pg_hba.conf (authentication):

# TYPE   DATABASE   USER            ADDRESS              METHOD
host     replication replication_user 10.0.2.0/24         scram-sha-256

Tạo replication user:

CREATE ROLE replication_user WITH REPLICATION LOGIN PASSWORD 'secure_replication_password';

Replica setup:

# Tạo base backup từ Master
pg_basebackup -h master-db -U replication_user -D /var/lib/postgresql/16/main \
  -P -Xs -R
 
# -P: show progress
# -Xs: stream WAL during backup
# -R: tự tạo standby.signal và primary_conninfo trong postgresql.auto.conf

Replica postgresql.conf:

hot_standby = on                      # Cho phép read queries trên replica
primary_conninfo = 'host=master-db port=5432 user=replication_user password=secure_replication_password application_name=replica1'
primary_slot_name = 'replica1_slot'   # Dùng replication slot

5.2 pg_stat_replication Monitoring

-- Kiểm tra replication status trên Master
SELECT
    client_addr,
    application_name,
    state,
    sync_state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replication_lag_bytes,
    pg_wal_lsn_diff(sent_lsn, replay_lsn) / 1024 / 1024 AS replication_lag_mb
FROM pg_stat_replication;
 
-- Kiểm tra lag theo thời gian trên Replica
SELECT
    now() - pg_last_xact_replay_timestamp() AS replication_lag_time;

Column	Ý nghĩa
`sent_lsn`	WAL đã gửi từ Master
`write_lsn`	WAL đã write vào disk trên Replica
`flush_lsn`	WAL đã flush (fsync) trên Replica
`replay_lsn`	WAL đã replay (apply) trên Replica
`sent - replay`	Replication lag thực tế

5.3 Prometheus postgres_exporter

# docker-compose.monitoring.yml
services:
  postgres-exporter-master:
    image: prometheuscommunity/postgres-exporter:latest
    environment:
      DATA_SOURCE_NAME: "postgresql://monitor_user:password@master-db:5432/mydb?sslmode=require"
    ports:
      - "9187:9187"
    restart: unless-stopped
 
  postgres-exporter-replica:
    image: prometheuscommunity/postgres-exporter:latest
    environment:
      DATA_SOURCE_NAME: "postgresql://monitor_user:password@replica-db:5432/mydb?sslmode=require"
    ports:
      - "9188:9187"
    restart: unless-stopped

Prometheus alerts cho replication:

# prometheus-alerts-db.yml
groups:
  - name: postgresql_replication
    rules:
      - alert: ReplicationLagHigh
        expr: pg_replication_lag_seconds > 5
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Replication lag > 5s on {{ $labels.instance }}"
          description: "Current lag: {{ $value }}s. Check replica load and network."
 
      - alert: ReplicationLagCritical
        expr: pg_replication_lag_seconds > 30
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Replication lag > 30s on {{ $labels.instance }}"
          description: "CRITICAL: Lag {{ $value }}s. Users may see very stale data."
 
      - alert: ReplicationDown
        expr: pg_replication_is_replica == 1 AND pg_up == 0
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Replica {{ $labels.instance }} is DOWN"
 
      - alert: ReplicationSlotInactive
        expr: pg_replication_slots_active == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Replication slot inactive — WAL accumulating on master"
 
      - alert: ConnectionPoolExhausted
        expr: pg_stat_activity_count / pg_settings_max_connections > 0.85
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Connection usage > 85% on {{ $labels.instance }}"
 
      - alert: StoragePerShardHigh
        expr: pg_database_size_bytes / 1073741824 > 800  # > 800GB
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Database size > 800GB on {{ $labels.instance }}. Consider resharding."

5.4 Automated Failover with Patroni

Patroni = PostgreSQL HA solution sử dụng etcd/ZooKeeper/Consul cho leader election.

# patroni.yml — Cấu hình cho mỗi PostgreSQL node
scope: my-cluster
name: node1
 
restapi:
  listen: 0.0.0.0:8008
  connect_address: node1:8008
 
etcd3:
  hosts: etcd1:2379,etcd2:2379,etcd3:2379
 
bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576  # 1MB — replica phải trong 1MB lag để được promote
    synchronous_mode: true
    postgresql:
      use_pg_rewind: true
      parameters:
        max_connections: 200
        max_worker_processes: 8
        wal_level: replica
        max_wal_senders: 5
        max_replication_slots: 5
        hot_standby: "on"
        wal_log_hints: "on"         # Cần cho pg_rewind
        synchronous_commit: "on"
 
  initdb:
    - encoding: UTF8
    - data-checksums             # Detect corruption
 
postgresql:
  listen: 0.0.0.0:5432
  connect_address: node1:5432
  data_dir: /var/lib/postgresql/16/main
  authentication:
    replication:
      username: replication_user
      password: secure_replication_password
    superuser:
      username: postgres
      password: postgres_password

Patroni failover flow:

Bước	Hành động	Thời gian
1	Patroni phát hiện Master không respond	`ttl` = 30s
2	etcd confirm Master’s lease expired	~5s
3	Patroni chọn Replica với least lag	~2s
4	pg_rewind sync lại data nếu cần	1-10s
5	Promote Replica → Master mới	~2s
6	Các Replica khác follow Master mới	~5s
Tổng	Automatic failover	~30-60s

5.5 Backup Strategy

Strategy	Cách	Tốc độ backup	Tốc độ restore	PITR	Dung lượng
pg_dump	Logical backup (SQL statements)	Chậm (giờ cho TB-level)	Chậm	Không	Nhỏ (compressed)
pg_basebackup	Physical backup (copy data files)	Nhanh	Nhanh	Có (với WAL)	Lớn (full copy)
WAL archiving	Continuous, incremental	Real-time	Nhanh (restore base + replay WAL)	Có — chính xác tới giây	Nhỏ (chỉ changes)

Production recommendation: pg_basebackup (weekly full) + WAL archiving (continuous)

# WAL archiving configuration (postgresql.conf)
# archive_mode = on
# archive_command = 'aws s3 cp %p s3://wal-archive/%f --sse aws:kms'
# archive_timeout = 60  # archive WAL mỗi 60s dù chưa đầy segment
 
# Weekly full backup (cron job)
# 0 2 * * 0 pg_basebackup -h localhost -U backup_user -D /backup/base/$(date +\%Y\%m\%d) -Ft -z -P
 
# Point-in-Time Recovery (PITR) — restore tới 1 thời điểm cụ thể
# Scenario: Ai đó chạy DROP TABLE vào 14:30:00
# 1. Restore base backup
# 2. Replay WAL tới 14:29:59
# recovery_target_time = '2026-03-18 14:29:59'
# restore_command = 'aws s3 cp s3://wal-archive/%f %p'

6. Code Examples

6.1 Docker-compose: PostgreSQL Master-Slave

# docker-compose.yml
# PostgreSQL Master-Replica setup for local development/testing
version: '3.8'
 
services:
  pg-master:
    image: postgres:16-alpine
    container_name: pg-master
    environment:
      POSTGRES_DB: mydb
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres_password
      POSTGRES_INITDB_ARGS: "--data-checksums"
    ports:
      - "5432:5432"
    volumes:
      - pg_master_data:/var/lib/postgresql/data
      - ./init-master.sh:/docker-entrypoint-initdb.d/init-master.sh
    command: >
      postgres
        -c wal_level=replica
        -c max_wal_senders=5
        -c max_replication_slots=5
        -c hot_standby=on
        -c log_replication_commands=on
        -c wal_keep_size=256MB
    networks:
      - db-network
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 3s
      retries: 5
 
  pg-replica:
    image: postgres:16-alpine
    container_name: pg-replica
    environment:
      PGUSER: replication_user
      PGPASSWORD: replication_password
    ports:
      - "5433:5432"
    volumes:
      - pg_replica_data:/var/lib/postgresql/data
      - ./init-replica.sh:/init-replica.sh
    entrypoint: /init-replica.sh
    depends_on:
      pg-master:
        condition: service_healthy
    networks:
      - db-network
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 3s
      retries: 5
 
  pgbouncer:
    image: edoburu/pgbouncer:latest
    container_name: pgbouncer
    environment:
      DATABASE_URL: "postgresql://postgres:postgres_password@pg-master:5432/mydb"
      POOL_MODE: transaction
      MAX_CLIENT_CONN: 1000
      DEFAULT_POOL_SIZE: 25
    ports:
      - "6432:6432"
    depends_on:
      - pg-master
    networks:
      - db-network
 
volumes:
  pg_master_data:
  pg_replica_data:
 
networks:
  db-network:
    driver: bridge

init-master.sh:

#!/bin/bash
set -e
 
# Tạo replication user
psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL
    CREATE ROLE replication_user WITH REPLICATION LOGIN PASSWORD 'replication_password';
 
    -- App users with least privilege
    CREATE ROLE app_readonly LOGIN PASSWORD 'readonly_password';
    GRANT CONNECT ON DATABASE mydb TO app_readonly;
    GRANT USAGE ON SCHEMA public TO app_readonly;
    GRANT SELECT ON ALL TABLES IN SCHEMA public TO app_readonly;
    ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO app_readonly;
 
    CREATE ROLE app_readwrite LOGIN PASSWORD 'readwrite_password';
    GRANT CONNECT ON DATABASE mydb TO app_readwrite;
    GRANT USAGE ON SCHEMA public TO app_readwrite;
    GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO app_readwrite;
    GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA public TO app_readwrite;
    ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO app_readwrite;
    ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT USAGE, SELECT ON SEQUENCES TO app_readwrite;
 
    -- Create replication slot
    SELECT pg_create_physical_replication_slot('replica1_slot');
 
    -- Sample table
    CREATE TABLE orders (
        id BIGSERIAL PRIMARY KEY,
        user_id BIGINT NOT NULL,
        product_id BIGINT NOT NULL,
        amount DECIMAL(10,2) NOT NULL,
        status VARCHAR(20) DEFAULT 'pending',
        created_at TIMESTAMPTZ DEFAULT NOW()
    );
 
    CREATE INDEX idx_orders_user_id ON orders(user_id);
    CREATE INDEX idx_orders_created_at ON orders(created_at);
EOSQL
 
# Thêm replication entry vào pg_hba.conf
echo "host replication replication_user 0.0.0.0/0 scram-sha-256" >> "$PGDATA/pg_hba.conf"

init-replica.sh:

#!/bin/bash
set -e
 
# Đợi Master sẵn sàng
until pg_isready -h pg-master -p 5432 -U replication_user; do
  echo "Waiting for master to be ready..."
  sleep 2
done
 
# Nếu data directory đã có data, bỏ qua
if [ -f "$PGDATA/PG_VERSION" ]; then
  echo "Data directory already initialized. Starting replica..."
  exec postgres \
    -c hot_standby=on
else
  echo "Initializing replica from master..."
 
  # Xoá data directory nếu empty nhưng tồn tại
  rm -rf "$PGDATA"/*
 
  # Tạo base backup
  pg_basebackup \
    -h pg-master \
    -p 5432 \
    -U replication_user \
    -D "$PGDATA" \
    -Fp -Xs -P -R \
    -S replica1_slot
 
  # Cấu hình replica
  cat >> "$PGDATA/postgresql.auto.conf" <<EOF
primary_conninfo = 'host=pg-master port=5432 user=replication_user password=replication_password application_name=replica1'
primary_slot_name = 'replica1_slot'
EOF
 
  # Đảm bảo standby.signal tồn tại
  touch "$PGDATA/standby.signal"
 
  echo "Replica initialized. Starting..."
  exec postgres \
    -c hot_standby=on
fi

6.2 Python: Connection with Read Replica Routing

"""
Database connection manager with automatic read/write routing.
- Write queries → Master
- Read queries → Replica (with fallback to Master)
- Connection pooling via PgBouncer or psycopg2 pool
"""
 
import psycopg2
from psycopg2 import pool
from contextlib import contextmanager
from functools import wraps
from typing import Optional
import logging
import time
import hashlib
 
logger = logging.getLogger(__name__)
 
 
class DatabaseConfig:
    """Database configuration — KHÔNG hardcode credentials trong code!"""
    def __init__(
        self,
        host: str,
        port: int,
        dbname: str,
        user: str,
        password: str,
        sslmode: str = "require",
    ):
        self.host = host
        self.port = port
        self.dbname = dbname
        self.user = user
        self.password = password
        self.sslmode = sslmode
 
    @property
    def dsn(self) -> str:
        return (
            f"host={self.host} port={self.port} dbname={self.dbname} "
            f"user={self.user} password={self.password} sslmode={self.sslmode}"
        )
 
 
class ReplicaAwareConnectionPool:
    """
    Connection pool hỗ trợ read/write routing.
 
    - Tất cả write (INSERT, UPDATE, DELETE, DDL) → Master
    - Tất cả read (SELECT) → Replica (round-robin nếu nhiều replica)
    - Fallback: nếu replica chết → đọc từ Master
    - Read-your-writes: sau khi write, đọc từ Master trong N giây
    """
 
    def __init__(
        self,
        master_config: DatabaseConfig,
        replica_configs: list[DatabaseConfig],
        master_pool_size: int = 10,
        replica_pool_size: int = 20,
        read_after_write_window_seconds: float = 2.0,
    ):
        self.master_pool = pool.ThreadedConnectionPool(
            minconn=2,
            maxconn=master_pool_size,
            dsn=master_config.dsn,
        )
 
        self.replica_pools = []
        for config in replica_configs:
            try:
                replica_pool = pool.ThreadedConnectionPool(
                    minconn=2,
                    maxconn=replica_pool_size,
                    dsn=config.dsn,
                )
                self.replica_pools.append(replica_pool)
                logger.info(f"Connected to replica: {config.host}:{config.port}")
            except Exception as e:
                logger.warning(f"Failed to connect replica {config.host}: {e}")
 
        self._replica_index = 0
        self._read_after_write_window = read_after_write_window_seconds
        self._last_write_times: dict[str, float] = {}  # session_id → timestamp
 
    def _get_replica_pool(self) -> Optional[pool.ThreadedConnectionPool]:
        """Round-robin replica selection."""
        if not self.replica_pools:
            return None
        selected = self.replica_pools[self._replica_index % len(self.replica_pools)]
        self._replica_index += 1
        return selected
 
    def _should_read_from_master(self, session_id: Optional[str] = None) -> bool:
        """Read-your-writes consistency: sau write gần đây, đọc từ Master."""
        if session_id and session_id in self._last_write_times:
            elapsed = time.time() - self._last_write_times[session_id]
            if elapsed < self._read_after_write_window:
                logger.debug(
                    f"Read-your-writes: routing to master "
                    f"({elapsed:.1f}s since last write)"
                )
                return True
        return False
 
    def _record_write(self, session_id: Optional[str] = None):
        """Ghi nhận thời điểm write cho read-your-writes consistency."""
        if session_id:
            self._last_write_times[session_id] = time.time()
 
    @contextmanager
    def get_read_connection(self, session_id: Optional[str] = None):
        """Lấy connection cho read query."""
        use_master = self._should_read_from_master(session_id)
 
        if use_master or not self.replica_pools:
            conn = self.master_pool.getconn()
            source = "master"
        else:
            replica_pool = self._get_replica_pool()
            try:
                conn = replica_pool.getconn()
                source = "replica"
            except Exception:
                logger.warning("Replica unavailable, falling back to master")
                conn = self.master_pool.getconn()
                source = "master"
 
        try:
            conn.set_session(readonly=True, autocommit=True)
            logger.debug(f"Read connection from {source}")
            yield conn
        finally:
            if source == "master":
                self.master_pool.putconn(conn)
            else:
                replica_pool.putconn(conn)
 
    @contextmanager
    def get_write_connection(self, session_id: Optional[str] = None):
        """Lấy connection cho write query."""
        conn = self.master_pool.getconn()
        try:
            conn.set_session(readonly=False, autocommit=False)
            yield conn
            conn.commit()
            self._record_write(session_id)
            logger.debug("Write committed to master")
        except Exception:
            conn.rollback()
            logger.error("Write rolled back")
            raise
        finally:
            self.master_pool.putconn(conn)
 
    def close(self):
        """Cleanup all connection pools."""
        self.master_pool.closeall()
        for rp in self.replica_pools:
            rp.closeall()
 
 
# === Sử dụng ===
 
def create_pool_from_env():
    """Tạo pool từ environment variables — KHÔNG hardcode credentials."""
    import os
 
    master = DatabaseConfig(
        host=os.environ["DB_MASTER_HOST"],
        port=int(os.environ.get("DB_MASTER_PORT", 5432)),
        dbname=os.environ["DB_NAME"],
        user=os.environ["DB_WRITE_USER"],
        password=os.environ["DB_WRITE_PASSWORD"],
    )
 
    replica_hosts = os.environ.get("DB_REPLICA_HOSTS", "").split(",")
    replicas = [
        DatabaseConfig(
            host=host.strip(),
            port=int(os.environ.get("DB_REPLICA_PORT", 5432)),
            dbname=os.environ["DB_NAME"],
            user=os.environ["DB_READ_USER"],
            password=os.environ["DB_READ_PASSWORD"],
        )
        for host in replica_hosts
        if host.strip()
    ]
 
    return ReplicaAwareConnectionPool(
        master_config=master,
        replica_configs=replicas,
    )
 
 
# Example usage
if __name__ == "__main__":
    import os
    os.environ.update({
        "DB_MASTER_HOST": "localhost",
        "DB_MASTER_PORT": "5432",
        "DB_NAME": "mydb",
        "DB_WRITE_USER": "app_readwrite",
        "DB_WRITE_PASSWORD": "readwrite_password",
        "DB_READ_USER": "app_readonly",
        "DB_READ_PASSWORD": "readonly_password",
        "DB_REPLICA_HOSTS": "localhost",
        "DB_REPLICA_PORT": "5433",
    })
 
    db = create_pool_from_env()
    session = "user_12345_session"
 
    # Write — always goes to master
    with db.get_write_connection(session_id=session) as conn:
        with conn.cursor() as cur:
            cur.execute(
                "INSERT INTO orders (user_id, product_id, amount) VALUES (%s, %s, %s)",
                (12345, 67890, 99.99),
            )
 
    # Read immediately after write — goes to master (read-your-writes)
    with db.get_read_connection(session_id=session) as conn:
        with conn.cursor() as cur:
            cur.execute("SELECT * FROM orders WHERE user_id = %s ORDER BY created_at DESC LIMIT 1", (12345,))
            print(cur.fetchone())
 
    # Read after 3 seconds — goes to replica
    time.sleep(3)
    with db.get_read_connection(session_id=session) as conn:
        with conn.cursor() as cur:
            cur.execute("SELECT COUNT(*) FROM orders")
            print(f"Total orders: {cur.fetchone()[0]}")
 
    db.close()

6.3 Node.js: Connection with Read Replica Routing

/**
 * Database connection manager with read/write routing for Node.js.
 * Uses pg (node-postgres) library.
 *
 * Write → Master
 * Read → Replica (round-robin) with read-your-writes consistency
 */
 
const { Pool } = require('pg');
 
class ReplicaAwarePool {
  /**
   * @param {object} masterConfig - pg Pool config for master
   * @param {object[]} replicaConfigs - pg Pool configs for replicas
   * @param {number} readAfterWriteWindowMs - ms to route reads to master after write
   */
  constructor(masterConfig, replicaConfigs = [], readAfterWriteWindowMs = 2000) {
    this.masterPool = new Pool({
      ...masterConfig,
      max: 20,
      idleTimeoutMillis: 30000,
      connectionTimeoutMillis: 5000,
    });
 
    this.replicaPools = replicaConfigs.map(
      (config) =>
        new Pool({
          ...config,
          max: 30,
          idleTimeoutMillis: 30000,
          connectionTimeoutMillis: 5000,
        })
    );
 
    this.replicaIndex = 0;
    this.readAfterWriteWindowMs = readAfterWriteWindowMs;
    this.lastWriteTimes = new Map(); // sessionId → timestamp
 
    // Log pool errors
    this.masterPool.on('error', (err) => {
      console.error('Master pool error:', err.message);
    });
    this.replicaPools.forEach((pool, i) => {
      pool.on('error', (err) => {
        console.error(`Replica ${i} pool error:`, err.message);
      });
    });
  }
 
  /**
   * Execute a READ query (SELECT).
   * Routes to replica unless read-your-writes window is active.
   */
  async read(sql, params = [], sessionId = null) {
    const usemaster = this._shouldReadFromMaster(sessionId);
    const pool = usemaster ? this.masterPool : this._getReplicaPool();
    const source = usemaster ? 'master' : 'replica';
 
    try {
      const result = await pool.query(sql, params);
      return result;
    } catch (err) {
      if (source === 'replica') {
        console.warn('Replica query failed, falling back to master:', err.message);
        return this.masterPool.query(sql, params);
      }
      throw err;
    }
  }
 
  /**
   * Execute a WRITE query (INSERT/UPDATE/DELETE).
   * Always goes to master.
   */
  async write(sql, params = [], sessionId = null) {
    const result = await this.masterPool.query(sql, params);
    if (sessionId) {
      this.lastWriteTimes.set(sessionId, Date.now());
    }
    return result;
  }
 
  /**
   * Execute a transaction (multiple writes).
   */
  async transaction(queries, sessionId = null) {
    const client = await this.masterPool.connect();
    try {
      await client.query('BEGIN');
      const results = [];
      for (const { sql, params } of queries) {
        results.push(await client.query(sql, params || []));
      }
      await client.query('COMMIT');
      if (sessionId) {
        this.lastWriteTimes.set(sessionId, Date.now());
      }
      return results;
    } catch (err) {
      await client.query('ROLLBACK');
      throw err;
    } finally {
      client.release();
    }
  }
 
  _shouldReadFromMaster(sessionId) {
    if (!sessionId || this.replicaPools.length === 0) return true;
    const lastWrite = this.lastWriteTimes.get(sessionId);
    if (!lastWrite) return false;
    return Date.now() - lastWrite < this.readAfterWriteWindowMs;
  }
 
  _getReplicaPool() {
    if (this.replicaPools.length === 0) return this.masterPool;
    const pool = this.replicaPools[this.replicaIndex % this.replicaPools.length];
    this.replicaIndex++;
    return pool;
  }
 
  async close() {
    await this.masterPool.end();
    await Promise.all(this.replicaPools.map((p) => p.end()));
  }
}
 
// === Usage ===
 
const db = new ReplicaAwarePool(
  {
    host: process.env.DB_MASTER_HOST || 'localhost',
    port: parseInt(process.env.DB_MASTER_PORT || '5432'),
    database: process.env.DB_NAME || 'mydb',
    user: process.env.DB_WRITE_USER || 'app_readwrite',
    password: process.env.DB_WRITE_PASSWORD || 'readwrite_password',
    ssl: process.env.NODE_ENV === 'production' ? { rejectUnauthorized: true } : false,
  },
  [
    {
      host: process.env.DB_REPLICA_HOST || 'localhost',
      port: parseInt(process.env.DB_REPLICA_PORT || '5433'),
      database: process.env.DB_NAME || 'mydb',
      user: process.env.DB_READ_USER || 'app_readonly',
      password: process.env.DB_READ_PASSWORD || 'readonly_password',
      ssl: process.env.NODE_ENV === 'production' ? { rejectUnauthorized: true } : false,
    },
  ]
);
 
// Express middleware example
function attachDb(req, res, next) {
  req.db = db;
  req.sessionId = req.headers['x-session-id'] || req.sessionID;
  next();
}
 
// Route examples
async function createOrder(req, res) {
  const { userId, productId, amount } = req.body;
  const result = await req.db.write(
    'INSERT INTO orders (user_id, product_id, amount) VALUES ($1, $2, $3) RETURNING *',
    [userId, productId, amount],
    req.sessionId
  );
  res.json(result.rows[0]);
}
 
async function getOrders(req, res) {
  const { userId } = req.params;
  // Will route to master if user just placed an order (read-your-writes)
  const result = await req.db.read(
    'SELECT * FROM orders WHERE user_id = $1 ORDER BY created_at DESC LIMIT 20',
    [userId],
    req.sessionId
  );
  res.json(result.rows);
}
 
module.exports = { ReplicaAwarePool, db, attachDb, createOrder, getOrders };

6.4 Shard Routing Middleware (Python)

"""
Shard routing middleware — Hash-based sharding example.
Routes queries to the correct shard based on shard key (user_id).
"""
 
import hashlib
from typing import Optional
from contextlib import contextmanager
import psycopg2
from psycopg2 import pool
 
 
class ShardConfig:
    def __init__(self, shard_id: int, host: str, port: int, dbname: str,
                 user: str, password: str):
        self.shard_id = shard_id
        self.host = host
        self.port = port
        self.dbname = dbname
        self.user = user
        self.password = password
 
 
class ShardRouter:
    """
    Hash-based shard router.
 
    Shard selection: shard_id = hash(shard_key) % num_shards
 
    Dùng consistent hashing trong production (xem Tuan-10-Consistent-Hashing).
    Ở đây dùng simple modulo cho dễ hiểu.
    """
 
    def __init__(self, shard_configs: list[ShardConfig], pool_size: int = 10):
        self.num_shards = len(shard_configs)
        self.pools: dict[int, pool.ThreadedConnectionPool] = {}
 
        for config in shard_configs:
            self.pools[config.shard_id] = pool.ThreadedConnectionPool(
                minconn=2,
                maxconn=pool_size,
                host=config.host,
                port=config.port,
                dbname=config.dbname,
                user=config.user,
                password=config.password,
            )
 
    def _get_shard_id(self, shard_key: int) -> int:
        """Hash-based shard selection."""
        # Dùng MD5 hash cho distribution đều hơn simple modulo
        hash_value = int(hashlib.md5(str(shard_key).encode()).hexdigest(), 16)
        return hash_value % self.num_shards
 
    @contextmanager
    def connection(self, shard_key: int):
        """Lấy connection tới đúng shard cho shard_key."""
        shard_id = self._get_shard_id(shard_key)
        conn = self.pools[shard_id].getconn()
        try:
            yield conn
        finally:
            self.pools[shard_id].putconn(conn)
 
    def execute_on_shard(self, shard_key: int, sql: str, params: tuple = ()):
        """Execute query trên 1 shard cụ thể."""
        with self.connection(shard_key) as conn:
            with conn.cursor() as cur:
                cur.execute(sql, params)
                conn.commit()
                if cur.description:  # SELECT query
                    return cur.fetchall()
                return cur.rowcount
 
    def execute_on_all_shards(self, sql: str, params: tuple = ()):
        """
        Fan-out query tới TẤT CẢ shards (cross-shard query).
        CẢNH BÁO: Chậm! Chỉ dùng cho aggregation/analytics.
        """
        results = []
        for shard_id, shard_pool in self.pools.items():
            conn = shard_pool.getconn()
            try:
                with conn.cursor() as cur:
                    cur.execute(sql, params)
                    if cur.description:
                        rows = cur.fetchall()
                        results.extend(rows)
            finally:
                shard_pool.putconn(conn)
        return results
 
    def scatter_gather_count(self, sql: str, params: tuple = ()):
        """
        Scatter-gather pattern: gửi COUNT query tới tất cả shards, aggregate kết quả.
        Ví dụ: SELECT COUNT(*) FROM orders WHERE status = 'completed'
        """
        total = 0
        for shard_id, shard_pool in self.pools.items():
            conn = shard_pool.getconn()
            try:
                with conn.cursor() as cur:
                    cur.execute(sql, params)
                    row = cur.fetchone()
                    if row:
                        total += row[0]
            finally:
                shard_pool.putconn(conn)
        return total
 
    def close(self):
        for shard_pool in self.pools.values():
            shard_pool.closeall()
 
 
# === Usage ===
 
if __name__ == "__main__":
    # 4 shards trên 4 DB servers
    configs = [
        ShardConfig(0, "shard0-db", 5432, "mydb_shard0", "app_user", "password"),
        ShardConfig(1, "shard1-db", 5432, "mydb_shard1", "app_user", "password"),
        ShardConfig(2, "shard2-db", 5432, "mydb_shard2", "app_user", "password"),
        ShardConfig(3, "shard3-db", 5432, "mydb_shard3", "app_user", "password"),
    ]
 
    router = ShardRouter(configs)
 
    # Insert order — routed to correct shard based on user_id
    user_id = 12345
    router.execute_on_shard(
        shard_key=user_id,
        sql="INSERT INTO orders (user_id, product_id, amount) VALUES (%s, %s, %s)",
        params=(user_id, 67890, 99.99),
    )
    print(f"Order inserted into shard {router._get_shard_id(user_id)}")
 
    # Read user's orders — single shard
    orders = router.execute_on_shard(
        shard_key=user_id,
        sql="SELECT * FROM orders WHERE user_id = %s",
        params=(user_id,),
    )
    print(f"User {user_id} has {len(orders)} orders")
 
    # Cross-shard query — fan-out to ALL shards (slow!)
    total_orders = router.scatter_gather_count(
        sql="SELECT COUNT(*) FROM orders WHERE status = %s",
        params=("completed",),
    )
    print(f"Total completed orders across all shards: {total_orders}")
 
    router.close()

7. System Design Diagrams

7.1 Replication Topology

flowchart TD
    subgraph "Application Layer"
        APP1[App Server 1]
        APP2[App Server 2]
        APP3[App Server 3]
    end

    subgraph "Connection Pooling"
        PGB[PgBouncer<br/>Max 10K connections → 50 backend]
    end

    subgraph "Write Path"
        APP1 -->|"INSERT/UPDATE/DELETE"| PGB
        APP2 -->|"INSERT/UPDATE/DELETE"| PGB
        PGB -->|"Write"| MASTER[(PostgreSQL Master<br/>Read + Write)]
    end

    subgraph "Read Path"
        APP1 -->|"SELECT"| R1[(Replica 1<br/>Read Only)]
        APP2 -->|"SELECT"| R2[(Replica 2<br/>Read Only)]
        APP3 -->|"SELECT"| R3[(Replica 3<br/>Read Only)]
    end

    subgraph "Replication"
        MASTER -->|"WAL Stream<br/>(async)"| R1
        MASTER -->|"WAL Stream<br/>(sync)"| R2
        MASTER -->|"WAL Stream<br/>(async)"| R3
    end

    subgraph "Failover Management"
        PATRONI[Patroni + etcd<br/>Leader Election]
        PATRONI -.->|"monitors"| MASTER
        PATRONI -.->|"promotes if<br/>master fails"| R2
    end

    subgraph "Monitoring"
        PROM[Prometheus<br/>postgres_exporter]
        GRAF[Grafana Dashboard]
        PROM -.->|"scrape metrics"| MASTER
        PROM -.->|"scrape metrics"| R1
        PROM -.->|"scrape metrics"| R2
        PROM --> GRAF
    end

    style MASTER fill:#ff6b6b,stroke:#333,stroke-width:2px,color:#fff
    style R1 fill:#4ecdc4,stroke:#333,stroke-width:2px
    style R2 fill:#4ecdc4,stroke:#333,stroke-width:2px
    style R3 fill:#4ecdc4,stroke:#333,stroke-width:2px
    style PGB fill:#f9a825,stroke:#333,stroke-width:2px
    style PATRONI fill:#a29bfe,stroke:#333,stroke-width:2px

7.2 Sharded Architecture

flowchart TD
    subgraph "Client Layer"
        C1[Client Request<br/>user_id=12345]
        C2[Client Request<br/>user_id=67890]
        C3[Client Request<br/>Analytics Query]
    end

    subgraph "Application Layer"
        LB[Load Balancer]
        APP[App Server]
    end

    subgraph "Shard Routing Layer"
        SR[Shard Router / Middleware<br/>hash&#40;user_id&#41; % N]
    end

    C1 --> LB
    C2 --> LB
    C3 --> LB
    LB --> APP
    APP --> SR

    subgraph "Shard 0 (user_id hash → 0)"
        S0M[(Master 0)]
        S0R[(Replica 0)]
        S0M -->|"replication"| S0R
    end

    subgraph "Shard 1 (user_id hash → 1)"
        S1M[(Master 1)]
        S1R[(Replica 1)]
        S1M -->|"replication"| S1R
    end

    subgraph "Shard 2 (user_id hash → 2)"
        S2M[(Master 2)]
        S2R[(Replica 2)]
        S2M -->|"replication"| S2R
    end

    subgraph "Shard 3 (user_id hash → 3)"
        S3M[(Master 3)]
        S3R[(Replica 3)]
        S3M -->|"replication"| S3R
    end

    SR -->|"user 12345 → shard 1"| S1M
    SR -->|"user 67890 → shard 2"| S2M

    subgraph "Cross-shard Query (Scatter-Gather)"
        AGG[Aggregator]
        SR -->|"analytics query<br/>fan-out"| AGG
        AGG --> S0R
        AGG --> S1R
        AGG --> S2R
        AGG --> S3R
    end

    subgraph "Global Services"
        ES[Elasticsearch<br/>Cross-shard Search]
        CACHE[Redis Cluster<br/>Hot Data Cache]
    end

    APP -->|"search queries"| ES
    APP -->|"cache read"| CACHE

    style SR fill:#f9a825,stroke:#333,stroke-width:2px
    style AGG fill:#fd79a8,stroke:#333,stroke-width:2px
    style S0M fill:#ff6b6b,stroke:#333,stroke-width:2px,color:#fff
    style S1M fill:#ff6b6b,stroke:#333,stroke-width:2px,color:#fff
    style S2M fill:#ff6b6b,stroke:#333,stroke-width:2px,color:#fff
    style S3M fill:#ff6b6b,stroke:#333,stroke-width:2px,color:#fff
    style S0R fill:#4ecdc4,stroke:#333,stroke-width:2px
    style S1R fill:#4ecdc4,stroke:#333,stroke-width:2px
    style S2R fill:#4ecdc4,stroke:#333,stroke-width:2px
    style S3R fill:#4ecdc4,stroke:#333,stroke-width:2px

7.3 Failover Sequence

sequenceDiagram
    participant App as Application
    participant P as Patroni
    participant E as etcd Cluster
    participant M as Master (dies)
    participant R1 as Replica 1 (promoted)
    participant R2 as Replica 2

    Note over M: Master crashes!

    P->>M: Health check (every 10s)
    M--xP: No response

    P->>M: Retry health check
    M--xP: No response (TTL=30s expired)

    P->>E: Report: Master is dead
    E->>E: Remove Master's leader key

    P->>E: Replica 1 requests leader lock<br/>(least replication lag)
    E->>P: Leader lock granted to Replica 1

    P->>R1: pg_ctl promote
    R1->>R1: Promote to Master

    P->>R2: Repoint to new Master (R1)
    R2->>R1: Start replication from new Master

    P->>App: Update endpoint (DNS/VIP)
    App->>R1: New writes go to Replica 1 (now Master)

    Note over App,R2: Failover complete (~30-60s)

8. Aha Moments & Pitfalls

Aha Moments

#1 — Premature sharding is the root of all evil: Sharding thêm ENORMOUSLY complexity (cross-shard queries, distributed transactions, resharding). Cố gắng TRÁNH shard bằng mọi cách: vertical scaling (bigger machine), read replicas, caching, table partitioning, archiving old data. Chỉ shard khi thực sự cần.

#2 — Replication lag không phải bug, nó là feature: Async replication lag là trade-off có chủ đích giữa write latency và consistency. Hãy design application để tolerant với lag thay vì cố loại bỏ hoàn toàn.

#3 — Shard key quyết định TOÀN BỘ query patterns: Chọn shard key xong là “đóng đinh” cách data được access. Thay đổi shard key = migrate toàn bộ data = downtime hoặc cực kỳ tốn resource. Suy nghĩ kỹ access patterns TRƯỚC khi chọn.

#4 — Connection pooling là optimization đầu tiên: Trước khi nghĩ tới sharding hay read replicas, hãy chắc chắn đã dùng PgBouncer/ProxySQL. Nhiều database “quá tải” thực ra chỉ bị cạn connection vì mỗi connection tốn 10MB RAM.

#5 — CAP theorem trong thực tế: Không phải “chọn 2 trong 3” cứng nhắc. Hệ thống thực tế tunable — có thể chọn consistency cho payment flow và availability cho product listing, TRONG CÙNG MỘT HỆ THỐNG.

Pitfalls

Pitfall 1: Premature Sharding

Sai: “Hệ thống mới có 100K users, data 50GB, nhưng tương lai sẽ lớn → shard ngay từ đầu cho chắc.” Đúng: 50GB là tiny cho PostgreSQL. Dùng table partitioning (PARTITION BY RANGE trên created_at) + read replicas trước. Khi nào thực sự cần (>1-2TB, >10K write QPS) mới shard. “Premature optimization is the root of all evil” — Donald Knuth.

Pitfall 2: Wrong Shard Key

Sai: Shard e-commerce theo order_date → Black Friday 1 shard chịu 90% load. Đúng: Shard theo user_id cho user-centric queries, hoặc tenant_id cho multi-tenant. Key phải có high cardinality và even distribution.

Pitfall 3: Replication Lag Reads

Sai: User tạo post, refresh trang, không thấy post → nghĩ bị bug → tạo lại → duplicate. Đúng: Implement read-your-writes consistency — sau khi write, đọc từ Master trong N giây. Hoặc trả về data vừa write trong response (optimistic UI).

Pitfall 4: Split Brain

Sai: Automatic failover promote Replica thành Master mới, nhưng Master cũ vẫn sống (chỉ bị network partition) → 2 Masters nhận write → data diverge → disaster. Đúng: Luôn dùng fencing (STONITH) — force kill Master cũ trước khi promote Master mới. Dùng quorum-based systems (Patroni + etcd 3 nodes).

Pitfall 5: Ignoring Connection Pooling

Sai: App có 500 instances, mỗi instance mở 20 connections = 10,000 connections trực tiếp tới PostgreSQL → PostgreSQL fork 10,000 processes → OOM crash. Đúng: Dùng PgBouncer (transaction pooling mode). 10,000 app connections → 50 backend connections. PostgreSQL chỉ cần handle 50 connections.

Pitfall 6: Backup nhưng không test restore

Sai: “Em có backup hàng ngày rồi, yên tâm.” Đúng: Backup chưa restore thành công thì chưa phải backup. Phải test restore định kỳ (ít nhất monthly). Nhiều team phát hiện backup bị corrupt khi cần restore lúc disaster → game over.

9. Internal Links

Topic	Link	Liên quan
Scaling fundamentals	Tuan-01-Scale-From-Zero-To-Millions	Single server → distributed
Estimation	Tuan-02-Back-of-the-envelope	Tính khi nào cần shard
Networking	Tuan-03-Networking-DNS-CDN	Cross-region replication latency
Load Balancer	Tuan-05-Load-Balancer	Route traffic tới app layer
Cache	Tuan-06-Cache-Strategy	Giảm read load trước khi shard
Message Queue	Tuan-08-Message-Queue	Buffer write spikes, async processing
Rate Limiter	Tuan-09-Rate-Limiter	Protect DB from overload
Consistent Hashing	Tuan-10-Consistent-Hashing	Hash-based sharding cải tiến
Monitoring	Tuan-13-Monitoring-Observability	Monitor replication lag, shard health
Security & Encryption	Tuan-15-Data-Security-Encryption	DB encryption, KMS, audit

Tham khảo

Alex Xu, System Design Interview — Chapter 1: Scale from Zero to Millions (Database Replication & Sharding)
Martin Kleppmann, Designing Data-Intensive Applications — Chapter 5: Replication, Chapter 6: Partitioning
PostgreSQL Documentation — Streaming Replication
Patroni — PostgreSQL HA
Vitess — YouTube’s Database Clustering
Citus — Distributed PostgreSQL
PgBouncer — Lightweight connection pooler
Tuan-02-Back-of-the-envelope — Estimation framework
Tuan-06-Cache-Strategy — Tại sao cache trước, shard sau
Tuan-10-Consistent-Hashing — Cải tiến hash-based sharding

Tuần tới: Tuan-08-Message-Queue — Decouple services và buffer write spikes

lthieu's notes

Explorer

Tuan-07-Database-Sharding-Replication

Tuần 07: Database Sharding & Replication

1. Context & Why

Analogy đời thường — Thư viện Quốc gia

Tại sao Alex Xu đặt nó ở giai đoạn này?

2. Deep Dive — Các khái niệm cốt lõi

2.1 Replication — Nhân bản dữ liệu

2.1.1 Master-Slave Replication (Primary-Replica)

2.1.2 Master-Master Replication (Multi-Primary)

2.1.3 Synchronous vs Asynchronous vs Semi-Synchronous Replication

2.1.4 Replication Lag — Kẻ thù thầm lặng

2.1.5 Read Replicas — Scale Read

2.1.6 Failover — Khi Master chết

2.2 Sharding — Phân mảnh dữ liệu

2.2.1 Sharding Strategies

Hash-based Sharding (Phân mảnh theo hash)

Range-based Sharding (Phân mảnh theo khoảng)

Directory-based Sharding (Phân mảnh theo bảng tra cứu)

Geo-based Sharding (Phân mảnh theo địa lý)

2.2.2 Shard Key Selection — Quyết định sống còn

2.2.3 Cross-Shard Queries — Nỗi đau của Sharding

2.2.4 Resharding — Khi cần thêm/bớt shard

2.2.5 Hotspot Problem — Shard nóng

2.3 Sharding Middleware & Tools

Vitess (YouTube → CNCF project)

Citus (PostgreSQL extension)

ProxySQL (MySQL Proxy)

PgBouncer (PostgreSQL Connection Pooler)

2.4 SQL vs NoSQL — Ma trận lựa chọn

2.5 ACID vs BASE

2.6 CAP Theorem — Practical Implications

3. Estimation — Khi nào cần Sharding?

3.1 Ngưỡng cần xem xét Sharding

3.2 Ước lượng cho E-commerce platform

3.3 Replication Lag Impact Estimation

3.4 Storage per Shard Estimation

4. Security First — Bảo mật Database

4.1 SQL Injection Prevention — Phòng chống tiêm SQL

4.2 Database Encryption at Rest — Mã hoá dữ liệu lưu trữ

4.3 Network Segmentation cho Database

4.4 Audit Logging

4.5 Principle of Least Privilege — Quyền tối thiểu

4.6 Backup Encryption

5. DevOps/Ops-Light — Triển khai & Vận hành

5.1 PostgreSQL Streaming Replication Setup

5.2 pg_stat_replication Monitoring

5.3 Prometheus postgres_exporter

5.4 Automated Failover with Patroni

5.5 Backup Strategy

6. Code Examples

6.1 Docker-compose: PostgreSQL Master-Slave

6.2 Python: Connection with Read Replica Routing

6.3 Node.js: Connection with Read Replica Routing

6.4 Shard Routing Middleware (Python)

7. System Design Diagrams

7.1 Replication Topology

7.2 Sharded Architecture

7.3 Failover Sequence

8. Aha Moments & Pitfalls

Aha Moments

Pitfalls

Pitfall 1: Premature Sharding

Pitfall 2: Wrong Shard Key

Pitfall 3: Replication Lag Reads

Pitfall 4: Split Brain

Pitfall 5: Ignoring Connection Pooling

Pitfall 6: Backup nhưng không test restore

9. Internal Links

Tham khảo

Graph View

Table of Contents

Backlinks