Tuần 20: Design Distributed Key-Value Store (Capstone)

“Nếu em hiểu cách xây một distributed key-value store từ đầu đến cuối, em hiểu 80% cốt lõi của distributed systems. Mọi thứ khác chỉ là biến thể.”

Tags: system-design distributed-systems key-value-store capstone alex-xu Student: Hieu Prerequisite: Tuan-01-Scale-From-Zero-To-Millions · Tuan-02-Back-of-the-envelope · Tuan-07-Database-Sharding-Replication · Tuan-10-Consistent-Hashing Liên quan: Tuan-03-Networking-DNS-CDN · Tuan-05-Load-Balancer · Tuan-06-Cache-Strategy · Tuan-08-Message-Queue · Tuan-09-Rate-Limiter · Tuan-11-Microservices-Pattern · Tuan-13-Monitoring-Observability · Tuan-14-AuthN-AuthZ-Security · Tuan-15-Data-Security-Encryption

Tại sao đây là Capstone Week?

Hieu, tuần này là tuần tổng hợp — nơi mọi thứ em đã học từ Tuần 01 đến Tuần 19 hội tụ lại trong một bài toán duy nhất. Một distributed key-value store (KV store) đụng chạm đến:

Khái niệm	Tuần đã học	Vai trò trong KV Store
Scaling	Tuan-01-Scale-From-Zero-To-Millions	Horizontal scaling qua partitioning
Estimation	Tuan-02-Back-of-the-envelope	Tính storage per node, throughput
Networking	Tuan-03-Networking-DNS-CDN	Cross-DC replication latency
API Design	Tuan-04-API-Design-REST-gRPC	put/get API, gRPC inter-node
Load Balancer	Tuan-05-Load-Balancer	Request routing to coordinator
Cache	Tuan-06-Cache-Strategy	Memtable chính là in-memory cache
DB Sharding & Replication	Tuan-07-Database-Sharding-Replication	Consistent hashing + replication factor
Message Queue	Tuan-08-Message-Queue	Hinted handoff queue
Rate Limiter	Tuan-09-Rate-Limiter	Protect nodes from overload
Consistent Hashing	Tuan-10-Consistent-Hashing	Data partitioning core
Monitoring	Tuan-13-Monitoring-Observability	Cluster health, compaction lag
Security	Tuan-14-AuthN-AuthZ-Security · Tuan-15-Data-Security-Encryption	Encryption at rest, inter-node TLS

Các hệ thống KV store nổi tiếng trong thực tế: Amazon DynamoDB, Apache Cassandra, Riak, Voldemort, etcd (CP), Redis Cluster.

Step 1 — Understand the Problem & Establish Design Scope

1.1 Functional Requirements (Yêu cầu chức năng)

Operation	Mô tả	Ví dụ
`put(key, value)`	Lưu cặp key-value. Nếu key đã tồn tại → ghi đè	`put("user:1001", "{name: Hieu}")`
`get(key)`	Trả về value tương ứng với key	`get("user:1001")` → `"{name: Hieu}"`
`delete(key)`	Xoá key (thực tế: đánh tombstone)	`delete("user:1001")`

Constraints:

Key: string, tối đa 256 bytes
Value: blob, tối đa 1 MB (thường < 10 KB)
Không hỗ trợ range query (khác biệt với sorted KV store)

1.2 Non-Functional Requirements (Yêu cầu phi chức năng)

Yêu cầu	Mục tiêu	Giải thích
High Availability (Tính sẵn sàng cao)	99.99% uptime	Hệ thống vẫn hoạt động kể cả khi một số node chết
High Scalability (Khả năng mở rộng)	Hàng triệu key, hàng trăm node	Tự động scale khi data tăng
Tunable Consistency (Tính nhất quán có thể điều chỉnh)	Strong → Eventual	Client chọn mức consistency phù hợp use case
Automatic Scaling	Thêm/bớt node không downtime	Consistent hashing giúp rebalance tự động
Low Latency	p99 < 10ms cho read/write	Cần in-memory structures
Fault Tolerance (Chịu lỗi)	Survive node/rack/DC failure	Replication + failure detection

1.3 Capacity Estimation (Ước lượng năng lực)

Assumptions:

Thông số	Giá trị	Giải thích
Total keys	1 Billion (1B)	Hệ thống quy mô lớn
Average value size	10 KB	JSON document trung bình
Read:Write ratio	10:1	Read-heavy
DAU (applications)	10,000 services	Microservices gọi vào KV store
Reads/service/day	1,000,000	1M reads mỗi service
Writes/service/day	100,000	100K writes mỗi service
Replication factor (N)	3	Mỗi key lưu trên 3 node
Number of nodes	100	Cluster ban đầu

QPS Calculation:

R e a d QP S_{a vg} = \frac{10 , 000 \times 1 , 000 , 000}{86 , 400} \approx 115, 741 re a d s / s \approx 116 K re a d s / s

W r i t e QP S_{a vg} = \frac{10 , 000 \times 100 , 000}{86 , 400} \approx 11, 574 w r i t es / s \approx 12 K w r i t es / s

R e a d QP S_{p e ak} = 116 K \times 3 = 348 K re a d s / s

W r i t e QP S_{p e ak} = 12 K \times 3 = 36 K w r i t es / s

Per-node QPS (100 nodes, đều tải nhờ consistent hashing):

R e a d QP S_{p er n o d e} = \frac{348 K}{100} = 3, 480 re a d s / s

W r i t e QP S_{p er n o d e} = \frac{36 K}{100} = 360 w r i t es / s

Nhận xét: Mỗi node chỉ handle ~3,800 QPS tổng. Một node Cassandra trung bình handle 10K–50K ops/s → dư sức. Nhưng phải tính cả replication write amplification.

Write Amplification do Replication:

Mỗi write từ client phải được persist tại N=3 replicas. Tổng writes trong cluster:

T o t a l c l u s t er w r i t es / s = Cl i e n t w r i t es / s \times N = 36, 000 \times 3 = 108, 000 w r i t es / s

Phân bổ đều trên 100 nodes:

A c t u a l w r i t e QP S_{p er n o d e} = \frac{108 , 000}{100} = 1, 080 w r i t es / s

Cách hiểu trực quan: Mỗi node đảm nhiệm 2 vai trò cho các write khác nhau — (1) coordinator cho ~360 write/s mà nó nhận trực tiếp từ client, (2) replica cho ~720 write/s được forward từ coordinator khác (vì với RF=3, mỗi key có 3 replicas, mỗi node nằm trong tập replica của ~2/100 phần key space của các coordinator khác). Tổng vẫn là 1,080 writes/s/node.

Với LSM-tree compaction, disk write amplification thêm 5-10x → thực tế disk I/O ~5K-10K writes/s/node. Đây là lý do KV store cần NVMe SSD.

Storage Estimation:

R a w d a t a = 1 B k eys \times 10 K B = 10 TB

Wi t h re pl i c a t i o n (N = 3) = 10 TB \times 3 = 30 TB

St or a g e p er n o d e = \frac{30 TB}{100 n o d es} = 300 GB / n o d e

Overhead thực tế (index, bloom filter, commit log, compaction temp space):

A c t u a l s t or a g e p er n o d e = 300 GB \times 2.5 = 750 GB / n o d e

Rule of thumb: Nhân raw storage với 2x–3x cho overhead. Cassandra khuyến nghị giữ disk usage < 50% để compaction có đủ chỗ.

R eco mm e n d e d d i s k p er n o d e = 750 GB \times 2 = 1.5 TB / n o d e

Bandwidth Estimation:

R e a d ban d w i d t h_{t o t a l} = 348 K \times 10 K B = 3.48 GB / s = 27.8 G b p s

W r i t e ban d w i d t h_{t o t a l} = 36 K \times 10 K B \times 3 (re pl i c a t i o n) = 1.08 GB / s = 8.6 G b p s

B an d w i d t h p er n o d e \approx \frac{27.8 + 8.6}{100} \approx 364 M b p s / n o d e

Mỗi node cần NIC tối thiểu 1 Gbps, khuyến nghị 10 Gbps cho headroom.

Replication Network Overhead:

I n t er - n o d e re pl i c a t i o n t r a ff i c = W r i t e QP S_{t o t a l} \times a vg_s i ze \times (N - 1)

= 36 K \times 10 K B \times 2 = 720 MB / s = 5.76 G b p s

Tóm tắt Estimation:

Metric	Value
Read QPS (peak)	~348K/s
Write QPS (peak)	~36K/s
Total raw data	10 TB
Total with replication (N=3)	30 TB
Storage per node (with overhead)	~750 GB
Recommended disk per node	1.5 TB
Bandwidth per node	~364 Mbps
Cluster size	100 nodes

Step 2 — High-Level Design

2.1 CAP Theorem Deep Dive

CAP Theorem là gì?

CAP Theorem (Eric Brewer, 2000) phát biểu rằng trong một hệ thống phân tán, ta chỉ có thể đảm bảo tối đa 2 trong 3 tính chất:

Tính chất	Tiếng Việt	Ý nghĩa
Consistency (Tính nhất quán)	Nhất quán	Mọi node đều trả về cùng data tại cùng thời điểm
Availability (Tính sẵn sàng)	Sẵn sàng	Mọi request đều nhận được response (không bị reject)
Partition Tolerance (Chịu phân vùng)	Chịu lỗi mạng	Hệ thống vẫn hoạt động khi mạng giữa các node bị đứt

Quan trọng: Trong thực tế, P luôn phải có vì network partition là không thể tránh khỏi. Vậy lựa chọn thực sự là giữa CP và AP.

CP vs AP — Lựa chọn thực tế

flowchart LR
    subgraph "CAP Theorem"
        C["🔒 Consistency"]
        A["✅ Availability"]
        P["🌐 Partition Tolerance"]
    end

    subgraph "CP Systems"
        CP1["etcd / ZooKeeper"]
        CP2["HBase"]
        CP3["MongoDB (default)"]
        CP4["Google Spanner"]
    end

    subgraph "AP Systems"
        AP1["Cassandra"]
        AP2["DynamoDB"]
        AP3["Riak"]
        AP4["CouchDB"]
    end

    C --- CP1
    C --- CP2
    C --- CP3
    C --- CP4
    A --- AP1
    A --- AP2
    A --- AP3
    A --- AP4

    style C fill:#e53935,color:#fff
    style A fill:#43a047,color:#fff
    style P fill:#1e88e5,color:#fff

Đặc điểm	CP System	AP System
Khi network partition	Reject writes để giữ consistency	Accept writes dù data có thể inconsistent
Use case	Banking, inventory count, config mgmt	Shopping cart, social feed, session store
Trade-off	Downtime khi partition	Stale/conflicting data khi partition
Recovery	Tự động khi partition heal	Cần conflict resolution
Ví dụ thực tế	etcd (Kubernetes config), ZooKeeper	DynamoDB (Amazon shopping cart), Cassandra

Tại sao KV Store thường chọn AP?

Amazon DynamoDB được sinh ra từ bài toán: “Giỏ hàng của khách không bao giờ được mất, kể cả khi data center sập.” Mất availability = mất tiền. Inconsistency tạm thời thì có thể resolve sau.

Aha Moment: Trong thực tế, AP system không có nghĩa là “không có consistency”. Mà là consistency được nới lỏng (eventual consistency) và có thể điều chỉnh (tunable consistency). Đây chính là sức mạnh của hệ thống KV store hiện đại.

2.2 High-Level Architecture

flowchart TB
    Client["Client Application"]
    LB["Load Balancer"]

    Client -->|"put(k,v) / get(k)"| LB

    subgraph Cluster["KV Store Cluster (Consistent Hash Ring)"]
        direction TB
        Coord["Coordinator Node<br/>(any node can be coordinator)"]

        subgraph Ring["Consistent Hash Ring"]
            N1["Node A<br/>Token: 0-85"]
            N2["Node B<br/>Token: 86-170"]
            N3["Node C<br/>Token: 171-255"]
            N4["Node D<br/>Token: 256-340"]
            N5["Node E<br/>Token: 341-425"]
            N6["Node F<br/>Token: 426-511"]
        end

        Coord -->|"Route by hash(key)"| N1
        Coord -->|"Replicate"| N2
        Coord -->|"Replicate"| N3
    end

    LB --> Coord

    subgraph NodeInternal["Inside Each Node"]
        CL["Commit Log<br/>(append-only, WAL)"]
        MT["Memtable<br/>(in-memory sorted)"]
        BF["Bloom Filter"]
        SS["SSTables on disk<br/>(sorted, immutable)"]
    end

    N1 --- NodeInternal

    style Coord fill:#ff9800,color:#000
    style N1 fill:#4caf50,color:#fff
    style N2 fill:#4caf50,color:#fff
    style N3 fill:#4caf50,color:#fff
    style MT fill:#ffeb3b,color:#000
    style SS fill:#90a4ae,color:#000

Giải thích luồng:

Client gọi put("user:1001", value) hoặc get("user:1001")
Load Balancer route request tới bất kỳ node nào trong cluster
Node nhận request trở thành Coordinator cho request đó
Coordinator dùng consistent hashing để xác định key thuộc node nào → route tới node chủ (primary) và N-1 replicas
Coordinator chờ W (write) hoặc R (read) acknowledgements rồi trả response cho client

Key insight: Không có “master node” — bất kỳ node nào cũng có thể là coordinator. Đây là kiến trúc peer-to-peer (masterless), khác với master-slave.

2.3 Data Partitioning — Consistent Hashing

Chi tiết thuật toán: Tuan-10-Consistent-Hashing

flowchart LR
    subgraph "Consistent Hash Ring"
        direction TB
        R["Hash Ring<br/>0 ────── 2^128"]

        A["Node A<br/>pos: 30, 120, 250"]
        B["Node B<br/>pos: 70, 180, 310"]
        C["Node C<br/>pos: 100, 220, 380"]
    end

    K1["key1 → hash: 45<br/>→ Node B (70)"]
    K2["key2 → hash: 130<br/>→ Node C (220... no, 180)<br/>→ Node B (180)"]
    K3["key3 → hash: 260<br/>→ Node B (310)"]

    style A fill:#e53935,color:#fff
    style B fill:#1e88e5,color:#fff
    style C fill:#43a047,color:#fff

Tại sao dùng Virtual Nodes (Vnodes)?

Mỗi physical node sở hữu nhiều vị trí trên hash ring (virtual nodes). Lợi ích:

Phân bổ đều tải: Tránh tình trạng 1 node nhận quá nhiều key
Thêm/bớt node mượt mà: Chỉ cần di chuyển một phần nhỏ data
Heterogeneous hardware: Node mạnh → nhiều vnodes hơn

Vn o d es p er n o d e = \frac{T o t a l v n o d es}{N u mb er o f n o d es}

Cassandra mặc định: 256 vnodes/node. Với 100 nodes → 25,600 vnodes trên ring.

2.4 Data Replication

Mỗi key được replicate lên N node liên tiếp trên ring (theo chiều kim đồng hồ).

flowchart LR
    subgraph "Replication with N=3"
        direction LR
        NA["Node A<br/>(Primary)"]
        NB["Node B<br/>(Replica 1)"]
        NC["Node C<br/>(Replica 2)"]
        ND["Node D"]
        NE["Node E"]
    end

    K["key: user:1001<br/>hash → Node A"]

    K --> NA
    NA -->|"Replicate"| NB
    NA -->|"Replicate"| NC

    style NA fill:#e53935,color:#fff,stroke-width:3px
    style NB fill:#ff7043,color:#fff
    style NC fill:#ff7043,color:#fff
    style ND fill:#bdbdbd,color:#000
    style NE fill:#bdbdbd,color:#000

Rack-aware replication: Đảm bảo N replicas nằm trên khác rack (hoặc khác data center) để survive rack/DC failure.

P (d a t a l oss) = P (a ll N re pl i c a s f ai l s im u lt an eo u s l y)

Với N=3, mỗi node có availability 99.9%:

P (d a t a l oss) = (1 - 0.999)^{3} = 1 0^{- 9} = 0.0000001%

Với replication factor N=3, khả năng mất data là 1 trong 1 tỷ. Đó là lý do N=3 là chuẩn industry.

Step 3 — Design Deep Dive

3.1 Consistency Models (Mô hình nhất quán)

Model	Tiếng Việt	Mô tả	Ví dụ
Strong Consistency	Nhất quán mạnh	Sau khi write thành công, mọi read đều thấy giá trị mới nhất	Banking balance
Eventual Consistency	Nhất quán cuối cùng	Sau khi write, read có thể thấy giá trị cũ trong một khoảng thời gian. Cuối cùng tất cả replicas sẽ converge	Social media likes count
Causal Consistency	Nhất quán nhân quả	Nếu operation A xảy ra trước B (A causes B), thì mọi node đều thấy A trước B. Các operations không liên quan có thể thấy thứ tự khác	Chat messages trong cùng thread

3.2 Quorum Consensus — Tunable Consistency

Đây là cơ chế cho phép điều chỉnh mức consistency dựa trên 3 tham số:

Tham số	Ý nghĩa
N	Replication factor — số replicas cho mỗi key
W	Write quorum — số replicas phải confirm write trước khi trả success
R	Read quorum — số replicas phải respond trước khi trả data

Quy tắc vàng:

W + R > N ⟹ St ro n g C o n s i s t e n cy

Vì nếu $W + R > N$ , tập hợp nodes đã write và tập hợp nodes đang read chắc chắn có giao (overlap). Ít nhất 1 node trong read quorum có data mới nhất.

Các cấu hình phổ biến

Config	N	W	R	W+R	Consistency	Đặc điểm
Strong	3	2	2	4 > 3	Strong	Balanced read/write latency
Strong (write-heavy)	3	1	3	4 > 3	Strong	Fast write, slow read
Strong (read-heavy)	3	3	1	4 > 3	Strong	Slow write, fast read
Eventual	3	1	1	2 < 3	Eventual	Fastest, lowest consistency
ONE quorum	3	1	1	2 < 3	Eventual	DynamoDB default cho non-critical

flowchart TD
    subgraph "Write with W=2, N=3"
        Client1["Client: put(k, v)"]
        Coord1["Coordinator"]
        R1["Replica 1 ✅"]
        R2["Replica 2 ✅"]
        R3["Replica 3 ⏳ (slow)"]

        Client1 --> Coord1
        Coord1 --> R1
        Coord1 --> R2
        Coord1 --> R3
        R1 -->|"ACK"| Coord1
        R2 -->|"ACK"| Coord1
        Coord1 -->|"Success<br/>(W=2 reached)"| Client1
    end

    subgraph "Read with R=2, N=3"
        Client2["Client: get(k)"]
        Coord2["Coordinator"]
        R4["Replica 1: v2 ✅"]
        R5["Replica 2: v2 ✅"]
        R6["Replica 3: v1 (stale) ⏳"]

        Client2 --> Coord2
        Coord2 --> R4
        Coord2 --> R5
        Coord2 --> R6
        R4 -->|"v2"| Coord2
        R5 -->|"v2"| Coord2
        Coord2 -->|"Return v2<br/>(latest by timestamp)"| Client2
    end

    style R1 fill:#4caf50,color:#fff
    style R2 fill:#4caf50,color:#fff
    style R3 fill:#ff9800,color:#000
    style R4 fill:#4caf50,color:#fff
    style R5 fill:#4caf50,color:#fff
    style R6 fill:#ff9800,color:#000

Latency trade-off:

W r i t e l a t e n cy \propto W (phải chờ W nodes ACK)

R e a d l a t e n cy \propto R (phải chờ R nodes respond)

Khi tăng W → write chậm hơn nhưng read nhanh hơn (vì R có thể nhỏ). Và ngược lại.

3.3 Inconsistency Resolution — Xử lý xung đột

Vấn đề: Concurrent Writes

Khi 2 client cùng write vào 1 key trên 2 node khác nhau (do network partition hoặc concurrent access):

Timeline:
  Client A: put("cart", [iPhone])     → Node 1 lúc t=1
  Client B: put("cart", [MacBook])    → Node 2 lúc t=2

Node 1 có [iPhone], Node 2 có [MacBook]. Đâu là giá trị đúng?

Giải pháp 1: Last-Write-Wins (LWW)

Mỗi write đi kèm timestamp. Khi conflict, giá trị có timestamp lớn hơn thắng.

reso l v e (v_{1}, t_{1}, v_{2}, t_{2}) = {v_{1} v_{2} if t_{1} > t_{2} if t_{2} > t_{1}

Ưu điểm: Đơn giản, dễ implement. Nhược điểm: Mất data! Trong ví dụ trên, hoặc iPhone hoặc MacBook sẽ bị mất. Cassandra dùng LWW làm default.

Giải pháp 2: Vector Clocks (Đồng hồ vector)

Vector clock là một danh sách các cặp (node, counter) gắn với mỗi value. Nó cho phép phát hiện:

Causally related writes: Nếu $V C (A) \leq V C (B)$ → B xảy ra sau A → B thắng
Concurrent writes: Nếu $V C (A) \neq \leq V C (B)$ và $V C (B) \neq \leq V C (A)$ → conflict → cần client resolve

Ví dụ chi tiết:

Bước 1: Client A write lên Node Sx
  cart = [iPhone]
  VC = {Sx: 1}

Bước 2: Client A write tiếp lên Node Sx
  cart = [iPhone, AirPods]
  VC = {Sx: 2}

Bước 3: Client B read từ Sx, rồi write lên Node Sy
  cart = [iPhone, AirPods, MacBook]
  VC = {Sx: 2, Sy: 1}

Bước 4: Client C read từ Sx (chưa thấy update của B), write lên Node Sz
  cart = [iPhone, AirPods, iPad]
  VC = {Sx: 2, Sz: 1}

Bước 5: Client D read cả 2 versions:
  Version 1: [iPhone, AirPods, MacBook]  VC = {Sx: 2, Sy: 1}
  Version 2: [iPhone, AirPods, iPad]     VC = {Sx: 2, Sz: 1}

  → VC1 và VC2 là CONCURRENT (không ai "sau" ai)
  → Client D phải merge: [iPhone, AirPods, MacBook, iPad]
  → Write merged value: VC = {Sx: 2, Sy: 1, Sz: 1}

So sánh VC:

V C (A) \leq V C (B) ⟺ \forall n o d e i : A [i] \leq B [i]

Nếu ${S x : 2, S y : 1}$ vs ${S x : 2, S z : 1}$ : Sy=1 vs Sy=0 (B thua), Sz=0 vs Sz=1 (A thua) → Concurrent → conflict.

Nhược điểm Vector Clock: Vector clock phình to theo số nodes đã tham gia write. Giải pháp: truncate entries cũ nhất khi vector vượt quá threshold (ví dụ: giữ tối đa 10 entries).

3.4 Handling Failures — Xử lý lỗi

3.4.1 Failure Detection — Gossip Protocol

Trong cluster 100 nodes, làm sao biết node nào chết?

Naive approach: Mỗi node ping tất cả node khác → $O (N^{2})$ messages → không scale.

Gossip Protocol (Epidemic Protocol):

Mỗi node duy trì một membership list (danh sách node + heartbeat counter)
Định kỳ (mỗi 1s), mỗi node chọn random một node khác và gửi membership list
Node nhận sẽ merge 2 list: giữ heartbeat counter cao hơn cho mỗi entry
Nếu heartbeat của 1 node không tăng sau T seconds → đánh dấu suspected failure
Nếu sau thêm T’ seconds vẫn không tăng → đánh dấu confirmed failure

sequenceDiagram
    participant A as Node A
    participant B as Node B
    participant C as Node C
    participant D as Node D

    Note over A,D: Gossip Round 1 (t=0s)

    A->>B: Membership: {A:10, B:8, C:9, D:7}
    B->>B: Merge: {A:10, B:9, C:9, D:7}<br/>(B tăng counter của chính mình)

    Note over A,D: Gossip Round 2 (t=1s)

    B->>C: Membership: {A:10, B:9, C:9, D:7}
    C->>C: Merge: {A:10, B:9, C:10, D:7}

    Note over A,D: Node D dies at t=2s

    Note over A,D: Gossip Round 3-5 (t=2-4s)

    C->>A: {A:10, B:9, C:10, D:7}
    A->>A: D's heartbeat still 7...<br/>T_fail = 5s, not reached yet

    Note over A,D: Gossip Round 8 (t=7s)

    A->>A: D's heartbeat = 7 for 5s<br/>→ SUSPECT D is down!

    A->>B: {A:15, B:12, C:14, D:7💀}
    B->>C: Propagate D's failure

Phi Accrual Failure Detector (Cassandra dùng cái này):

Thay vì binary “alive/dead”, dùng phi value (giá trị phi) — xác suất node đã chết dựa trên lịch sử heartbeat:

ϕ = - lo g_{10} (P_{l a t er} (t_{n o w} - t_{l a s t}))

Trong đó $P_{l a t er}$ là xác suất heartbeat đến muộn hơn khoảng thời gian hiện tại, dựa trên distribution lịch sử.

$ϕ < 1$ : Rất có thể node còn sống
$ϕ > 8$ : Gần như chắc chắn node đã chết (Cassandra default threshold)

Ưu điểm so với fixed timeout: Tự adapt theo network conditions. Mạng chậm → threshold tự nới rộng.

3.4.2 Temporary Failures — Sloppy Quorum & Hinted Handoff

Khi một node trong quorum tạm thời không thể trả lời (chưa chết hẳn, chỉ chậm hoặc đang restart):

Strict Quorum: Reject write → giảm availability.

Sloppy Quorum: Thay vì chờ node chết, gửi write cho node kế tiếp trên ring (node “thay thế”).

flowchart LR
    subgraph "Normal Operation (N=3)"
        W1["Write key X"]
        NA1["Node A ✅"]
        NB1["Node B ✅"]
        NC1["Node C ✅"]
        W1 --> NA1
        W1 --> NB1
        W1 --> NC1
    end

    subgraph "Node B is down → Sloppy Quorum"
        W2["Write key X"]
        NA2["Node A ✅"]
        NB2["Node B ❌ (down)"]
        NC2["Node C ✅"]
        ND2["Node D ✅<br/>(temporary holder)"]
        W2 --> NA2
        W2 -.->|"Failed"| NB2
        W2 --> NC2
        W2 -->|"Hinted Handoff"| ND2
    end

    subgraph "Node B recovers"
        ND3["Node D"]
        NB3["Node B ✅ (back!)"]
        ND3 -->|"Transfer hinted data<br/>back to B"| NB3
    end

    style NB2 fill:#e53935,color:#fff
    style ND2 fill:#ff9800,color:#000
    style NB3 fill:#4caf50,color:#fff

Hinted Handoff:

Node D nhận data tạm và lưu kèm hint (metadata: “data này thuộc Node B”)
Node D định kỳ check xem Node B đã recover chưa
Khi Node B recover → Node D gửi data trả lại → xoá hint

Giới hạn: Nếu Node B chết quá lâu, hints trên Node D tích tụ → disk đầy. Cassandra giới hạn hint storage (default: 10GB hoặc 3 giờ).

3.4.3 Permanent Failures — Anti-Entropy with Merkle Trees

Khi một node chết vĩnh viễn và được thay thế bằng node mới, hoặc replicas bị drift (data khác nhau do hinted handoff không đầy đủ):

Vấn đề: Làm sao so sánh data giữa 2 replicas một cách hiệu quả? So sánh từng key → quá chậm.

Giải pháp: Merkle Tree (Hash Tree)

Merkle tree là cây mà:

Leaf nodes: hash của từng data block (hoặc range of keys)
Internal nodes: hash của con trái + con phải
Root: hash đại diện cho toàn bộ data

flowchart TD
    subgraph "Replica 1 Merkle Tree"
        R1["Root: abc123"]
        L1["Left: def456"]
        R1R["Right: ghi789"]
        LL1["Keys 1-25<br/>hash: aaa"]
        LR1["Keys 26-50<br/>hash: bbb"]
        RL1["Keys 51-75<br/>hash: ccc"]
        RR1["Keys 76-100<br/>hash: ddd"]

        R1 --> L1
        R1 --> R1R
        L1 --> LL1
        L1 --> LR1
        R1R --> RL1
        R1R --> RR1
    end

    subgraph "Replica 2 Merkle Tree"
        R2["Root: abc123... wait, xyz999!"]
        L2["Left: def456"]
        R2R["Right: DIFFERENT!"]
        LL2["Keys 1-25<br/>hash: aaa ✅"]
        LR2["Keys 26-50<br/>hash: bbb ✅"]
        RL2["Keys 51-75<br/>hash: DIFFERS ❌"]
        RR2["Keys 76-100<br/>hash: ddd ✅"]

        R2 --> L2
        R2 --> R2R
        L2 --> LL2
        L2 --> LR2
        R2R --> RL2
        R2R --> RR2
    end

    style R1 fill:#4caf50,color:#fff
    style R2 fill:#e53935,color:#fff
    style RL2 fill:#e53935,color:#fff
    style R2R fill:#e53935,color:#fff

Quy trình Anti-Entropy:

So sánh root hash → Nếu giống → 2 replicas đồng bộ, xong!
Nếu root khác → so sánh children: Left giống, Right khác
Drill down Right subtree: Keys 51-75 khác, Keys 76-100 giống
Chỉ cần sync Keys 51-75 → giảm lượng data transfer cực lớn

D a t a t o co m p a re (b r u t e f orce) = N_{k eys} \times a vg_v a l u e_s i ze

D a t a t o co m p a re (M er k l e t ree) \approx O (lo g N) \times ha s h_s i ze

Với 1 triệu keys, brute force so sánh ~10 GB. Merkle tree so sánh ~20 hashes (mỗi cái 32 bytes) = 640 bytes. Hiệu quả gấp hàng triệu lần.

3.4.4 Data Center Outage — Cross-DC Replication

Khi cả một data center sập (thiên tai, mất điện toàn bộ):

flowchart TB
    Client["Client"]

    subgraph DC1["Data Center 1 (US-East)"]
        N1A["Node A1"]
        N1B["Node B1"]
        N1C["Node C1"]
    end

    subgraph DC2["Data Center 2 (US-West)"]
        N2A["Node A2"]
        N2B["Node B2"]
        N2C["Node C2"]
    end

    subgraph DC3["Data Center 3 (EU-West)"]
        N3A["Node A3"]
        N3B["Node B3"]
        N3C["Node C3"]
    end

    Client --> DC1
    Client -.->|"Failover"| DC2

    N1A <-->|"Async Replication<br/>~100-150ms"| N2A
    N1A <-->|"Async Replication<br/>~80-120ms"| N3A
    N2A <-->|"Async Replication"| N3A

    style DC1 fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
    style DC2 fill:#e3f2fd,stroke:#1e88e5,stroke-width:2px
    style DC3 fill:#fce4ec,stroke:#e53935,stroke-width:2px

Cấu hình Cassandra multi-DC (ví dụ):

NetworkTopologyStrategy với replication factor: {DC1: 3, DC2: 3, DC3: 3}
Tổng 9 replicas cho mỗi key → data cực kỳ durable
Write consistency: LOCAL_QUORUM (chỉ cần quorum trong local DC) → low latency
Read consistency: LOCAL_QUORUM (đọc từ local DC)
Cross-DC replication là async → không ảnh hưởng write latency

C ross - D C re pl i c a t i o n l a g \approx RT T_{in t er - D C} + p rocess in g

\approx 100 - 200 m s (U S co a s t t o co a s t)

3.5 Write Path — Luồng ghi chi tiết

flowchart TD
    C["Client: put(key, value)"]
    Coord["Coordinator Node"]
    CL["1. Commit Log<br/>(append-only on disk)<br/>→ Durability guarantee"]
    MT["2. Memtable<br/>(in-memory sorted structure)<br/>→ Red-Black tree / Skip list"]
    FL["Memtable full?"]
    SS["3. Flush to SSTable<br/>(Sorted String Table)<br/>→ Immutable file on disk"]
    CP["4. Compaction<br/>(merge multiple SSTables)"]

    C --> Coord
    Coord -->|"Route to responsible nodes"| CL
    CL --> MT
    MT --> FL
    FL -->|"Yes (size > threshold)"| SS
    FL -->|"No"| DONE["ACK to coordinator"]
    SS --> CP
    CP --> DONE2["Background process"]

    style CL fill:#ff9800,color:#000
    style MT fill:#ffeb3b,color:#000
    style SS fill:#78909c,color:#fff
    style CP fill:#5c6bc0,color:#fff

Chi tiết từng bước:

Step 1 — Commit Log (WAL — Write-Ahead Log):

Ghi vào file append-only trên disk (sequential write → cực nhanh)
Mục đích: Nếu node crash trước khi memtable flush xuống disk → recover từ commit log
Tương tự WAL trong PostgreSQL → Tuan-07-Database-Sharding-Replication

Step 2 — Memtable:

Cấu trúc dữ liệu in-memory, sorted by key (Red-Black tree, Skip list, hoặc B-tree)
Write vào memtable = write vào memory → cực nhanh (microseconds)
Đây chính là lý do write latency của KV store rất thấp

Step 3 — Flush to SSTable:

Khi memtable đạt threshold (ví dụ: 64MB) → flush xuống disk thành SSTable
SSTable (Sorted String Table): file immutable, data sorted by key
Mỗi SSTable đi kèm: index file + bloom filter

Step 4 — Compaction:

Theo thời gian, nhiều SSTables tích tụ → cần merge (compaction)
Compaction: merge nhiều SSTables thành 1, loại bỏ deleted keys (tombstones) và old versions

Toàn bộ kiến trúc này gọi là LSM Tree (Log-Structured Merge Tree). Đây là nền tảng storage engine của Cassandra, RocksDB, LevelDB, HBase.

3.6 Read Path — Luồng đọc chi tiết

flowchart TD
    C["Client: get(key)"]
    Coord["Coordinator Node"]
    MT2["1. Check Memtable<br/>(in-memory)"]
    Found1{"Found?"}
    BF["2. Check Bloom Filter<br/>for each SSTable"]
    BFR{"Bloom Filter<br/>says 'maybe'?"}
    IDX["3. Check SSTable Index<br/>(sparse index)"]
    SS2["4. Read from SSTable<br/>(disk read)"]
    Merge["5. Merge results<br/>(latest timestamp wins)"]
    Return["Return to client"]

    C --> Coord
    Coord --> MT2
    MT2 --> Found1
    Found1 -->|"Yes"| Return
    Found1 -->|"No"| BF
    BF --> BFR
    BFR -->|"No (definitely not here)"| NextSS["Try next SSTable"]
    BFR -->|"Yes (might be here)"| IDX
    IDX --> SS2
    SS2 --> Merge
    NextSS --> BF
    Merge --> Return

    style MT2 fill:#ffeb3b,color:#000
    style BF fill:#ab47bc,color:#fff
    style SS2 fill:#78909c,color:#fff

Bloom Filter — Tại sao cần?

Bloom filter là cấu trúc dữ liệu xác suất:

“Key không có” → 100% chắc chắn không có (no false negatives)
“Key có thể có” → có thể sai (false positives, tỷ lệ ~1%)
Kích thước: vài KB cho hàng triệu keys

FPR = (1 - e^{- \frac{kn}{m}})^{k}

Trong đó: $k$ = số hash functions, $n$ = số elements, $m$ = số bits.

Không có bloom filter: phải check mọi SSTable trên disk cho mỗi read (có thể hàng chục file). Với bloom filter: skip ngay các SSTables chắc chắn không chứa key → giảm disk I/O dramatically.

Read Repair: Khi coordinator nhận response từ R replicas và phát hiện data không đồng nhất → gửi bản mới nhất cho replicas có data cũ → self-healing.

3.7 Compaction Strategies

Compaction là quá trình merge SSTables, loại bỏ tombstones và duplicate keys. Có 2 chiến lược chính:

Size-Tiered Compaction (STCS)

Level 0: [SST1 4MB] [SST2 4MB] [SST3 4MB] [SST4 4MB]
         ↓ merge khi có đủ 4 SSTables cùng size
Level 1: [SST5 16MB] [SST6 16MB] [SST7 16MB] [SST8 16MB]
         ↓ merge
Level 2: [SST9 64MB] ...

Ưu	Nhược
Write throughput cao	Space amplification cao (cần 2x disk tạm thời khi compact)
Tốt cho write-heavy workload	Read chậm hơn (nhiều SSTables chồng chéo)

Leveled Compaction (LCS)

Level 0: [SST1] [SST2] (memtable flushes, có thể overlap)
Level 1: [a-d] [e-h] [i-m] [n-r] [s-z] (non-overlapping, mỗi SSTable ~160MB)
Level 2: [a-b] [c-d] [e-f] ... (10x larger, non-overlapping)

Ưu	Nhược
Read hiệu quả (mỗi key chỉ ở 1 SSTable per level)	Write amplification cao (mỗi write có thể trigger cascade compaction)
Space amplification thấp (~10%)	Write throughput thấp hơn STCS

Cassandra default: STCS cho write-heavy, LCS cho read-heavy tables.

W r i t e A m pl i f i c a t i o n_{STCS} \approx O (lo g_{T} N) (T = s i ze r a t i o)

W r i t e A m pl i f i c a t i o n_{L CS} \approx O (\frac{s i z e _{l e v e l_{i + 1}}}{s i z e _{l e v e l_{i}}}) = O (10) p er l e v e l

3.8 Gossip Protocol for Cluster Membership

flowchart TD
    subgraph "Gossip Protocol — Cluster State"
        N1["Node 1<br/>State: {<br/>  N1: gen5/hb100,<br/>  N2: gen3/hb98,<br/>  N3: gen2/hb99,<br/>  N4: gen4/hb95<br/>}"]

        N2["Node 2<br/>State: {<br/>  N1: gen5/hb97,<br/>  N2: gen3/hb101,<br/>  N3: gen2/hb99,<br/>  N4: gen4/hb90<br/>}"]

        N3["Node 3"]
        N4["Node 4"]

        N1 -->|"1. Random pick:<br/>gossip to N2"| N2
        N2 -->|"2. Merge states:<br/>keep max heartbeat"| N2
        N2 -->|"3. Send merged<br/>state back"| N1
        N1 -->|"4. Merge again"| N1

        N2 -.->|"Next round:<br/>random pick N3"| N3
        N3 -.->|"Next round:<br/>random pick N4"| N4
    end

    style N1 fill:#42a5f5,color:#fff
    style N2 fill:#66bb6a,color:#fff
    style N3 fill:#ffa726,color:#000
    style N4 fill:#ef5350,color:#fff

Convergence time — bao lâu để toàn cluster biết 1 node chết?

T_{co n v er g e n ce} \approx O (lo g N) \times g oss i p_in t er v a l

Với 100 nodes, gossip interval = 1s:

T_{co n v er g e n ce} \approx lo g_{2} (100) \times 1 s \approx 7 s

Mất khoảng 7 giây để 100% cluster biết node X chết. Nhưng detector threshold (phi accrual) có thể cần thêm 5-10s nữa → tổng ~15-20s.

Thông tin gossip truyền tải:

Membership list: node nào đang sống/chết
Token ownership: node nào sở hữu range nào trên hash ring
Schema version: cluster schema có đồng nhất không
Load information: giúp routing thông minh hơn

3.9 Tổng hợp — Complete Architecture

flowchart TB
    Client["Client App"]
    LB["Load Balancer / Client-side routing"]

    Client --> LB

    subgraph Cluster["Distributed KV Store Cluster"]
        direction TB

        subgraph GossipLayer["Gossip Layer (Cluster Membership)"]
            G1["Gossip Protocol"]
            G2["Phi Accrual Failure Detector"]
            G3["Token Ring Manager"]
        end

        subgraph CoordLayer["Coordinator Layer"]
            C1["Request Router<br/>(Consistent Hash)"]
            C2["Quorum Manager<br/>(W/R/N config)"]
            C3["Conflict Resolver<br/>(Vector Clock / LWW)"]
        end

        subgraph StorageLayer["Storage Engine (per node)"]
            CL["Commit Log<br/>(WAL)"]
            MT["Memtable<br/>(in-memory)"]
            BF["Bloom Filters"]
            SST["SSTables<br/>(on disk)"]
            COMP["Compaction<br/>Engine"]
            MK["Merkle Tree<br/>(anti-entropy)"]
        end

        subgraph RepairLayer["Repair & Sync"]
            HH["Hinted Handoff<br/>Queue"]
            RR["Read Repair"]
            AE["Anti-Entropy<br/>Repair"]
        end
    end

    LB --> C1
    C1 --> C2
    C2 --> CL
    CL --> MT
    MT --> SST
    BF --> SST
    COMP --> SST
    G1 --> G2
    G2 --> C1
    G3 --> C1
    C3 --> C2
    HH --> StorageLayer
    RR --> StorageLayer
    AE --> MK

    style GossipLayer fill:#e8eaf6,stroke:#3f51b5
    style CoordLayer fill:#fff3e0,stroke:#ff9800
    style StorageLayer fill:#e8f5e9,stroke:#4caf50
    style RepairLayer fill:#fce4ec,stroke:#e53935

Step 4 — Wrap Up

Tổng kết các trade-offs

Design Decision	Option A	Option B	KV Store thường chọn
Consistency vs Availability	CP (reject when partition)	AP (accept, resolve later)	AP (tunable)
Conflict resolution	LWW (simple, data loss)	Vector Clock (complex, no loss)	LWW (Cassandra) hoặc VC (Riak, DynamoDB)
Compaction	Size-tiered (write-optimized)	Leveled (read-optimized)	Depends on workload
Failure detection	Fixed timeout	Phi accrual	Phi accrual (adaptive)
Quorum	Strict (may block)	Sloppy (always available)	Sloppy cho availability
Replication	Sync (strong, slow)	Async (fast, eventual)	Async + tunable quorum

Tóm tắt components và vai trò

Component	Giải quyết vấn đề gì
Consistent hashing	Data partitioning, automatic rebalancing
Replication (N replicas)	Durability, availability
Quorum (W + R > N)	Tunable consistency
Vector clocks	Conflict detection for concurrent writes
Gossip protocol	Decentralized failure detection, cluster membership
Merkle tree	Efficient data synchronization between replicas
Sloppy quorum + hinted handoff	Availability during temporary failures
LSM tree (commit log + memtable + SSTable)	High write throughput
Bloom filter	Fast read path (avoid unnecessary disk reads)
Compaction	Reclaim space, remove tombstones, merge data

So sánh với các hệ thống thực tế

Feature	Cassandra	DynamoDB	etcd	Redis Cluster
CAP	AP	AP	CP	AP/CP
Consistency	Tunable	Tunable	Strong (Raft)	Eventual
Partitioning	Consistent hash (Murmur3)	Consistent hash	Raft groups	Hash slots (16384)
Replication	Configurable N	3 (fixed)	Raft log	Async master-slave
Conflict resolution	LWW	Vector clocks	Raft consensus	LWW
Storage engine	LSM tree	B-tree (internal)	BoltDB (B+tree)	In-memory + RDB/AOF
Use case	Time-series, IoT, messaging	Shopping cart, session, gaming	Config, service discovery	Cache, session, leaderboard

5. Security — Bảo mật cho Distributed KV Store

5.1 Encryption at Rest (Mã hoá dữ liệu trên disk)

Mỗi node lưu data trên disk (SSTables, commit log) → cần mã hoá.

Layer	Phương pháp	Ưu/Nhược
Disk-level (dm-crypt, LUKS)	Encrypt toàn bộ partition	Transparent, nhưng ai có access vào OS đều đọc được
File-level (SSTable encryption)	Mỗi SSTable encrypted riêng	Granular control, nhưng thêm CPU overhead
Field-level	Encrypt từng value trước khi write	Client-side, KV store không thấy plaintext, nhưng không thể search/filter

E n cry pt i o n o v er h e a d p er w r i t e = T_{A ES - 256 - GCM} \times v a l u e_s i ze

\approx 0.5 μ s / K B \times 10 K B = 5 μ s p er w r i t e

So với write latency ~1ms (disk flush), encryption overhead chỉ thêm 0.5%. Negligible.

Key Management cho N replicas:

T o t a l e n cry pt i o n k eys = N o d es \times Key_ro t a t i o n_p er_ye a r \times R e t e n t i o n_ye a rs

= 100 \times 4 \times 5 = 2, 000 k eys

Cần KMS (AWS KMS, HashiCorp Vault) để quản lý. Xem Tuan-15-Data-Security-Encryption.

5.2 Inter-Node TLS (Mã hoá giao tiếp giữa các node)

Gossip traffic: mang membership info, token ranges → nếu bị intercept, attacker biết topology
Replication traffic: mang actual data → nếu bị intercept, data leak
Mandatory: mTLS (mutual TLS) giữa mọi node trong cluster

# cassandra.yaml — inter-node encryption
server_encryption_options:
  internode_encryption: all          # encrypt tất cả traffic giữa nodes
  keystore: /etc/cassandra/keystore.jks
  keystore_password: ${KEYSTORE_PASS}
  truststore: /etc/cassandra/truststore.jks
  truststore_password: ${TRUSTSTORE_PASS}
  protocol: TLSv1.3
  cipher_suites:
    - TLS_AES_256_GCM_SHA384
    - TLS_CHACHA20_POLY1305_SHA256
  require_client_auth: true          # mTLS: node phải chứng minh identity

5.3 Client Authentication & Access Control

Cơ chế	Mô tả
Username/Password	Basic auth, Cassandra hỗ trợ sẵn
Certificate-based (mTLS)	Client cũng phải có certificate → mạnh hơn
RBAC (Role-Based Access Control)	Phân quyền: read-only, read-write, admin
Keyspace-level ACL	Giới hạn quyền theo keyspace (namespace)

-- Cassandra CQL: Tạo role và phân quyền
CREATE ROLE app_reader WITH PASSWORD = 'secure_pass' AND LOGIN = true;
CREATE ROLE app_writer WITH PASSWORD = 'secure_pass' AND LOGIN = true;
 
GRANT SELECT ON KEYSPACE user_data TO app_reader;
GRANT SELECT, MODIFY ON KEYSPACE user_data TO app_writer;
 
-- Audit: xem ai đang có quyền gì
LIST ALL PERMISSIONS OF app_writer;

5.4 Data Protection with Multiple Replicas

Rủi ro: Data được replicate trên N nodes → attack surface tăng N lần. Nếu 1 node bị compromise, attacker có full data.

Mitigation:

Mỗi node encrypt at rest với unique key (không share key giữa nodes)
Rack-aware placement: replicas trên khác rack → compromise 1 rack chưa đủ
Network segmentation: KV store cluster trong private VLAN, chỉ application layer truy cập được

6. DevOps — Vận hành Cassandra Cluster

6.1 Monitoring với nodetool

# Kiểm tra trạng thái cluster
nodetool status
# Output: UN (Up Normal), DN (Down Normal), UJ (Up Joining), etc.
 
# Xem token ownership của mỗi node
nodetool ring
 
# Kiểm tra compaction đang chạy
nodetool compactionstats
 
# Xem thông tin chi tiết 1 table
nodetool tablestats keyspace_name.table_name
 
# Xem gossip info
nodetool gossipinfo
 
# Kiểm tra thread pool usage
nodetool tpstats
 
# Trigger repair thủ công (anti-entropy)
nodetool repair keyspace_name --full

6.2 Repair Operations

Tại sao cần repair?

Hinted handoff có thể miss data (hint expired, node down quá lâu)
Read repair chỉ fix data được đọc, data không ai đọc vẫn inconsistent
Anti-entropy repair là cách duy nhất đảm bảo 100% data consistency

# Full repair: so sánh toàn bộ data giữa replicas
# WARNING: resource-intensive, chạy trong maintenance window
nodetool repair -full my_keyspace
 
# Incremental repair: chỉ repair data mới từ lần repair trước
# Nhanh hơn, chạy thường xuyên hơn (daily)
nodetool repair my_keyspace
 
# Sub-range repair: repair 1 token range cụ thể
nodetool repair -st 0 -et 1000000 my_keyspace

Rule of thumb: Chạy repair ít nhất 1 lần mỗi gc_grace_seconds (default 10 ngày). Nếu không → tombstones bị xoá trước khi propagate → zombie data (data đã xoá sống lại).

6.3 Compaction Monitoring

# prometheus-alerts.yml — Cassandra compaction alerts
groups:
  - name: cassandra_compaction
    rules:
      - alert: CompactionPending
        expr: cassandra_table_pending_compactions > 50
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }}: {{ $value }} pending compactions"
          description: "Compaction falling behind. May cause read latency increase."
 
      - alert: CompactionThroughputLow
        expr: rate(cassandra_table_bytes_compacted_total[5m]) < 10485760  # < 10MB/s
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Compaction throughput below 10MB/s on {{ $labels.instance }}"
 
      - alert: TombstoneAccumulation
        expr: cassandra_table_tombstone_scanned_histogram{quantile="0.99"} > 1000
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Too many tombstones scanned per read on {{ $labels.instance }}"

6.4 Prometheus JMX Metrics

Cassandra expose metrics qua JMX → dùng jmx_exporter để Prometheus scrape.

# jmx_exporter_config.yml
---
rules:
  # Read/Write latency per table
  - pattern: org.apache.cassandra.metrics<type=Table, keyspace=(\w+), scope=(\w+), name=(Read|Write)Latency><>(Mean|99thPercentile)
    name: cassandra_table_${3}_latency_${4}
    labels:
      keyspace: "$1"
      table: "$2"
 
  # Pending compactions
  - pattern: org.apache.cassandra.metrics<type=Table, keyspace=(\w+), scope=(\w+), name=PendingCompactions><>Value
    name: cassandra_table_pending_compactions
    labels:
      keyspace: "$1"
      table: "$2"
 
  # SSTable count per table
  - pattern: org.apache.cassandra.metrics<type=Table, keyspace=(\w+), scope=(\w+), name=LiveSSTableCount><>Value
    name: cassandra_table_live_sstable_count
    labels:
      keyspace: "$1"
      table: "$2"
 
  # Bloom filter false positive rate
  - pattern: org.apache.cassandra.metrics<type=Table, keyspace=(\w+), scope=(\w+), name=BloomFilterFalseRatio><>Value
    name: cassandra_table_bloom_filter_fp_ratio
    labels:
      keyspace: "$1"
      table: "$2"
 
  # Gossip active/pending
  - pattern: org.apache.cassandra.metrics<type=ThreadPools, path=internal, scope=Gossip.*
    name: cassandra_threadpool_gossip_$1

6.5 Grafana Dashboard Essentials

Panel	PromQL Query	Threshold
Cluster Status	`count(cassandra_node_status == 1)`	Alert if < expected nodes
Read Latency p99	`cassandra_table_Read_latency_99thPercentile`	< 10ms
Write Latency p99	`cassandra_table_Write_latency_99thPercentile`	< 5ms
Pending Compactions	`cassandra_table_pending_compactions`	Warning > 20, Critical > 100
SSTable Count	`cassandra_table_live_sstable_count`	Warning > 30 per table
Bloom Filter FP Rate	`cassandra_table_bloom_filter_fp_ratio`	Warning > 0.01 (1%)
Disk Usage %	`node_filesystem_avail_bytes / node_filesystem_size_bytes`	Warning < 40%, Critical < 20%
Gossip Health	`cassandra_threadpool_gossip_pending`	Warning > 0 for > 5m

6.6 Capacity Planning

"""
Capacity Planning Calculator for Distributed KV Store
Dùng để dự đoán khi nào cần thêm nodes
"""
 
def plan_capacity(
    current_data_tb: float,
    growth_rate_pct_monthly: float,
    replication_factor: int,
    disk_per_node_tb: float,
    max_disk_usage_pct: float = 0.50,  # Cassandra best practice
    current_nodes: int = 100,
    months_ahead: int = 12,
) -> dict:
    """Dự đoán capacity needs trong N tháng tới."""
    results = []
    data = current_data_tb
 
    for month in range(1, months_ahead + 1):
        data *= (1 + growth_rate_pct_monthly / 100)
        total_with_replication = data * replication_factor
        # Overhead: index, bloom filter, compaction temp space
        total_with_overhead = total_with_replication * 2.5
        usable_per_node = disk_per_node_tb * max_disk_usage_pct
        nodes_needed = int(total_with_overhead / usable_per_node) + 1
 
        results.append({
            "month": month,
            "raw_data_tb": round(data, 2),
            "total_storage_tb": round(total_with_overhead, 2),
            "nodes_needed": nodes_needed,
            "nodes_to_add": max(0, nodes_needed - current_nodes),
        })
 
    return results
 
# Ví dụ: data 10TB, tăng 10%/tháng, N=3, disk 1.5TB/node
plan = plan_capacity(
    current_data_tb=10,
    growth_rate_pct_monthly=10,
    replication_factor=3,
    disk_per_node_tb=1.5,
    current_nodes=100,
    months_ahead=12,
)
for p in plan:
    print(f"Month {p['month']:2d}: {p['raw_data_tb']:8.1f} TB raw, "
          f"{p['nodes_needed']:4d} nodes needed (+{p['nodes_to_add']})")

7. Code Examples

7.1 Python: Consistent Hashing Implementation

"""
Consistent Hashing with Virtual Nodes
Tham chiếu: [[Tuan-10-Consistent-Hashing]]
"""
 
import hashlib
from bisect import bisect_right, insort
from typing import Optional
 
 
class ConsistentHash:
    """
    Consistent hashing ring with virtual nodes.
 
    Mỗi physical node được map thành `num_vnodes` virtual nodes
    trên hash ring để phân bổ đều tải.
    """
 
    def __init__(self, num_vnodes: int = 150):
        self.num_vnodes = num_vnodes
        self.ring: list[int] = []             # sorted list of hash positions
        self.ring_map: dict[int, str] = {}    # hash position → physical node
        self.nodes: set[str] = set()
 
    def _hash(self, key: str) -> int:
        """MD5 hash → integer position trên ring."""
        return int(hashlib.md5(key.encode()).hexdigest(), 16)
 
    def add_node(self, node: str) -> list[int]:
        """
        Thêm physical node vào ring.
        Returns: list of vnode positions.
        """
        if node in self.nodes:
            return []
 
        self.nodes.add(node)
        positions = []
        for i in range(self.num_vnodes):
            vnode_key = f"{node}:vnode{i}"
            pos = self._hash(vnode_key)
            insort(self.ring, pos)       # insert giữ sorted order
            self.ring_map[pos] = node
            positions.append(pos)
        return positions
 
    def remove_node(self, node: str) -> None:
        """Xoá physical node khỏi ring (graceful decommission)."""
        if node not in self.nodes:
            return
 
        self.nodes.discard(node)
        self.ring = [pos for pos in self.ring if self.ring_map.get(pos) != node]
        self.ring_map = {pos: n for pos, n in self.ring_map.items() if n != node}
 
    def get_node(self, key: str) -> Optional[str]:
        """Tìm node chịu trách nhiệm cho key."""
        if not self.ring:
            return None
 
        h = self._hash(key)
        idx = bisect_right(self.ring, h)
        # Wrap around ring nếu hash > max position
        if idx == len(self.ring):
            idx = 0
        return self.ring_map[self.ring[idx]]
 
    def get_replica_nodes(self, key: str, n: int = 3) -> list[str]:
        """
        Tìm N node liên tiếp trên ring (cho replication).
        Đảm bảo N nodes là DISTINCT physical nodes.
        """
        if not self.ring or n > len(self.nodes):
            return list(self.nodes)
 
        h = self._hash(key)
        idx = bisect_right(self.ring, h)
 
        replicas = []
        seen_nodes = set()
        ring_len = len(self.ring)
 
        for i in range(ring_len):
            pos = self.ring[(idx + i) % ring_len]
            node = self.ring_map[pos]
            if node not in seen_nodes:
                replicas.append(node)
                seen_nodes.add(node)
                if len(replicas) == n:
                    break
 
        return replicas
 
 
# === Demo ===
if __name__ == "__main__":
    ch = ConsistentHash(num_vnodes=150)
 
    for node in ["node-A", "node-B", "node-C", "node-D", "node-E"]:
        ch.add_node(node)
 
    # Test key distribution
    distribution: dict[str, int] = {n: 0 for n in ch.nodes}
    for i in range(100_000):
        node = ch.get_node(f"key:{i}")
        distribution[node] += 1
 
    print("Key distribution across 5 nodes (100K keys):")
    for node, count in sorted(distribution.items()):
        bar = "#" * (count // 500)
        print(f"  {node}: {count:6d} ({count/1000:.1f}%) {bar}")
 
    # Test replication
    replicas = ch.get_replica_nodes("user:1001", n=3)
    print(f"\nReplicas for 'user:1001': {replicas}")
 
    # Test node removal (simulate decommission)
    print(f"\nRemoving node-C...")
    ch.remove_node("node-C")
    new_replicas = ch.get_replica_nodes("user:1001", n=3)
    print(f"New replicas for 'user:1001': {new_replicas}")

7.2 Python: Vector Clock Implementation

"""
Vector Clock — Conflict detection for concurrent writes.
Dùng để xác định: 2 versions là causally related hay concurrent?
"""
 
from __future__ import annotations
from dataclasses import dataclass, field
from copy import deepcopy
from enum import Enum
 
 
class Relation(Enum):
    BEFORE = "BEFORE"         # A happened before B
    AFTER = "AFTER"           # A happened after B
    CONCURRENT = "CONCURRENT" # Neither — conflict!
    EQUAL = "EQUAL"           # Same version
 
 
@dataclass
class VectorClock:
    """
    Vector clock: {node_id: counter} dictionary.
 
    Rules:
    - Khi node X thực hiện write: increment clock[X]
    - Khi node X nhận message từ Y: merge(clock_X, clock_Y), rồi increment clock[X]
    """
 
    clock: dict[str, int] = field(default_factory=dict)
 
    def increment(self, node_id: str) -> VectorClock:
        """Node thực hiện local event → tăng counter."""
        new_clock = deepcopy(self)
        new_clock.clock[node_id] = new_clock.clock.get(node_id, 0) + 1
        return new_clock
 
    def merge(self, other: VectorClock) -> VectorClock:
        """Merge 2 vector clocks: giữ max counter cho mỗi node."""
        merged = VectorClock()
        all_nodes = set(self.clock.keys()) | set(other.clock.keys())
        for node in all_nodes:
            merged.clock[node] = max(
                self.clock.get(node, 0),
                other.clock.get(node, 0),
            )
        return merged
 
    def compare(self, other: VectorClock) -> Relation:
        """
        So sánh 2 vector clocks:
        - BEFORE: self < other (self happened before other)
        - AFTER: self > other (self happened after other)
        - CONCURRENT: neither ≤ nor ≥ → CONFLICT
        - EQUAL: same
        """
        all_nodes = set(self.clock.keys()) | set(other.clock.keys())
 
        self_less = False
        other_less = False
 
        for node in all_nodes:
            s = self.clock.get(node, 0)
            o = other.clock.get(node, 0)
 
            if s < o:
                self_less = True
            elif s > o:
                other_less = True
 
        if self_less and other_less:
            return Relation.CONCURRENT
        elif self_less:
            return Relation.BEFORE
        elif other_less:
            return Relation.AFTER
        else:
            return Relation.EQUAL
 
    def __repr__(self) -> str:
        items = ", ".join(f"{k}:{v}" for k, v in sorted(self.clock.items()))
        return f"VC({items})"
 
 
@dataclass
class VersionedValue:
    """Key-value pair with vector clock for conflict detection."""
 
    key: str
    value: str
    vector_clock: VectorClock
 
    def __repr__(self) -> str:
        return f"({self.key}={self.value}, {self.vector_clock})"
 
 
# === Demo: Shopping Cart conflict scenario ===
if __name__ == "__main__":
    print("=== Shopping Cart Conflict Detection ===\n")
 
    # Step 1: Client A writes on node Sx
    vc1 = VectorClock().increment("Sx")
    v1 = VersionedValue("cart", "[iPhone]", vc1)
    print(f"Step 1 - Client A writes: {v1}")
 
    # Step 2: Client A writes again on node Sx
    vc2 = vc1.increment("Sx")
    v2 = VersionedValue("cart", "[iPhone, AirPods]", vc2)
    print(f"Step 2 - Client A writes: {v2}")
    print(f"  v1 vs v2: {vc1.compare(vc2)} → v2 supersedes v1")
 
    # Step 3: Client B reads v2, writes on node Sy
    vc3 = vc2.merge(VectorClock()).increment("Sy")
    v3 = VersionedValue("cart", "[iPhone, AirPods, MacBook]", vc3)
    print(f"\nStep 3 - Client B writes: {v3}")
 
    # Step 4: Client C reads v2 (hasn't seen v3!), writes on node Sz
    vc4 = vc2.merge(VectorClock()).increment("Sz")
    v4 = VersionedValue("cart", "[iPhone, AirPods, iPad]", vc4)
    print(f"Step 4 - Client C writes: {v4}")
 
    # Step 5: Detect conflict
    relation = vc3.compare(vc4)
    print(f"\nStep 5 - v3 vs v4: {relation}")
    print(f"  v3: {v3}")
    print(f"  v4: {v4}")
 
    if relation == Relation.CONCURRENT:
        # Client-side merge
        merged_vc = vc3.merge(vc4).increment("Sx")
        merged = VersionedValue("cart", "[iPhone, AirPods, MacBook, iPad]", merged_vc)
        print(f"\n  → Client resolves conflict by merging:")
        print(f"  → {merged}")

7.3 Python: Simple In-Memory KV Store with Replication

"""
Simple distributed KV Store simulator.
Demonstrates: consistent hashing, replication, quorum reads/writes.
"""
 
import time
import threading
from dataclasses import dataclass, field
from typing import Optional
from concurrent.futures import ThreadPoolExecutor, as_completed
 
 
@dataclass
class Entry:
    value: str
    timestamp: float
    is_tombstone: bool = False
 
 
class KVNode:
    """Simulates a single KV store node."""
 
    def __init__(self, node_id: str, latency_ms: float = 1.0):
        self.node_id = node_id
        self.latency_ms = latency_ms
        self.store: dict[str, Entry] = {}
        self.is_alive = True
        self._lock = threading.Lock()
 
    def put(self, key: str, value: str) -> bool:
        """Write key-value pair. Returns True if successful."""
        if not self.is_alive:
            raise ConnectionError(f"Node {self.node_id} is down")
 
        time.sleep(self.latency_ms / 1000)  # simulate latency
 
        with self._lock:
            self.store[key] = Entry(
                value=value,
                timestamp=time.time(),
                is_tombstone=False,
            )
        return True
 
    def get(self, key: str) -> Optional[Entry]:
        """Read value for key. Returns None if not found."""
        if not self.is_alive:
            raise ConnectionError(f"Node {self.node_id} is down")
 
        time.sleep(self.latency_ms / 1000)  # simulate latency
 
        with self._lock:
            entry = self.store.get(key)
            if entry and entry.is_tombstone:
                return None
            return entry
 
    def delete(self, key: str) -> bool:
        """Soft delete: write tombstone."""
        if not self.is_alive:
            raise ConnectionError(f"Node {self.node_id} is down")
 
        with self._lock:
            self.store[key] = Entry(
                value="",
                timestamp=time.time(),
                is_tombstone=True,
            )
        return True
 
 
class DistributedKVStore:
    """
    Coordinator that routes requests to nodes using consistent hashing.
    Supports tunable quorum (N, W, R).
    """
 
    def __init__(self, n: int = 3, w: int = 2, r: int = 2):
        self.N = n  # replication factor
        self.W = w  # write quorum
        self.R = r  # read quorum
        self.nodes: dict[str, KVNode] = {}
        self.hash_ring: Optional[object] = None  # ConsistentHash from 7.1
 
    def add_node(self, node: KVNode) -> None:
        self.nodes[node.node_id] = node
 
    def _get_replicas(self, key: str) -> list[KVNode]:
        """Get N replica nodes for a key (simplified: round-robin)."""
        node_ids = sorted(self.nodes.keys())
        # Simple hash to pick starting node
        start = hash(key) % len(node_ids)
        replicas = []
        for i in range(self.N):
            idx = (start + i) % len(node_ids)
            replicas.append(self.nodes[node_ids[idx]])
        return replicas
 
    def put(self, key: str, value: str) -> bool:
        """
        Quorum write: send to N replicas, wait for W ACKs.
        """
        replicas = self._get_replicas(key)
        acks = 0
        errors = []
 
        with ThreadPoolExecutor(max_workers=self.N) as executor:
            futures = {
                executor.submit(node.put, key, value): node
                for node in replicas
            }
 
            for future in as_completed(futures):
                node = futures[future]
                try:
                    if future.result():
                        acks += 1
                        if acks >= self.W:
                            return True  # Quorum reached!
                except ConnectionError as e:
                    errors.append(str(e))
 
        if acks >= self.W:
            return True
 
        raise Exception(
            f"Write quorum not met: {acks}/{self.W} ACKs. "
            f"Errors: {errors}"
        )
 
    def get(self, key: str) -> Optional[str]:
        """
        Quorum read: read from N replicas, wait for R responses,
        return value with latest timestamp.
        """
        replicas = self._get_replicas(key)
        responses: list[tuple[Entry, KVNode]] = []
        errors = []
 
        with ThreadPoolExecutor(max_workers=self.N) as executor:
            futures = {
                executor.submit(node.get, key): node
                for node in replicas
            }
 
            for future in as_completed(futures):
                node = futures[future]
                try:
                    entry = future.result()
                    if entry:
                        responses.append((entry, node))
                    if len(responses) >= self.R:
                        break
                except ConnectionError as e:
                    errors.append(str(e))
 
        if not responses:
            return None
 
        # Return value with latest timestamp (LWW)
        latest = max(responses, key=lambda x: x[0].timestamp)
        return latest[0].value
 
 
# === Demo ===
if __name__ == "__main__":
    # Create cluster with 5 nodes
    store = DistributedKVStore(n=3, w=2, r=2)
    for i in range(5):
        store.add_node(KVNode(f"node-{i}", latency_ms=0.5))
 
    # Write
    store.put("user:1001", '{"name": "Hieu", "role": "student"}')
    print("Written: user:1001")
 
    # Read
    value = store.get("user:1001")
    print(f"Read: user:1001 = {value}")
 
    # Simulate node failure
    store.nodes["node-0"].is_alive = False
    print("\nnode-0 is DOWN")
 
    # Should still work (sloppy quorum)
    value = store.get("user:1001")
    print(f"Read after failure: user:1001 = {value}")
 
    # Write still works (W=2, still have 4 alive nodes)
    store.put("user:1001", '{"name": "Hieu", "role": "engineer"}')
    print("Written updated value despite node-0 being down")

7.4 Python: Merkle Tree Comparison

"""
Merkle Tree — Efficient data comparison between replicas.
Used in anti-entropy repair to find which key ranges are out of sync.
"""
 
import hashlib
from dataclasses import dataclass
from typing import Optional
 
 
@dataclass
class MerkleNode:
    hash_value: str
    left: Optional["MerkleNode"] = None
    right: Optional["MerkleNode"] = None
    key_range: tuple[int, int] = (0, 0)  # (start, end) token range
    is_leaf: bool = False
 
 
class MerkleTree:
    """
    Merkle tree for a range of keys.
    Leaf nodes contain hash of actual data in that key range.
    Internal nodes contain hash(left.hash + right.hash).
    """
 
    def __init__(self, data: dict[int, str], token_range: tuple[int, int] = (0, 1024)):
        """
        Args:
            data: {token_position: value_hash} — sorted by token
            token_range: (min_token, max_token) cho node này
        """
        self.data = data
        self.token_range = token_range
        self.root = self._build(token_range[0], token_range[1])
 
    def _hash(self, content: str) -> str:
        return hashlib.sha256(content.encode()).hexdigest()[:16]
 
    def _build(self, start: int, end: int, depth: int = 0, max_depth: int = 4) -> MerkleNode:
        """Recursively build Merkle tree."""
        # Leaf node: hash all data in this range
        if depth >= max_depth or end - start <= 1:
            keys_in_range = {k: v for k, v in self.data.items() if start <= k < end}
            content = "".join(f"{k}:{v}" for k, v in sorted(keys_in_range.items()))
            return MerkleNode(
                hash_value=self._hash(content) if content else self._hash("empty"),
                key_range=(start, end),
                is_leaf=True,
            )
 
        mid = (start + end) // 2
        left = self._build(start, mid, depth + 1, max_depth)
        right = self._build(mid, end, depth + 1, max_depth)
 
        return MerkleNode(
            hash_value=self._hash(left.hash_value + right.hash_value),
            left=left,
            right=right,
            key_range=(start, end),
        )
 
    @staticmethod
    def find_differences(
        node1: Optional[MerkleNode],
        node2: Optional[MerkleNode],
    ) -> list[tuple[int, int]]:
        """
        Compare 2 Merkle trees and return list of key ranges that differ.
        This is the magic: O(log N) comparison instead of O(N).
        """
        if node1 is None or node2 is None:
            return [(0, 0)]
 
        # Hashes match → subtrees are identical, no need to go deeper!
        if node1.hash_value == node2.hash_value:
            return []
 
        # Leaf node with different hash → this range needs sync
        if node1.is_leaf or node2.is_leaf:
            return [node1.key_range]
 
        # Internal node: recurse into children
        diffs = []
        diffs.extend(MerkleTree.find_differences(node1.left, node2.left))
        diffs.extend(MerkleTree.find_differences(node1.right, node2.right))
        return diffs
 
 
# === Demo: Compare 2 replicas ===
if __name__ == "__main__":
    # Replica 1 data: {token_position: value_hash}
    replica1_data = {
        10: "hash_a", 50: "hash_b", 100: "hash_c", 200: "hash_d",
        300: "hash_e", 400: "hash_f", 500: "hash_g", 600: "hash_h",
        700: "hash_i", 800: "hash_j", 900: "hash_k", 1000: "hash_l",
    }
 
    # Replica 2: identical except tokens 200 and 700 have different values
    replica2_data = replica1_data.copy()
    replica2_data[200] = "hash_d_MODIFIED"    # Changed!
    replica2_data[700] = "hash_i_MODIFIED"    # Changed!
 
    tree1 = MerkleTree(replica1_data, (0, 1024))
    tree2 = MerkleTree(replica2_data, (0, 1024))
 
    print(f"Replica 1 root hash: {tree1.root.hash_value}")
    print(f"Replica 2 root hash: {tree2.root.hash_value}")
    print(f"Roots match: {tree1.root.hash_value == tree2.root.hash_value}")
 
    diffs = MerkleTree.find_differences(tree1.root, tree2.root)
    print(f"\nKey ranges that differ: {diffs}")
    print(f"Only need to sync {len(diffs)} ranges instead of checking all {len(replica1_data)} keys!")
 
    # In real system: chỉ transfer data trong các ranges khác biệt
    for start, end in diffs:
        keys_to_sync = [k for k in replica1_data if start <= k < end]
        print(f"  Range [{start}, {end}): sync keys {keys_to_sync}")

8. Mermaid Diagrams — Tổng hợp

8.1 Consistent Hashing Ring with Replication

flowchart TD
    subgraph "Consistent Hash Ring (N=3, 6 physical nodes)"
        direction LR

        K["key: 'user:1001'<br/>hash = 42"]

        Ring["
        0 ──── Node A (pos 30) ←── Primary
        │
        50 ─── Node B (pos 70) ←── Replica 1
        │
        100 ── Node C (pos 100) ←── Replica 2
        │
        150 ── Node D (pos 180)
        │
        200 ── Node E (pos 220)
        │
        250 ── Node F (pos 310)
        │
        ↩ wrap to 0
        "]

        K -->|"hash(key)=42 → next node clockwise"| Ring
    end

    Note["key hash=42 → gần Node B nhất (pos 70)<br/>Replicas: B (primary), C, D<br/>(3 distinct physical nodes clockwise)"]

    style K fill:#ff9800,color:#000
    style Note fill:#e8f5e9,color:#000

8.2 Write Path Detail

sequenceDiagram
    participant C as Client
    participant Co as Coordinator
    participant R1 as Replica 1 (Primary)
    participant R2 as Replica 2
    participant R3 as Replica 3

    C->>Co: put("user:1001", value)
    Co->>Co: hash("user:1001") → determine replicas

    par Send to all N=3 replicas
        Co->>R1: Write request
        Co->>R2: Write request
        Co->>R3: Write request
    end

    Note over R1: 1. Append to Commit Log
    Note over R1: 2. Write to Memtable
    R1-->>Co: ACK ✅

    Note over R2: 1. Append to Commit Log
    Note over R2: 2. Write to Memtable
    R2-->>Co: ACK ✅

    Co->>C: Success (W=2 quorum met)

    Note over R3: 1. Append to Commit Log
    Note over R3: 2. Write to Memtable
    R3-->>Co: ACK (late, but data saved)

    Note over R1,R3: Later: Memtable flush → SSTable → Compaction

8.3 Read Path Detail

sequenceDiagram
    participant C as Client
    participant Co as Coordinator
    participant R1 as Replica 1
    participant R2 as Replica 2
    participant R3 as Replica 3

    C->>Co: get("user:1001")
    Co->>Co: hash → determine replicas

    par Read from all N=3 replicas
        Co->>R1: Read request
        Co->>R2: Read request
        Co->>R3: Read request
    end

    Note over R1: Check Memtable → Found!
    R1-->>Co: value_v3 (ts=1003)

    Note over R2: Memtable miss → Bloom Filter<br/>→ SSTable → Found
    R2-->>Co: value_v3 (ts=1003)

    Co->>C: Return value_v3 (R=2 quorum met, latest ts)

    Note over R3: Slow response...
    R3-->>Co: value_v2 (ts=1002) — STALE!

    Note over Co: Read Repair: send v3 to R3
    Co->>R3: Write value_v3 (repair)

8.4 Gossip Protocol Visualization

flowchart TD
    subgraph "T=0s: Initial State"
        A0["A: {A:1, B:1, C:1, D:1, E:1}"]
        B0["B: {A:1, B:1, C:1, D:1, E:1}"]
        C0["C: {A:1, B:1, C:1, D:1, E:1}"]
        D0["D: {A:1, B:1, C:1, D:0, E:1}"]
        E0["E: DEAD 💀"]
    end

    subgraph "T=1s: Round 1"
        A1["A gossips to C"]
        B1["B gossips to D"]
        A1 -->|"A increments own counter"| A1r["A: {A:2, B:1, C:1, D:1, E:1}"]
    end

    subgraph "T=2s: Round 2"
        C1["C gossips to B<br/>B merges: E still at 1"]
        D1["D gossips to A<br/>A sees E still at 1"]
    end

    subgraph "T=5s: E heartbeat stale > 5s"
        AF["A: E suspected! phi > 8"]
        BF["B: E suspected!"]
        CF["C: E suspected!"]
        DF["D: E suspected!"]
    end

    A0 --> A1
    B0 --> B1
    A1r --> C1
    C1 --> AF

    style E0 fill:#e53935,color:#fff
    style AF fill:#ff9800,color:#000
    style BF fill:#ff9800,color:#000
    style CF fill:#ff9800,color:#000
    style DF fill:#ff9800,color:#000

9. Aha Moments — Khoảnh khắc “À ha!”

#1 — LSM Tree trade-off: Write nhanh vì chỉ cần append vào commit log + write vào memory. Cái giá phải trả: read có thể chậm (check memtable → bloom filter → nhiều SSTables) và write amplification do compaction. Không có storage engine nào tối ưu cả read lẫn write — đây là trade-off cốt lõi.

#2 — W + R > N không phải “silver bullet”: Ngay cả khi W + R > N, vẫn có thể đọc stale data nếu: (a) write chưa complete trên W nodes khi read xảy ra, (b) sloppy quorum gửi write cho node ngoài preference list. Strong consistency chỉ đúng khi mọi thứ hoạt động bình thường.

#3 — Gossip protocol = eventual consistency cho metadata: Cluster membership cũng là dạng distributed state. Gossip đảm bảo eventually consistent — tức là có khoảng thời gian 1 node nghĩ node X còn sống trong khi node Y đã biết X chết. Đây là nguồn gốc của nhiều edge case.

#4 — Tombstones là “technical debt” của KV store: Khi delete key, thực tế chỉ ghi tombstone (đánh dấu đã xoá). Tombstone phải tồn tại đủ lâu (gc_grace_seconds) để propagate tới mọi replica. Nếu compaction chạy trước khi repair → tombstone bị xoá → data “sống lại” (zombie resurrection). Đây là bug khó debug nhất trong Cassandra.

#5 — Vector clock giải quyết vấn đề mà timestamp không thể: Trong distributed system, clock skew giữa các node có thể lên tới hàng chục milliseconds. 2 writes cách nhau 5ms trên 2 node khác nhau — timestamp không thể xác định ai trước ai. Vector clock không phụ thuộc vào wall clock → chính xác hơn. Nhưng cái giá: phức tạp hơn, client phải handle conflict.

#6 — Merkle tree biến O(N) thành O(log N): So sánh 1 tỷ keys giữa 2 replicas bằng brute force = transfer 10TB. Bằng Merkle tree = compare vài chục hashes (~1KB). Đây là ví dụ tuyệt vời của “right data structure changes everything”.

#7 — Tất cả đều là trade-off: Distributed KV store là bài tập về trade-off engineering: consistency vs availability, write speed vs read speed, simplicity vs correctness, storage efficiency vs write throughput. Không có “best” design — chỉ có design phù hợp nhất với use case.

10. Common Pitfalls — Sai lầm thường gặp

Pitfall 1: Write Amplification bị đánh giá thấp

Sai: “12K writes/s, Cassandra dư sức.” Đúng: Mỗi client write → N replica writes → compaction rewrite → thực tế 30x–50x write amplification. Với LCS, mỗi byte data có thể bị viết lại 10–30 lần qua các levels compaction.

A c t u a l d i s k w r i t es = Cl i e n t w r i t es \times N \times C o m p a c t i o n am pl i f i c a t i o n

= 12 K \times 3 \times 10 = 360 K d i s k w r i t es / s

Phải tính write amplification khi chọn SSD (SSD có giới hạn write endurance - TBW).

Pitfall 2: Tombstone Accumulation — “Zombie Data”

Sai: Delete hàng triệu key rồi nghĩ data đã mất. Đúng: Tombstones tích tụ, gây:

Read chậm (phải scan qua tombstones)

Disk usage tăng

Nếu repair không chạy trong gc_grace_seconds → zombie resurrection: data đã xoá sống lại!

Giải pháp: Tránh mass delete. Dùng TTL thay vì explicit delete. Monitor tombstone ratio per read.

Pitfall 3: Gossip Protocol Convergence Time bị quên

Sai: “Node chết là cluster biết ngay.” Đúng: Gossip cần $O (lo g N)$ rounds để propagate. Với 100 nodes, 1s interval → 7-20 giây trước khi toàn cluster nhận ra. Trong khoảng thời gian đó, một số requests có thể bị route tới node chết → timeout → tăng latency.

Pitfall 4: Vector Clock Bloat

Sai: Dùng vector clock cho mọi key, mặc kệ size. Đúng: Vector clock có 1 entry per node that ever wrote to the key. Với 100 nodes, mỗi VC entry ~20 bytes → 2KB overhead per key. Với 1 tỷ keys → 2TB chỉ cho vector clocks!

Giải pháp:

Truncate VC khi quá dài (Dynamo giữ max 10 entries, xoá entry cũ nhất)
Chấp nhận risk: truncation có thể gây false concurrent detection

Pitfall 5: Sloppy Quorum tạo ảo giác consistency

Sai: “W=2, R=2, N=3 → W+R > N → strong consistency.” Đúng: Với sloppy quorum, W=2 writes có thể đi tới node D (thay thế node B chết). R=2 reads vẫn đọc từ A, B, C → không overlap → stale read!

Sl o pp y q u or u m : W + R > N KH \hat{O} NG đ ảm bảo strong consistency

Chỉ strict quorum mới đảm bảo overlap. Trade-off: strict quorum → giảm availability khi node down.

Pitfall 6: Hot Key / Hot Partition

Sai: “Consistent hashing phân bổ đều, không cần lo.” Đúng: Nếu 1 key cực hot (celebrity tweet, viral product), tất cả reads/writes tập trung vào N nodes chịu trách nhiệm key đó → overload N nodes, 97 nodes khác rảnh.

Giải pháp:

Client-side caching cho hot reads
Key splitting: hot_key → hot_key:shard_1, hot_key:shard_2, … (scatter reads)
Separate hot key handling layer

Pitfall 7: Compaction làm gián đoạn production traffic

Sai: “Compaction chạy background, không ảnh hưởng gì.” Đúng: Compaction tiêu tốn disk I/O, CPU, và temporary disk space. Khi compaction chạy nặng:

Read latency tăng vọt (disk I/O contention)

Disk usage đột ngột tăng 2x (old + new SSTables cùng tồn tại)

Nếu disk full → node crash → cascade failure

Giải pháp:

Giữ disk usage < 50% (headroom cho compaction)
Throttle compaction throughput trong peak hours
Monitor pending_compactions metric

11. Bài tập tự luyện

Bài 1: Tunable Consistency Scenarios

Cho cluster N=3. Xác định mức consistency cho mỗi config:

Config	W	R	W+R	Strong hay Eventual?	Tại sao?
A	1	1	?	?	?
B	2	2	?	?	?
C	3	1	?	?	?
D	1	3	?	?	?
E	2	1	?	?	?

Bài 2: Vector Clock Comparison

Cho 3 vector clocks:

VC1 = {A:3, B:2, C:1}
VC2 = {A:3, B:3, C:1}
VC3 = {A:4, B:1, C:2}

Xác định:

VC1 vs VC2: BEFORE, AFTER, hay CONCURRENT?
VC1 vs VC3: BEFORE, AFTER, hay CONCURRENT?
VC2 vs VC3: BEFORE, AFTER, hay CONCURRENT?

Bài 3: Capacity Planning

Cluster hiện tại: 50 nodes, mỗi node 2TB disk, N=3, data 5TB raw. Data growth: 20%/tháng.

Tính:

Khi nào cluster hết disk? (giả sử max 50% usage cho compaction headroom)
Cần thêm bao nhiêu node sau 6 tháng?
Inter-node bandwidth nếu write QPS = 20K, avg value = 5KB?

Bài 4: Design Decision

Em đang thiết kế KV store cho 2 use case khác nhau. Chọn config phù hợp:

Use case A: Session store cho e-commerce (availability quan trọng hơn consistency)

CP hay AP?
N, W, R = ?
LWW hay Vector Clock?
STCS hay LCS?

Use case B: Distributed lock manager (consistency quan trọng hơn availability)

CP hay AP?
N, W, R = ?
Giải thích tại sao KV store có thể không phải lựa chọn tốt nhất cho case này

12. Quick Reference — Các con số cần nhớ

Cassandra Performance Benchmarks (single node, SSD)

Operation	Throughput	Latency (p99)
Write (1KB value)	20,000–50,000 ops/s	< 5ms
Read (1KB value, memtable hit)	50,000–100,000 ops/s	< 2ms
Read (1KB value, SSTable)	10,000–30,000 ops/s	< 10ms
Scan (100 rows)	5,000–10,000 ops/s	< 50ms

Key Formulas

Formula	Ý nghĩa
$W + R > N$	Strong consistency condition
$P (d a t a l oss) = (1 - A)^{N}$	Probability all N replicas fail
$T_{g oss i p} \approx O (lo g N) \times in t er v a l$	Gossip convergence time
$FPR = (1 - e^{- kn / m})^{k}$	Bloom filter false positive rate
$W r i t e am p = N \times co m p a c t i o n_f a c t or$	Total write amplification

Typical Configurations

Use Case	N	W	R	Consistency
General purpose	3	2	2	Strong
Write-heavy analytics	3	1	1	Eventual
Read-heavy cache	3	3	1	Strong (slow write, fast read)
Maximum availability	3	1	1	Eventual
Cross-DC	3 per DC	LOCAL_QUORUM	LOCAL_QUORUM	Strong within DC

Tham khảo

Alex Xu, System Design Interview — Chapter 6: Design a Key-Value Store
DeCandia et al., Dynamo: Amazon’s Highly Available Key-Value Store (SOSP 2007)
Lakshman & Malik, Cassandra — A Decentralized Structured Storage System (2010)
sdi.anhvy.dev — Key-Value Store
Tuan-01-Scale-From-Zero-To-Millions — Horizontal scaling
Tuan-02-Back-of-the-envelope — Estimation framework
Tuan-03-Networking-DNS-CDN — Cross-DC latency
Tuan-04-API-Design-REST-gRPC — gRPC cho inter-node communication
Tuan-05-Load-Balancer — Request routing
Tuan-06-Cache-Strategy — Memtable as write-back cache
Tuan-07-Database-Sharding-Replication — Partitioning & replication foundations
Tuan-08-Message-Queue — Hinted handoff as queue
Tuan-09-Rate-Limiter — Protecting nodes from overload
Tuan-10-Consistent-Hashing — Core partitioning algorithm
Tuan-11-Microservices-Pattern — KV store as backing service
Tuan-13-Monitoring-Observability — Cluster monitoring
Tuan-14-AuthN-AuthZ-Security — Client authentication
Tuan-15-Data-Security-Encryption — Encryption at rest, KMS

Capstone complete. Tuần này tổng hợp toàn bộ 19 tuần trước. Nếu em hiểu bài này, em đã có nền tảng vững chắc cho mọi System Design Interview.

lthieu's notes

Explorer

Tuan-20-Design-Key-Value-Store

Tuần 20: Design Distributed Key-Value Store (Capstone)

Tại sao đây là Capstone Week?

Step 1 — Understand the Problem & Establish Design Scope

1.1 Functional Requirements (Yêu cầu chức năng)

1.2 Non-Functional Requirements (Yêu cầu phi chức năng)

1.3 Capacity Estimation (Ước lượng năng lực)

Step 2 — High-Level Design

2.1 CAP Theorem Deep Dive

CAP Theorem là gì?

CP vs AP — Lựa chọn thực tế

Tại sao KV Store thường chọn AP?

2.2 High-Level Architecture

2.3 Data Partitioning — Consistent Hashing

2.4 Data Replication

Step 3 — Design Deep Dive

3.1 Consistency Models (Mô hình nhất quán)

3.2 Quorum Consensus — Tunable Consistency

Các cấu hình phổ biến

3.3 Inconsistency Resolution — Xử lý xung đột

Vấn đề: Concurrent Writes

Giải pháp 1: Last-Write-Wins (LWW)

Giải pháp 2: Vector Clocks (Đồng hồ vector)

3.4 Handling Failures — Xử lý lỗi

3.4.1 Failure Detection — Gossip Protocol

3.4.2 Temporary Failures — Sloppy Quorum & Hinted Handoff

3.4.3 Permanent Failures — Anti-Entropy with Merkle Trees

3.4.4 Data Center Outage — Cross-DC Replication

3.5 Write Path — Luồng ghi chi tiết

3.6 Read Path — Luồng đọc chi tiết

3.7 Compaction Strategies

Size-Tiered Compaction (STCS)

Leveled Compaction (LCS)

3.8 Gossip Protocol for Cluster Membership

3.9 Tổng hợp — Complete Architecture

Step 4 — Wrap Up

Tổng kết các trade-offs

Tóm tắt components và vai trò

So sánh với các hệ thống thực tế

5. Security — Bảo mật cho Distributed KV Store

5.1 Encryption at Rest (Mã hoá dữ liệu trên disk)

5.2 Inter-Node TLS (Mã hoá giao tiếp giữa các node)

5.3 Client Authentication & Access Control

5.4 Data Protection with Multiple Replicas

6. DevOps — Vận hành Cassandra Cluster

6.1 Monitoring với nodetool

6.2 Repair Operations

6.3 Compaction Monitoring

6.4 Prometheus JMX Metrics

6.5 Grafana Dashboard Essentials

6.6 Capacity Planning

7. Code Examples

7.1 Python: Consistent Hashing Implementation

7.2 Python: Vector Clock Implementation

7.3 Python: Simple In-Memory KV Store with Replication

7.4 Python: Merkle Tree Comparison

8. Mermaid Diagrams — Tổng hợp

8.1 Consistent Hashing Ring with Replication

8.2 Write Path Detail

8.3 Read Path Detail

8.4 Gossip Protocol Visualization

9. Aha Moments — Khoảnh khắc “À ha!”

10. Common Pitfalls — Sai lầm thường gặp

Pitfall 1: Write Amplification bị đánh giá thấp

Pitfall 2: Tombstone Accumulation — “Zombie Data”

Pitfall 3: Gossip Protocol Convergence Time bị quên

Pitfall 4: Vector Clock Bloat

Pitfall 5: Sloppy Quorum tạo ảo giác consistency

Pitfall 6: Hot Key / Hot Partition

Pitfall 7: Compaction làm gián đoạn production traffic

11. Bài tập tự luyện

Bài 1: Tunable Consistency Scenarios

Bài 2: Vector Clock Comparison

Bài 3: Capacity Planning

Bài 4: Design Decision

12. Quick Reference — Các con số cần nhớ

Cassandra Performance Benchmarks (single node, SSD)

Key Formulas