Case Study: Design a Distributed Message Queue

“Message Queue la xuong song cua moi distributed system. Khong hieu cach no hoat dong ben trong thi chi la nguoi dung — hieu cach thiet ke no thi moi la kien truc su.”

Tags: system-design message-queue kafka distributed-log replication alex-xu-vol2 case-study Student: Hieu Source: Alex Xu — System Design Interview Volume 2, Chapter 4 Prerequisite: Tuan-01-Scale-From-Zero-To-Millions · Tuan-02-Back-of-the-envelope · Tuan-08-Message-Queue Lien quan: Tuan-07-Database-Sharding-Replication · Tuan-10-Consistent-Hashing · Tuan-13-Monitoring-Observability · Case-Design-Payment-System

1. Context & Why — Tai sao can thiet ke Message Queue tu dau?

1.1 Analogy: He thong buu dien quoc gia

Hieu, tuong tuong em duoc giao nhiem vu thiet ke he thong buu dien quoc gia tu con so khong. Khong phai chi dat mot cai hop thu o goc pho — ma la toan bo he thong: tu luc nguoi gui bo thu vao hop, den luc nguoi nhan cam thu tren tay.

He thong nay phai giai quyet nhung van de sau:

Nhan thu (Ingestion): Hang trieu nguoi gui thu moi ngay. Moi buu cuc phai nhan thu nhanh chong, khong duoc de nguoi gui phai xep hang qua lau. Tuong tu producer gui message vao broker.
Phan loai (Routing): Thu gui den Ha Noi phai di Ha Noi, thu gui Sai Gon phai di Sai Gon. Khong duoc nham. Tuong tu viec partition message theo key — tat ca message cua cung mot khach hang phai vao cung mot partition de dam bao thu tu.
Luu tru (Storage): Truoc khi giao, thu phai duoc luu an toan trong kho. Neu kho bi chay, thu phai co ban sao o kho khac. Tuong tu replication trong message queue — moi partition co nhieu replica de dam bao khong mat du lieu.
Giao hang (Delivery): Nguoi dua thu phai giao dung nguoi, dung dia chi. Neu nguoi nhan khong co nha, phai quay lai giao lan sau (retry). Neu giao nhieu lan ma khong duoc, chuyen vao kho thu chet (Dead Letter Queue). Tuong tu consumer doc message va xu ly.
Dam bao khong mat (Durability): Mot la thu mat = mat long tin cua ca he thong. Tuong tu delivery guarantee — at-least-once, exactly-once.
Dam bao khong trung (Deduplication): Khong duoc giao cung mot la thu hai lan cho nguoi nhan. Tuong tu exactly-once semantics voi idempotent producer.
Theo doi (Tracking): Nguoi gui muon biet thu da den chua. Tuong tu offset tracking — consumer biet minh da doc den dau.

1.2 Bai toan thiet ke tu scratch

Trong Tuan-08-Message-Queue, em da hoc cach su dung message queue (Kafka, RabbitMQ) trong system design. Bay gio, chung ta se thiet ke mot distributed message queue tu dau — giong nhu Kafka, Pulsar, hay Redpanda.

Day la bai toan kho vi no ket hop nhieu khai niem:

Thach thuc	Giai thich
High throughput	1 trieu messages/giay — moi message phai duoc nhan, luu, va giao
Durability	Khong duoc mat message — du broker bi crash
Ordering	Messages trong cung partition phai giu dung thu tu
Scalability	Them broker, them partition khi traffic tang
Fault tolerance	Broker chet → he thong van hoat dong binh thuong
Low latency	Producer gui → consumer nhan trong vai millisecond
Replay capability	Consumer co the doc lai message tu bat ky thoi diem nao
Exactly-once delivery	Trong nhieu truong hop, khong duoc xu ly trung

1.3 Tai sao khong dung co san?

Cau hoi hay: “Tai sao khong cu dung Kafka?”

Interview: Day la bai thiet ke — interviewer muon thay em hieu tai sao Kafka thiet ke nhu vay, khong phai chi biet cach dung.
Thuc te: Nhieu cong ty lon (LinkedIn, Uber, Confluent) da xay dung hoac custom message queue rieng vi yeu cau dac thu.
Hieu sau: Khi em biet cach thiet ke message queue, em se dung no tot hon — biet parameter nao quan trong, biet khi nao dung at-least-once vs exactly-once, biet cach tune performance.

Aha Moment: Bai nay bo sung cho Tuan-08-Message-Queue. Tuan 08 day em dung MQ nhu mot component. Bai nay day em xay MQ tu scratch. Hai goc nhin bo sung cho nhau — nguoi dung vs kien truc su.

2. Deep Dive — Alex Xu 4-Step Framework

Step 1: Requirements — Hieu va gioi han bai toan

2.1.1 Functional Requirements

Chuc nang	Mo ta chi tiet
Producer API	Producer gui message vao mot topic cu the. Ho tro batch gui nhieu message cung luc
Consumer API	Consumer doc message tu topic. Ho tro pull-based model (consumer chu dong keo data)
Topic management	Tao, xoa, liet ke topics. Moi topic co nhieu partitions
Message retention	Giu message trong mot khoang thoi gian (7 ngay default) hoac theo kich thuoc
Message replay	Consumer co the doc lai message tu bat ky offset nao trong retention period
Consumer group	Nhom consumer cung doc mot topic — moi partition chi gan cho 1 consumer trong group
Delivery semantics	Ho tro at-most-once, at-least-once (default), va exactly-once
Message ordering	Dam bao ordering trong cung mot partition

2.1.2 Non-Functional Requirements

Yeu cau	Muc tieu	Ly do
Throughput	1,000,000 messages/sec (write)	Xu ly event streaming cho he thong lon (e-commerce, fintech, IoT)
Latency	< 10ms P99 (end-to-end)	Producer gui → consumer nhan phai nhanh
Durability	Khong mat message khi co ⇐ 2 broker failures dong thoi	Du lieu la tai san — mat message = mat tien
Availability	99.99% uptime	Message queue la backbone — no chet thi toan he thong chet
Scalability	Scale horizontally bang cach them broker	Traffic tang → them may, khong can thiet ke lai
Storage efficiency	Luu tru hieu qua tren disk	Message duoc giu nhieu ngay — disk la bottleneck
Ordering guarantee	Messages trong cung partition giu dung thu tu	Nhieu use case yeu cau strict ordering (payment events, order state changes)

2.1.3 Out of Scope

De giu bai tap trung, chung ta khong thiet ke:

Schema registry (Avro, Protobuf schema management)
Stream processing engine (Kafka Streams, Flink)
Multi-datacenter replication (MirrorMaker)
Tiered storage (hot/cold storage)

2.1.4 API Design

Producer API:

API	Parameters	Mo ta
`produce(topic, message, key?, partition?)`	topic: string, message: bytes, key: bytes (optional), partition: int (optional)	Gui message vao topic. Neu co key → hash(key) chon partition. Neu co partition → gui truc tiep
`produce_batch(topic, messages[])`	topic: string, messages: list of (key, value)	Gui nhieu message cung luc — giam network overhead

Consumer API:

API	Parameters	Mo ta
`subscribe(topics[], group_id)`	topics: list of string, group_id: string	Consumer tham gia group va subscribe vao topics
`poll(timeout)`	timeout: ms	Keo batch messages moi. Tra ve list messages
`commit(offsets)`	offsets: map<partition, offset>	Commit offset — danh dau da xu ly den dau
`seek(partition, offset)`	partition: int, offset: long	Nhay den vi tri bat ky — dung cho replay

Step 2: High-Level Design — Kien truc tong quan

2.2.1 Cac thanh phan chinh

flowchart TB
    subgraph Producers
        P1["Producer 1<br/>(Order Service)"]
        P2["Producer 2<br/>(Payment Service)"]
        P3["Producer 3<br/>(IoT Devices)"]
    end

    subgraph "Broker Cluster"
        B1["Broker 1<br/>Partitions: 0, 3, 6"]
        B2["Broker 2<br/>Partitions: 1, 4, 7"]
        B3["Broker 3<br/>Partitions: 2, 5, 8"]
    end

    subgraph "Metadata Store"
        MS["ZooKeeper / etcd<br/>- Broker registry<br/>- Topic config<br/>- Partition assignment<br/>- Consumer group state"]
    end

    subgraph "Coordinator"
        CO["Coordinator<br/>- Leader election<br/>- Partition assignment<br/>- Consumer rebalancing"]
    end

    subgraph Consumers
        subgraph "Consumer Group A (Order Processing)"
            C1["Consumer A1"]
            C2["Consumer A2"]
            C3["Consumer A3"]
        end
        subgraph "Consumer Group B (Analytics)"
            C4["Consumer B1"]
            C5["Consumer B2"]
        end
    end

    P1 & P2 & P3 --> B1 & B2 & B3
    B1 & B2 & B3 <--> MS
    CO <--> MS
    CO --> B1 & B2 & B3
    B1 & B2 & B3 --> C1 & C2 & C3
    B1 & B2 & B3 --> C4 & C5

    style B1 fill:#1e88e5,color:#fff
    style B2 fill:#1e88e5,color:#fff
    style B3 fill:#1e88e5,color:#fff
    style MS fill:#f57c00,color:#fff
    style CO fill:#7b1fa2,color:#fff

2.2.2 Vai tro tung thanh phan

Thanh phan	Vai tro	Chi tiet
Producer	Gui message vao broker	Chon partition (round-robin, key-based hash, custom). Batch va compress truoc khi gui
Broker	Luu tru va phuc vu message	Moi broker quan ly mot so partitions. Ghi message vao disk (WAL). Phuc vu read cho consumer
Metadata Store	Luu metadata cua cluster	Danh sach broker, topic config, partition assignment, consumer group state. Dung ZooKeeper hoac etcd
Coordinator	Dieu phoi cluster	Leader election cho partition, consumer group rebalancing, broker membership management
Consumer	Doc message tu broker	Pull-based — consumer chu dong keo data. Track offset de biet doc den dau
Consumer Group	Nhom consumer chia nhau doc	Moi partition chi gan cho 1 consumer trong group. Cho phep parallel processing

2.2.3 Message Flow tong quan

sequenceDiagram
    participant P as Producer
    participant LB as Broker (Leader)
    participant F1 as Broker (Follower 1)
    participant F2 as Broker (Follower 2)
    participant C as Consumer

    P->>LB: 1. Send message batch
    LB->>LB: 2. Write to WAL (disk)
    LB->>F1: 3. Replicate to follower 1
    LB->>F2: 3. Replicate to follower 2
    F1-->>LB: 4. ACK replication
    F2-->>LB: 4. ACK replication
    LB-->>P: 5. ACK to producer (acks=all)

    C->>LB: 6. Poll(offset=100)
    LB->>LB: 7. Read from WAL/page cache
    LB-->>C: 8. Return messages [100-150]
    C->>C: 9. Process messages
    C->>LB: 10. Commit offset=150

Step 3: Deep Dive — Chi tiet tung thanh phan

2.3.1 Data Model — Mo hinh du lieu

Topic

Topic la logical channel de to chuc messages. Tuong tu nhu chuyen muc trong buu dien — thu kinh doanh, thu ca nhan, thu quang cao.

Thuoc tinh	Mo ta	Vi du
Name	Ten topic — dinh danh duy nhat	`orders`, `payments`, `user-events`
Partition count	So partitions — quyet dinh parallelism	12 partitions
Replication factor	So ban sao moi partition	3 (1 leader + 2 followers)
Retention	Thoi gian giu message	7 ngay (default)
Cleanup policy	Cach don dep message cu	`delete` (xoa theo thoi gian) hoac `compact` (giu latest per key)

Partition

Partition la don vi co ban cua parallelism va ordering. Tuong tu nhu nhieu o buu dien song song — moi o xu ly doc lap.

Moi partition la mot ordered, immutable sequence of messages. Messages duoc append vao cuoi — khong bao gio sua hay xoa (chi xoa khi retention het han).

Dac diem quan trong:

Messages trong cung mot partition co thu tu dam bao (ordering guarantee)
Messages o khac partition khong co thu tu voi nhau
Moi partition co mot leader va nhieu followers (replicas)
Chi leader xu ly read/write — followers chi replicate

Offset

Offset la so thu tu cua message trong partition. Bat dau tu 0, tang dan. Tuong tu so thu tu cua thu trong hop thu — thu so 0, thu so 1, thu so 2…

Thong tin	Chi tiet
Kieu du lieu	64-bit integer
Pham vi	0 den 2^63 - 1 (du cho 292 nam voi 1M msg/sec)
Tinh chat	Monotonically increasing trong partition
Muc dich	Consumer dung offset de biet doc den dau, co the seek den bat ky offset nao

Message (Record)

Moi message gom cac thanh phan:

Truong	Kieu	Mo ta
Key	bytes (nullable)	Dung de quyet dinh partition. Vi du: `user_id`, `order_id`. Nullable — neu null thi round-robin
Value	bytes	Noi dung message. Co the la JSON, Avro, Protobuf. Broker khong care format — chi la bytes
Timestamp	long (epoch ms)	Thoi diem message duoc tao (CreateTime) hoac thoi diem broker nhan (LogAppendTime)
Headers	list of key-value	Metadata bo sung. Vi du: `trace-id`, `source-service`, `content-type`
Offset	long	Duoc broker gan khi message duoc ghi vao partition. Producer khong set
Partition	int	Partition ma message thuoc ve
CRC	int	Checksum de phat hien data corruption

Aha Moment: Broker khong parse hay hieu noi dung message. Voi broker, message chi la mot cuc bytes. Dieu nay giup broker cuc ky nhanh — no chi can ghi bytes vao disk va doc bytes tu disk. Khong co serialization/deserialization overhead.

2.3.2 Broker Storage — Cach luu tru tren disk

Day la phan quan trong nhat cua bai thiet ke. Storage design quyet dinh throughput va durability cua toan he thong.

Write-Ahead Log (WAL) per Partition

Moi partition duoc luu tren disk nhu mot append-only log. Tuong tu nhu cuon so ghi chep cua buu dien — chi ghi them, khong bao gio xoa hay sua.

Tai sao append-only?

Disk sequential write cuc nhanh: HDD sequential write dat 200-300 MB/s (nhanh hon random write 1000x). SSD sequential write dat 500-3000 MB/s.
Khong can lock phuc tap: Chi ghi vao cuoi → khong co conflict
Tu nhien ho tro ordering: Message duoc ghi theo thu tu → doc ra cung dung thu tu

Segment Files

Log cua moi partition duoc chia thanh nhieu segment files. Moi segment la mot file tren disk.

Topic: orders, Partition: 0

/data/orders-0/
    00000000000000000000.log     ← Segment 0: offset 0 - 999,999
    00000000000000000000.index   ← Offset index cho segment 0
    00000000000000000000.timeindex ← Time index cho segment 0
    00000000000001000000.log     ← Segment 1: offset 1,000,000 - 1,999,999
    00000000000001000000.index
    00000000000001000000.timeindex
    00000000000002000000.log     ← Segment 2 (active segment - dang ghi)
    00000000000002000000.index
    00000000000002000000.timeindex

Tai sao chia segment?

Ly do	Giai thich
Retention cleanup	Xoa segment cu de giai phong disk — xoa ca file, khong can scan tung message
Compaction	Nen segment cu — giu lai latest value per key
File system limits	Mot file qua lon → OS xu ly cham. Segment nho hon de quan ly
Recovery	Khi broker restart, chi can recover active segment — cac segment cu da duoc flush

Cau hinh segment:

Config	Default	Mo ta
`segment.bytes`	1 GB	Kich thuoc toi da moi segment
`segment.ms`	7 ngay	Thoi gian toi da truoc khi roll segment moi
`segment.index.bytes`	10 MB	Kich thuoc index file

Index Files — Tim message nhanh

Khi consumer muon doc tu offset 1,500,000 — lam sao broker tim dung vi tri tren disk ma khong can scan tu dau?

Offset Index: Map tu offset → vi tri (position) trong segment file.

Offset Index (sparse — khong luu moi offset):
Offset 1,000,000 → Position 0
Offset 1,000,100 → Position 15,200
Offset 1,000,200 → Position 30,800
Offset 1,000,300 → Position 46,100
...

Khi can tim offset 1,000,150:

Binary search trong index → tim entry gan nhat nho hon: offset 1,000,100 tai position 15,200
Scan sequential tu position 15,200 den khi gap offset 1,000,150

Timestamp Index: Map tu timestamp → offset. Dung khi consumer muon doc message tu mot thoi diem cu the (vi du: “doc lai tat ca message tu 2 ngay truoc”).

Timestamp Index:
Timestamp 1710700000000 → Offset 1,000,000
Timestamp 1710700060000 → Offset 1,000,500
Timestamp 1710700120000 → Offset 1,001,200
...

Aha Moment: Index la sparse (khong luu moi offset, chi luu moi N offset). Dieu nay tiet kiem disk va memory. Trade-off: can scan mot doan nho de tim chinh xac — nhung doan nay thuong nam trong page cache nen rat nhanh.

Memory-Mapped Files (mmap) va Page Cache

Day la “bi quyet” giup message queue dat throughput cuc cao:

Page Cache cua OS: Khi broker ghi message vao file, data khong ghi thang vao disk. No di qua page cache cua OS truoc:

Producer gui message → broker ghi vao page cache (RAM)
OS tu dong flush page cache xuong disk (async)
Consumer doc message → doc tu page cache (neu data con trong cache) → zero disk I/O

Ket qua: Trong truong hop ly tuong (consumer doc message vua moi duoc ghi), toan bo read xay ra tu RAM — khong cham disk. Day la ly do message queue co the dat millions of messages/sec tren hardware thuong.

Memory-mapped files (mmap): Index files duoc map vao memory thong qua mmap. Dieu nay cho phep truy cap index nhu truy cap mang trong RAM — khong can read() system call.

Ky thuat	Ap dung vao	Loi ich
Page cache	Segment files (.log)	Write nhanh (async flush), read nhanh (tu RAM)
mmap	Index files (.index, .timeindex)	Truy cap index nhanh nhu RAM
sendfile() / zero-copy	Transfer data tu disk → network	Giam CPU usage, giam copy giua kernel va user space

Zero-copy transfer: Khi consumer doc message, broker dung sendfile() system call de chuyen data truc tiep tu page cache → network socket, khong can copy qua user space. Giam 2 lan copy va 2 lan context switch.

flowchart LR
    subgraph "Truyen thong (4 copies)"
        D1["Disk"] -->|"1. read()"| K1["Kernel Buffer"]
        K1 -->|"2. copy"| U1["User Buffer"]
        U1 -->|"3. write()"| K2["Socket Buffer"]
        K2 -->|"4. send"| N1["Network"]
    end

    subgraph "Zero-copy (2 copies)"
        D2["Disk/Page Cache"] -->|"1. sendfile()"| K3["Kernel Buffer"]
        K3 -->|"2. send"| N2["Network"]
    end

    style D2 fill:#43a047,color:#fff
    style K3 fill:#43a047,color:#fff
    style N2 fill:#43a047,color:#fff

2.3.3 Producer — Gui message hieu qua

Partitioning Strategy

Khi producer gui message, no phai chon partition nao se nhan message. Co 3 strategy:

1. Round-Robin (khong co key):

Message duoc gui luan phien den cac partition: P0, P1, P2, P0, P1, P2…
Uu diem: Phan bo deu tai — khong bi hotspot
Nhuoc diem: Khong dam bao ordering — messages cua cung entity co the nam o khac partition
Dung khi: Khong can ordering (vi du: log messages, metrics)

2. Key-based Hash (co key):

partition = hash(key) % num_partitions
Tat ca messages cung key → luon vao cung partition
Uu diem: Dam bao ordering cho cung key
Nhuoc diem: Co the bi hotspot neu key phan bo khong deu (vi du: celebrity user co nhieu event)
Dung khi: Can ordering per entity (vi du: key = order_id → tat ca event cua 1 order theo dung thu tu)

3. Custom Partitioner:

Producer tu implement logic chon partition
Dung khi: Logic dac biet — vi du: high-priority messages vao partition rieng, geo-based routing

Pitfall: Khi thay doi so partition (vi du: tu 6 len 12), hash(key) % num_partitions se thay doi. Message cua cung key co the di vao partition khac. Day la ly do partition count rat kho thay doi sau khi da chay production. Chon dung tu dau!

Batching

Producer khong gui tung message mot. No gom nhieu messages thanh 1 batch roi gui cung luc.

Config	Default	Mo ta
`batch.size`	16 KB	Kich thuoc toi da cua batch
`linger.ms`	0 ms (gui ngay)	Thoi gian cho de gom them messages vao batch

Trade-off:

linger.ms = 0: Latency thap nhat, nhung throughput thap (gui tung message)
linger.ms = 5-50ms: Tang throughput dang ke (gom nhieu messages), nhung latency tang them vai ms
batch.size lon: It network round-trip hon, nhung memory usage cao hon

Compression

Batch duoc compress truoc khi gui qua network. Broker luu batch da compress — khong decompress. Consumer moi decompress.

Algorithm	Compression ratio	CPU usage	Toc do	Khi nao dung
gzip	Cao nhat (~70-80%)	Cao	Cham	Bandwidth limited, batch lon
snappy	Trung binh (~50-60%)	Thap	Nhanh	Can balance giua compression va CPU
lz4	Trung binh (~55-65%)	Rat thap	Rat nhanh	Default choice — tot nhat cho da so truong hop
zstd	Cao (~65-75%)	Trung binh	Nhanh	Muon compression tot hon lz4, chap nhan them CPU

Aha Moment: Compression xay ra o batch level, khong phai message level. Batch cang lon → compression ratio cang tot (vi co nhieu data giong nhau de compress). Day la ly do tang linger.ms va batch.size giup tiet kiem bandwidth dang ke.

Acknowledgment (acks)

Sau khi gui message, producer cho broker xac nhan (ACK). Muc ACK quyet dinh durability vs latency:

acks	Hanh vi	Durability	Latency	Khi nao dung
0	Producer khong cho ACK. “Fire and forget”	Thap nhat — co the mat message	Thap nhat	Metrics, logs — chap nhan mat
1	Cho ACK tu leader. Leader ghi xong → ACK	Trung binh — mat neu leader crash truoc khi replicate	Trung binh	Da so truong hop
all (-1)	Cho ACK tu tat ca ISR (In-Sync Replicas). Leader + followers ghi xong → ACK	Cao nhat — chi mat khi tat ca ISR crash cung luc	Cao nhat	Payment events, critical data — khong duoc mat

Pitfall: acks=all khong co nghia la tat ca replicas. No chi doi tat ca replicas trong ISR (In-Sync Replicas). Neu ISR chi con 1 (leader) vi followers lag → acks=all tuong duong acks=1. Day la ly do can set min.insync.replicas=2.

2.3.4 Consumer — Doc message hieu qua

Consumer Group

Consumer group la co che chia cong viec giua nhieu consumers. Tuong tu nhom nhan vien buu dien — moi nguoi phu trach mot khu vuc, khong ai lam trung voi ai.

Quy tac vang:

Trong 1 consumer group: moi partition chi gan cho dung 1 consumer
1 consumer co the doc nhieu partitions
Neu num_consumers > num_partitions → consumer thua se idle
Neu num_consumers < num_partitions → mot consumer doc nhieu partitions
Ly tuong: num_consumers = num_partitions

Nhieu consumer groups doc cung topic doc lap voi nhau:

Topic: orders (6 partitions)

Consumer Group "order-processing":
  Consumer 1 → Partition 0, 1
  Consumer 2 → Partition 2, 3
  Consumer 3 → Partition 4, 5

Consumer Group "analytics":
  Consumer A → Partition 0, 1, 2
  Consumer B → Partition 3, 4, 5

Consumer Group "audit":
  Consumer X → Partition 0, 1, 2, 3, 4, 5  (1 consumer doc tat ca)

Moi group co offset rieng. Group “order-processing” co the da doc den offset 10,000 trong khi group “analytics” moi doc den offset 5,000.

Partition Assignment Strategies

Khi consumer join/leave group, partitions phai duoc re-assign. Co nhieu strategy:

1. Range Assignment:

Sap xep partitions theo so thu tu, chia deu cho consumers
Vi du: 6 partitions, 3 consumers → C1: [P0,P1], C2: [P2,P3], C3: [P4,P5]
Uu diem: Don gian, deterministic
Nhuoc diem: Consumer dau tien co the bi nhieu partition hon neu chia khong deu

2. Round-Robin Assignment:

Phan phoi partition luan phien: C1→P0, C2→P1, C3→P2, C1→P3, C2→P4, C3→P5
Uu diem: Phan bo deu hon range
Nhuoc diem: Khi rebalance, nhieu partition bi di chuyen

3. Sticky Assignment:

Giu nguyen assignment cu nhieu nhat co the, chi di chuyen partition cua consumer da roi di
Uu diem: Giam disruption khi rebalance — consumer giu lai state da co
Nhuoc diem: Phuc tap hon de implement

Aha Moment: Sticky assignment la best practice trong production. Khi 1 consumer crash, chi partitions cua no bi reassign — cac consumer khac giu nguyen. Giam thoi gian “stop the world” dang ke.

Offset Management

Consumer phai bao cho broker biet da doc den offset nao. Qua trinh nay goi la commit offset.

Auto Commit:

Consumer tu dong commit offset sau moi auto.commit.interval.ms (default 5s)
Uu diem: Don gian, khong can code them
Nhuoc diem: Co the mat message hoac xu ly trung
- Consumer doc message, chua xu ly xong → auto commit → consumer crash → message mat (at-most-once)
- Consumer xu ly xong, chua commit → crash → rebalance → consumer moi doc lai (at-least-once voi duplicates)

Manual Commit:

Consumer chi commit khi da xu ly xong message
Synchronous commit: commitSync() — block cho den khi commit thanh cong. An toan nhung cham
Asynchronous commit: commitAsync() — khong block. Nhanh nhung co the fail silently

Best practice: Dung commitAsync() trong vong lap xu ly, va commitSync() truoc khi consumer shutdown. Ket hop ca hai de dat balance giua performance va safety.

Rebalancing Protocol

Khi consumer join/leave group, hệ thống phai rebalance — gan lai partitions cho consumers.

sequenceDiagram
    participant C1 as Consumer 1
    participant C2 as Consumer 2
    participant C3 as Consumer 3 (new)
    participant GC as Group Coordinator

    Note over C1,GC: Consumer 3 joins group
    C3->>GC: JoinGroup request
    GC->>GC: Trigger rebalance
    GC->>C1: Revoke partitions
    GC->>C2: Revoke partitions
    C1->>C1: Commit offsets, release partitions
    C2->>C2: Commit offsets, release partitions
    C1->>GC: JoinGroup (re-join)
    C2->>GC: JoinGroup (re-join)
    C3->>GC: JoinGroup
    GC->>C1: Assign: Consumer 1 is leader
    Note over C1: Leader computes assignment
    C1->>GC: SyncGroup (assignment plan)
    GC->>C1: SyncGroup response (P0, P1)
    GC->>C2: SyncGroup response (P2, P3)
    GC->>C3: SyncGroup response (P4, P5)
    Note over C1,C3: Resume consuming with new assignment

Eager Rebalancing (truyen thong):

Tat ca consumers dung lai, tra partitions ve, roi nhan assignment moi
Van de: “Stop the world” — trong thoi gian rebalance, khong consumer nao doc duoc

Cooperative / Incremental Rebalancing (moi):

Chi revoke partitions can di chuyen. Consumer giu lai partitions khong thay doi
Uu diem: Giam disruption dang ke — da so consumers van hoat dong binh thuong trong rebalance
Day la best practice cho production systems

Pitfall: Rebalancing la thoi diem nguy hiem nhat cua consumer group. Trong khi rebalance, khong ai xu ly messages → consumer lag tang. Neu rebalance xay ra lien tuc (vi du: consumer unstable, session timeout qua ngan) → he thong khong hoat dong duoc. Monitoring rebalance frequency la critical.

2.3.5 Replication — Dam bao du lieu khong mat

Leader-Follower Model

Moi partition co 1 leader va nhieu followers. Chi leader nhan read/write. Followers chi replicate tu leader.

flowchart TB
    subgraph "Topic: orders, Partition 0"
        subgraph "Broker 1"
            L0["Partition 0<br/>LEADER<br/>Offset: 0-15,000"]
        end
        subgraph "Broker 2"
            F0a["Partition 0<br/>FOLLOWER (ISR)<br/>Offset: 0-14,998"]
        end
        subgraph "Broker 3"
            F0b["Partition 0<br/>FOLLOWER (ISR)<br/>Offset: 0-14,995"]
        end
    end

    subgraph "Topic: orders, Partition 1"
        subgraph "Broker 2 "
            L1["Partition 1<br/>LEADER<br/>Offset: 0-12,000"]
        end
        subgraph "Broker 3 "
            F1a["Partition 1<br/>FOLLOWER (ISR)<br/>Offset: 0-11,999"]
        end
        subgraph "Broker 1 "
            F1b["Partition 1<br/>FOLLOWER (ISR)<br/>Offset: 0-11,997"]
        end
    end

    L0 -->|"Replicate"| F0a
    L0 -->|"Replicate"| F0b
    L1 -->|"Replicate"| F1a
    L1 -->|"Replicate"| F1b

    style L0 fill:#e53935,color:#fff
    style L1 fill:#e53935,color:#fff
    style F0a fill:#1e88e5,color:#fff
    style F0b fill:#1e88e5,color:#fff
    style F1a fill:#1e88e5,color:#fff
    style F1b fill:#1e88e5,color:#fff

Tai sao leader phan bo tren nhieu brokers?

Partition 0 leader o Broker 1, Partition 1 leader o Broker 2…
Phan bo deu load — khong broker nao bi overload
Neu Broker 1 chet → chi leader cua partitions tren Broker 1 can failover

ISR (In-Sync Replicas)

ISR la tap hop cac replicas dang theo kip leader. Mot replica nam trong ISR neu:

No da fetch du lieu tu leader va khong bi lag qua nhieu
Cau hinh: replica.lag.time.max.ms (default 30s) — neu follower khong fetch trong 30s, bi loai khoi ISR

Thuat ngu	Dinh nghia
ISR	Tap replicas dang sync voi leader (bao gom leader)
OSR (Out-of-Sync Replicas)	Replicas bi lag — chua theo kip leader
LEO (Log End Offset)	Offset cua message cuoi cung tren moi replica
HW (High Watermark)	Offset cao nhat ma tat ca ISR da replicate. Consumer chi doc duoc den HW

High Watermark la gi?

Leader:    [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
Follower1: [0] [1] [2] [3] [4] [5] [6] [7]
Follower2: [0] [1] [2] [3] [4] [5] [6]
                                        ↑
                                   High Watermark = 6

Consumer chi doc duoc message co offset ⇐ 6 (High Watermark). Messages 7, 8, 9 da o leader nhung chua duoc tat ca ISR replicate → chua visible cho consumer.

Tai sao? Neu leader crash truoc khi replicate 7, 8, 9 → follower len lam leader moi → messages 7, 8, 9 mat. Neu consumer da doc chung → data inconsistency.

min.insync.replicas

Config nay quyet dinh so ISR toi thieu truoc khi leader chap nhan write.

Cau hinh	Hanh vi
`replication.factor = 3`	Moi partition co 3 replicas
`min.insync.replicas = 2`	Can it nhat 2 replicas trong ISR de ghi duoc
`acks = all`	Producer doi tat ca ISR xac nhan

Ket hop: acks=all + min.insync.replicas=2 + replication.factor=3

Producer gui message → leader ghi + it nhat 1 follower ghi → ACK
Co the chiu 1 broker failure ma khong mat data va khong ngung ghi
Neu 2 brokers chet → ISR < min.insync.replicas → leader tu choi write → producer nhan error

Aha Moment: min.insync.replicas la “circuit breaker” cua replication. No ngan leader ghi khi khong du replicas — tranh truong hop ghi message vao leader roi leader chet → mat data.

Unclean Leader Election

Khi leader chet va khong co follower nao trong ISR (tat ca followers deu lag) — he thong co 2 lua chon:

Lua chon	Config	Hau qua
Cho doi ISR follower online	`unclean.leader.election.enable = false`	Partition khong hoat dong cho den khi ISR follower quay lai. Durability > Availability
Chon follower ngoai ISR lam leader	`unclean.leader.election.enable = true`	Partition hoat dong ngay, nhung mat messages chua duoc replicate. Availability > Durability

Trade-off kinh dien: Durability vs Availability

Payment system, financial data → false (khong duoc mat data)
Metrics, logs → true (chap nhan mat vai message, mien la he thong khong dung)

2.3.6 Delivery Semantics — Dam bao giao message

Day la mot trong nhung khai niem kho nhat va hay bi hieu sai nhat trong message queue.

At-Most-Once

Dinh nghia: Message duoc xu ly toi da 1 lan. Co the mat, nhung khong bao gio trung.

Cach implement:

Producer: acks = 0 (khong cho ACK)
Consumer: Commit offset truoc khi xu ly message

Khi nao dung: Metrics, logs, analytics — mat vai data point khong anh huong ket qua.

Rui ro: Message mat vinh vien — producer khong biet, consumer khong biet.

At-Least-Once (Default)

Dinh nghia: Message duoc xu ly it nhat 1 lan. Khong mat, nhung co the trung.

Cach implement:

Producer: acks = all + retry khi gui that bai
Consumer: Xu ly message truoc, roi moi commit offset

Khi nao dung: Da so truong hop — notification, email, order processing (voi idempotent consumer).

Rui ro: Message co the duoc xu ly 2 lan. Vi du: consumer xu ly xong, chua commit offset, consumer crash → message duoc xu ly lai.

Giai phap: Consumer phai idempotent — xu ly cung message 2 lan van cho ket qua giong nhau. Vi du: dung unique ID cua message de check “da xu ly chua” truoc khi xu ly.

Exactly-Once

Dinh nghia: Message duoc xu ly chinh xac 1 lan. Khong mat, khong trung.

Cach implement (phuc tap nhat):

Idempotent Producer:
- Broker gan moi producer mot Producer ID (PID)
- Moi message co Sequence Number tang dan per partition
- Neu broker nhan message voi cung PID + sequence number → tu dong bo qua (deduplicate)
- Giai quyet: producer retry gui trung → broker chi ghi 1 lan
Transactional API:
- Producer bat dau transaction → gui messages vao nhieu partitions → commit hoac abort
- Consumer chi doc messages cua committed transactions (isolation.level = read_committed)
- Giai quyet: atomic write across partitions — hoac tat ca messages duoc ghi, hoac khong co gi
Consumer-side deduplication:
- Du co idempotent producer va transactions, consumer van can xu ly idempotent
- Ly do: “exactly-once” cua broker chi dam bao ghi 1 lan. Consumer xu ly la logic rieng — co the fail giua chung

Aha Moment: “Exactly-once” trong distributed systems la cuc ky kho va thuong la ket hop cua nhieu co che: idempotent producer (deduplicate write) + transactions (atomic write) + idempotent consumer (deduplicate processing). Khong co “magic button” nao lam exactly-once tu dong.

2.3.7 Message Retention — Giu message bao lau?

Khac voi traditional message queue (RabbitMQ: message bi xoa sau khi consumed), distributed message queue giu message theo retention policy — bat ke da duoc consume hay chua.

Time-Based Retention

Config	Default	Mo ta
`retention.ms`	604,800,000 (7 ngay)	Message cu hon 7 ngay bi xoa
`retention.minutes`	-	Tinh theo phut
`retention.hours`	168	Tinh theo gio

Khi segment file cu nhat co timestamp > retention → xoa ca segment file.

Size-Based Retention

Config	Default	Mo ta
`retention.bytes`	-1 (unlimited)	Tong kich thuoc toi da per partition

Khi tong kich thuoc cac segments vuot qua retention.bytes → xoa segment cu nhat.

Log Compaction — Giu latest per key

Day la retention policy dac biet. Thay vi xoa theo thoi gian, broker giu lai message moi nhat cho moi key va xoa cac message cu hon.

TRUOC compaction:
Key=user1, Value=Hanoi     (offset 0)
Key=user2, Value=Saigon    (offset 1)
Key=user1, Value=Danang    (offset 2)   ← user1 cap nhat dia chi
Key=user3, Value=Hue       (offset 3)
Key=user2, Value=null      (offset 4)   ← user2 bi xoa (tombstone)
Key=user1, Value=Dalat     (offset 5)   ← user1 cap nhat lan nua

SAU compaction:
Key=user3, Value=Hue       (offset 3)   ← giu nguyen
Key=user2, Value=null      (offset 4)   ← tombstone (bi xoa sau)
Key=user1, Value=Dalat     (offset 5)   ← chi giu latest

Dung khi nao?

Database changelog (CDC): moi row la 1 key, value la state moi nhat
Configuration store: moi key la config name, value la config value
User profile updates: moi key la user_id, value la profile moi nhat

Pitfall: Log compaction khong xay ra ngay lap tuc. Co mot “dirty ratio” threshold. Compaction thread chay background va co the tao I/O load lon. Trong production, can monitoring disk I/O trong luc compaction.

2.3.8 Coordination — Quan ly cluster

ZooKeeper (truyen thong)

ZooKeeper la external coordination service duoc Kafka dung tu dau de:

Chuc nang	Chi tiet
Broker registry	Moi broker dang ky vao ZooKeeper khi start. ZK biet broker nao dang song
Controller election	Chon 1 broker lam “Controller” — phu trach leader election cho partitions
Topic/partition metadata	Luu danh sach topics, partitions, replica assignment
Consumer group state	(cu) Luu consumer offsets va group membership. Tu Kafka 0.9+, consumer offsets luu trong internal topic `__consumer_offsets`

Van de cua ZooKeeper:

Them 1 dependency: ZooKeeper la hệ thống rieng, can deploy va van hanh rieng biệt
Scaling bottleneck: ZK cluster thuong chi 3-5 nodes. Metadata updates tro thanh bottleneck khi cluster lon (hang nghin partitions)
Split brain risk: Neu ZK va broker cluster bi network partition → co the co hai leaders cho cung partition
Operational complexity: Phai biet van hanh 2 he thong: Kafka + ZooKeeper

KRaft (ZooKeeper-less) — Xu huong moi

Tu Kafka 3.3+, KRaft thay the ZooKeeper bang internal Raft-based consensus:

Khia canh	ZooKeeper	KRaft
Architecture	External ZK cluster	Built-in — metadata stored trong Kafka brokers
Metadata storage	ZK znodes	Internal `__cluster_metadata` topic (Raft log)
Controller	1 broker duoc ZK bau lam controller	Quorum of controllers (3-5 brokers co role controller)
Failover time	10-30 giay (ZK session timeout)	1-3 giay (Raft election)
Scaling	ZK la bottleneck	Metadata replicate qua Raft — scale tot hon
Operational	2 clusters (Kafka + ZK)	1 cluster (chi Kafka)

Controller va metadata management trong KRaft:

Mot nhom brokers co role controller (thuong 3 hoac 5)
Cac controllers dung Raft consensus de bau leader va replicate metadata
Active controller xu ly tat ca metadata changes (tao topic, leader election, v.v.)
Cac controller khac la follower — san sang thay the neu active controller chet

Aha Moment: KRaft khong chi don gian la “bo ZooKeeper”. No thay doi co ban cach Kafka quan ly metadata — tu external coordination sang internal log-based coordination. Day la xu huong cua distributed systems: tu ngoai vao trong, tu complexity sang simplicity.

2.3.9 Push vs Pull — Hai mo hinh tieu thu

Khia canh	Pull (Kafka model)	Push (RabbitMQ model)
Ai dieu khien toc do?	Consumer — keo data theo toc do cua minh	Broker — day data xuong consumer
Consumer cham?	Khong anh huong broker — data van nam tren disk	Broker phai buffer hoac drop messages
Consumer nhanh?	Keo data lien tuc, khong phai doi	Broker co the khong day du nhanh
Batching	Consumer tu chon batch size toi uu	Broker quyet dinh batch — co the khong phu hop
Long polling	Consumer poll voi timeout — neu khong co data moi, doi mot luc roi poll lai	Khong can — broker day ngay khi co data
Back-pressure	Tu nhien — consumer chi keo khi san sang	Can co che rieng (prefetch count, flow control)
Replay	De dang — consumer chi can seek offset	Kho — message da bi xoa sau khi ACK
Complexity	Consumer phuc tap hon (phai tu quan ly polling)	Consumer don gian hon (chi nhan va xu ly)

Tai sao thiet ke cua chung ta chon Pull?

Consumer tu quyet dinh toc do — khong bi overwhelm
Batching toi uu — consumer keo dung luong data phu hop
Replay de dang — seek den bat ky offset nao
Back-pressure tu nhien — khong can co che dac biet
Scale consumer doc lap — them consumer khong anh huong broker

Nhuoc diem cua Pull: Khi khong co data moi, consumer van phai poll → lang phi. Giai phap: Long polling — consumer gui poll request, broker giu request cho den khi co data moi hoac timeout.

2.3.10 Dead Letter Queue (DLQ) — Xu ly “thu chet”

Van de: Consumer nhan message nhung khong xu ly duoc. Vi du:

Message bi corrupt (bad format)
Business logic fail (invalid data)
Dependency bi loi (database down → retry nhieu lan van fail)

Neu cu retry mai → consumer bi stuck tai message nay, khong xu ly duoc messages phia sau. Day goi la poison message.

Giai phap: Dead Letter Queue

Main Topic: orders
  ↓ (consumer doc)
  Consumer: xu ly message
    ↓ (thanh cong) → commit offset, tiep tuc
    ↓ (that bai lan 1) → retry
    ↓ (that bai lan 2) → retry
    ↓ (that bai lan 3) → chuyen vao DLQ

DLQ Topic: orders-dlq
  ↓ (ops team review, fix, re-publish)

Cach implement DLQ:

Consumer co max.retries (vi du: 3)
Sau khi retry het → producer message vao DLQ topic (vi du: orders-dlq)
Commit offset cua message goc → consumer tiep tuc xu ly message tiep theo
DLQ topic duoc monitor boi ops team
Sau khi fix root cause, message trong DLQ duoc re-publish vao main topic

Config	Mo ta
Max retries	So lan retry truoc khi chuyen vao DLQ
Retry delay	Thoi gian doi giua cac lan retry (exponential backoff)
DLQ topic name	Convention: `{original-topic}-dlq`
DLQ retention	Thuong dai hon main topic (30-90 ngay) de co thoi gian investigate

Pitfall: Khong co DLQ → poison message se block consumer vinh vien (neu auto-commit tat) hoac bi mat (neu auto-commit bat va consumer crash truoc khi xu ly). Ca hai truong hop deu nguy hiem. Luon thiet ke DLQ cho production systems.

2.3.11 Scaling — Mo rong he thong

Them Broker

Khi cluster can them capacity (disk, CPU, network):

Start broker moi, join cluster
Broker moi chua co partition nao — no la “empty”
Admin chay partition reassignment — di chuyen mot so partitions tu brokers cu sang broker moi
Trong qua trinh di chuyen, data duoc replicate tu broker cu sang broker moi
Khi replicate xong, broker moi nhan role (leader hoac follower) va broker cu giai phong partition

Van de: Partition reassignment ton bandwidth va disk I/O. Trong production:

Chay reassignment ngoai gio cao diem
Gioi han toc do reassignment (throttle) de khong anh huong traffic production
Monitor replication lag trong qua trinh reassignment

Them Partition

Khi topic can them parallelism (nhieu consumers hon muon doc dong thoi):

Admin tang partition count cua topic (vi du: tu 6 len 12)
Broker tao partitions moi tren cac brokers
Consumer group rebalance — partitions duoc gan lai cho consumers
Messages moi duoc phan phoi vao ca partitions cu va moi

Van de nghiem trong:

Key ordering bi pha: hash(key) % 6 khac voi hash(key) % 12. Messages cua cung key truoc kia vao P0, bay gio co the vao P6. Ordering per key bi pha trong thoi gian chuyen doi.
Khong the giam partition: Kafka khong ho tro giam so partition. Mot khi da tang, khong quay lai duoc.

Pitfall: Partition count la quyet dinh kho thay doi nhat trong message queue. Chon qua it → khong scale duoc. Chon qua nhieu → lang phi resources (moi partition ton memory, file handles, replication bandwidth). Rule of thumb: bat dau voi partition count = 2-3x so consumers du kien, de room cho scale.

3. Estimation — Uoc luong he thong

3.1 Throughput Estimation

Assumptions:

Thong so	Gia tri	Giai thich
Target throughput	1,000,000 messages/sec	Yeu cau tu requirements
Average message size	1 KB	JSON event trung binh
Peak/Average ratio	3x	Flash sales, events
Replication factor	3	1 leader + 2 followers

W r i t e t h ro ug h p u t = 1, 000, 000 m s g / sec \times 1 K B = 1 GB / sec

P e ak w r i t e t h ro ug h p u t = 1 GB / sec \times 3 = 3 GB / sec

T o t a l w r i t e (in c l u d in g re pl i c a t i o n) = 1 GB / sec \times 3 re pl i c a s = 3 GB / sec (n or ma l)

P e ak t o t a l w r i t e = 3 GB / sec \times 3 re pl i c a s = 9 GB / sec

3.2 Storage Estimation

Thong so	Gia tri
Messages/day	1,000,000 msg/sec x 86,400 sec = 86.4 billion messages
Data/day (raw)	86.4B x 1 KB = 86.4 TB
Compression ratio	~50% (lz4)
Data/day (compressed)	86.4 TB x 0.5 = 43.2 TB
Replication factor	3
Data/day (with replication)	43.2 TB x 3 = 129.6 TB
Retention	7 ngay

T o t a l St or a g e = 129.6 TB / d a y \times 7 d a ys = 907.2 TB \approx 0.9 PB

S o b ro k er (n e u m o i b ro k er co 12 d i s k s \times 4 TB) = \frac{907.2 TB}{48 TB / b ro k er} \approx 19 b ro k ers (c h o s t or a g e)

3.3 Network Bandwidth Estimation

Inbound (producer → broker):

I nb o u n d ban d w i d t h = 1 GB / sec (p ro d u ce) = 8 G b p s

Replication bandwidth (leader → followers):

R e pl i c a t i o n ban d w i d t h = 1 GB / sec \times 2 f o ll o w ers = 2 GB / sec = 16 G b p s

Outbound (broker → consumers):

Gia su co 3 consumer groups, moi group doc toan bo data:

O u t b o u n d ban d w i d t h = 1 GB / sec \times 3 g ro u p s = 3 GB / sec = 24 G b p s

Total network per cluster:

T o t a l n e tw or k = 8 + 16 + 24 = 48 G b p s

P er b ro k er (19 b ro k ers) = \frac{48 G b p s}{19} \approx 2.5 G b p s / b ro k er

Moi broker can 10 Gbps NIC de co du headroom.

3.4 Consumer Throughput Estimation

Consumer processing	Time	Messages/sec per consumer
Simple log/forward	0.1 ms/msg	10,000 msg/sec
Database write	1 ms/msg	1,000 msg/sec
Complex processing	5 ms/msg	200 msg/sec

S o co n s u m ers (s im pl e) = \frac{1 , 000 , 000}{10 , 000} = 100 co n s u m ers

S o co n s u m ers (D B w r i t e) = \frac{1 , 000 , 000}{1 , 000} = 1, 000 co n s u m ers

Aha Moment: So partitions phai >= so consumers. Neu can 1,000 consumers → can it nhat 1,000 partitions. Day la ly do chon partition count la quyet dinh quan trong.

3.5 Tom tat Estimation

Metric	Gia tri
Write throughput	1 GB/sec (3 GB/sec peak)
Total write (with replication)	3 GB/sec (9 GB/sec peak)
Storage (7-day retention, RF=3, compressed)	~907 TB (~0.9 PB)
Number of brokers	~19+ (storage-based)
Network per broker	~2.5 Gbps (can 10 Gbps NIC)
Consumers needed	100 (simple) to 1,000 (DB write)
Partitions needed	>= max consumer count

4. Security — Bao mat message queue

4.1 Encryption — Ma hoa du lieu

4.1.1 Encryption in Transit (TLS)

Ket noi	TLS	Mo ta
Producer → Broker	Co	TLS 1.2/1.3. Ngan man-in-the-middle doc messages
Broker → Broker (replication)	Co	Inter-broker replication cung phai encrypt
Broker → Consumer	Co	Consumer doc messages qua TLS
Broker → ZooKeeper/Controller	Co	Metadata communication phai secure

Luu y: TLS giam throughput khoang 10-30% do encryption/decryption overhead. Trong internal network (trusted), co the can nhac dung PLAINTEXT cho inter-broker replication de tang performance — nhung day la trade-off.

4.1.2 Encryption at Rest

Phuong phap	Mo ta	Khi nao dung
Disk encryption (dm-crypt, LUKS)	Encrypt toan bo disk — transparent cho Kafka	Standard cho moi production deployment
File system encryption	Encrypt o file system level (eCryptfs, fscrypt)	Khi khong co full disk encryption
Message-level encryption	Producer encrypt message body truoc khi gui	Khi broker khong duoc phep doc noi dung (multi-tenant, sensitive data)

Message-level encryption: Producer encrypt bang public key cua consumer. Broker chi thay ciphertext — khong doc duoc. Consumer decrypt bang private key. Dung cho:

Healthcare data (HIPAA)
Financial data (PCI-DSS)
Multi-tenant platforms (tenant A khong doc duoc data cua tenant B, ke ca broker admin)

Pitfall: Message-level encryption lam log compaction khong hoat dong (broker khong doc duoc key de so sanh). Can can nhac trade-off giua security va functionality.

4.2 Authentication — Xac thuc danh tinh

Phuong phap	Mo ta	Khi nao dung
SASL/PLAIN	Username/password	Dev/test environments. Khong an toan cho production (password di plain text, can dung kem TLS)
SASL/SCRAM	Challenge-response (khong gui password)	Production — an toan hon PLAIN
SASL/GSSAPI (Kerberos)	Enterprise SSO	Enterprise environments co Kerberos infrastructure
SASL/OAUTHBEARER	OAuth 2.0 tokens	Cloud-native environments, integration voi identity providers
mTLS	Mutual TLS — client va server deu present certificate	Zero-trust environments, service-to-service authentication

Best practice: Dung mTLS cho service-to-service (producer/consumer la services) va SASL/SCRAM hoac OAUTHBEARER cho human users (admin tools, monitoring).

4.3 Authorization — Phan quyen truy cap

ACL (Access Control List) per topic:

Principal	Resource	Operation	Permission
`order-service`	Topic: `orders`	WRITE	ALLOW
`order-service`	Topic: `payments`	WRITE	DENY
`analytics-service`	Topic: `orders`	READ	ALLOW
`analytics-service`	Topic: `orders`	WRITE	DENY
`admin`	Topic: `*`	ALL	ALLOW

Principle of least privilege: Moi service chi co quyen vua du de lam viec cua no. Order service chi ghi vao orders topic, khong duoc doc payments topic.

4.4 Audit Logging

Event	Thong tin log
Topic created/deleted	Who, when, topic name, config
ACL changed	Who, when, old ACL, new ACL
Authentication failure	Who (attempted), when, IP address, reason
Admin operation	Who, when, operation (reassignment, config change)

Audit logs phai duoc gui den external system (SIEM, Elasticsearch) — khong luu tren broker (de tranh bi xoa boi attacker).

Aha Moment: Security trong message queue thuong bi bo qua vi “no la internal system”. Nhung neu attacker truy cap duoc broker → doc duoc toan bo messages cua moi service. Message queue la central nervous system — bao ve no la bao ve toan bo he thong.

5. DevOps — Van hanh va giam sat

5.1 Critical Metrics — Cac chi so quan trong nhat

5.1.1 Broker Health

Metric	Threshold	Y nghia
Under-replicated partitions	> 0	Co partition chua duoc replicate du. Alert ngay — risk cua data loss
ISR shrink rate	> 0	Followers dang bi loai khoi ISR. Broker overloaded hoac network issue
Active controller count	!= 1	Phai luon co dung 1 controller. 0 = khong ai dieu phoi. >1 = split brain
Offline partitions	> 0	Partition khong co leader. Critical alert — data khong doc/ghi duoc
Request handler idle ratio	< 20%	Broker dang qua tai — request threads gan het

5.1.2 Producer Metrics

Metric	Threshold	Y nghia
Produce request rate	Bao nhieu requests/sec	Throughput cua producers
Produce latency P99	> 100ms	Ghi cham — co the do broker overloaded, replication lag
Record error rate	> 0	Producer gui that bai — broker reject hoac network error
Batch size average	< 1KB	Batch qua nho — tang linger.ms de gom nhieu hon
Compression ratio	> 0.8	Compression khong hieu qua — check data pattern

5.1.3 Consumer Metrics — Quan trong nhat

Metric	Threshold	Y nghia
Consumer lag	> 10,000 messages	Consumer khong theo kip producer. Day la metric #1
Consumer lag trend	Tang lien tuc	Consumer cham hon producer — se tran memory/disk neu khong fix
Commit rate	Giam dot ngot	Consumer co the bi stuck hoac crash
Rebalance rate	> 1/hour	Rebalance qua thuong xuyen — consumer unstable
Poll interval	> max.poll.interval.ms	Consumer bi kick khoi group — processing qua cham

Aha Moment: Consumer lag la metric #1 cua toan bo message queue system. Neu lag tang → messages dang bi xu ly cham hon toc do gui → cuoi cung retention het → messages bi xoa truoc khi consumer doc → data loss. Monitoring consumer lag la bat buoc.

5.1.4 Infrastructure Metrics

Metric	Threshold	Y nghia
Disk usage	> 80%	Gan het disk — can them disk hoac giam retention
Disk I/O wait	> 10%	Disk la bottleneck — can SSD hoac giam load
Network throughput	> 70% capacity	Network la bottleneck — can nang cap NIC
CPU usage	> 70%	Thuong do compression/decompression
JVM GC pause	> 200ms	GC pause lam broker khong respond — client timeout
File descriptor count	> 80% ulimit	Moi partition + segment + connection ton 1 fd. Het fd = broker crash

5.2 Alerting Strategy

Severity	Metric	Action
P1 — Critical	Offline partitions > 0	Ngay lập tức: page on-call. Data khong truy cap duoc
P1 — Critical	Active controller count != 1	Page on-call. Cluster khong co coordinator
P2 — High	Under-replicated partitions > 0 (> 5 min)	Investigate broker health, disk, network
P2 — High	Consumer lag tang lien tuc (> 30 min)	Scale consumers hoac investigate bottleneck
P3 — Medium	Disk usage > 80%	Plan capacity — them disk hoac giam retention
P3 — Medium	ISR shrink rate > 0	Check follower health, replication bandwidth
P4 — Low	Produce latency P99 > 100ms	Investigate — co the la transient
P4 — Low	Rebalance rate > 1/hour	Check consumer stability, session timeout config

5.3 Operational Runbook

5.3.1 Broker Failure

Alert: Under-replicated partitions tang
Check: Broker nao bi mat? (broker.id khong con trong cluster)
Automatic: Controller tu dong elect leader moi tu ISR cho cac partitions cua broker chet
Action: Restart broker hoac thay hardware. Sau khi broker online, tu dong rejoin va sync data
Monitor: ISR count tro ve replication factor → recovery hoan tat

5.3.2 Consumer Lag Spike

Alert: Consumer lag > threshold
Check: Consumer co dang chay khong? Processing time tren moi message?
Action short-term: Scale consumers (them instances). Luu y: can du partitions
Action long-term: Optimize processing logic, tang partition count (neu can)
Monitor: Lag giam dan ve 0

5.3.3 Disk Full

Alert: Disk usage > 80%
Action ngay: Giam retention (vi du: tu 7 ngay xuong 3 ngay) de free disk
Action trung han: Them disk, them broker
Action dai han: Implement tiered storage (hot/cold) hoac tang compression

5.4 Capacity Planning

Thoi diem	Action
Weekly	Review consumer lag trends, disk growth rate
Monthly	Review throughput growth, plan broker additions
Quarterly	Review partition counts, rebalance strategy, retention policy
Yearly	Major capacity planning — hardware refresh, architecture review

Pitfall: Nhieu team chi monitoring throughput va latency nhung quen consumer lag. He thong trong “healthy” (broker khong bi loi) nhung data dang mat vi consumer khong theo kip va messages bi xoa khi retention het. Consumer lag la “silent killer” cua message queue.

6. Mermaid Diagrams — Tong hop kien truc

6.1 Broker Architecture voi WAL Segments

flowchart TB
    subgraph "Broker 1"
        subgraph "Partition 0 (Leader)"
            direction TB
            SEG0["Segment 0<br/>Offset 0-999K<br/>📁 00000000.log<br/>📁 00000000.index<br/>📁 00000000.timeindex"]
            SEG1["Segment 1<br/>Offset 1M-1.99M<br/>📁 01000000.log<br/>📁 01000000.index<br/>📁 01000000.timeindex"]
            SEG2["Segment 2 (ACTIVE)<br/>Offset 2M-...<br/>📁 02000000.log<br/>📁 02000000.index<br/>📁 02000000.timeindex"]
            SEG0 --> SEG1 --> SEG2
        end

        subgraph "Write Path"
            PC["Page Cache<br/>(OS RAM)"]
            FL["Flush to Disk<br/>(async)"]
        end

        subgraph "Read Path"
            IDX["Index Lookup<br/>(mmap)"]
            ZC["Zero-Copy<br/>sendfile()"]
        end
    end

    P["Producer"] -->|"Append"| SEG2
    SEG2 -->|"Write"| PC
    PC -->|"Async"| FL

    C["Consumer"] -->|"Seek(offset)"| IDX
    IDX -->|"Position"| PC
    PC -->|"sendfile()"| ZC
    ZC -->|"Data"| C

    style SEG2 fill:#e53935,color:#fff
    style PC fill:#1e88e5,color:#fff
    style ZC fill:#43a047,color:#fff

6.2 Replication Flow — Leader + Followers + ISR

flowchart TB
    subgraph "Partition 0 Replication"
        P["Producer<br/>acks=all"] -->|"1. Send msg"| L["Broker 1<br/>LEADER<br/>LEO: 1000"]

        L -->|"2. Write WAL"| WAL["WAL<br/>(disk)"]
        L -->|"3. Replicate"| F1["Broker 2<br/>FOLLOWER<br/>LEO: 999<br/>✅ IN ISR"]
        L -->|"3. Replicate"| F2["Broker 3<br/>FOLLOWER<br/>LEO: 998<br/>✅ IN ISR"]
        L -.->|"3. Replicate"| F3["Broker 4<br/>FOLLOWER<br/>LEO: 500<br/>❌ OUT OF ISR<br/>(lag > 30s)"]

        F1 -->|"4. ACK"| L
        F2 -->|"4. ACK"| L
        L -->|"5. ACK to Producer<br/>(all ISR confirmed)"| P

        HW["High Watermark = 998<br/>(min LEO of ISR)"]
        L --- HW

        C["Consumer"] -->|"6. Read <= HW"| L
    end

    style L fill:#e53935,color:#fff
    style F1 fill:#1e88e5,color:#fff
    style F2 fill:#1e88e5,color:#fff
    style F3 fill:#757575,color:#fff
    style HW fill:#f57c00,color:#fff

6.3 Consumer Group Rebalancing (Cooperative)

flowchart TB
    subgraph "TRUOC — 2 consumers, 6 partitions"
        direction LR
        C1_B["Consumer 1<br/>P0, P1, P2"]
        C2_B["Consumer 2<br/>P3, P4, P5"]
    end

    subgraph "Consumer 3 joins"
        direction TB
        EV["Rebalance triggered"]
    end

    subgraph "SAU — 3 consumers, 6 partitions (Cooperative)"
        direction LR
        C1_A["Consumer 1<br/>P0, P1<br/>(giu P0,P1 — tra P2)"]
        C2_A["Consumer 2<br/>P3, P4<br/>(giu P3,P4 — tra P5)"]
        C3_A["Consumer 3<br/>P2, P5<br/>(nhan P2, P5)"]
    end

    C1_B --> EV
    C2_B --> EV
    EV --> C1_A
    EV --> C2_A
    EV --> C3_A

    style EV fill:#f57c00,color:#fff
    style C3_A fill:#43a047,color:#fff

6.4 End-to-End Message Flow (Chi tiet)

flowchart LR
    subgraph "Producer Side"
        APP["Application"] -->|"1. produce()"| SER["Serializer<br/>(JSON/Avro)"]
        SER -->|"2. serialize"| PART["Partitioner<br/>hash(key) % N"]
        PART -->|"3. assign partition"| BATCH["Batch Buffer<br/>(per partition)"]
        BATCH -->|"4. batch full<br/>or linger.ms"| COMP["Compressor<br/>(lz4/zstd)"]
        COMP -->|"5. compressed batch"| NET1["Network<br/>Send"]
    end

    subgraph "Broker Side"
        NET1 -->|"6. receive"| VAL["Validate<br/>CRC, auth, ACL"]
        VAL -->|"7. append"| WAL2["WAL<br/>(page cache)"]
        WAL2 -->|"8. replicate"| REP["Followers"]
        REP -->|"9. ACK"| ACK["ACK to<br/>Producer"]
    end

    subgraph "Consumer Side"
        POLL["poll()"] -->|"10. fetch"| WAL2
        WAL2 -->|"11. zero-copy"| DECOMP["Decompressor"]
        DECOMP -->|"12. decompress"| DESER["Deserializer"]
        DESER -->|"13. deserialize"| PROC["Process<br/>Message"]
        PROC -->|"14. commit"| OFFSET["Commit<br/>Offset"]
    end

    style BATCH fill:#1e88e5,color:#fff
    style WAL2 fill:#e53935,color:#fff
    style OFFSET fill:#43a047,color:#fff

7. Aha Moments & Pitfalls — Nhung dieu can nho

7.1 Aha Moments

#	Insight	Giai thich
1	WAL la nen tang	Toan bo message queue duoc xay tren mot y tuong don gian: append-only log. Ghi vao cuoi file la thao tac nhanh nhat tren disk. Tu day, moi thu khac (replication, retention, replay) tro nen tu nhien
2	ISR can bang durability vs availability	ISR khong phai “tat ca replicas” — no la “cac replicas dang theo kip”. Config min.insync.replicas quyet dinh ban chap nhan mat bao nhieu replicas ma van ghi duoc
3	Consumer lag la metric #1	Broker throughput cao, latency thap — nhung consumer lag tang → data dang mat dan. Day la “silent killer”. Monitor consumer lag truoc moi thu khac
4	Partition count kho thay doi	Tang partition → key routing thay doi → ordering bi pha. Giam partition → khong the. Chon partition count dung tu dau la quyet dinh architecture quan trong nhat
5	Pull model uu viet cho streaming	Consumer tu quyet dinh toc do, batch size, va khi nao doc. Back-pressure tu nhien. Replay de dang. Day la ly do Kafka (pull) thang the cho event streaming so voi RabbitMQ (push)
6	Zero-copy la “bi quyet” performance	Khong copy data tu kernel → user space → kernel. Truc tiep tu page cache → network. Giam CPU usage, giam latency. Day la ly do message queue dat millions msg/sec tren hardware thuong
7	Broker khong hieu message	Broker chi thay bytes. Khong parse, khong validate noi dung. Dieu nay giup broker cuc nhanh va generic — phu hop voi moi loai data
8	Exactly-once la ket hop nhieu co che	Khong co “magic button”. Can idempotent producer + transactions + idempotent consumer. Hieu tung co che giup em biet khi nao actually can exactly-once vs at-least-once da du

7.2 Common Pitfalls

#	Pitfall	Hau qua	Cach tranh
1	Khong set message key	Round-robin → khong ordering. Messages cua cung entity di vao partition khac nhau	Luon set key = entity ID (user_id, order_id) khi can ordering
2	Partition count qua it	Khong scale duoc consumers. Throughput bi gioi han	Chon partition count = 2-3x so consumers du kien
3	acks=0 cho critical data	Mat message khi broker crash	Dung acks=all + min.insync.replicas=2 cho moi data quan trong
4	Auto commit voi complex processing	Mat message hoac xu ly trung	Dung manual commit — chi commit sau khi xu ly xong
5	Khong co DLQ	Poison message block consumer vinh vien	Luon design DLQ voi max retries + exponential backoff
6	Tang partition count tuy tien	Key routing thay doi, ordering per key bi pha	Plan partition count truoc. Neu phai tang, chap nhan downtime cho re-keying
7	Khong monitor consumer lag	Data mat do retention expire truoc khi consumer doc	Alert consumer lag > threshold. Day la P2 alert
8	unclean.leader.election = true cho financial data	Mat messages khi leader crash	Tat unclean leader election cho moi topic quan trong
9	Rebalance storm	Consumer group lien tuc rebalance → khong ai xu ly messages	Tang session.timeout.ms, tang max.poll.interval.ms, dung sticky assignment
10	Khong encrypt inter-broker traffic	Attacker trong internal network doc duoc tat ca messages	TLS cho moi connection, ke ca inter-broker

8. Summary — Tong ket

8.1 So sanh thiet ke cua chung ta voi Kafka thuc te

Component	Thiet ke cua chung ta	Apache Kafka	Ghi chu
Storage	WAL + segments + index	Giong	Kafka dung chinh xac mo hinh nay
Replication	Leader-follower + ISR	Giong	ISR la sang tao cua Kafka team
Coordination	ZooKeeper/KRaft	ZK → KRaft	Kafka dang migrate sang KRaft
Consumer model	Pull + consumer groups	Giong	Pull la core design cua Kafka
Delivery semantics	At-most/least/exactly once	Giong	Exactly-once tu Kafka 0.11+
Retention	Time/size/compaction	Giong	3 policies giong nhau
Compression	gzip, snappy, lz4, zstd	Giong	zstd tu Kafka 2.1+

8.2 Khi nao dung message queue architecture nay?

Use case	Phu hop?	Ly do
Event streaming (logs, metrics, clickstream)	Rat phu hop	High throughput, retention, replay
Microservice communication (async)	Phu hop	Decoupling, at-least-once, consumer groups
Real-time analytics pipeline	Rat phu hop	Multiple consumer groups doc cung data
Task queue (background jobs)	Duoc, nhung RabbitMQ co the tot hon	Pull model khong optimal cho ngan job queue
Request-reply pattern	Khong phu hop	Pull model khong tot cho synchronous communication
Small-scale system (< 1000 msg/sec)	Overkill	Dung Redis Pub/Sub hoac SQS don gian hon

8.3 Lien ket voi cac bai khac

Bai	Lien quan
Tuan-08-Message-Queue	Bai nay bo sung — Tuan 08 day dung MQ, bai nay day thiet ke MQ
Tuan-10-Consistent-Hashing	Partition assignment co the dung consistent hashing de giam rebalance disruption
Tuan-07-Database-Sharding-Replication	Replication model (leader-follower, ISR) tuong tu database replication. Sharding tuong tu partitioning
Tuan-13-Monitoring-Observability	Consumer lag monitoring, broker health monitoring la ung dung truc tiep
Tuan-14-AuthN-AuthZ-Security	SASL, mTLS, ACL trong message queue la ung dung cua authn/authz patterns
Case-Design-Payment-System	Payment system dung message queue voi exactly-once semantics
Case-Design-Stock-Exchange	Stock exchange dung message queue cho event sequencing

“Hieu, sau bai nay em da biet cach thiet ke mot distributed message queue tu scratch — khong chi biet dung no. Khi interview hoi ‘Design Kafka’, em co the tu tin giai thich moi quyet dinh thiet ke: tai sao append-only log, tai sao pull model, tai sao ISR, tai sao partition la don vi cua parallelism. Do la su khac biet giua Backend Dev va System Architect.”

lthieu's notes

Explorer

Case-Design-Distributed-Message-Queue