Case Study: Design a Distributed Message Queue
“Message Queue la xuong song cua moi distributed system. Khong hieu cach no hoat dong ben trong thi chi la nguoi dung — hieu cach thiet ke no thi moi la kien truc su.”
Tags: system-design message-queue kafka distributed-log replication alex-xu-vol2 case-study Student: Hieu Source: Alex Xu — System Design Interview Volume 2, Chapter 4 Prerequisite: Tuan-01-Scale-From-Zero-To-Millions · Tuan-02-Back-of-the-envelope · Tuan-08-Message-Queue Lien quan: Tuan-07-Database-Sharding-Replication · Tuan-10-Consistent-Hashing · Tuan-13-Monitoring-Observability · Case-Design-Payment-System
1. Context & Why — Tai sao can thiet ke Message Queue tu dau?
1.1 Analogy: He thong buu dien quoc gia
Hieu, tuong tuong em duoc giao nhiem vu thiet ke he thong buu dien quoc gia tu con so khong. Khong phai chi dat mot cai hop thu o goc pho — ma la toan bo he thong: tu luc nguoi gui bo thu vao hop, den luc nguoi nhan cam thu tren tay.
He thong nay phai giai quyet nhung van de sau:
-
Nhan thu (Ingestion): Hang trieu nguoi gui thu moi ngay. Moi buu cuc phai nhan thu nhanh chong, khong duoc de nguoi gui phai xep hang qua lau. Tuong tu producer gui message vao broker.
-
Phan loai (Routing): Thu gui den Ha Noi phai di Ha Noi, thu gui Sai Gon phai di Sai Gon. Khong duoc nham. Tuong tu viec partition message theo key — tat ca message cua cung mot khach hang phai vao cung mot partition de dam bao thu tu.
-
Luu tru (Storage): Truoc khi giao, thu phai duoc luu an toan trong kho. Neu kho bi chay, thu phai co ban sao o kho khac. Tuong tu replication trong message queue — moi partition co nhieu replica de dam bao khong mat du lieu.
-
Giao hang (Delivery): Nguoi dua thu phai giao dung nguoi, dung dia chi. Neu nguoi nhan khong co nha, phai quay lai giao lan sau (retry). Neu giao nhieu lan ma khong duoc, chuyen vao kho thu chet (Dead Letter Queue). Tuong tu consumer doc message va xu ly.
-
Dam bao khong mat (Durability): Mot la thu mat = mat long tin cua ca he thong. Tuong tu delivery guarantee — at-least-once, exactly-once.
-
Dam bao khong trung (Deduplication): Khong duoc giao cung mot la thu hai lan cho nguoi nhan. Tuong tu exactly-once semantics voi idempotent producer.
-
Theo doi (Tracking): Nguoi gui muon biet thu da den chua. Tuong tu offset tracking — consumer biet minh da doc den dau.
1.2 Bai toan thiet ke tu scratch
Trong Tuan-08-Message-Queue, em da hoc cach su dung message queue (Kafka, RabbitMQ) trong system design. Bay gio, chung ta se thiet ke mot distributed message queue tu dau — giong nhu Kafka, Pulsar, hay Redpanda.
Day la bai toan kho vi no ket hop nhieu khai niem:
| Thach thuc | Giai thich |
|---|---|
| High throughput | 1 trieu messages/giay — moi message phai duoc nhan, luu, va giao |
| Durability | Khong duoc mat message — du broker bi crash |
| Ordering | Messages trong cung partition phai giu dung thu tu |
| Scalability | Them broker, them partition khi traffic tang |
| Fault tolerance | Broker chet → he thong van hoat dong binh thuong |
| Low latency | Producer gui → consumer nhan trong vai millisecond |
| Replay capability | Consumer co the doc lai message tu bat ky thoi diem nao |
| Exactly-once delivery | Trong nhieu truong hop, khong duoc xu ly trung |
1.3 Tai sao khong dung co san?
Cau hoi hay: “Tai sao khong cu dung Kafka?”
- Interview: Day la bai thiet ke — interviewer muon thay em hieu tai sao Kafka thiet ke nhu vay, khong phai chi biet cach dung.
- Thuc te: Nhieu cong ty lon (LinkedIn, Uber, Confluent) da xay dung hoac custom message queue rieng vi yeu cau dac thu.
- Hieu sau: Khi em biet cach thiet ke message queue, em se dung no tot hon — biet parameter nao quan trong, biet khi nao dung at-least-once vs exactly-once, biet cach tune performance.
Aha Moment: Bai nay bo sung cho Tuan-08-Message-Queue. Tuan 08 day em dung MQ nhu mot component. Bai nay day em xay MQ tu scratch. Hai goc nhin bo sung cho nhau — nguoi dung vs kien truc su.
2. Deep Dive — Alex Xu 4-Step Framework
Step 1: Requirements — Hieu va gioi han bai toan
2.1.1 Functional Requirements
| Chuc nang | Mo ta chi tiet |
|---|---|
| Producer API | Producer gui message vao mot topic cu the. Ho tro batch gui nhieu message cung luc |
| Consumer API | Consumer doc message tu topic. Ho tro pull-based model (consumer chu dong keo data) |
| Topic management | Tao, xoa, liet ke topics. Moi topic co nhieu partitions |
| Message retention | Giu message trong mot khoang thoi gian (7 ngay default) hoac theo kich thuoc |
| Message replay | Consumer co the doc lai message tu bat ky offset nao trong retention period |
| Consumer group | Nhom consumer cung doc mot topic — moi partition chi gan cho 1 consumer trong group |
| Delivery semantics | Ho tro at-most-once, at-least-once (default), va exactly-once |
| Message ordering | Dam bao ordering trong cung mot partition |
2.1.2 Non-Functional Requirements
| Yeu cau | Muc tieu | Ly do |
|---|---|---|
| Throughput | 1,000,000 messages/sec (write) | Xu ly event streaming cho he thong lon (e-commerce, fintech, IoT) |
| Latency | < 10ms P99 (end-to-end) | Producer gui → consumer nhan phai nhanh |
| Durability | Khong mat message khi co ⇐ 2 broker failures dong thoi | Du lieu la tai san — mat message = mat tien |
| Availability | 99.99% uptime | Message queue la backbone — no chet thi toan he thong chet |
| Scalability | Scale horizontally bang cach them broker | Traffic tang → them may, khong can thiet ke lai |
| Storage efficiency | Luu tru hieu qua tren disk | Message duoc giu nhieu ngay — disk la bottleneck |
| Ordering guarantee | Messages trong cung partition giu dung thu tu | Nhieu use case yeu cau strict ordering (payment events, order state changes) |
2.1.3 Out of Scope
De giu bai tap trung, chung ta khong thiet ke:
- Schema registry (Avro, Protobuf schema management)
- Stream processing engine (Kafka Streams, Flink)
- Multi-datacenter replication (MirrorMaker)
- Tiered storage (hot/cold storage)
2.1.4 API Design
Producer API:
| API | Parameters | Mo ta |
|---|---|---|
produce(topic, message, key?, partition?) | topic: string, message: bytes, key: bytes (optional), partition: int (optional) | Gui message vao topic. Neu co key → hash(key) chon partition. Neu co partition → gui truc tiep |
produce_batch(topic, messages[]) | topic: string, messages: list of (key, value) | Gui nhieu message cung luc — giam network overhead |
Consumer API:
| API | Parameters | Mo ta |
|---|---|---|
subscribe(topics[], group_id) | topics: list of string, group_id: string | Consumer tham gia group va subscribe vao topics |
poll(timeout) | timeout: ms | Keo batch messages moi. Tra ve list messages |
commit(offsets) | offsets: map<partition, offset> | Commit offset — danh dau da xu ly den dau |
seek(partition, offset) | partition: int, offset: long | Nhay den vi tri bat ky — dung cho replay |
Step 2: High-Level Design — Kien truc tong quan
2.2.1 Cac thanh phan chinh
flowchart TB subgraph Producers P1["Producer 1<br/>(Order Service)"] P2["Producer 2<br/>(Payment Service)"] P3["Producer 3<br/>(IoT Devices)"] end subgraph "Broker Cluster" B1["Broker 1<br/>Partitions: 0, 3, 6"] B2["Broker 2<br/>Partitions: 1, 4, 7"] B3["Broker 3<br/>Partitions: 2, 5, 8"] end subgraph "Metadata Store" MS["ZooKeeper / etcd<br/>- Broker registry<br/>- Topic config<br/>- Partition assignment<br/>- Consumer group state"] end subgraph "Coordinator" CO["Coordinator<br/>- Leader election<br/>- Partition assignment<br/>- Consumer rebalancing"] end subgraph Consumers subgraph "Consumer Group A (Order Processing)" C1["Consumer A1"] C2["Consumer A2"] C3["Consumer A3"] end subgraph "Consumer Group B (Analytics)" C4["Consumer B1"] C5["Consumer B2"] end end P1 & P2 & P3 --> B1 & B2 & B3 B1 & B2 & B3 <--> MS CO <--> MS CO --> B1 & B2 & B3 B1 & B2 & B3 --> C1 & C2 & C3 B1 & B2 & B3 --> C4 & C5 style B1 fill:#1e88e5,color:#fff style B2 fill:#1e88e5,color:#fff style B3 fill:#1e88e5,color:#fff style MS fill:#f57c00,color:#fff style CO fill:#7b1fa2,color:#fff
2.2.2 Vai tro tung thanh phan
| Thanh phan | Vai tro | Chi tiet |
|---|---|---|
| Producer | Gui message vao broker | Chon partition (round-robin, key-based hash, custom). Batch va compress truoc khi gui |
| Broker | Luu tru va phuc vu message | Moi broker quan ly mot so partitions. Ghi message vao disk (WAL). Phuc vu read cho consumer |
| Metadata Store | Luu metadata cua cluster | Danh sach broker, topic config, partition assignment, consumer group state. Dung ZooKeeper hoac etcd |
| Coordinator | Dieu phoi cluster | Leader election cho partition, consumer group rebalancing, broker membership management |
| Consumer | Doc message tu broker | Pull-based — consumer chu dong keo data. Track offset de biet doc den dau |
| Consumer Group | Nhom consumer chia nhau doc | Moi partition chi gan cho 1 consumer trong group. Cho phep parallel processing |
2.2.3 Message Flow tong quan
sequenceDiagram participant P as Producer participant LB as Broker (Leader) participant F1 as Broker (Follower 1) participant F2 as Broker (Follower 2) participant C as Consumer P->>LB: 1. Send message batch LB->>LB: 2. Write to WAL (disk) LB->>F1: 3. Replicate to follower 1 LB->>F2: 3. Replicate to follower 2 F1-->>LB: 4. ACK replication F2-->>LB: 4. ACK replication LB-->>P: 5. ACK to producer (acks=all) C->>LB: 6. Poll(offset=100) LB->>LB: 7. Read from WAL/page cache LB-->>C: 8. Return messages [100-150] C->>C: 9. Process messages C->>LB: 10. Commit offset=150
Step 3: Deep Dive — Chi tiet tung thanh phan
2.3.1 Data Model — Mo hinh du lieu
Topic
Topic la logical channel de to chuc messages. Tuong tu nhu chuyen muc trong buu dien — thu kinh doanh, thu ca nhan, thu quang cao.
| Thuoc tinh | Mo ta | Vi du |
|---|---|---|
| Name | Ten topic — dinh danh duy nhat | orders, payments, user-events |
| Partition count | So partitions — quyet dinh parallelism | 12 partitions |
| Replication factor | So ban sao moi partition | 3 (1 leader + 2 followers) |
| Retention | Thoi gian giu message | 7 ngay (default) |
| Cleanup policy | Cach don dep message cu | delete (xoa theo thoi gian) hoac compact (giu latest per key) |
Partition
Partition la don vi co ban cua parallelism va ordering. Tuong tu nhu nhieu o buu dien song song — moi o xu ly doc lap.
Moi partition la mot ordered, immutable sequence of messages. Messages duoc append vao cuoi — khong bao gio sua hay xoa (chi xoa khi retention het han).
Dac diem quan trong:
- Messages trong cung mot partition co thu tu dam bao (ordering guarantee)
- Messages o khac partition khong co thu tu voi nhau
- Moi partition co mot leader va nhieu followers (replicas)
- Chi leader xu ly read/write — followers chi replicate
Offset
Offset la so thu tu cua message trong partition. Bat dau tu 0, tang dan. Tuong tu so thu tu cua thu trong hop thu — thu so 0, thu so 1, thu so 2…
| Thong tin | Chi tiet |
|---|---|
| Kieu du lieu | 64-bit integer |
| Pham vi | 0 den 2^63 - 1 (du cho 292 nam voi 1M msg/sec) |
| Tinh chat | Monotonically increasing trong partition |
| Muc dich | Consumer dung offset de biet doc den dau, co the seek den bat ky offset nao |
Message (Record)
Moi message gom cac thanh phan:
| Truong | Kieu | Mo ta |
|---|---|---|
| Key | bytes (nullable) | Dung de quyet dinh partition. Vi du: user_id, order_id. Nullable — neu null thi round-robin |
| Value | bytes | Noi dung message. Co the la JSON, Avro, Protobuf. Broker khong care format — chi la bytes |
| Timestamp | long (epoch ms) | Thoi diem message duoc tao (CreateTime) hoac thoi diem broker nhan (LogAppendTime) |
| Headers | list of key-value | Metadata bo sung. Vi du: trace-id, source-service, content-type |
| Offset | long | Duoc broker gan khi message duoc ghi vao partition. Producer khong set |
| Partition | int | Partition ma message thuoc ve |
| CRC | int | Checksum de phat hien data corruption |
Aha Moment: Broker khong parse hay hieu noi dung message. Voi broker, message chi la mot cuc bytes. Dieu nay giup broker cuc ky nhanh — no chi can ghi bytes vao disk va doc bytes tu disk. Khong co serialization/deserialization overhead.
2.3.2 Broker Storage — Cach luu tru tren disk
Day la phan quan trong nhat cua bai thiet ke. Storage design quyet dinh throughput va durability cua toan he thong.
Write-Ahead Log (WAL) per Partition
Moi partition duoc luu tren disk nhu mot append-only log. Tuong tu nhu cuon so ghi chep cua buu dien — chi ghi them, khong bao gio xoa hay sua.
Tai sao append-only?
- Disk sequential write cuc nhanh: HDD sequential write dat 200-300 MB/s (nhanh hon random write 1000x). SSD sequential write dat 500-3000 MB/s.
- Khong can lock phuc tap: Chi ghi vao cuoi → khong co conflict
- Tu nhien ho tro ordering: Message duoc ghi theo thu tu → doc ra cung dung thu tu
Segment Files
Log cua moi partition duoc chia thanh nhieu segment files. Moi segment la mot file tren disk.
Topic: orders, Partition: 0
/data/orders-0/
00000000000000000000.log ← Segment 0: offset 0 - 999,999
00000000000000000000.index ← Offset index cho segment 0
00000000000000000000.timeindex ← Time index cho segment 0
00000000000001000000.log ← Segment 1: offset 1,000,000 - 1,999,999
00000000000001000000.index
00000000000001000000.timeindex
00000000000002000000.log ← Segment 2 (active segment - dang ghi)
00000000000002000000.index
00000000000002000000.timeindex
Tai sao chia segment?
| Ly do | Giai thich |
|---|---|
| Retention cleanup | Xoa segment cu de giai phong disk — xoa ca file, khong can scan tung message |
| Compaction | Nen segment cu — giu lai latest value per key |
| File system limits | Mot file qua lon → OS xu ly cham. Segment nho hon de quan ly |
| Recovery | Khi broker restart, chi can recover active segment — cac segment cu da duoc flush |
Cau hinh segment:
| Config | Default | Mo ta |
|---|---|---|
segment.bytes | 1 GB | Kich thuoc toi da moi segment |
segment.ms | 7 ngay | Thoi gian toi da truoc khi roll segment moi |
segment.index.bytes | 10 MB | Kich thuoc index file |
Index Files — Tim message nhanh
Khi consumer muon doc tu offset 1,500,000 — lam sao broker tim dung vi tri tren disk ma khong can scan tu dau?
Offset Index: Map tu offset → vi tri (position) trong segment file.
Offset Index (sparse — khong luu moi offset):
Offset 1,000,000 → Position 0
Offset 1,000,100 → Position 15,200
Offset 1,000,200 → Position 30,800
Offset 1,000,300 → Position 46,100
...
Khi can tim offset 1,000,150:
- Binary search trong index → tim entry gan nhat nho hon: offset 1,000,100 tai position 15,200
- Scan sequential tu position 15,200 den khi gap offset 1,000,150
Timestamp Index: Map tu timestamp → offset. Dung khi consumer muon doc message tu mot thoi diem cu the (vi du: “doc lai tat ca message tu 2 ngay truoc”).
Timestamp Index:
Timestamp 1710700000000 → Offset 1,000,000
Timestamp 1710700060000 → Offset 1,000,500
Timestamp 1710700120000 → Offset 1,001,200
...
Aha Moment: Index la sparse (khong luu moi offset, chi luu moi N offset). Dieu nay tiet kiem disk va memory. Trade-off: can scan mot doan nho de tim chinh xac — nhung doan nay thuong nam trong page cache nen rat nhanh.
Memory-Mapped Files (mmap) va Page Cache
Day la “bi quyet” giup message queue dat throughput cuc cao:
Page Cache cua OS: Khi broker ghi message vao file, data khong ghi thang vao disk. No di qua page cache cua OS truoc:
- Producer gui message → broker ghi vao page cache (RAM)
- OS tu dong flush page cache xuong disk (async)
- Consumer doc message → doc tu page cache (neu data con trong cache) → zero disk I/O
Ket qua: Trong truong hop ly tuong (consumer doc message vua moi duoc ghi), toan bo read xay ra tu RAM — khong cham disk. Day la ly do message queue co the dat millions of messages/sec tren hardware thuong.
Memory-mapped files (mmap): Index files duoc map vao memory thong qua mmap. Dieu nay cho phep truy cap index nhu truy cap mang trong RAM — khong can read() system call.
| Ky thuat | Ap dung vao | Loi ich |
|---|---|---|
| Page cache | Segment files (.log) | Write nhanh (async flush), read nhanh (tu RAM) |
| mmap | Index files (.index, .timeindex) | Truy cap index nhanh nhu RAM |
| sendfile() / zero-copy | Transfer data tu disk → network | Giam CPU usage, giam copy giua kernel va user space |
Zero-copy transfer: Khi consumer doc message, broker dung sendfile() system call de chuyen data truc tiep tu page cache → network socket, khong can copy qua user space. Giam 2 lan copy va 2 lan context switch.
flowchart LR subgraph "Truyen thong (4 copies)" D1["Disk"] -->|"1. read()"| K1["Kernel Buffer"] K1 -->|"2. copy"| U1["User Buffer"] U1 -->|"3. write()"| K2["Socket Buffer"] K2 -->|"4. send"| N1["Network"] end subgraph "Zero-copy (2 copies)" D2["Disk/Page Cache"] -->|"1. sendfile()"| K3["Kernel Buffer"] K3 -->|"2. send"| N2["Network"] end style D2 fill:#43a047,color:#fff style K3 fill:#43a047,color:#fff style N2 fill:#43a047,color:#fff
2.3.3 Producer — Gui message hieu qua
Partitioning Strategy
Khi producer gui message, no phai chon partition nao se nhan message. Co 3 strategy:
1. Round-Robin (khong co key):
- Message duoc gui luan phien den cac partition: P0, P1, P2, P0, P1, P2…
- Uu diem: Phan bo deu tai — khong bi hotspot
- Nhuoc diem: Khong dam bao ordering — messages cua cung entity co the nam o khac partition
- Dung khi: Khong can ordering (vi du: log messages, metrics)
2. Key-based Hash (co key):
- partition = hash(key) % num_partitions
- Tat ca messages cung key → luon vao cung partition
- Uu diem: Dam bao ordering cho cung key
- Nhuoc diem: Co the bi hotspot neu key phan bo khong deu (vi du: celebrity user co nhieu event)
- Dung khi: Can ordering per entity (vi du: key = order_id → tat ca event cua 1 order theo dung thu tu)
3. Custom Partitioner:
- Producer tu implement logic chon partition
- Dung khi: Logic dac biet — vi du: high-priority messages vao partition rieng, geo-based routing
Pitfall: Khi thay doi so partition (vi du: tu 6 len 12), hash(key) % num_partitions se thay doi. Message cua cung key co the di vao partition khac. Day la ly do partition count rat kho thay doi sau khi da chay production. Chon dung tu dau!
Batching
Producer khong gui tung message mot. No gom nhieu messages thanh 1 batch roi gui cung luc.
| Config | Default | Mo ta |
|---|---|---|
batch.size | 16 KB | Kich thuoc toi da cua batch |
linger.ms | 0 ms (gui ngay) | Thoi gian cho de gom them messages vao batch |
Trade-off:
linger.ms = 0: Latency thap nhat, nhung throughput thap (gui tung message)linger.ms = 5-50ms: Tang throughput dang ke (gom nhieu messages), nhung latency tang them vai msbatch.sizelon: It network round-trip hon, nhung memory usage cao hon
Compression
Batch duoc compress truoc khi gui qua network. Broker luu batch da compress — khong decompress. Consumer moi decompress.
| Algorithm | Compression ratio | CPU usage | Toc do | Khi nao dung |
|---|---|---|---|---|
| gzip | Cao nhat (~70-80%) | Cao | Cham | Bandwidth limited, batch lon |
| snappy | Trung binh (~50-60%) | Thap | Nhanh | Can balance giua compression va CPU |
| lz4 | Trung binh (~55-65%) | Rat thap | Rat nhanh | Default choice — tot nhat cho da so truong hop |
| zstd | Cao (~65-75%) | Trung binh | Nhanh | Muon compression tot hon lz4, chap nhan them CPU |
Aha Moment: Compression xay ra o batch level, khong phai message level. Batch cang lon → compression ratio cang tot (vi co nhieu data giong nhau de compress). Day la ly do tang
linger.msvabatch.sizegiup tiet kiem bandwidth dang ke.
Acknowledgment (acks)
Sau khi gui message, producer cho broker xac nhan (ACK). Muc ACK quyet dinh durability vs latency:
| acks | Hanh vi | Durability | Latency | Khi nao dung |
|---|---|---|---|---|
| 0 | Producer khong cho ACK. “Fire and forget” | Thap nhat — co the mat message | Thap nhat | Metrics, logs — chap nhan mat |
| 1 | Cho ACK tu leader. Leader ghi xong → ACK | Trung binh — mat neu leader crash truoc khi replicate | Trung binh | Da so truong hop |
| all (-1) | Cho ACK tu tat ca ISR (In-Sync Replicas). Leader + followers ghi xong → ACK | Cao nhat — chi mat khi tat ca ISR crash cung luc | Cao nhat | Payment events, critical data — khong duoc mat |
Pitfall:
acks=allkhong co nghia la tat ca replicas. No chi doi tat ca replicas trong ISR (In-Sync Replicas). Neu ISR chi con 1 (leader) vi followers lag →acks=alltuong duongacks=1. Day la ly do can setmin.insync.replicas=2.
2.3.4 Consumer — Doc message hieu qua
Consumer Group
Consumer group la co che chia cong viec giua nhieu consumers. Tuong tu nhom nhan vien buu dien — moi nguoi phu trach mot khu vuc, khong ai lam trung voi ai.
Quy tac vang:
- Trong 1 consumer group: moi partition chi gan cho dung 1 consumer
- 1 consumer co the doc nhieu partitions
- Neu
num_consumers > num_partitions→ consumer thua se idle - Neu
num_consumers < num_partitions→ mot consumer doc nhieu partitions - Ly tuong:
num_consumers = num_partitions
Nhieu consumer groups doc cung topic doc lap voi nhau:
Topic: orders (6 partitions)
Consumer Group "order-processing":
Consumer 1 → Partition 0, 1
Consumer 2 → Partition 2, 3
Consumer 3 → Partition 4, 5
Consumer Group "analytics":
Consumer A → Partition 0, 1, 2
Consumer B → Partition 3, 4, 5
Consumer Group "audit":
Consumer X → Partition 0, 1, 2, 3, 4, 5 (1 consumer doc tat ca)
Moi group co offset rieng. Group “order-processing” co the da doc den offset 10,000 trong khi group “analytics” moi doc den offset 5,000.
Partition Assignment Strategies
Khi consumer join/leave group, partitions phai duoc re-assign. Co nhieu strategy:
1. Range Assignment:
- Sap xep partitions theo so thu tu, chia deu cho consumers
- Vi du: 6 partitions, 3 consumers → C1: [P0,P1], C2: [P2,P3], C3: [P4,P5]
- Uu diem: Don gian, deterministic
- Nhuoc diem: Consumer dau tien co the bi nhieu partition hon neu chia khong deu
2. Round-Robin Assignment:
- Phan phoi partition luan phien: C1→P0, C2→P1, C3→P2, C1→P3, C2→P4, C3→P5
- Uu diem: Phan bo deu hon range
- Nhuoc diem: Khi rebalance, nhieu partition bi di chuyen
3. Sticky Assignment:
- Giu nguyen assignment cu nhieu nhat co the, chi di chuyen partition cua consumer da roi di
- Uu diem: Giam disruption khi rebalance — consumer giu lai state da co
- Nhuoc diem: Phuc tap hon de implement
Aha Moment: Sticky assignment la best practice trong production. Khi 1 consumer crash, chi partitions cua no bi reassign — cac consumer khac giu nguyen. Giam thoi gian “stop the world” dang ke.
Offset Management
Consumer phai bao cho broker biet da doc den offset nao. Qua trinh nay goi la commit offset.
Auto Commit:
- Consumer tu dong commit offset sau moi
auto.commit.interval.ms(default 5s) - Uu diem: Don gian, khong can code them
- Nhuoc diem: Co the mat message hoac xu ly trung
- Consumer doc message, chua xu ly xong → auto commit → consumer crash → message mat (at-most-once)
- Consumer xu ly xong, chua commit → crash → rebalance → consumer moi doc lai (at-least-once voi duplicates)
Manual Commit:
- Consumer chi commit khi da xu ly xong message
- Synchronous commit:
commitSync()— block cho den khi commit thanh cong. An toan nhung cham - Asynchronous commit:
commitAsync()— khong block. Nhanh nhung co the fail silently
Best practice: Dung commitAsync() trong vong lap xu ly, va commitSync() truoc khi consumer shutdown. Ket hop ca hai de dat balance giua performance va safety.
Rebalancing Protocol
Khi consumer join/leave group, hệ thống phai rebalance — gan lai partitions cho consumers.
sequenceDiagram participant C1 as Consumer 1 participant C2 as Consumer 2 participant C3 as Consumer 3 (new) participant GC as Group Coordinator Note over C1,GC: Consumer 3 joins group C3->>GC: JoinGroup request GC->>GC: Trigger rebalance GC->>C1: Revoke partitions GC->>C2: Revoke partitions C1->>C1: Commit offsets, release partitions C2->>C2: Commit offsets, release partitions C1->>GC: JoinGroup (re-join) C2->>GC: JoinGroup (re-join) C3->>GC: JoinGroup GC->>C1: Assign: Consumer 1 is leader Note over C1: Leader computes assignment C1->>GC: SyncGroup (assignment plan) GC->>C1: SyncGroup response (P0, P1) GC->>C2: SyncGroup response (P2, P3) GC->>C3: SyncGroup response (P4, P5) Note over C1,C3: Resume consuming with new assignment
Eager Rebalancing (truyen thong):
- Tat ca consumers dung lai, tra partitions ve, roi nhan assignment moi
- Van de: “Stop the world” — trong thoi gian rebalance, khong consumer nao doc duoc
Cooperative / Incremental Rebalancing (moi):
- Chi revoke partitions can di chuyen. Consumer giu lai partitions khong thay doi
- Uu diem: Giam disruption dang ke — da so consumers van hoat dong binh thuong trong rebalance
- Day la best practice cho production systems
Pitfall: Rebalancing la thoi diem nguy hiem nhat cua consumer group. Trong khi rebalance, khong ai xu ly messages → consumer lag tang. Neu rebalance xay ra lien tuc (vi du: consumer unstable, session timeout qua ngan) → he thong khong hoat dong duoc. Monitoring rebalance frequency la critical.
2.3.5 Replication — Dam bao du lieu khong mat
Leader-Follower Model
Moi partition co 1 leader va nhieu followers. Chi leader nhan read/write. Followers chi replicate tu leader.
flowchart TB subgraph "Topic: orders, Partition 0" subgraph "Broker 1" L0["Partition 0<br/>LEADER<br/>Offset: 0-15,000"] end subgraph "Broker 2" F0a["Partition 0<br/>FOLLOWER (ISR)<br/>Offset: 0-14,998"] end subgraph "Broker 3" F0b["Partition 0<br/>FOLLOWER (ISR)<br/>Offset: 0-14,995"] end end subgraph "Topic: orders, Partition 1" subgraph "Broker 2 " L1["Partition 1<br/>LEADER<br/>Offset: 0-12,000"] end subgraph "Broker 3 " F1a["Partition 1<br/>FOLLOWER (ISR)<br/>Offset: 0-11,999"] end subgraph "Broker 1 " F1b["Partition 1<br/>FOLLOWER (ISR)<br/>Offset: 0-11,997"] end end L0 -->|"Replicate"| F0a L0 -->|"Replicate"| F0b L1 -->|"Replicate"| F1a L1 -->|"Replicate"| F1b style L0 fill:#e53935,color:#fff style L1 fill:#e53935,color:#fff style F0a fill:#1e88e5,color:#fff style F0b fill:#1e88e5,color:#fff style F1a fill:#1e88e5,color:#fff style F1b fill:#1e88e5,color:#fff
Tai sao leader phan bo tren nhieu brokers?
- Partition 0 leader o Broker 1, Partition 1 leader o Broker 2…
- Phan bo deu load — khong broker nao bi overload
- Neu Broker 1 chet → chi leader cua partitions tren Broker 1 can failover
ISR (In-Sync Replicas)
ISR la tap hop cac replicas dang theo kip leader. Mot replica nam trong ISR neu:
- No da fetch du lieu tu leader va khong bi lag qua nhieu
- Cau hinh:
replica.lag.time.max.ms(default 30s) — neu follower khong fetch trong 30s, bi loai khoi ISR
| Thuat ngu | Dinh nghia |
|---|---|
| ISR | Tap replicas dang sync voi leader (bao gom leader) |
| OSR (Out-of-Sync Replicas) | Replicas bi lag — chua theo kip leader |
| LEO (Log End Offset) | Offset cua message cuoi cung tren moi replica |
| HW (High Watermark) | Offset cao nhat ma tat ca ISR da replicate. Consumer chi doc duoc den HW |
High Watermark la gi?
Leader: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
Follower1: [0] [1] [2] [3] [4] [5] [6] [7]
Follower2: [0] [1] [2] [3] [4] [5] [6]
↑
High Watermark = 6
Consumer chi doc duoc message co offset ⇐ 6 (High Watermark). Messages 7, 8, 9 da o leader nhung chua duoc tat ca ISR replicate → chua visible cho consumer.
Tai sao? Neu leader crash truoc khi replicate 7, 8, 9 → follower len lam leader moi → messages 7, 8, 9 mat. Neu consumer da doc chung → data inconsistency.
min.insync.replicas
Config nay quyet dinh so ISR toi thieu truoc khi leader chap nhan write.
| Cau hinh | Hanh vi |
|---|---|
replication.factor = 3 | Moi partition co 3 replicas |
min.insync.replicas = 2 | Can it nhat 2 replicas trong ISR de ghi duoc |
acks = all | Producer doi tat ca ISR xac nhan |
Ket hop: acks=all + min.insync.replicas=2 + replication.factor=3
- Producer gui message → leader ghi + it nhat 1 follower ghi → ACK
- Co the chiu 1 broker failure ma khong mat data va khong ngung ghi
- Neu 2 brokers chet → ISR < min.insync.replicas → leader tu choi write → producer nhan error
Aha Moment:
min.insync.replicasla “circuit breaker” cua replication. No ngan leader ghi khi khong du replicas — tranh truong hop ghi message vao leader roi leader chet → mat data.
Unclean Leader Election
Khi leader chet va khong co follower nao trong ISR (tat ca followers deu lag) — he thong co 2 lua chon:
| Lua chon | Config | Hau qua |
|---|---|---|
| Cho doi ISR follower online | unclean.leader.election.enable = false | Partition khong hoat dong cho den khi ISR follower quay lai. Durability > Availability |
| Chon follower ngoai ISR lam leader | unclean.leader.election.enable = true | Partition hoat dong ngay, nhung mat messages chua duoc replicate. Availability > Durability |
Trade-off kinh dien: Durability vs Availability
- Payment system, financial data →
false(khong duoc mat data) - Metrics, logs →
true(chap nhan mat vai message, mien la he thong khong dung)
2.3.6 Delivery Semantics — Dam bao giao message
Day la mot trong nhung khai niem kho nhat va hay bi hieu sai nhat trong message queue.
At-Most-Once
Dinh nghia: Message duoc xu ly toi da 1 lan. Co the mat, nhung khong bao gio trung.
Cach implement:
- Producer:
acks = 0(khong cho ACK) - Consumer: Commit offset truoc khi xu ly message
Khi nao dung: Metrics, logs, analytics — mat vai data point khong anh huong ket qua.
Rui ro: Message mat vinh vien — producer khong biet, consumer khong biet.
At-Least-Once (Default)
Dinh nghia: Message duoc xu ly it nhat 1 lan. Khong mat, nhung co the trung.
Cach implement:
- Producer:
acks = all+ retry khi gui that bai - Consumer: Xu ly message truoc, roi moi commit offset
Khi nao dung: Da so truong hop — notification, email, order processing (voi idempotent consumer).
Rui ro: Message co the duoc xu ly 2 lan. Vi du: consumer xu ly xong, chua commit offset, consumer crash → message duoc xu ly lai.
Giai phap: Consumer phai idempotent — xu ly cung message 2 lan van cho ket qua giong nhau. Vi du: dung unique ID cua message de check “da xu ly chua” truoc khi xu ly.
Exactly-Once
Dinh nghia: Message duoc xu ly chinh xac 1 lan. Khong mat, khong trung.
Cach implement (phuc tap nhat):
-
Idempotent Producer:
- Broker gan moi producer mot Producer ID (PID)
- Moi message co Sequence Number tang dan per partition
- Neu broker nhan message voi cung PID + sequence number → tu dong bo qua (deduplicate)
- Giai quyet: producer retry gui trung → broker chi ghi 1 lan
-
Transactional API:
- Producer bat dau transaction → gui messages vao nhieu partitions → commit hoac abort
- Consumer chi doc messages cua committed transactions (
isolation.level = read_committed) - Giai quyet: atomic write across partitions — hoac tat ca messages duoc ghi, hoac khong co gi
-
Consumer-side deduplication:
- Du co idempotent producer va transactions, consumer van can xu ly idempotent
- Ly do: “exactly-once” cua broker chi dam bao ghi 1 lan. Consumer xu ly la logic rieng — co the fail giua chung
Aha Moment: “Exactly-once” trong distributed systems la cuc ky kho va thuong la ket hop cua nhieu co che: idempotent producer (deduplicate write) + transactions (atomic write) + idempotent consumer (deduplicate processing). Khong co “magic button” nao lam exactly-once tu dong.
2.3.7 Message Retention — Giu message bao lau?
Khac voi traditional message queue (RabbitMQ: message bi xoa sau khi consumed), distributed message queue giu message theo retention policy — bat ke da duoc consume hay chua.
Time-Based Retention
| Config | Default | Mo ta |
|---|---|---|
retention.ms | 604,800,000 (7 ngay) | Message cu hon 7 ngay bi xoa |
retention.minutes | - | Tinh theo phut |
retention.hours | 168 | Tinh theo gio |
Khi segment file cu nhat co timestamp > retention → xoa ca segment file.
Size-Based Retention
| Config | Default | Mo ta |
|---|---|---|
retention.bytes | -1 (unlimited) | Tong kich thuoc toi da per partition |
Khi tong kich thuoc cac segments vuot qua retention.bytes → xoa segment cu nhat.
Log Compaction — Giu latest per key
Day la retention policy dac biet. Thay vi xoa theo thoi gian, broker giu lai message moi nhat cho moi key va xoa cac message cu hon.
TRUOC compaction:
Key=user1, Value=Hanoi (offset 0)
Key=user2, Value=Saigon (offset 1)
Key=user1, Value=Danang (offset 2) ← user1 cap nhat dia chi
Key=user3, Value=Hue (offset 3)
Key=user2, Value=null (offset 4) ← user2 bi xoa (tombstone)
Key=user1, Value=Dalat (offset 5) ← user1 cap nhat lan nua
SAU compaction:
Key=user3, Value=Hue (offset 3) ← giu nguyen
Key=user2, Value=null (offset 4) ← tombstone (bi xoa sau)
Key=user1, Value=Dalat (offset 5) ← chi giu latest
Dung khi nao?
- Database changelog (CDC): moi row la 1 key, value la state moi nhat
- Configuration store: moi key la config name, value la config value
- User profile updates: moi key la user_id, value la profile moi nhat
Pitfall: Log compaction khong xay ra ngay lap tuc. Co mot “dirty ratio” threshold. Compaction thread chay background va co the tao I/O load lon. Trong production, can monitoring disk I/O trong luc compaction.
2.3.8 Coordination — Quan ly cluster
ZooKeeper (truyen thong)
ZooKeeper la external coordination service duoc Kafka dung tu dau de:
| Chuc nang | Chi tiet |
|---|---|
| Broker registry | Moi broker dang ky vao ZooKeeper khi start. ZK biet broker nao dang song |
| Controller election | Chon 1 broker lam “Controller” — phu trach leader election cho partitions |
| Topic/partition metadata | Luu danh sach topics, partitions, replica assignment |
| Consumer group state | (cu) Luu consumer offsets va group membership. Tu Kafka 0.9+, consumer offsets luu trong internal topic __consumer_offsets |
Van de cua ZooKeeper:
- Them 1 dependency: ZooKeeper la hệ thống rieng, can deploy va van hanh rieng biệt
- Scaling bottleneck: ZK cluster thuong chi 3-5 nodes. Metadata updates tro thanh bottleneck khi cluster lon (hang nghin partitions)
- Split brain risk: Neu ZK va broker cluster bi network partition → co the co hai leaders cho cung partition
- Operational complexity: Phai biet van hanh 2 he thong: Kafka + ZooKeeper
KRaft (ZooKeeper-less) — Xu huong moi
Tu Kafka 3.3+, KRaft thay the ZooKeeper bang internal Raft-based consensus:
| Khia canh | ZooKeeper | KRaft |
|---|---|---|
| Architecture | External ZK cluster | Built-in — metadata stored trong Kafka brokers |
| Metadata storage | ZK znodes | Internal __cluster_metadata topic (Raft log) |
| Controller | 1 broker duoc ZK bau lam controller | Quorum of controllers (3-5 brokers co role controller) |
| Failover time | 10-30 giay (ZK session timeout) | 1-3 giay (Raft election) |
| Scaling | ZK la bottleneck | Metadata replicate qua Raft — scale tot hon |
| Operational | 2 clusters (Kafka + ZK) | 1 cluster (chi Kafka) |
Controller va metadata management trong KRaft:
- Mot nhom brokers co role controller (thuong 3 hoac 5)
- Cac controllers dung Raft consensus de bau leader va replicate metadata
- Active controller xu ly tat ca metadata changes (tao topic, leader election, v.v.)
- Cac controller khac la follower — san sang thay the neu active controller chet
Aha Moment: KRaft khong chi don gian la “bo ZooKeeper”. No thay doi co ban cach Kafka quan ly metadata — tu external coordination sang internal log-based coordination. Day la xu huong cua distributed systems: tu ngoai vao trong, tu complexity sang simplicity.
2.3.9 Push vs Pull — Hai mo hinh tieu thu
| Khia canh | Pull (Kafka model) | Push (RabbitMQ model) |
|---|---|---|
| Ai dieu khien toc do? | Consumer — keo data theo toc do cua minh | Broker — day data xuong consumer |
| Consumer cham? | Khong anh huong broker — data van nam tren disk | Broker phai buffer hoac drop messages |
| Consumer nhanh? | Keo data lien tuc, khong phai doi | Broker co the khong day du nhanh |
| Batching | Consumer tu chon batch size toi uu | Broker quyet dinh batch — co the khong phu hop |
| Long polling | Consumer poll voi timeout — neu khong co data moi, doi mot luc roi poll lai | Khong can — broker day ngay khi co data |
| Back-pressure | Tu nhien — consumer chi keo khi san sang | Can co che rieng (prefetch count, flow control) |
| Replay | De dang — consumer chi can seek offset | Kho — message da bi xoa sau khi ACK |
| Complexity | Consumer phuc tap hon (phai tu quan ly polling) | Consumer don gian hon (chi nhan va xu ly) |
Tai sao thiet ke cua chung ta chon Pull?
- Consumer tu quyet dinh toc do — khong bi overwhelm
- Batching toi uu — consumer keo dung luong data phu hop
- Replay de dang — seek den bat ky offset nao
- Back-pressure tu nhien — khong can co che dac biet
- Scale consumer doc lap — them consumer khong anh huong broker
Nhuoc diem cua Pull: Khi khong co data moi, consumer van phai poll → lang phi. Giai phap: Long polling — consumer gui poll request, broker giu request cho den khi co data moi hoac timeout.
2.3.10 Dead Letter Queue (DLQ) — Xu ly “thu chet”
Van de: Consumer nhan message nhung khong xu ly duoc. Vi du:
- Message bi corrupt (bad format)
- Business logic fail (invalid data)
- Dependency bi loi (database down → retry nhieu lan van fail)
Neu cu retry mai → consumer bi stuck tai message nay, khong xu ly duoc messages phia sau. Day goi la poison message.
Giai phap: Dead Letter Queue
Main Topic: orders
↓ (consumer doc)
Consumer: xu ly message
↓ (thanh cong) → commit offset, tiep tuc
↓ (that bai lan 1) → retry
↓ (that bai lan 2) → retry
↓ (that bai lan 3) → chuyen vao DLQ
DLQ Topic: orders-dlq
↓ (ops team review, fix, re-publish)
Cach implement DLQ:
- Consumer co
max.retries(vi du: 3) - Sau khi retry het → producer message vao DLQ topic (vi du:
orders-dlq) - Commit offset cua message goc → consumer tiep tuc xu ly message tiep theo
- DLQ topic duoc monitor boi ops team
- Sau khi fix root cause, message trong DLQ duoc re-publish vao main topic
| Config | Mo ta |
|---|---|
| Max retries | So lan retry truoc khi chuyen vao DLQ |
| Retry delay | Thoi gian doi giua cac lan retry (exponential backoff) |
| DLQ topic name | Convention: {original-topic}-dlq |
| DLQ retention | Thuong dai hon main topic (30-90 ngay) de co thoi gian investigate |
Pitfall: Khong co DLQ → poison message se block consumer vinh vien (neu auto-commit tat) hoac bi mat (neu auto-commit bat va consumer crash truoc khi xu ly). Ca hai truong hop deu nguy hiem. Luon thiet ke DLQ cho production systems.
2.3.11 Scaling — Mo rong he thong
Them Broker
Khi cluster can them capacity (disk, CPU, network):
- Start broker moi, join cluster
- Broker moi chua co partition nao — no la “empty”
- Admin chay partition reassignment — di chuyen mot so partitions tu brokers cu sang broker moi
- Trong qua trinh di chuyen, data duoc replicate tu broker cu sang broker moi
- Khi replicate xong, broker moi nhan role (leader hoac follower) va broker cu giai phong partition
Van de: Partition reassignment ton bandwidth va disk I/O. Trong production:
- Chay reassignment ngoai gio cao diem
- Gioi han toc do reassignment (
throttle) de khong anh huong traffic production - Monitor replication lag trong qua trinh reassignment
Them Partition
Khi topic can them parallelism (nhieu consumers hon muon doc dong thoi):
- Admin tang partition count cua topic (vi du: tu 6 len 12)
- Broker tao partitions moi tren cac brokers
- Consumer group rebalance — partitions duoc gan lai cho consumers
- Messages moi duoc phan phoi vao ca partitions cu va moi
Van de nghiem trong:
- Key ordering bi pha:
hash(key) % 6khac voihash(key) % 12. Messages cua cung key truoc kia vao P0, bay gio co the vao P6. Ordering per key bi pha trong thoi gian chuyen doi. - Khong the giam partition: Kafka khong ho tro giam so partition. Mot khi da tang, khong quay lai duoc.
Pitfall: Partition count la quyet dinh kho thay doi nhat trong message queue. Chon qua it → khong scale duoc. Chon qua nhieu → lang phi resources (moi partition ton memory, file handles, replication bandwidth). Rule of thumb: bat dau voi partition count = 2-3x so consumers du kien, de room cho scale.
3. Estimation — Uoc luong he thong
3.1 Throughput Estimation
Assumptions:
| Thong so | Gia tri | Giai thich |
|---|---|---|
| Target throughput | 1,000,000 messages/sec | Yeu cau tu requirements |
| Average message size | 1 KB | JSON event trung binh |
| Peak/Average ratio | 3x | Flash sales, events |
| Replication factor | 3 | 1 leader + 2 followers |
3.2 Storage Estimation
| Thong so | Gia tri |
|---|---|
| Messages/day | 1,000,000 msg/sec x 86,400 sec = 86.4 billion messages |
| Data/day (raw) | 86.4B x 1 KB = 86.4 TB |
| Compression ratio | ~50% (lz4) |
| Data/day (compressed) | 86.4 TB x 0.5 = 43.2 TB |
| Replication factor | 3 |
| Data/day (with replication) | 43.2 TB x 3 = 129.6 TB |
| Retention | 7 ngay |
3.3 Network Bandwidth Estimation
Inbound (producer → broker):
Replication bandwidth (leader → followers):
Outbound (broker → consumers):
Gia su co 3 consumer groups, moi group doc toan bo data:
Total network per cluster:
Moi broker can 10 Gbps NIC de co du headroom.
3.4 Consumer Throughput Estimation
| Consumer processing | Time | Messages/sec per consumer |
|---|---|---|
| Simple log/forward | 0.1 ms/msg | 10,000 msg/sec |
| Database write | 1 ms/msg | 1,000 msg/sec |
| Complex processing | 5 ms/msg | 200 msg/sec |
Aha Moment: So partitions phai >= so consumers. Neu can 1,000 consumers → can it nhat 1,000 partitions. Day la ly do chon partition count la quyet dinh quan trong.
3.5 Tom tat Estimation
| Metric | Gia tri |
|---|---|
| Write throughput | 1 GB/sec (3 GB/sec peak) |
| Total write (with replication) | 3 GB/sec (9 GB/sec peak) |
| Storage (7-day retention, RF=3, compressed) | ~907 TB (~0.9 PB) |
| Number of brokers | ~19+ (storage-based) |
| Network per broker | ~2.5 Gbps (can 10 Gbps NIC) |
| Consumers needed | 100 (simple) to 1,000 (DB write) |
| Partitions needed | >= max consumer count |
4. Security — Bao mat message queue
4.1 Encryption — Ma hoa du lieu
4.1.1 Encryption in Transit (TLS)
| Ket noi | TLS | Mo ta |
|---|---|---|
| Producer → Broker | Co | TLS 1.2/1.3. Ngan man-in-the-middle doc messages |
| Broker → Broker (replication) | Co | Inter-broker replication cung phai encrypt |
| Broker → Consumer | Co | Consumer doc messages qua TLS |
| Broker → ZooKeeper/Controller | Co | Metadata communication phai secure |
Luu y: TLS giam throughput khoang 10-30% do encryption/decryption overhead. Trong internal network (trusted), co the can nhac dung PLAINTEXT cho inter-broker replication de tang performance — nhung day la trade-off.
4.1.2 Encryption at Rest
| Phuong phap | Mo ta | Khi nao dung |
|---|---|---|
| Disk encryption (dm-crypt, LUKS) | Encrypt toan bo disk — transparent cho Kafka | Standard cho moi production deployment |
| File system encryption | Encrypt o file system level (eCryptfs, fscrypt) | Khi khong co full disk encryption |
| Message-level encryption | Producer encrypt message body truoc khi gui | Khi broker khong duoc phep doc noi dung (multi-tenant, sensitive data) |
Message-level encryption: Producer encrypt bang public key cua consumer. Broker chi thay ciphertext — khong doc duoc. Consumer decrypt bang private key. Dung cho:
- Healthcare data (HIPAA)
- Financial data (PCI-DSS)
- Multi-tenant platforms (tenant A khong doc duoc data cua tenant B, ke ca broker admin)
Pitfall: Message-level encryption lam log compaction khong hoat dong (broker khong doc duoc key de so sanh). Can can nhac trade-off giua security va functionality.
4.2 Authentication — Xac thuc danh tinh
| Phuong phap | Mo ta | Khi nao dung |
|---|---|---|
| SASL/PLAIN | Username/password | Dev/test environments. Khong an toan cho production (password di plain text, can dung kem TLS) |
| SASL/SCRAM | Challenge-response (khong gui password) | Production — an toan hon PLAIN |
| SASL/GSSAPI (Kerberos) | Enterprise SSO | Enterprise environments co Kerberos infrastructure |
| SASL/OAUTHBEARER | OAuth 2.0 tokens | Cloud-native environments, integration voi identity providers |
| mTLS | Mutual TLS — client va server deu present certificate | Zero-trust environments, service-to-service authentication |
Best practice: Dung mTLS cho service-to-service (producer/consumer la services) va SASL/SCRAM hoac OAUTHBEARER cho human users (admin tools, monitoring).
4.3 Authorization — Phan quyen truy cap
ACL (Access Control List) per topic:
| Principal | Resource | Operation | Permission |
|---|---|---|---|
order-service | Topic: orders | WRITE | ALLOW |
order-service | Topic: payments | WRITE | DENY |
analytics-service | Topic: orders | READ | ALLOW |
analytics-service | Topic: orders | WRITE | DENY |
admin | Topic: * | ALL | ALLOW |
Principle of least privilege: Moi service chi co quyen vua du de lam viec cua no. Order service chi ghi vao orders topic, khong duoc doc payments topic.
4.4 Audit Logging
| Event | Thong tin log |
|---|---|
| Topic created/deleted | Who, when, topic name, config |
| ACL changed | Who, when, old ACL, new ACL |
| Authentication failure | Who (attempted), when, IP address, reason |
| Admin operation | Who, when, operation (reassignment, config change) |
Audit logs phai duoc gui den external system (SIEM, Elasticsearch) — khong luu tren broker (de tranh bi xoa boi attacker).
Aha Moment: Security trong message queue thuong bi bo qua vi “no la internal system”. Nhung neu attacker truy cap duoc broker → doc duoc toan bo messages cua moi service. Message queue la central nervous system — bao ve no la bao ve toan bo he thong.
5. DevOps — Van hanh va giam sat
5.1 Critical Metrics — Cac chi so quan trong nhat
5.1.1 Broker Health
| Metric | Threshold | Y nghia |
|---|---|---|
| Under-replicated partitions | > 0 | Co partition chua duoc replicate du. Alert ngay — risk cua data loss |
| ISR shrink rate | > 0 | Followers dang bi loai khoi ISR. Broker overloaded hoac network issue |
| Active controller count | != 1 | Phai luon co dung 1 controller. 0 = khong ai dieu phoi. >1 = split brain |
| Offline partitions | > 0 | Partition khong co leader. Critical alert — data khong doc/ghi duoc |
| Request handler idle ratio | < 20% | Broker dang qua tai — request threads gan het |
5.1.2 Producer Metrics
| Metric | Threshold | Y nghia |
|---|---|---|
| Produce request rate | Bao nhieu requests/sec | Throughput cua producers |
| Produce latency P99 | > 100ms | Ghi cham — co the do broker overloaded, replication lag |
| Record error rate | > 0 | Producer gui that bai — broker reject hoac network error |
| Batch size average | < 1KB | Batch qua nho — tang linger.ms de gom nhieu hon |
| Compression ratio | > 0.8 | Compression khong hieu qua — check data pattern |
5.1.3 Consumer Metrics — Quan trong nhat
| Metric | Threshold | Y nghia |
|---|---|---|
| Consumer lag | > 10,000 messages | Consumer khong theo kip producer. Day la metric #1 |
| Consumer lag trend | Tang lien tuc | Consumer cham hon producer — se tran memory/disk neu khong fix |
| Commit rate | Giam dot ngot | Consumer co the bi stuck hoac crash |
| Rebalance rate | > 1/hour | Rebalance qua thuong xuyen — consumer unstable |
| Poll interval | > max.poll.interval.ms | Consumer bi kick khoi group — processing qua cham |
Aha Moment: Consumer lag la metric #1 cua toan bo message queue system. Neu lag tang → messages dang bi xu ly cham hon toc do gui → cuoi cung retention het → messages bi xoa truoc khi consumer doc → data loss. Monitoring consumer lag la bat buoc.
5.1.4 Infrastructure Metrics
| Metric | Threshold | Y nghia |
|---|---|---|
| Disk usage | > 80% | Gan het disk — can them disk hoac giam retention |
| Disk I/O wait | > 10% | Disk la bottleneck — can SSD hoac giam load |
| Network throughput | > 70% capacity | Network la bottleneck — can nang cap NIC |
| CPU usage | > 70% | Thuong do compression/decompression |
| JVM GC pause | > 200ms | GC pause lam broker khong respond — client timeout |
| File descriptor count | > 80% ulimit | Moi partition + segment + connection ton 1 fd. Het fd = broker crash |
5.2 Alerting Strategy
| Severity | Metric | Action |
|---|---|---|
| P1 — Critical | Offline partitions > 0 | Ngay lập tức: page on-call. Data khong truy cap duoc |
| P1 — Critical | Active controller count != 1 | Page on-call. Cluster khong co coordinator |
| P2 — High | Under-replicated partitions > 0 (> 5 min) | Investigate broker health, disk, network |
| P2 — High | Consumer lag tang lien tuc (> 30 min) | Scale consumers hoac investigate bottleneck |
| P3 — Medium | Disk usage > 80% | Plan capacity — them disk hoac giam retention |
| P3 — Medium | ISR shrink rate > 0 | Check follower health, replication bandwidth |
| P4 — Low | Produce latency P99 > 100ms | Investigate — co the la transient |
| P4 — Low | Rebalance rate > 1/hour | Check consumer stability, session timeout config |
5.3 Operational Runbook
5.3.1 Broker Failure
- Alert: Under-replicated partitions tang
- Check: Broker nao bi mat? (
broker.idkhong con trong cluster) - Automatic: Controller tu dong elect leader moi tu ISR cho cac partitions cua broker chet
- Action: Restart broker hoac thay hardware. Sau khi broker online, tu dong rejoin va sync data
- Monitor: ISR count tro ve replication factor → recovery hoan tat
5.3.2 Consumer Lag Spike
- Alert: Consumer lag > threshold
- Check: Consumer co dang chay khong? Processing time tren moi message?
- Action short-term: Scale consumers (them instances). Luu y: can du partitions
- Action long-term: Optimize processing logic, tang partition count (neu can)
- Monitor: Lag giam dan ve 0
5.3.3 Disk Full
- Alert: Disk usage > 80%
- Action ngay: Giam retention (vi du: tu 7 ngay xuong 3 ngay) de free disk
- Action trung han: Them disk, them broker
- Action dai han: Implement tiered storage (hot/cold) hoac tang compression
5.4 Capacity Planning
| Thoi diem | Action |
|---|---|
| Weekly | Review consumer lag trends, disk growth rate |
| Monthly | Review throughput growth, plan broker additions |
| Quarterly | Review partition counts, rebalance strategy, retention policy |
| Yearly | Major capacity planning — hardware refresh, architecture review |
Pitfall: Nhieu team chi monitoring throughput va latency nhung quen consumer lag. He thong trong “healthy” (broker khong bi loi) nhung data dang mat vi consumer khong theo kip va messages bi xoa khi retention het. Consumer lag la “silent killer” cua message queue.
6. Mermaid Diagrams — Tong hop kien truc
6.1 Broker Architecture voi WAL Segments
flowchart TB subgraph "Broker 1" subgraph "Partition 0 (Leader)" direction TB SEG0["Segment 0<br/>Offset 0-999K<br/>📁 00000000.log<br/>📁 00000000.index<br/>📁 00000000.timeindex"] SEG1["Segment 1<br/>Offset 1M-1.99M<br/>📁 01000000.log<br/>📁 01000000.index<br/>📁 01000000.timeindex"] SEG2["Segment 2 (ACTIVE)<br/>Offset 2M-...<br/>📁 02000000.log<br/>📁 02000000.index<br/>📁 02000000.timeindex"] SEG0 --> SEG1 --> SEG2 end subgraph "Write Path" PC["Page Cache<br/>(OS RAM)"] FL["Flush to Disk<br/>(async)"] end subgraph "Read Path" IDX["Index Lookup<br/>(mmap)"] ZC["Zero-Copy<br/>sendfile()"] end end P["Producer"] -->|"Append"| SEG2 SEG2 -->|"Write"| PC PC -->|"Async"| FL C["Consumer"] -->|"Seek(offset)"| IDX IDX -->|"Position"| PC PC -->|"sendfile()"| ZC ZC -->|"Data"| C style SEG2 fill:#e53935,color:#fff style PC fill:#1e88e5,color:#fff style ZC fill:#43a047,color:#fff
6.2 Replication Flow — Leader + Followers + ISR
flowchart TB subgraph "Partition 0 Replication" P["Producer<br/>acks=all"] -->|"1. Send msg"| L["Broker 1<br/>LEADER<br/>LEO: 1000"] L -->|"2. Write WAL"| WAL["WAL<br/>(disk)"] L -->|"3. Replicate"| F1["Broker 2<br/>FOLLOWER<br/>LEO: 999<br/>✅ IN ISR"] L -->|"3. Replicate"| F2["Broker 3<br/>FOLLOWER<br/>LEO: 998<br/>✅ IN ISR"] L -.->|"3. Replicate"| F3["Broker 4<br/>FOLLOWER<br/>LEO: 500<br/>❌ OUT OF ISR<br/>(lag > 30s)"] F1 -->|"4. ACK"| L F2 -->|"4. ACK"| L L -->|"5. ACK to Producer<br/>(all ISR confirmed)"| P HW["High Watermark = 998<br/>(min LEO of ISR)"] L --- HW C["Consumer"] -->|"6. Read <= HW"| L end style L fill:#e53935,color:#fff style F1 fill:#1e88e5,color:#fff style F2 fill:#1e88e5,color:#fff style F3 fill:#757575,color:#fff style HW fill:#f57c00,color:#fff
6.3 Consumer Group Rebalancing (Cooperative)
flowchart TB subgraph "TRUOC — 2 consumers, 6 partitions" direction LR C1_B["Consumer 1<br/>P0, P1, P2"] C2_B["Consumer 2<br/>P3, P4, P5"] end subgraph "Consumer 3 joins" direction TB EV["Rebalance triggered"] end subgraph "SAU — 3 consumers, 6 partitions (Cooperative)" direction LR C1_A["Consumer 1<br/>P0, P1<br/>(giu P0,P1 — tra P2)"] C2_A["Consumer 2<br/>P3, P4<br/>(giu P3,P4 — tra P5)"] C3_A["Consumer 3<br/>P2, P5<br/>(nhan P2, P5)"] end C1_B --> EV C2_B --> EV EV --> C1_A EV --> C2_A EV --> C3_A style EV fill:#f57c00,color:#fff style C3_A fill:#43a047,color:#fff
6.4 End-to-End Message Flow (Chi tiet)
flowchart LR subgraph "Producer Side" APP["Application"] -->|"1. produce()"| SER["Serializer<br/>(JSON/Avro)"] SER -->|"2. serialize"| PART["Partitioner<br/>hash(key) % N"] PART -->|"3. assign partition"| BATCH["Batch Buffer<br/>(per partition)"] BATCH -->|"4. batch full<br/>or linger.ms"| COMP["Compressor<br/>(lz4/zstd)"] COMP -->|"5. compressed batch"| NET1["Network<br/>Send"] end subgraph "Broker Side" NET1 -->|"6. receive"| VAL["Validate<br/>CRC, auth, ACL"] VAL -->|"7. append"| WAL2["WAL<br/>(page cache)"] WAL2 -->|"8. replicate"| REP["Followers"] REP -->|"9. ACK"| ACK["ACK to<br/>Producer"] end subgraph "Consumer Side" POLL["poll()"] -->|"10. fetch"| WAL2 WAL2 -->|"11. zero-copy"| DECOMP["Decompressor"] DECOMP -->|"12. decompress"| DESER["Deserializer"] DESER -->|"13. deserialize"| PROC["Process<br/>Message"] PROC -->|"14. commit"| OFFSET["Commit<br/>Offset"] end style BATCH fill:#1e88e5,color:#fff style WAL2 fill:#e53935,color:#fff style OFFSET fill:#43a047,color:#fff
7. Aha Moments & Pitfalls — Nhung dieu can nho
7.1 Aha Moments
| # | Insight | Giai thich |
|---|---|---|
| 1 | WAL la nen tang | Toan bo message queue duoc xay tren mot y tuong don gian: append-only log. Ghi vao cuoi file la thao tac nhanh nhat tren disk. Tu day, moi thu khac (replication, retention, replay) tro nen tu nhien |
| 2 | ISR can bang durability vs availability | ISR khong phai “tat ca replicas” — no la “cac replicas dang theo kip”. Config min.insync.replicas quyet dinh ban chap nhan mat bao nhieu replicas ma van ghi duoc |
| 3 | Consumer lag la metric #1 | Broker throughput cao, latency thap — nhung consumer lag tang → data dang mat dan. Day la “silent killer”. Monitor consumer lag truoc moi thu khac |
| 4 | Partition count kho thay doi | Tang partition → key routing thay doi → ordering bi pha. Giam partition → khong the. Chon partition count dung tu dau la quyet dinh architecture quan trong nhat |
| 5 | Pull model uu viet cho streaming | Consumer tu quyet dinh toc do, batch size, va khi nao doc. Back-pressure tu nhien. Replay de dang. Day la ly do Kafka (pull) thang the cho event streaming so voi RabbitMQ (push) |
| 6 | Zero-copy la “bi quyet” performance | Khong copy data tu kernel → user space → kernel. Truc tiep tu page cache → network. Giam CPU usage, giam latency. Day la ly do message queue dat millions msg/sec tren hardware thuong |
| 7 | Broker khong hieu message | Broker chi thay bytes. Khong parse, khong validate noi dung. Dieu nay giup broker cuc nhanh va generic — phu hop voi moi loai data |
| 8 | Exactly-once la ket hop nhieu co che | Khong co “magic button”. Can idempotent producer + transactions + idempotent consumer. Hieu tung co che giup em biet khi nao actually can exactly-once vs at-least-once da du |
7.2 Common Pitfalls
| # | Pitfall | Hau qua | Cach tranh |
|---|---|---|---|
| 1 | Khong set message key | Round-robin → khong ordering. Messages cua cung entity di vao partition khac nhau | Luon set key = entity ID (user_id, order_id) khi can ordering |
| 2 | Partition count qua it | Khong scale duoc consumers. Throughput bi gioi han | Chon partition count = 2-3x so consumers du kien |
| 3 | acks=0 cho critical data | Mat message khi broker crash | Dung acks=all + min.insync.replicas=2 cho moi data quan trong |
| 4 | Auto commit voi complex processing | Mat message hoac xu ly trung | Dung manual commit — chi commit sau khi xu ly xong |
| 5 | Khong co DLQ | Poison message block consumer vinh vien | Luon design DLQ voi max retries + exponential backoff |
| 6 | Tang partition count tuy tien | Key routing thay doi, ordering per key bi pha | Plan partition count truoc. Neu phai tang, chap nhan downtime cho re-keying |
| 7 | Khong monitor consumer lag | Data mat do retention expire truoc khi consumer doc | Alert consumer lag > threshold. Day la P2 alert |
| 8 | unclean.leader.election = true cho financial data | Mat messages khi leader crash | Tat unclean leader election cho moi topic quan trong |
| 9 | Rebalance storm | Consumer group lien tuc rebalance → khong ai xu ly messages | Tang session.timeout.ms, tang max.poll.interval.ms, dung sticky assignment |
| 10 | Khong encrypt inter-broker traffic | Attacker trong internal network doc duoc tat ca messages | TLS cho moi connection, ke ca inter-broker |
8. Summary — Tong ket
8.1 So sanh thiet ke cua chung ta voi Kafka thuc te
| Component | Thiet ke cua chung ta | Apache Kafka | Ghi chu |
|---|---|---|---|
| Storage | WAL + segments + index | Giong | Kafka dung chinh xac mo hinh nay |
| Replication | Leader-follower + ISR | Giong | ISR la sang tao cua Kafka team |
| Coordination | ZooKeeper/KRaft | ZK → KRaft | Kafka dang migrate sang KRaft |
| Consumer model | Pull + consumer groups | Giong | Pull la core design cua Kafka |
| Delivery semantics | At-most/least/exactly once | Giong | Exactly-once tu Kafka 0.11+ |
| Retention | Time/size/compaction | Giong | 3 policies giong nhau |
| Compression | gzip, snappy, lz4, zstd | Giong | zstd tu Kafka 2.1+ |
8.2 Khi nao dung message queue architecture nay?
| Use case | Phu hop? | Ly do |
|---|---|---|
| Event streaming (logs, metrics, clickstream) | Rat phu hop | High throughput, retention, replay |
| Microservice communication (async) | Phu hop | Decoupling, at-least-once, consumer groups |
| Real-time analytics pipeline | Rat phu hop | Multiple consumer groups doc cung data |
| Task queue (background jobs) | Duoc, nhung RabbitMQ co the tot hon | Pull model khong optimal cho ngan job queue |
| Request-reply pattern | Khong phu hop | Pull model khong tot cho synchronous communication |
| Small-scale system (< 1000 msg/sec) | Overkill | Dung Redis Pub/Sub hoac SQS don gian hon |
8.3 Lien ket voi cac bai khac
| Bai | Lien quan |
|---|---|
| Tuan-08-Message-Queue | Bai nay bo sung — Tuan 08 day dung MQ, bai nay day thiet ke MQ |
| Tuan-10-Consistent-Hashing | Partition assignment co the dung consistent hashing de giam rebalance disruption |
| Tuan-07-Database-Sharding-Replication | Replication model (leader-follower, ISR) tuong tu database replication. Sharding tuong tu partitioning |
| Tuan-13-Monitoring-Observability | Consumer lag monitoring, broker health monitoring la ung dung truc tiep |
| Tuan-14-AuthN-AuthZ-Security | SASL, mTLS, ACL trong message queue la ung dung cua authn/authz patterns |
| Case-Design-Payment-System | Payment system dung message queue voi exactly-once semantics |
| Case-Design-Stock-Exchange | Stock exchange dung message queue cho event sequencing |
“Hieu, sau bai nay em da biet cach thiet ke mot distributed message queue tu scratch — khong chi biet dung no. Khi interview hoi ‘Design Kafka’, em co the tu tin giai thich moi quyet dinh thiet ke: tai sao append-only log, tai sao pull model, tai sao ISR, tai sao partition la don vi cua parallelism. Do la su khac biet giua Backend Dev va System Architect.”