Case Study: Design Ad Click Event Aggregation

“1 tỷ lượt click mỗi ngày, mỗi click phải được đếm đúng 1 lần, kết quả phải có trong vài phút. Đây là bài toán mà sai 0.1% cũng có thể mất hàng triệu đô quảng cáo.”

Tags: system-design ad-click event-aggregation streaming exactly-once mapreduce alex-xu-vol2 Student: Hieu Source: Alex Xu — System Design Interview Volume 2, Chapter 3 Prerequisite: Tuan-08-Message-Queue · Tuan-02-Back-of-the-envelope · Tuan-01-Scale-From-Zero-To-Millions Lien quan: Tuan-13-Monitoring-Observability · Case-Design-Metrics-Monitoring-Alerting · Tuan-07-Database-Sharding-Replication

1. Context & Why — Tại sao Ad Click Event Aggregation quan trọng?

1.1 Analogy: Đếm phiếu bầu cử quốc gia real-time

Hiếu, bạn hãy tưởng tượng bài toán này như việc đếm phiếu bầu cử quốc gia diễn ra real-time:

Bầu cử quốc gia	Ad Click Aggregation	Điểm chung
Hàng triệu phiếu từ khắp cả nước	Hàng tỷ click từ khắp thế giới	Volume cực lớn, đến liên tục
Mỗi phiếu chỉ được đếm 1 lần — đếm 2 lần = gian lận	Mỗi click chỉ được đếm 1 lần — đếm 2 lần = tính tiền sai nhà quảng cáo	Exactly-once là yêu cầu bắt buộc
Kết quả phải cập nhật liên tục — báo chí cần biết ai đang dẫn mỗi phút	Dashboard phải cập nhật liên tục — advertiser cần biết bao nhiêu click trong 1 phút qua	Real-time aggregation với latency thấp
Chia theo khu vực — đếm theo tỉnh, theo thành phố	Chia theo filter — đếm theo ad_id, campaign_id, country	Multi-dimensional aggregation
Phiếu đến muộn — có văn phòng gửi kết quả chậm vì xa	Late events — click event đến muộn do network delay	Late event handling là thách thức lớn
Kiểm tra chéo — kết quả đếm thủ công phải khớp với máy	Reconciliation — batch job kiểm tra lại kết quả streaming	Reconciliation đảm bảo chính xác
Có thể bầu trùng — một người đi bầu 2 lần	Click fraud — bot click nhiều lần	Deduplication cần xử lý

Aha Moment: Bài toán này không phải là “đếm click” đơn giản. Nó là bài toán đếm đúng, đếm đủ, đếm nhanh ở quy mô hàng tỷ event mỗi ngày. Sai một chút = mất tiền thật — hoặc nhà quảng cáo bị tính thừa, hoặc platform mất doanh thu.

1.2 Tại sao bài toán này khó?

Hiếu, đếm số tưởng như đơn giản — nhưng đếm số ở quy mô lớn, real-time, và chính xác tuyệt đối lại là một trong những bài toán khó nhất của distributed systems:

Thách thức	Giải thích
Volume cực lớn	1 tỷ click/ngày = ~11,574 click/giây trung bình, peak có thể 50K+/giây
Real-time	Advertiser muốn thấy kết quả trong vài phút, không phải cuối ngày
Exactly-once	Đếm thừa = tính tiền sai. Đếm thiếu = mất doanh thu. Cả hai đều không chấp nhận được
Late events	Event có thể đến muộn 5 phút, 1 giờ, thậm chí 1 ngày do mobile network, CDN delay
Multi-dimensional	Cần aggregate theo nhiều chiều: ad_id, campaign_id, country, device_type, time window
Fault tolerance	Hệ thống không được mất data khi server crash, network partition
Hot spots	Quảng cáo viral có thể nhận 100x traffic bình thường — tạo hot partition
Consistency vs Latency	Cần cân bằng giữa độ chính xác và tốc độ trả kết quả

1.3 Business context — Tại sao advertiser quan tâm?

Trong mô hình quảng cáo online (Google Ads, Facebook Ads, TikTok Ads), advertiser trả tiền theo CPC (Cost Per Click). Mỗi click = một khoản tiền. Ví dụ:

Advertiser A chạy quảng cáo với CPC = $0.50
1 ngày có 1 triệu click vào quảng cáo của A
Tổng chi phí = $500,000/ngày

Nếu hệ thống đếm sai 1% → sai $5, 000/ n g \overset{a}{ˋ} y * * \to s ai * *$ 1.8 triệu/năm. Với hàng triệu advertiser, con số này nhân lên gấp bội. Đây là lý do data accuracy là non-negotiable.

1.4 Scope của bài toán

Thuộc tính	Trong scope	Ngoài scope
Click event aggregation (đếm click theo ad_id, time window)	Yes	—
Real-time query (advertiser xem dashboard)	Yes	—
Filtering (theo campaign_id, country)	Yes	—
Click fraud detection (bot, click farms)	Liên quan (Section 4)	Deep dive riêng
Conversion tracking (click → purchase)	—	Bài toán khác
Ad serving (chọn quảng cáo nào hiển thị)	—	Bài toán khác
Billing (tính tiền advertiser)	—	Downstream của aggregation

2. Deep Dive — Alex Xu 4-Step Framework

Step 1 — Understand the Problem & Establish Design Scope

2.1.1 Functional Requirements

Chức năng	Mô tả chi tiết
Aggregate ad clicks	Đếm số lượng click và unique click cho mỗi ad_id trong các time window: 1 phút, 5 phút, 1 giờ
Return aggregated data	Query API trả về click_count và unique_click_count cho một ad_id trong một khoảng thời gian
Support filtering	Filter kết quả theo ad_id, campaign_id, country, device_type
Support multi-dimensional aggregation	Aggregate theo nhiều chiều cùng lúc: ví dụ click count per ad_id per country per minute
Store raw events	Lưu raw click event để reconciliation và re-processing

Các câu hỏi bạn nên hỏi interviewer:

Câu hỏi	Câu trả lời giả định	Tại sao hỏi
Bao nhiêu click/ngày?	1 billion (1 tỷ) click/ngày	Quyết định throughput của toàn hệ thống
Bao nhiêu ad đang active?	2 triệu ad	Quyết định cardinality của aggregation
Aggregation window nào?	1 phút, 5 phút, 1 giờ	Quyết định windowing strategy
Latency yêu cầu?	< vài phút từ click đến queryable	Quyết định streaming vs batch
Data accuracy?	Exactly-once — không được đếm thừa hay thiếu	Quyết định delivery semantics
Cần store raw data không?	Có — để reconciliation và ad-hoc analysis	Quyết định storage strategy
Bao lâu giữ raw data?	Raw: 7 ngày, Aggregated: years	Quyết định data retention
Click fraud có trong scope không?	Basic dedup có, advanced fraud detection ngoài scope	Ranh giới thiết kế

2.1.2 Non-Functional Requirements

Requirement	Target	Lý do
Throughput	1B clicks/ngày, peak 50K clicks/sec	Volume lớn, bursty traffic
Latency	End-to-end < vài phút (từ click đến queryable)	Near-real-time dashboard
Accuracy	Exactly-once processing	Liên quan đến tiền bạc
Availability	99.99% cho ingestion pipeline	Mất event = mất doanh thu
Scalability	Scale horizontal khi traffic tăng	Mùa sale, viral ads
Fault tolerance	Không mất data khi node failure	Zero data loss
Idempotency	Re-process không thay đổi kết quả	Retry safety
Queryability	Query response < 1 giây cho single ad_id	Dashboard UX

2.1.3 API Design

API 1: Aggregate click count cho một ad_id

GET /v1/ads/{ad_id}/aggregated_count

Parameters:

Parameter	Type	Mô tả
`from`	long (epoch)	Thời điểm bắt đầu
`to`	long (epoch)	Thời điểm kết thúc
`filter`	object	Optional: `{campaign_id: "xxx", country: "VN"}`

Response:

Field	Type	Mô tả
`ad_id`	string	ID của quảng cáo
`click_count`	long	Tổng số click
`unique_click_count`	long	Số click từ distinct users

API 2: Top N clicked ads trong khoảng thời gian

GET /v1/ads/top_clicked

Parameters:

Parameter	Type	Mô tả
`count`	int	Số lượng top ads cần trả về (N)
`window`	int	Khoảng thời gian tính bằng phút (1, 5, 60)
`filter`	object	Optional filter

Step 2 — High-Level Design

2.2.1 Kiến trúc tổng quan

Hiếu, kiến trúc này giống như dây chuyền xử lý cá trong nhà máy thủy sản:

Tàu cá (Client) mang cá về cảng → Click events từ user browsers/apps
Cảng cá (Load Balancer) phân phối cá vào các băng chuyền → Ingestion Service nhận events
Kho lạnh (Message Queue) giữ cá tươi trước khi chế biến → Kafka buffer events
Dây chuyền chế biến (Aggregation Service) cắt, lọc, đóng gói → Map-Reduce aggregate clicks
Kho thành phẩm (Database) lưu sản phẩm đã đóng gói → Aggregated DB lưu kết quả
Quầy bán hàng (Query Service) phục vụ khách hàng → Query API trả kết quả cho advertiser

flowchart LR
    subgraph "Data Generation"
        C1["User Clicks<br/>(Browsers/Apps)"]
    end

    subgraph "Log Ingestion"
        LB["Load Balancer"]
        IS1["Ingestion<br/>Service 1"]
        IS2["Ingestion<br/>Service 2"]
        ISN["Ingestion<br/>Service N"]
    end

    subgraph "Message Queue"
        K["Apache Kafka<br/>(Partitioned by ad_id)"]
    end

    subgraph "Streaming Aggregation"
        AG1["Aggregation Node 1<br/>(Map + Aggregate)"]
        AG2["Aggregation Node 2<br/>(Map + Aggregate)"]
        AGN["Aggregation Node N<br/>(Map + Aggregate)"]
    end

    subgraph "Second Message Queue (Optional)"
        K2["Kafka Topic 2<br/>(Aggregated Results)"]
    end

    subgraph "Storage"
        RAW[("Raw Data Store<br/>(Cassandra / S3)")]
        AGG[("Aggregated DB<br/>(Cassandra / ClickHouse)")]
    end

    subgraph "Serving"
        QS["Query Service"]
        CACHE[("Redis Cache")]
        DASH["Advertiser<br/>Dashboard"]
    end

    C1 --> LB
    LB --> IS1 & IS2 & ISN
    IS1 & IS2 & ISN --> K
    K --> AG1 & AG2 & AGN
    K -->|"Raw events"| RAW
    AG1 & AG2 & AGN --> K2
    K2 --> AGG
    AGG --> QS
    QS --> CACHE
    QS --> DASH

    style K fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff
    style K2 fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff
    style AGG fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff
    style QS fill:#2196F3,stroke:#333,stroke-width:2px,color:#fff

2.2.2 Data Flow — End to End

sequenceDiagram
    participant User as User Browser
    participant IS as Ingestion Service
    participant Kafka as Kafka (Raw Events)
    participant AGG as Aggregation Service
    participant Kafka2 as Kafka (Aggregated)
    participant DB as Aggregated DB
    participant QS as Query Service
    participant ADV as Advertiser Dashboard

    User->>IS: Click event (ad_id, timestamp, user_id, ip, country)
    IS->>IS: Validate & enrich event
    IS->>Kafka: Produce to partition (hash ad_id)

    Note over Kafka,AGG: Streaming Aggregation (per minute window)
    Kafka->>AGG: Consume batch of events
    AGG->>AGG: Map: filter + transform
    AGG->>AGG: Aggregate: count by ad_id per window
    AGG->>AGG: Reduce: merge partial results
    AGG->>Kafka2: Produce aggregated result

    Kafka2->>DB: Write aggregated data
    Note over DB: (ad_id, window_start, click_count, unique_count)

    ADV->>QS: GET /v1/ads/123/aggregated_count?from=...&to=...
    QS->>DB: Query aggregated data
    DB-->>QS: Results
    QS-->>ADV: JSON response

2.2.3 Component Responsibilities

Component	Trách nhiệm	Technology choices
Ingestion Service	Nhận click event từ client, validate, enrich (geo lookup), gửi vào Kafka	Custom service (Go/Java), stateless, horizontally scalable
Kafka (Raw)	Buffer raw events, decouple ingestion từ processing, replay capability	Apache Kafka, partitioned by ad_id
Raw Data Store	Lưu raw events cho reconciliation, ad-hoc query, re-processing	Cassandra (write-heavy), hoặc S3 (cost-effective)
Aggregation Service	Stream processing: map, aggregate, reduce trong time window	Apache Flink, Spark Streaming
Kafka (Aggregated)	Buffer aggregated results trước khi ghi DB, decouple aggregation từ storage	Apache Kafka
Aggregated DB	Lưu kết quả aggregation, phục vụ read queries	Cassandra, ClickHouse
Query Service	Parse query, fetch data từ DB, apply filter, return results	Custom service với caching layer
Redis Cache	Cache hot queries (top ads, recent aggregations)	Redis cluster

2.2.4 Tại sao cần 2 Kafka topics?

Lợi ích	Giải thích
Decoupling	Aggregation service và DB writer có thể scale độc lập
Exactly-once	Kafka transactions đảm bảo aggregated result được ghi chính xác 1 lần
Replay	Nếu DB writer lỗi, có thể replay từ Kafka topic 2 mà không cần re-aggregate
Multiple consumers	Nhiều downstream systems có thể đọc aggregated data (alerting, billing, analytics)

Step 3 — Design Deep Dive

2.3.1 Data Model

Raw Click Event

Field	Type	Mô tả	Ví dụ
`ad_id`	string	ID của quảng cáo	”ad_001”
`click_timestamp`	long (epoch ms)	Thời điểm user click (event time)	1678886400000
`user_id`	string	ID của user (cookie/device ID)	“user_abc123”
`ip`	string	IP address của user	”203.113.152.4”
`country`	string	Country code (từ IP geo lookup)	“VN”
`device_type`	string	Loại thiết bị	”mobile”
`campaign_id`	string	Campaign chứa ad này	”camp_xyz”

Kích thước trung bình: ~0.5-1 KB per event.

Aggregated Data

Field	Type	Mô tả	Ví dụ
`ad_id`	string	ID của quảng cáo	”ad_001”
`window_start`	long (epoch)	Bắt đầu của time window	1678886400
`window_size`	int	Kích thước window (phút)	1
`click_count`	long	Tổng số click trong window	45,231
`unique_click_count`	long	Số click từ distinct users	38,102
`filter_id`	string	Composite key của filter dimensions	”country=VN”

Aha Moment: Raw event và aggregated data có đặc điểm hoàn toàn khác nhau. Raw event write-heavy, append-only, high volume. Aggregated data read-heavy, update-in-place (upsert), lower volume. Đây là lý do chúng cần databases khác nhau — hoặc ít nhất là tables khác nhau với access patterns khác nhau.

Star Schema cho Aggregation

Hiếu, data model cho aggregation tương tự star schema trong data warehousing:

Thành phần	Vai trò	Ví dụ
Fact table	Lưu metric (click_count)	`aggregated_clicks (ad_id, window, click_count, unique_count)`
Dimension tables	Lưu thông tin filter	`ads (ad_id, campaign_id, advertiser_id)`, `campaigns (campaign_id, budget)`

Query pattern: join fact table với dimension tables để filter theo campaign_id, advertiser_id, country.

2.3.2 Choosing the Right Database

Yêu cầu	Raw Data	Aggregated Data
Write pattern	Append-only, sequential	Upsert (update or insert)
Read pattern	Rare (reconciliation, re-processing)	Frequent (dashboard queries)
Volume	Cực lớn (~1TB/ngày)	Nhỏ hơn nhiều (~vài GB/ngày)
Query pattern	Full scan, time-range scan	Point query (ad_id + time range), top-N
Retention	7 ngày (sau đó archive)	Years
Consistency	Eventual OK	Strong preferred

Raw Data → Cassandra hoặc S3

Lựa chọn	Ưu điểm	Nhược điểm
Cassandra	Write-optimized (LSM-tree), time-series friendly, TTL support, distributed	Query flexibility hạn chế, expensive storage
S3 + Parquet	Cực rẻ, unlimited storage, Athena/Spark có thể query	Latency cao, không phù hợp real-time query
Hybrid (Cassandra 7 ngày + S3 archive)	Kết hợp cả hai	Phức tạp operations

Kết luận: Dùng Cassandra cho raw data với TTL 7 ngày. Archive sang S3 (Parquet format) cho long-term storage. Cassandra write-heavy performance rất phù hợp với use case này — partition key là ad_id, clustering key là click_timestamp.

Aggregated Data → Cassandra hoặc ClickHouse

Lựa chọn	Ưu điểm	Nhược điểm
Cassandra	Consistent với raw data store, write-optimized	OLAP queries chậm (không có secondary index tốt)
ClickHouse	OLAP-optimized, column-oriented, blazing fast aggregation queries	Write pattern khác (batch insert tốt hơn single upsert)
Druid	Real-time OLAP, pre-aggregation tại ingestion	Phức tạp operations

Kết luận: ClickHouse là lựa chọn tốt nhất cho aggregated data vì:

Column-oriented storage nên tối ưu cho aggregation queries (SUM, COUNT, GROUP BY)
Query speed cực nhanh cho dashboard use case
Built-in materialized views có thể tự động aggregate
Compression ratio cao cho time-series data

Trade-off: Nếu team chưa có kinh nghiệm với ClickHouse, dùng Cassandra cho cả raw và aggregated data để giảm operational complexity. Performance không tốt bằng nhưng đơn giản hơn.

2.3.3 Aggregation Service — MapReduce Model

Đây là trái tim của hệ thống. Aggregation service dùng mô hình MapReduce để xử lý events theo 3 giai đoạn:

Giai đoạn 1: Map Node (Filter + Transform)

Nhiệm vụ	Chi tiết
Nhận raw event từ Kafka	Mỗi Map node nhận events từ 1 hoặc nhiều Kafka partitions
Filter	Loại bỏ invalid events (missing ad_id, invalid timestamp)
Transform	Enrich data (geo lookup từ IP → country), normalize fields
Emit key-value	Output: `(ad_id, 1)` — nghĩa là “ad này được click 1 lần”

Giai đoạn 2: Aggregate Node (Count by ad_id per window)

Nhiệm vụ	Chi tiết
Nhận (ad_id, 1) từ Map	Group by ad_id
Aggregate trong time window	Đếm số lượng clicks và unique clicks trong từng window (1 phút)
Maintain state	Dùng in-memory state (Flink state backend) để lưu partial aggregation
Emit partial result	Output: `(ad_id, window, partial_count, partial_unique_count)`

Giai đoạn 3: Reduce Node (Merge Results)

Nhiệm vụ	Chi tiết
Nhận partial results từ nhiều Aggregate nodes	Trường hợp ad_id bị chia ra nhiều node
Merge	Cộng tất cả partial_count lại, merge HyperLogLog cho unique count
Output final result	`(ad_id, window, final_click_count, final_unique_count)`
Ghi vào Kafka topic 2	Aggregated result sẵn sàng ghi DB

flowchart TB
    subgraph "Kafka Partitions"
        P1["Partition 0<br/>ad_001, ad_005, ad_009..."]
        P2["Partition 1<br/>ad_002, ad_006, ad_010..."]
        P3["Partition 2<br/>ad_003, ad_007, ad_011..."]
        P4["Partition 3<br/>ad_004, ad_008, ad_012..."]
    end

    subgraph "Map Nodes"
        M1["Map Node 1<br/>Filter + Transform<br/>Emit (ad_id, 1)"]
        M2["Map Node 2<br/>Filter + Transform<br/>Emit (ad_id, 1)"]
        M3["Map Node 3<br/>Filter + Transform<br/>Emit (ad_id, 1)"]
        M4["Map Node 4<br/>Filter + Transform<br/>Emit (ad_id, 1)"]
    end

    subgraph "Aggregate Nodes (Windowed)"
        A1["Aggregate Node 1<br/>Window: 1 min<br/>Count by ad_id"]
        A2["Aggregate Node 2<br/>Window: 1 min<br/>Count by ad_id"]
    end

    subgraph "Reduce Nodes"
        R1["Reduce Node<br/>Merge partial counts<br/>Final result per ad_id"]
    end

    subgraph "Output"
        K2["Kafka Topic 2<br/>(Aggregated Results)"]
        DB[("ClickHouse")]
    end

    P1 --> M1
    P2 --> M2
    P3 --> M3
    P4 --> M4

    M1 & M2 --> A1
    M3 & M4 --> A2

    A1 & A2 --> R1
    R1 --> K2
    K2 --> DB

    style M1 fill:#42A5F5,color:#fff
    style M2 fill:#42A5F5,color:#fff
    style M3 fill:#42A5F5,color:#fff
    style M4 fill:#42A5F5,color:#fff
    style A1 fill:#FFA726,color:#fff
    style A2 fill:#FFA726,color:#fff
    style R1 fill:#EF5350,color:#fff

Aha Moment: Trong thực tế, khi dùng Apache Flink, Map và Aggregate thường merge thành 1 operator. Flink tự động làm chuyện này qua operator chaining để giảm network overhead. Mô hình 3 giai đoạn là để hiểu concept — implementation thực tế compact hơn.

Unique Click Count — HyperLogLog

Đếm unique clicks (distinct user_id) là bài toán khó vì:

Không thể giữ tất cả user_id trong memory (quá nhiều)
Exact count yêu cầu O(n) memory per ad_id per window

Giải pháp: Dùng HyperLogLog (HLL) — probabilistic data structure:

Thuộc tính	Giá trị
Memory	~12 KB per counter (cố định, bất kể bao nhiêu elements)
Accuracy	Sai số ~0.81% (với 16K registers)
Mergeable	Có thể merge 2 HLL counters → perfect cho distributed aggregation
Trade-off	Mất độ chính xác tuyệt đối, nhưng tiết kiệm memory cực lớn

Với 2 triệu active ads, mỗi ad 1 HLL counter = 2M x 12KB = ~24 GB memory — hoàn toàn khả thi.

Nếu cần exact unique count (một số advertiser trả thêm tiền cho độ chính xác), dùng bitmap hoặc Bloom filter với memory lớn hơn.

2.3.4 Streaming vs Batching — Lambda vs Kappa Architecture

Đây là quyết định kiến trúc quan trọng nhất của bài toán. Hiếu cần hiểu 2 trường phái:

Lambda Architecture (Batch + Speed Layer)

flowchart TB
    subgraph "Data Source"
        EVENTS["Click Events"]
    end

    subgraph "Batch Layer"
        BATCH_STORE[("Batch Storage<br/>(HDFS / S3)")]
        BATCH_PROC["Batch Processing<br/>(Spark / MapReduce)<br/>Chạy mỗi giờ hoặc mỗi ngày"]
        BATCH_VIEW[("Batch View<br/>(accurate, complete)")]
    end

    subgraph "Speed Layer"
        STREAM_PROC["Stream Processing<br/>(Flink / Storm)<br/>Real-time"]
        RT_VIEW[("Real-time View<br/>(approximate, fast)")]
    end

    subgraph "Serving Layer"
        MERGE["Merge Results<br/>Batch View + Real-time View"]
        QS["Query Service"]
    end

    EVENTS --> BATCH_STORE
    EVENTS --> STREAM_PROC
    BATCH_STORE --> BATCH_PROC
    BATCH_PROC --> BATCH_VIEW
    STREAM_PROC --> RT_VIEW
    BATCH_VIEW --> MERGE
    RT_VIEW --> MERGE
    MERGE --> QS

    style BATCH_PROC fill:#4CAF50,color:#fff
    style STREAM_PROC fill:#FF9800,color:#fff
    style MERGE fill:#2196F3,color:#fff

Ưu điểm	Nhược điểm
Batch layer đảm bảo accuracy	2 codebases — phải maintain 2 pipeline (batch + streaming)
Speed layer đảm bảo low latency	Logic trùng lặp — cùng 1 aggregation nhưng viết 2 lần
Nếu streaming sai, batch sẽ sửa	Phức tạp debug — sai ở layer nào?
	Merge logic không đơn giản

Kappa Architecture (Streaming Only)

flowchart LR
    subgraph "Data Source"
        EVENTS["Click Events"]
    end

    subgraph "Streaming Layer (Only Layer)"
        KAFKA["Kafka<br/>(Immutable Log)"]
        FLINK["Stream Processing<br/>(Apache Flink)<br/>Real-time Aggregation"]
    end

    subgraph "Serving Layer"
        DB[("Aggregated DB")]
        QS["Query Service"]
    end

    subgraph "Reconciliation (Periodic)"
        RECON["Batch Reconciliation<br/>(Hourly / Daily)<br/>Verify streaming results"]
    end

    EVENTS --> KAFKA
    KAFKA --> FLINK
    FLINK --> DB
    DB --> QS
    KAFKA -.->|"Replay if needed"| FLINK
    DB -.-> RECON
    KAFKA -.-> RECON

    style KAFKA fill:#FF9800,color:#fff
    style FLINK fill:#4CAF50,color:#fff
    style RECON fill:#9E9E9E,color:#fff

Ưu điểm	Nhược điểm
1 codebase — chỉ cần maintain 1 pipeline	Cần streaming framework mature (Flink có exactly-once)
Đơn giản hơn Lambda nhiều	Nếu cần re-process, replay từ Kafka (cần đủ retention)
Kafka là immutable log — có thể replay bất kỳ lúc nào
Apache Flink hỗ trợ exactly-once processing

Kết luận của Alex Xu: Kappa architecture là lựa chọn ưu tiên cho bài toán này. Lý do: Apache Flink đã mature đủ để đảm bảo exactly-once processing. Lambda chỉ cần thiết khi streaming framework chưa đủ tin cậy — điều này không còn đúng với Flink 2024+.

Aha Moment: Lambda architecture ra đời vì streaming frameworks ngày xưa (Storm, Samza) không đảm bảo exactly-once. Khi Flink giải quyết vấn đề này, lý do tồn tại của Lambda giảm đi đáng kể. Tuy nhiên, reconciliation (batch job kiểm tra lại) vẫn cần thiết như safety net — đây không phải là Lambda, mà là best practice.

2.3.5 Watermark và Late Events

Đây là khái niệm khó nhất trong stream processing. Hiếu cần hiểu rõ:

Event Time vs Processing Time

Khái niệm	Định nghĩa	Ví dụ
Event time	Thời điểm sự kiện thực sự xảy ra (user click)	14:00:00 — user click vào quảng cáo
Processing time	Thời điểm hệ thống nhận được event	14:00:05 — event đến Flink (trễ 5 giây do network)
Ingestion time	Thời điểm event vào Kafka	14:00:02 — Kafka nhận event

Tại sao quan trọng? Vì bạn aggregate theo time window. Nếu dùng processing time, 1 click xảy ra lúc 13:59:59 nhưng đến lúc 14:00:05 sẽ bị đếm vào window 14:00-14:01 thay vì 13:59-14:00. Sai window = sai kết quả.

→ Kết luận: Luôn dùng event time cho aggregation. Processing time chỉ dùng cho monitoring hệ thống.

Watermark là gì?

Watermark = lời tuyên bố của hệ thống rằng: “Tất cả event với event_time ⇐ W đã đến hết”.

Ví dụ: Watermark = 14:05:00 có nghĩa là hệ thống tin rằng không còn event nào với timestamp trước 14:05:00 sẽ đến nữa. Hệ thống có thể đóng (close) tất cả window trước 14:05:00 và emit kết quả.

Vấn đề: Làm sao biết tất cả event đã đến? Không thể biết chắc chắn. Vì vậy watermark luôn có allowed lateness (độ trễ cho phép).

Watermark strategy	Mô tả	Trade-off
Punctuated watermark	Emit watermark dựa trên event data (ví dụ: max event_time - 5s)	Phù hợp traffic đều
Periodic watermark	Emit watermark định kỳ (ví dụ: mỗi 200ms, watermark = max event_time seen - 10s)	Phổ biến nhất, đơn giản
Watermark = max_event_time - allowed_lateness	Ví dụ: max_event_time - 5 phút	Balance giữa latency và completeness

Late Event Handling

Khi event đến sau watermark (late event), có 3 cách xử lý:

Strategy	Mô tả	Khi nào dùng
Drop	Bỏ qua late event	Khi data accuracy không quan trọng (dashboard xấp xỉ)
Update aggregation	Mở lại window, cập nhật kết quả, emit lại	Khi cần accuracy cao — dùng cho ad click
Side output	Gửi late event sang 1 stream riêng để xử lý sau	Khi late event cần xử lý khác (reconciliation, alerting)

Trong bài toán ad click, Alex Xu khuyên dùng kết hợp Update + Side output:

Late event trong 5 phút: update aggregation (mở lại window, đếm lại)
Late event quá 5 phút: gửi vào side output → reconciliation batch job xử lý

Aha Moment: Watermark là khái niệm khó nhất trong stream processing vì nó liên quan đến trade-off giữa latency và completeness. Watermark quá nhanh (allowed_lateness nhỏ) → nhiều late events bị bỏ → sai kết quả. Watermark quá chậm (allowed_lateness lớn) → latency cao → dashboard cập nhật chậm. Không có giá trị “đúng” — chỉ có giá trị “phù hợp với business yêu cầu”.

2.3.6 Exactly-Once Processing

Trong hệ thống phân tán, exactly-once là cực kỳ khó vì có nhiều điểm failure:

Điểm failure	Vấn đề	Giải pháp
Kafka consumer đọc event	Consumer crash sau khi xử lý nhưng trước khi commit offset	Flink checkpointing (snapshot state + Kafka offset cùng lúc)
Aggregation service	Node crash giữa chừng xử lý	Flink savepoints/checkpoints — resume từ last consistent state
Ghi vào DB	Ghi thành công nhưng chưa ack	Idempotent writes — ghi lại cùng data không thay đổi kết quả
Kafka producer (ghi aggregated result)	Producer crash sau khi Kafka nhận nhưng trước khi ack	Kafka idempotent producer (enable.idempotence=true)

Exactly-Once End-to-End Flow

flowchart TB
    subgraph "Source: Kafka Consumer"
        KC["Kafka Consumer<br/>Track offset per partition"]
    end

    subgraph "Processing: Flink"
        CP["Checkpoint Manager<br/>(Periodic snapshots)"]
        STATE["Flink State Backend<br/>(RocksDB / Heap)"]
        OP["Aggregation Operator<br/>(Stateful processing)"]
    end

    subgraph "Sink: Kafka Producer + DB"
        KP["Kafka Producer<br/>(Idempotent + Transactional)"]
        DB[("ClickHouse<br/>(Idempotent Upsert)")]
    end

    subgraph "Checkpoint Flow"
        BARRIER["Checkpoint Barrier<br/>(injected into stream)"]
    end

    KC -->|"Events"| OP
    OP --> STATE
    OP --> KP
    KP --> DB

    CP -->|"Trigger checkpoint"| BARRIER
    BARRIER -->|"Snapshot: offset + state + pending writes"| CP
    CP -->|"On success: commit Kafka offset + flush writes"| KC
    CP -->|"On success: commit Kafka transaction"| KP

    style CP fill:#e53935,color:#fff
    style STATE fill:#1e88e5,color:#fff
    style OP fill:#43a047,color:#fff

Cơ chế Flink Checkpoint hoạt động như thế nào:

Checkpoint Manager định kỳ (ví dụ mỗi 30 giây) inject checkpoint barrier vào stream
Khi barrier đi qua operator, operator snapshot state (partial aggregation, HLL counters) vào durable storage (HDFS/S3)
Đồng thời lưu Kafka consumer offset tại thời điểm đó
Khi tất cả operators đã snapshot xong → checkpoint hoàn thành
Nếu failure xảy ra → Flink rollback về last successful checkpoint:
- Reset Kafka consumer offset về checkpoint offset (đọc lại events từ đó)
- Restore operator state từ checkpoint
- Xử lý lại events từ checkpoint offset → kết quả giống hệt vì state giống hệt

Quan trọng: Exactly-once trong Flink không có nghĩa là mỗi event chỉ được xử lý 1 lần. Nó có nghĩa là kết quả cuối cùng giống như mỗi event chỉ được xử lý 1 lần. Thực tế, events có thể được xử lý lại (sau checkpoint restore) nhưng kết quả không thay đổi nhờ idempotent writes.

2.3.7 Deduplication — Cùng 1 click bị gửi 2 lần

Khác với exactly-once (vấn đề của hệ thống), deduplication xử lý vấn đề từ phía client:

Nguyên nhân	Mô tả	Ví dụ
Client retry	User click, request timeout, client gửi lại	1 click → 2 events
Double click	User click nhanh 2 lần	2 clicks thực tế nhưng chỉ nên đếm 1
Page reload	User refresh trang có quảng cáo	Click event gửi lại
CDN/Proxy retry	Intermediate server retry request	1 click → nhiều events

Dedup Strategy

Cách tiếp cận	Mô tả	Trade-off
Dedup key: (ad_id, user_id, click_timestamp)	2 events có cùng 3 field = duplicate	Cần lưu seen keys trong window → memory
Dedup window	Chỉ check duplicate trong 1 khoảng thời gian (ví dụ 5 phút)	Duplicate xa hơn window không bị bắt
Bloom filter	Probabilistic — check “đã thấy chưa?” với false positive rate thấp	Tiết kiệm memory, nhưng có false positive (bỏ nhầm event thật)
External dedup store (Redis)	Lưu dedup key trong Redis với TTL	Chính xác hơn Bloom filter, nhưng thêm latency và dependency

Kết luận: Dùng Bloom filter trong Flink operator với dedup window 5 phút. False positive rate ~0.1% là chấp nhận được (bỏ nhầm 1/1000 click thật — insignificant với 1 tỷ clicks/ngày).

Cho trường hợp cần chính xác tuyệt đối (billing), dùng Redis dedup với key {ad_id}:{user_id}:{minute_bucket} và TTL 10 phút.

2.3.8 Time Window Types

Flink (và các streaming framework khác) hỗ trợ nhiều loại time window:

Tumbling Window

Thuộc tính	Giá trị
Định nghĩa	Fixed-size, non-overlapping windows
Ví dụ	Window 1 phút: [14:00-14:01), [14:01-14:02), [14:02-14:03)
Khi nào dùng	Ad click aggregation — đếm click trong từng phút, từng giờ
Đặc điểm	Mỗi event thuộc đúng 1 window. Đơn giản nhất, efficient nhất

Sliding Window

Thuộc tính	Giá trị
Định nghĩa	Fixed-size windows nhưng overlap với nhau
Ví dụ	Window 5 phút, slide mỗi 1 phút: [14:00-14:05), [14:01-14:06), [14:02-14:07)
Khi nào dùng	Khi cần “moving average” — ví dụ “số click trong 5 phút gần nhất” cập nhật mỗi phút
Đặc điểm	Mỗi event thuộc nhiều windows. Tốn memory hơn tumbling

Session Window

Thuộc tính	Giá trị
Định nghĩa	Window đóng khi không có event mới trong 1 khoảng thời gian (gap)
Ví dụ	Session gap 30 phút: events lúc 14:00, 14:05, 14:10 thuộc 1 session. Event lúc 14:50 bắt đầu session mới
Khi nào dùng	User behavior analysis — 1 session browsing của user
Đặc điểm	Kích thước window động (dynamic), không cố định

Cho bài toán ad click: Dùng tumbling window cho 1 phút, 5 phút, 1 giờ. Đây là lựa chọn tự nhiên nhất vì:

Mỗi click chỉ thuộc 1 window → đơn giản, không duplicate counting
Window 5 phút = aggregate 5 windows 1-phút → có thể tính từ window nhỏ hơn
Window 1 giờ = aggregate 60 windows 1-phút → tiếp tục cascade

Aha Moment: Không cần 3 pipeline riêng cho 3 window sizes. Chỉ cần aggregate window 1 phút (smallest), sau đó cascade aggregation: 5 windows 1-phút → 1 window 5-phút, 12 windows 5-phút → 1 window 1-giờ. Đây tiết kiệm compute và storage đáng kể.

2.3.9 Scaling — Xử lý traffic lớn

Kafka Partitioning

Chiến lược	Mô tả	Ưu/Nhược
Partition by ad_id	Hash(ad_id) % num_partitions	Tất cả clicks của 1 ad vào cùng partition → aggregation đơn giản, NHƯNG hot ads tạo hot partition
Partition by ad_id + minute	Hash(ad_id + minute_bucket)	Phân tán hơn, nhưng cần reduce step để merge
Random partition	Round-robin	Phân tán đều nhất, nhưng aggregation phải shuffle data (expensive)

Kết luận: Dùng partition by ad_id làm default. Xử lý hot partition riêng (Section 2.3.11).

Flink Parallelism

Component	Parallelism	Lý do
Source (Kafka consumer)	= Số Kafka partitions	Mỗi consumer đọc 1 partition
Map operator	= Số Kafka partitions	1:1 với source
Aggregate operator	Tùy throughput	keyBy(ad_id) tự động distribute
Sink (Kafka producer)	Tùy throughput	Có thể nhỏ hơn source

Khi cần scale:

Tăng Kafka partitions (ví dụ từ 100 → 200)
Tăng Flink parallelism tương ứng
Tăng consumer group instances
Flink auto-scaling (reactive mode) — Flink 1.13+ hỗ trợ

Aggregation Node Horizontal Scaling

Flink tự động distribute state dựa trên key group. Khi tăng parallelism:

State của key group A được migrate từ node 1 sang node 2
Trong quá trình migration, có brief pause (managed bởi checkpoint/restore)
Không cần manual resharding như với database

2.3.10 Fault Tolerance

Cơ chế	Mức	Mô tả
Flink Checkpoints	Streaming	Periodic snapshot của tất cả operator state + Kafka offsets. Automatic recovery
Flink Savepoints	Streaming	Manual checkpoint cho planned maintenance (upgrade, config change)
Kafka Consumer Group Rebalancing	Messaging	Khi 1 consumer die, partitions được redistribute cho consumers còn lại
Kafka Replication	Messaging	Mỗi partition có 3 replicas — tolerate 2 broker failures
DB Replication	Storage	Cassandra/ClickHouse replication factor 3
End-to-end exactly-once	Toàn hệ thống	Flink checkpoint + Kafka transaction + idempotent DB write

Failure scenarios và recovery:

Scenario	Hệ thống phản ứng
1 Flink TaskManager crash	JobManager detect (heartbeat timeout) → redeploy tasks → restore từ last checkpoint
Flink JobManager crash	Standby JobManager take over (HA mode với ZooKeeper) → restore job từ last checkpoint
1 Kafka broker crash	ISR (In-Sync Replicas) → leader election → consumer reconnect → continue từ last committed offset
DB node crash	Replication → query route sang replica → repair node sau
Network partition (Flink ↔ Kafka)	Consumer timeout → rebalance → reconnect → replay từ last offset
Entire data center failure	Cross-DC replication (Kafka MirrorMaker, Cassandra multi-DC) → failover sang DC khác

2.3.11 Reconciliation — Verify Streaming Results

Dù Flink có exactly-once, vẫn cần reconciliation vì:

Bug trong aggregation logic
Clock skew giữa servers
Edge cases trong watermark/late event handling
Data corruption

Reconciliation Flow

Bước	Mô tả
1. Batch job chạy mỗi giờ	Đọc raw events từ Cassandra/S3 cho window vừa qua
2. Re-aggregate	Tính lại click_count và unique_count bằng batch processing (Spark)
3. So sánh	Compare batch result với streaming result trong aggregated DB
4. Detect drift	Nếu sai lệch > threshold (ví dụ 0.1%) → alert
5. Auto-correct (optional)	Ghi đè kết quả streaming bằng kết quả batch (batch là source of truth)

Metric	Threshold	Action
Drift < 0.01%	Normal	Log only
Drift 0.01% - 0.1%	Warning	Alert team
Drift > 0.1%	Critical	Alert + auto-correct + investigate root cause

Aha Moment: Reconciliation là safety net, không phải primary mechanism. Trong hệ thống tốt, reconciliation luôn cho kết quả khớp (drift ~ 0%). Nếu drift thường xuyên > 0, có bug trong streaming pipeline.

2.3.12 Hot Shard Problem — Quảng cáo viral

Vấn đề: Quảng cáo viral (ví dụ Super Bowl ad) có thể nhận 100x-1000x traffic bình thường. Vì partition by ad_id, tất cả clicks của ad này vào 1 Kafka partition và 1 Flink subtask → hot shard → bottleneck.

Triệu chứng	Hậu quả
1 Kafka partition nhận quá nhiều data	Consumer lag tăng, latency tăng
1 Flink subtask dùng quá nhiều CPU/memory	Backpressure, checkpoint timeout
Các partition/subtask khác idle	Waste resources

Giải pháp: Salted Keys + Secondary Aggregation

Bước	Mô tả
1. Detect hot key	Monitor click rate per ad_id. Nếu vượt threshold (ví dụ 10x average) → mark as hot
2. Salt the key	Thay vì partition by ad_id, partition by `ad_id + random(0..N)` với N = số shards mong muốn
3. First aggregation	Mỗi shard aggregate riêng → N partial results cho 1 ad_id
4. Secondary aggregation	1 reducer merge N partial results → final result cho ad_id

Ví dụ với ad_001 là hot key, N=10:

Thay vì 1 partition nhận tất cả clicks của ad_001
10 partitions nhận clicks: ad_001_0, ad_001_1, …, ad_001_9
10 aggregation nodes đếm riêng
1 reduce node cộng 10 partial counts lại

Trade-off:

Pro: Traffic phân tán đều, không còn hot shard
Con: Thêm 1 aggregation step, logic phức tạp hơn, latency tăng chút ít
Con: Unique count (HLL) cần merge — nhưng HLL hỗ trợ merge tốt nên không vấn đề

Aha Moment: Hot shard problem không chỉ xảy ra trong bài toán này. Nó xuất hiện bất cứ khi nào bạn partition by key và 1 key có traffic bất cân xứng. Giải pháp “salted keys + secondary aggregation” là general pattern áp dụng được cho nhiều bài toán khác.

3. Back-of-the-Envelope Estimation

3.1 Traffic Estimation

Assumptions:

1 tỷ (1 billion) clicks/ngày
Peak traffic = 5x average

$Average QPS = \frac{1 , 000 , 000 , 000}{86 , 400} \approx 11, 574 clicks/sec$

$Peak QPS = 11, 574 \times 5 \approx 57, 870 clicks/sec \approx 50 K- 60 K clicks/sec$

3.2 Kafka Throughput

Raw event size: ~1 KB (ad_id + timestamp + user_id + ip + country + metadata)

$Average throughput = 11, 574 \times 1 KB \approx 11.3 MB/sec$

$Peak throughput = 57, 870 \times 1 KB \approx 56.5 MB/sec$

Kafka partition estimation:

Mỗi partition handle ~10 MB/sec (conservative)
Số partitions cần thiết cho peak: $⌈ 56.5/10 ⌉ = 6$ partitions tối thiểu
Thực tế dùng 100-200 partitions để scale consumers và đảm bảo parallelism cho Flink

Kafka retention:

Raw topic retention: 7 ngày
Storage per day: $1 B \times 1 KB = 1 TB/ng \overset{a}{ˋ} y$
7-day retention: $1 TB \times 7 = 7 TB$
Với replication factor 3: $7 TB \times 3 = 21 TB$ Kafka cluster storage

3.3 Raw Data Storage

$Daily raw storage = 1, 000, 000, 000 \times 1 KB = 1 TB/ng \overset{a}{ˋ} y$

$7-day retention (Cassandra) = 7 TB$

$Long-term archive (S3, compressed 5:1) = \frac{1 TB}{5} \times 365 = 73 TB/n \overset{a}{˘} m$

3.4 Aggregated Data Storage

Assumptions:

2 triệu active ads
Aggregation windows: 1 phút, 5 phút, 1 giờ
Mỗi aggregated record: ~100 bytes (ad_id + window + counts)

$Records per minute = 2, 000, 000 (1 per active ad)$

$Records per day (1-min window) = 2, 000, 000 \times 1, 440 ph \overset{u}{ˊ} t = 2.88 \times 1 0^{9} records$

$Storage per day (1-min) = 2.88 \times 1 0^{9} \times 100 B = 288 GB/ng \overset{a}{ˋ} y$

Với ClickHouse compression ratio ~10:1:

$Compressed storage per day \approx 28.8 GB/ng \overset{a}{ˋ} y$

$Storage per year \approx 28.8 \times 365 = 10.5 TB/n \overset{a}{˘} m (compressed)$

3.5 Memory cho Flink State

HyperLogLog cho unique count:

$HLL memory = 2, 000, 000 ads \times 12 KB/HLL = 24 GB$

Partial aggregation state (tumbling window 1 min):

$State per ad \approx 100 bytes (counts + metadata)$

$Total state = 2, 000, 000 \times 100 B = 200 MB$

Tổng memory Flink state: ~25 GB — phân bổ trên nhiều TaskManagers, hoàn toàn khả thi.

3.6 Tổng hợp

Resource	Giá trị
Traffic	11.5K avg / 58K peak clicks/sec
Kafka cluster storage	21 TB (7-day retention, RF=3)
Kafka throughput	11 MB/s avg / 57 MB/s peak
Kafka partitions	100-200
Raw data (Cassandra)	7 TB (7-day TTL)
Raw data (S3 archive)	73 TB/year (compressed)
Aggregated data (ClickHouse)	10.5 TB/year (compressed)
Flink state memory	~25 GB (distributed)
Flink TaskManagers	10-20 (tùy CPU/memory spec)

4. Security & Data Privacy

4.1 Click Fraud Detection

Click fraud là mối đe dọa lớn nhất đối với hệ thống quảng cáo. Nếu không phát hiện, advertiser mất tiền, platform mất uy tín.

Loại fraud	Mô tả	Dấu hiệu nhận biết
Bot clicks	Automated scripts click quảng cáo liên tục	Click rate bất thường từ 1 IP, không có mouse movement/scroll
Click farms	Nhóm người được trả tiền để click	Nhiều clicks từ cùng geo location, pattern giống nhau
Competitor clicking	Đối thủ click quảng cáo của bạn để hao ngân sách	Clicks từ cùng IP range, không conversion
Ad stacking	Nhiều quảng cáo chồng lên nhau, 1 click đếm cho nhiều ad	Click coordinates giống nhau cho nhiều ads
Pixel stuffing	Quảng cáo hiển thị 1x1 pixel, không ai thấy nhưng đếm là “impression”	Impression không tương ứng với click

Anti-Fraud Strategies

Strategy	Mô tả	Khi nào áp dụng
IP-based rate limiting	Giới hạn số click từ 1 IP per ad per time window	Real-time, tại ingestion layer
User-agent fingerprinting	Phát hiện bot dựa trên browser fingerprint	Real-time, tại ingestion layer
Click-through rate (CTR) anomaly	CTR bất thường cao cho 1 ad → suspicious	Near-real-time, tại aggregation layer
Geo anomaly	Click từ country không phù hợp với target audience	Near-real-time
Session analysis	Phân tích hành vi sau click (bounce rate, time on page)	Batch analysis
Machine learning model	Train model phát hiện fraud patterns	Batch + real-time inference

Implementation: Fraud detection không làm trong aggregation pipeline chính. Thay vào đó:

Raw events đi qua fraud detection service (parallel với aggregation)
Fraud service flag suspicious clicks
Flagged clicks không được đếm trong aggregation (hoặc đếm riêng)
Advertiser có thể dispute và review flagged clicks

4.2 User Privacy

Nguyên tắc	Mô tả
Aggregate only	Report chỉ chứa aggregated numbers (click_count), KHÔNG chứa individual user data
No PII in reports	user_id, ip KHÔNG xuất hiện trong aggregated results
PII only for dedup	user_id chỉ dùng trong dedup window (5 phút), sau đó discard
IP hashing	IP address được hash trước khi lưu (không lưu raw IP)
Data minimization	Chỉ thu thập data cần thiết cho aggregation, không hơn
GDPR/CCPA compliance	User có quyền yêu cầu xóa data. Raw data TTL 7 ngày hỗ trợ điều này
Consent-based tracking	Chỉ track click khi user đã consent (cookie consent banner)

4.3 Data Retention Policies

Data type	Retention	Lý do	Storage
Raw click events	7 ngày (hot) + 90 ngày (archive)	Reconciliation + dispute resolution	Cassandra → S3
Aggregated data (1-min)	1 năm	Dashboard + reporting	ClickHouse
Aggregated data (1-hour)	3 năm	Long-term analytics	ClickHouse (compressed)
Aggregated data (1-day)	Vĩnh viễn	Historical trend	ClickHouse (cold storage)
Dedup keys	10 phút	Dedup window	Redis (TTL)
Fraud detection logs	1 năm	Audit + dispute	S3

5. DevOps & Monitoring

5.1 Key Metrics cần monitor

Kafka Metrics

Metric	Mô tả	Alert threshold
Consumer lag	Số messages chưa được consume	> 100K messages → Warning, > 1M → Critical
Under-replicated partitions	Partitions chưa đủ replicas	> 0 → Warning
Broker disk usage	Disk usage của Kafka broker	> 80% → Warning, > 90% → Critical
Producer error rate	Tỷ lệ lỗi khi produce message	> 0.1% → Warning
Request latency (p99)	Latency của produce/consume request	> 100ms → Warning

Flink Metrics

Metric	Mô tả	Alert threshold
Checkpoint duration	Thời gian hoàn thành 1 checkpoint	> 30s → Warning, > 60s → Critical
Checkpoint failure rate	Tỷ lệ checkpoint thất bại	> 0 trong 1 giờ → Critical
Backpressure	Operator đang bị chậm, buffer đầy	isBackPressured = true > 5 phút → Warning
Records per second	Throughput của Flink job	Giảm > 50% so với baseline → Warning
State size	Kích thước state của operators	Tăng > 2x so với baseline → Warning
Late events dropped	Số late events bị drop	> 1% total events → Warning
TaskManager failures	Số TaskManager crash	> 0 → Alert

Data Freshness & Accuracy

Metric	Mô tả	Alert threshold
End-to-end latency	Thời gian từ click xảy ra đến data queryable trong DB	> 5 phút → Warning, > 10 phút → Critical
Data freshness	Khoảng cách giữa “now” và timestamp của record mới nhất trong DB	> 3 phút → Warning
Reconciliation drift	% chênh lệch giữa streaming và batch result	> 0.01% → Warning, > 0.1% → Critical
Query latency (p99)	Latency của query service	> 500ms → Warning, > 2s → Critical

Database Metrics

Metric	Mô tả	Alert threshold
Write latency (p99)	Latency ghi vào ClickHouse/Cassandra	> 100ms → Warning
Disk usage	Storage usage	> 75% → Warning
Compaction lag (Cassandra)	Pending compactions	> 10 → Warning
Query queue size (ClickHouse)	Số queries đang chờ	> 50 → Warning

5.2 Alerting Strategy

Severity	Ví dụ	Notification	Response time
P1 - Critical	Consumer lag > 1M, checkpoint failures, data freshness > 10 min	PagerDuty (phone call)	< 15 phút
P2 - Warning	Consumer lag > 100K, reconciliation drift > 0.01%, high backpressure	Slack + PagerDuty (SMS)	< 1 giờ
P3 - Info	Traffic spike, scheduled maintenance	Slack	Next business day

5.3 Runbook — Common Issues

Vấn đề	Nguyên nhân thường gặp	Cách xử lý
Consumer lag tăng đột ngột	Traffic spike (viral ad), slow consumer, GC pause	Scale Flink parallelism, check backpressure, check GC logs
Checkpoint failure	State quá lớn, slow storage, network issue	Tăng checkpoint timeout, check state size, check HDFS/S3 health
Reconciliation drift	Bug trong aggregation logic, clock skew, late events	So sánh raw events với aggregated results, check watermark config
High query latency	ClickHouse overloaded, missing index, large result set	Add materialized view, optimize query, scale read replicas
Kafka broker down	Disk full, hardware failure, network partition	Check disk, failover sang replica, add new broker

5.4 Deployment Strategy

Chiến lược	Mô tả	Khi nào dùng
Blue-green deployment	Chạy 2 Flink jobs song song (old + new), compare results, switch traffic	Major logic changes
Canary deployment	Route 5% traffic sang new version, monitor metrics	Minor changes
Savepoint + restart	Take savepoint của Flink job, deploy new version, restore từ savepoint	Config changes, minor updates
A/B testing	2 pipelines xử lý cùng data, compare results	Testing new aggregation logic

6. Mermaid Diagrams — Tổng hợp

6.1 High-Level Architecture (Complete)

flowchart TB
    subgraph "Clients"
        BROWSER["User Browsers"]
        APP["Mobile Apps"]
    end

    subgraph "Ingestion Layer"
        LB["Load Balancer<br/>(L7)"]
        IS["Ingestion Service Cluster<br/>(Stateless, Auto-scale)"]
    end

    subgraph "Message Queue Layer"
        KR["Kafka — Raw Events Topic<br/>(100+ partitions, RF=3)"]
    end

    subgraph "Processing Layer"
        FRAUD["Fraud Detection<br/>Service (Async)"]
        FLINK["Apache Flink Cluster<br/>Map → Aggregate → Reduce"]
    end

    subgraph "Storage Layer"
        RAW_CASS[("Cassandra<br/>Raw Events<br/>TTL 7 days")]
        RAW_S3[("S3 Archive<br/>Raw Events<br/>Parquet")]
        AGG_CH[("ClickHouse<br/>Aggregated Data")]
    end

    subgraph "Serving Layer"
        QS["Query Service"]
        REDIS[("Redis Cache")]
    end

    subgraph "Reconciliation"
        SPARK["Spark Batch Job<br/>(Hourly)"]
    end

    subgraph "Consumers"
        DASH["Advertiser Dashboard"]
        BILLING["Billing Service"]
        ANALYTICS["Analytics Platform"]
    end

    BROWSER & APP --> LB
    LB --> IS
    IS --> KR
    KR --> FRAUD
    KR --> FLINK
    KR --> RAW_CASS
    RAW_CASS -.->|"Archive"| RAW_S3
    FLINK --> AGG_CH
    AGG_CH --> QS
    QS --> REDIS
    QS --> DASH & BILLING & ANALYTICS

    RAW_S3 -.-> SPARK
    AGG_CH -.-> SPARK

    style KR fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff
    style FLINK fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff
    style AGG_CH fill:#2196F3,stroke:#333,stroke-width:2px,color:#fff
    style FRAUD fill:#f44336,stroke:#333,stroke-width:2px,color:#fff
    style SPARK fill:#9C27B0,stroke:#333,stroke-width:2px,color:#fff

6.2 Exactly-Once End-to-End Flow (Detailed)

sequenceDiagram
    participant Kafka as Kafka (Source)
    participant Flink as Flink Operator
    participant State as Flink State (RocksDB)
    participant CM as Checkpoint Manager
    participant Storage as Durable Storage (HDFS)
    participant Sink as Kafka (Sink) + DB

    Note over Kafka,Sink: Normal Processing
    Kafka->>Flink: Event batch (offset 100-200)
    Flink->>State: Update aggregation state
    Flink->>Sink: Emit partial result (uncommitted)

    Note over CM,Storage: Checkpoint Triggered (every 30s)
    CM->>Flink: Inject checkpoint barrier
    Flink->>State: Snapshot current state
    State->>Storage: Persist state snapshot
    Flink->>CM: Report: offset=200, state=snapshot_42
    CM->>Kafka: Commit consumer offset=200
    CM->>Sink: Commit Kafka transaction (flush writes to DB)

    Note over Kafka,Sink: Failure & Recovery
    Flink->>Flink: TaskManager CRASH!
    CM->>Storage: Load last checkpoint (snapshot_42, offset=200)
    CM->>Kafka: Reset consumer to offset=200
    CM->>Flink: Restore state from snapshot_42
    Kafka->>Flink: Replay events from offset 200
    Flink->>State: Re-process (same state → same result)
    Flink->>Sink: Re-emit results (idempotent write → no duplicate)

6.3 Lambda vs Kappa Architecture Comparison

flowchart TB
    subgraph "Lambda Architecture"
        direction TB
        L_SRC["Events"] --> L_BATCH["Batch Layer<br/>(Spark, chạy mỗi giờ)"]
        L_SRC --> L_SPEED["Speed Layer<br/>(Flink, real-time)"]
        L_BATCH --> L_BV["Batch View<br/>(accurate, stale)"]
        L_SPEED --> L_RV["Real-time View<br/>(fast, approximate)"]
        L_BV --> L_MERGE["Merge"]
        L_RV --> L_MERGE
        L_MERGE --> L_QS["Query"]

        style L_BATCH fill:#4CAF50,color:#fff
        style L_SPEED fill:#FF9800,color:#fff
        style L_MERGE fill:#f44336,color:#fff
    end

    subgraph "Kappa Architecture (Preferred)"
        direction TB
        K_SRC["Events"] --> K_KAFKA["Kafka<br/>(Immutable Log)"]
        K_KAFKA --> K_STREAM["Stream Layer<br/>(Flink, real-time<br/>exactly-once)"]
        K_STREAM --> K_DB["Serving DB"]
        K_DB --> K_QS["Query"]
        K_KAFKA -.->|"Replay<br/>if needed"| K_STREAM
        K_DB -.-> K_RECON["Reconciliation<br/>(Safety net)"]

        style K_KAFKA fill:#FF9800,color:#fff
        style K_STREAM fill:#4CAF50,color:#fff
        style K_RECON fill:#9E9E9E,color:#fff
    end

6.4 Hot Shard Solution — Salted Keys

flowchart TB
    subgraph "Problem: Hot Shard"
        direction LR
        HOT_AD["ad_001 (viral)<br/>100K clicks/sec"] --> P1_HOT["Partition 0<br/>OVERLOADED"]
        NORMAL1["ad_002<br/>10 clicks/sec"] --> P2["Partition 1<br/>idle"]
        NORMAL2["ad_003<br/>15 clicks/sec"] --> P3["Partition 2<br/>idle"]

        style P1_HOT fill:#f44336,color:#fff
    end

    subgraph "Solution: Salted Keys + Secondary Aggregation"
        direction TB
        SALT["ad_001 → ad_001_0, ad_001_1, ..., ad_001_9"]

        subgraph "First Aggregation (Parallel)"
            S0["ad_001_0<br/>count=10K"]
            S1["ad_001_1<br/>count=10K"]
            S9["ad_001_9<br/>count=10K"]
        end

        subgraph "Secondary Aggregation (Merge)"
            REDUCE["Reduce Node<br/>10K x 10 = 100K"]
        end

        SALT --> S0 & S1 & S9
        S0 & S1 & S9 --> REDUCE

        style SALT fill:#FF9800,color:#fff
        style REDUCE fill:#4CAF50,color:#fff
    end

7. Aha Moments & Pitfalls

7.1 Aha Moments — Những điều Hiếu cần nhớ

#	Insight	Giải thích
1	Kappa > Lambda cho hầu hết trường hợp	Lambda architecture ra đời vì streaming frameworks chưa mature. Với Flink exactly-once, Kappa đơn giản hơn và đủ chính xác. Chỉ cần thêm reconciliation batch job làm safety net — đây không phải Lambda
2	Watermark là concept khó nhất	Nó đại diện cho trade-off giữa latency và completeness. Không có giá trị “đúng” — chỉ có giá trị phù hợp với business. Cách duy nhất hiểu watermark: thực hành với Flink
3	Hot shard có thể giết performance	1 quảng cáo viral → 1 partition overloaded → toàn bộ pipeline chậm lại (backpressure). Salted keys là giải pháp, nhưng tăng complexity. Monitor click rate per ad_id để detect sớm
4	Reconciliation là safety net bắt buộc	Dù streaming exactly-once, vẫn cần batch job verify. Bug trong logic, clock skew, edge cases — tất cả có thể gây sai lệch nhỏ. Reconciliation phát hiện trước khi advertiser phát hiện
5	Exactly-once không có nghĩa “xử lý 1 lần”	Nó có nghĩa “kết quả giống như xử lý 1 lần”. Events có thể bị replay (sau checkpoint restore) nhưng kết quả không đổi nhờ idempotent writes. Hiểu sai concept này → thiết kế sai
6	Cascade aggregation tiết kiệm compute	Aggregate 1-min window, rồi tính 5-min từ 5x 1-min, 1-hour từ 12x 5-min. Không cần 3 pipeline riêng — 1 pipeline + cascade
7	HyperLogLog là hero ẩn mình	Đếm unique users trong 12KB memory per counter với sai số < 1%. Không có HLL, bài toán unique count ở scale này gần như bất khả thi với memory hợp lý
8	Raw data là bảo hiểm	Luôn lưu raw events (dù đắt). Khi logic sai, khi cần re-process, khi advertiser dispute — raw data là nguồn sự thật duy nhất

7.2 Pitfalls — Những lỗi thường gặp

#	Pitfall	Hậu quả	Cách tránh
1	Dùng processing time thay vì event time	Click xảy ra lúc 13:59 bị đếm vào window 14:00	Luôn dùng event time cho aggregation. Processing time chỉ cho system monitoring
2	Không xử lý late events	Click đến muộn bị mất → under-count → advertiser khiếu nại	Configure allowed lateness, dùng side output cho late events
3	Không có reconciliation	Bug trong streaming làm sai kết quả mà không ai biết	Chạy batch reconciliation mỗi giờ, alert khi drift > threshold
4	Partition by random key	Mỗi aggregation node cần dữ liệu của mọi ad → shuffle quá nhiều data	Partition by ad_id để data locality. Chỉ salt khi cần xử lý hot keys
5	Checkpoint interval quá lớn	Khi failure, phải replay nhiều events → recovery chậm, duplicate window lớn	Checkpoint mỗi 30s-1 phút. Balance giữa overhead và recovery time
6	Không monitor consumer lag	Pipeline chậm mà không biết → data stale → advertiser thấy số cũ	Consumer lag là metric #1 cần monitor. Alert ngay khi tăng bất thường
7	Single point of failure cho fraud detection	Fraud service down → tất cả clicks được chấp nhận → mất tiền	Fraud detection async, không block aggregation pipeline. Fallback: flag và review sau
8	Unique count dùng exact counting	O(n) memory per ad per window → memory explosion với 2M ads	Dùng HyperLogLog. Chỉ dùng exact counting cho billing-critical use cases
9	Không lưu raw data	Không thể re-process khi logic sai, không thể reconcile	Luôn lưu raw events — Cassandra (7 ngày) + S3 (archive)
10	Scale vertically thay vì horizontally	Hit hardware limit, single point of failure	Kafka partitions + Flink parallelism cho horizontal scaling. Vertical chỉ cho DB optimization

7.3 Interview Tips

Tip	Giải thích
Bắt đầu với requirements	Hỏi rõ: bao nhiêu clicks/ngày, latency target, accuracy yêu cầu. Không nhảy vào solution ngay
Nói về trade-offs	Lambda vs Kappa, HLL vs exact count, Cassandra vs ClickHouse — interviewer muốn nghe bạn cân nhắc, không phải chỉ chọn 1
Vẽ diagram trước, chi tiết sau	High-level flow: Ingestion → Queue → Aggregation → DB → Query. Rồi dive vào từng component
Nhắc đến exactly-once sớm	Đây là yêu cầu khó nhất của bài toán. Nếu bạn biết giải thích Flink checkpoint + idempotent write → ấn tượng lớn
Hot shard là bonus point	Nhiều ứng viên không nghĩ tới. Nếu bạn tự đề cập và giải pháp salted keys → interviewer biết bạn có kinh nghiệm thực tế
Reconciliation cho thấy bạn là senior	Junior engineer chỉ nghĩ streaming là đủ. Senior biết cần safety net. Nhắc đến reconciliation batch job → chứng tỏ production experience

8. Tổng kết — Summary Table

Khía cạnh	Quyết định	Lý do
Architecture	Kappa (streaming only + reconciliation)	Đơn giản hơn Lambda, Flink exactly-once mature
Message queue	Apache Kafka	Durable, partitioned, replayable
Streaming engine	Apache Flink	Exactly-once, stateful processing, checkpoint/savepoint
Raw data store	Cassandra (7 days) + S3 (archive)	Write-heavy, time-series, cost-effective
Aggregated data store	ClickHouse	OLAP-optimized, fast aggregation queries
Partitioning strategy	By ad_id (salted for hot keys)	Data locality cho aggregation
Time window	Tumbling 1-min, cascade to 5-min and 1-hour	Simple, no overlap, efficient
Unique counting	HyperLogLog	Memory efficient, mergeable, ~1% error OK
Deduplication	Bloom filter (5-min window)	Memory efficient, low false positive
Late events	Watermark + allowed lateness 5 min + side output	Balance latency và completeness
Reconciliation	Spark batch job hourly	Safety net, detect drift
Fraud detection	Async, parallel pipeline	Không block aggregation

9. Internal Links & Further Reading

Prerequisite (đã học)

Tuan-01-Scale-From-Zero-To-Millions — Nền tảng về scalability
Tuan-02-Back-of-the-envelope — Kỹ năng estimation
Tuan-08-Message-Queue — Kafka fundamentals, consumer groups, partitions

Liên quan (tham khảo chéo)

Tuan-13-Monitoring-Observability — Monitoring pipeline tương tự (metrics ingestion)
Case-Design-Metrics-Monitoring-Alerting — Bài toán tương tự nhưng cho infrastructure metrics
Tuan-07-Database-Sharding-Replication — Cassandra partitioning, ClickHouse replication

Concepts để tìm hiểu thêm

Apache Flink — Stateful stream processing, checkpointing, exactly-once semantics
ClickHouse — Column-oriented OLAP database, materialized views
HyperLogLog — Probabilistic cardinality estimation
Kafka Streams vs Flink — Khi nào dùng cái nào
Event sourcing — Lưu tất cả events, derive state từ events (Kafka là event log)

“Hệ thống tốt không phải là hệ thống không bao giờ sai — mà là hệ thống biết khi nào nó sai, và tự sửa được. Reconciliation chính là cơ chế tự sửa đó.”

lthieu's notes

Explorer

Case-Design-Ad-Click-Event-Aggregation