Tuần 19: Design Notification System

“Notification hay thì người dùng cảm ơn. Notification dở thì người dùng uninstall app. Ranh giới mỏng hơn em tưởng.”

Tags: system-design notification case-study alex-xu Student: Hieu Prerequisite: Tuan-08-Message-Queue · Tuan-09-Rate-Limiter · Tuan-06-Cache-Strategy Liên quan: Tuan-17-Design-Chat-System · Tuan-18-Design-News-Feed · Tuan-13-Monitoring-Observability · Tuan-14-CI-CD-Pipeline

Step 1 — Understand the Problem & Establish Design Scope

1.1 Functional Requirements (Yêu cầu chức năng)

Hieu, trước khi vẽ bất kỳ diagram nào, em phải hỏi interviewer cho rõ scope. Đây là những câu hỏi quan trọng:

Câu hỏi	Câu trả lời giả định	Tại sao hỏi
Hệ thống hỗ trợ những loại notification nào?	Push notification (iOS/Android), SMS, Email	Mỗi channel có architecture riêng
Notification được trigger bởi ai/cái gì?	Event-driven: new message, order update, marketing campaign, system alert	Phân biệt transactional vs marketing
Có real-time requirement không?	Push & SMS: near real-time (< 5s). Email: best-effort (< 5 min)	Ảnh hưởng queue priority
User có thể opt-out từng channel không?	Có — user preference per channel, per notification type	Cần User Preference Service
Có priority levels không?	Urgent (OTP, security alert), Normal (order update), Low (marketing)	Priority queue design
Có cần analytics không?	Có — delivery rate, open rate, click-through rate	Analytics pipeline
Hỗ trợ đa ngôn ngữ không?	Có — i18n với template engine	Template design

1.2 Non-functional Requirements (Yêu cầu phi chức năng)

Requirement	Target	Ghi chú
Scalability	100M DAU, 500M notifications/day	Hệ thống quy mô lớn
Availability	99.9% (three 9s)	8.77 giờ downtime/năm
Latency	Urgent: < 3s, Normal: < 30s, Low: best-effort	SLA per priority
Reliability	At-least-once delivery, ideally exactly-once	Deduplication mechanism
Compliance	CAN-SPAM (email), GDPR (EU), TCPA (SMS)	Legal requirements

1.3 Capacity Estimation (Ước lượng dung lượng)

Assumptions (Giả thiết)

Thông số	Giá trị	Giải thích
DAU	100M	Hệ thống quy mô Facebook/Uber-like
Total notifications/day	500M	~5 notifications/user/day trung bình
Channel breakdown	Push: 60%, Email: 30%, SMS: 10%	Push rẻ nhất, SMS đắt nhất
Avg notification payload	Push: 1 KB, Email: 50 KB, SMS: 0.2 KB	Bao gồm metadata
Analytics event size	0.5 KB per notification	Delivery + open + click tracking
Retention	Notification log: 90 ngày, Analytics: 1 năm	Compliance + business intelligence

Notification Volume per Channel

P u s h_{d ai l y} = 500 M \times 60% = 300 M p u s h / d a y

E mai l_{d ai l y} = 500 M \times 30% = 150 M e mai l / d a y

SM S_{d ai l y} = 500 M \times 10% = 50 M SMS / d a y

QPS Calculation

T o t a l QP S_{a vg} = \frac{500 M}{86 , 400} \approx 5, 787 n o t i f i c a t i o n s / s

T o t a l QP S_{p e ak} = 5, 787 \times 5 = 28, 935 n o t i f i c a t i o n s / s \approx 30 K / s

Tại sao peak multiplier = 5x? Marketing campaign gửi đồng loạt, flash sale, breaking news — tất cả tạo burst traffic.

P u s h QP S_{p e ak} = 30 K \times 60% = 18, 000/ s

E mai l QP S_{p e ak} = 30 K \times 30% = 9, 000/ s

SMS QP S_{p e ak} = 30 K \times 10% = 3, 000/ s

Worker Count Estimation

Giả sử mỗi worker xử lý 100 notifications/s (bao gồm template rendering, API call to provider, retry logic):

P u s h w or k ers = ⌈ \frac{18 , 000}{100} ⌉ = 180 w or k ers

E mai l w or k ers = ⌈ \frac{9 , 000}{100} ⌉ = 90 w or k ers

SMS w or k ers = ⌈ \frac{3 , 000}{100} ⌉ = 30 w or k ers

T o t a l w or k er s_{p e ak} = 300 w or k ers

Rule of thumb: Provision 1.5x peak workers cho headroom → ~450 workers.

Queue Depth Estimation

Nếu downstream provider (APNs/FCM/SES) bị chậm 30 giây:

Q u e u e d e pt h_{ma x} = QP S_{p e ak} \times b u ff er_seco n d s = 30, 000 \times 30 = 900, 000 m ess a g es

Kafka partition cho throughput: mỗi partition handle ~10K msg/s → cần 3 partitions cho push channel ở peak.

Storage Estimation

Notification log (90 ngày):

St or a g e_{n o t i f i c a t i o n} = 500 M / d a y \times 1 K B_{a vg} \times 90 d a ys = 45 TB

Analytics data (1 năm):

St or a g e_{ana l y t i cs} = 500 M / d a y \times 0.5 K B \times 365 d a ys = 91.25 TB \approx 91 TB

Email content storage (90 ngày):

St or a g e_{e mai l} = 150 M / d a y \times 50 K B \times 90 d a ys = 675 TB

Alert: Email content chiếm phần lớn storage! Cần compression + object storage (S3) thay vì DB.

Bandwidth Estimation

B an d w i d t h_{p u s h} = 18, 000/ s \times 1 K B = 18 MB / s = 144 M b p s

B an d w i d t h_{e mai l} = 9, 000/ s \times 50 K B = 450 MB / s = 3.6 G b p s

B an d w i d t h_{s m s} = 3, 000/ s \times 0.2 K B = 0.6 MB / s = 4.8 M b p s

Tóm tắt Estimation

Metric	Value
Total QPS (avg)	~5,800/s
Total QPS (peak)	~30,000/s
Push notifications/day	300M
Email notifications/day	150M
SMS notifications/day	50M
Workers needed (peak + headroom)	~450
Queue depth (30s buffer)	~900K messages
Notification log storage (90d)	~45 TB
Analytics storage (1yr)	~91 TB
Email content storage (90d)	~675 TB

Step 2 — Propose High-Level Design

2.1 Analogy (Ví dụ đời thường)

Hieu, tưởng tượng hệ thống notification như bưu điện quốc gia:

Event producers = Người viết thư (các service trong hệ thống)
Notification service = Bưu điện trung tâm — phân loại, kiểm tra địa chỉ, đóng dấu
Template engine = Máy in — điền thông tin vào mẫu thư sẵn
User preference service = Sổ đăng ký — ai muốn nhận thư, ai không
Rate limiter = Kiểm soát viên — không cho gửi quá nhiều thư cho một người
Message queues = Xe tải phân loại theo vùng (push/SMS/email)
Delivery workers = Người đưa thư — mỗi loại biết cách giao khác nhau
Dead letter queue = Kho thư không giao được — thử lại sau

2.2 Architecture Overview

flowchart TB
    subgraph "Event Producers"
        EP1["Order Service"]
        EP2["Chat Service"]
        EP3["Marketing Service"]
        EP4["Auth Service<br/>(OTP, Security)"]
        EP5["Scheduler<br/>(Cron Jobs)"]
    end

    subgraph "Notification Platform"
        NS["Notification Service<br/>(API Gateway)"]
        VAL["Validation &<br/>Deduplication"]
        UPS["User Preference<br/>Service"]
        TE["Template Engine<br/>(i18n)"]
        RL["Rate Limiter"]
        PQ["Priority Router"]
    end

    subgraph "Message Queues (Kafka)"
        direction LR
        Q_PUSH_U["Push Queue<br/>(Urgent)"]
        Q_PUSH_N["Push Queue<br/>(Normal)"]
        Q_EMAIL_U["Email Queue<br/>(Urgent)"]
        Q_EMAIL_N["Email Queue<br/>(Normal)"]
        Q_SMS_U["SMS Queue<br/>(Urgent)"]
        Q_SMS_N["SMS Queue<br/>(Normal)"]
    end

    subgraph "Delivery Workers"
        PW["Push Workers<br/>(FCM / APNs)"]
        EW["Email Workers<br/>(SES / SendGrid)"]
        SW["SMS Workers<br/>(Twilio)"]
    end

    subgraph "External Providers"
        FCM["Firebase Cloud<br/>Messaging"]
        APNS["Apple Push<br/>Notification Service"]
        SES["Amazon SES /<br/>SendGrid"]
        TWILIO["Twilio /<br/>MessageBird"]
    end

    subgraph "Data Stores"
        DB[("Notification Log<br/>(Cassandra)")]
        REDIS[("User Preferences<br/>+ Device Tokens<br/>(Redis)")]
        S3[("Email Templates<br/>+ Content<br/>(S3)")]
        ANALYTICS[("Analytics<br/>(ClickHouse)")]
    end

    subgraph "Reliability"
        DLQ["Dead Letter Queue"]
        RETRY["Retry Scheduler"]
    end

    EP1 & EP2 & EP3 & EP4 & EP5 --> NS
    NS --> VAL
    VAL --> UPS
    UPS --> TE
    TE --> RL
    RL --> PQ

    PQ --> Q_PUSH_U & Q_PUSH_N
    PQ --> Q_EMAIL_U & Q_EMAIL_N
    PQ --> Q_SMS_U & Q_SMS_N

    Q_PUSH_U & Q_PUSH_N --> PW
    Q_EMAIL_U & Q_EMAIL_N --> EW
    Q_SMS_U & Q_SMS_N --> SW

    PW --> FCM & APNS
    EW --> SES
    SW --> TWILIO

    PW & EW & SW --> DB
    PW & EW & SW --> ANALYTICS
    PW & EW & SW -->|"Failed"| DLQ
    DLQ --> RETRY
    RETRY --> Q_PUSH_N & Q_EMAIL_N & Q_SMS_N

    NS --> REDIS
    TE --> S3

    style NS fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff
    style PQ fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff
    style DLQ fill:#f44336,stroke:#333,stroke-width:2px,color:#fff

2.3 Component Responsibilities

Component	Trách nhiệm	Technology
Notification Service	API gateway, nhận request từ producers, validation	Go / Java microservice
Validation & Dedup	Schema validation, idempotency check (idempotency key)	In-process + Redis
User Preference Service	Kiểm tra opt-in/opt-out per user per channel per type	Redis Hash
Template Engine	Render notification content từ template + parameters, i18n	Handlebars / Jinja2
Rate Limiter	Per-user + per-channel rate limit, chống spam	Redis + Token Bucket
Priority Router	Route notification vào đúng queue theo priority + channel	In-process logic
Message Queues	Decouple producers & consumers, buffer burst traffic	Kafka (6 topics: 3 channels x 2 priorities)
Delivery Workers	Consume từ queue, gọi external provider, handle response	Stateless workers, autoscaled
Dead Letter Queue	Chứa notifications fail sau max retries	Kafka DLQ topic
Notification Log	Lưu trạng thái mỗi notification (created, sent, delivered, opened)	Cassandra (write-heavy, time-series)
Analytics Service	Aggregate delivery/open/click metrics, A/B testing	ClickHouse / Druid

2.4 Notification Delivery Flow

sequenceDiagram
    participant Producer as Event Producer
    participant NS as Notification Service
    participant Redis as Redis (Prefs + Dedup)
    participant TE as Template Engine
    participant RL as Rate Limiter
    participant Kafka as Kafka Queue
    participant Worker as Delivery Worker
    participant Provider as External Provider<br/>(FCM/SES/Twilio)
    participant DB as Notification Log

    Producer->>NS: POST /api/v1/notifications<br/>{user_id, type, channel, data, priority}
    NS->>Redis: Check idempotency key
    Redis-->>NS: Not duplicate

    NS->>Redis: Get user preferences
    Redis-->>NS: {push: true, email: true, sms: false}

    Note over NS: User opted out of SMS → skip SMS channel

    NS->>TE: Render template(type, locale, data)
    TE-->>NS: Rendered content

    NS->>RL: Check rate limit(user_id, channel)
    RL-->>NS: Allowed (under limit)

    NS->>Kafka: Publish to push_normal topic
    NS->>Kafka: Publish to email_normal topic
    NS->>DB: Log status = QUEUED

    Kafka->>Worker: Consume message
    Worker->>Provider: Send notification
    Provider-->>Worker: Success (message_id)

    Worker->>DB: Update status = SENT
    Worker->>DB: Log provider_message_id

    Note over Provider: Provider delivers to device/inbox

    Provider-->>Worker: Delivery callback (webhook)
    Worker->>DB: Update status = DELIVERED

Step 3 — Design Deep Dive

3.1 Notification Template Engine

Tại sao cần template?

Thay vì mỗi service tự compose notification content (dễ inconsistent, khó maintain), centralize templates:

// Template: order_shipped
// Locale: vi
Chào {{user_name}}, đơn hàng #{{order_id}} đã được giao cho {{carrier}}.
Theo dõi tại: {{tracking_url}}

// Locale: en
Hi {{user_name}}, your order #{{order_id}} has been shipped via {{carrier}}.
Track it here: {{tracking_url}}

Template Storage Design

{
  "template_id": "order_shipped",
  "version": 3,
  "channels": {
    "push": {
      "vi": {
        "title": "Đơn hàng đã gửi!",
        "body": "Đơn #{{order_id}} đang trên đường đến bạn."
      },
      "en": {
        "title": "Order Shipped!",
        "body": "Order #{{order_id}} is on its way."
      }
    },
    "email": {
      "vi": {
        "subject": "Đơn hàng #{{order_id}} đã được gửi",
        "html_template_s3_key": "templates/order_shipped/vi/v3.html"
      },
      "en": {
        "subject": "Your order #{{order_id}} has shipped",
        "html_template_s3_key": "templates/order_shipped/en/v3.html"
      }
    },
    "sms": {
      "vi": {
        "body": "Don hang #{{order_id}} da gui. Theo doi: {{tracking_url}}"
      },
      "en": {
        "body": "Order #{{order_id}} shipped. Track: {{tracking_url}}"
      }
    }
  },
  "required_params": ["user_name", "order_id", "carrier", "tracking_url"],
  "created_at": "2026-01-15T10:00:00Z"
}

Aha Moment #1: Template versioning rất quan trọng. Khi update template, notification đang trong queue vẫn dùng version cũ. Phải lưu template_version trong message payload.

i18n Strategy

User profile lưu preferred_locale (mặc định từ device locale)
Fallback chain: user_locale → country_default → en
SMS không dùng Unicode để tránh tốn segment (Unicode SMS = 70 chars/segment vs GSM-7 = 160 chars/segment)

3.2 Priority Queue Design

Priority Levels

Priority	Use Cases	SLA	Queue
URGENT	OTP, security alert, payment confirmation	< 3 seconds	Dedicated urgent queue, pre-allocated workers
NORMAL	Order update, message notification, social interaction	< 30 seconds	Standard queue
LOW	Marketing campaign, weekly digest, recommendations	Best-effort (< 5 min)	Low-priority queue, off-peak delivery

Implementation: Separate Queues (Not Priority Field)

Tại sao dùng separate Kafka topics thay vì single queue + priority field?

Resource isolation: Urgent queue có dedicated worker pool — marketing burst không ảnh hưởng OTP
Independent scaling: Urgent workers always-on (không autoscale xuống 0), marketing workers scale to zero off-peak
Different retry policies: Urgent → aggressive retry (3 lần trong 10s). Low → gentle retry (3 lần trong 1 giờ)

Kafka topics:
├── notification.push.urgent     (3 partitions, replication=3)
├── notification.push.normal     (6 partitions, replication=3)
├── notification.push.low        (3 partitions, replication=2)
├── notification.email.urgent    (3 partitions, replication=3)
├── notification.email.normal    (6 partitions, replication=3)
├── notification.email.low       (3 partitions, replication=2)
├── notification.sms.urgent      (3 partitions, replication=3)
├── notification.sms.normal      (3 partitions, replication=3)
├── notification.sms.low         (2 partitions, replication=2)
└── notification.dlq             (3 partitions, replication=3)

3.3 Rate Limiting — Chống Notification Fatigue

Tại sao cần rate limit notification?

User nhận 50 push/ngày → uninstall app
Email quá nhiều → ISP đánh spam → deliverability giảm cho toàn domain
SMS quá nhiều → chi phí bùng nổ + vi phạm TCPA

Rate Limit Tiers

Tier	Limit	Scope	Ví dụ
Per-user per-channel	10 push/hour, 5 email/day, 2 SMS/day	Một user cụ thể	User A nhận max 10 push notification/giờ
Per-user total	30 notifications/day	Tổng tất cả channels	User A nhận max 30 notification/ngày tổng cộng
Per-type	1/event_type/hour	Cùng loại notification	Không gửi “someone liked your post” quá 1 lần/giờ
Global per-channel	100K email/hour	Toàn hệ thống	Tránh bị ISP throttle

Ngoại lệ: URGENT priority (OTP, security alert) bypass rate limiter. Người dùng request OTP thì phải gửi ngay, không được chặn.

Redis Implementation

# Per-user per-channel rate limit (sliding window)
Key:    rate:{user_id}:{channel}:{window}
Value:  count
TTL:    window duration

# Example: User 12345, push channel, hourly window
Key:    rate:12345:push:2026031810
Value:  7
TTL:    3600

# Check: if value < 10 → allowed, INCR
#         if value >= 10 → rejected, queue for later or drop

3.4 Retry Mechanism & Dead Letter Queue

Retry Strategy: Exponential Backoff

flowchart TD
    A["Worker picks message<br/>from Kafka"] --> B{"Send to Provider"}
    B -->|"Success"| C["Update status = SENT<br/>ACK message"]
    B -->|"Fail (timeout,<br/>5xx, rate limit)"| D{"Retry count<br/>< max_retries?"}
    D -->|"Yes"| E["Calculate backoff:<br/>delay = min(base × 2^retry, max_delay)<br/>+ random jitter"]
    E --> F["Re-publish to same topic<br/>with incremented retry_count<br/>and scheduled_at = now + delay"]
    F --> A
    D -->|"No (exhausted)"| G["Publish to DLQ"]
    G --> H["Alert + Manual Review"]

    B -->|"Fail (4xx client error,<br/>invalid token)"| I["Permanent failure<br/>Update status = FAILED"]
    I --> J["Clean up: remove<br/>invalid device token"]

    style G fill:#f44336,stroke:#333,stroke-width:2px,color:#fff
    style C fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff
    style I fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff

Retry Configuration per Priority

Priority	Max Retries	Base Delay	Max Delay	Total Window
URGENT	5	1s	30s	~1 min
NORMAL	3	30s	5 min	~15 min
LOW	3	5 min	1 hour	~3 hours

Backoff Formula

d e l a y = min (ba se \times 2^{re t ry_co u n t}, ma x_d e l a y) + random (0, 1000 m s)

Jitter rất quan trọng: Không có jitter → tất cả retries đến cùng lúc → thundering herd trên provider API.

Dead Letter Queue (DLQ) Processing

Messages trong DLQ được xử lý bởi manual review hoặc automated recovery:

Invalid device token → Remove token từ Redis, mark user’s device as inactive
Provider outage → Bulk re-process khi provider recovery
Permanent failure → Log, alert, move to archive

3.5 Push Notification — FCM/APNs Integration

Device Token Management

Mỗi user có thể có nhiều devices. Mỗi device có một device token (registration token) từ FCM hoặc APNs.

Redis Hash — device tokens per user:
Key:    devices:{user_id}
Fields:
  {device_id_1}: {"token": "fcm_token_abc", "platform": "android", "app_version": "3.2.1", "last_active": "2026-03-17T10:00:00Z"}
  {device_id_2}: {"token": "apns_token_xyz", "platform": "ios", "app_version": "3.1.0", "last_active": "2026-03-18T08:30:00Z"}

Token Lifecycle

Event	Action
User installs app	Register device token → Redis
User opens app	Refresh token (tokens có thể thay đổi)
Push fails with `InvalidRegistration` (FCM) hoặc `BadDeviceToken` (APNs)	Remove token từ Redis
User uninstalls app	Token becomes invalid → detected on next push fail
User logs out	Dissociate token from user (nhưng không xóa — re-associate khi login lại)

Silent Push (Data-only Push)

Không phải push nào cũng hiện notification banner. Silent push dùng để:

Trigger background data sync
Update badge count
Invalidate local cache

// FCM silent push payload
{
  "to": "device_token",
  "data": {
    "type": "badge_update",
    "unread_count": 5
  }
  // Không có "notification" key → silent push
}

3.6 Email — Deliverability & Integration

Email Architecture

Notification Service → Email Worker → Email Sending Service (SES/SendGrid)
                                          ↓
                                    SMTP Relay / API
                                          ↓
                                    Recipient ISP (Gmail, Outlook, Yahoo)
                                          ↓
                                    User's Inbox (or Spam folder!)

Deliverability Essentials — SPF, DKIM, DMARC

Protocol	Mục đích	DNS Record
SPF (Sender Policy Framework)	Xác định IP nào được phép gửi email cho domain	`TXT "v=spf1 include:amazonses.com -all"`
DKIM (DomainKeys Identified Mail)	Ký email bằng private key, receiver verify bằng public key trong DNS	`TXT` record chứa DKIM public key
DMARC (Domain-based Message Auth)	Policy khi SPF/DKIM fail: none, quarantine, reject	`TXT "v=DMARC1; p=reject; rua=mailto:[email protected]"`

Aha Moment #2: Nếu không setup SPF + DKIM + DMARC đúng, email notification sẽ rơi vào spam folder. Và một khi domain bị ISP blacklist, recovery mất hàng tuần. Đây là lý do email deliverability là kỹ năng riêng biệt.

Bounce Handling

Bounce Type	Ví dụ	Action
Hard bounce	Email address không tồn tại (550)	Remove email khỏi user profile, KHÔNG bao giờ gửi lại
Soft bounce	Mailbox full (452), server tạm thời lỗi	Retry 3 lần, sau đó mark as soft-bounced
Complaint	User click “Report Spam”	Unsubscribe ngay lập tức, log complaint

ISP theo dõi bounce rate và complaint rate. Nếu bounce rate > 5% hoặc complaint rate > 0.1% → domain bị throttle hoặc blacklist.

SES/SendGrid Integration

Dùng dedicated IP thay vì shared IP (kiểm soát reputation)
IP warming: IP mới gửi từ từ (100 email/ngày → tăng dần → full volume trong 4-6 tuần)
Feedback loop: SES SNS notifications cho bounce/complaint → auto-update user preference

3.7 SMS — Cost Optimization

SMS Pricing Reality

Route	Giá/SMS (USD)	50M SMS/day Cost
US (short code)	$0.0075	$375,000/day
US (long code)	$0.0050	$250,000/day
Vietnam	$0.04	$2,000,000/day
India	$0.003	$150,000/day

Aha Moment #3: SMS là channel đắt nhất — gấp hàng nghìn lần push notification (gần như miễn phí). Đây là lý do SMS chỉ nên dùng cho high-value notifications (OTP, security, payment).

Cost Optimization Strategies

Channel preference cascade: Push first → Email → SMS (only if urgent + user has no push token)
Smart routing: Dùng short code cho US (higher throughput, $0.0075) v s l o n g co d e ($ 0.005 but slower)
SMS content optimization: GSM-7 encoding (160 chars/segment) thay vì Unicode (70 chars/segment)
Batching: Aggregate similar SMS (thay vì 3 SMS riêng → 1 SMS tổng hợp)
Geographic routing: Dùng local provider cho từng region (Twilio US, MessageBird EU, local provider VN)

3.8 Deduplication — Exactly-Once Delivery

Tại sao cần dedup?

Network failure sau khi gửi thành công → producer retry → user nhận 2 lần
Kafka consumer crash sau khi gửi nhưng trước khi commit offset → re-process → duplicate
Marketing campaign accidentally triggered twice → user nhận double email

Idempotency Key Design

Idempotency key = hash(event_source + event_id + user_id + channel + notification_type)

Example:
  event_source: "order_service"
  event_id: "order_12345_shipped"
  user_id: "user_67890"
  channel: "push"
  notification_type: "order_shipped"

  Key: SHA256("order_service:order_12345_shipped:user_67890:push:order_shipped")
       = "a1b2c3d4..."

Redis SET with TTL:
  SETNX dedup:{idempotency_key} 1 EX 86400  (24h TTL)
  → If OK (key didn't exist) → process notification
  → If nil (key already exists) → skip (duplicate)

Trade-off: TTL quá ngắn → miss duplicates. TTL quá dài → Redis memory bloat. 24h là sweet spot cho hầu hết use cases.

3.9 Analytics — Measuring Notification Effectiveness

Metrics Pipeline

Notification sent → Delivery callback → Open/Read event → Click event → Conversion event
     ↓                    ↓                   ↓                ↓              ↓
   SENT               DELIVERED            OPENED           CLICKED       CONVERTED

Key Metrics

Metric	Formula	Target	Cách đo
Delivery Rate	delivered / sent	> 95% (push), > 98% (email)	Provider callback
Open Rate	opened / delivered	> 15% (push), > 20% (email)	Tracking pixel (email), app open event (push)
Click-through Rate (CTR)	clicked / opened	> 5%	Redirect URL tracking
Opt-out Rate	unsubscribed / delivered	< 0.5%	Unsubscribe link click
Conversion Rate	converted / clicked	Varies by campaign	Deep link attribution

A/B Testing

Split users into buckets (control vs variant)
Test: notification title, body, send time, channel
Track metrics per variant → pick winner
Example: “Đơn hàng đã gửi!” vs “Đơn hàng đang trên đường đến bạn!” → which has higher open rate?

3.10 User Preference Store

Redis Hash Design

Key:    prefs:{user_id}
Fields:
  push_enabled:           "true"
  email_enabled:          "true"
  sms_enabled:            "false"
  push_marketing:         "false"    # opt-out marketing push
  push_social:            "true"     # opt-in social push
  push_transactional:     "true"     # opt-in transactional push
  email_marketing:        "false"
  email_transactional:    "true"
  quiet_hours_start:      "22:00"    # local time
  quiet_hours_end:        "08:00"
  timezone:               "Asia/Ho_Chi_Minh"
  locale:                 "vi"

Opt-out Handling — Legal Compliance

Regulation	Yêu cầu	Implementation
CAN-SPAM (US, Email)	Unsubscribe link trong mỗi email, process trong 10 ngày	`List-Unsubscribe` header, one-click unsubscribe
GDPR (EU)	Explicit consent, right to erasure	Double opt-in, preference deletion API
TCPA (US, SMS)	Explicit written consent trước khi gửi SMS marketing	Consent record + timestamp trong DB

3.11 Batch Notification — Aggregation

Problem

User có 10 người like post trong 5 phút → gửi 10 notification? No!

Solution: Event Aggregation

Window: 5 minutes
Events:
  10:00 - Alice liked your post
  10:01 - Bob liked your post
  10:03 - Charlie liked your post

Aggregated notification (at 10:05):
  "Alice, Bob, Charlie liked your post"

If > 3 people:
  "Alice, Bob, and 8 others liked your post"

Implementation

Tumbling window (fixed 5-min intervals): Simple, predictable
Session window (gap-based): Send when no new events for 2 minutes
Storage: Redis Sorted Set per user per event type, score = timestamp
Aggregation worker runs every window interval, collects events, sends single notification

3.12 Timezone-Aware Delivery

Problem

Marketing notification scheduled for 9:00 AM — but user is in which timezone?

Solution

User timezone: "Asia/Ho_Chi_Minh" (UTC+7)
Scheduled send: 9:00 AM local time

Server calculation:
  UTC time = 9:00 AM - 7h = 2:00 AM UTC

For user in "America/New_York" (UTC-4):
  UTC time = 9:00 AM + 4h = 1:00 PM UTC

Quiet hours: Do not send between 22:00 - 08:00 local time (buffer in queue, send at 08:00)
Exception: URGENT priority ignores quiet hours (OTP/security alert at 3 AM is fine)

4. Security — Notification-Specific Threats

4.1 Notification Spoofing Prevention

Threat: Attacker gửi fake notification request giả danh internal service.

Mitigations:

Service-to-service authentication: mTLS hoặc JWT with service identity
API key per producer: Mỗi service có API key riêng, rate-limited
Request signing: HMAC signature trên payload → Notification Service verify trước khi process
Allow-list: Chỉ registered event types + registered producer services mới được gửi

4.2 Phishing Detection in Email

Threat: Internal service bị compromise → gửi phishing email qua notification system.

Mitigations:

Template-only policy: Email content PHẢI dùng pre-approved templates. Không cho phép arbitrary HTML.
URL allowlist: Links trong email chỉ được point to approved domains
Content scanning: Scan email content cho known phishing patterns trước khi gửi
DMARC enforcement: p=reject → email giả mạo domain bị reject bởi ISP

4.3 PII in Notifications

Threat: Push notification preview hiện trên lock screen → lộ thông tin nhạy cảm.

Mitigations:

Content masking: “Bạn nhận được chuyển khoản ****5,000,000 VND” thay vì hiện full
Hidden preview: iOS/Android cho phép ẩn notification content trên lock screen
Encrypted push payload: Encrypt payload, app decrypt khi user unlock device

4.4 Encrypted Push Payload

// Standard push (visible on lock screen):
{
  "title": "Chuyển khoản thành công",
  "body": "Bạn nhận 5,000,000 VND từ Nguyễn Văn A"  ← PII leak!
}

// Encrypted push:
{
  "data": {
    "encrypted_payload": "aes256gcm:iv:ciphertext:tag",
    "key_id": "user_key_v3"
  }
  // App decrypts after biometric auth
}

5. DevOps — Production Operations

5.1 Kafka Configuration for Notification Ingestion

# kafka-notification-topics.yml
topics:
  - name: notification.push.urgent
    partitions: 3
    replication_factor: 3
    config:
      retention.ms: 86400000        # 24h
      min.insync.replicas: 2        # Strong durability for urgent
      max.message.bytes: 1048576    # 1MB max
 
  - name: notification.push.normal
    partitions: 6
    replication_factor: 3
    config:
      retention.ms: 259200000       # 72h
      min.insync.replicas: 2
 
  - name: notification.email.normal
    partitions: 6
    replication_factor: 3
    config:
      retention.ms: 259200000
      min.insync.replicas: 2
 
  - name: notification.sms.urgent
    partitions: 3
    replication_factor: 3
    config:
      retention.ms: 86400000
      min.insync.replicas: 2
 
  - name: notification.dlq
    partitions: 3
    replication_factor: 3
    config:
      retention.ms: 604800000       # 7 days — manual review window
      min.insync.replicas: 2

5.2 Worker Autoscaling

# k8s-hpa-push-workers.yml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: push-worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: push-worker
  minReplicas: 20          # Always-on baseline
  maxReplicas: 200         # Peak capacity
  metrics:
    - type: External
      external:
        metric:
          name: kafka_consumergroup_lag
          selector:
            matchLabels:
              topic: notification.push.normal
        target:
          type: AverageValue
          averageValue: "1000"    # Scale up when lag > 1000 per pod
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          targetAverageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60    # React fast to burst
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300   # Slow scale-down to avoid flapping
      policies:
        - type: Percent
          value: 25
          periodSeconds: 120

5.3 Monitoring & Alerting

# prometheus-notification-alerts.yml
groups:
  - name: notification_delivery
    rules:
      - alert: DeliveryRateDropped
        expr: |
          (
            sum(rate(notification_delivered_total[5m]))
            /
            sum(rate(notification_sent_total[5m]))
          ) < 0.90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Notification delivery rate dropped below 90%"
          description: "Current delivery rate: {{ $value | humanizePercentage }}"
 
      - alert: UrgentQueueLagHigh
        expr: kafka_consumergroup_lag{topic=~"notification.*.urgent"} > 5000
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Urgent notification queue lag is high ({{ $value }})"
 
      - alert: DLQGrowing
        expr: rate(kafka_topic_messages_in_total{topic="notification.dlq"}[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Dead letter queue receiving {{ $value }} msg/s"
 
      - alert: SMSCostSpike
        expr: |
          sum(rate(sms_sent_total[1h])) * 0.0075 * 24 > 100000
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Projected daily SMS cost exceeds $100K"
 
      - alert: EmailBounceRateHigh
        expr: |
          sum(rate(email_bounce_total{type="hard"}[1h]))
          /
          sum(rate(email_sent_total[1h]))
          > 0.05
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Email hard bounce rate exceeds 5% — risk of domain blacklist"
 
      - alert: PushTokenInvalidRate
        expr: |
          sum(rate(push_token_invalid_total[1h]))
          /
          sum(rate(push_sent_total[1h]))
          > 0.10
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "10%+ push tokens are invalid — device token cleanup needed"

5.4 Grafana Dashboard Panels

Panel	PromQL	Threshold
Notifications/sec by channel	`sum(rate(notification_sent_total[1m])) by (channel)`	Compare vs estimation
Delivery rate by channel	`delivered / sent by channel`	Push > 95%, Email > 98%
P99 notification latency	`histogram_quantile(0.99, notification_e2e_duration_seconds_bucket)`	Urgent < 3s
Kafka consumer lag	`kafka_consumergroup_lag by topic`	Urgent < 1K, Normal < 100K
DLQ message count	`kafka_topic_partition_current_offset{topic="notification.dlq"}`	< 1000
SMS daily cost projection	`sum(sms_sent_total) * cost_per_sms * (86400/elapsed)`	< $50K/day
Email bounce rate	`hard_bounce / sent`	< 2%
Opt-out rate (rolling 7d)	`unsubscribe_total / delivered_total`	< 0.5%

5.5 SES/SendGrid Operational Notes

Dedicated IP pool: Separate IPs for transactional vs marketing email (khác reputation)
IP warming schedule: Day 1: 200 emails → Day 7: 10K → Day 14: 50K → Day 30: full volume
Suppression list: SES auto-maintains bounce/complaint list — sync to internal preference store
Sending quota: SES default = 200 emails/sec. Request increase gradually.
Dashboard: Monitor SES console cho sending statistics, bounce/complaint rates, reputation dashboard

6. Code Examples

6.1 Python: Notification Service with Priority Queue

"""
Notification Service — Core processing pipeline
Handles validation, deduplication, preference check, template rendering,
rate limiting, and queue routing.
"""
 
import hashlib
import json
import time
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
 
import redis
from confluent_kafka import Producer
 
 
class Channel(Enum):
    PUSH = "push"
    EMAIL = "email"
    SMS = "sms"
 
 
class Priority(Enum):
    URGENT = "urgent"
    NORMAL = "normal"
    LOW = "low"
 
 
@dataclass
class NotificationRequest:
    event_source: str
    event_id: str
    user_id: str
    channels: list[Channel]
    notification_type: str
    priority: Priority
    template_params: dict
    idempotency_key: Optional[str] = None
    scheduled_at: Optional[float] = None
 
    def __post_init__(self):
        if not self.idempotency_key:
            # Auto-generate idempotency key
            raw = f"{self.event_source}:{self.event_id}:{self.user_id}"
            self.idempotency_key = hashlib.sha256(raw.encode()).hexdigest()
 
 
@dataclass
class NotificationMessage:
    """Message published to Kafka after processing."""
    notification_id: str
    user_id: str
    channel: Channel
    priority: Priority
    rendered_content: dict
    device_tokens: list[str] = field(default_factory=list)
    email_address: str = ""
    phone_number: str = ""
    retry_count: int = 0
    created_at: float = field(default_factory=time.time)
 
 
class NotificationService:
    """
    Core notification processing pipeline.
 
    Flow: Validate → Dedup → Check Preferences → Render Template
          → Rate Limit → Route to Queue
    """
 
    DEDUP_TTL = 86400  # 24 hours
    RATE_LIMITS = {
        Channel.PUSH: {"window": 3600, "max_count": 10},    # 10/hour
        Channel.EMAIL: {"window": 86400, "max_count": 5},    # 5/day
        Channel.SMS: {"window": 86400, "max_count": 2},      # 2/day
    }
 
    def __init__(self, redis_client: redis.Redis, kafka_producer: Producer):
        self.redis = redis_client
        self.producer = kafka_producer
 
    def process(self, request: NotificationRequest) -> dict:
        """Main entry point — process a notification request."""
        results = {}
 
        # Step 1: Deduplication check
        if self._is_duplicate(request.idempotency_key):
            return {"status": "skipped", "reason": "duplicate"}
 
        for channel in request.channels:
            result = self._process_channel(request, channel)
            results[channel.value] = result
 
        return {"status": "processed", "channels": results}
 
    def _is_duplicate(self, idempotency_key: str) -> bool:
        """Check Redis for duplicate using SETNX."""
        key = f"dedup:{idempotency_key}"
        was_set = self.redis.set(key, 1, nx=True, ex=self.DEDUP_TTL)
        return not was_set  # None = key existed = duplicate
 
    def _process_channel(self, request: NotificationRequest, channel: Channel) -> dict:
        # Step 2: Check user preferences
        if not self._check_preference(request.user_id, channel, request.notification_type):
            return {"status": "skipped", "reason": "user_opted_out"}
 
        # Step 3: Check quiet hours (skip for URGENT)
        if request.priority != Priority.URGENT:
            if self._in_quiet_hours(request.user_id):
                return {"status": "deferred", "reason": "quiet_hours"}
 
        # Step 4: Render template
        locale = self._get_user_locale(request.user_id)
        content = self._render_template(
            request.notification_type, channel, locale, request.template_params
        )
 
        # Step 5: Rate limit check (URGENT bypasses)
        if request.priority != Priority.URGENT:
            if not self._check_rate_limit(request.user_id, channel):
                return {"status": "rate_limited", "reason": "too_many_notifications"}
 
        # Step 6: Build message and publish to Kafka
        message = NotificationMessage(
            notification_id=f"{request.idempotency_key}:{channel.value}",
            user_id=request.user_id,
            channel=channel,
            priority=request.priority,
            rendered_content=content,
        )
 
        # Enrich with delivery target
        if channel == Channel.PUSH:
            message.device_tokens = self._get_device_tokens(request.user_id)
            if not message.device_tokens:
                return {"status": "skipped", "reason": "no_device_tokens"}
        elif channel == Channel.EMAIL:
            message.email_address = self._get_email(request.user_id)
        elif channel == Channel.SMS:
            message.phone_number = self._get_phone(request.user_id)
 
        topic = f"notification.{channel.value}.{request.priority.value}"
        self._publish(topic, message)
 
        return {"status": "queued", "topic": topic}
 
    def _check_preference(self, user_id: str, channel: Channel, notif_type: str) -> bool:
        """Check user preference from Redis hash."""
        prefs = self.redis.hgetall(f"prefs:{user_id}")
        if not prefs:
            return True  # Default: all enabled
 
        # Check channel-level opt-out
        channel_key = f"{channel.value}_enabled"
        if prefs.get(channel_key, b"true") == b"false":
            return False
 
        # Check type-level opt-out (e.g., push_marketing)
        type_key = f"{channel.value}_{notif_type}"
        if prefs.get(type_key, b"true") == b"false":
            return False
 
        return True
 
    def _in_quiet_hours(self, user_id: str) -> bool:
        """Check if current time is within user's quiet hours."""
        prefs = self.redis.hgetall(f"prefs:{user_id}")
        quiet_start = prefs.get(b"quiet_hours_start", b"").decode()
        quiet_end = prefs.get(b"quiet_hours_end", b"").decode()
        if not quiet_start or not quiet_end:
            return False
 
        tz = prefs.get(b"timezone", b"UTC").decode()
        # In production: convert current UTC time to user's local time
        # and check if within [quiet_start, quiet_end]
        # Simplified here for brevity
        return False
 
    def _check_rate_limit(self, user_id: str, channel: Channel) -> bool:
        """Sliding window rate limit using Redis INCR + TTL."""
        config = self.RATE_LIMITS[channel]
        window = int(time.time()) // config["window"]
        key = f"rate:{user_id}:{channel.value}:{window}"
 
        count = self.redis.incr(key)
        if count == 1:
            self.redis.expire(key, config["window"])
 
        return count <= config["max_count"]
 
    def _get_user_locale(self, user_id: str) -> str:
        locale = self.redis.hget(f"prefs:{user_id}", "locale")
        return locale.decode() if locale else "en"
 
    def _render_template(
        self, notif_type: str, channel: Channel, locale: str, params: dict
    ) -> dict:
        """
        Render notification template with parameters.
        In production: load template from S3/cache, use Jinja2/Handlebars.
        """
        # Simplified — in production this loads from template store
        templates = {
            ("order_shipped", Channel.PUSH, "vi"): {
                "title": "Don hang da gui!",
                "body": "Don #{order_id} dang tren duong den ban.",
            },
            ("order_shipped", Channel.PUSH, "en"): {
                "title": "Order Shipped!",
                "body": "Order #{order_id} is on its way.",
            },
        }
        template = templates.get((notif_type, channel, locale), {})
        rendered = {}
        for key, value in template.items():
            for param_name, param_value in params.items():
                value = value.replace(f"{{{param_name}}}", str(param_value))
            rendered[key] = value
        return rendered
 
    def _get_device_tokens(self, user_id: str) -> list[str]:
        devices = self.redis.hgetall(f"devices:{user_id}")
        tokens = []
        for device_id, device_data in devices.items():
            data = json.loads(device_data)
            tokens.append(data["token"])
        return tokens
 
    def _get_email(self, user_id: str) -> str:
        return self.redis.hget(f"user:{user_id}", "email").decode()
 
    def _get_phone(self, user_id: str) -> str:
        return self.redis.hget(f"user:{user_id}", "phone").decode()
 
    def _publish(self, topic: str, message: NotificationMessage):
        payload = json.dumps({
            "notification_id": message.notification_id,
            "user_id": message.user_id,
            "channel": message.channel.value,
            "priority": message.priority.value,
            "content": message.rendered_content,
            "device_tokens": message.device_tokens,
            "email_address": message.email_address,
            "phone_number": message.phone_number,
            "retry_count": message.retry_count,
            "created_at": message.created_at,
        })
        self.producer.produce(topic, value=payload.encode("utf-8"))
        self.producer.flush()

6.2 Python: FCM Push Notification Sender

"""
Push Notification Worker — Consumes from Kafka, sends via FCM/APNs.
Handles retries with exponential backoff.
"""
 
import json
import time
import random
import logging
from dataclasses import dataclass
 
import requests
from confluent_kafka import Consumer, Producer
 
logger = logging.getLogger(__name__)
 
 
@dataclass
class RetryConfig:
    max_retries: int
    base_delay_seconds: float
    max_delay_seconds: float
 
 
RETRY_CONFIGS = {
    "urgent": RetryConfig(max_retries=5, base_delay_seconds=1, max_delay_seconds=30),
    "normal": RetryConfig(max_retries=3, base_delay_seconds=30, max_delay_seconds=300),
    "low": RetryConfig(max_retries=3, base_delay_seconds=300, max_delay_seconds=3600),
}
 
 
class FCMPushSender:
    """Send push notifications via Firebase Cloud Messaging HTTP v1 API."""
 
    FCM_URL = "https://fcm.googleapis.com/v1/projects/{project_id}/messages:send"
 
    def __init__(self, project_id: str, service_account_token: str):
        self.url = self.FCM_URL.format(project_id=project_id)
        self.headers = {
            "Authorization": f"Bearer {service_account_token}",
            "Content-Type": "application/json",
        }
 
    def send(self, device_token: str, title: str, body: str, data: dict = None) -> dict:
        """
        Send push notification to a single device via FCM.
        Returns: {"success": True, "message_id": "..."} or {"success": False, "error": "..."}
        """
        payload = {
            "message": {
                "token": device_token,
                "notification": {
                    "title": title,
                    "body": body,
                },
            }
        }
        if data:
            payload["message"]["data"] = {k: str(v) for k, v in data.items()}
 
        try:
            response = requests.post(
                self.url, headers=self.headers, json=payload, timeout=5
            )
 
            if response.status_code == 200:
                return {"success": True, "message_id": response.json().get("name")}
            elif response.status_code == 404:
                # Invalid device token — permanent failure
                return {
                    "success": False,
                    "error": "INVALID_TOKEN",
                    "retriable": False,
                }
            elif response.status_code == 429:
                # Rate limited by FCM
                return {
                    "success": False,
                    "error": "RATE_LIMITED",
                    "retriable": True,
                }
            else:
                return {
                    "success": False,
                    "error": f"HTTP_{response.status_code}",
                    "retriable": response.status_code >= 500,
                }
 
        except requests.exceptions.Timeout:
            return {"success": False, "error": "TIMEOUT", "retriable": True}
        except requests.exceptions.ConnectionError:
            return {"success": False, "error": "CONNECTION_ERROR", "retriable": True}
 
 
class PushWorker:
    """
    Kafka consumer worker that processes push notification messages.
    Handles retry with exponential backoff and DLQ routing.
    """
 
    def __init__(
        self,
        consumer: Consumer,
        producer: Producer,
        fcm_sender: FCMPushSender,
        redis_client,
    ):
        self.consumer = consumer
        self.producer = producer
        self.fcm = fcm_sender
        self.redis = redis_client
 
    def run(self):
        """Main consumer loop."""
        topics = [
            "notification.push.urgent",
            "notification.push.normal",
            "notification.push.low",
        ]
        self.consumer.subscribe(topics)
        logger.info(f"Push worker subscribed to {topics}")
 
        while True:
            msg = self.consumer.poll(timeout=1.0)
            if msg is None:
                continue
            if msg.error():
                logger.error(f"Consumer error: {msg.error()}")
                continue
 
            try:
                self._process_message(json.loads(msg.value().decode("utf-8")))
                self.consumer.commit(msg)
            except Exception as e:
                logger.exception(f"Failed to process message: {e}")
 
    def _process_message(self, message: dict):
        """Process a single notification message."""
        priority = message["priority"]
        retry_count = message.get("retry_count", 0)
        retry_config = RETRY_CONFIGS[priority]
 
        for token in message["device_tokens"]:
            result = self.fcm.send(
                device_token=token,
                title=message["content"].get("title", ""),
                body=message["content"].get("body", ""),
                data={"notification_id": message["notification_id"]},
            )
 
            if result["success"]:
                logger.info(
                    f"Push sent: {message['notification_id']} → {token[:20]}..."
                )
                self._update_status(message["notification_id"], "SENT")
            elif not result.get("retriable", False):
                # Permanent failure — e.g., invalid token
                logger.warning(
                    f"Permanent failure for token {token[:20]}: {result['error']}"
                )
                self._remove_invalid_token(message["user_id"], token)
                self._update_status(message["notification_id"], "FAILED")
            elif retry_count < retry_config.max_retries:
                # Retriable failure — schedule retry
                delay = self._calculate_backoff(
                    retry_count, retry_config.base_delay_seconds,
                    retry_config.max_delay_seconds,
                )
                logger.info(
                    f"Scheduling retry #{retry_count + 1} in {delay:.1f}s "
                    f"for {message['notification_id']}"
                )
                message["retry_count"] = retry_count + 1
                message["scheduled_at"] = time.time() + delay
                topic = f"notification.push.{priority}"
                self.producer.produce(
                    topic, value=json.dumps(message).encode("utf-8")
                )
                self.producer.flush()
            else:
                # Exhausted retries — send to DLQ
                logger.error(
                    f"Max retries exhausted for {message['notification_id']}"
                )
                self.producer.produce(
                    "notification.dlq",
                    value=json.dumps(message).encode("utf-8"),
                )
                self.producer.flush()
                self._update_status(message["notification_id"], "DLQ")
 
    @staticmethod
    def _calculate_backoff(
        retry_count: int, base_delay: float, max_delay: float
    ) -> float:
        """Exponential backoff with jitter."""
        delay = min(base_delay * (2 ** retry_count), max_delay)
        jitter = random.uniform(0, 1.0)  # Up to 1 second jitter
        return delay + jitter
 
    def _remove_invalid_token(self, user_id: str, token: str):
        """Remove invalid device token from Redis."""
        devices = self.redis.hgetall(f"devices:{user_id}")
        for device_id, device_data in devices.items():
            data = json.loads(device_data)
            if data.get("token") == token:
                self.redis.hdel(f"devices:{user_id}", device_id)
                logger.info(f"Removed invalid token for user {user_id}")
                break
 
    def _update_status(self, notification_id: str, status: str):
        """Update notification status in log (Cassandra in production)."""
        # In production: write to Cassandra notification_log table
        logger.info(f"Status update: {notification_id} → {status}")

6.3 Python: Email Sender with Retry and Bounce Handling

"""
Email Worker — Sends emails via Amazon SES with bounce handling.
"""
 
import json
import logging
from dataclasses import dataclass
 
import boto3
from botocore.exceptions import ClientError
 
logger = logging.getLogger(__name__)
 
 
@dataclass
class EmailResult:
    success: bool
    message_id: str = ""
    error: str = ""
    error_type: str = ""  # "hard_bounce", "soft_bounce", "throttle", "other"
    retriable: bool = False
 
 
class SESEmailSender:
    """Send emails via Amazon SES with proper error categorization."""
 
    def __init__(self, region: str = "us-east-1", config_set: str = "notification"):
        self.client = boto3.client("ses", region_name=region)
        self.config_set = config_set
 
    def send(
        self,
        to_address: str,
        subject: str,
        html_body: str,
        from_address: str = "[email protected]",
        reply_to: str = None,
        headers: dict = None,
    ) -> EmailResult:
        """Send a single email via SES."""
        try:
            message = {
                "Subject": {"Data": subject, "Charset": "UTF-8"},
                "Body": {
                    "Html": {"Data": html_body, "Charset": "UTF-8"},
                },
            }
 
            kwargs = {
                "Source": from_address,
                "Destination": {"ToAddresses": [to_address]},
                "Message": message,
                "ConfigurationSetName": self.config_set,
            }
 
            if reply_to:
                kwargs["ReplyToAddresses"] = [reply_to]
 
            # Add List-Unsubscribe header for CAN-SPAM compliance
            if headers and "List-Unsubscribe" in headers:
                kwargs["Tags"] = [
                    {"Name": "unsubscribe_url", "Value": headers["List-Unsubscribe"]}
                ]
 
            response = self.client.send_email(**kwargs)
 
            return EmailResult(
                success=True,
                message_id=response["MessageId"],
            )
 
        except ClientError as e:
            error_code = e.response["Error"]["Code"]
 
            if error_code == "MessageRejected":
                # Typically: email address on suppression list
                return EmailResult(
                    success=False,
                    error=str(e),
                    error_type="hard_bounce",
                    retriable=False,
                )
            elif error_code == "Throttling":
                return EmailResult(
                    success=False,
                    error="SES rate limit exceeded",
                    error_type="throttle",
                    retriable=True,
                )
            elif error_code == "AccountSendingPausedException":
                # SES suspended sending — critical alert needed
                logger.critical("SES account sending is paused!")
                return EmailResult(
                    success=False,
                    error="SES sending paused",
                    error_type="other",
                    retriable=False,
                )
            else:
                return EmailResult(
                    success=False,
                    error=str(e),
                    error_type="other",
                    retriable=True,
                )
 
        except Exception as e:
            return EmailResult(
                success=False,
                error=str(e),
                error_type="other",
                retriable=True,
            )
 
 
class BounceHandler:
    """
    Process SES bounce/complaint notifications via SNS webhook.
    Updates user preferences to prevent future sends to invalid addresses.
    """
 
    def __init__(self, redis_client):
        self.redis = redis_client
 
    def handle_sns_notification(self, sns_message: dict):
        """Process SNS notification from SES feedback."""
        notif_type = sns_message.get("notificationType")
 
        if notif_type == "Bounce":
            self._handle_bounce(sns_message["bounce"])
        elif notif_type == "Complaint":
            self._handle_complaint(sns_message["complaint"])
        elif notif_type == "Delivery":
            self._handle_delivery(sns_message["delivery"])
 
    def _handle_bounce(self, bounce: dict):
        bounce_type = bounce["bounceType"]  # "Permanent" or "Transient"
 
        for recipient in bounce["bouncedRecipients"]:
            email = recipient["emailAddress"]
 
            if bounce_type == "Permanent":
                # Hard bounce — disable email for this user
                logger.warning(f"Hard bounce for {email} — disabling email")
                self._disable_email_for_user(email)
            else:
                # Soft bounce — log, but don't disable yet
                logger.info(f"Soft bounce for {email} — monitoring")
 
    def _handle_complaint(self, complaint: dict):
        """User clicked 'Report Spam' — immediately unsubscribe."""
        for recipient in complaint["complainedRecipients"]:
            email = recipient["emailAddress"]
            logger.warning(f"Spam complaint from {email} — unsubscribing immediately")
            self._disable_email_for_user(email)
 
    def _handle_delivery(self, delivery: dict):
        """Successful delivery confirmation."""
        for recipient in delivery["recipients"]:
            logger.info(f"Email delivered to {recipient}")
 
    def _disable_email_for_user(self, email: str):
        """
        Find user by email and disable email notifications.
        In production: lookup user by email in DB, then update Redis prefs.
        """
        # Simplified — in production: DB lookup
        # user_id = db.query("SELECT id FROM users WHERE email = ?", email)
        # self.redis.hset(f"prefs:{user_id}", "email_enabled", "false")
        pass

7. Aha Moments — Tổng hợp

#1 — Template versioning: Khi update template, notification đang trong queue vẫn dùng version cũ. Luôn embed template_version trong message. Nếu không, user có thể nhận notification với format sai khi template thay đổi giữa chừng.

#2 — Email deliverability là hệ sinh thái riêng: SPF + DKIM + DMARC + dedicated IP + IP warming + bounce handling + complaint handling. Skip bất kỳ bước nào → email vào spam. Recovery từ blacklist mất hàng tuần.

#3 — SMS là channel đắt nhất: Push notification gần như miễn phí (FCM/APNs không charge). SMS = $0.005 -$ 0.04/message. Với 50M SMS/ngày = hàng trăm nghìn USD/ngày. Đây là lý do preference cascade (push > email > SMS) là bắt buộc.

#4 — Deduplication window trade-off: TTL quá ngắn → miss duplicates, user bực. TTL quá dài → Redis memory bloat. 24h với SHA256 key (32 bytes) cho 500M notifications = ~16 GB Redis → chấp nhận được.

#5 — Separate queues per priority: Dùng single queue + priority field thì marketing blast sẽ delay OTP. Separate queues + dedicated worker pools = resource isolation. Urgent notifications không bao giờ bị ảnh hưởng bởi bulk sends.

#6 — Quiet hours + timezone: Marketing notification lúc 3 AM = user rage uninstall. Nhưng OTP lúc 3 AM thì PHẢI gửi. Priority-based quiet hours bypass là thiết kế tinh tế mà interviewer đánh giá cao.

#7 — Event aggregation giảm notification fatigue: “10 people liked your post” (1 notification) >> 10 x “X liked your post” (10 notifications). Aggregation window 5 phút, dùng Redis Sorted Set, là pattern production-ready.

8. Common Pitfalls — Sai lầm thường gặp

Pitfall 1: Notification Fatigue

Sai: Gửi notification cho mọi event — “A liked your post”, “B liked your post”, “C liked your post”… Đúng: Aggregate events. Rate limit per user. Respect quiet hours. Cho user granular opt-out (per type, per channel). Mỗi notification phải earn quyền được gửi.

Pitfall 2: Thundering Herd on Mass Notification

Sai: Marketing campaign gửi 100M push notification cùng lúc → spike 100x normal QPS → Kafka lag, FCM rate limit, worker crash. Đúng: Staggered delivery — spread 100M notifications over 30-60 phút. Rate limit marketing sends at global level. Use separate Kafka partitions cho marketing vs transactional.

Pitfall 3: SMS Cost Explosion

Sai: Mặc định gửi SMS cho mọi notification type. Đúng: SMS chỉ cho URGENT (OTP, security). Preference cascade: push first → email → SMS as last resort. Budget alert khi projected daily SMS cost vượt threshold. Geographic routing (local SMS providers rẻ hơn Twilio cho nhiều quốc gia).

Pitfall 4: Timezone-Unaware Delivery

Sai: Schedule “morning notification” at 9 AM UTC → user ở Việt Nam nhận lúc 4 PM, user ở US West nhận lúc 1 AM. Đúng: Store user timezone. Convert scheduled time to user’s local time. Buffer in queue nếu trong quiet hours, gửi khi quiet hours kết thúc.

Pitfall 5: Ignoring Device Token Lifecycle

Sai: Lưu device token một lần, dùng mãi. Đúng: Token có thể thay đổi khi app reinstall, OS update, hoặc token refresh. FCM trả InvalidRegistration → xóa token. Cần token refresh on every app open. Nếu không, push failure rate tăng dần theo thời gian (token decay).

Pitfall 6: Single Point of Failure in Template Engine

Sai: Template engine là single service — nó down thì tất cả notification fail. Đúng: Cache rendered templates. Fallback to plain text nếu template service unavailable. Template engine nên stateless, horizontally scalable. Pre-render marketing templates trước campaign.

Pitfall 7: No Idempotency → Duplicate Notifications

Sai: Retry khi network timeout mà không check duplicate → user nhận “Đơn hàng đã gửi” 3 lần. Đúng: Idempotency key = hash(event_source + event_id + user_id + channel). SETNX trong Redis với TTL. Producer retry an toàn vì duplicate bị filter.

Step 4 — Wrap Up

Tóm tắt kiến trúc

Layer	Components	Key Design Decision
Ingestion	Notification Service API	Validation + dedup + preference check trước khi vào queue
Processing	Template Engine, Rate Limiter, Priority Router	Separate concerns, mỗi component có thể scale độc lập
Queueing	Kafka (9 topics: 3 channels x 3 priorities)	Separate queues cho resource isolation
Delivery	Push/Email/SMS Workers	Stateless, autoscaled dựa trên Kafka lag
Reliability	Retry + DLQ	Exponential backoff + jitter, permanent vs transient failure
Data	Cassandra (log), Redis (prefs + dedup), ClickHouse (analytics)	Right tool for right access pattern

Scalability Dimensions

Nếu cần scale…	Solution
More notifications/sec	Add Kafka partitions + more workers
More channels (WhatsApp, Telegram)	Add new Kafka topic + new worker type — plug-in architecture
More notification types	Add new template — zero code change
Global deployment	Multi-region Kafka + region-local workers + local SMS providers

Những điểm interviewer đánh giá cao

Priority-based queueing với resource isolation (không phải single queue + priority field)
Rate limiting ở nhiều level (per-user, per-channel, per-type, global)
Exactly-once delivery qua idempotency key
Email deliverability awareness (SPF/DKIM/DMARC, bounce handling, IP warming)
Cost optimization cho SMS channel
Timezone-aware delivery với quiet hours
Event aggregation cho batch notifications
Analytics pipeline tách biệt khỏi delivery path

Tham khảo

Alex Xu, System Design Interview — Chapter 10: Design a Notification System
Firebase Cloud Messaging Documentation
Amazon SES Developer Guide
Twilio SMS API
Tuan-02-Back-of-the-envelope — Estimation fundamentals
Tuan-06-Cache-Strategy — Redis caching patterns
Tuan-08-Message-Queue — Kafka queue architecture
Tuan-09-Rate-Limiter — Rate limiting algorithms
Tuan-13-Monitoring-Observability — Monitoring delivery metrics
Tuan-14-CI-CD-Pipeline — Deploying notification workers
Tuan-15-Data-Security-Encryption — PII protection in notifications
Tuan-17-Design-Chat-System — Related real-time system design
Tuan-18-Design-News-Feed — Related event-driven system design

Tuần tới: Tuan-20-Design-Key-Value-Store — Distributed KV Store design deep dive

lthieu's notes

Explorer

Tuan-19-Design-Notification-System

Tuần 19: Design Notification System

Step 1 — Understand the Problem & Establish Design Scope

1.1 Functional Requirements (Yêu cầu chức năng)

1.2 Non-functional Requirements (Yêu cầu phi chức năng)

1.3 Capacity Estimation (Ước lượng dung lượng)

Assumptions (Giả thiết)

Notification Volume per Channel

QPS Calculation

Worker Count Estimation

Queue Depth Estimation

Storage Estimation

Bandwidth Estimation

Tóm tắt Estimation

Step 2 — Propose High-Level Design

2.1 Analogy (Ví dụ đời thường)

2.2 Architecture Overview

2.3 Component Responsibilities

2.4 Notification Delivery Flow

Step 3 — Design Deep Dive

3.1 Notification Template Engine

Tại sao cần template?

Template Storage Design

i18n Strategy

3.2 Priority Queue Design

Priority Levels

Implementation: Separate Queues (Not Priority Field)

3.3 Rate Limiting — Chống Notification Fatigue

Tại sao cần rate limit notification?

Rate Limit Tiers

Redis Implementation

3.4 Retry Mechanism & Dead Letter Queue

Retry Strategy: Exponential Backoff

Retry Configuration per Priority

Backoff Formula

Dead Letter Queue (DLQ) Processing

3.5 Push Notification — FCM/APNs Integration

Device Token Management

Token Lifecycle

Silent Push (Data-only Push)

3.6 Email — Deliverability & Integration

Email Architecture

Deliverability Essentials — SPF, DKIM, DMARC

Bounce Handling

SES/SendGrid Integration

3.7 SMS — Cost Optimization

SMS Pricing Reality

Cost Optimization Strategies

3.8 Deduplication — Exactly-Once Delivery

Tại sao cần dedup?

Idempotency Key Design

3.9 Analytics — Measuring Notification Effectiveness

Metrics Pipeline

Key Metrics

A/B Testing

3.10 User Preference Store

Redis Hash Design

Opt-out Handling — Legal Compliance

3.11 Batch Notification — Aggregation

Problem

Solution: Event Aggregation

Implementation

3.12 Timezone-Aware Delivery

Problem

Solution

4. Security — Notification-Specific Threats

4.1 Notification Spoofing Prevention

4.2 Phishing Detection in Email

4.3 PII in Notifications

4.4 Encrypted Push Payload

5. DevOps — Production Operations

5.1 Kafka Configuration for Notification Ingestion

5.2 Worker Autoscaling

5.3 Monitoring & Alerting

5.4 Grafana Dashboard Panels

5.5 SES/SendGrid Operational Notes

6. Code Examples

6.1 Python: Notification Service with Priority Queue