Tuần 19: Design Notification System

“Notification hay thì người dùng cảm ơn. Notification dở thì người dùng uninstall app. Ranh giới mỏng hơn em tưởng.”

Tags: system-design notification case-study alex-xu Student: Hieu Prerequisite: Tuan-08-Message-Queue · Tuan-09-Rate-Limiter · Tuan-06-Cache-Strategy Liên quan: Tuan-17-Design-Chat-System · Tuan-18-Design-News-Feed · Tuan-13-Monitoring-Observability · Tuan-14-CI-CD-Pipeline


Step 1 — Understand the Problem & Establish Design Scope

1.1 Functional Requirements (Yêu cầu chức năng)

Hieu, trước khi vẽ bất kỳ diagram nào, em phải hỏi interviewer cho rõ scope. Đây là những câu hỏi quan trọng:

Câu hỏiCâu trả lời giả địnhTại sao hỏi
Hệ thống hỗ trợ những loại notification nào?Push notification (iOS/Android), SMS, EmailMỗi channel có architecture riêng
Notification được trigger bởi ai/cái gì?Event-driven: new message, order update, marketing campaign, system alertPhân biệt transactional vs marketing
Có real-time requirement không?Push & SMS: near real-time (< 5s). Email: best-effort (< 5 min)Ảnh hưởng queue priority
User có thể opt-out từng channel không?Có — user preference per channel, per notification typeCần User Preference Service
Có priority levels không?Urgent (OTP, security alert), Normal (order update), Low (marketing)Priority queue design
Có cần analytics không?Có — delivery rate, open rate, click-through rateAnalytics pipeline
Hỗ trợ đa ngôn ngữ không?Có — i18n với template engineTemplate design

1.2 Non-functional Requirements (Yêu cầu phi chức năng)

RequirementTargetGhi chú
Scalability100M DAU, 500M notifications/dayHệ thống quy mô lớn
Availability99.9% (three 9s)8.77 giờ downtime/năm
LatencyUrgent: < 3s, Normal: < 30s, Low: best-effortSLA per priority
ReliabilityAt-least-once delivery, ideally exactly-onceDeduplication mechanism
ComplianceCAN-SPAM (email), GDPR (EU), TCPA (SMS)Legal requirements

1.3 Capacity Estimation (Ước lượng dung lượng)

Assumptions (Giả thiết)

Thông sốGiá trịGiải thích
DAU100MHệ thống quy mô Facebook/Uber-like
Total notifications/day500M~5 notifications/user/day trung bình
Channel breakdownPush: 60%, Email: 30%, SMS: 10%Push rẻ nhất, SMS đắt nhất
Avg notification payloadPush: 1 KB, Email: 50 KB, SMS: 0.2 KBBao gồm metadata
Analytics event size0.5 KB per notificationDelivery + open + click tracking
RetentionNotification log: 90 ngày, Analytics: 1 nămCompliance + business intelligence

Notification Volume per Channel

QPS Calculation

Tại sao peak multiplier = 5x? Marketing campaign gửi đồng loạt, flash sale, breaking news — tất cả tạo burst traffic.

Worker Count Estimation

Giả sử mỗi worker xử lý 100 notifications/s (bao gồm template rendering, API call to provider, retry logic):

Rule of thumb: Provision 1.5x peak workers cho headroom → ~450 workers.

Queue Depth Estimation

Nếu downstream provider (APNs/FCM/SES) bị chậm 30 giây:

Kafka partition cho throughput: mỗi partition handle ~10K msg/s → cần 3 partitions cho push channel ở peak.

Storage Estimation

Notification log (90 ngày):

Analytics data (1 năm):

Email content storage (90 ngày):

Alert: Email content chiếm phần lớn storage! Cần compression + object storage (S3) thay vì DB.

Bandwidth Estimation

Tóm tắt Estimation

MetricValue
Total QPS (avg)~5,800/s
Total QPS (peak)~30,000/s
Push notifications/day300M
Email notifications/day150M
SMS notifications/day50M
Workers needed (peak + headroom)~450
Queue depth (30s buffer)~900K messages
Notification log storage (90d)~45 TB
Analytics storage (1yr)~91 TB
Email content storage (90d)~675 TB

Step 2 — Propose High-Level Design

2.1 Analogy (Ví dụ đời thường)

Hieu, tưởng tượng hệ thống notification như bưu điện quốc gia:

  • Event producers = Người viết thư (các service trong hệ thống)
  • Notification service = Bưu điện trung tâm — phân loại, kiểm tra địa chỉ, đóng dấu
  • Template engine = Máy in — điền thông tin vào mẫu thư sẵn
  • User preference service = Sổ đăng ký — ai muốn nhận thư, ai không
  • Rate limiter = Kiểm soát viên — không cho gửi quá nhiều thư cho một người
  • Message queues = Xe tải phân loại theo vùng (push/SMS/email)
  • Delivery workers = Người đưa thư — mỗi loại biết cách giao khác nhau
  • Dead letter queue = Kho thư không giao được — thử lại sau

2.2 Architecture Overview

flowchart TB
    subgraph "Event Producers"
        EP1["Order Service"]
        EP2["Chat Service"]
        EP3["Marketing Service"]
        EP4["Auth Service<br/>(OTP, Security)"]
        EP5["Scheduler<br/>(Cron Jobs)"]
    end

    subgraph "Notification Platform"
        NS["Notification Service<br/>(API Gateway)"]
        VAL["Validation &<br/>Deduplication"]
        UPS["User Preference<br/>Service"]
        TE["Template Engine<br/>(i18n)"]
        RL["Rate Limiter"]
        PQ["Priority Router"]
    end

    subgraph "Message Queues (Kafka)"
        direction LR
        Q_PUSH_U["Push Queue<br/>(Urgent)"]
        Q_PUSH_N["Push Queue<br/>(Normal)"]
        Q_EMAIL_U["Email Queue<br/>(Urgent)"]
        Q_EMAIL_N["Email Queue<br/>(Normal)"]
        Q_SMS_U["SMS Queue<br/>(Urgent)"]
        Q_SMS_N["SMS Queue<br/>(Normal)"]
    end

    subgraph "Delivery Workers"
        PW["Push Workers<br/>(FCM / APNs)"]
        EW["Email Workers<br/>(SES / SendGrid)"]
        SW["SMS Workers<br/>(Twilio)"]
    end

    subgraph "External Providers"
        FCM["Firebase Cloud<br/>Messaging"]
        APNS["Apple Push<br/>Notification Service"]
        SES["Amazon SES /<br/>SendGrid"]
        TWILIO["Twilio /<br/>MessageBird"]
    end

    subgraph "Data Stores"
        DB[("Notification Log<br/>(Cassandra)")]
        REDIS[("User Preferences<br/>+ Device Tokens<br/>(Redis)")]
        S3[("Email Templates<br/>+ Content<br/>(S3)")]
        ANALYTICS[("Analytics<br/>(ClickHouse)")]
    end

    subgraph "Reliability"
        DLQ["Dead Letter Queue"]
        RETRY["Retry Scheduler"]
    end

    EP1 & EP2 & EP3 & EP4 & EP5 --> NS
    NS --> VAL
    VAL --> UPS
    UPS --> TE
    TE --> RL
    RL --> PQ

    PQ --> Q_PUSH_U & Q_PUSH_N
    PQ --> Q_EMAIL_U & Q_EMAIL_N
    PQ --> Q_SMS_U & Q_SMS_N

    Q_PUSH_U & Q_PUSH_N --> PW
    Q_EMAIL_U & Q_EMAIL_N --> EW
    Q_SMS_U & Q_SMS_N --> SW

    PW --> FCM & APNS
    EW --> SES
    SW --> TWILIO

    PW & EW & SW --> DB
    PW & EW & SW --> ANALYTICS
    PW & EW & SW -->|"Failed"| DLQ
    DLQ --> RETRY
    RETRY --> Q_PUSH_N & Q_EMAIL_N & Q_SMS_N

    NS --> REDIS
    TE --> S3

    style NS fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff
    style PQ fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff
    style DLQ fill:#f44336,stroke:#333,stroke-width:2px,color:#fff

2.3 Component Responsibilities

ComponentTrách nhiệmTechnology
Notification ServiceAPI gateway, nhận request từ producers, validationGo / Java microservice
Validation & DedupSchema validation, idempotency check (idempotency key)In-process + Redis
User Preference ServiceKiểm tra opt-in/opt-out per user per channel per typeRedis Hash
Template EngineRender notification content từ template + parameters, i18nHandlebars / Jinja2
Rate LimiterPer-user + per-channel rate limit, chống spamRedis + Token Bucket
Priority RouterRoute notification vào đúng queue theo priority + channelIn-process logic
Message QueuesDecouple producers & consumers, buffer burst trafficKafka (6 topics: 3 channels x 2 priorities)
Delivery WorkersConsume từ queue, gọi external provider, handle responseStateless workers, autoscaled
Dead Letter QueueChứa notifications fail sau max retriesKafka DLQ topic
Notification LogLưu trạng thái mỗi notification (created, sent, delivered, opened)Cassandra (write-heavy, time-series)
Analytics ServiceAggregate delivery/open/click metrics, A/B testingClickHouse / Druid

2.4 Notification Delivery Flow

sequenceDiagram
    participant Producer as Event Producer
    participant NS as Notification Service
    participant Redis as Redis (Prefs + Dedup)
    participant TE as Template Engine
    participant RL as Rate Limiter
    participant Kafka as Kafka Queue
    participant Worker as Delivery Worker
    participant Provider as External Provider<br/>(FCM/SES/Twilio)
    participant DB as Notification Log

    Producer->>NS: POST /api/v1/notifications<br/>{user_id, type, channel, data, priority}
    NS->>Redis: Check idempotency key
    Redis-->>NS: Not duplicate

    NS->>Redis: Get user preferences
    Redis-->>NS: {push: true, email: true, sms: false}

    Note over NS: User opted out of SMS → skip SMS channel

    NS->>TE: Render template(type, locale, data)
    TE-->>NS: Rendered content

    NS->>RL: Check rate limit(user_id, channel)
    RL-->>NS: Allowed (under limit)

    NS->>Kafka: Publish to push_normal topic
    NS->>Kafka: Publish to email_normal topic
    NS->>DB: Log status = QUEUED

    Kafka->>Worker: Consume message
    Worker->>Provider: Send notification
    Provider-->>Worker: Success (message_id)

    Worker->>DB: Update status = SENT
    Worker->>DB: Log provider_message_id

    Note over Provider: Provider delivers to device/inbox

    Provider-->>Worker: Delivery callback (webhook)
    Worker->>DB: Update status = DELIVERED

Step 3 — Design Deep Dive

3.1 Notification Template Engine

Tại sao cần template?

Thay vì mỗi service tự compose notification content (dễ inconsistent, khó maintain), centralize templates:

// Template: order_shipped
// Locale: vi
Chào {{user_name}}, đơn hàng #{{order_id}} đã được giao cho {{carrier}}.
Theo dõi tại: {{tracking_url}}

// Locale: en
Hi {{user_name}}, your order #{{order_id}} has been shipped via {{carrier}}.
Track it here: {{tracking_url}}

Template Storage Design

{
  "template_id": "order_shipped",
  "version": 3,
  "channels": {
    "push": {
      "vi": {
        "title": "Đơn hàng đã gửi!",
        "body": "Đơn #{{order_id}} đang trên đường đến bạn."
      },
      "en": {
        "title": "Order Shipped!",
        "body": "Order #{{order_id}} is on its way."
      }
    },
    "email": {
      "vi": {
        "subject": "Đơn hàng #{{order_id}} đã được gửi",
        "html_template_s3_key": "templates/order_shipped/vi/v3.html"
      },
      "en": {
        "subject": "Your order #{{order_id}} has shipped",
        "html_template_s3_key": "templates/order_shipped/en/v3.html"
      }
    },
    "sms": {
      "vi": {
        "body": "Don hang #{{order_id}} da gui. Theo doi: {{tracking_url}}"
      },
      "en": {
        "body": "Order #{{order_id}} shipped. Track: {{tracking_url}}"
      }
    }
  },
  "required_params": ["user_name", "order_id", "carrier", "tracking_url"],
  "created_at": "2026-01-15T10:00:00Z"
}

Aha Moment #1: Template versioning rất quan trọng. Khi update template, notification đang trong queue vẫn dùng version cũ. Phải lưu template_version trong message payload.

i18n Strategy

  • User profile lưu preferred_locale (mặc định từ device locale)
  • Fallback chain: user_locale → country_default → en
  • SMS không dùng Unicode để tránh tốn segment (Unicode SMS = 70 chars/segment vs GSM-7 = 160 chars/segment)

3.2 Priority Queue Design

Priority Levels

PriorityUse CasesSLAQueue
URGENTOTP, security alert, payment confirmation< 3 secondsDedicated urgent queue, pre-allocated workers
NORMALOrder update, message notification, social interaction< 30 secondsStandard queue
LOWMarketing campaign, weekly digest, recommendationsBest-effort (< 5 min)Low-priority queue, off-peak delivery

Implementation: Separate Queues (Not Priority Field)

Tại sao dùng separate Kafka topics thay vì single queue + priority field?

  1. Resource isolation: Urgent queue có dedicated worker pool — marketing burst không ảnh hưởng OTP
  2. Independent scaling: Urgent workers always-on (không autoscale xuống 0), marketing workers scale to zero off-peak
  3. Different retry policies: Urgent → aggressive retry (3 lần trong 10s). Low → gentle retry (3 lần trong 1 giờ)
Kafka topics:
├── notification.push.urgent     (3 partitions, replication=3)
├── notification.push.normal     (6 partitions, replication=3)
├── notification.push.low        (3 partitions, replication=2)
├── notification.email.urgent    (3 partitions, replication=3)
├── notification.email.normal    (6 partitions, replication=3)
├── notification.email.low       (3 partitions, replication=2)
├── notification.sms.urgent      (3 partitions, replication=3)
├── notification.sms.normal      (3 partitions, replication=3)
├── notification.sms.low         (2 partitions, replication=2)
└── notification.dlq             (3 partitions, replication=3)

3.3 Rate Limiting — Chống Notification Fatigue

Tại sao cần rate limit notification?

  • User nhận 50 push/ngày → uninstall app
  • Email quá nhiều → ISP đánh spam → deliverability giảm cho toàn domain
  • SMS quá nhiều → chi phí bùng nổ + vi phạm TCPA

Rate Limit Tiers

TierLimitScopeVí dụ
Per-user per-channel10 push/hour, 5 email/day, 2 SMS/dayMột user cụ thểUser A nhận max 10 push notification/giờ
Per-user total30 notifications/dayTổng tất cả channelsUser A nhận max 30 notification/ngày tổng cộng
Per-type1/event_type/hourCùng loại notificationKhông gửi “someone liked your post” quá 1 lần/giờ
Global per-channel100K email/hourToàn hệ thốngTránh bị ISP throttle

Ngoại lệ: URGENT priority (OTP, security alert) bypass rate limiter. Người dùng request OTP thì phải gửi ngay, không được chặn.

Redis Implementation

# Per-user per-channel rate limit (sliding window)
Key:    rate:{user_id}:{channel}:{window}
Value:  count
TTL:    window duration

# Example: User 12345, push channel, hourly window
Key:    rate:12345:push:2026031810
Value:  7
TTL:    3600

# Check: if value < 10 → allowed, INCR
#         if value >= 10 → rejected, queue for later or drop

3.4 Retry Mechanism & Dead Letter Queue

Retry Strategy: Exponential Backoff

flowchart TD
    A["Worker picks message<br/>from Kafka"] --> B{"Send to Provider"}
    B -->|"Success"| C["Update status = SENT<br/>ACK message"]
    B -->|"Fail (timeout,<br/>5xx, rate limit)"| D{"Retry count<br/>< max_retries?"}
    D -->|"Yes"| E["Calculate backoff:<br/>delay = min(base × 2^retry, max_delay)<br/>+ random jitter"]
    E --> F["Re-publish to same topic<br/>with incremented retry_count<br/>and scheduled_at = now + delay"]
    F --> A
    D -->|"No (exhausted)"| G["Publish to DLQ"]
    G --> H["Alert + Manual Review"]

    B -->|"Fail (4xx client error,<br/>invalid token)"| I["Permanent failure<br/>Update status = FAILED"]
    I --> J["Clean up: remove<br/>invalid device token"]

    style G fill:#f44336,stroke:#333,stroke-width:2px,color:#fff
    style C fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff
    style I fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff

Retry Configuration per Priority

PriorityMax RetriesBase DelayMax DelayTotal Window
URGENT51s30s~1 min
NORMAL330s5 min~15 min
LOW35 min1 hour~3 hours

Backoff Formula

Jitter rất quan trọng: Không có jitter → tất cả retries đến cùng lúc → thundering herd trên provider API.

Dead Letter Queue (DLQ) Processing

Messages trong DLQ được xử lý bởi manual review hoặc automated recovery:

  1. Invalid device token → Remove token từ Redis, mark user’s device as inactive
  2. Provider outage → Bulk re-process khi provider recovery
  3. Permanent failure → Log, alert, move to archive

3.5 Push Notification — FCM/APNs Integration

Device Token Management

Mỗi user có thể có nhiều devices. Mỗi device có một device token (registration token) từ FCM hoặc APNs.

Redis Hash — device tokens per user:
Key:    devices:{user_id}
Fields:
  {device_id_1}: {"token": "fcm_token_abc", "platform": "android", "app_version": "3.2.1", "last_active": "2026-03-17T10:00:00Z"}
  {device_id_2}: {"token": "apns_token_xyz", "platform": "ios", "app_version": "3.1.0", "last_active": "2026-03-18T08:30:00Z"}

Token Lifecycle

EventAction
User installs appRegister device token → Redis
User opens appRefresh token (tokens có thể thay đổi)
Push fails with InvalidRegistration (FCM) hoặc BadDeviceToken (APNs)Remove token từ Redis
User uninstalls appToken becomes invalid → detected on next push fail
User logs outDissociate token from user (nhưng không xóa — re-associate khi login lại)

Silent Push (Data-only Push)

Không phải push nào cũng hiện notification banner. Silent push dùng để:

  • Trigger background data sync
  • Update badge count
  • Invalidate local cache
// FCM silent push payload
{
  "to": "device_token",
  "data": {
    "type": "badge_update",
    "unread_count": 5
  }
  // Không có "notification" key → silent push
}

3.6 Email — Deliverability & Integration

Email Architecture

Notification Service → Email Worker → Email Sending Service (SES/SendGrid)
                                          ↓
                                    SMTP Relay / API
                                          ↓
                                    Recipient ISP (Gmail, Outlook, Yahoo)
                                          ↓
                                    User's Inbox (or Spam folder!)

Deliverability Essentials — SPF, DKIM, DMARC

ProtocolMục đíchDNS Record
SPF (Sender Policy Framework)Xác định IP nào được phép gửi email cho domainTXT "v=spf1 include:amazonses.com -all"
DKIM (DomainKeys Identified Mail)Ký email bằng private key, receiver verify bằng public key trong DNSTXT record chứa DKIM public key
DMARC (Domain-based Message Auth)Policy khi SPF/DKIM fail: none, quarantine, rejectTXT "v=DMARC1; p=reject; rua=mailto:[email protected]"

Aha Moment #2: Nếu không setup SPF + DKIM + DMARC đúng, email notification sẽ rơi vào spam folder. Và một khi domain bị ISP blacklist, recovery mất hàng tuần. Đây là lý do email deliverability là kỹ năng riêng biệt.

Bounce Handling

Bounce TypeVí dụAction
Hard bounceEmail address không tồn tại (550)Remove email khỏi user profile, KHÔNG bao giờ gửi lại
Soft bounceMailbox full (452), server tạm thời lỗiRetry 3 lần, sau đó mark as soft-bounced
ComplaintUser click “Report Spam”Unsubscribe ngay lập tức, log complaint

ISP theo dõi bounce ratecomplaint rate. Nếu bounce rate > 5% hoặc complaint rate > 0.1% → domain bị throttle hoặc blacklist.

SES/SendGrid Integration

  • Dùng dedicated IP thay vì shared IP (kiểm soát reputation)
  • IP warming: IP mới gửi từ từ (100 email/ngày → tăng dần → full volume trong 4-6 tuần)
  • Feedback loop: SES SNS notifications cho bounce/complaint → auto-update user preference

3.7 SMS — Cost Optimization

SMS Pricing Reality

RouteGiá/SMS (USD)50M SMS/day Cost
US (short code)$0.0075$375,000/day
US (long code)$0.0050$250,000/day
Vietnam$0.04$2,000,000/day
India$0.003$150,000/day

Aha Moment #3: SMS là channel đắt nhất — gấp hàng nghìn lần push notification (gần như miễn phí). Đây là lý do SMS chỉ nên dùng cho high-value notifications (OTP, security, payment).

Cost Optimization Strategies

  1. Channel preference cascade: Push first → Email → SMS (only if urgent + user has no push token)
  2. Smart routing: Dùng short code cho US (higher throughput, 0.005 but slower)
  3. SMS content optimization: GSM-7 encoding (160 chars/segment) thay vì Unicode (70 chars/segment)
  4. Batching: Aggregate similar SMS (thay vì 3 SMS riêng → 1 SMS tổng hợp)
  5. Geographic routing: Dùng local provider cho từng region (Twilio US, MessageBird EU, local provider VN)

3.8 Deduplication — Exactly-Once Delivery

Tại sao cần dedup?

  • Network failure sau khi gửi thành công → producer retry → user nhận 2 lần
  • Kafka consumer crash sau khi gửi nhưng trước khi commit offset → re-process → duplicate
  • Marketing campaign accidentally triggered twice → user nhận double email

Idempotency Key Design

Idempotency key = hash(event_source + event_id + user_id + channel + notification_type)

Example:
  event_source: "order_service"
  event_id: "order_12345_shipped"
  user_id: "user_67890"
  channel: "push"
  notification_type: "order_shipped"

  Key: SHA256("order_service:order_12345_shipped:user_67890:push:order_shipped")
       = "a1b2c3d4..."
Redis SET with TTL:
  SETNX dedup:{idempotency_key} 1 EX 86400  (24h TTL)
  → If OK (key didn't exist) → process notification
  → If nil (key already exists) → skip (duplicate)

Trade-off: TTL quá ngắn → miss duplicates. TTL quá dài → Redis memory bloat. 24h là sweet spot cho hầu hết use cases.

3.9 Analytics — Measuring Notification Effectiveness

Metrics Pipeline

Notification sent → Delivery callback → Open/Read event → Click event → Conversion event
     ↓                    ↓                   ↓                ↓              ↓
   SENT               DELIVERED            OPENED           CLICKED       CONVERTED

Key Metrics

MetricFormulaTargetCách đo
Delivery Ratedelivered / sent> 95% (push), > 98% (email)Provider callback
Open Rateopened / delivered> 15% (push), > 20% (email)Tracking pixel (email), app open event (push)
Click-through Rate (CTR)clicked / opened> 5%Redirect URL tracking
Opt-out Rateunsubscribed / delivered< 0.5%Unsubscribe link click
Conversion Rateconverted / clickedVaries by campaignDeep link attribution

A/B Testing

  • Split users into buckets (control vs variant)
  • Test: notification title, body, send time, channel
  • Track metrics per variant → pick winner
  • Example: “Đơn hàng đã gửi!” vs “Đơn hàng đang trên đường đến bạn!” → which has higher open rate?

3.10 User Preference Store

Redis Hash Design

Key:    prefs:{user_id}
Fields:
  push_enabled:           "true"
  email_enabled:          "true"
  sms_enabled:            "false"
  push_marketing:         "false"    # opt-out marketing push
  push_social:            "true"     # opt-in social push
  push_transactional:     "true"     # opt-in transactional push
  email_marketing:        "false"
  email_transactional:    "true"
  quiet_hours_start:      "22:00"    # local time
  quiet_hours_end:        "08:00"
  timezone:               "Asia/Ho_Chi_Minh"
  locale:                 "vi"
RegulationYêu cầuImplementation
CAN-SPAM (US, Email)Unsubscribe link trong mỗi email, process trong 10 ngàyList-Unsubscribe header, one-click unsubscribe
GDPR (EU)Explicit consent, right to erasureDouble opt-in, preference deletion API
TCPA (US, SMS)Explicit written consent trước khi gửi SMS marketingConsent record + timestamp trong DB

3.11 Batch Notification — Aggregation

Problem

User có 10 người like post trong 5 phút → gửi 10 notification? No!

Solution: Event Aggregation

Window: 5 minutes
Events:
  10:00 - Alice liked your post
  10:01 - Bob liked your post
  10:03 - Charlie liked your post

Aggregated notification (at 10:05):
  "Alice, Bob, Charlie liked your post"

If > 3 people:
  "Alice, Bob, and 8 others liked your post"

Implementation

  • Tumbling window (fixed 5-min intervals): Simple, predictable
  • Session window (gap-based): Send when no new events for 2 minutes
  • Storage: Redis Sorted Set per user per event type, score = timestamp
  • Aggregation worker runs every window interval, collects events, sends single notification

3.12 Timezone-Aware Delivery

Problem

Marketing notification scheduled for 9:00 AM — but user is in which timezone?

Solution

User timezone: "Asia/Ho_Chi_Minh" (UTC+7)
Scheduled send: 9:00 AM local time

Server calculation:
  UTC time = 9:00 AM - 7h = 2:00 AM UTC

For user in "America/New_York" (UTC-4):
  UTC time = 9:00 AM + 4h = 1:00 PM UTC
  • Quiet hours: Do not send between 22:00 - 08:00 local time (buffer in queue, send at 08:00)
  • Exception: URGENT priority ignores quiet hours (OTP/security alert at 3 AM is fine)

4. Security — Notification-Specific Threats

4.1 Notification Spoofing Prevention

Threat: Attacker gửi fake notification request giả danh internal service.

Mitigations:

  • Service-to-service authentication: mTLS hoặc JWT with service identity
  • API key per producer: Mỗi service có API key riêng, rate-limited
  • Request signing: HMAC signature trên payload → Notification Service verify trước khi process
  • Allow-list: Chỉ registered event types + registered producer services mới được gửi

4.2 Phishing Detection in Email

Threat: Internal service bị compromise → gửi phishing email qua notification system.

Mitigations:

  • Template-only policy: Email content PHẢI dùng pre-approved templates. Không cho phép arbitrary HTML.
  • URL allowlist: Links trong email chỉ được point to approved domains
  • Content scanning: Scan email content cho known phishing patterns trước khi gửi
  • DMARC enforcement: p=reject → email giả mạo domain bị reject bởi ISP

4.3 PII in Notifications

Threat: Push notification preview hiện trên lock screen → lộ thông tin nhạy cảm.

Mitigations:

  • Content masking: “Bạn nhận được chuyển khoản ****5,000,000 VND” thay vì hiện full
  • Hidden preview: iOS/Android cho phép ẩn notification content trên lock screen
  • Encrypted push payload: Encrypt payload, app decrypt khi user unlock device

4.4 Encrypted Push Payload

// Standard push (visible on lock screen):
{
  "title": "Chuyển khoản thành công",
  "body": "Bạn nhận 5,000,000 VND từ Nguyễn Văn A"  ← PII leak!
}

// Encrypted push:
{
  "data": {
    "encrypted_payload": "aes256gcm:iv:ciphertext:tag",
    "key_id": "user_key_v3"
  }
  // App decrypts after biometric auth
}

5. DevOps — Production Operations

5.1 Kafka Configuration for Notification Ingestion

# kafka-notification-topics.yml
topics:
  - name: notification.push.urgent
    partitions: 3
    replication_factor: 3
    config:
      retention.ms: 86400000        # 24h
      min.insync.replicas: 2        # Strong durability for urgent
      max.message.bytes: 1048576    # 1MB max
 
  - name: notification.push.normal
    partitions: 6
    replication_factor: 3
    config:
      retention.ms: 259200000       # 72h
      min.insync.replicas: 2
 
  - name: notification.email.normal
    partitions: 6
    replication_factor: 3
    config:
      retention.ms: 259200000
      min.insync.replicas: 2
 
  - name: notification.sms.urgent
    partitions: 3
    replication_factor: 3
    config:
      retention.ms: 86400000
      min.insync.replicas: 2
 
  - name: notification.dlq
    partitions: 3
    replication_factor: 3
    config:
      retention.ms: 604800000       # 7 days — manual review window
      min.insync.replicas: 2

5.2 Worker Autoscaling

# k8s-hpa-push-workers.yml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: push-worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: push-worker
  minReplicas: 20          # Always-on baseline
  maxReplicas: 200         # Peak capacity
  metrics:
    - type: External
      external:
        metric:
          name: kafka_consumergroup_lag
          selector:
            matchLabels:
              topic: notification.push.normal
        target:
          type: AverageValue
          averageValue: "1000"    # Scale up when lag > 1000 per pod
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          targetAverageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60    # React fast to burst
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300   # Slow scale-down to avoid flapping
      policies:
        - type: Percent
          value: 25
          periodSeconds: 120

5.3 Monitoring & Alerting

# prometheus-notification-alerts.yml
groups:
  - name: notification_delivery
    rules:
      - alert: DeliveryRateDropped
        expr: |
          (
            sum(rate(notification_delivered_total[5m]))
            /
            sum(rate(notification_sent_total[5m]))
          ) < 0.90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Notification delivery rate dropped below 90%"
          description: "Current delivery rate: {{ $value | humanizePercentage }}"
 
      - alert: UrgentQueueLagHigh
        expr: kafka_consumergroup_lag{topic=~"notification.*.urgent"} > 5000
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Urgent notification queue lag is high ({{ $value }})"
 
      - alert: DLQGrowing
        expr: rate(kafka_topic_messages_in_total{topic="notification.dlq"}[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Dead letter queue receiving {{ $value }} msg/s"
 
      - alert: SMSCostSpike
        expr: |
          sum(rate(sms_sent_total[1h])) * 0.0075 * 24 > 100000
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Projected daily SMS cost exceeds $100K"
 
      - alert: EmailBounceRateHigh
        expr: |
          sum(rate(email_bounce_total{type="hard"}[1h]))
          /
          sum(rate(email_sent_total[1h]))
          > 0.05
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Email hard bounce rate exceeds 5% — risk of domain blacklist"
 
      - alert: PushTokenInvalidRate
        expr: |
          sum(rate(push_token_invalid_total[1h]))
          /
          sum(rate(push_sent_total[1h]))
          > 0.10
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "10%+ push tokens are invalid — device token cleanup needed"

5.4 Grafana Dashboard Panels

PanelPromQLThreshold
Notifications/sec by channelsum(rate(notification_sent_total[1m])) by (channel)Compare vs estimation
Delivery rate by channeldelivered / sent by channelPush > 95%, Email > 98%
P99 notification latencyhistogram_quantile(0.99, notification_e2e_duration_seconds_bucket)Urgent < 3s
Kafka consumer lagkafka_consumergroup_lag by topicUrgent < 1K, Normal < 100K
DLQ message countkafka_topic_partition_current_offset{topic="notification.dlq"}< 1000
SMS daily cost projectionsum(sms_sent_total) * cost_per_sms * (86400/elapsed)< $50K/day
Email bounce ratehard_bounce / sent< 2%
Opt-out rate (rolling 7d)unsubscribe_total / delivered_total< 0.5%

5.5 SES/SendGrid Operational Notes

  • Dedicated IP pool: Separate IPs for transactional vs marketing email (khác reputation)
  • IP warming schedule: Day 1: 200 emails → Day 7: 10K → Day 14: 50K → Day 30: full volume
  • Suppression list: SES auto-maintains bounce/complaint list — sync to internal preference store
  • Sending quota: SES default = 200 emails/sec. Request increase gradually.
  • Dashboard: Monitor SES console cho sending statistics, bounce/complaint rates, reputation dashboard

6. Code Examples

6.1 Python: Notification Service with Priority Queue

"""
Notification Service — Core processing pipeline
Handles validation, deduplication, preference check, template rendering,
rate limiting, and queue routing.
"""
 
import hashlib
import json
import time
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
 
import redis
from confluent_kafka import Producer
 
 
class Channel(Enum):
    PUSH = "push"
    EMAIL = "email"
    SMS = "sms"
 
 
class Priority(Enum):
    URGENT = "urgent"
    NORMAL = "normal"
    LOW = "low"
 
 
@dataclass
class NotificationRequest:
    event_source: str
    event_id: str
    user_id: str
    channels: list[Channel]
    notification_type: str
    priority: Priority
    template_params: dict
    idempotency_key: Optional[str] = None
    scheduled_at: Optional[float] = None
 
    def __post_init__(self):
        if not self.idempotency_key:
            # Auto-generate idempotency key
            raw = f"{self.event_source}:{self.event_id}:{self.user_id}"
            self.idempotency_key = hashlib.sha256(raw.encode()).hexdigest()
 
 
@dataclass
class NotificationMessage:
    """Message published to Kafka after processing."""
    notification_id: str
    user_id: str
    channel: Channel
    priority: Priority
    rendered_content: dict
    device_tokens: list[str] = field(default_factory=list)
    email_address: str = ""
    phone_number: str = ""
    retry_count: int = 0
    created_at: float = field(default_factory=time.time)
 
 
class NotificationService:
    """
    Core notification processing pipeline.
 
    Flow: Validate → Dedup → Check Preferences → Render Template
          → Rate Limit → Route to Queue
    """
 
    DEDUP_TTL = 86400  # 24 hours
    RATE_LIMITS = {
        Channel.PUSH: {"window": 3600, "max_count": 10},    # 10/hour
        Channel.EMAIL: {"window": 86400, "max_count": 5},    # 5/day
        Channel.SMS: {"window": 86400, "max_count": 2},      # 2/day
    }
 
    def __init__(self, redis_client: redis.Redis, kafka_producer: Producer):
        self.redis = redis_client
        self.producer = kafka_producer
 
    def process(self, request: NotificationRequest) -> dict:
        """Main entry point — process a notification request."""
        results = {}
 
        # Step 1: Deduplication check
        if self._is_duplicate(request.idempotency_key):
            return {"status": "skipped", "reason": "duplicate"}
 
        for channel in request.channels:
            result = self._process_channel(request, channel)
            results[channel.value] = result
 
        return {"status": "processed", "channels": results}
 
    def _is_duplicate(self, idempotency_key: str) -> bool:
        """Check Redis for duplicate using SETNX."""
        key = f"dedup:{idempotency_key}"
        was_set = self.redis.set(key, 1, nx=True, ex=self.DEDUP_TTL)
        return not was_set  # None = key existed = duplicate
 
    def _process_channel(self, request: NotificationRequest, channel: Channel) -> dict:
        # Step 2: Check user preferences
        if not self._check_preference(request.user_id, channel, request.notification_type):
            return {"status": "skipped", "reason": "user_opted_out"}
 
        # Step 3: Check quiet hours (skip for URGENT)
        if request.priority != Priority.URGENT:
            if self._in_quiet_hours(request.user_id):
                return {"status": "deferred", "reason": "quiet_hours"}
 
        # Step 4: Render template
        locale = self._get_user_locale(request.user_id)
        content = self._render_template(
            request.notification_type, channel, locale, request.template_params
        )
 
        # Step 5: Rate limit check (URGENT bypasses)
        if request.priority != Priority.URGENT:
            if not self._check_rate_limit(request.user_id, channel):
                return {"status": "rate_limited", "reason": "too_many_notifications"}
 
        # Step 6: Build message and publish to Kafka
        message = NotificationMessage(
            notification_id=f"{request.idempotency_key}:{channel.value}",
            user_id=request.user_id,
            channel=channel,
            priority=request.priority,
            rendered_content=content,
        )
 
        # Enrich with delivery target
        if channel == Channel.PUSH:
            message.device_tokens = self._get_device_tokens(request.user_id)
            if not message.device_tokens:
                return {"status": "skipped", "reason": "no_device_tokens"}
        elif channel == Channel.EMAIL:
            message.email_address = self._get_email(request.user_id)
        elif channel == Channel.SMS:
            message.phone_number = self._get_phone(request.user_id)
 
        topic = f"notification.{channel.value}.{request.priority.value}"
        self._publish(topic, message)
 
        return {"status": "queued", "topic": topic}
 
    def _check_preference(self, user_id: str, channel: Channel, notif_type: str) -> bool:
        """Check user preference from Redis hash."""
        prefs = self.redis.hgetall(f"prefs:{user_id}")
        if not prefs:
            return True  # Default: all enabled
 
        # Check channel-level opt-out
        channel_key = f"{channel.value}_enabled"
        if prefs.get(channel_key, b"true") == b"false":
            return False
 
        # Check type-level opt-out (e.g., push_marketing)
        type_key = f"{channel.value}_{notif_type}"
        if prefs.get(type_key, b"true") == b"false":
            return False
 
        return True
 
    def _in_quiet_hours(self, user_id: str) -> bool:
        """Check if current time is within user's quiet hours."""
        prefs = self.redis.hgetall(f"prefs:{user_id}")
        quiet_start = prefs.get(b"quiet_hours_start", b"").decode()
        quiet_end = prefs.get(b"quiet_hours_end", b"").decode()
        if not quiet_start or not quiet_end:
            return False
 
        tz = prefs.get(b"timezone", b"UTC").decode()
        # In production: convert current UTC time to user's local time
        # and check if within [quiet_start, quiet_end]
        # Simplified here for brevity
        return False
 
    def _check_rate_limit(self, user_id: str, channel: Channel) -> bool:
        """Sliding window rate limit using Redis INCR + TTL."""
        config = self.RATE_LIMITS[channel]
        window = int(time.time()) // config["window"]
        key = f"rate:{user_id}:{channel.value}:{window}"
 
        count = self.redis.incr(key)
        if count == 1:
            self.redis.expire(key, config["window"])
 
        return count <= config["max_count"]
 
    def _get_user_locale(self, user_id: str) -> str:
        locale = self.redis.hget(f"prefs:{user_id}", "locale")
        return locale.decode() if locale else "en"
 
    def _render_template(
        self, notif_type: str, channel: Channel, locale: str, params: dict
    ) -> dict:
        """
        Render notification template with parameters.
        In production: load template from S3/cache, use Jinja2/Handlebars.
        """
        # Simplified — in production this loads from template store
        templates = {
            ("order_shipped", Channel.PUSH, "vi"): {
                "title": "Don hang da gui!",
                "body": "Don #{order_id} dang tren duong den ban.",
            },
            ("order_shipped", Channel.PUSH, "en"): {
                "title": "Order Shipped!",
                "body": "Order #{order_id} is on its way.",
            },
        }
        template = templates.get((notif_type, channel, locale), {})
        rendered = {}
        for key, value in template.items():
            for param_name, param_value in params.items():
                value = value.replace(f"{{{param_name}}}", str(param_value))
            rendered[key] = value
        return rendered
 
    def _get_device_tokens(self, user_id: str) -> list[str]:
        devices = self.redis.hgetall(f"devices:{user_id}")
        tokens = []
        for device_id, device_data in devices.items():
            data = json.loads(device_data)
            tokens.append(data["token"])
        return tokens
 
    def _get_email(self, user_id: str) -> str:
        return self.redis.hget(f"user:{user_id}", "email").decode()
 
    def _get_phone(self, user_id: str) -> str:
        return self.redis.hget(f"user:{user_id}", "phone").decode()
 
    def _publish(self, topic: str, message: NotificationMessage):
        payload = json.dumps({
            "notification_id": message.notification_id,
            "user_id": message.user_id,
            "channel": message.channel.value,
            "priority": message.priority.value,
            "content": message.rendered_content,
            "device_tokens": message.device_tokens,
            "email_address": message.email_address,
            "phone_number": message.phone_number,
            "retry_count": message.retry_count,
            "created_at": message.created_at,
        })
        self.producer.produce(topic, value=payload.encode("utf-8"))
        self.producer.flush()

6.2 Python: FCM Push Notification Sender

"""
Push Notification Worker — Consumes from Kafka, sends via FCM/APNs.
Handles retries with exponential backoff.
"""
 
import json
import time
import random
import logging
from dataclasses import dataclass
 
import requests
from confluent_kafka import Consumer, Producer
 
logger = logging.getLogger(__name__)
 
 
@dataclass
class RetryConfig:
    max_retries: int
    base_delay_seconds: float
    max_delay_seconds: float
 
 
RETRY_CONFIGS = {
    "urgent": RetryConfig(max_retries=5, base_delay_seconds=1, max_delay_seconds=30),
    "normal": RetryConfig(max_retries=3, base_delay_seconds=30, max_delay_seconds=300),
    "low": RetryConfig(max_retries=3, base_delay_seconds=300, max_delay_seconds=3600),
}
 
 
class FCMPushSender:
    """Send push notifications via Firebase Cloud Messaging HTTP v1 API."""
 
    FCM_URL = "https://fcm.googleapis.com/v1/projects/{project_id}/messages:send"
 
    def __init__(self, project_id: str, service_account_token: str):
        self.url = self.FCM_URL.format(project_id=project_id)
        self.headers = {
            "Authorization": f"Bearer {service_account_token}",
            "Content-Type": "application/json",
        }
 
    def send(self, device_token: str, title: str, body: str, data: dict = None) -> dict:
        """
        Send push notification to a single device via FCM.
        Returns: {"success": True, "message_id": "..."} or {"success": False, "error": "..."}
        """
        payload = {
            "message": {
                "token": device_token,
                "notification": {
                    "title": title,
                    "body": body,
                },
            }
        }
        if data:
            payload["message"]["data"] = {k: str(v) for k, v in data.items()}
 
        try:
            response = requests.post(
                self.url, headers=self.headers, json=payload, timeout=5
            )
 
            if response.status_code == 200:
                return {"success": True, "message_id": response.json().get("name")}
            elif response.status_code == 404:
                # Invalid device token — permanent failure
                return {
                    "success": False,
                    "error": "INVALID_TOKEN",
                    "retriable": False,
                }
            elif response.status_code == 429:
                # Rate limited by FCM
                return {
                    "success": False,
                    "error": "RATE_LIMITED",
                    "retriable": True,
                }
            else:
                return {
                    "success": False,
                    "error": f"HTTP_{response.status_code}",
                    "retriable": response.status_code >= 500,
                }
 
        except requests.exceptions.Timeout:
            return {"success": False, "error": "TIMEOUT", "retriable": True}
        except requests.exceptions.ConnectionError:
            return {"success": False, "error": "CONNECTION_ERROR", "retriable": True}
 
 
class PushWorker:
    """
    Kafka consumer worker that processes push notification messages.
    Handles retry with exponential backoff and DLQ routing.
    """
 
    def __init__(
        self,
        consumer: Consumer,
        producer: Producer,
        fcm_sender: FCMPushSender,
        redis_client,
    ):
        self.consumer = consumer
        self.producer = producer
        self.fcm = fcm_sender
        self.redis = redis_client
 
    def run(self):
        """Main consumer loop."""
        topics = [
            "notification.push.urgent",
            "notification.push.normal",
            "notification.push.low",
        ]
        self.consumer.subscribe(topics)
        logger.info(f"Push worker subscribed to {topics}")
 
        while True:
            msg = self.consumer.poll(timeout=1.0)
            if msg is None:
                continue
            if msg.error():
                logger.error(f"Consumer error: {msg.error()}")
                continue
 
            try:
                self._process_message(json.loads(msg.value().decode("utf-8")))
                self.consumer.commit(msg)
            except Exception as e:
                logger.exception(f"Failed to process message: {e}")
 
    def _process_message(self, message: dict):
        """Process a single notification message."""
        priority = message["priority"]
        retry_count = message.get("retry_count", 0)
        retry_config = RETRY_CONFIGS[priority]
 
        for token in message["device_tokens"]:
            result = self.fcm.send(
                device_token=token,
                title=message["content"].get("title", ""),
                body=message["content"].get("body", ""),
                data={"notification_id": message["notification_id"]},
            )
 
            if result["success"]:
                logger.info(
                    f"Push sent: {message['notification_id']}{token[:20]}..."
                )
                self._update_status(message["notification_id"], "SENT")
            elif not result.get("retriable", False):
                # Permanent failure — e.g., invalid token
                logger.warning(
                    f"Permanent failure for token {token[:20]}: {result['error']}"
                )
                self._remove_invalid_token(message["user_id"], token)
                self._update_status(message["notification_id"], "FAILED")
            elif retry_count < retry_config.max_retries:
                # Retriable failure — schedule retry
                delay = self._calculate_backoff(
                    retry_count, retry_config.base_delay_seconds,
                    retry_config.max_delay_seconds,
                )
                logger.info(
                    f"Scheduling retry #{retry_count + 1} in {delay:.1f}s "
                    f"for {message['notification_id']}"
                )
                message["retry_count"] = retry_count + 1
                message["scheduled_at"] = time.time() + delay
                topic = f"notification.push.{priority}"
                self.producer.produce(
                    topic, value=json.dumps(message).encode("utf-8")
                )
                self.producer.flush()
            else:
                # Exhausted retries — send to DLQ
                logger.error(
                    f"Max retries exhausted for {message['notification_id']}"
                )
                self.producer.produce(
                    "notification.dlq",
                    value=json.dumps(message).encode("utf-8"),
                )
                self.producer.flush()
                self._update_status(message["notification_id"], "DLQ")
 
    @staticmethod
    def _calculate_backoff(
        retry_count: int, base_delay: float, max_delay: float
    ) -> float:
        """Exponential backoff with jitter."""
        delay = min(base_delay * (2 ** retry_count), max_delay)
        jitter = random.uniform(0, 1.0)  # Up to 1 second jitter
        return delay + jitter
 
    def _remove_invalid_token(self, user_id: str, token: str):
        """Remove invalid device token from Redis."""
        devices = self.redis.hgetall(f"devices:{user_id}")
        for device_id, device_data in devices.items():
            data = json.loads(device_data)
            if data.get("token") == token:
                self.redis.hdel(f"devices:{user_id}", device_id)
                logger.info(f"Removed invalid token for user {user_id}")
                break
 
    def _update_status(self, notification_id: str, status: str):
        """Update notification status in log (Cassandra in production)."""
        # In production: write to Cassandra notification_log table
        logger.info(f"Status update: {notification_id}{status}")

6.3 Python: Email Sender with Retry and Bounce Handling

"""
Email Worker — Sends emails via Amazon SES with bounce handling.
"""
 
import json
import logging
from dataclasses import dataclass
 
import boto3
from botocore.exceptions import ClientError
 
logger = logging.getLogger(__name__)
 
 
@dataclass
class EmailResult:
    success: bool
    message_id: str = ""
    error: str = ""
    error_type: str = ""  # "hard_bounce", "soft_bounce", "throttle", "other"
    retriable: bool = False
 
 
class SESEmailSender:
    """Send emails via Amazon SES with proper error categorization."""
 
    def __init__(self, region: str = "us-east-1", config_set: str = "notification"):
        self.client = boto3.client("ses", region_name=region)
        self.config_set = config_set
 
    def send(
        self,
        to_address: str,
        subject: str,
        html_body: str,
        from_address: str = "[email protected]",
        reply_to: str = None,
        headers: dict = None,
    ) -> EmailResult:
        """Send a single email via SES."""
        try:
            message = {
                "Subject": {"Data": subject, "Charset": "UTF-8"},
                "Body": {
                    "Html": {"Data": html_body, "Charset": "UTF-8"},
                },
            }
 
            kwargs = {
                "Source": from_address,
                "Destination": {"ToAddresses": [to_address]},
                "Message": message,
                "ConfigurationSetName": self.config_set,
            }
 
            if reply_to:
                kwargs["ReplyToAddresses"] = [reply_to]
 
            # Add List-Unsubscribe header for CAN-SPAM compliance
            if headers and "List-Unsubscribe" in headers:
                kwargs["Tags"] = [
                    {"Name": "unsubscribe_url", "Value": headers["List-Unsubscribe"]}
                ]
 
            response = self.client.send_email(**kwargs)
 
            return EmailResult(
                success=True,
                message_id=response["MessageId"],
            )
 
        except ClientError as e:
            error_code = e.response["Error"]["Code"]
 
            if error_code == "MessageRejected":
                # Typically: email address on suppression list
                return EmailResult(
                    success=False,
                    error=str(e),
                    error_type="hard_bounce",
                    retriable=False,
                )
            elif error_code == "Throttling":
                return EmailResult(
                    success=False,
                    error="SES rate limit exceeded",
                    error_type="throttle",
                    retriable=True,
                )
            elif error_code == "AccountSendingPausedException":
                # SES suspended sending — critical alert needed
                logger.critical("SES account sending is paused!")
                return EmailResult(
                    success=False,
                    error="SES sending paused",
                    error_type="other",
                    retriable=False,
                )
            else:
                return EmailResult(
                    success=False,
                    error=str(e),
                    error_type="other",
                    retriable=True,
                )
 
        except Exception as e:
            return EmailResult(
                success=False,
                error=str(e),
                error_type="other",
                retriable=True,
            )
 
 
class BounceHandler:
    """
    Process SES bounce/complaint notifications via SNS webhook.
    Updates user preferences to prevent future sends to invalid addresses.
    """
 
    def __init__(self, redis_client):
        self.redis = redis_client
 
    def handle_sns_notification(self, sns_message: dict):
        """Process SNS notification from SES feedback."""
        notif_type = sns_message.get("notificationType")
 
        if notif_type == "Bounce":
            self._handle_bounce(sns_message["bounce"])
        elif notif_type == "Complaint":
            self._handle_complaint(sns_message["complaint"])
        elif notif_type == "Delivery":
            self._handle_delivery(sns_message["delivery"])
 
    def _handle_bounce(self, bounce: dict):
        bounce_type = bounce["bounceType"]  # "Permanent" or "Transient"
 
        for recipient in bounce["bouncedRecipients"]:
            email = recipient["emailAddress"]
 
            if bounce_type == "Permanent":
                # Hard bounce — disable email for this user
                logger.warning(f"Hard bounce for {email} — disabling email")
                self._disable_email_for_user(email)
            else:
                # Soft bounce — log, but don't disable yet
                logger.info(f"Soft bounce for {email} — monitoring")
 
    def _handle_complaint(self, complaint: dict):
        """User clicked 'Report Spam' — immediately unsubscribe."""
        for recipient in complaint["complainedRecipients"]:
            email = recipient["emailAddress"]
            logger.warning(f"Spam complaint from {email} — unsubscribing immediately")
            self._disable_email_for_user(email)
 
    def _handle_delivery(self, delivery: dict):
        """Successful delivery confirmation."""
        for recipient in delivery["recipients"]:
            logger.info(f"Email delivered to {recipient}")
 
    def _disable_email_for_user(self, email: str):
        """
        Find user by email and disable email notifications.
        In production: lookup user by email in DB, then update Redis prefs.
        """
        # Simplified — in production: DB lookup
        # user_id = db.query("SELECT id FROM users WHERE email = ?", email)
        # self.redis.hset(f"prefs:{user_id}", "email_enabled", "false")
        pass

7. Aha Moments — Tổng hợp

#1 — Template versioning: Khi update template, notification đang trong queue vẫn dùng version cũ. Luôn embed template_version trong message. Nếu không, user có thể nhận notification với format sai khi template thay đổi giữa chừng.

#2 — Email deliverability là hệ sinh thái riêng: SPF + DKIM + DMARC + dedicated IP + IP warming + bounce handling + complaint handling. Skip bất kỳ bước nào → email vào spam. Recovery từ blacklist mất hàng tuần.

#3 — SMS là channel đắt nhất: Push notification gần như miễn phí (FCM/APNs không charge). SMS = 0.04/message. Với 50M SMS/ngày = hàng trăm nghìn USD/ngày. Đây là lý do preference cascade (push > email > SMS) là bắt buộc.

#4 — Deduplication window trade-off: TTL quá ngắn → miss duplicates, user bực. TTL quá dài → Redis memory bloat. 24h với SHA256 key (32 bytes) cho 500M notifications = ~16 GB Redis → chấp nhận được.

#5 — Separate queues per priority: Dùng single queue + priority field thì marketing blast sẽ delay OTP. Separate queues + dedicated worker pools = resource isolation. Urgent notifications không bao giờ bị ảnh hưởng bởi bulk sends.

#6 — Quiet hours + timezone: Marketing notification lúc 3 AM = user rage uninstall. Nhưng OTP lúc 3 AM thì PHẢI gửi. Priority-based quiet hours bypass là thiết kế tinh tế mà interviewer đánh giá cao.

#7 — Event aggregation giảm notification fatigue: “10 people liked your post” (1 notification) >> 10 x “X liked your post” (10 notifications). Aggregation window 5 phút, dùng Redis Sorted Set, là pattern production-ready.


8. Common Pitfalls — Sai lầm thường gặp

Pitfall 1: Notification Fatigue

Sai: Gửi notification cho mọi event — “A liked your post”, “B liked your post”, “C liked your post”… Đúng: Aggregate events. Rate limit per user. Respect quiet hours. Cho user granular opt-out (per type, per channel). Mỗi notification phải earn quyền được gửi.

Pitfall 2: Thundering Herd on Mass Notification

Sai: Marketing campaign gửi 100M push notification cùng lúc → spike 100x normal QPS → Kafka lag, FCM rate limit, worker crash. Đúng: Staggered delivery — spread 100M notifications over 30-60 phút. Rate limit marketing sends at global level. Use separate Kafka partitions cho marketing vs transactional.

Pitfall 3: SMS Cost Explosion

Sai: Mặc định gửi SMS cho mọi notification type. Đúng: SMS chỉ cho URGENT (OTP, security). Preference cascade: push first → email → SMS as last resort. Budget alert khi projected daily SMS cost vượt threshold. Geographic routing (local SMS providers rẻ hơn Twilio cho nhiều quốc gia).

Pitfall 4: Timezone-Unaware Delivery

Sai: Schedule “morning notification” at 9 AM UTC → user ở Việt Nam nhận lúc 4 PM, user ở US West nhận lúc 1 AM. Đúng: Store user timezone. Convert scheduled time to user’s local time. Buffer in queue nếu trong quiet hours, gửi khi quiet hours kết thúc.

Pitfall 5: Ignoring Device Token Lifecycle

Sai: Lưu device token một lần, dùng mãi. Đúng: Token có thể thay đổi khi app reinstall, OS update, hoặc token refresh. FCM trả InvalidRegistration → xóa token. Cần token refresh on every app open. Nếu không, push failure rate tăng dần theo thời gian (token decay).

Pitfall 6: Single Point of Failure in Template Engine

Sai: Template engine là single service — nó down thì tất cả notification fail. Đúng: Cache rendered templates. Fallback to plain text nếu template service unavailable. Template engine nên stateless, horizontally scalable. Pre-render marketing templates trước campaign.

Pitfall 7: No Idempotency → Duplicate Notifications

Sai: Retry khi network timeout mà không check duplicate → user nhận “Đơn hàng đã gửi” 3 lần. Đúng: Idempotency key = hash(event_source + event_id + user_id + channel). SETNX trong Redis với TTL. Producer retry an toàn vì duplicate bị filter.


Step 4 — Wrap Up

Tóm tắt kiến trúc

LayerComponentsKey Design Decision
IngestionNotification Service APIValidation + dedup + preference check trước khi vào queue
ProcessingTemplate Engine, Rate Limiter, Priority RouterSeparate concerns, mỗi component có thể scale độc lập
QueueingKafka (9 topics: 3 channels x 3 priorities)Separate queues cho resource isolation
DeliveryPush/Email/SMS WorkersStateless, autoscaled dựa trên Kafka lag
ReliabilityRetry + DLQExponential backoff + jitter, permanent vs transient failure
DataCassandra (log), Redis (prefs + dedup), ClickHouse (analytics)Right tool for right access pattern

Scalability Dimensions

Nếu cần scale…Solution
More notifications/secAdd Kafka partitions + more workers
More channels (WhatsApp, Telegram)Add new Kafka topic + new worker type — plug-in architecture
More notification typesAdd new template — zero code change
Global deploymentMulti-region Kafka + region-local workers + local SMS providers

Những điểm interviewer đánh giá cao

  1. Priority-based queueing với resource isolation (không phải single queue + priority field)
  2. Rate limiting ở nhiều level (per-user, per-channel, per-type, global)
  3. Exactly-once delivery qua idempotency key
  4. Email deliverability awareness (SPF/DKIM/DMARC, bounce handling, IP warming)
  5. Cost optimization cho SMS channel
  6. Timezone-aware delivery với quiet hours
  7. Event aggregation cho batch notifications
  8. Analytics pipeline tách biệt khỏi delivery path

Tham khảo


Tuần tới: Tuan-20-Design-Key-Value-Store — Distributed KV Store design deep dive