Tuần Bonus: Multi-Tenancy SaaS Patterns
“Day 1: Single tenant, mọi thứ đơn giản. Day 100: 1000 tenants, 1 enterprise customer ‘noisy neighbor’ bóp nghẹt 999 tenant khác. Day 365: Compliance audit yêu cầu ‘prove tenant A’s data không thể leak sang tenant B’ — và em không trả lời được. Multi-tenancy patterns là kiến trúc quyết định life-or-death của một SaaS.”
Tags: system-design multi-tenancy saas isolation noisy-neighbor bonus Student: Hieu (Backend Dev → Architect) Prerequisite: Tuan-07-Database-Sharding-Replication · Tuan-09-Rate-Limiter · Tuan-11-Microservices-Pattern Liên quan: Tuan-Bonus-Multi-Region-Active-Active-DSQL · Case-Design-Payment-System · Tuan-14-AuthN-AuthZ-Security
1. Context & Why
Analogy đời thường — Mô hình bất động sản
Hieu, tưởng tượng em quản lý 3 mô hình nhà ở:
Silo (Nhà riêng):
- Mỗi gia đình 1 căn nhà riêng, đất riêng, hàng rào riêng
- Privacy tuyệt đối, isolation 100%
- Đắt: 1 nhà cho 1 gia đình
- Khi nhà A cháy → A bị mất, không lan sang B
Pool (Chung cư):
- 1 toà nhà, nhiều căn hộ
- Resource shared (thang máy, hành lang, hệ thống nước)
- Rẻ: chia chi phí
- 1 căn cháy → có thể lan; 1 cư dân làm ồn → cả tầng nghe
Hybrid (Khu phức hợp tier):
- Tier 1 (Free): Chung cư đông đúc
- Tier 2 (Pro): Chung cư cao cấp ít người
- Tier 3 (Enterprise): Biệt thự riêng
- Linh hoạt: trả nhiều = isolation cao hơn
Đây chính là 3 mô hình multi-tenancy trong SaaS:
| Mô hình | Isolation | Cost/tenant | Use case |
|---|---|---|---|
| Silo | Cao nhất | Cao nhất | Enterprise, regulated industries |
| Pool | Thấp | Thấp nhất | Free tier, startup, B2C |
| Hybrid (Bridge) | Tùy tier | Tùy tier | SaaS với multiple plans |
Tại sao Backend Dev cần hiểu Multi-Tenancy?
| Lý do | Hậu quả nếu sai |
|---|---|
| Hầu hết product là SaaS | Mặc định bài toán multi-tenant |
| Compliance (SOC 2, HIPAA, PCI) | “Prove isolation” — không có architecture đúng = audit fail |
| Noisy neighbor | 1 tenant lớn DDoS cả platform |
| Cost economics | Pool cho free tier (rẻ), Silo cho enterprise (premium) |
| Data residency | Tenant EU phải lưu EU; per-tenant region selection |
| BYOK (Bring Your Own Key) | Enterprise yêu cầu; cần tenant-level encryption |
Tại sao Alex Xu không đi sâu?
Alex Xu Vol 1+2 nói về sharding (tenant-by-shard) nhưng không cover full multi-tenancy spectrum: pool model, RLS, K8s namespace-per-tenant, blast radius isolation. Đây là cross-cutting concern xuất hiện ở mọi layer (DB, app, network, observability).
Tham chiếu chính
- AWS SaaS Tenant Isolation Strategies (whitepaper) — https://d1.awsstatic.com/whitepapers/saas-tenant-isolation-strategies.pdf
- AWS SaaS Builder Toolkit (SBT) — https://github.com/awslabs/sbt-aws
- Tenancy patterns (Microsoft Learn) — https://learn.microsoft.com/en-us/azure/architecture/guide/multitenant/considerations/tenancy-models
- Neon, Noisy Neighbor problem — https://neon.tech/blog/noisy-neighbors
2. Deep Dive — Khái niệm cốt lõi
2.1 Tenancy Models — 3 chính
2.1.1 Silo Model (1 tenant = 1 stack)
Tenant A: Tenant B: Tenant C:
┌──────────┐ ┌──────────┐ ┌──────────┐
│ App A │ │ App B │ │ App C │
└────┬─────┘ └────┬─────┘ └────┬─────┘
┌────▼─────┐ ┌────▼─────┐ ┌────▼─────┐
│ DB A │ │ DB B │ │ DB C │
└──────────┘ └──────────┘ └──────────┘
Pros:
- Maximum isolation (security, performance, compliance)
- Per-tenant customization (different versions, features)
- Easy to migrate / archive 1 tenant
Cons:
- High operational overhead (1000 tenants = 1000 DB instances)
- Expensive
- Hard to push fix to all tenants quickly
- Complex deployment automation
Use case: Enterprise SaaS (Salesforce big customers), regulated (HIPAA/PCI).
2.1.2 Pool Model (all tenants share)
┌──────────────┐
│ Shared App │
│ (multi-tenant)│
└──────┬───────┘
│
┌──────▼───────┐
│ Shared DB │
│ (tenant_id │
│ column + │
│ RLS) │
└──────────────┘
Pros:
- Lowest cost (high tenant density)
- Easy upgrades (1 deploy → all tenants)
- Operational simplicity
Cons:
- Noisy neighbor risk
- Higher blast radius (1 bad query → all tenants slow)
- Compliance harder to prove
- “Account takeover” can leak across tenants if RLS bug
Use case: B2C, freemium SaaS, low-value tenants.
2.1.3 Hybrid / Bridge / Pod Model
Concept: Combine — pool by default, silo for enterprise.
Tier 1 (Free): [Shared Pool — 10K tenants]
Tier 2 (Pro): [Pod 1: 100 tenants] [Pod 2: 100 tenants] ...
Tier 3 (Enterprise): [Silo 1] [Silo 2] [Silo 3]
Implementation patterns:
- Sharded multi-tenant (Shopify pattern): pods of N tenants
- Tier-based: Different deployments per tier
- Cell-based: Independent cells (limit blast radius)
Pros:
- Best of both worlds
- Right-cost per tenant
- Flexibility to upgrade tenant tier
Cons:
- Operational complexity (multiple deployment shapes)
- Cross-tier feature parity tricky
2.2 Database Tenancy Patterns
2.2.1 Database per tenant (Silo)
-- Each tenant has own database
CREATE DATABASE tenant_acme;
CREATE DATABASE tenant_widget_co;
-- App routes based on tenant identifier in URL/header
SET search_path = tenant_acme;Pros: Maximum isolation, easy backup/restore per tenant Cons: 1000 tenants = 1000 DBs → connection pool nightmare
2.2.2 Schema per tenant (PostgreSQL)
-- Single database, schema per tenant
CREATE SCHEMA tenant_acme;
CREATE SCHEMA tenant_widget;
CREATE TABLE tenant_acme.users (...);
CREATE TABLE tenant_widget.users (...);
-- App sets search_path
SET search_path = tenant_acme;
SELECT * FROM users; -- Resolves to tenant_acme.usersPros: Easier than DB-per-tenant, still good isolation Cons: Migration complexity (apply to all schemas), connection pooling tricky
2.2.3 Shared schema with tenant_id (Pool)
-- Single schema, tenant_id column on every table
CREATE TABLE users (
id UUID PRIMARY KEY,
tenant_id UUID NOT NULL,
email TEXT,
...
);
-- Index includes tenant_id
CREATE INDEX idx_users_tenant ON users (tenant_id, email);
-- Every query MUST filter by tenant_id
SELECT * FROM users WHERE tenant_id = $1 AND email = $2;Risk: Forget WHERE tenant_id → leak across tenants. Use Row-Level Security (RLS).
2.2.4 PostgreSQL Row-Level Security (RLS)
-- Enable RLS
ALTER TABLE users ENABLE ROW LEVEL SECURITY;
-- Policy: users see only their tenant
CREATE POLICY tenant_isolation ON users
USING (tenant_id = current_setting('app.tenant_id')::uuid);
-- App sets tenant context per session
SET app.tenant_id = 'acme-uuid-here';
SELECT * FROM users; -- Auto-filtered to acme tenantMagic: Even if app forgets WHERE clause, RLS prevents cross-tenant leak.
Performance: RLS adds query plan overhead (~5-10%). Add tenant_id to indexes.
Tools:
- pgvector: also supports RLS
- Supabase: built on Postgres RLS
- Hasura: pg-graphile leverages RLS
2.2.5 Modern Alternative — Neon Branching
Neon (serverless Postgres) supports per-tenant branching:
- Each tenant = separate Postgres branch
- Copy-on-write storage → cheap
- Independent compute (autosuspend)
- Sub-second branch creation
Use case: Per-tenant dev/staging environments, “free tier with own DB”.
2.3 Network-Level Isolation
2.3.1 Kubernetes Namespace-per-Tenant
# Tenant gets own namespace
apiVersion: v1
kind: Namespace
metadata:
name: tenant-acme
---
# NetworkPolicy: deny cross-tenant traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-cross-tenant
namespace: tenant-acme
spec:
podSelector: {}
policyTypes: [Ingress, Egress]
ingress:
- from:
- namespaceSelector:
matchLabels:
name: tenant-acme
- namespaceSelector:
matchLabels:
name: shared-platform
egress:
- to:
- namespaceSelector:
matchLabels:
name: tenant-acme
- namespaceSelector:
matchLabels:
name: shared-platform2.3.2 K8s vCluster (per-tenant virtual cluster)
vCluster (Loft Labs): Each tenant gets virtual K8s cluster inside shared physical cluster.
# Create vCluster for tenant
vcluster create acme-cluster --namespace tenant-acme
# Tenant has own k8s API, can deploy own resources
# Physical cluster admin doesn't see tenant's resourcesPros: Strong K8s-level isolation, tenant has admin in own vCluster Cons: Resource overhead, complexity
2.3.3 Capsule
Capsule: K8s native multi-tenancy operator.
- Tenants = K8s CRD
- Quota enforcement
- Network policies per tenant
- RBAC per tenant
2.4 Tenant Context Propagation
Critical: Every operation must know “which tenant”.
# Middleware extracts tenant from JWT/header
async def tenant_middleware(request, call_next):
# 1. Extract from JWT
token = request.headers.get("Authorization", "").replace("Bearer ", "")
claims = jwt.decode(token, key, algorithms=["RS256"])
tenant_id = claims["tenant_id"]
# 2. Set context (works with asyncio)
tenant_context.set(tenant_id)
# 3. Set DB session variable for RLS
async with get_db_pool().acquire() as conn:
await conn.execute(f"SET app.tenant_id = '{tenant_id}'")
# 4. Pass to log/metrics
log.contextualize(tenant_id=tenant_id)
response = await call_next(request)
response.headers["X-Tenant-ID"] = tenant_id
return responseCommon bugs:
- Background jobs without tenant context (cross-tenant leak)
- Logger forgets tenant_id (unable to debug per-tenant)
- Cache key doesn’t include tenant_id (cache leak)
2.5 Noisy Neighbor Mitigation
Problem: 1 tenant runs heavy report query, blocks others.
2.5.1 Per-tenant rate limiting
# Token bucket per tenant
class TenantRateLimiter:
def __init__(self, redis):
self.redis = redis
async def check(self, tenant_id: str, action: str) -> bool:
key = f"ratelimit:{tenant_id}:{action}"
count = await self.redis.incr(key)
if count == 1:
await self.redis.expire(key, 60)
return count <= self._limit_for(tenant_id, action)
def _limit_for(self, tenant_id, action):
tier = self._get_tier(tenant_id)
limits = {
"free": {"api": 100, "report": 5},
"pro": {"api": 1000, "report": 50},
"enterprise": {"api": 10000, "report": 500},
}
return limits[tier][action]2.5.2 Resource quotas (K8s)
apiVersion: v1
kind: ResourceQuota
metadata:
name: tenant-acme-quota
namespace: tenant-acme
spec:
hard:
requests.cpu: "10"
requests.memory: 20Gi
limits.cpu: "20"
limits.memory: 40Gi
pods: "50"
services: "20"2.5.3 Database connection pooling per tenant
# PgBouncer per tenant for predictable resource usage
pgbouncer_config = {
"tenant_acme": {
"max_client_conn": 100,
"default_pool_size": 20,
},
"tenant_widget": {
"max_client_conn": 50,
"default_pool_size": 10,
},
}2.5.4 Query timeouts per tenant
-- PostgreSQL: SET statement_timeout per session
SET app.tenant_id = 'tenant-acme';
SET statement_timeout = '5s'; -- Free tier
-- Enterprise tier:
SET statement_timeout = '60s';2.5.5 Fair scheduling
Async job processing với fair queueing:
Naive: FIFO queue. 1 tenant submits 10K jobs → blocks others.
Fair: Round-robin across tenants:
Queue tenant-A: [j1, j2, j3, j4]
Queue tenant-B: [j5]
Queue tenant-C: [j6, j7]
Process: j1, j5, j6, j2, j7, j3, j4
→ No starvation
Tools: Sidekiq Enterprise (Ruby), AWS SQS FIFO with group ID.
2.6 Tenant Onboarding & Provisioning
Automated provisioning pipeline:
async def onboard_tenant(name: str, tier: str):
"""Provision new tenant idempotently."""
tenant_id = generate_uuid()
# 1. Database
if tier == "enterprise":
# Silo: dedicated DB
await create_database(f"tenant_{tenant_id}")
await run_migrations(tenant_id)
else:
# Pool: insert tenant record + RLS context
await db.execute(
"INSERT INTO tenants (id, name, tier) VALUES ($1, $2, $3)",
tenant_id, name, tier
)
# 2. K8s namespace (if enterprise)
if tier == "enterprise":
await k8s.create_namespace(f"tenant-{tenant_id}")
await k8s.apply_quota(tenant_id, tier_to_quota(tier))
# 3. Storage bucket
await s3.create_bucket(f"tenant-{tenant_id}-uploads")
# 4. Encryption key (BYOK pattern)
if tier == "enterprise":
kms_key = await kms.create_key(
description=f"Tenant {name}",
tags={"tenant_id": tenant_id}
)
await store_tenant_kms_key(tenant_id, kms_key)
# 5. Default admin user
admin_user = await create_user(
email=name + "@admin.com",
tenant_id=tenant_id,
role="admin"
)
# 6. Welcome email + setup checklist
await send_welcome(admin_user.email, tenant_id)
return tenant_idIdempotency: Failure mid-way → retry safe. Each step checks “already exists”.
2.7 Tenant Deletion (GDPR / offboarding)
async def delete_tenant(tenant_id: str):
"""GDPR-compliant tenant deletion."""
# 1. Soft delete: mark for deletion (30-day grace period)
await db.execute(
"UPDATE tenants SET status = 'deleting', delete_after = NOW() + INTERVAL '30 days' WHERE id = $1",
tenant_id
)
# 2. Disable access immediately
await invalidate_all_tokens(tenant_id)
await k8s.scale_namespace(f"tenant-{tenant_id}", replicas=0)
# 3. After 30 days (cron job)
if past_grace_period(tenant_id):
# Delete data
if is_silo_tenant(tenant_id):
await drop_database(f"tenant_{tenant_id}")
else:
await db.execute(
"DELETE FROM users WHERE tenant_id = $1", tenant_id
)
# ... all tenant tables
# Delete S3
await s3.delete_bucket(f"tenant-{tenant_id}-uploads")
# Schedule KMS key deletion (7-30 days mandatory waiting)
await kms.schedule_key_deletion(get_tenant_kms_key(tenant_id), 30)
# Delete K8s namespace
await k8s.delete_namespace(f"tenant-{tenant_id}")
# Audit log
await audit_log.record(
event="tenant_deleted",
tenant_id=tenant_id,
timestamp=now()
)2.8 Cell-based Multi-Tenancy
Pattern: Group N tenants into “cells”, each cell isolated.
Cell 1: Tenants A-F (max 100 tenants)
Cell 2: Tenants G-M
Cell 3: Tenants N-T
Cell 4: Tenants U-Z
Each cell:
- Own DB instance
- Own app deployment
- Independent failure domain
Pros:
- Blast radius = 1 cell (10-100 tenants)
- Easier capacity planning
- A/B test deploy at cell level
Cons:
- Need cell router (which cell does tenant X belong to?)
- Operational overhead (N cells to manage)
Tham chiếu: Tuan-11-Microservices-Pattern section 2.15 Cell-based Architecture.
3. Estimation
3.1 Density per pool
PostgreSQL pool model with proper indexes:
- 1 PG instance handles ~10K-100K small tenants
- Bottleneck: connection pooling (PgBouncer can handle 1000s of clients → 100 server connections)
Per-tenant overhead (pool, RLS):
- ~5-10% query overhead from RLS policy evaluation
- Minimal storage overhead (just tenant_id column)
3.2 Cost per tenant
| Model | Cost per tenant/month |
|---|---|
| Pool (shared infra) | 1 |
| Pod (10-100 tenants per pod) | 50 |
| Silo (dedicated infra) | 500+ |
ROI calculation:
- Free tier (low ARPU $0): MUST be pool
- Pro tier ($50/month ARPU): Pod or pool
- Enterprise ($1000+/month ARPU): Silo OK
3.3 Onboarding time
| Operation | Pool | Silo |
|---|---|---|
| Tenant creation | < 1s (insert row) | 5-30 min (provision DB, namespace, etc.) |
| Tenant deletion | Minutes (DELETE rows) | Hours (drop DB, cleanup) |
3.4 Noisy neighbor blast radius
| Model | Blast radius |
|---|---|
| Pure Pool | 100% (all tenants) |
| Pod | 1-10% (single pod) |
| Silo | 0.1% (single tenant) |
| Cell-based pod | 1% (single cell) |
4. Security First
4.1 The “Cross-tenant leak” nightmare
Worst case: Bug allows tenant A user to read tenant B data.
Defense in depth:
- App layer: Always filter by tenant_id (review in code review)
- DB layer: RLS policies (defense even if app forgets)
- Network: Per-tenant namespaces (vCluster, Capsule)
- Audit: Log every cross-tenant access attempt
- Testing: Automated tenant isolation tests
4.2 Tenant Isolation Testing
# Pytest: ensure tenant isolation
@pytest.mark.parametrize("attacker_tenant,victim_tenant", [
("tenant-a", "tenant-b"),
])
def test_cannot_read_other_tenant(attacker_tenant, victim_tenant):
# Create data in victim tenant
with set_tenant_context(victim_tenant):
victim_user = create_user(email="[email protected]")
# Try to read as attacker
with set_tenant_context(attacker_tenant):
result = get_user_by_email("[email protected]")
assert result is None, "Cross-tenant leak detected!"
# Try direct ID access
with set_tenant_context(attacker_tenant):
result = get_user_by_id(victim_user.id)
assert result is None, "Cross-tenant leak via direct ID!"
# Run for ALL endpoints, not just selected
def test_all_endpoints_isolated():
for endpoint in get_all_api_endpoints():
for method in endpoint.methods:
# Attempt cross-tenant access
assert is_endpoint_isolated(endpoint, method)4.3 BYOK (Bring Your Own Key)
Enterprise tenants want to control their encryption keys.
# Per-tenant KMS key
def encrypt_for_tenant(tenant_id: str, plaintext: bytes) -> bytes:
key_arn = get_tenant_kms_key(tenant_id) # From tenant config
return kms.encrypt(key_id=key_arn, plaintext=plaintext)
def decrypt_for_tenant(tenant_id: str, ciphertext: bytes) -> bytes:
return kms.decrypt(ciphertext=ciphertext)
# KMS auto-detects key from ciphertext metadataImplications:
- Tenant can revoke key → permanent data loss for that tenant (intentional)
- Per-tenant KMS = per-tenant cost
- Compliance benefit: tenant proves control of their data
4.4 Audit Logging Per Tenant
Every action must be auditable per tenant:
CREATE TABLE audit_log (
id BIGSERIAL PRIMARY KEY,
timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
tenant_id UUID NOT NULL,
user_id UUID,
action TEXT NOT NULL,
resource_type TEXT,
resource_id TEXT,
ip_address INET,
user_agent TEXT,
metadata JSONB,
-- Append-only: revoke UPDATE/DELETE
);
-- Partition by tenant for efficient retrieval
CREATE TABLE audit_log_tenant_acme PARTITION OF audit_log
FOR VALUES IN ('acme-tenant-uuid');Forward to immutable storage (S3 Object Lock, blockchain) for compliance.
4.5 Compliance Frameworks
| Framework | Multi-tenancy requirements |
|---|---|
| SOC 2 Type II | ”Logical isolation” (RLS sufficient if properly tested) |
| HIPAA | Stronger isolation; many require silo for PHI |
| PCI DSS | Card data must be tokenized; cardholder env separated |
| GDPR | Tenant consent, data portability, right to delete |
| ISO 27001 | Risk assessment per tenant, controls documented |
| FedRAMP | High often requires gov cloud silo |
5. DevOps — Vận hành Multi-Tenant SaaS
5.1 Provisioning Pipeline
# Argo Workflow: tenant provisioning
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: provision-tenant-
spec:
entrypoint: provision
arguments:
parameters:
- name: tenant-name
- name: tier
templates:
- name: provision
steps:
- - name: create-namespace
template: k8s-namespace
- - name: create-database
template: db-setup
- - name: apply-quota
template: resource-quota
- - name: setup-monitoring
template: monitoring-config
- - name: send-welcome
template: notify
- name: k8s-namespace
script:
image: kubectl:latest
command: [sh]
source: |
kubectl create namespace tenant-{{workflow.parameters.tenant-name}}
kubectl label namespace tenant-{{workflow.parameters.tenant-name}} \
tenant={{workflow.parameters.tenant-name}} \
tier={{workflow.parameters.tier}}5.2 Per-tenant monitoring
Critical: Dashboards filterable by tenant.
# Prometheus queries per tenant
rate(http_requests_total{tenant_id="acme"}[5m])
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket{tenant_id="acme"}[5m])
)
# Per-tenant resource usage
sum(rate(container_cpu_usage_seconds_total{namespace=~"tenant-.*"}[5m]))
by (namespace)Grafana variable: $tenant dropdown → filter all panels.
5.3 Per-tenant alerting
- alert: TenantQuotaApproached
expr: |
(
sum(container_memory_usage_bytes) by (namespace) /
kube_resourcequota{type="hard", resource="requests.memory"}
) > 0.8
for: 5m
labels:
severity: warning
tenant: "{{ $labels.namespace }}"
annotations:
summary: "Tenant {{ $labels.namespace }} at 80% memory quota"
- alert: TenantSLOViolation
expr: |
(
rate(http_requests_total{status=~"5..", tenant=~".+"}[5m]) /
rate(http_requests_total{tenant=~".+"}[5m])
) > 0.01
for: 10m
labels:
severity: critical
tenant: "{{ $labels.tenant }}"
annotations:
summary: "SLO violation for tenant {{ $labels.tenant }}: > 1% error rate"5.4 Tenant cost allocation
FinOps integration: Track infrastructure cost per tenant.
# Daily cost allocation job
def allocate_costs():
total_cost = aws.get_billing_total() # $X
# Allocation strategies:
# 1. By usage (compute time, storage)
tenant_usage = {}
for tenant_id in get_all_tenants():
tenant_usage[tenant_id] = {
"compute": prometheus.get_metric(
f'sum(container_cpu_usage_seconds_total{{tenant_id="{tenant_id}"}})'
),
"storage": s3.get_bucket_size(f"tenant-{tenant_id}-*"),
"network": cloudwatch.get_metric("NetworkOut", tenant_id),
}
# 2. By tier (proportional)
tier_weights = {"free": 0.1, "pro": 1.0, "enterprise": 10.0}
for tenant_id in get_all_tenants():
cost = calculate_tenant_cost(tenant_id, total_cost, tenant_usage)
save_cost_record(tenant_id, cost, date=today())5.5 Disaster scenarios
| Scenario | Mitigation |
|---|---|
| Cross-tenant data leak detected | Halt service, security incident response, forensic, customer notification |
| 1 tenant overuses → impacts others | Quotas + circuit breaker; auto-isolate |
| Migration fails for some tenants | Per-tenant migration tracking, rollback per tenant |
| Tenant requests forensic copy | Pre-built tooling for per-tenant backup export |
6. Code Implementation
6.1 PostgreSQL RLS-based pool tenancy
-- Migration: enable multi-tenancy
ALTER TABLE users ADD COLUMN tenant_id UUID NOT NULL;
CREATE INDEX idx_users_tenant_email ON users (tenant_id, email);
ALTER TABLE users ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON users
FOR ALL
USING (tenant_id = current_setting('app.tenant_id')::uuid);
-- Repeat for all tenant tables# FastAPI app with tenant context
from contextvars import ContextVar
from fastapi import FastAPI, Request, HTTPException, Depends
import jwt
import asyncpg
app = FastAPI()
tenant_ctx: ContextVar[str] = ContextVar("tenant_id")
db_pool: asyncpg.Pool = None
@app.middleware("http")
async def tenant_middleware(request: Request, call_next):
auth = request.headers.get("Authorization", "")
if not auth.startswith("Bearer "):
raise HTTPException(401, "Missing token")
try:
claims = jwt.decode(
auth[7:], JWT_SECRET, algorithms=["HS256"]
)
tenant_id = claims["tenant_id"]
except jwt.JWTError:
raise HTTPException(401, "Invalid token")
tenant_ctx.set(tenant_id)
response = await call_next(request)
response.headers["X-Tenant-ID"] = tenant_id
return response
async def get_db():
"""DB connection with tenant context set."""
async with db_pool.acquire() as conn:
# Set RLS context per session
tenant_id = tenant_ctx.get()
await conn.execute(
"SET LOCAL app.tenant_id = $1",
tenant_id
)
yield conn
@app.get("/users")
async def list_users(conn=Depends(get_db)):
# No need to filter by tenant_id — RLS does it
rows = await conn.fetch("SELECT id, email FROM users LIMIT 100")
return [dict(r) for r in rows]6.2 Tenant context propagation across services
"""
Propagate tenant context across HTTP, gRPC, async jobs.
"""
import httpx
from contextvars import ContextVar
tenant_ctx: ContextVar[str] = ContextVar("tenant_id")
class TenantAwareHTTPClient(httpx.AsyncClient):
"""HTTP client that auto-propagates tenant context."""
async def request(self, method, url, **kwargs):
headers = kwargs.pop("headers", {})
try:
tenant_id = tenant_ctx.get()
headers["X-Tenant-ID"] = tenant_id
except LookupError:
pass # No tenant context (e.g., system call)
return await super().request(method, url, headers=headers, **kwargs)
# Usage
client = TenantAwareHTTPClient()
async with client:
# Tenant ID auto-included in headers
response = await client.get("https://internal-service/data")
# For async jobs (Celery, RQ, etc.)
class TenantAwareTask:
def apply_async(self, args, **kwargs):
try:
tenant_id = tenant_ctx.get()
kwargs["headers"] = kwargs.get("headers", {})
kwargs["headers"]["tenant_id"] = tenant_id
except LookupError:
pass
return super().apply_async(args, **kwargs)
# Worker side: restore context
@worker.before_task_publish.connect
def set_tenant_context(headers=None, **kwargs):
if headers and "tenant_id" in headers:
tenant_ctx.set(headers["tenant_id"])6.3 Fair queueing for async jobs
"""
Fair scheduler: prevents 1 tenant from monopolizing job queue.
"""
import asyncio
from collections import deque
from typing import Optional
class FairTenantQueue:
"""Round-robin across tenant queues."""
def __init__(self):
self.queues: dict[str, deque] = {}
self.tenant_order: list[str] = []
self.idx = 0
self.lock = asyncio.Lock()
self.event = asyncio.Event()
async def enqueue(self, tenant_id: str, job):
async with self.lock:
if tenant_id not in self.queues:
self.queues[tenant_id] = deque()
self.tenant_order.append(tenant_id)
self.queues[tenant_id].append(job)
self.event.set()
async def dequeue(self) -> Optional[tuple]:
while True:
async with self.lock:
# Round-robin
tried = 0
while tried < len(self.tenant_order):
tenant = self.tenant_order[self.idx]
self.idx = (self.idx + 1) % len(self.tenant_order)
queue = self.queues[tenant]
if queue:
job = queue.popleft()
return tenant, job
tried += 1
# All empty, wait for new job
self.event.clear()
await self.event.wait()
# Worker
async def worker(queue: FairTenantQueue):
while True:
result = await queue.dequeue()
if result is None:
await asyncio.sleep(0.1)
continue
tenant_id, job = result
tenant_ctx.set(tenant_id)
await process(job)7. System Design Diagrams
7.1 Tenancy Models Spectrum
flowchart LR subgraph Pool["Pool Model"] SharedApp[Shared App<br/>multi-tenant] SharedDB[(Shared DB<br/>tenant_id + RLS)] SharedApp --> SharedDB end subgraph Pod["Pod Model"] PodApp1[Pod 1<br/>~100 tenants] PodApp2[Pod 2<br/>~100 tenants] PodDB1[(Pod 1 DB)] PodDB2[(Pod 2 DB)] PodApp1 --> PodDB1 PodApp2 --> PodDB2 end subgraph Silo["Silo Model"] SiloApp1[Tenant A App] SiloApp2[Tenant B App] SiloDB1[(Tenant A DB)] SiloDB2[(Tenant B DB)] SiloApp1 --> SiloDB1 SiloApp2 --> SiloDB2 end Pool -->|"Free Tier"| Pod Pod -->|"Pro Tier"| Silo Silo -->|"Enterprise Tier"| End[Hybrid SaaS] style Pool fill:#fff9c4 style Pod fill:#c8e6c9 style Silo fill:#bbdefb
7.2 RLS Tenant Isolation
sequenceDiagram participant User participant App participant Auth as JWT Auth participant DB as PostgreSQL User->>App: GET /users (Bearer token) App->>Auth: Decode JWT Auth-->>App: tenant_id="acme" App->>DB: BEGIN<br/>SET app.tenant_id = 'acme' App->>DB: SELECT * FROM users<br/>(no WHERE clause!) DB->>DB: Apply RLS policy:<br/>tenant_id = current_setting('app.tenant_id') DB-->>App: Only acme users returned App->>DB: COMMIT App-->>User: { users: [...] } Note over DB: Even if app forgets WHERE,<br/>RLS prevents cross-tenant leak
7.3 Cell-Based Multi-Tenancy
flowchart TB Router[Cell Router<br/>tenant_id → cell_id] Router --> Cell1 Router --> Cell2 Router --> Cell3 subgraph Cell1["Cell 1 (Tenants 1-100)"] App1[App Tier] DB1[(DB)] App1 --> DB1 end subgraph Cell2["Cell 2 (Tenants 101-200)"] App2[App Tier] DB2[(DB)] App2 --> DB2 end subgraph Cell3["Cell 3 (Tenants 201-300)"] App3[App Tier] DB3[(DB)] App3 --> DB3 end User[User of tenant 50] --> Router Note["Bug in Cell 1 → only 100 tenants affected<br/>vs 300 in single pool"] style Note fill:#fff9c4
7.4 Noisy Neighbor Mitigation Layers
flowchart TD Request[Tenant Request] Request --> L1{Rate Limit<br/>per tenant?} L1 -->|Exceeded| Block1[429 Too Many Requests] L1 -->|OK| L2{Resource Quota<br/>K8s ResourceQuota?} L2 -->|Exceeded| Block2[503 Service Unavailable] L2 -->|OK| L3{Connection Pool<br/>tenant-aware?} L3 -->|Exceeded| Block3[Wait or Reject] L3 -->|OK| L4{Query Timeout<br/>per tenant?} L4 -->|Exceeded| Block4[Cancel query] L4 -->|OK| Success[Process request] style Block1 fill:#ffcdd2 style Block2 fill:#ffcdd2 style Block3 fill:#ffcdd2 style Block4 fill:#ffcdd2 style Success fill:#c8e6c9
8. Aha Moments & Pitfalls
Aha Moments
#1: Multi-tenancy là decision đầu tiên ảnh hưởng tất cả. Pool vs Silo vs Hybrid quyết định cost, complexity, compliance, scalability. Khó migrate sau.
#2: PostgreSQL RLS là defense-in-depth. Even nếu app code có bug forget WHERE clause, RLS prevent cross-tenant leak. Bắt buộc bật cho pool model.
#3: Noisy neighbor là silent killer. 1 enterprise customer with bad query → 99% small tenants slow → churn. Per-tenant rate limit + quotas mandatory.
#4: Cell-based pattern reduce blast radius. Pure pool: 100% impact. Cell pool: 1-10% impact. AWS DynamoDB, Slack, Stripe đều dùng pattern này.
#5: Tenant context propagation là hard. HTTP request OK, async jobs khó, background workers dễ quên. Logger không có tenant_id → debug imopssible.
#6: BYOK is enterprise differentiator. Big customers care về key control. Implementation cost cao but premium pricing justify.
#7: Compliance shapes architecture. HIPAA, FedRAMP often force silo model. Plan upfront, retrofit expensive.
#8: Tenant deletion là requirement, không phải nice-to-have. GDPR right to delete. Architect cho hard delete + audit trail từ đầu.
Pitfalls
Pitfall 1: Hardcode tenant in URL
Sai:
acme.app.com/users→ tenant trong subdomain only. Đúng: Tenant từ JWT claim. Subdomain backup.
Pitfall 2: Forget tenant_id trong index
Sai:
CREATE INDEX ON users (email)→ cross-tenant scan slow. Đúng:CREATE INDEX ON users (tenant_id, email).
Pitfall 3: No RLS in pool model
Sai: “Code always filter by tenant_id” → 1 SQL bug = data breach. Đúng: Defense-in-depth: app filter + RLS.
Pitfall 4: Background jobs without context
Sai: Cron job processes all tenants, forget to set context → queries miss filter. Đúng: Explicitly set tenant context per job, audit log.
Pitfall 5: Cache without tenant_id
Sai:
cache.set("user:123", data)→ tenant A’s user 123 vs tenant B’s user 123. Đúng:cache.set(f"user:{tenant_id}:123", data).
Pitfall 6: No fair scheduling
Sai: FIFO job queue → 1 tenant submits 100K jobs → others wait hours. Đúng: Round-robin or weighted fair scheduling.
Pitfall 7: Single connection pool
Sai: 1 PgBouncer for all tenants → 1 tenant exhausts pool → others fail. Đúng: Tenant-aware pool quotas hoặc separate pools per tier.
Pitfall 8: No tenant-level monitoring
Sai: Aggregate metrics — can’t see “tenant X has 50% error rate”. Đúng: Tenant_id label on all metrics; per-tenant dashboards.
Pitfall 9: Same KMS key for all tenants
Sai: 1 master key encrypts all tenants → key compromise = total breach. Đúng: Per-tenant KMS (cost vs security trade-off).
Pitfall 10: No isolation testing
Sai: Manual review → bug slips into production → cross-tenant leak detected by customer. Đúng: Automated tests for every endpoint, tenant_isolation_test.py runs on PR.
9. Internal Links
| Topic | Liên hệ |
|---|---|
| Tuan-07-Database-Sharding-Replication | Sharding by tenant_id; pool model uses tenant_id column |
| Tuan-09-Rate-Limiter | Per-tenant rate limiting |
| Tuan-11-Microservices-Pattern | Cell-based architecture; namespace per tenant |
| Tuan-14-AuthN-AuthZ-Security | Tenant context from JWT; RBAC scoped to tenant |
| Tuan-15-Data-Security-Encryption | BYOK per tenant |
| Tuan-Bonus-Multi-Region-Active-Active-DSQL | Tenant-region affinity |
| Case-Design-Payment-System | Per-merchant tenant model |
Tham khảo
Whitepapers:
- AWS, SaaS Tenant Isolation Strategies — https://d1.awsstatic.com/whitepapers/saas-tenant-isolation-strategies.pdf
- AWS, SaaS Architecture Fundamentals — https://d1.awsstatic.com/whitepapers/saas-architecture-fundamentals.pdf
- Microsoft Learn, Multi-tenancy considerations — https://learn.microsoft.com/en-us/azure/architecture/guide/multitenant/
Engineering blogs:
- Neon, The Noisy Neighbor Problem — https://neon.tech/blog/noisy-neighbors
- Shopify, Sharding for resilience — https://shopify.engineering/sharding-customers-tools-and-techniques
- WorkOS, Multi-Tenant Architecture — https://workos.com/blog/multi-tenant-architecture
- Supabase, Postgres RLS — https://supabase.com/docs/guides/auth/row-level-security
- Stripe, Per-customer SLAs — https://stripe.com/blog/
Tools:
- AWS SaaS Builder Toolkit (SBT) — https://github.com/awslabs/sbt-aws
- vCluster — https://www.vcluster.com/
- Capsule (K8s multi-tenancy) — https://capsule.clastix.io/
- Loft (K8s SaaS) — https://loft.sh/
Tiếp theo: Tuan-Bonus-MCP-Architecture — Model Context Protocol cho LLM tools.