Tuần 14 — OLAP & Columnar: ClickHouse, DuckDB
“OLTP đo nhỏ và nhiều: 1ms per query, 10K queries/s. OLAP đo to và ít: scan 1B rows in 1 second. Postgres làm OLAP cũng được, nhưng chậm 100x so với columnar. Tuần này dạy bạn vì sao + khi nào move.”
Tags: database clickhouse duckdb olap columnar analytics Thời lượng: 7 ngày (5-7h/ngày) Prerequisites: Tuan-01-DB-Internals-Refresh (storage concept) Liên quan: Case-Design-Data-Realtime-Analytics · Tuan-Bonus-Lakehouse-Engineering
1. Context & Why
1.1 OLTP vs OLAP — Fundamentally Different
graph TB subgraph "OLTP - Online Transaction Processing" OLTP1[Small transactions] OLTP2[Read few rows by key] OLTP3[Write per request] OLTP4[Latency: ms] OLTP5[Postgres, MySQL, Oracle] end subgraph "OLAP - Online Analytical Processing" OLAP1[Large scans] OLAP2[Aggregate billions of rows] OLAP3[Bulk insert, rare update] OLAP4[Latency: seconds] OLAP5[ClickHouse, BigQuery, Snowflake] end style OLTP4 fill:#c8e6c9 style OLAP4 fill:#c8e6c9
1.2 Columnar vs Row Storage
graph LR subgraph "Row-oriented Postgres" R1["Row 1: id=1 | name=Alice | age=30 | city=SF"] R2["Row 2: id=2 | name=Bob | age=25 | city=NY"] R3["Row 3: id=3 | name=Carol | age=35 | city=LA"] R1 --> R2 --> R3 end subgraph "Column-oriented ClickHouse" C1[id: 1, 2, 3] C2[name: Alice, Bob, Carol] C3[age: 30, 25, 35] C4[city: SF, NY, LA] end
1.3 Why columnar wins for analytics
Query: SELECT avg(age) FROM users WHERE city = 'SF'
Row store:
- Read all columns of all rows
- Filter, compute avg
- For 100M rows × 100 bytes = 10GB scanned for 4-byte age
Columnar:
- Read only
citycolumn → filter rows - Read only
agecolumn → compute - For 100M rows × 4 bytes age + 4 bytes city = 800MB scanned
~10-100x less I/O. Plus compression on column (homogeneous data compresses better).
1.4 Mục tiêu tuần
- Columnar storage internals
- Vectorized execution
- ClickHouse MergeTree family (ReplacingMergeTree, SummingMergeTree, AggregatingMergeTree)
- Compression codecs (LZ4, ZSTD, Delta, T64)
- Materialized views — incremental aggregation pattern
- Sharding + replication
- DuckDB — in-process analytics
- Star schema vs wide table
- BigQuery, Snowflake basics for context
- When ClickHouse vs Postgres
1.5 Tham chiếu
- ClickHouse Documentation — https://clickhouse.com/docs
- The Story of ClickHouse — Aleksey Milovidov talks
- DuckDB Documentation — https://duckdb.org/docs/
- Snowflake paper (SIGMOD 2016) — Vuppalapati et al
- Designing Data-Intensive Applications Ch.3 (Column-oriented)
2. Columnar Storage Internals
2.1 ClickHouse on disk
table_name/
├── 20260501_1_1_0/ (partition_minBlock_maxBlock_level)
│ ├── primary.idx (sparse index)
│ ├── partition.dat
│ ├── id.bin (column data, compressed)
│ ├── id.mrk (marks - offsets to granules)
│ ├── name.bin
│ ├── name.mrk
│ ├── age.bin
│ ├── age.mrk
│ └── checksums.txt
├── 20260502_2_2_0/
└── ...
Each column = separate file. Compressed in blocks (granules = 8192 rows by default via index_granularity). Trên thực tế index_granularity_bytes (default 10MB) thường dominates — granule split khi đạt N rows HOẶC M bytes, whichever first. Tune cả 2 cho wide rows.
2.2 Sparse primary index
Primary index marks every 8192 rows (granule):
granule 0: id=1
granule 1: id=8193
granule 2: id=16385
...
Query WHERE id = 12345:
- Binary search index → granule 1 (id 8193-16384)
- Read granule 1 only (~8K rows)
- Filter inside granule
vs Postgres B-tree (every row in index). ClickHouse: 100x smaller index, but slower point lookup.
→ Designed for scan, not point.
2.3 Compression
Columnar = homogeneous data per column → high compression.
| Codec | Use case | Compression |
|---|---|---|
| LZ4 (default) | General | 2-3x |
| ZSTD | Better ratio | 4-5x |
| Delta | Time-series, sequential | 5-10x |
| DoubleDelta | Timestamps | 10-20x |
| T64 | Integers | 3-5x |
| Gorilla | Float time-series | 10x+ |
CREATE TABLE events (
ts DateTime CODEC(DoubleDelta, ZSTD),
user_id UInt64 CODEC(T64, ZSTD),
event_type LowCardinality(String), -- enum-like
value Float64 CODEC(Gorilla, ZSTD)
) ENGINE = MergeTree() ORDER BY (ts, user_id);2.4 LowCardinality
Type wrapper for columns with few distinct values:
event_type LowCardinality(String) -- 'click', 'view', 'purchase' etcInternally: dictionary encoding. 10-100x compression for repeated values.
2.5 Vectorized execution
Process data in batches of typed values (vectors).
Row-based: for each row, evaluate expression
→ branch prediction, cache miss, function call overhead
Vectorized: for batch of 1024 rows, apply operation on whole vector
→ SIMD instructions, cache-friendly, ~10-100x faster
Modern OLAP DBs: ClickHouse, DuckDB, Apache Arrow, Polars — all vectorized.
3. ClickHouse Engines
3.1 MergeTree family
graph TB MT[MergeTree<br/>base, partitioned, sorted] RMT[ReplacingMergeTree<br/>upsert by sort key] SMT[SummingMergeTree<br/>auto-sum numeric on merge] AMT[AggregatingMergeTree<br/>store aggregate state] CMT[CollapsingMergeTree<br/>+1/-1 sign rows] VCMT[VersionedCollapsingMergeTree<br/>versioned collapse] MT --> RMT MT --> SMT MT --> AMT MT --> CMT CMT --> VCMT
3.2 MergeTree basics
CREATE TABLE events (
timestamp DateTime,
user_id UInt64,
event String,
value Float64
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp) -- monthly partitions
ORDER BY (timestamp, user_id) -- sort key (= primary key default)
SETTINGS index_granularity = 8192;PARTITION BY— folder per partition, allows partition pruning + drop partitionORDER BY— physical sort, indexed by primary key- Optional
PRIMARY KEYseparate from ORDER BY — but must be a prefix of ORDER BY. Use case: PK smaller (RAM) than full sort key
3.3 ReplacingMergeTree
Dedup on merge: keep latest by ORDER BY.
CREATE TABLE users (
id UInt64,
name String,
updated_at DateTime
)
ENGINE = ReplacingMergeTree(updated_at)
ORDER BY id;INSERT with same id → eventually deduped on background merge. Use FINAL modifier or OPTIMIZE TABLE to force.
Caveat: only eventually consistent. SELECT immediately after INSERT might see duplicates.
3.4 SummingMergeTree
Auto-sum numeric columns on merge for rows with same ORDER BY.
CREATE TABLE daily_stats (
date Date,
user_id UInt64,
clicks UInt64,
views UInt64
)
ENGINE = SummingMergeTree()
ORDER BY (date, user_id);
INSERT INTO daily_stats VALUES ('2026-05-16', 42, 1, 5);
INSERT INTO daily_stats VALUES ('2026-05-16', 42, 2, 3);
-- Eventually: ('2026-05-16', 42, 3, 8)3.5 AggregatingMergeTree
Store aggregate state (not final result), final by quantileState, uniqState, etc.
CREATE TABLE user_metrics (
date Date,
user_id UInt64,
unique_pages AggregateFunction(uniq, String),
avg_session_sec AggregateFunction(avg, UInt32)
)
ENGINE = AggregatingMergeTree()
ORDER BY (date, user_id);Used with materialized views (section 4).
4. Materialized Views — Incremental Aggregation
4.1 Concept
In Postgres, MV = snapshot, REFRESH whole.
In ClickHouse, MV = trigger: every insert to source table → computed into MV in real-time.
4.2 Example
Source table:
CREATE TABLE events (
ts DateTime, user_id UInt64, event String
) ENGINE = MergeTree() ORDER BY ts;Materialized view aggregating daily:
CREATE MATERIALIZED VIEW events_daily
ENGINE = AggregatingMergeTree()
ORDER BY (date, event)
AS
SELECT
toDate(ts) AS date,
event,
uniqState(user_id) AS unique_users,
count() AS total_events
FROM events
GROUP BY date, event;Query daily aggregates:
SELECT
date, event,
uniqMerge(unique_users) AS unique_users,
sum(total_events) AS total
FROM events_daily
GROUP BY date, event;4.3 Power
Source table 100B rows. MV daily aggregate <100K rows. Dashboard query <100ms.
vs Postgres MV REFRESH — takes hours to compute.
4.4 Multi-layer MVs
graph LR Raw[events_raw<br/>100B rows] --> MV1[events_hourly<br/>via MV] MV1 --> MV2[events_daily<br/>via MV] MV2 --> MV3[events_monthly<br/>via MV]
Each level reduces 60-1000x.
5. Sharding & Replication
5.1 Distributed table
-- Local table on each shard
CREATE TABLE events_local (...) ENGINE = MergeTree() ORDER BY ...;
-- Distributed table - cluster-wide view
CREATE TABLE events ON CLUSTER cluster1 AS events_local
ENGINE = Distributed(cluster1, default, events_local, rand());Query SELECT * FROM events → distribute to all shards → aggregate.
5.2 Replication
ReplicatedMergeTree for fault tolerance:
CREATE TABLE events_local ON CLUSTER cluster1
(...)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
ORDER BY ...;Uses ZooKeeper (or ClickHouse Keeper) for coordination.
5.3 Sharding key
Random (good distribution) or hashed by user_id (group same user’s data).
Trade-off: random → all queries fan out; hashed → can prune to single shard with user_id filter.
6. ClickHouse Advanced
6.1 SAMPLE clause
Approximate queries for huge tables:
SELECT count() FROM events SAMPLE 0.1; -- approximate from 10% sampleNeed SAMPLE BY defined on table.
6.2 ARRAY JOIN
Explode array column to multiple rows (like SQL UNNEST).
CREATE TABLE products (id UInt64, tags Array(String)) ENGINE = MergeTree() ORDER BY id;
SELECT id, tag
FROM products
ARRAY JOIN tags AS tag
WHERE tag = 'rust';6.3 Window functions
SELECT
user_id,
date,
purchase,
sum(purchase) OVER (PARTITION BY user_id ORDER BY date) AS lifetime_spent
FROM orders;ClickHouse fully supports SQL window functions since 21.x.
6.4 Joins
ClickHouse joins are constrained — build hash table from right side in memory.
SELECT *
FROM events e
LEFT JOIN users u ON u.id = e.user_id;If users huge (>10M rows × wide) → OOM. Workarounds:
- Distributed dictionary for small lookup
- Denormalize at ingest
6.5 Dictionaries
External small lookup table loaded in memory:
CREATE DICTIONARY user_dict (
id UInt64,
name String,
country String
)
PRIMARY KEY id
SOURCE(HTTP(url 'http://api/users.csv'))
LAYOUT(HASHED())
LIFETIME(60);
-- Use in query
SELECT
e.event,
dictGet('user_dict', 'country', e.user_id) AS country
FROM events e;Fast in-memory lookup, no JOIN.
6.6 Projections (since 21.3)
In-table secondary “view” with different sort/aggregate. Auto-used when matches query.
ALTER TABLE events ADD PROJECTION p_by_user (
SELECT * ORDER BY user_id
);
ALTER TABLE events MATERIALIZE PROJECTION p_by_user;Query SELECT * FROM events WHERE user_id = 42 → uses projection automatically.
7. DuckDB — In-process Analytics
7.1 Concept
SQLite for analytics. Single binary, embedded in Python/R/Node, no server.
import duckdb
con = duckdb.connect("analytics.db")
con.execute("CREATE TABLE events AS SELECT * FROM 'events.parquet'")
result = con.execute("SELECT date, count(*) FROM events GROUP BY date").df()7.2 Use cases
- Local analytics on Parquet/CSV files
- Notebook data exploration
- ETL pipelines
- App-embedded analytics
7.3 Features 2024-2026
- Reads Parquet, CSV, JSON, Arrow natively
- Joins, window functions, full SQL
- Vectorized execution
- Smaller than ClickHouse but similar perf for single-machine
7.4 DuckDB vs ClickHouse
| DuckDB | ClickHouse | |
|---|---|---|
| Deployment | Embedded library | Server cluster |
| Concurrency | Single-process | Multi-user |
| Data size | Single machine (TB) | Cluster (PB) |
| Best for | Analyst notebooks, embedded | Production analytics service |
7.5 MotherDuck
Cloud DuckDB service. Hybrid local + cloud. 2024 launch.
8. Cloud Data Warehouses
8.1 BigQuery
- Serverless, pay per query (bytes scanned)
- Petabyte scale
- Multi-region
- Streaming inserts
- SQL similar to standard
SELECT user_id, count(*) FROM `project.dataset.events`
WHERE _PARTITIONTIME >= '2026-05-01'
GROUP BY user_id;Cost: $5/TB scanned. Optimize: partition + cluster.
8.2 Snowflake
- Multi-cloud (AWS, GCP, Azure)
- Compute decoupled from storage
- “Virtual warehouses” — multiple compute clusters on same data
- Time-travel (query past versions)
SELECT * FROM events AT(OFFSET => -3600); -- 1 hour ago8.3 Databricks SQL / Photon
Spark-based, Delta Lake format. Lakehouse architecture. Covered Tuan-Bonus-Lakehouse-Engineering.
8.4 Decision
| Need | Pick |
|---|---|
| Real-time, sub-second | ClickHouse |
| Pure SQL on existing files | DuckDB |
| Serverless, AWS/GCP-native | BigQuery |
| Multi-cloud, separate compute | Snowflake |
| Open format (Iceberg/Delta) | Databricks, Trino |
9. Schema Design for Analytics
9.1 Star Schema
graph TB F[Fact: events<br/>user_id, product_id, date_id, amount] D1[Dim: users<br/>id, name, country, signup_date] D2[Dim: products<br/>id, name, category] D3[Dim: dates<br/>id, day, week, month] F --> D1 F --> D2 F --> D3
- Fact = events/transactions, narrow rows, billions
- Dim = entities, small, references in fact
Joins for context.
9.2 Wide Table (denormalized)
CREATE TABLE events_wide (
ts DateTime, user_id, product_id, amount,
-- denormalized user fields
user_name, user_country, user_signup_date,
-- denormalized product fields
product_name, product_category
);No joins, fast scan. Larger storage (acceptable with compression).
Pattern 2024: ClickHouse/DuckDB favor wide tables. Snowflake/BigQuery handle either.
9.3 Slowly Changing Dimensions (SCD)
User changes country. How to track?
- SCD Type 1: overwrite (lose history)
- SCD Type 2: new row with valid_from/valid_to (keep history)
- SCD Type 6: hybrid (current + history)
Type 2 most common for analytics.
10. Real-Time Analytics Architecture
graph LR App[App] -->|events| Kafka[Kafka] Kafka --> CH[ClickHouse<br/>raw layer] CH -.materialized.-> Agg1[hourly aggs] CH -.materialized.-> Agg2[daily aggs] Agg1 --> Dashboard[Grafana/Metabase] Agg2 --> Dashboard PG[(Postgres OLTP)] -.CDC.-> Kafka
Pattern:
- Events from app → Kafka
- CDC from OLTP → Kafka
- ClickHouse consumes (Kafka engine) → raw table
- Materialized views compute aggregates real-time
- BI tools query aggregates
Will deep-dive Case-Design-Data-Realtime-Analytics.
11. Postgres for Analytics — When OK
Postgres NOT terrible for analytics if:
- Data < 100GB
- Queries < few seconds OK
- No need <1s dashboard
- Want one DB instead of two
Optimizations:
- BRIN index on time column
- Partitioning by time
- Aggregate-only MVs (refresh periodically)
pg_cronfor scheduled aggregation- citus extension for sharding
But at 100GB+ analytics workload, ClickHouse 100x faster + 10x cheaper. Move when justified.
12. Anti-patterns
| Pattern | Why bad | Fix |
|---|---|---|
| Use ClickHouse as OLTP | High-frequency updates poor | Use Postgres for OLTP |
| Random ORDER BY in MergeTree | Compression poor, no pruning | Match sort key to query pattern |
| Single huge partition | No pruning | Partition by time |
| Frequent INSERTs single rows | Many small parts | Batch inserts (1K+ rows) |
| ALTER columns frequently | Background merge churn | Stable schema |
| Use Postgres at petabyte | Slow, expensive | ClickHouse/Snowflake |
| Denormalize without need | Wasted storage | Star schema OK |
| Forget to MERGE | Dedup/sum doesn’t happen | OPTIMIZE TABLE or wait |
| Query without partition filter | Full scan all parts | Always filter ts |
| Heavy joins ClickHouse | OOM | Dictionary or denormalize |
13. Lab
13.1 Day 1: ClickHouse setup
docker run -d --name ch -p 8123:8123 -p 9000:9000 clickhouse/clickhouse-server:24.5
clickhouse-client -h localhostLoad NYC Taxi dataset (sample analytical workload).
13.2 Day 2: MergeTree
Create table, partition by month, order by (pickup_time, payment_type). Run benchmark vs Postgres on same data.
13.3 Day 3: Materialized view
Build hourly aggregate via MV. Watch real-time updates.
13.4 Day 4: Compression
Compare LZ4 vs ZSTD vs Delta on time-series data.
13.5 Day 5: Distributed
3-node cluster. Distributed table. Sharding by user_id.
13.6 Day 6: DuckDB
import duckdb
con = duckdb.connect(":memory:")
con.execute("CREATE TABLE events AS SELECT * FROM 'events.parquet'")
con.execute("SELECT date, count(*) FROM events GROUP BY date").df()Benchmark vs ClickHouse on same dataset.
13.7 Day 7: Migration plan
Take Postgres analytics workload, plan migration to ClickHouse:
- Identify hot queries
- Design ClickHouse schema
- ETL plan
- Cutover
14. Self-check
- Columnar vs row storage — vẽ + explain?
- Vectorized execution — vì sao 10-100x faster?
- MergeTree family — 5 engines + use cases?
- Materialized view ClickHouse vs Postgres MV — khác biệt?
- LowCardinality — khi nào dùng?
- Sparse primary index — trade-off vs B-tree?
- ClickHouse joins — limit + workaround?
- DuckDB vs ClickHouse — pick khi nào?
- Star schema vs wide table — modern OLAP preference?
- Postgres for analytics — limit ở đâu?
15. Tiếp theo
Bài tiếp: Tuan-15-Vector-DB-AI — vector search cho AI workload.
Tuần 14 hoàn thành. Column store changes everything for analytics. Cập nhật: 2026-05-16