Tuần 14 — OLAP & Columnar: ClickHouse, DuckDB

“OLTP đo nhỏ và nhiều: 1ms per query, 10K queries/s. OLAP đo to và ít: scan 1B rows in 1 second. Postgres làm OLAP cũng được, nhưng chậm 100x so với columnar. Tuần này dạy bạn vì sao + khi nào move.”

Tags: database clickhouse duckdb olap columnar analytics Thời lượng: 7 ngày (5-7h/ngày) Prerequisites: Tuan-01-DB-Internals-Refresh (storage concept) Liên quan: Case-Design-Data-Realtime-Analytics · Tuan-Bonus-Lakehouse-Engineering

1. Context & Why

1.1 OLTP vs OLAP — Fundamentally Different

graph TB
    subgraph "OLTP - Online Transaction Processing"
        OLTP1[Small transactions]
        OLTP2[Read few rows by key]
        OLTP3[Write per request]
        OLTP4[Latency: ms]
        OLTP5[Postgres, MySQL, Oracle]
    end

    subgraph "OLAP - Online Analytical Processing"
        OLAP1[Large scans]
        OLAP2[Aggregate billions of rows]
        OLAP3[Bulk insert, rare update]
        OLAP4[Latency: seconds]
        OLAP5[ClickHouse, BigQuery, Snowflake]
    end

    style OLTP4 fill:#c8e6c9
    style OLAP4 fill:#c8e6c9

1.2 Columnar vs Row Storage

graph LR
    subgraph "Row-oriented Postgres"
        R1["Row 1: id=1 | name=Alice | age=30 | city=SF"]
        R2["Row 2: id=2 | name=Bob | age=25 | city=NY"]
        R3["Row 3: id=3 | name=Carol | age=35 | city=LA"]

        R1 --> R2 --> R3
    end

    subgraph "Column-oriented ClickHouse"
        C1[id: 1, 2, 3]
        C2[name: Alice, Bob, Carol]
        C3[age: 30, 25, 35]
        C4[city: SF, NY, LA]
    end

1.3 Why columnar wins for analytics

Query: SELECT avg(age) FROM users WHERE city = 'SF'

Row store:

Read all columns of all rows
Filter, compute avg
For 100M rows × 100 bytes = 10GB scanned for 4-byte age

Columnar:

Read only city column → filter rows
Read only age column → compute
For 100M rows × 4 bytes age + 4 bytes city = 800MB scanned

~10-100x less I/O. Plus compression on column (homogeneous data compresses better).

1.4 Mục tiêu tuần

1.5 Tham chiếu

ClickHouse Documentation — https://clickhouse.com/docs
The Story of ClickHouse — Aleksey Milovidov talks
DuckDB Documentation — https://duckdb.org/docs/
Snowflake paper (SIGMOD 2016) — Vuppalapati et al
Designing Data-Intensive Applications Ch.3 (Column-oriented)

2. Columnar Storage Internals

2.1 ClickHouse on disk

table_name/
├── 20260501_1_1_0/        (partition_minBlock_maxBlock_level)
│   ├── primary.idx        (sparse index)
│   ├── partition.dat
│   ├── id.bin             (column data, compressed)
│   ├── id.mrk             (marks - offsets to granules)
│   ├── name.bin
│   ├── name.mrk
│   ├── age.bin
│   ├── age.mrk
│   └── checksums.txt
├── 20260502_2_2_0/
└── ...

Each column = separate file. Compressed in blocks (granules = 8192 rows by default via index_granularity). Trên thực tế index_granularity_bytes (default 10MB) thường dominates — granule split khi đạt N rows HOẶC M bytes, whichever first. Tune cả 2 cho wide rows.

2.2 Sparse primary index

Primary index marks every 8192 rows (granule):
  granule 0: id=1
  granule 1: id=8193
  granule 2: id=16385
  ...

Query WHERE id = 12345:

Binary search index → granule 1 (id 8193-16384)
Read granule 1 only (~8K rows)
Filter inside granule

vs Postgres B-tree (every row in index). ClickHouse: 100x smaller index, but slower point lookup.

→ Designed for scan, not point.

2.3 Compression

Columnar = homogeneous data per column → high compression.

Codec	Use case	Compression
LZ4 (default)	General	2-3x
ZSTD	Better ratio	4-5x
Delta	Time-series, sequential	5-10x
DoubleDelta	Timestamps	10-20x
T64	Integers	3-5x
Gorilla	Float time-series	10x+

CREATE TABLE events (
    ts DateTime CODEC(DoubleDelta, ZSTD),
    user_id UInt64 CODEC(T64, ZSTD),
    event_type LowCardinality(String),  -- enum-like
    value Float64 CODEC(Gorilla, ZSTD)
) ENGINE = MergeTree() ORDER BY (ts, user_id);

2.4 LowCardinality

Type wrapper for columns with few distinct values:

event_type LowCardinality(String)  -- 'click', 'view', 'purchase' etc

Internally: dictionary encoding. 10-100x compression for repeated values.

2.5 Vectorized execution

Process data in batches of typed values (vectors).

Row-based: for each row, evaluate expression
  → branch prediction, cache miss, function call overhead

Vectorized: for batch of 1024 rows, apply operation on whole vector
  → SIMD instructions, cache-friendly, ~10-100x faster

Modern OLAP DBs: ClickHouse, DuckDB, Apache Arrow, Polars — all vectorized.

3. ClickHouse Engines

3.1 MergeTree family

graph TB
    MT[MergeTree<br/>base, partitioned, sorted]
    RMT[ReplacingMergeTree<br/>upsert by sort key]
    SMT[SummingMergeTree<br/>auto-sum numeric on merge]
    AMT[AggregatingMergeTree<br/>store aggregate state]
    CMT[CollapsingMergeTree<br/>+1/-1 sign rows]
    VCMT[VersionedCollapsingMergeTree<br/>versioned collapse]

    MT --> RMT
    MT --> SMT
    MT --> AMT
    MT --> CMT
    CMT --> VCMT

3.2 MergeTree basics

CREATE TABLE events (
    timestamp DateTime,
    user_id UInt64,
    event String,
    value Float64
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)   -- monthly partitions
ORDER BY (timestamp, user_id)       -- sort key (= primary key default)
SETTINGS index_granularity = 8192;

PARTITION BY — folder per partition, allows partition pruning + drop partition
ORDER BY — physical sort, indexed by primary key
Optional PRIMARY KEY separate from ORDER BY — but must be a prefix of ORDER BY. Use case: PK smaller (RAM) than full sort key

3.3 ReplacingMergeTree

Dedup on merge: keep latest by ORDER BY.

CREATE TABLE users (
    id UInt64,
    name String,
    updated_at DateTime
)
ENGINE = ReplacingMergeTree(updated_at)
ORDER BY id;

INSERT with same id → eventually deduped on background merge. Use FINAL modifier or OPTIMIZE TABLE to force.

Caveat: only eventually consistent. SELECT immediately after INSERT might see duplicates.

3.4 SummingMergeTree

Auto-sum numeric columns on merge for rows with same ORDER BY.

CREATE TABLE daily_stats (
    date Date,
    user_id UInt64,
    clicks UInt64,
    views UInt64
)
ENGINE = SummingMergeTree()
ORDER BY (date, user_id);
 
INSERT INTO daily_stats VALUES ('2026-05-16', 42, 1, 5);
INSERT INTO daily_stats VALUES ('2026-05-16', 42, 2, 3);
-- Eventually: ('2026-05-16', 42, 3, 8)

3.5 AggregatingMergeTree

Store aggregate state (not final result), final by quantileState, uniqState, etc.

CREATE TABLE user_metrics (
    date Date,
    user_id UInt64,
    unique_pages AggregateFunction(uniq, String),
    avg_session_sec AggregateFunction(avg, UInt32)
)
ENGINE = AggregatingMergeTree()
ORDER BY (date, user_id);

Used with materialized views (section 4).

4. Materialized Views — Incremental Aggregation

4.1 Concept

In Postgres, MV = snapshot, REFRESH whole.

In ClickHouse, MV = trigger: every insert to source table → computed into MV in real-time.

4.2 Example

Source table:

CREATE TABLE events (
    ts DateTime, user_id UInt64, event String
) ENGINE = MergeTree() ORDER BY ts;

Materialized view aggregating daily:

CREATE MATERIALIZED VIEW events_daily
ENGINE = AggregatingMergeTree()
ORDER BY (date, event)
AS
SELECT
    toDate(ts) AS date,
    event,
    uniqState(user_id) AS unique_users,
    count() AS total_events
FROM events
GROUP BY date, event;

Query daily aggregates:

SELECT
    date, event,
    uniqMerge(unique_users) AS unique_users,
    sum(total_events) AS total
FROM events_daily
GROUP BY date, event;

4.3 Power

Source table 100B rows. MV daily aggregate <100K rows. Dashboard query <100ms.

vs Postgres MV REFRESH — takes hours to compute.

4.4 Multi-layer MVs

graph LR
    Raw[events_raw<br/>100B rows] --> MV1[events_hourly<br/>via MV]
    MV1 --> MV2[events_daily<br/>via MV]
    MV2 --> MV3[events_monthly<br/>via MV]

Each level reduces 60-1000x.

5. Sharding & Replication

5.1 Distributed table

-- Local table on each shard
CREATE TABLE events_local (...) ENGINE = MergeTree() ORDER BY ...;
 
-- Distributed table - cluster-wide view
CREATE TABLE events ON CLUSTER cluster1 AS events_local
ENGINE = Distributed(cluster1, default, events_local, rand());

Query SELECT * FROM events → distribute to all shards → aggregate.

5.2 Replication

ReplicatedMergeTree for fault tolerance:

CREATE TABLE events_local ON CLUSTER cluster1
(...)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
ORDER BY ...;

Uses ZooKeeper (or ClickHouse Keeper) for coordination.

5.3 Sharding key

Random (good distribution) or hashed by user_id (group same user’s data).

Trade-off: random → all queries fan out; hashed → can prune to single shard with user_id filter.

6. ClickHouse Advanced

6.1 SAMPLE clause

Approximate queries for huge tables:

SELECT count() FROM events SAMPLE 0.1;  -- approximate from 10% sample

Need SAMPLE BY defined on table.

6.2 ARRAY JOIN

Explode array column to multiple rows (like SQL UNNEST).

CREATE TABLE products (id UInt64, tags Array(String)) ENGINE = MergeTree() ORDER BY id;
 
SELECT id, tag
FROM products
ARRAY JOIN tags AS tag
WHERE tag = 'rust';

6.3 Window functions

SELECT
    user_id,
    date,
    purchase,
    sum(purchase) OVER (PARTITION BY user_id ORDER BY date) AS lifetime_spent
FROM orders;

ClickHouse fully supports SQL window functions since 21.x.

6.4 Joins

ClickHouse joins are constrained — build hash table from right side in memory.

SELECT *
FROM events e
LEFT JOIN users u ON u.id = e.user_id;

If users huge (>10M rows × wide) → OOM. Workarounds:

Distributed dictionary for small lookup
Denormalize at ingest

6.5 Dictionaries

External small lookup table loaded in memory:

CREATE DICTIONARY user_dict (
    id UInt64,
    name String,
    country String
)
PRIMARY KEY id
SOURCE(HTTP(url 'http://api/users.csv'))
LAYOUT(HASHED())
LIFETIME(60);
 
-- Use in query
SELECT
    e.event,
    dictGet('user_dict', 'country', e.user_id) AS country
FROM events e;

Fast in-memory lookup, no JOIN.

6.6 Projections (since 21.3)

In-table secondary “view” with different sort/aggregate. Auto-used when matches query.

ALTER TABLE events ADD PROJECTION p_by_user (
    SELECT * ORDER BY user_id
);
 
ALTER TABLE events MATERIALIZE PROJECTION p_by_user;

Query SELECT * FROM events WHERE user_id = 42 → uses projection automatically.

7. DuckDB — In-process Analytics

7.1 Concept

SQLite for analytics. Single binary, embedded in Python/R/Node, no server.

import duckdb
con = duckdb.connect("analytics.db")
con.execute("CREATE TABLE events AS SELECT * FROM 'events.parquet'")
result = con.execute("SELECT date, count(*) FROM events GROUP BY date").df()

7.2 Use cases

Local analytics on Parquet/CSV files
Notebook data exploration
ETL pipelines
App-embedded analytics

7.3 Features 2024-2026

Reads Parquet, CSV, JSON, Arrow natively
Joins, window functions, full SQL
Vectorized execution
Smaller than ClickHouse but similar perf for single-machine

7.4 DuckDB vs ClickHouse

	DuckDB	ClickHouse
Deployment	Embedded library	Server cluster
Concurrency	Single-process	Multi-user
Data size	Single machine (TB)	Cluster (PB)
Best for	Analyst notebooks, embedded	Production analytics service

7.5 MotherDuck

Cloud DuckDB service. Hybrid local + cloud. 2024 launch.

8. Cloud Data Warehouses

8.1 BigQuery

Serverless, pay per query (bytes scanned)
Petabyte scale
Multi-region
Streaming inserts
SQL similar to standard

SELECT user_id, count(*) FROM `project.dataset.events`
WHERE _PARTITIONTIME >= '2026-05-01'
GROUP BY user_id;

Cost: $5/TB scanned. Optimize: partition + cluster.

8.2 Snowflake

Multi-cloud (AWS, GCP, Azure)
Compute decoupled from storage
“Virtual warehouses” — multiple compute clusters on same data
Time-travel (query past versions)

SELECT * FROM events AT(OFFSET => -3600);  -- 1 hour ago

8.3 Databricks SQL / Photon

Spark-based, Delta Lake format. Lakehouse architecture. Covered Tuan-Bonus-Lakehouse-Engineering.

8.4 Decision

Need	Pick
Real-time, sub-second	ClickHouse
Pure SQL on existing files	DuckDB
Serverless, AWS/GCP-native	BigQuery
Multi-cloud, separate compute	Snowflake
Open format (Iceberg/Delta)	Databricks, Trino

9. Schema Design for Analytics

9.1 Star Schema

graph TB
    F[Fact: events<br/>user_id, product_id, date_id, amount]
    D1[Dim: users<br/>id, name, country, signup_date]
    D2[Dim: products<br/>id, name, category]
    D3[Dim: dates<br/>id, day, week, month]

    F --> D1
    F --> D2
    F --> D3

Fact = events/transactions, narrow rows, billions
Dim = entities, small, references in fact

Joins for context.

9.2 Wide Table (denormalized)

CREATE TABLE events_wide (
    ts DateTime, user_id, product_id, amount,
    -- denormalized user fields
    user_name, user_country, user_signup_date,
    -- denormalized product fields
    product_name, product_category
);

No joins, fast scan. Larger storage (acceptable with compression).

Pattern 2024: ClickHouse/DuckDB favor wide tables. Snowflake/BigQuery handle either.

9.3 Slowly Changing Dimensions (SCD)

User changes country. How to track?

SCD Type 1: overwrite (lose history)
SCD Type 2: new row with valid_from/valid_to (keep history)
SCD Type 6: hybrid (current + history)

Type 2 most common for analytics.

10. Real-Time Analytics Architecture

graph LR
    App[App] -->|events| Kafka[Kafka]
    Kafka --> CH[ClickHouse<br/>raw layer]
    CH -.materialized.-> Agg1[hourly aggs]
    CH -.materialized.-> Agg2[daily aggs]
    Agg1 --> Dashboard[Grafana/Metabase]
    Agg2 --> Dashboard

    PG[(Postgres OLTP)] -.CDC.-> Kafka

Pattern:

Events from app → Kafka
CDC from OLTP → Kafka
ClickHouse consumes (Kafka engine) → raw table
Materialized views compute aggregates real-time
BI tools query aggregates

Will deep-dive Case-Design-Data-Realtime-Analytics.

11. Postgres for Analytics — When OK

Postgres NOT terrible for analytics if:

Data < 100GB
Queries < few seconds OK
No need <1s dashboard
Want one DB instead of two

Optimizations:

BRIN index on time column
Partitioning by time
Aggregate-only MVs (refresh periodically)
pg_cron for scheduled aggregation
citus extension for sharding

But at 100GB+ analytics workload, ClickHouse 100x faster + 10x cheaper. Move when justified.

12. Anti-patterns

Pattern	Why bad	Fix
Use ClickHouse as OLTP	High-frequency updates poor	Use Postgres for OLTP
Random ORDER BY in MergeTree	Compression poor, no pruning	Match sort key to query pattern
Single huge partition	No pruning	Partition by time
Frequent INSERTs single rows	Many small parts	Batch inserts (1K+ rows)
ALTER columns frequently	Background merge churn	Stable schema
Use Postgres at petabyte	Slow, expensive	ClickHouse/Snowflake
Denormalize without need	Wasted storage	Star schema OK
Forget to MERGE	Dedup/sum doesn’t happen	OPTIMIZE TABLE or wait
Query without partition filter	Full scan all parts	Always filter ts
Heavy joins ClickHouse	OOM	Dictionary or denormalize

13. Lab

13.1 Day 1: ClickHouse setup

docker run -d --name ch -p 8123:8123 -p 9000:9000 clickhouse/clickhouse-server:24.5
 
clickhouse-client -h localhost

Load NYC Taxi dataset (sample analytical workload).

13.2 Day 2: MergeTree

Create table, partition by month, order by (pickup_time, payment_type). Run benchmark vs Postgres on same data.

13.3 Day 3: Materialized view

Build hourly aggregate via MV. Watch real-time updates.

13.4 Day 4: Compression

Compare LZ4 vs ZSTD vs Delta on time-series data.

13.5 Day 5: Distributed

3-node cluster. Distributed table. Sharding by user_id.

13.6 Day 6: DuckDB

import duckdb
con = duckdb.connect(":memory:")
con.execute("CREATE TABLE events AS SELECT * FROM 'events.parquet'")
con.execute("SELECT date, count(*) FROM events GROUP BY date").df()

Benchmark vs ClickHouse on same dataset.

13.7 Day 7: Migration plan

Take Postgres analytics workload, plan migration to ClickHouse:

Identify hot queries
Design ClickHouse schema
ETL plan
Cutover

14. Self-check

Columnar vs row storage — vẽ + explain?
Vectorized execution — vì sao 10-100x faster?
MergeTree family — 5 engines + use cases?
Materialized view ClickHouse vs Postgres MV — khác biệt?
LowCardinality — khi nào dùng?
Sparse primary index — trade-off vs B-tree?
ClickHouse joins — limit + workaround?
DuckDB vs ClickHouse — pick khi nào?
Star schema vs wide table — modern OLAP preference?
Postgres for analytics — limit ở đâu?

15. Tiếp theo

Bài tiếp: Tuan-15-Vector-DB-AI — vector search cho AI workload.

Tuần 14 hoàn thành. Column store changes everything for analytics. Cập nhật: 2026-05-16

lthieu's notes

Explorer

Tuan-14-OLAP-Columnar-ClickHouse