Tuần 14 — OLAP & Columnar: ClickHouse, DuckDB

“OLTP đo nhỏ và nhiều: 1ms per query, 10K queries/s. OLAP đo to và ít: scan 1B rows in 1 second. Postgres làm OLAP cũng được, nhưng chậm 100x so với columnar. Tuần này dạy bạn vì sao + khi nào move.”

Tags: database clickhouse duckdb olap columnar analytics Thời lượng: 7 ngày (5-7h/ngày) Prerequisites: Tuan-01-DB-Internals-Refresh (storage concept) Liên quan: Case-Design-Data-Realtime-Analytics · Tuan-Bonus-Lakehouse-Engineering


1. Context & Why

1.1 OLTP vs OLAP — Fundamentally Different

graph TB
    subgraph "OLTP - Online Transaction Processing"
        OLTP1[Small transactions]
        OLTP2[Read few rows by key]
        OLTP3[Write per request]
        OLTP4[Latency: ms]
        OLTP5[Postgres, MySQL, Oracle]
    end

    subgraph "OLAP - Online Analytical Processing"
        OLAP1[Large scans]
        OLAP2[Aggregate billions of rows]
        OLAP3[Bulk insert, rare update]
        OLAP4[Latency: seconds]
        OLAP5[ClickHouse, BigQuery, Snowflake]
    end

    style OLTP4 fill:#c8e6c9
    style OLAP4 fill:#c8e6c9

1.2 Columnar vs Row Storage

graph LR
    subgraph "Row-oriented Postgres"
        R1["Row 1: id=1 | name=Alice | age=30 | city=SF"]
        R2["Row 2: id=2 | name=Bob | age=25 | city=NY"]
        R3["Row 3: id=3 | name=Carol | age=35 | city=LA"]

        R1 --> R2 --> R3
    end

    subgraph "Column-oriented ClickHouse"
        C1[id: 1, 2, 3]
        C2[name: Alice, Bob, Carol]
        C3[age: 30, 25, 35]
        C4[city: SF, NY, LA]
    end

1.3 Why columnar wins for analytics

Query: SELECT avg(age) FROM users WHERE city = 'SF'

Row store:

  • Read all columns of all rows
  • Filter, compute avg
  • For 100M rows × 100 bytes = 10GB scanned for 4-byte age

Columnar:

  • Read only city column → filter rows
  • Read only age column → compute
  • For 100M rows × 4 bytes age + 4 bytes city = 800MB scanned

~10-100x less I/O. Plus compression on column (homogeneous data compresses better).

1.4 Mục tiêu tuần

  • Columnar storage internals
  • Vectorized execution
  • ClickHouse MergeTree family (ReplacingMergeTree, SummingMergeTree, AggregatingMergeTree)
  • Compression codecs (LZ4, ZSTD, Delta, T64)
  • Materialized views — incremental aggregation pattern
  • Sharding + replication
  • DuckDB — in-process analytics
  • Star schema vs wide table
  • BigQuery, Snowflake basics for context
  • When ClickHouse vs Postgres

1.5 Tham chiếu

  • ClickHouse Documentationhttps://clickhouse.com/docs
  • The Story of ClickHouse — Aleksey Milovidov talks
  • DuckDB Documentationhttps://duckdb.org/docs/
  • Snowflake paper (SIGMOD 2016) — Vuppalapati et al
  • Designing Data-Intensive Applications Ch.3 (Column-oriented)

2. Columnar Storage Internals

2.1 ClickHouse on disk

table_name/
├── 20260501_1_1_0/        (partition_minBlock_maxBlock_level)
│   ├── primary.idx        (sparse index)
│   ├── partition.dat
│   ├── id.bin             (column data, compressed)
│   ├── id.mrk             (marks - offsets to granules)
│   ├── name.bin
│   ├── name.mrk
│   ├── age.bin
│   ├── age.mrk
│   └── checksums.txt
├── 20260502_2_2_0/
└── ...

Each column = separate file. Compressed in blocks (granules = 8192 rows by default via index_granularity). Trên thực tế index_granularity_bytes (default 10MB) thường dominates — granule split khi đạt N rows HOẶC M bytes, whichever first. Tune cả 2 cho wide rows.

2.2 Sparse primary index

Primary index marks every 8192 rows (granule):
  granule 0: id=1
  granule 1: id=8193
  granule 2: id=16385
  ...

Query WHERE id = 12345:

  1. Binary search index → granule 1 (id 8193-16384)
  2. Read granule 1 only (~8K rows)
  3. Filter inside granule

vs Postgres B-tree (every row in index). ClickHouse: 100x smaller index, but slower point lookup.

→ Designed for scan, not point.

2.3 Compression

Columnar = homogeneous data per column → high compression.

CodecUse caseCompression
LZ4 (default)General2-3x
ZSTDBetter ratio4-5x
DeltaTime-series, sequential5-10x
DoubleDeltaTimestamps10-20x
T64Integers3-5x
GorillaFloat time-series10x+
CREATE TABLE events (
    ts DateTime CODEC(DoubleDelta, ZSTD),
    user_id UInt64 CODEC(T64, ZSTD),
    event_type LowCardinality(String),  -- enum-like
    value Float64 CODEC(Gorilla, ZSTD)
) ENGINE = MergeTree() ORDER BY (ts, user_id);

2.4 LowCardinality

Type wrapper for columns with few distinct values:

event_type LowCardinality(String)  -- 'click', 'view', 'purchase' etc

Internally: dictionary encoding. 10-100x compression for repeated values.

2.5 Vectorized execution

Process data in batches of typed values (vectors).

Row-based: for each row, evaluate expression
  → branch prediction, cache miss, function call overhead

Vectorized: for batch of 1024 rows, apply operation on whole vector
  → SIMD instructions, cache-friendly, ~10-100x faster

Modern OLAP DBs: ClickHouse, DuckDB, Apache Arrow, Polars — all vectorized.


3. ClickHouse Engines

3.1 MergeTree family

graph TB
    MT[MergeTree<br/>base, partitioned, sorted]
    RMT[ReplacingMergeTree<br/>upsert by sort key]
    SMT[SummingMergeTree<br/>auto-sum numeric on merge]
    AMT[AggregatingMergeTree<br/>store aggregate state]
    CMT[CollapsingMergeTree<br/>+1/-1 sign rows]
    VCMT[VersionedCollapsingMergeTree<br/>versioned collapse]

    MT --> RMT
    MT --> SMT
    MT --> AMT
    MT --> CMT
    CMT --> VCMT

3.2 MergeTree basics

CREATE TABLE events (
    timestamp DateTime,
    user_id UInt64,
    event String,
    value Float64
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)   -- monthly partitions
ORDER BY (timestamp, user_id)       -- sort key (= primary key default)
SETTINGS index_granularity = 8192;
  • PARTITION BY — folder per partition, allows partition pruning + drop partition
  • ORDER BY — physical sort, indexed by primary key
  • Optional PRIMARY KEY separate from ORDER BY — but must be a prefix of ORDER BY. Use case: PK smaller (RAM) than full sort key

3.3 ReplacingMergeTree

Dedup on merge: keep latest by ORDER BY.

CREATE TABLE users (
    id UInt64,
    name String,
    updated_at DateTime
)
ENGINE = ReplacingMergeTree(updated_at)
ORDER BY id;

INSERT with same id → eventually deduped on background merge. Use FINAL modifier or OPTIMIZE TABLE to force.

Caveat: only eventually consistent. SELECT immediately after INSERT might see duplicates.

3.4 SummingMergeTree

Auto-sum numeric columns on merge for rows with same ORDER BY.

CREATE TABLE daily_stats (
    date Date,
    user_id UInt64,
    clicks UInt64,
    views UInt64
)
ENGINE = SummingMergeTree()
ORDER BY (date, user_id);
 
INSERT INTO daily_stats VALUES ('2026-05-16', 42, 1, 5);
INSERT INTO daily_stats VALUES ('2026-05-16', 42, 2, 3);
-- Eventually: ('2026-05-16', 42, 3, 8)

3.5 AggregatingMergeTree

Store aggregate state (not final result), final by quantileState, uniqState, etc.

CREATE TABLE user_metrics (
    date Date,
    user_id UInt64,
    unique_pages AggregateFunction(uniq, String),
    avg_session_sec AggregateFunction(avg, UInt32)
)
ENGINE = AggregatingMergeTree()
ORDER BY (date, user_id);

Used with materialized views (section 4).


4. Materialized Views — Incremental Aggregation

4.1 Concept

In Postgres, MV = snapshot, REFRESH whole.

In ClickHouse, MV = trigger: every insert to source table → computed into MV in real-time.

4.2 Example

Source table:

CREATE TABLE events (
    ts DateTime, user_id UInt64, event String
) ENGINE = MergeTree() ORDER BY ts;

Materialized view aggregating daily:

CREATE MATERIALIZED VIEW events_daily
ENGINE = AggregatingMergeTree()
ORDER BY (date, event)
AS
SELECT
    toDate(ts) AS date,
    event,
    uniqState(user_id) AS unique_users,
    count() AS total_events
FROM events
GROUP BY date, event;

Query daily aggregates:

SELECT
    date, event,
    uniqMerge(unique_users) AS unique_users,
    sum(total_events) AS total
FROM events_daily
GROUP BY date, event;

4.3 Power

Source table 100B rows. MV daily aggregate <100K rows. Dashboard query <100ms.

vs Postgres MV REFRESH — takes hours to compute.

4.4 Multi-layer MVs

graph LR
    Raw[events_raw<br/>100B rows] --> MV1[events_hourly<br/>via MV]
    MV1 --> MV2[events_daily<br/>via MV]
    MV2 --> MV3[events_monthly<br/>via MV]

Each level reduces 60-1000x.


5. Sharding & Replication

5.1 Distributed table

-- Local table on each shard
CREATE TABLE events_local (...) ENGINE = MergeTree() ORDER BY ...;
 
-- Distributed table - cluster-wide view
CREATE TABLE events ON CLUSTER cluster1 AS events_local
ENGINE = Distributed(cluster1, default, events_local, rand());

Query SELECT * FROM events → distribute to all shards → aggregate.

5.2 Replication

ReplicatedMergeTree for fault tolerance:

CREATE TABLE events_local ON CLUSTER cluster1
(...)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
ORDER BY ...;

Uses ZooKeeper (or ClickHouse Keeper) for coordination.

5.3 Sharding key

Random (good distribution) or hashed by user_id (group same user’s data).

Trade-off: random → all queries fan out; hashed → can prune to single shard with user_id filter.


6. ClickHouse Advanced

6.1 SAMPLE clause

Approximate queries for huge tables:

SELECT count() FROM events SAMPLE 0.1;  -- approximate from 10% sample

Need SAMPLE BY defined on table.

6.2 ARRAY JOIN

Explode array column to multiple rows (like SQL UNNEST).

CREATE TABLE products (id UInt64, tags Array(String)) ENGINE = MergeTree() ORDER BY id;
 
SELECT id, tag
FROM products
ARRAY JOIN tags AS tag
WHERE tag = 'rust';

6.3 Window functions

SELECT
    user_id,
    date,
    purchase,
    sum(purchase) OVER (PARTITION BY user_id ORDER BY date) AS lifetime_spent
FROM orders;

ClickHouse fully supports SQL window functions since 21.x.

6.4 Joins

ClickHouse joins are constrained — build hash table from right side in memory.

SELECT *
FROM events e
LEFT JOIN users u ON u.id = e.user_id;

If users huge (>10M rows × wide) → OOM. Workarounds:

  • Distributed dictionary for small lookup
  • Denormalize at ingest

6.5 Dictionaries

External small lookup table loaded in memory:

CREATE DICTIONARY user_dict (
    id UInt64,
    name String,
    country String
)
PRIMARY KEY id
SOURCE(HTTP(url 'http://api/users.csv'))
LAYOUT(HASHED())
LIFETIME(60);
 
-- Use in query
SELECT
    e.event,
    dictGet('user_dict', 'country', e.user_id) AS country
FROM events e;

Fast in-memory lookup, no JOIN.

6.6 Projections (since 21.3)

In-table secondary “view” with different sort/aggregate. Auto-used when matches query.

ALTER TABLE events ADD PROJECTION p_by_user (
    SELECT * ORDER BY user_id
);
 
ALTER TABLE events MATERIALIZE PROJECTION p_by_user;

Query SELECT * FROM events WHERE user_id = 42 → uses projection automatically.


7. DuckDB — In-process Analytics

7.1 Concept

SQLite for analytics. Single binary, embedded in Python/R/Node, no server.

import duckdb
con = duckdb.connect("analytics.db")
con.execute("CREATE TABLE events AS SELECT * FROM 'events.parquet'")
result = con.execute("SELECT date, count(*) FROM events GROUP BY date").df()

7.2 Use cases

  • Local analytics on Parquet/CSV files
  • Notebook data exploration
  • ETL pipelines
  • App-embedded analytics

7.3 Features 2024-2026

  • Reads Parquet, CSV, JSON, Arrow natively
  • Joins, window functions, full SQL
  • Vectorized execution
  • Smaller than ClickHouse but similar perf for single-machine

7.4 DuckDB vs ClickHouse

DuckDBClickHouse
DeploymentEmbedded libraryServer cluster
ConcurrencySingle-processMulti-user
Data sizeSingle machine (TB)Cluster (PB)
Best forAnalyst notebooks, embeddedProduction analytics service

7.5 MotherDuck

Cloud DuckDB service. Hybrid local + cloud. 2024 launch.


8. Cloud Data Warehouses

8.1 BigQuery

  • Serverless, pay per query (bytes scanned)
  • Petabyte scale
  • Multi-region
  • Streaming inserts
  • SQL similar to standard
SELECT user_id, count(*) FROM `project.dataset.events`
WHERE _PARTITIONTIME >= '2026-05-01'
GROUP BY user_id;

Cost: $5/TB scanned. Optimize: partition + cluster.

8.2 Snowflake

  • Multi-cloud (AWS, GCP, Azure)
  • Compute decoupled from storage
  • “Virtual warehouses” — multiple compute clusters on same data
  • Time-travel (query past versions)
SELECT * FROM events AT(OFFSET => -3600);  -- 1 hour ago

8.3 Databricks SQL / Photon

Spark-based, Delta Lake format. Lakehouse architecture. Covered Tuan-Bonus-Lakehouse-Engineering.

8.4 Decision

NeedPick
Real-time, sub-secondClickHouse
Pure SQL on existing filesDuckDB
Serverless, AWS/GCP-nativeBigQuery
Multi-cloud, separate computeSnowflake
Open format (Iceberg/Delta)Databricks, Trino

9. Schema Design for Analytics

9.1 Star Schema

graph TB
    F[Fact: events<br/>user_id, product_id, date_id, amount]
    D1[Dim: users<br/>id, name, country, signup_date]
    D2[Dim: products<br/>id, name, category]
    D3[Dim: dates<br/>id, day, week, month]

    F --> D1
    F --> D2
    F --> D3
  • Fact = events/transactions, narrow rows, billions
  • Dim = entities, small, references in fact

Joins for context.

9.2 Wide Table (denormalized)

CREATE TABLE events_wide (
    ts DateTime, user_id, product_id, amount,
    -- denormalized user fields
    user_name, user_country, user_signup_date,
    -- denormalized product fields
    product_name, product_category
);

No joins, fast scan. Larger storage (acceptable with compression).

Pattern 2024: ClickHouse/DuckDB favor wide tables. Snowflake/BigQuery handle either.

9.3 Slowly Changing Dimensions (SCD)

User changes country. How to track?

  • SCD Type 1: overwrite (lose history)
  • SCD Type 2: new row with valid_from/valid_to (keep history)
  • SCD Type 6: hybrid (current + history)

Type 2 most common for analytics.


10. Real-Time Analytics Architecture

graph LR
    App[App] -->|events| Kafka[Kafka]
    Kafka --> CH[ClickHouse<br/>raw layer]
    CH -.materialized.-> Agg1[hourly aggs]
    CH -.materialized.-> Agg2[daily aggs]
    Agg1 --> Dashboard[Grafana/Metabase]
    Agg2 --> Dashboard

    PG[(Postgres OLTP)] -.CDC.-> Kafka

Pattern:

  • Events from app → Kafka
  • CDC from OLTP → Kafka
  • ClickHouse consumes (Kafka engine) → raw table
  • Materialized views compute aggregates real-time
  • BI tools query aggregates

Will deep-dive Case-Design-Data-Realtime-Analytics.


11. Postgres for Analytics — When OK

Postgres NOT terrible for analytics if:

  • Data < 100GB
  • Queries < few seconds OK
  • No need <1s dashboard
  • Want one DB instead of two

Optimizations:

  • BRIN index on time column
  • Partitioning by time
  • Aggregate-only MVs (refresh periodically)
  • pg_cron for scheduled aggregation
  • citus extension for sharding

But at 100GB+ analytics workload, ClickHouse 100x faster + 10x cheaper. Move when justified.


12. Anti-patterns

PatternWhy badFix
Use ClickHouse as OLTPHigh-frequency updates poorUse Postgres for OLTP
Random ORDER BY in MergeTreeCompression poor, no pruningMatch sort key to query pattern
Single huge partitionNo pruningPartition by time
Frequent INSERTs single rowsMany small partsBatch inserts (1K+ rows)
ALTER columns frequentlyBackground merge churnStable schema
Use Postgres at petabyteSlow, expensiveClickHouse/Snowflake
Denormalize without needWasted storageStar schema OK
Forget to MERGEDedup/sum doesn’t happenOPTIMIZE TABLE or wait
Query without partition filterFull scan all partsAlways filter ts
Heavy joins ClickHouseOOMDictionary or denormalize

13. Lab

13.1 Day 1: ClickHouse setup

docker run -d --name ch -p 8123:8123 -p 9000:9000 clickhouse/clickhouse-server:24.5
 
clickhouse-client -h localhost

Load NYC Taxi dataset (sample analytical workload).

13.2 Day 2: MergeTree

Create table, partition by month, order by (pickup_time, payment_type). Run benchmark vs Postgres on same data.

13.3 Day 3: Materialized view

Build hourly aggregate via MV. Watch real-time updates.

13.4 Day 4: Compression

Compare LZ4 vs ZSTD vs Delta on time-series data.

13.5 Day 5: Distributed

3-node cluster. Distributed table. Sharding by user_id.

13.6 Day 6: DuckDB

import duckdb
con = duckdb.connect(":memory:")
con.execute("CREATE TABLE events AS SELECT * FROM 'events.parquet'")
con.execute("SELECT date, count(*) FROM events GROUP BY date").df()

Benchmark vs ClickHouse on same dataset.

13.7 Day 7: Migration plan

Take Postgres analytics workload, plan migration to ClickHouse:

  • Identify hot queries
  • Design ClickHouse schema
  • ETL plan
  • Cutover

14. Self-check

  1. Columnar vs row storage — vẽ + explain?
  2. Vectorized execution — vì sao 10-100x faster?
  3. MergeTree family — 5 engines + use cases?
  4. Materialized view ClickHouse vs Postgres MV — khác biệt?
  5. LowCardinality — khi nào dùng?
  6. Sparse primary index — trade-off vs B-tree?
  7. ClickHouse joins — limit + workaround?
  8. DuckDB vs ClickHouse — pick khi nào?
  9. Star schema vs wide table — modern OLAP preference?
  10. Postgres for analytics — limit ở đâu?

15. Tiếp theo

Bài tiếp: Tuan-15-Vector-DB-AI — vector search cho AI workload.


Tuần 14 hoàn thành. Column store changes everything for analytics. Cập nhật: 2026-05-16