Bonus 7 — Lakehouse Engineering: Iceberg, Delta, Hudi
“Data Lake (cheap S3) + Data Warehouse (rich SQL) = Lakehouse. Tuần này đào sâu engineering side của Apache Iceberg — table format trở thành OLAP standard 2024-2026.”
Tags: database lakehouse iceberg delta parquet data-engineering unity-catalog polaris Thời lượng: 5-6 ngày Prerequisites: Tuan-14-OLAP-Columnar-ClickHouse Liên quan: Case-Design-Modern-Data-Lakehouse (SD course) · Tuan-Bonus-CDC-Debezium
1. Why Lakehouse
1.1 Three eras
graph LR DW[Data Warehouse era<br/>Teradata, Oracle<br/>$$$ per TB<br/>structured SQL] --> Lake[Data Lake era<br/>S3 + Parquet<br/>cheap but no ACID<br/>schema chaos] --> Lakehouse[Lakehouse era 2020+<br/>Iceberg/Delta/Hudi<br/>cheap + ACID<br/>multi-engine]
1.2 Lakehouse value
- Cheap storage (S3 ~0.05/GB egress)
- Open format (no vendor lock)
- Many compute engines (Spark, Trino, Snowflake, BigQuery, Athena, DuckDB, ClickHouse 2024)
- ACID transactions on object store
- Time travel, schema evolution
- Format separation from compute (store once, query many)
1.3 Three competing formats — current status 2024-2026
| Iceberg | Delta Lake | Hudi | |
|---|---|---|---|
| Origin | Netflix → Apache | Databricks → Linux Foundation | Uber → Apache |
| OSS | Apache 2.0 | OSS + Databricks ext | Apache 2.0 |
| Vendor neutrality | Highest | Databricks-leaning | Apache-aligned |
| Adoption 2024-2026 | Winning broadly | Strong (Databricks) | Smaller |
| Compute engines | Many | Many (esp. Spark) | Spark-centric |
| Time travel | Yes | Yes | Yes |
| Schema evolution | Best | Good | Good |
| CDC | Limited | Native (Delta Change Feed) | Strong (built for CDC) |
| Snapshot isolation | Yes | Yes | Yes |
1.4 Industry consolidation 2024
June 2024: Databricks acquired Tabular (Iceberg’s commercial steward, founded by Ryan Blue + Daniel Weeks who created Iceberg at Netflix) for $1B+.
Implications:
- Databricks now influences both Iceberg + Delta
- UniForm product bridges Delta ↔ Iceberg (read Delta as Iceberg)
- Convergence between formats
Apache Polaris (Snowflake, 2024): Open catalog standard for Iceberg.
Unity Catalog OSS (Databricks, 2024): Open catalog (was proprietary).
Catalog war: Polaris (Snowflake) vs Unity Catalog (Databricks) vs HMS + REST → still settling.
1.5 Net effect 2024-2026
- Iceberg becomes dominant format
- Catalog standard converging via REST API
- Multi-engine future (your data in Iceberg, query with whatever)
- Vendor differentiation moves to: compute, catalog UI, governance, ML integration
2. Apache Iceberg Architecture
2.1 Layers
graph TB Catalog[Catalog<br/>Hive Metastore / Glue / REST / Polaris / Unity / Snowflake] Catalog --> Meta[Current metadata file<br/>v_N.metadata.json] Meta --> ML[Manifest list<br/>snap-X.avro] ML --> M1[Manifest 1<br/>data-files-list.avro<br/>+ stats per file] ML --> M2[Manifest 2] M1 --> D1[Parquet file 1] M1 --> D2[Parquet file 2] M2 --> D3[Parquet file 3] M2 --> D4[Parquet file 4] style Catalog fill:#fff9c4
2.2 Metadata vs data
- Data: Parquet files in S3 (or any object store)
- Metadata: JSON + Avro pointing to data, partitions, stats
- Snapshot: pointer to current state of metadata
Read flow:
- Catalog → current
v_N.metadata.json - Metadata → current manifest list (
snap-X.avro) - Manifest list → relevant manifests (pruned by partition)
- Manifests → relevant data files (pruned by file-level stats)
- Read only needed columns from Parquet files
Pruning at every step → query 1PB but read 100MB.
Write flow:
- Append new Parquet file(s) to data location
- Create new manifest pointing to new file(s)
- New manifest list pointing to all manifests
- New metadata pointing to manifest list
- Atomic commit: catalog updates current metadata pointer
2.3 Snapshot isolation
Every commit = new snapshot. Old snapshots retained (configurable retention).
-- Time travel
SELECT * FROM orders FOR VERSION AS OF 12345;
SELECT * FROM orders FOR TIMESTAMP AS OF '2026-05-16 10:00:00';
-- See snapshot history
SELECT * FROM orders.snapshots;
SELECT * FROM orders.history;2.4 Schema evolution
-- Add column - metadata only, no rewrite
ALTER TABLE orders ADD COLUMN region STRING;
-- Drop column - metadata only
ALTER TABLE orders DROP COLUMN deprecated;
-- Rename column - metadata only
ALTER TABLE orders RENAME COLUMN old_name TO new_name;
-- Type widening - metadata only (int → bigint)
ALTER TABLE orders ALTER COLUMN id TYPE bigint;Iceberg tracks field IDs, not names. Renames don’t break existing data.
vs Delta Lake: column name based, schema changes more conservative.
2.5 Partition evolution
Unique to Iceberg: change partition spec without rewriting data. New writes use new partitioning.
-- Initially partitioned by day
CREATE TABLE events (id BIGINT, ts TIMESTAMP, payload STRING)
USING iceberg
PARTITIONED BY (days(ts));
-- Later, change to month for older data efficiency
ALTER TABLE events DROP PARTITION FIELD days(ts);
ALTER TABLE events ADD PARTITION FIELD months(ts);
-- Old data partitioned by day, new data by month
-- Read transparently combinesvs Hive (legacy): partition is in physical path, can’t change without rewrite.
2.6 Hidden partitioning
CREATE TABLE events (
id BIGINT, ts TIMESTAMP, payload STRING
) USING iceberg
PARTITIONED BY (days(ts));
-- Insert (user writes natural)
INSERT INTO events VALUES (1, TIMESTAMP '2026-05-16 10:00:00', '...');
-- Query (no need to know about partitions)
SELECT * FROM events WHERE ts > '2026-05-15';
-- Iceberg auto partition-prunesUser writes natural query, Iceberg handles partitioning detail. Big improvement over Hive era.
3. Z-Ordering & Sort Order
3.1 Z-order (multi-dim clustering)
Files contain rows close in multiple dimensions.
ALTER TABLE events WRITE ORDERED BY z-order(user_id, ts);
-- After Z-order, query filters benefit:
SELECT * FROM events WHERE user_id = 42 AND ts > '2026-05-15';
-- Far fewer files scannedTrade-off: write-time cost (rewrite to z-order) for read-time speedup. Periodic operation:
CALL system.rewrite_data_files(
table => 'events',
sort_order => 'zorder(user_id, ts)'
);3.2 Liquid clustering (Delta Lake equivalent)
Delta auto-tunes clustering keys.
ALTER TABLE events CLUSTER BY (user_id, ts);3.3 When use Z-order
- Multi-column WHERE common
- Cardinality of columns moderate
- Trade write cost for read speedup
4. Read Optimizations
4.1 Column pruning
Parquet stores columns separately → read only needed columns.
SELECT id, total FROM events;
-- Reads only 'id' + 'total' columns from Parquet, skip rest10-100x I/O save vs row-oriented.
4.2 Predicate pushdown
Filter pushed down to file level. Stats (min/max per file) → skip files entirely.
Query: WHERE ts BETWEEN '2026-05-01' AND '2026-05-07'
Iceberg manifest stats: file1.ts_min=2026-04-01, file1.ts_max=2026-04-30
→ skip file1 entirely
4.3 Manifest pruning
Stats at manifest level → skip whole groups of files. Important for huge tables.
4.4 Bloom filters
Parquet supports Bloom filters per column.
ALTER TABLE events WRITE BLOOM FILTERS (user_id);
-- Now user_id = X queries skip files without that user4.5 Combined effect
Table: 1 PB
Query: filter 1 user, 1 day, project 3 columns
Pruning steps:
- Manifest pruning: 10K manifests → 50 relevant
- File pruning by partition: 50 → 10 files
- Bloom filter on user_id: 10 → 3 files
- Column pruning: 3 columns out of 100
- Row group pruning within file: 50% rows
Net: read ~100MB of 1PB → 0.00001% I/O
5. Maintenance Operations
5.1 Compaction
Many small files = poor read perf. Periodic compaction:
-- Spark/Trino procedure
CALL system.rewrite_data_files(
table => 'db.events',
options => map('min-input-files', '5', 'target-file-size-bytes', '536870912')
);Best practice: schedule daily/weekly, off-peak.
5.2 Snapshot expiration
Old snapshots accumulate. Each retained snapshot = retention cost.
CALL system.expire_snapshots(
table => 'db.events',
older_than => TIMESTAMP '2026-04-16 00:00:00',
retain_last => 100
);Trade-off: retention vs storage cost. Typical: keep 30-90 days.
5.3 Orphan file cleanup
Files no longer referenced:
CALL system.remove_orphan_files(
table => 'db.events',
older_than => TIMESTAMP '2026-04-16 00:00:00'
);⚠️ Caution: dry-run first. Active writers might be creating files seen as orphan in short windows.
5.4 Statistics refresh
CALL system.refresh_statistics(table => 'db.events');
ANALYZE TABLE db.events COMPUTE STATISTICS;Better query planning.
5.5 Manifest rewrite
CALL system.rewrite_manifests(table => 'db.events');Consolidates manifest files when too fragmented.
5.6 Catalog migration
Move table catalog without copying data:
# Iceberg
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0 \
-e "CALL system.register_table('rest_catalog.db.events', 's3://bucket/path/metadata/v1.metadata.json')"6. Streaming Writes
6.1 CDC into Lakehouse
graph LR PG[(Postgres)] -.Debezium.-> Kafka[(Kafka)] Kafka --> Streaming[Spark Structured Streaming / Flink] Streaming -.upsert merge.-> Iceberg[(Iceberg table)] Iceberg --> BI[BI tools]
Pattern:
- CDC events to Kafka
- Streaming job consumes (Spark/Flink)
- Merge into Iceberg (upsert)
6.2 Spark Structured Streaming
from pyspark.sql import SparkSession
stream = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "orders") \
.load()
query = stream.writeStream \
.format("iceberg") \
.outputMode("append") \
.option("checkpointLocation", "/tmp/checkpoint") \
.toTable("lakehouse.orders")
query.awaitTermination()6.3 Flink + Iceberg
TableEnvironment tEnv = ...;
tEnv.executeSql("""
CREATE TABLE events_iceberg (
id BIGINT, ts TIMESTAMP_LTZ(3), data STRING
) PARTITIONED BY (days(ts)) WITH (
'connector' = 'iceberg',
'catalog-name' = 'rest_catalog',
...
)
""");
tEnv.executeSql("INSERT INTO events_iceberg SELECT * FROM kafka_source");Flink writes Iceberg natively, streaming-friendly with checkpointing.
6.4 Upsert semantics
MERGE INTO orders t
USING new_orders s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;Iceberg supports MERGE since v2. Behind scenes: writes deletion vectors (Iceberg V3) or position deletes (V2).
7. Iceberg Spec V3 (2024)
Major spec update in 2024 work:
7.1 Deletion vectors
V2: position deletes (file + row position). V3: bitmap-based deletion vectors. More efficient for many small deletes.
7.2 Variant type
JSON-like data with binary encoding. Better than storing raw JSON strings.
7.3 Encryption
Native column-level encryption in V3.
7.4 Time travel improvements
Better support for change data feed across snapshots.
Status of spec: actively being adopted by engines. Check compatibility before using V3 features.
8. Compute Engines on Lakehouse
8.1 Spark
Most full-featured. Batch + streaming. Heavy footprint.
spark = SparkSession.builder \
.config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0") \
.config("spark.sql.catalog.lakehouse", "org.apache.iceberg.spark.SparkCatalog") \
.getOrCreate()
spark.sql("SELECT * FROM lakehouse.events LIMIT 10").show()8.2 Trino / Presto
SQL query engine, federated (query multiple sources). Good for ad-hoc analytics.
SELECT count(*) FROM iceberg.events.events WHERE date_col > DATE '2026-01-01';Fast for SELECTs. Less for writes.
8.3 Snowflake (2024 Iceberg support)
Reads + writes Iceberg tables since 2024.
- “External Iceberg tables” (read-only)
- “Iceberg tables managed by Snowflake” (read-write)
No data duplication if data already in S3.
8.4 BigQuery (2024)
External tables on Iceberg in GCS / S3. Federated query.
CREATE EXTERNAL TABLE iceberg_events
OPTIONS (
format = 'ICEBERG',
uris = ['gs://bucket/path/metadata/']
);8.5 DuckDB (2024)
INSTALL iceberg;
LOAD iceberg;
SELECT * FROM iceberg_scan('s3://bucket/path/metadata/v1.metadata.json');Embedded analytics on cloud lakehouse. Surprisingly fast for medium-size queries.
8.6 ClickHouse (2024)
SELECT * FROM iceberg('s3://bucket/path');ClickHouse 24.x added Iceberg + Delta reader. Hot + cold pattern: ClickHouse hot, Iceberg cold, single query joins.
8.7 Athena (AWS)
Serverless Trino. Pay per TB scanned ($5/TB).
8.8 Decision
| Need | Pick |
|---|---|
| Real-time, sub-second hot data | ClickHouse |
| Pure SQL on existing files | DuckDB |
| Serverless, AWS/GCP-native | Athena / BigQuery |
| Multi-cloud, separate compute | Snowflake |
| Open format primary store | Iceberg + Trino |
| Heavy batch processing | Spark |
| Streaming write | Flink + Iceberg |
9. Catalog Wars 2024-2026
9.1 Catalog options
graph TB Catalog[Iceberg catalog<br/>tells engines where tables are] Catalog --> Hive[Hive Metastore<br/>legacy] Catalog --> Glue[AWS Glue<br/>AWS proprietary] Catalog --> REST[REST Catalog<br/>open spec] Catalog --> Polaris[Apache Polaris<br/>Snowflake 2024 OSS] Catalog --> Unity[Unity Catalog<br/>Databricks 2024 OSS] Catalog --> Snow[Snowflake catalog<br/>proprietary] style REST fill:#c8e6c9 style Polaris fill:#c8e6c9 style Unity fill:#c8e6c9
9.2 REST Catalog Spec
Iceberg defined an HTTP REST API for catalog operations. Engines implement this.
Importance: vendor-neutral interface. Switch catalog without re-engineering.
9.3 Polaris (Snowflake)
Apache-incubated 2024. Implements REST spec. Multi-engine access control.
9.4 Unity Catalog OSS (Databricks 2024)
Databricks open-sourced Unity Catalog. Implements REST spec too.
Lineage tracking, governance built-in.
9.5 Choosing catalog
flowchart TD A[Catalog?] --> B{Multi-cloud or vendor-neutral?} B -->|Yes| C[Polaris or Unity Catalog OSS] B -->|No, AWS only| D[Glue] B -->|No, GCP only| E[BigLake / BigQuery metastore] B -->|Snowflake-centric| F[Snowflake catalog] B -->|Databricks-centric| G[Unity Catalog DB] style C fill:#c8e6c9
10. Patterns
10.1 Bronze-Silver-Gold (Medallion)
graph LR Bronze[Bronze<br/>raw, append-only<br/>schema as-is<br/>retain all] --> Silver[Silver<br/>cleaned, validated<br/>conform schema<br/>typed] Silver --> Gold[Gold<br/>business-level<br/>aggregations<br/>ML features] Bronze -.->|reprocess on schema change| Silver Silver -.->|reprocess| Gold
Storage cost: same data 3x. Justified by:
- Reprocessing safety (rebuild Silver/Gold from Bronze)
- Schema evolution flexibility
- Auditability
10.2 SCD (Slowly Changing Dimensions)
Type 2 SCD natural in Iceberg via merge:
MERGE INTO users_history t
USING users_changes c ON t.id = c.id AND t.is_current
WHEN MATCHED AND c.changed THEN
UPDATE SET is_current = false, valid_to = current_timestamp()
WHEN NOT MATCHED THEN
INSERT (id, attrs..., is_current, valid_from)
VALUES (c.id, c.attrs..., true, current_timestamp());10.3 Schema evolution
-- Add column - metadata only
ALTER TABLE events ADD COLUMN region STRING;
-- Existing data NULL for region; new writes have value
-- Reads see schema with new columnIceberg handles via metadata pointer, not data rewrite. Compare to Hive where adding column requires rewriting partitions.
10.4 Time travel for debug
-- Was data correct yesterday?
SELECT count(*) FROM events FOR TIMESTAMP AS OF '2026-05-15 00:00:00'
WHERE region = 'EU';
-- Diff between snapshots
SELECT * FROM events FOR VERSION AS OF 100
EXCEPT
SELECT * FROM events FOR VERSION AS OF 99;Powerful debugging tool.
10.5 Lakehouse + Hot Store
graph LR Source[(OLTP / events)] --> ClickHouse[(ClickHouse hot<br/>last 30 days)] Source -.also.-> Iceberg[(Iceberg cold<br/>5 year retention)] BI[Dashboard] --> ClickHouse Analyst[Ad-hoc query] --> Trino[Trino] --> Iceberg AdHoc -.federated.-> Trino
Hot data in ClickHouse (sub-sec dashboards). Cold in Iceberg (cheap, long retention). Query federated.
11. Comparison to ClickHouse / Snowflake
| Iceberg + Trino | ClickHouse | Snowflake | |
|---|---|---|---|
| Storage | S3 (cheap) | Local SSD (expensive) | S3 micro-partitions (managed) |
| Latency | seconds (S3) | sub-second | sub-second |
| Compute engines | Many | One (CH) | Snowflake only |
| Cost (1TB stored) | $25/month | $200-500/month | $25/month + compute |
| Best for | Long retention, ad-hoc, open format | Real-time dashboards | Enterprise BI, governance |
| Update/delete | Yes (merge) | Limited (ReplacingMT) | Yes |
| Lock-in | None (open format) | High (own format) | High |
Pattern: ClickHouse for hot (recent, real-time queries). Iceberg for cold (historical, batch). Snowflake for managed end-to-end with strong governance need.
12. Anti-patterns
| Pattern | Why bad | Fix |
|---|---|---|
| Too many small files | Bad read perf | Periodic compaction |
| No partition pruning | Full table scan | Design partition matching query |
| Schema thrashing (many evolutions) | Metadata churn | Stabilize schema design |
| Forgetting snapshot expiration | Metadata grows + cost | Regular expiration |
| Use Iceberg for OLTP | Latency too high (seconds) | Use Postgres/CockroachDB |
| Manual file deletion | Breaks tracking | Use system procedures |
| Skip catalog choice | Vendor lock-in surprise | Pick REST-spec catalog |
| Many concurrent writers same table | Commit conflicts | Coordinate, write daily batches |
| Forget statistics refresh | Bad query plans | Schedule |
| Z-order without query pattern match | Wasted compute | Match z-order to actual filter cols |
13. Lab
Day 1: Iceberg basics with PySpark
spark = SparkSession.builder \
.appName("iceberg") \
.config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0") \
.config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.local.type", "hadoop") \
.config("spark.sql.catalog.local.warehouse", "/tmp/iceberg") \
.getOrCreate()
spark.sql("CREATE TABLE local.db.events (id bigint, ts timestamp, data string) USING iceberg")
spark.sql("INSERT INTO local.db.events VALUES (1, current_timestamp(), 'hello')")
# Time travel
v1 = spark.sql("SELECT current_snapshot_id FROM local.db.events").collect()[0]
spark.sql("INSERT INTO local.db.events VALUES (2, current_timestamp(), 'world')")
spark.sql(f"SELECT * FROM local.db.events FOR VERSION AS OF {v1}").show()Day 2: Trino on Iceberg
Run Trino, connect to Iceberg catalog, query.
SELECT count(*), date_trunc('hour', ts) AS hour FROM iceberg.db.events
GROUP BY hour ORDER BY hour DESC LIMIT 24;Day 3: DuckDB Iceberg
INSTALL iceberg;
LOAD iceberg;
SELECT * FROM iceberg_scan('s3://bucket/path/metadata/v1.metadata.json') LIMIT 10;Day 4: Streaming with Flink
Setup Flink job consuming Kafka → Iceberg. Verify writes.
Day 5: Maintenance
Run compaction, snapshot expiration, orphan cleanup. Observe metadata files shrink.
Day 6: Schema evolution
ADD COLUMN, DROP COLUMN, RENAME. Verify old snapshots still readable.
Day 7: Partition evolution
Initial partition by day, switch to month. Read both old and new together.
14. Self-check
- Lakehouse 3 eras — value proposition mỗi giai đoạn?
- Iceberg vs Delta vs Hudi — current standing 2024?
- Iceberg layered architecture — catalog → meta → manifest → file. Mục đích?
- Partition evolution — vì sao Iceberg unique vs Hive?
- Z-order — trade-off?
- Iceberg Spec V3 (2024) — đặc điểm mới?
- Catalog wars — Polaris vs Unity Catalog?
- Lakehouse + Hot Store pattern?
- SCD Type 2 trong Iceberg — MERGE pattern?
- Khi nào lakehouse thắng ClickHouse / Snowflake?
15. Tiếp theo
Tuan-Bonus-DB-Security-Compliance
Cập nhật: 2026-05-16