Bonus 7 — Lakehouse Engineering: Iceberg, Delta, Hudi

“Data Lake (cheap S3) + Data Warehouse (rich SQL) = Lakehouse. Tuần này đào sâu engineering side của Apache Iceberg — table format trở thành OLAP standard 2024-2026.”

Tags: database lakehouse iceberg delta parquet data-engineering unity-catalog polaris Thời lượng: 5-6 ngày Prerequisites: Tuan-14-OLAP-Columnar-ClickHouse Liên quan: Case-Design-Modern-Data-Lakehouse (SD course) · Tuan-Bonus-CDC-Debezium


1. Why Lakehouse

1.1 Three eras

graph LR
    DW[Data Warehouse era<br/>Teradata, Oracle<br/>$$$ per TB<br/>structured SQL] --> Lake[Data Lake era<br/>S3 + Parquet<br/>cheap but no ACID<br/>schema chaos] --> Lakehouse[Lakehouse era 2020+<br/>Iceberg/Delta/Hudi<br/>cheap + ACID<br/>multi-engine]

1.2 Lakehouse value

  • Cheap storage (S3 ~0.05/GB egress)
  • Open format (no vendor lock)
  • Many compute engines (Spark, Trino, Snowflake, BigQuery, Athena, DuckDB, ClickHouse 2024)
  • ACID transactions on object store
  • Time travel, schema evolution
  • Format separation from compute (store once, query many)

1.3 Three competing formats — current status 2024-2026

IcebergDelta LakeHudi
OriginNetflix → ApacheDatabricks → Linux FoundationUber → Apache
OSSApache 2.0OSS + Databricks extApache 2.0
Vendor neutralityHighestDatabricks-leaningApache-aligned
Adoption 2024-2026Winning broadlyStrong (Databricks)Smaller
Compute enginesManyMany (esp. Spark)Spark-centric
Time travelYesYesYes
Schema evolutionBestGoodGood
CDCLimitedNative (Delta Change Feed)Strong (built for CDC)
Snapshot isolationYesYesYes

1.4 Industry consolidation 2024

June 2024: Databricks acquired Tabular (Iceberg’s commercial steward, founded by Ryan Blue + Daniel Weeks who created Iceberg at Netflix) for $1B+.

Implications:

  • Databricks now influences both Iceberg + Delta
  • UniForm product bridges Delta ↔ Iceberg (read Delta as Iceberg)
  • Convergence between formats

Apache Polaris (Snowflake, 2024): Open catalog standard for Iceberg.

Unity Catalog OSS (Databricks, 2024): Open catalog (was proprietary).

Catalog war: Polaris (Snowflake) vs Unity Catalog (Databricks) vs HMS + REST → still settling.

1.5 Net effect 2024-2026

  • Iceberg becomes dominant format
  • Catalog standard converging via REST API
  • Multi-engine future (your data in Iceberg, query with whatever)
  • Vendor differentiation moves to: compute, catalog UI, governance, ML integration

2. Apache Iceberg Architecture

2.1 Layers

graph TB
    Catalog[Catalog<br/>Hive Metastore / Glue / REST / Polaris / Unity / Snowflake]
    Catalog --> Meta[Current metadata file<br/>v_N.metadata.json]
    Meta --> ML[Manifest list<br/>snap-X.avro]
    ML --> M1[Manifest 1<br/>data-files-list.avro<br/>+ stats per file]
    ML --> M2[Manifest 2]
    M1 --> D1[Parquet file 1]
    M1 --> D2[Parquet file 2]
    M2 --> D3[Parquet file 3]
    M2 --> D4[Parquet file 4]

    style Catalog fill:#fff9c4

2.2 Metadata vs data

  • Data: Parquet files in S3 (or any object store)
  • Metadata: JSON + Avro pointing to data, partitions, stats
  • Snapshot: pointer to current state of metadata

Read flow:

  1. Catalog → current v_N.metadata.json
  2. Metadata → current manifest list (snap-X.avro)
  3. Manifest list → relevant manifests (pruned by partition)
  4. Manifests → relevant data files (pruned by file-level stats)
  5. Read only needed columns from Parquet files

Pruning at every step → query 1PB but read 100MB.

Write flow:

  1. Append new Parquet file(s) to data location
  2. Create new manifest pointing to new file(s)
  3. New manifest list pointing to all manifests
  4. New metadata pointing to manifest list
  5. Atomic commit: catalog updates current metadata pointer

2.3 Snapshot isolation

Every commit = new snapshot. Old snapshots retained (configurable retention).

-- Time travel
SELECT * FROM orders FOR VERSION AS OF 12345;
SELECT * FROM orders FOR TIMESTAMP AS OF '2026-05-16 10:00:00';
 
-- See snapshot history
SELECT * FROM orders.snapshots;
SELECT * FROM orders.history;

2.4 Schema evolution

-- Add column - metadata only, no rewrite
ALTER TABLE orders ADD COLUMN region STRING;
 
-- Drop column - metadata only
ALTER TABLE orders DROP COLUMN deprecated;
 
-- Rename column - metadata only
ALTER TABLE orders RENAME COLUMN old_name TO new_name;
 
-- Type widening - metadata only (int → bigint)
ALTER TABLE orders ALTER COLUMN id TYPE bigint;

Iceberg tracks field IDs, not names. Renames don’t break existing data.

vs Delta Lake: column name based, schema changes more conservative.

2.5 Partition evolution

Unique to Iceberg: change partition spec without rewriting data. New writes use new partitioning.

-- Initially partitioned by day
CREATE TABLE events (id BIGINT, ts TIMESTAMP, payload STRING)
USING iceberg
PARTITIONED BY (days(ts));
 
-- Later, change to month for older data efficiency
ALTER TABLE events DROP PARTITION FIELD days(ts);
ALTER TABLE events ADD PARTITION FIELD months(ts);
 
-- Old data partitioned by day, new data by month
-- Read transparently combines

vs Hive (legacy): partition is in physical path, can’t change without rewrite.

2.6 Hidden partitioning

CREATE TABLE events (
    id BIGINT, ts TIMESTAMP, payload STRING
) USING iceberg
PARTITIONED BY (days(ts));
 
-- Insert (user writes natural)
INSERT INTO events VALUES (1, TIMESTAMP '2026-05-16 10:00:00', '...');
 
-- Query (no need to know about partitions)
SELECT * FROM events WHERE ts > '2026-05-15';
-- Iceberg auto partition-prunes

User writes natural query, Iceberg handles partitioning detail. Big improvement over Hive era.


3. Z-Ordering & Sort Order

3.1 Z-order (multi-dim clustering)

Files contain rows close in multiple dimensions.

ALTER TABLE events WRITE ORDERED BY z-order(user_id, ts);
 
-- After Z-order, query filters benefit:
SELECT * FROM events WHERE user_id = 42 AND ts > '2026-05-15';
-- Far fewer files scanned

Trade-off: write-time cost (rewrite to z-order) for read-time speedup. Periodic operation:

CALL system.rewrite_data_files(
    table => 'events',
    sort_order => 'zorder(user_id, ts)'
);

3.2 Liquid clustering (Delta Lake equivalent)

Delta auto-tunes clustering keys.

ALTER TABLE events CLUSTER BY (user_id, ts);

3.3 When use Z-order

  • Multi-column WHERE common
  • Cardinality of columns moderate
  • Trade write cost for read speedup

4. Read Optimizations

4.1 Column pruning

Parquet stores columns separately → read only needed columns.

SELECT id, total FROM events;
-- Reads only 'id' + 'total' columns from Parquet, skip rest

10-100x I/O save vs row-oriented.

4.2 Predicate pushdown

Filter pushed down to file level. Stats (min/max per file) → skip files entirely.

Query: WHERE ts BETWEEN '2026-05-01' AND '2026-05-07'
Iceberg manifest stats: file1.ts_min=2026-04-01, file1.ts_max=2026-04-30
→ skip file1 entirely

4.3 Manifest pruning

Stats at manifest level → skip whole groups of files. Important for huge tables.

4.4 Bloom filters

Parquet supports Bloom filters per column.

ALTER TABLE events WRITE BLOOM FILTERS (user_id);
-- Now user_id = X queries skip files without that user

4.5 Combined effect

Table: 1 PB
Query: filter 1 user, 1 day, project 3 columns

Pruning steps:
- Manifest pruning: 10K manifests → 50 relevant
- File pruning by partition: 50 → 10 files
- Bloom filter on user_id: 10 → 3 files
- Column pruning: 3 columns out of 100
- Row group pruning within file: 50% rows

Net: read ~100MB of 1PB → 0.00001% I/O

5. Maintenance Operations

5.1 Compaction

Many small files = poor read perf. Periodic compaction:

-- Spark/Trino procedure
CALL system.rewrite_data_files(
    table => 'db.events',
    options => map('min-input-files', '5', 'target-file-size-bytes', '536870912')
);

Best practice: schedule daily/weekly, off-peak.

5.2 Snapshot expiration

Old snapshots accumulate. Each retained snapshot = retention cost.

CALL system.expire_snapshots(
    table => 'db.events',
    older_than => TIMESTAMP '2026-04-16 00:00:00',
    retain_last => 100
);

Trade-off: retention vs storage cost. Typical: keep 30-90 days.

5.3 Orphan file cleanup

Files no longer referenced:

CALL system.remove_orphan_files(
    table => 'db.events',
    older_than => TIMESTAMP '2026-04-16 00:00:00'
);

⚠️ Caution: dry-run first. Active writers might be creating files seen as orphan in short windows.

5.4 Statistics refresh

CALL system.refresh_statistics(table => 'db.events');
ANALYZE TABLE db.events COMPUTE STATISTICS;

Better query planning.

5.5 Manifest rewrite

CALL system.rewrite_manifests(table => 'db.events');

Consolidates manifest files when too fragmented.

5.6 Catalog migration

Move table catalog without copying data:

# Iceberg
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0 \
    -e "CALL system.register_table('rest_catalog.db.events', 's3://bucket/path/metadata/v1.metadata.json')"

6. Streaming Writes

6.1 CDC into Lakehouse

graph LR
    PG[(Postgres)] -.Debezium.-> Kafka[(Kafka)]
    Kafka --> Streaming[Spark Structured Streaming / Flink]
    Streaming -.upsert merge.-> Iceberg[(Iceberg table)]
    Iceberg --> BI[BI tools]

Pattern:

  • CDC events to Kafka
  • Streaming job consumes (Spark/Flink)
  • Merge into Iceberg (upsert)

6.2 Spark Structured Streaming

from pyspark.sql import SparkSession
 
stream = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "orders") \
    .load()
 
query = stream.writeStream \
    .format("iceberg") \
    .outputMode("append") \
    .option("checkpointLocation", "/tmp/checkpoint") \
    .toTable("lakehouse.orders")
 
query.awaitTermination()
TableEnvironment tEnv = ...;
tEnv.executeSql("""
    CREATE TABLE events_iceberg (
        id BIGINT, ts TIMESTAMP_LTZ(3), data STRING
    ) PARTITIONED BY (days(ts)) WITH (
        'connector' = 'iceberg',
        'catalog-name' = 'rest_catalog',
        ...
    )
""");
 
tEnv.executeSql("INSERT INTO events_iceberg SELECT * FROM kafka_source");

Flink writes Iceberg natively, streaming-friendly with checkpointing.

6.4 Upsert semantics

MERGE INTO orders t
USING new_orders s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Iceberg supports MERGE since v2. Behind scenes: writes deletion vectors (Iceberg V3) or position deletes (V2).


7. Iceberg Spec V3 (2024)

Major spec update in 2024 work:

7.1 Deletion vectors

V2: position deletes (file + row position). V3: bitmap-based deletion vectors. More efficient for many small deletes.

7.2 Variant type

JSON-like data with binary encoding. Better than storing raw JSON strings.

7.3 Encryption

Native column-level encryption in V3.

7.4 Time travel improvements

Better support for change data feed across snapshots.

Status of spec: actively being adopted by engines. Check compatibility before using V3 features.


8. Compute Engines on Lakehouse

8.1 Spark

Most full-featured. Batch + streaming. Heavy footprint.

spark = SparkSession.builder \
    .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0") \
    .config("spark.sql.catalog.lakehouse", "org.apache.iceberg.spark.SparkCatalog") \
    .getOrCreate()
 
spark.sql("SELECT * FROM lakehouse.events LIMIT 10").show()

8.2 Trino / Presto

SQL query engine, federated (query multiple sources). Good for ad-hoc analytics.

SELECT count(*) FROM iceberg.events.events WHERE date_col > DATE '2026-01-01';

Fast for SELECTs. Less for writes.

8.3 Snowflake (2024 Iceberg support)

Reads + writes Iceberg tables since 2024.

  • “External Iceberg tables” (read-only)
  • “Iceberg tables managed by Snowflake” (read-write)

No data duplication if data already in S3.

8.4 BigQuery (2024)

External tables on Iceberg in GCS / S3. Federated query.

CREATE EXTERNAL TABLE iceberg_events
OPTIONS (
    format = 'ICEBERG',
    uris = ['gs://bucket/path/metadata/']
);

8.5 DuckDB (2024)

INSTALL iceberg;
LOAD iceberg;
 
SELECT * FROM iceberg_scan('s3://bucket/path/metadata/v1.metadata.json');

Embedded analytics on cloud lakehouse. Surprisingly fast for medium-size queries.

8.6 ClickHouse (2024)

SELECT * FROM iceberg('s3://bucket/path');

ClickHouse 24.x added Iceberg + Delta reader. Hot + cold pattern: ClickHouse hot, Iceberg cold, single query joins.

8.7 Athena (AWS)

Serverless Trino. Pay per TB scanned ($5/TB).

8.8 Decision

NeedPick
Real-time, sub-second hot dataClickHouse
Pure SQL on existing filesDuckDB
Serverless, AWS/GCP-nativeAthena / BigQuery
Multi-cloud, separate computeSnowflake
Open format primary storeIceberg + Trino
Heavy batch processingSpark
Streaming writeFlink + Iceberg

9. Catalog Wars 2024-2026

9.1 Catalog options

graph TB
    Catalog[Iceberg catalog<br/>tells engines where tables are]
    Catalog --> Hive[Hive Metastore<br/>legacy]
    Catalog --> Glue[AWS Glue<br/>AWS proprietary]
    Catalog --> REST[REST Catalog<br/>open spec]
    Catalog --> Polaris[Apache Polaris<br/>Snowflake 2024 OSS]
    Catalog --> Unity[Unity Catalog<br/>Databricks 2024 OSS]
    Catalog --> Snow[Snowflake catalog<br/>proprietary]

    style REST fill:#c8e6c9
    style Polaris fill:#c8e6c9
    style Unity fill:#c8e6c9

9.2 REST Catalog Spec

Iceberg defined an HTTP REST API for catalog operations. Engines implement this.

Importance: vendor-neutral interface. Switch catalog without re-engineering.

9.3 Polaris (Snowflake)

Apache-incubated 2024. Implements REST spec. Multi-engine access control.

9.4 Unity Catalog OSS (Databricks 2024)

Databricks open-sourced Unity Catalog. Implements REST spec too.

Lineage tracking, governance built-in.

9.5 Choosing catalog

flowchart TD
    A[Catalog?] --> B{Multi-cloud or vendor-neutral?}
    B -->|Yes| C[Polaris or Unity Catalog OSS]
    B -->|No, AWS only| D[Glue]
    B -->|No, GCP only| E[BigLake / BigQuery metastore]
    B -->|Snowflake-centric| F[Snowflake catalog]
    B -->|Databricks-centric| G[Unity Catalog DB]

    style C fill:#c8e6c9

10. Patterns

10.1 Bronze-Silver-Gold (Medallion)

graph LR
    Bronze[Bronze<br/>raw, append-only<br/>schema as-is<br/>retain all] --> Silver[Silver<br/>cleaned, validated<br/>conform schema<br/>typed]
    Silver --> Gold[Gold<br/>business-level<br/>aggregations<br/>ML features]

    Bronze -.->|reprocess on schema change| Silver
    Silver -.->|reprocess| Gold

Storage cost: same data 3x. Justified by:

  • Reprocessing safety (rebuild Silver/Gold from Bronze)
  • Schema evolution flexibility
  • Auditability

10.2 SCD (Slowly Changing Dimensions)

Type 2 SCD natural in Iceberg via merge:

MERGE INTO users_history t
USING users_changes c ON t.id = c.id AND t.is_current
WHEN MATCHED AND c.changed THEN
    UPDATE SET is_current = false, valid_to = current_timestamp()
WHEN NOT MATCHED THEN
    INSERT (id, attrs..., is_current, valid_from)
    VALUES (c.id, c.attrs..., true, current_timestamp());

10.3 Schema evolution

-- Add column - metadata only
ALTER TABLE events ADD COLUMN region STRING;
-- Existing data NULL for region; new writes have value
-- Reads see schema with new column

Iceberg handles via metadata pointer, not data rewrite. Compare to Hive where adding column requires rewriting partitions.

10.4 Time travel for debug

-- Was data correct yesterday?
SELECT count(*) FROM events FOR TIMESTAMP AS OF '2026-05-15 00:00:00'
WHERE region = 'EU';
 
-- Diff between snapshots
SELECT * FROM events FOR VERSION AS OF 100
EXCEPT
SELECT * FROM events FOR VERSION AS OF 99;

Powerful debugging tool.

10.5 Lakehouse + Hot Store

graph LR
    Source[(OLTP / events)] --> ClickHouse[(ClickHouse hot<br/>last 30 days)]
    Source -.also.-> Iceberg[(Iceberg cold<br/>5 year retention)]

    BI[Dashboard] --> ClickHouse
    Analyst[Ad-hoc query] --> Trino[Trino] --> Iceberg
    AdHoc -.federated.-> Trino

Hot data in ClickHouse (sub-sec dashboards). Cold in Iceberg (cheap, long retention). Query federated.


11. Comparison to ClickHouse / Snowflake

Iceberg + TrinoClickHouseSnowflake
StorageS3 (cheap)Local SSD (expensive)S3 micro-partitions (managed)
Latencyseconds (S3)sub-secondsub-second
Compute enginesManyOne (CH)Snowflake only
Cost (1TB stored)$25/month$200-500/month$25/month + compute
Best forLong retention, ad-hoc, open formatReal-time dashboardsEnterprise BI, governance
Update/deleteYes (merge)Limited (ReplacingMT)Yes
Lock-inNone (open format)High (own format)High

Pattern: ClickHouse for hot (recent, real-time queries). Iceberg for cold (historical, batch). Snowflake for managed end-to-end with strong governance need.


12. Anti-patterns

PatternWhy badFix
Too many small filesBad read perfPeriodic compaction
No partition pruningFull table scanDesign partition matching query
Schema thrashing (many evolutions)Metadata churnStabilize schema design
Forgetting snapshot expirationMetadata grows + costRegular expiration
Use Iceberg for OLTPLatency too high (seconds)Use Postgres/CockroachDB
Manual file deletionBreaks trackingUse system procedures
Skip catalog choiceVendor lock-in surprisePick REST-spec catalog
Many concurrent writers same tableCommit conflictsCoordinate, write daily batches
Forget statistics refreshBad query plansSchedule
Z-order without query pattern matchWasted computeMatch z-order to actual filter cols

13. Lab

Day 1: Iceberg basics with PySpark

spark = SparkSession.builder \
    .appName("iceberg") \
    .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0") \
    .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.local.type", "hadoop") \
    .config("spark.sql.catalog.local.warehouse", "/tmp/iceberg") \
    .getOrCreate()
 
spark.sql("CREATE TABLE local.db.events (id bigint, ts timestamp, data string) USING iceberg")
spark.sql("INSERT INTO local.db.events VALUES (1, current_timestamp(), 'hello')")
 
# Time travel
v1 = spark.sql("SELECT current_snapshot_id FROM local.db.events").collect()[0]
spark.sql("INSERT INTO local.db.events VALUES (2, current_timestamp(), 'world')")
spark.sql(f"SELECT * FROM local.db.events FOR VERSION AS OF {v1}").show()

Day 2: Trino on Iceberg

Run Trino, connect to Iceberg catalog, query.

SELECT count(*), date_trunc('hour', ts) AS hour FROM iceberg.db.events
GROUP BY hour ORDER BY hour DESC LIMIT 24;

Day 3: DuckDB Iceberg

INSTALL iceberg;
LOAD iceberg;
SELECT * FROM iceberg_scan('s3://bucket/path/metadata/v1.metadata.json') LIMIT 10;

Setup Flink job consuming Kafka → Iceberg. Verify writes.

Day 5: Maintenance

Run compaction, snapshot expiration, orphan cleanup. Observe metadata files shrink.

Day 6: Schema evolution

ADD COLUMN, DROP COLUMN, RENAME. Verify old snapshots still readable.

Day 7: Partition evolution

Initial partition by day, switch to month. Read both old and new together.


14. Self-check

  1. Lakehouse 3 eras — value proposition mỗi giai đoạn?
  2. Iceberg vs Delta vs Hudi — current standing 2024?
  3. Iceberg layered architecture — catalog → meta → manifest → file. Mục đích?
  4. Partition evolution — vì sao Iceberg unique vs Hive?
  5. Z-order — trade-off?
  6. Iceberg Spec V3 (2024) — đặc điểm mới?
  7. Catalog wars — Polaris vs Unity Catalog?
  8. Lakehouse + Hot Store pattern?
  9. SCD Type 2 trong Iceberg — MERGE pattern?
  10. Khi nào lakehouse thắng ClickHouse / Snowflake?

15. Tiếp theo

Tuan-Bonus-DB-Security-Compliance

Cập nhật: 2026-05-16