Bonus 6 — MySQL Internals & High Performance

“Khóa này Postgres-first vì lý do tốt. Nhưng MySQL/MariaDB chạy nửa internet — đặc biệt Shopify, Booking, FB, Wikipedia, GitHub. Hiểu khác biệt với Postgres khi cần move qua lại.”

Tags: database mysql mariadb innodb percona vitess Thời lượng: 5-6 ngày Prerequisites: Tuan-01-DB-Internals-Refresh (Postgres internals for comparison) Liên quan: Tuan-05-Transactions-Locking · Tuan-08-Zero-Downtime-Migration

1. MySQL Ecosystem 2024-2026

1.1 Forks + flavors

MySQL Community (Oracle) — original, GPLv2 + commercial dual
MariaDB — fork by original author (Monty), OSS purist, GPLv2
Percona Server — Oracle MySQL + perf patches, drop-in
Vitess — sharding layer on MySQL (YouTube, Slack, Square, GitHub)
PlanetScale — managed Vitess (announced shutdown of free tier 2024, still operating)
Aurora MySQL — AWS reimplementation, MySQL-compatible
Aiven, RDS, Cloud SQL — managed MySQL

License divergence:

MySQL (Oracle) GPLv2 + commercial
MariaDB GPLv2 only
Percona GPLv2

1.2 Major version timeline

MySQL 5.6 (2013) — InnoDB online DDL
MySQL 5.7 (2015) — JSON, generated columns
MySQL 8.0 (2018) — window functions, CTEs, atomic DDL, transactional data dictionary
MySQL 8.0.x (continuous releases 2018-2024)
MySQL 8.4 LTS (2024) — first true LTS release model
MySQL 9.0 Innovation (2024) — vector type, JS functions
MariaDB 10.6 LTS (2021) — System-versioned tables, JSON
MariaDB 11.x (2023+) — Optimizer trace, OPTIMIZE ENGINE_TABLE

1.3 MySQL 9.0 — vector support (2024)

CREATE TABLE docs (
    id INT,
    embedding VECTOR(1536)
);
INSERT INTO docs VALUES (1, '[0.1, 0.2, ...]');
SELECT * FROM docs ORDER BY VECTOR_DISTANCE(embedding, '[...]', 'COSINE') LIMIT 10;

Behind pgvector in maturity but emerging.

2. InnoDB Storage Engine

2.1 Architecture

graph TB
    Buffer[InnoDB Buffer Pool<br/>LRU + adaptive] --> Pages[16KB pages]
    Pages --> Tablespace[Tablespace files<br/>.ibd per table or shared]

    Buffer --> RedoLog[Redo log<br/>circular, fixed size<br/>physical changes]
    Buffer --> UndoLog[Undo log<br/>versions for MVCC + rollback]
    Buffer --> Binlog[Binary log<br/>logical changes for replication]

    Buffer --> DoubleWrite[Doublewrite buffer<br/>torn-page protection]

2.2 Clustered Index — fundamental vs Postgres

Key difference: InnoDB table data is physically stored inside primary key B+tree. Table IS the PK index.

Postgres: heap table (unordered) + separate B-tree PK index → PK lookup = index lookup + heap fetch (2 random I/O typical)

InnoDB: PK is the B+tree leaf containing row data → PK lookup = single B+tree walk

Implications:

PK lookup: InnoDB ~1-2x faster
Secondary index lookup: hits secondary → finds PK → looks up PK tree (2 walks) vs Postgres (1 walk + 1 heap fetch)
Choose PK carefully — affects EVERY secondary index size
PK should be: small, monotonic for insert performance
Updating PK = physically move row (avoid!)

2.3 Why monotonic PK matters in InnoDB

UUID v4 PK in InnoDB:

Random insert position in B+tree
Page splits all over
Slower writes (lot of fragmentation)
Worse than Postgres for same workload

→ Use auto-increment BIGINT, UUID v7, or app-generated sortable ID.

2.4 Page format

Similar concept to Postgres but 16KB default (vs Postgres 8KB).

More space per page, less overhead per row
Higher fan-out in B+tree (~3 levels for billion rows)

2.5 Three logs

Different from Postgres’ single WAL:

Redo log:

Physical changes (“page X bytes Y..Z changed to …“)
Circular, fixed size
Recovery from crash
MySQL 8.0.30+: innodb_redo_log_capacity replaces old innodb_log_file_size × innodb_log_files_in_group

# Modern (8.0.30+)
innodb_redo_log_capacity = 4G
 
# Legacy (deprecated)
innodb_log_file_size = 2G
innodb_log_files_in_group = 2

Undo log:

Pre-modification values
Used for: rollback + MVCC reads (old versions)
Separate from redo, stored in undo tablespace
Auto-truncated when transactions complete

Binary log (binlog):

Logical changes (statements or row events)
For replication
For PITR
Append only

3 logs vs Postgres 1 (WAL). More to manage, more nuance, more places things can go wrong.

2.6 Doublewrite buffer

Combat torn page writes. Before writing pages to .ibd, write to doublewrite buffer first. If crash mid-write, recover from doublewrite.

In Postgres = full_page_writes in WAL. Different mechanism, same goal.

2.7 Atomic DDL (MySQL 8.0+)

Pre-8.0: DDL was multi-step, could leave inconsistent state on crash. 8.0+: data dictionary in InnoDB, DDL atomic.

ALTER TABLE huge ADD COLUMN x INT;  -- atomic, recoverable

2.8 Instant DDL (8.0.12+, expanded 8.0.29+)

Many DDL operations now instant (metadata only):

ADD COLUMN at end (8.0.12+)
ADD COLUMN anywhere (8.0.29+)
DROP COLUMN (8.0.29+)
RENAME COLUMN (8.0.x)

-- Instant in 8.0.29+
ALTER TABLE huge ADD COLUMN new_col INT, ALGORITHM=INSTANT;
ALTER TABLE huge DROP COLUMN old_col, ALGORITHM=INSTANT;

Limitations: not all columns; row format must support.

2.9 Histogram statistics (8.0.3+)

Optimizer uses histograms for non-indexed columns.

ANALYZE TABLE orders UPDATE HISTOGRAM ON status WITH 100 BUCKETS;

Helps planner pick join order without index on every column.

3. MVCC in InnoDB

3.1 Read view per transaction

Each transaction allocates read view at start (RR) or per-statement (RC)
Tracks active transactions at snapshot
Reads use undo log to reconstruct prior versions

3.2 Differences from Postgres

	Postgres MVCC	InnoDB MVCC
Old versions location	In heap (same page)	Separate undo log
VACUUM needed?	Yes (clean dead tuples)	No (auto-truncate undo)
Tuple has xmin/xmax	Yes	Hidden DB_TRX_ID column
MVCC bloat	Yes (table grows)	No (rollback segment)
Replication	WAL streaming	Binlog (logical)

3.3 Hidden columns

InnoDB tables internally have 3 extra columns:

DB_TRX_ID — last transaction that modified the row
DB_ROLL_PTR — pointer to undo log entry
DB_ROW_ID — used when no PK defined (avoid this!)

4. Locking — Gap & Next-Key Locks

4.1 Row locks

SELECT ... FOR UPDATE, FOR SHARE similar to Postgres.

4.2 Gap locks — InnoDB unique

In REPEATABLE READ (default), InnoDB locks not just rows but gaps between rows to prevent phantom reads.

-- Session A
BEGIN;
SELECT * FROM users WHERE age BETWEEN 20 AND 30 FOR UPDATE;
-- Locks rows 20-30, AND gaps before 20, after 30, between rows
 
-- Session B
INSERT INTO users (name, age) VALUES ('X', 25);
-- BLOCKS due to gap lock

vs Postgres SERIALIZABLE = SSI with predicate locks. Different mechanism, same goal.

4.3 Next-key locks

Combination of row lock + gap before it.

Index: (10, 20, 30, 40)
Next-key lock on 30:
  - Locks row 30
  - Locks gap (20, 30] (exclusive 20, inclusive 30)

4.4 Deadlock pattern

Deadlocks more common in InnoDB due to gap locks. Tune:

innodb_lock_wait_timeout = 50    # default 50s
innodb_print_all_deadlocks = ON  # log all

Workaround for high deadlock rate: drop to READ COMMITTED isolation (no gap locks). App handles phantom reads.

SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED;

Pattern at scale (Shopify, GitHub): default RC, not RR. Postgres opposite (default RC, opt-in RR/Serializable).

4.5 Auto-increment locks (8.0+)

innodb_autoinc_lock_mode = 2 — “interleaved” — no lock for INSERT
Default in MySQL 8.0+
Pre-8.0 default 1 — could lock for INSERT…SELECT
Use 2 for high-concurrency INSERT

5. Replication

5.1 Binlog-based async (traditional)

Default. Primary writes binlog. Replicas read + apply.

# my.cnf primary
log-bin = mysql-bin
server-id = 1
binlog_format = ROW    # ROW > STATEMENT for safety
gtid_mode = ON
enforce_gtid_consistency = ON
 
# replica
server-id = 2
replicate-do-db = appdb

5.2 GTID — Global Transaction ID

Modern replication uses GTID (server_uuid:txn_number). Auto-positioning replicas:

-- Replica setup auto
CHANGE MASTER TO MASTER_HOST='primary', MASTER_USER='repl', MASTER_AUTO_POSITION=1;
START SLAVE;  -- or START REPLICA in 8.0.22+

No need to track binlog file + position manually.

5.3 Semi-sync

Primary waits for AT LEAST 1 replica to ack before commit.

rpl_semi_sync_master_enabled = ON
rpl_semi_sync_master_timeout = 1000  # ms (fallback to async)

Compromise: tighter durability without full sync overhead.

5.4 Parallel replication (8.0+)

Replica apply parallelized:

slave_parallel_workers = 8
slave_parallel_type = LOGICAL_CLOCK   # default 8.0

LOGICAL_CLOCK based on commit timestamps → parallel apply for non-conflicting transactions.

This is a bigger advantage MySQL has over Postgres which still has single-threaded WAL apply.

5.5 Group Replication (5.7.17+)

Built-in cluster, multi-primary or single-primary. Paxos-based.

plugin_load_add = 'group_replication.so'
group_replication_group_seeds = "node1:33061,node2:33061,node3:33061"
group_replication_start_on_boot = ON

vs Postgres: Patroni, repmgr handle similar role at orchestration level.

5.6 InnoDB Cluster

Group Replication + MySQL Router + MySQL Shell admin → “InnoDB Cluster”. Easier setup than raw Group Replication.

5.7 InnoDB ClusterSet (8.0.27+)

Disaster recovery across data centers. Cross-region MySQL clusters.

6. Performance Patterns

6.1 Buffer pool sizing

innodb_buffer_pool_size = 12G  # ~70-80% RAM (different rule from Postgres 25-35%)
innodb_buffer_pool_instances = 8  # default 8 for buffer_pool > 1GB

Why higher than Postgres? InnoDB uses buffer pool more aggressively, doesn’t rely heavily on OS page cache.

Postgres lets OS handle disk cache; MySQL takes more responsibility.

6.2 Redo log sizing

# MySQL 8.0.30+ modern
innodb_redo_log_capacity = 4G
 
# Pre-8.0.30 (still works, deprecated)
innodb_log_file_size = 2G
innodb_log_files_in_group = 2

Larger redo log → fewer checkpoints, better write throughput. Trade-off: longer recovery time.

6.3 Flush settings

innodb_flush_log_at_trx_commit = 1   # 1=durable, 2=lose 1s on crash, 0=lose more
sync_binlog = 1                       # fsync binlog after each commit

1, 1 is ACID-compliant. 2, 0 faster but data loss on crash. Trade-off knob.

# Production safe
innodb_flush_log_at_trx_commit = 1
sync_binlog = 1
innodb_flush_method = O_DIRECT       # bypass OS cache for InnoDB

6.4 Performance Schema

Like Postgres pg_stat_statements but more granular.

SELECT * FROM performance_schema.events_statements_summary_by_digest
ORDER BY sum_timer_wait DESC LIMIT 10;
 
-- Top queries by time
SELECT digest_text, count_star, avg_timer_wait/1e6 AS avg_ms
FROM performance_schema.events_statements_summary_by_digest
ORDER BY sum_timer_wait DESC LIMIT 10;
 
-- Wait events
SELECT * FROM performance_schema.events_waits_summary_global_by_event_name
ORDER BY sum_timer_wait DESC LIMIT 20;

6.5 sys schema

Wrapper views on performance_schema for easier human reading.

SELECT * FROM sys.statements_with_runtimes_in_95th_percentile;
SELECT * FROM sys.io_global_by_file_by_bytes;
SELECT * FROM sys.schema_unused_indexes;
SELECT * FROM sys.schema_redundant_indexes;

6.6 Query cache (deprecated)

MySQL 8.0 removed query cache entirely.

Pre-8.0 query cache often hurt more than helped (cache invalidation overhead). Use Redis or app-level instead.

6.7 Optimizer hints (modern syntax)

SELECT /*+ INDEX(orders idx_user) */ * FROM orders WHERE user_id = 42;
SELECT /*+ MAX_EXECUTION_TIME(1000) */ ...;
SELECT /*+ NO_INDEX_MERGE(orders) */ ...;

More versatile than legacy USE INDEX.

6.8 ProxySQL — query routing

MySQL middleware:

Read/write split (proxy to replicas for reads)
Connection pooling
Query firewall (reject patterns)
Query rewriting

Setup:

INSERT INTO mysql_servers VALUES (1, 'primary', 3306, 1000, 'ONLINE');
INSERT INTO mysql_query_rules (rule_id, active, match_pattern, destination_hostgroup)
VALUES (1, 1, '^SELECT.*FOR UPDATE', 1);  -- writes to primary

vs Postgres pgBouncer + app-level routing.

7. Sharding with Vitess

7.1 Architecture

graph TB
    Client --> Gate[VTGate<br/>SQL parser, router<br/>stateless]
    Gate --> Tablet1[VTTablet 1<br/>shard 1]
    Gate --> Tablet2[VTTablet 2<br/>shard 2]
    Gate --> Tablet3[VTTablet 3<br/>shard 3]

    Tablet1 --> MySQL1[(MySQL)]
    Tablet2 --> MySQL2[(MySQL)]
    Tablet3 --> MySQL3[(MySQL)]

    Topo[Topology server<br/>etcd/zk/consul]
    Tablet1 -.-> Topo
    Tablet2 -.-> Topo
    Tablet3 -.-> Topo
    Gate -.-> Topo

VTGate makes shards look like single MySQL to client.

7.2 VSchema

keyspaces:
  app:
    sharded: true
    vindexes:
      hash:
        type: hash
    tables:
      users:
        column_vindexes:
          - column: id
            name: hash
      orders:
        column_vindexes:
          - column: user_id
            name: hash

Routes by hashing column. VTGate parses query, routes to correct shard(s).

7.3 Operations

# Reshard
vtctldclient Reshard --workflow=v1 \
    --source-shards=- --target-shards=-80,80-
 
# MoveTables
vtctldclient MoveTables --workflow=move --target-keyspace=appdb \
    --source-keyspace=legacy --tables=users,orders

Online resharding — split shards as data grows. Big advantage over manual.

7.4 When Vitess

Outgrew single MySQL (>10TB or >50K TPS sustained)
Need to keep MySQL ecosystem
Want online resharding
Team comfortable with infra complexity

Big users: YouTube, Slack, GitHub (some), Square, Shopify (with Mysql sharding).

7.5 PlanetScale

Managed Vitess. Branching workflow (like git for schema), zero-downtime migrations via gh-ost integration.

Notable: removed free tier April 2024. Still operates as paid service.

8. Migration Tools Specific to MySQL

8.1 gh-ost (covered Tuần 08)

Online schema change. Better than pt-osc for most cases.

gh-ost \
    --alter="ADD COLUMN test_col INT NOT NULL DEFAULT 0" \
    --database=test --table=users \
    --execute

How: ghost table + binlog replay → atomic swap.

8.2 pt-online-schema-change (Percona Toolkit)

Older, trigger-based. Still works.

8.3 Vitess MoveTables, Reshard

Move data between keyspaces without downtime.

8.4 Percona Toolkit

Mature ecosystem:

pt-archiver — archive old rows
pt-table-checksum — verify replica integrity
pt-deadlock-logger — log deadlocks
pt-query-digest — analyze slow log
pt-summary — system info

8.5 mydumper / myloader

Multi-threaded mysqldump replacement. Much faster for big DBs.

9. Postgres vs MySQL Decision

flowchart TD
    A[Choose] --> B{Existing ecosystem?}
    B -->|Postgres apps team| C[Postgres]
    B -->|MySQL apps team| D[MySQL]
    A --> E{Heavy schema evolution?}
    E -->|Yes complex DDL| F[Postgres - better online DDL legacy]
    E -->|MySQL 8.0.29+ instant DDL OK| OK[Either]
    A --> G{Need rich types like JSONB, arrays, custom?}
    G -->|Yes| F
    A --> H{Massive horizontal scale needed?}
    H -->|Yes >10TB sharded| I[Vitess/MySQL or distributed Postgres]
    A --> J{Replication complex multi-region?}
    J -->|Yes| K[MySQL Group Replication or Aurora]
    A --> L{Parallel replication apply critical?}
    L -->|Yes high write rate| K2[MySQL 8 - parallel apply ahead of PG]

    style C fill:#c8e6c9
    style F fill:#c8e6c9

Personal opinion 2024-2026:

New project, no constraint → Postgres for richer features
Need parallel replica apply → MySQL 8
Heavy sharding from day 1 → Vitess (MySQL) or CockroachDB (Postgres-compat)
Existing team strong in one → stay with it; both are excellent

10. Anti-patterns

Pattern	Why bad	Fix
UUID v4 PK in InnoDB	Page splits, slow inserts	Auto-inc or UUID v7
No PK on InnoDB table	Hidden DB_ROW_ID slow + 6 bytes/row	Always explicit PK
`innodb_buffer_pool_size = 25% RAM` (Postgres rule)	InnoDB needs more	70-80% RAM
`binlog_format = STATEMENT`	Unsafe with non-deterministic functions	ROW
`innodb_flush_log_at_trx_commit = 0` for OLTP	Data loss on crash	1 (default)
Enabling query cache pre-8.0	Mostly hurts perf	Already removed in 8.0
Long-running transactions	Undo log grows	Short tx; monitor `innodb_history_list_length`
Frequent gap lock deadlocks	Default RR	RC isolation level
Forgot to set `innodb_redo_log_capacity` (or old `log_file_size`)	Frequent checkpoints	Sizing right
Auto-increment lock mode 0	Serialize inserts	Mode 2 (default 8.0+)

11. Lab

Day 1: MySQL setup + performance schema

docker run -d -p 3306:3306 -e MYSQL_ROOT_PASSWORD=lab mysql:8.0

Enable performance schema, run mysqlslap workload, observe.

Day 2: Clustered index demonstration

Create table with auto-increment PK
Insert 1M rows, time it
Repeat with UUID v4 PK, observe slower
Compare with Postgres equivalent

Day 3: Gap locks reproduce

Two sessions, REPEATABLE READ, reproduce gap lock blocking. Switch to READ COMMITTED, observe no block.

Day 4: GTID replication

2 MySQL instances, GTID-based replication. Test failover.

docker run -d --name primary mysql:8.0 mysqld --log-bin --server-id=1 --gtid-mode=ON --enforce-gtid-consistency
docker run -d --name replica mysql:8.0 mysqld --server-id=2 --gtid-mode=ON --enforce-gtid-consistency --read-only

Day 5: Parallel replication

Generate write workload. Watch replica lag with parallel workers 1 vs 8. Compare.

Day 6: gh-ost migration

Run online schema change on 10M-row table while load running.

Day 7: ProxySQL routing

Setup: app → ProxySQL → primary + replicas. Configure read/write split. Test.

12. Self-check

Clustered index InnoDB vs heap Postgres — pros/cons mỗi cái?
UUID v4 PK trong InnoDB — vì sao tệ hơn Postgres?
Redo / Undo / Binlog — 3 logs InnoDB, vai trò?
Gap lock — vì sao deadlock common hơn Postgres?
innodb_buffer_pool_size 70-80% RAM — vì sao khác Postgres?
GTID replication — auto-positioning là gì?
Parallel replication 8.0+ — advantage so Postgres?
Instant DDL 8.0.29+ — operations support?
Vitess sharding cơ chế?
Khi nào pick MySQL vs Postgres 2024-2026?

13. Tiếp theo

Tuan-Bonus-Lakehouse-Engineering

Cập nhật: 2026-05-16

lthieu's notes

Explorer

Tuan-Bonus-MySQL-Internals