Bonus 6 — MySQL Internals & High Performance
“Khóa này Postgres-first vì lý do tốt. Nhưng MySQL/MariaDB chạy nửa internet — đặc biệt Shopify, Booking, FB, Wikipedia, GitHub. Hiểu khác biệt với Postgres khi cần move qua lại.”
Tags: database mysql mariadb innodb percona vitess Thời lượng: 5-6 ngày Prerequisites: Tuan-01-DB-Internals-Refresh (Postgres internals for comparison) Liên quan: Tuan-05-Transactions-Locking · Tuan-08-Zero-Downtime-Migration
1. MySQL Ecosystem 2024-2026
1.1 Forks + flavors
- MySQL Community (Oracle) — original, GPLv2 + commercial dual
- MariaDB — fork by original author (Monty), OSS purist, GPLv2
- Percona Server — Oracle MySQL + perf patches, drop-in
- Vitess — sharding layer on MySQL (YouTube, Slack, Square, GitHub)
- PlanetScale — managed Vitess (announced shutdown of free tier 2024, still operating)
- Aurora MySQL — AWS reimplementation, MySQL-compatible
- Aiven, RDS, Cloud SQL — managed MySQL
License divergence:
- MySQL (Oracle) GPLv2 + commercial
- MariaDB GPLv2 only
- Percona GPLv2
1.2 Major version timeline
- MySQL 5.6 (2013) — InnoDB online DDL
- MySQL 5.7 (2015) — JSON, generated columns
- MySQL 8.0 (2018) — window functions, CTEs, atomic DDL, transactional data dictionary
- MySQL 8.0.x (continuous releases 2018-2024)
- MySQL 8.4 LTS (2024) — first true LTS release model
- MySQL 9.0 Innovation (2024) — vector type, JS functions
- MariaDB 10.6 LTS (2021) — System-versioned tables, JSON
- MariaDB 11.x (2023+) — Optimizer trace, OPTIMIZE ENGINE_TABLE
1.3 MySQL 9.0 — vector support (2024)
CREATE TABLE docs (
id INT,
embedding VECTOR(1536)
);
INSERT INTO docs VALUES (1, '[0.1, 0.2, ...]');
SELECT * FROM docs ORDER BY VECTOR_DISTANCE(embedding, '[...]', 'COSINE') LIMIT 10;Behind pgvector in maturity but emerging.
2. InnoDB Storage Engine
2.1 Architecture
graph TB Buffer[InnoDB Buffer Pool<br/>LRU + adaptive] --> Pages[16KB pages] Pages --> Tablespace[Tablespace files<br/>.ibd per table or shared] Buffer --> RedoLog[Redo log<br/>circular, fixed size<br/>physical changes] Buffer --> UndoLog[Undo log<br/>versions for MVCC + rollback] Buffer --> Binlog[Binary log<br/>logical changes for replication] Buffer --> DoubleWrite[Doublewrite buffer<br/>torn-page protection]
2.2 Clustered Index — fundamental vs Postgres
Key difference: InnoDB table data is physically stored inside primary key B+tree. Table IS the PK index.
Postgres: heap table (unordered) + separate B-tree PK index → PK lookup = index lookup + heap fetch (2 random I/O typical)
InnoDB: PK is the B+tree leaf containing row data → PK lookup = single B+tree walk
Implications:
- PK lookup: InnoDB ~1-2x faster
- Secondary index lookup: hits secondary → finds PK → looks up PK tree (2 walks) vs Postgres (1 walk + 1 heap fetch)
- Choose PK carefully — affects EVERY secondary index size
- PK should be: small, monotonic for insert performance
- Updating PK = physically move row (avoid!)
2.3 Why monotonic PK matters in InnoDB
UUID v4 PK in InnoDB:
- Random insert position in B+tree
- Page splits all over
- Slower writes (lot of fragmentation)
- Worse than Postgres for same workload
→ Use auto-increment BIGINT, UUID v7, or app-generated sortable ID.
2.4 Page format
Similar concept to Postgres but 16KB default (vs Postgres 8KB).
- More space per page, less overhead per row
- Higher fan-out in B+tree (~3 levels for billion rows)
2.5 Three logs
Different from Postgres’ single WAL:
Redo log:
- Physical changes (“page X bytes Y..Z changed to …“)
- Circular, fixed size
- Recovery from crash
- MySQL 8.0.30+:
innodb_redo_log_capacityreplaces oldinnodb_log_file_size×innodb_log_files_in_group
# Modern (8.0.30+)
innodb_redo_log_capacity = 4G
# Legacy (deprecated)
innodb_log_file_size = 2G
innodb_log_files_in_group = 2Undo log:
- Pre-modification values
- Used for: rollback + MVCC reads (old versions)
- Separate from redo, stored in undo tablespace
- Auto-truncated when transactions complete
Binary log (binlog):
- Logical changes (statements or row events)
- For replication
- For PITR
- Append only
3 logs vs Postgres 1 (WAL). More to manage, more nuance, more places things can go wrong.
2.6 Doublewrite buffer
Combat torn page writes. Before writing pages to .ibd, write to doublewrite buffer first. If crash mid-write, recover from doublewrite.
In Postgres = full_page_writes in WAL. Different mechanism, same goal.
2.7 Atomic DDL (MySQL 8.0+)
Pre-8.0: DDL was multi-step, could leave inconsistent state on crash. 8.0+: data dictionary in InnoDB, DDL atomic.
ALTER TABLE huge ADD COLUMN x INT; -- atomic, recoverable2.8 Instant DDL (8.0.12+, expanded 8.0.29+)
Many DDL operations now instant (metadata only):
- ADD COLUMN at end (8.0.12+)
- ADD COLUMN anywhere (8.0.29+)
- DROP COLUMN (8.0.29+)
- RENAME COLUMN (8.0.x)
-- Instant in 8.0.29+
ALTER TABLE huge ADD COLUMN new_col INT, ALGORITHM=INSTANT;
ALTER TABLE huge DROP COLUMN old_col, ALGORITHM=INSTANT;Limitations: not all columns; row format must support.
2.9 Histogram statistics (8.0.3+)
Optimizer uses histograms for non-indexed columns.
ANALYZE TABLE orders UPDATE HISTOGRAM ON status WITH 100 BUCKETS;Helps planner pick join order without index on every column.
3. MVCC in InnoDB
3.1 Read view per transaction
- Each transaction allocates read view at start (RR) or per-statement (RC)
- Tracks active transactions at snapshot
- Reads use undo log to reconstruct prior versions
3.2 Differences from Postgres
| Postgres MVCC | InnoDB MVCC | |
|---|---|---|
| Old versions location | In heap (same page) | Separate undo log |
| VACUUM needed? | Yes (clean dead tuples) | No (auto-truncate undo) |
| Tuple has xmin/xmax | Yes | Hidden DB_TRX_ID column |
| MVCC bloat | Yes (table grows) | No (rollback segment) |
| Replication | WAL streaming | Binlog (logical) |
3.3 Hidden columns
InnoDB tables internally have 3 extra columns:
DB_TRX_ID— last transaction that modified the rowDB_ROLL_PTR— pointer to undo log entryDB_ROW_ID— used when no PK defined (avoid this!)
4. Locking — Gap & Next-Key Locks
4.1 Row locks
SELECT ... FOR UPDATE, FOR SHARE similar to Postgres.
4.2 Gap locks — InnoDB unique
In REPEATABLE READ (default), InnoDB locks not just rows but gaps between rows to prevent phantom reads.
-- Session A
BEGIN;
SELECT * FROM users WHERE age BETWEEN 20 AND 30 FOR UPDATE;
-- Locks rows 20-30, AND gaps before 20, after 30, between rows
-- Session B
INSERT INTO users (name, age) VALUES ('X', 25);
-- BLOCKS due to gap lockvs Postgres SERIALIZABLE = SSI with predicate locks. Different mechanism, same goal.
4.3 Next-key locks
Combination of row lock + gap before it.
Index: (10, 20, 30, 40)
Next-key lock on 30:
- Locks row 30
- Locks gap (20, 30] (exclusive 20, inclusive 30)
4.4 Deadlock pattern
Deadlocks more common in InnoDB due to gap locks. Tune:
innodb_lock_wait_timeout = 50 # default 50s
innodb_print_all_deadlocks = ON # log allWorkaround for high deadlock rate: drop to READ COMMITTED isolation (no gap locks). App handles phantom reads.
SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED;Pattern at scale (Shopify, GitHub): default RC, not RR. Postgres opposite (default RC, opt-in RR/Serializable).
4.5 Auto-increment locks (8.0+)
innodb_autoinc_lock_mode = 2— “interleaved” — no lock for INSERT- Default in MySQL 8.0+
- Pre-8.0 default 1 — could lock for INSERT…SELECT
- Use 2 for high-concurrency INSERT
5. Replication
5.1 Binlog-based async (traditional)
Default. Primary writes binlog. Replicas read + apply.
# my.cnf primary
log-bin = mysql-bin
server-id = 1
binlog_format = ROW # ROW > STATEMENT for safety
gtid_mode = ON
enforce_gtid_consistency = ON
# replica
server-id = 2
replicate-do-db = appdb5.2 GTID — Global Transaction ID
Modern replication uses GTID (server_uuid:txn_number). Auto-positioning replicas:
-- Replica setup auto
CHANGE MASTER TO MASTER_HOST='primary', MASTER_USER='repl', MASTER_AUTO_POSITION=1;
START SLAVE; -- or START REPLICA in 8.0.22+No need to track binlog file + position manually.
5.3 Semi-sync
Primary waits for AT LEAST 1 replica to ack before commit.
rpl_semi_sync_master_enabled = ON
rpl_semi_sync_master_timeout = 1000 # ms (fallback to async)Compromise: tighter durability without full sync overhead.
5.4 Parallel replication (8.0+)
Replica apply parallelized:
slave_parallel_workers = 8
slave_parallel_type = LOGICAL_CLOCK # default 8.0LOGICAL_CLOCK based on commit timestamps → parallel apply for non-conflicting transactions.
This is a bigger advantage MySQL has over Postgres which still has single-threaded WAL apply.
5.5 Group Replication (5.7.17+)
Built-in cluster, multi-primary or single-primary. Paxos-based.
plugin_load_add = 'group_replication.so'
group_replication_group_seeds = "node1:33061,node2:33061,node3:33061"
group_replication_start_on_boot = ONvs Postgres: Patroni, repmgr handle similar role at orchestration level.
5.6 InnoDB Cluster
Group Replication + MySQL Router + MySQL Shell admin → “InnoDB Cluster”. Easier setup than raw Group Replication.
5.7 InnoDB ClusterSet (8.0.27+)
Disaster recovery across data centers. Cross-region MySQL clusters.
6. Performance Patterns
6.1 Buffer pool sizing
innodb_buffer_pool_size = 12G # ~70-80% RAM (different rule from Postgres 25-35%)
innodb_buffer_pool_instances = 8 # default 8 for buffer_pool > 1GBWhy higher than Postgres? InnoDB uses buffer pool more aggressively, doesn’t rely heavily on OS page cache.
Postgres lets OS handle disk cache; MySQL takes more responsibility.
6.2 Redo log sizing
# MySQL 8.0.30+ modern
innodb_redo_log_capacity = 4G
# Pre-8.0.30 (still works, deprecated)
innodb_log_file_size = 2G
innodb_log_files_in_group = 2Larger redo log → fewer checkpoints, better write throughput. Trade-off: longer recovery time.
6.3 Flush settings
innodb_flush_log_at_trx_commit = 1 # 1=durable, 2=lose 1s on crash, 0=lose more
sync_binlog = 1 # fsync binlog after each commit1, 1 is ACID-compliant. 2, 0 faster but data loss on crash. Trade-off knob.
# Production safe
innodb_flush_log_at_trx_commit = 1
sync_binlog = 1
innodb_flush_method = O_DIRECT # bypass OS cache for InnoDB6.4 Performance Schema
Like Postgres pg_stat_statements but more granular.
SELECT * FROM performance_schema.events_statements_summary_by_digest
ORDER BY sum_timer_wait DESC LIMIT 10;
-- Top queries by time
SELECT digest_text, count_star, avg_timer_wait/1e6 AS avg_ms
FROM performance_schema.events_statements_summary_by_digest
ORDER BY sum_timer_wait DESC LIMIT 10;
-- Wait events
SELECT * FROM performance_schema.events_waits_summary_global_by_event_name
ORDER BY sum_timer_wait DESC LIMIT 20;6.5 sys schema
Wrapper views on performance_schema for easier human reading.
SELECT * FROM sys.statements_with_runtimes_in_95th_percentile;
SELECT * FROM sys.io_global_by_file_by_bytes;
SELECT * FROM sys.schema_unused_indexes;
SELECT * FROM sys.schema_redundant_indexes;6.6 Query cache (deprecated)
MySQL 8.0 removed query cache entirely.
Pre-8.0 query cache often hurt more than helped (cache invalidation overhead). Use Redis or app-level instead.
6.7 Optimizer hints (modern syntax)
SELECT /*+ INDEX(orders idx_user) */ * FROM orders WHERE user_id = 42;
SELECT /*+ MAX_EXECUTION_TIME(1000) */ ...;
SELECT /*+ NO_INDEX_MERGE(orders) */ ...;More versatile than legacy USE INDEX.
6.8 ProxySQL — query routing
MySQL middleware:
- Read/write split (proxy to replicas for reads)
- Connection pooling
- Query firewall (reject patterns)
- Query rewriting
Setup:
INSERT INTO mysql_servers VALUES (1, 'primary', 3306, 1000, 'ONLINE');
INSERT INTO mysql_query_rules (rule_id, active, match_pattern, destination_hostgroup)
VALUES (1, 1, '^SELECT.*FOR UPDATE', 1); -- writes to primaryvs Postgres pgBouncer + app-level routing.
7. Sharding with Vitess
7.1 Architecture
graph TB Client --> Gate[VTGate<br/>SQL parser, router<br/>stateless] Gate --> Tablet1[VTTablet 1<br/>shard 1] Gate --> Tablet2[VTTablet 2<br/>shard 2] Gate --> Tablet3[VTTablet 3<br/>shard 3] Tablet1 --> MySQL1[(MySQL)] Tablet2 --> MySQL2[(MySQL)] Tablet3 --> MySQL3[(MySQL)] Topo[Topology server<br/>etcd/zk/consul] Tablet1 -.-> Topo Tablet2 -.-> Topo Tablet3 -.-> Topo Gate -.-> Topo
VTGate makes shards look like single MySQL to client.
7.2 VSchema
keyspaces:
app:
sharded: true
vindexes:
hash:
type: hash
tables:
users:
column_vindexes:
- column: id
name: hash
orders:
column_vindexes:
- column: user_id
name: hashRoutes by hashing column. VTGate parses query, routes to correct shard(s).
7.3 Operations
# Reshard
vtctldclient Reshard --workflow=v1 \
--source-shards=- --target-shards=-80,80-
# MoveTables
vtctldclient MoveTables --workflow=move --target-keyspace=appdb \
--source-keyspace=legacy --tables=users,ordersOnline resharding — split shards as data grows. Big advantage over manual.
7.4 When Vitess
- Outgrew single MySQL (>10TB or >50K TPS sustained)
- Need to keep MySQL ecosystem
- Want online resharding
- Team comfortable with infra complexity
Big users: YouTube, Slack, GitHub (some), Square, Shopify (with Mysql sharding).
7.5 PlanetScale
Managed Vitess. Branching workflow (like git for schema), zero-downtime migrations via gh-ost integration.
Notable: removed free tier April 2024. Still operates as paid service.
8. Migration Tools Specific to MySQL
8.1 gh-ost (covered Tuần 08)
Online schema change. Better than pt-osc for most cases.
gh-ost \
--alter="ADD COLUMN test_col INT NOT NULL DEFAULT 0" \
--database=test --table=users \
--executeHow: ghost table + binlog replay → atomic swap.
8.2 pt-online-schema-change (Percona Toolkit)
Older, trigger-based. Still works.
8.3 Vitess MoveTables, Reshard
Move data between keyspaces without downtime.
8.4 Percona Toolkit
Mature ecosystem:
pt-archiver— archive old rowspt-table-checksum— verify replica integritypt-deadlock-logger— log deadlockspt-query-digest— analyze slow logpt-summary— system info
8.5 mydumper / myloader
Multi-threaded mysqldump replacement. Much faster for big DBs.
9. Postgres vs MySQL Decision
flowchart TD A[Choose] --> B{Existing ecosystem?} B -->|Postgres apps team| C[Postgres] B -->|MySQL apps team| D[MySQL] A --> E{Heavy schema evolution?} E -->|Yes complex DDL| F[Postgres - better online DDL legacy] E -->|MySQL 8.0.29+ instant DDL OK| OK[Either] A --> G{Need rich types like JSONB, arrays, custom?} G -->|Yes| F A --> H{Massive horizontal scale needed?} H -->|Yes >10TB sharded| I[Vitess/MySQL or distributed Postgres] A --> J{Replication complex multi-region?} J -->|Yes| K[MySQL Group Replication or Aurora] A --> L{Parallel replication apply critical?} L -->|Yes high write rate| K2[MySQL 8 - parallel apply ahead of PG] style C fill:#c8e6c9 style F fill:#c8e6c9
Personal opinion 2024-2026:
- New project, no constraint → Postgres for richer features
- Need parallel replica apply → MySQL 8
- Heavy sharding from day 1 → Vitess (MySQL) or CockroachDB (Postgres-compat)
- Existing team strong in one → stay with it; both are excellent
10. Anti-patterns
| Pattern | Why bad | Fix |
|---|---|---|
| UUID v4 PK in InnoDB | Page splits, slow inserts | Auto-inc or UUID v7 |
| No PK on InnoDB table | Hidden DB_ROW_ID slow + 6 bytes/row | Always explicit PK |
innodb_buffer_pool_size = 25% RAM (Postgres rule) | InnoDB needs more | 70-80% RAM |
binlog_format = STATEMENT | Unsafe with non-deterministic functions | ROW |
innodb_flush_log_at_trx_commit = 0 for OLTP | Data loss on crash | 1 (default) |
| Enabling query cache pre-8.0 | Mostly hurts perf | Already removed in 8.0 |
| Long-running transactions | Undo log grows | Short tx; monitor innodb_history_list_length |
| Frequent gap lock deadlocks | Default RR | RC isolation level |
Forgot to set innodb_redo_log_capacity (or old log_file_size) | Frequent checkpoints | Sizing right |
| Auto-increment lock mode 0 | Serialize inserts | Mode 2 (default 8.0+) |
11. Lab
Day 1: MySQL setup + performance schema
docker run -d -p 3306:3306 -e MYSQL_ROOT_PASSWORD=lab mysql:8.0Enable performance schema, run mysqlslap workload, observe.
Day 2: Clustered index demonstration
- Create table with auto-increment PK
- Insert 1M rows, time it
- Repeat with UUID v4 PK, observe slower
- Compare with Postgres equivalent
Day 3: Gap locks reproduce
Two sessions, REPEATABLE READ, reproduce gap lock blocking. Switch to READ COMMITTED, observe no block.
Day 4: GTID replication
2 MySQL instances, GTID-based replication. Test failover.
docker run -d --name primary mysql:8.0 mysqld --log-bin --server-id=1 --gtid-mode=ON --enforce-gtid-consistency
docker run -d --name replica mysql:8.0 mysqld --server-id=2 --gtid-mode=ON --enforce-gtid-consistency --read-onlyDay 5: Parallel replication
Generate write workload. Watch replica lag with parallel workers 1 vs 8. Compare.
Day 6: gh-ost migration
Run online schema change on 10M-row table while load running.
Day 7: ProxySQL routing
Setup: app → ProxySQL → primary + replicas. Configure read/write split. Test.
12. Self-check
- Clustered index InnoDB vs heap Postgres — pros/cons mỗi cái?
- UUID v4 PK trong InnoDB — vì sao tệ hơn Postgres?
- Redo / Undo / Binlog — 3 logs InnoDB, vai trò?
- Gap lock — vì sao deadlock common hơn Postgres?
innodb_buffer_pool_size70-80% RAM — vì sao khác Postgres?- GTID replication — auto-positioning là gì?
- Parallel replication 8.0+ — advantage so Postgres?
- Instant DDL 8.0.29+ — operations support?
- Vitess sharding cơ chế?
- Khi nào pick MySQL vs Postgres 2024-2026?
13. Tiếp theo
Tuan-Bonus-Lakehouse-Engineering
Cập nhật: 2026-05-16