Foundations: Computer Architecture for Architects

“DDIA Chapter 1: ‘A 10ms disk read = 10 million CPU cycles’. Hiểu memory hierarchy = hiểu 90% performance optimization. Sao Llama-70B inference giới hạn bởi memory bandwidth chứ không phải compute? Sao false sharing kill performance? Sao GPU H100 đắt vì HBM3? Architect cần hiểu hardware mình build trên.”

Tags: cs-foundations computer-architecture performance fundamentals Student: Hieu (Backend Dev → Architect) Liên quan: Tuan-Foundations-OS-Essentials · Tuan-Bonus-LLM-Serving-Infrastructure · Tuan-Bonus-Vector-Database-Internals


1. Context & Why

Tại sao cần hiểu Computer Architecture?

Vấn đề productionComputer Architecture concept
Why Llama-70B serving = 50 tok/s/GPUMemory bandwidth (HBM3 ~3.35 TB/s)
Why cache hit/miss matter 1000x perfL1 (1ns) vs DRAM (100ns)
Why “false sharing” tanks multithreadingCache line invalidation
Why NUMA-aware scheduling matterMemory locality across sockets
Why GPU >> CPU for MLSIMD parallelism + HBM bandwidth
Why ARM Graviton 20% cheaper for same perfDifferent uarch, fewer transistors
Why io_uring outperform epollFewer mode switches
Why LMAX Disruptor 100x throughputMechanical sympathy with cache

Mechanical sympathy (Martin Thompson, LMAX): “Code that works WITH the hardware, not against it.” This requires hardware knowledge.

Tham chiếu chính


2. Deep Dive — Khái niệm cốt lõi

2.1 The Memory Hierarchy

Most important diagram in Computer Architecture:

                          Latency       Size       Cost/GB
┌────────────────┐
│  CPU Registers │       ~0.3 ns        ~KB         $$$$$
├────────────────┤
│  L1 Cache      │       ~1 ns          ~32-128 KB  $$$$
├────────────────┤
│  L2 Cache      │       ~3-4 ns        ~256 KB-1MB $$$
├────────────────┤
│  L3 Cache      │       ~10-15 ns      ~4-64 MB    $$
├────────────────┤
│  DRAM (Main)   │       ~80-150 ns     ~16 GB-2TB  $
├────────────────┤
│  NVMe SSD      │       ~50-100 μs     ~512 GB-32TB ¢
├────────────────┤
│  HDD           │       ~5-15 ms       ~1-20 TB     ¢
├────────────────┤
│  Network (DC)  │       ~0.5 ms        unlimited    ¢
├────────────────┤
│  Network (cont)│       ~150 ms        unlimited    ¢
└────────────────┘

Key ratios:

  • L1 to DRAM: 100x slower
  • DRAM to NVMe: 500x slower
  • NVMe to HDD: 100x slower
  • DRAM to network DC: 5000x slower

Implication: Data locality > algorithmic complexity for performance.

Common quote (Jeff Dean): “If your code is dominated by memory access, big-O analysis lies to you.”

2.2 Cache Lines

Cache works in fixed-size chunks = cache lines (typically 64 bytes).

Reading 1 byte = entire 64-byte line loaded into cache.

Cache line:
| 0 | 1 | 2 | ... | 63 |  ← 64 bytes

Implications:

  • Spatial locality: accessing nearby memory is fast (already in cache)
  • Sequential access ≫ random access
  • Struct layout matters: hot fields together

2.2.1 Sequential vs Random access benchmark

// Sequential (cache-friendly)
for (int i = 0; i < N; i++)
    sum += arr[i];        // ~0.5 ns/iter (L1 hit after first line)
 
// Random (cache-hostile)
for (int i = 0; i < N; i++)
    sum += arr[indices[i]]; // ~100 ns/iter (DRAM miss every time)

200x slower for random access. Same algorithmic complexity O(N).

2.2.2 Why arrays > linked lists

Array: contiguous → cache-friendly → fast. Linked list: scattered → cache-miss → slow.

Even O(N) array iteration outperforms O(log N) random tree traversal for small N.

2.3 False Sharing — Multithreading Killer

Classic bug: Two threads write to different variables in same cache line → cache invalidation storm.

struct Stats {
    int counter_A;  // bytes 0-3
    int counter_B;  // bytes 4-7
} stats;  // Both in same 64-byte cache line!
 
// Thread 1: stats.counter_A++
// Thread 2: stats.counter_B++
 
// Both threads invalidate each other's cache line
// → 100x slower than expected

Fix — padding to separate cache lines:

struct Stats {
    int counter_A;
    char padding1[60];  // pad to 64 bytes
    int counter_B;
    char padding2[60];
} __attribute__((aligned(64)));

Or use language constructs:

#[repr(align(64))]
struct CacheAligned<T>(T);
@Contended
public class Counter { volatile long count; }

Real-world: LMAX Disruptor (Martin Thompson) achieves 6M ops/sec partly via cache-aware design.

2.4 NUMA — Non-Uniform Memory Access

Modern multi-socket servers: each socket has local DRAM. Cross-socket access slower.

┌──────────────┐       ┌──────────────┐
│  CPU 0       │ ◄───► │  CPU 1       │ ← Cross-socket
│  (Socket 0)  │       │  (Socket 1)  │   ~2x slower
│              │       │              │
│  DRAM 0      │       │  DRAM 1      │
│  (local)     │       │  (local)     │
└──────────────┘       └──────────────┘

Latency:

  • Local NUMA node: 100ns
  • Remote NUMA node: 200-300ns

Implication: Pin process + memory to same NUMA node.

# Check NUMA topology
numactl --hardware
 
# Run process bound to NUMA node 0
numactl --cpunodebind=0 --membind=0 ./my-server

Used by: High-perf databases (Postgres, MySQL), JVM (-XX:+UseNUMA), DPDK.

2.5 SIMD — Single Instruction Multiple Data

Modern CPUs can process multiple data points per instruction:

ExtensionWidthYearOperations/cycle
MMX64-bit19972 × int32
SSE128-bit19994 × float32
AVX256-bit20118 × float32
AVX-512512-bit201616 × float32
// Naive: 8 cycles
float sum = 0;
for (int i = 0; i < 8; i++) sum += arr[i];
 
// SIMD with AVX: 1 cycle
__m256 vec = _mm256_loadu_ps(arr);
// _mm256_hadd_ps for horizontal sum

Used by:

  • Numpy (uses BLAS / MKL)
  • Image / video codecs
  • Cryptography (AES-NI)
  • Databases (vectorized execution: ClickHouse, DuckDB)
  • ML inference (CPU paths)

2.6 Branch Prediction & Speculation

Pipeline: Modern CPU has 14-20 stage pipeline. Branch instructions are tricky.

Branch predictor: guesses which way branch goes.

  • Hit: pipeline flows
  • Miss: pipeline flushed, ~10-20 cycle penalty
// Predictable: always true
for (int i = 0; i < N; i++) {
    if (i % 2 == 0) ...  // Pattern, easy to predict
}
 
// Unpredictable: random
for (int i = 0; i < N; i++) {
    if (random_bool()) ...  // Hard to predict, 50% miss
}

Branchless code (no if):

// With branch
if (a > b) max = a; else max = b;
 
// Branchless (compiler may already do this)
int max = a + ((b - a) & ((b - a) >> 31));

Production impact: Sorted data ≫ unsorted for filtering operations because of branch predictor.

2.7 The Roofline Model

Plot: arithmetic intensity (ops/byte) vs performance (FLOPS/sec).

Performance
    ▲
peak ┼──────────  ← compute-bound region (limited by FLOPS)
     │       /
     │      /
     │     /
     │    / ← memory-bound region (limited by bandwidth)
     │   /
     │  /
     │ /
     └─────────────► Arithmetic Intensity (ops/byte)

Application: LLM inference:

  • Decode phase: 1 op (multiply) per byte read from memory
  • Memory bandwidth limit dominates → can’t speed up via more compute
  • This is why HBM3 (3.35 TB/s) matters for H100

Application: ML training:

  • Lots of matrix multiplications → high arithmetic intensity
  • Compute-bound → benefit from more FLOPS

2.8 GPU Architecture

GPUs = thousands of simple cores, optimized for parallel SIMD.

2.8.1 NVIDIA A100 / H100 architecture

A100:

  • 6,912 CUDA cores
  • 432 Tensor cores (mixed precision)
  • 80 GB HBM2e (2 TB/s bandwidth)
  • 312 TFLOPS (FP16), 1248 TFLOPS (FP16 with sparsity)

H100 (2022):

  • 14,592 CUDA cores
  • 456 Tensor cores
  • 80 GB HBM3 (3.35 TB/s bandwidth)
  • 989 TFLOPS (FP16), 4 PFLOPS (FP8 with sparsity)
  • Transformer Engine (FP8 with auto-scaling)

2.8.2 SM (Streaming Multiprocessor)

  • 32 threads execute as warp (lock-step)
  • All threads in warp execute same instruction
  • Branch divergence within warp → serialize → slow
// Good: all warp threads same path
if (thread_id < N) compute(thread_id);
 
// Bad: branch divergence
if (data[thread_id] > 0) ...  // half threads diverge

2.8.3 Memory hierarchy on GPU

Registers (per thread)        ~1 cycle
   ↓
Shared memory (per SM)        ~5-10 cycles
   ↓
L2 Cache (shared)             ~30-50 cycles
   ↓
HBM (Global memory)           ~200-400 cycles

Implication: Maximize shared memory usage. Why flash attention matters — keeps attention computation in shared memory.

2.9 Storage Hierarchy

2.9.1 NVMe SSD

  • PCIe 4.0 x4: ~7 GB/s sequential
  • PCIe 5.0 x4 (2024+): ~14 GB/s sequential
  • IOPS: 1M+ random reads
  • Latency: ~50 μs

2.9.2 NVMe-oF (over fabric)

  • NVMe over RDMA / TCP
  • Disaggregated storage
  • Sub-100 μs for remote NVMe
  • Used by: AWS Nitro, Snowflake, Aurora

2.9.3 Persistent Memory (PMem / Optane)

  • Intel Optane (discontinued 2022 but tech relevant)
  • DRAM-like latency, persistent
  • Used by: SAP HANA, Redis (modes), Postgres extensions
  • New (2022+) interconnect
  • Disaggregated memory pools
  • 128GB+ memory expansion via PCIe slot
  • Lower than DRAM but higher than NVMe

Future: CXL enable memory-centric architectures.

2.10 ARM vs x86 — Why Graviton matters

ARM (Graviton, Apple Silicon) vs x86 (Intel/AMD):

x86ARM
ArchitectureCISCRISC
Power efficiencyLowerHigher (~30%)
CostHigherLower
Cores per chipFewer, complexMany, simpler
Best forCompute-heavy single threadThroughput, parallel

AWS Graviton 4 (2024):

  • 96 cores @ 2.8 GHz
  • 30% faster than previous gen
  • 60% less energy than x86
  • 20-40% cheaper for same perf

Migration: Most languages compile fine. Watch for native deps (e.g., Pillow Python).

2.11 Latency Numbers — Updated 2024

Jeff Dean’s “Latency Numbers Every Programmer Should Know” (updated):

L1 cache reference                    1 ns
Branch mispredict                     5 ns
L2 cache reference                    4 ns
Mutex lock/unlock                    25 ns
Main memory reference               100 ns
Compress 1KB                      3,000 ns (3 μs)
Send 1KB over 1 Gbps                10,000 ns (10 μs)
SSD random read                    16,000 ns (16 μs)
NVMe random read                   50,000 ns (50 μs)
Read 1MB sequential from RAM       250,000 ns (0.25 ms)
Read 1MB sequential from NVMe      500,000 ns (0.5 ms)
Round trip same datacenter         500,000 ns (0.5 ms)
Read 1MB sequential from HDD     5,000,000 ns (5 ms)
Disk seek                       10,000,000 ns (10 ms)
Send packet CA→Netherlands     150,000,000 ns (150 ms)

Memorize ratios:

  • L1 : DRAM = 1 : 100
  • DRAM : NVMe = 1 : 500
  • NVMe : HDD = 1 : 100

3. Practical Applications

3.1 Why LLM serving is memory-bound

Llama-70B FP16 inference:

  • Model: 140 GB
  • Per token: read entire model weights once
  • Bandwidth: 3.35 TB/s (H100 HBM3)
  • Theoretical max throughput: 3,350 GB/s ÷ 140 GB = 24 tokens/s

→ Can’t go faster than memory bandwidth allows. Continuous batching helps because multiple requests share weight reads.

3.2 Why ClickHouse is fast

ClickHouse architectural choices:

  1. Columnar storage → cache-friendly (process column at a time)
  2. Vectorized execution → SIMD-accelerated
  3. JIT compilation for hot queries
  4. Late materialization — don’t read all columns

→ 10-100x faster than row-oriented OLAP databases.

3.3 Why memory bandwidth matters more than CPU GHz

Modern CPUs spend 50-80% of time waiting for memory. Adding more GHz doesn’t help.

Pareto rule for performance:

  • Reduce cache misses
  • Improve data layout
  • Use SIMD
  • Then optimize algorithm
  • (Adding cores last)

3.4 LMAX Disruptor — Hardware-aware design

Disruptor (2010): Lock-free ring buffer that achieves 6M+ ops/sec.

Tricks:

  • Single writer per slot → no synchronization
  • Cache-line aligned slots → no false sharing
  • Mechanical sympathy with CPU pipeline
  • Sequence numbers (no shared lock)

Design pattern: When you NEED max throughput, design with cache in mind.

3.5 Storage tier strategy

TierStorageUse caseCost/GB
HotDRAM (Redis)Sub-ms reads$5-10
WarmNVMe (Postgres)Active data$0.10-0.30
CoolSATA SSDLess active$0.03-0.10
ColdS3 StandardBackups$0.023
FrozenS3 GlacierArchive$0.001-0.004

Architecture pattern: Lakehouse uses this — hot data in compute cache, warm in S3 IA, cold in Glacier.


4. Performance Testing & Profiling

4.1 Benchmark methodology

Trinity of benchmarking:

  1. Microbenchmarks: Specific function (use criterion in Rust, JMH in Java)
  2. Synthetic load tests: Tool like wrk, vegeta, k6
  3. Production shadow traffic: Real-world patterns

Pitfalls:

  • Warmup needed (JIT, cache, branch predictor)
  • Don’t average — use percentiles
  • Watch for thermal throttling
  • Disable turbo boost for reproducibility

4.2 perf — Linux profiling

# CPU profiling
perf record -F 99 -g ./my-app    # Sample 99Hz, with stack
perf report                        # Interactive view
 
# Cache miss profiling
perf stat -e cache-misses,cache-references ./my-app
 
# Cycle attribution
perf annotate -d ./my-app

4.3 Brendan Gregg’s flame graphs

# Capture
perf record -F 99 -ag -- sleep 30
 
# Generate flame graph
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

→ Visual: y-axis stack depth, x-axis sample count, width = time spent.

4.4 eBPF-based profiling

# CPU profile via eBPF (no perf needed)
bpftrace -e 'profile:hz:99 /pid == 1234/ { @[ustack] = count(); }'
 
# Off-CPU analysis
offcputime -p 1234
 
# Cache misses by function
llcstat -p 1234

Tools: BCC, bpftrace, Pixie, Pyroscope, Parca.


5. Architecture Implications

5.1 When CPU bound

Symptoms: high CPU%, low waiting%, performance scales with cores.

Optimizations:

  • Profile, optimize hot loops
  • Use SIMD where possible
  • Multithreading
  • Better algorithm

5.2 When memory bandwidth bound

Symptoms: CPU% medium, but adding cores doesn’t help.

Optimizations:

  • Reduce data size (compression, quantization)
  • Batch operations
  • Improve locality (data layout)
  • Use cache-aware algorithms

Examples: LLM inference, in-memory analytics, big data scans.

5.3 When I/O bound

Symptoms: low CPU, high iowait.

Optimizations:

  • Async I/O (epoll, io_uring)
  • Connection pooling
  • Reduce I/O (cache, batch)
  • Faster storage (NVMe)

5.4 When network bound

Symptoms: txqueuelen saturated, retransmits.

Optimizations:

  • Compression
  • HTTP/2 multiplexing
  • Connection reuse
  • CDN / edge

6. Code Examples

6.1 Cache-friendly vs hostile

// Cache-friendly: row-major iteration
int matrix[N][N];
for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
        matrix[i][j] = i + j;
// ~10x faster than column-major!
 
// Cache-hostile
for (int j = 0; j < N; j++)
    for (int i = 0; i < N; i++)
        matrix[i][j] = i + j;  // Stride access

6.2 False sharing demo

use std::sync::atomic::{AtomicU64, Ordering};
use std::thread;
 
// BAD: false sharing
struct BadCounter {
    a: AtomicU64,  // 8 bytes
    b: AtomicU64,  // 8 bytes — same cache line!
}
 
// GOOD: padded to 64 bytes
#[repr(align(64))]
struct PaddedAtomic(AtomicU64);
 
struct GoodCounter {
    a: PaddedAtomic,
    b: PaddedAtomic,
}
 
fn benchmark<F: Fn(&AtomicU64) + Send + Sync + 'static>(name: &str, c: &impl ...) {
    // ...
}

Result: PaddedAtomic 5-10x faster than naive when 2 threads contend.

6.3 SIMD with Rust

use std::simd::*;
 
fn sum_simd(arr: &[f32]) -> f32 {
    let chunks = arr.chunks_exact(8);
    let remainder = chunks.remainder();
 
    let sum_vec = chunks
        .map(|chunk| f32x8::from_slice(chunk))
        .fold(f32x8::splat(0.0), |acc, v| acc + v);
 
    let sum_remainder: f32 = remainder.iter().sum();
    sum_vec.reduce_sum() + sum_remainder
}
 
// 4-8x faster than scalar sum

6.4 NUMA-aware allocation

#include <numa.h>
 
if (numa_available() >= 0) {
    // Allocate on local NUMA node
    void *mem = numa_alloc_local(SIZE);
    // ...
    numa_free(mem, SIZE);
}

7. System Design Diagrams

7.1 Memory Hierarchy

flowchart TB
    CPU[CPU Core]

    CPU <-->|0.3ns| Reg[Registers]
    Reg <-->|1ns| L1[L1 Cache<br/>32-128 KB]
    L1 <-->|3ns| L2[L2 Cache<br/>256 KB-1 MB]
    L2 <-->|10ns| L3[L3 Cache<br/>4-64 MB shared]
    L3 <-->|100ns| DRAM[DRAM<br/>16 GB-2 TB]
    DRAM <-->|50μs| NVMe[NVMe SSD]
    NVMe <-->|5ms| HDD[HDD]
    NVMe <-->|0.5ms| NetDC[Network DC]
    NetDC <-->|150ms| NetCont[Network Cross-Continent]

    style Reg fill:#1b5e20,color:#fff
    style L1 fill:#2e7d32,color:#fff
    style L2 fill:#43a047,color:#fff
    style L3 fill:#66bb6a,color:#fff
    style DRAM fill:#a5d6a7,color:#000
    style NVMe fill:#fff9c4,color:#000
    style HDD fill:#ffe0b2,color:#000

7.2 NUMA Architecture

flowchart LR
    subgraph S0["Socket 0"]
        CPU0[CPU 0<br/>cores 0-7]
        DRAM0[DRAM 0<br/>local]
        CPU0 <--> DRAM0
    end

    subgraph S1["Socket 1"]
        CPU1[CPU 1<br/>cores 8-15]
        DRAM1[DRAM 1<br/>local]
        CPU1 <--> DRAM1
    end

    CPU0 <-.QPI/UPI<br/>cross-socket.-> CPU1
    CPU0 <-.slower.-> DRAM1
    CPU1 <-.slower.-> DRAM0

    style CPU0 fill:#bbdefb
    style CPU1 fill:#c8e6c9

7.3 GPU Architecture (Simplified)

flowchart TB
    subgraph GPU["NVIDIA H100 SXM5"]
        subgraph SMs["132 Streaming Multiprocessors"]
            SM1[SM 1<br/>128 cores<br/>Tensor cores]
            SM2[...]
            SM132[SM 132]
        end

        L2[L2 Cache<br/>50 MB shared]
        HBM[HBM3<br/>80 GB<br/>3.35 TB/s]

        SMs --> L2
        L2 --> HBM
    end

    NVLink[NVLink 4<br/>900 GB/s<br/>to other GPUs]
    PCIe[PCIe 5<br/>128 GB/s<br/>to CPU]

    GPU --> NVLink
    GPU --> PCIe

7.4 Roofline Model

flowchart LR
    subgraph Compute["Compute-Bound Region"]
        C1[High arithmetic<br/>intensity]
        C2[Performance ≈ peak FLOPS]
        C3[Examples: ML training,<br/>scientific compute]
    end

    subgraph Memory["Memory-Bound Region"]
        M1[Low arithmetic<br/>intensity]
        M2[Performance ≈ bandwidth × intensity]
        M3[Examples: LLM decode,<br/>analytics scans]
    end

    Compute -.compute optimization.-> M1
    Memory -.bandwidth optimization.-> C1

    style Compute fill:#c8e6c9
    style Memory fill:#fff9c4

8. Aha Moments & Pitfalls

Aha Moments

#1: Memory hierarchy ratios are 100x at each level. L1 to DRAM, DRAM to NVMe, NVMe to HDD. Architecture decisions = where data lives.

#2: Modern CPUs spend 50-80% time waiting for memory. Adding GHz doesn’t help. Reduce cache misses → reduce data size → improve locality.

#3: LLM inference = memory bandwidth bound. 3.35 TB/s ÷ 140 GB = 24 tok/s theoretical max. Continuous batching shares weight reads.

#4: False sharing is silent killer. Two threads writing different vars in same cache line = serialization. Pad to 64 bytes.

#5: NUMA matters at scale. 2-socket server: pin process + memory to same node. Saves 50%+ memory latency.

#6: SIMD can give 4-16x speedup for free. Compiler auto-vectorizes simple loops. Vectorized DBs (DuckDB, ClickHouse) win because of this.

#7: Branch predictor matters. Sorted data >> unsorted. ~20 cycle penalty per misprediction.

#8: GPU is parallel SIMD on steroids. 14K cores in lockstep. Branch divergence kills GPU perf.

Pitfalls

Pitfall 1: O(N) > O(log N) for small N

Linked list O(log N) operations slower than array O(N) because of cache. Fix: Profile actual workload, not just Big O.

Pitfall 2: Same cache line struct fields

Multi-threaded counters in same struct → false sharing. Fix: Pad to 64 bytes between thread-local fields.

Pitfall 3: Random access patterns

Hash map with poor distribution → cache miss every lookup. Fix: Better hash, or use array-based DS for small data.

Pitfall 4: Over-rely on CPU GHz

“Just buy faster CPU” — but workload is memory-bound. Fix: Profile, find real bottleneck.

Pitfall 5: NUMA-blind allocation

JVM defaults can spread memory across NUMA → 2x slower. Fix: -XX:+UseNUMA for JVM, numactl for processes.

Pitfall 6: Thermal throttling in benchmarks

Hot CPU clocks down → benchmarks unreliable. Fix: Disable turbo boost, monitor temps, use sustained workload.

Pitfall 7: Sequential vs parallel mix

Add cores but workload sequential → no improvement (Amdahl’s law). Fix: Profile parallel speedup curve before scaling.

Pitfall 8: Forget about TLB

Random access across huge memory → TLB misses → 100ns each. Fix: Huge pages (madvise(MADV_HUGEPAGE)), Transparent HugePages.

Pitfall 9: GPU branch divergence

Naive CUDA code with if/else → warp serialization. Fix: Algorithm redesign, predicated execution.

Pitfall 10: Assume HDD = SSD

Designed for sequential HDD → poor on random NVMe (or vice versa). Fix: Match algorithm to storage characteristics.


TopicConnects to
Tuan-Foundations-OS-EssentialsVirtual memory uses MMU + caches
Tuan-Bonus-LLM-Serving-InfrastructureLLM = memory-bandwidth bound (HBM3)
Tuan-Bonus-Vector-Database-InternalsVector search uses SIMD + cache
Tuan-Foundations-Database-InternalsStorage hierarchy, B-tree vs LSM
Case-Design-Stock-ExchangeLMAX Disruptor, mechanical sympathy
Tuan-13-Monitoring-Observabilityperf, eBPF for profiling

Tham khảo

Books:

  • Computer Systems: A Programmer’s Perspective (CSAPP, Bryant & O’Hallaron)
  • Computer Architecture: A Quantitative Approach (Hennessy & Patterson)
  • Systems Performance (Brendan Gregg)
  • The Art of Multiprocessor Programming (Herlihy & Shavit)

Papers:

Online:

Courses:


Tiếp theo: Tuan-Foundations-Database-Internals — Storage engines (B-tree, LSM), MVCC, query optimizer.