Skip to content

🌐 Distributed Systems Modeling

Level: Advanced Solves: Thiết kế data models cho distributed databases với proper sharding và consistency

Distributed Data Fundamentals

💡 Giáo sư Tom

Distributed systems không phải là "relational database nhưng to hơn". Đó là một paradigm hoàn toàn khác với trade-offs riêng. CAP theorem không phải là lý thuyết - đó là reality check cho mọi design decision.

CAP Theorem Reality

┌─────────────────────────────────────────────────────────────────┐
│                    CAP THEOREM                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│                    CONSISTENCY                                  │
│                         ▲                                       │
│                        /│\                                      │
│                       / │ \                                     │
│                      /  │  \                                    │
│                     /   │   \                                   │
│                    / CP │ CA \                                  │
│                   /     │     \                                 │
│                  /      │      \                                │
│                 /       │       \                               │
│                /        │        \                              │
│               ▼─────────┴─────────▼                             │
│         PARTITION              AVAILABILITY                     │
│         TOLERANCE                                               │
│                    \         /                                  │
│                     \  AP   /                                   │
│                      \     /                                    │
│                       \   /                                     │
│                        \ /                                      │
│                         ▼                                       │
│                                                                 │
│  CP: MongoDB, HBase, Redis Cluster                              │
│      → Consistent but may be unavailable during partition       │
│                                                                 │
│  AP: Cassandra, DynamoDB, CouchDB                               │
│      → Available but may return stale data                      │
│                                                                 │
│  CA: Traditional RDBMS (single node)                            │
│      → Not partition tolerant (not truly distributed)           │
│                                                                 │
│  REALITY: Network partitions WILL happen.                       │
│           Choose between C and A during partition.              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Distributed Identifiers

UUID vs Sequential IDs

┌─────────────────────────────────────────────────────────────────┐
│              IDENTIFIER STRATEGIES                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SEQUENTIAL (Auto-increment)        UUID/ULID                   │
│  ──────────────────────────         ────────                    │
│                                                                 │
│  ✅ Compact (8 bytes)               ❌ Large (16 bytes)         │
│  ✅ Sortable by creation            ✅ Globally unique          │
│  ✅ Human readable                  ✅ No coordination needed   │
│  ❌ Requires coordination           ✅ Generate anywhere        │
│  ❌ Reveals business info           ✅ No info leakage          │
│  ❌ Hot spot on insert              ❌ Random distribution      │
│                                                                 │
│  DISTRIBUTED-FRIENDLY OPTIONS:                                  │
│                                                                 │
│  1. UUID v4 (Random)                                            │
│     550e8400-e29b-41d4-a716-446655440000                        │
│     → Truly random, no ordering                                 │
│                                                                 │
│  2. UUID v7 (Time-ordered)                                      │
│     018e5e5c-7b3a-7000-8000-000000000000                        │
│     → Time-sortable, better index locality                      │
│                                                                 │
│  3. ULID (Universally Unique Lexicographically Sortable ID)     │
│     01ARZ3NDEKTSV4RRFFQ69G5FAV                                  │
│     → Sortable, URL-safe, 26 chars                              │
│                                                                 │
│  4. Snowflake ID (Twitter)                                      │
│     1234567890123456789                                         │
│     → 64-bit, time + machine + sequence                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Snowflake ID Structure

┌─────────────────────────────────────────────────────────────────┐
│              SNOWFLAKE ID (64 bits)                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─┬────────────────────────┬──────────┬────────────────┐       │
│  │0│      Timestamp         │ Machine  │   Sequence     │       │
│  │ │      (41 bits)         │ (10 bits)│   (12 bits)    │       │
│  └─┴────────────────────────┴──────────┴────────────────┘       │
│   1          41                  10           12                │
│                                                                 │
│  BREAKDOWN:                                                     │
│  • Sign bit: Always 0 (positive number)                         │
│  • Timestamp: Milliseconds since epoch (69 years)               │
│  • Machine ID: 1024 unique machines                             │
│  • Sequence: 4096 IDs per millisecond per machine               │
│                                                                 │
│  CAPACITY: 4,096,000 IDs/second/machine                         │
│                                                                 │
│  BENEFITS:                                                      │
│  ✅ Time-sortable (roughly)                                     │
│  ✅ No coordination between machines                            │
│  ✅ Compact (64-bit integer)                                    │
│  ✅ Extractable timestamp                                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Sharding Strategies

Sharding Key Selection

⚠️ Sharding Key is Forever

Changing sharding key requires full data migration. Choose wisely upfront.

┌─────────────────────────────────────────────────────────────────┐
│              SHARDING KEY CRITERIA                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  GOOD SHARDING KEY:                                             │
│  ✅ High cardinality (many unique values)                       │
│  ✅ Even distribution (no hot spots)                            │
│  ✅ Immutable (never changes)                                   │
│  ✅ Frequently used in queries                                  │
│  ✅ Supports common access patterns                             │
│                                                                 │
│  BAD SHARDING KEY:                                              │
│  ❌ Low cardinality (status, country)                           │
│  ❌ Monotonically increasing (timestamp, auto-increment)        │
│  ❌ Skewed distribution (celebrity user_id)                     │
│  ❌ Mutable (email, username)                                   │
│                                                                 │
│  EXAMPLES:                                                      │
│                                                                 │
│  E-commerce Orders:                                             │
│  ✅ customer_id (queries by customer)                           │
│  ❌ order_date (hot partition for today)                        │
│  ❌ status (only 5 values)                                      │
│                                                                 │
│  Social Media Posts:                                            │
│  ✅ user_id (queries by user)                                   │
│  ❌ created_at (hot partition for recent)                       │
│  ⚠️ post_id (good distribution, but queries need user_id)      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Sharding Patterns

┌─────────────────────────────────────────────────────────────────┐
│              SHARDING PATTERNS                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. HASH-BASED SHARDING                                         │
│     shard = hash(key) % num_shards                              │
│                                                                 │
│     ┌─────┐  hash   ┌─────────┐                                 │
│     │ key │ ──────► │ shard_2 │                                 │
│     └─────┘         └─────────┘                                 │
│                                                                 │
│     ✅ Even distribution                                        │
│     ❌ Range queries require scatter-gather                     │
│     ❌ Resharding is expensive                                  │
│                                                                 │
│  2. RANGE-BASED SHARDING                                        │
│     shard = range(key)                                          │
│                                                                 │
│     A-M → Shard 1                                               │
│     N-Z → Shard 2                                               │
│                                                                 │
│     ✅ Range queries efficient                                  │
│     ❌ Potential hot spots                                      │
│     ❌ Uneven distribution                                      │
│                                                                 │
│  3. DIRECTORY-BASED SHARDING                                    │
│     Lookup table maps key → shard                               │
│                                                                 │
│     ┌─────────────────┐                                         │
│     │ key   │ shard   │                                         │
│     │ user1 │ shard_1 │                                         │
│     │ user2 │ shard_3 │                                         │
│     └─────────────────┘                                         │
│                                                                 │
│     ✅ Flexible placement                                       │
│     ❌ Lookup overhead                                          │
│     ❌ Directory is SPOF                                        │
│                                                                 │
│  4. CONSISTENT HASHING                                          │
│     Virtual ring with nodes                                     │
│                                                                 │
│     ✅ Minimal reshuffling on node add/remove                   │
│     ✅ Good for caches, distributed storage                     │
│     ❌ More complex implementation                              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Consistency Models

Consistency Spectrum

┌─────────────────────────────────────────────────────────────────┐
│              CONSISTENCY SPECTRUM                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  STRONG ◄─────────────────────────────────────────► EVENTUAL    │
│                                                                 │
│  Linearizable                                                   │
│  │  • Single global order                                       │
│  │  • Reads see latest write                                    │
│  │  • Highest latency                                           │
│  │                                                              │
│  Sequential Consistency                                         │
│  │  • Operations in program order                               │
│  │  • All processes see same order                              │
│  │                                                              │
│  Causal Consistency                                             │
│  │  • Causally related ops in order                             │
│  │  • Concurrent ops may differ                                 │
│  │                                                              │
│  Read-Your-Writes                                               │
│  │  • See your own writes                                       │
│  │  • Others may see stale                                      │
│  │                                                              │
│  Eventual Consistency                                           │
│     • Eventually converges                                      │
│     • May read stale data                                       │
│     • Lowest latency                                            │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Consistency Patterns

┌─────────────────────────────────────────────────────────────────┐
│              CONSISTENCY PATTERNS                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. READ-AFTER-WRITE CONSISTENCY                                │
│     User sees their own writes immediately                      │
│                                                                 │
│     Implementation:                                             │
│     • Read from leader after write                              │
│     • Include version in response, require in read              │
│     • Sticky sessions to same replica                           │
│                                                                 │
│  2. MONOTONIC READS                                             │
│     Never see older data after seeing newer                     │
│                                                                 │
│     Implementation:                                             │
│     • Track last-read timestamp per user                        │
│     • Route to replica with >= timestamp                        │
│                                                                 │
│  3. QUORUM READS/WRITES                                         │
│     W + R > N ensures overlap                                   │
│                                                                 │
│     N=3 replicas:                                               │
│     • W=2, R=2: Strong consistency                              │
│     • W=1, R=1: Eventual consistency, fast                      │
│     • W=3, R=1: Fast reads, slow writes                         │
│                                                                 │
│  4. CONFLICT RESOLUTION                                         │
│     When concurrent writes conflict:                            │
│     • Last-Write-Wins (LWW): Timestamp decides                  │
│     • Vector Clocks: Track causality                            │
│     • CRDTs: Mathematically mergeable                           │
│     • Application-level: Custom merge logic                     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Cross-Shard Operations

Distributed Transactions

┌─────────────────────────────────────────────────────────────────┐
│              DISTRIBUTED TRANSACTION PATTERNS                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. TWO-PHASE COMMIT (2PC)                                      │
│                                                                 │
│     Coordinator                                                 │
│         │                                                       │
│         ├──► Prepare? ──► Shard 1 ──► Yes                       │
│         ├──► Prepare? ──► Shard 2 ──► Yes                       │
│         │                                                       │
│         ├──► Commit ────► Shard 1                               │
│         └──► Commit ────► Shard 2                               │
│                                                                 │
│     ❌ Blocking if coordinator fails                            │
│     ❌ High latency                                             │
│     ✅ Strong consistency                                       │
│                                                                 │
│  2. SAGA PATTERN                                                │
│                                                                 │
│     T1 ──► T2 ──► T3 ──► Success                                │
│      │     │     │                                              │
│      │     │     └──► C3 (compensate)                           │
│      │     └──► C2                                              │
│      └──► C1                                                    │
│                                                                 │
│     ✅ No blocking                                              │
│     ✅ Better availability                                      │
│     ❌ Eventual consistency                                     │
│     ❌ Complex compensation logic                               │
│                                                                 │
│  3. OUTBOX PATTERN                                              │
│                                                                 │
│     ┌─────────────────────────────────┐                         │
│     │ Transaction:                    │                         │
│     │   UPDATE orders SET status=...  │                         │
│     │   INSERT INTO outbox (event)    │                         │
│     └─────────────────────────────────┘                         │
│                    │                                            │
│                    ▼                                            │
│     Background process reads outbox → publishes events          │
│                                                                 │
│     ✅ Atomic with local transaction                            │
│     ✅ At-least-once delivery                                   │
│     ❌ Requires polling or CDC                                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Cross-Shard Query Patterns

sql
-- Pattern 1: Scatter-Gather
-- Query all shards, aggregate results
-- Application code:
-- results = []
-- for shard in shards:
--     results.extend(shard.query("SELECT * FROM orders WHERE status = 'pending'"))
-- return sorted(results, key=lambda x: x.created_at)[:100]

-- Pattern 2: Global Secondary Index
-- Maintain separate index table for cross-shard queries
CREATE TABLE order_status_index (
    status VARCHAR(20),
    order_id BIGINT,
    shard_id INT,
    PRIMARY KEY (status, order_id)
);
-- Query index first, then fetch from specific shards

-- Pattern 3: Denormalization
-- Store frequently-queried data in multiple shards
-- orders table (sharded by customer_id)
-- order_by_status table (sharded by status)
-- Write to both, read from appropriate one

Data Modeling Best Practices

┌─────────────────────────────────────────────────────────────────┐
│              DISTRIBUTED MODELING CHECKLIST                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ✅ Choose sharding key based on access patterns                │
│  ✅ Design for single-shard queries when possible               │
│  ✅ Use distributed-friendly IDs (ULID, Snowflake)              │
│  ✅ Denormalize to avoid cross-shard joins                      │
│  ✅ Plan for eventual consistency                               │
│  ✅ Implement idempotent operations                             │
│  ✅ Design compensation logic for failures                      │
│  ✅ Monitor shard distribution and hot spots                    │
│  ✅ Plan resharding strategy upfront                            │
│  ✅ Test failure scenarios (network partition, node failure)    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

🔍 Common Review Comments

⚠️ Distributed Systems Review Feedback

Distributed system designs have no undo — resharding and consistency changes are multi-quarter projects.

Review CommentÝ nghĩaCách khắc phục
"Sharding key không align với primary access pattern"Most queries need cross-shard scatter-gatherRedesign shard key theo query pattern
"Cross-shard transaction sẽ kill throughput"2PC across shards adds latencySaga pattern hoặc single-shard design
"Resharding plan chưa có"Data growth sẽ require reshardingDesign for resharding from day 1
"Mutable sharding key được dùng"Entity shard changes require cross-shard moveImmutable shard key only
"No eventual consistency handling"Read-your-writes fails silentlySticky sessions hoặc explicit consistency
"Clock skew not considered"Timestamp ordering invalid across nodesLogical clocks (Lamport, vector)

⚠️ Anti-patterns

Anti-pattern: Cross-Shard Joins

┌─────────────────────────────────────────────────────────────────┐
│            CROSS-SHARD JOIN ANTI-PATTERN                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ❌ PROBLEM:                                                    │
│  Shard A: Users 1-1000                                          │
│  Shard B: Users 1001-2000                                       │
│  Query: SELECT * FROM users u JOIN orders o ON u.id = o.user_id │
│         WHERE o.created_at > '2024-01-01'                       │
│  → Orders for user 500 on Shard A, user 1500 on Shard B         │
│  → Scatter-gather to ALL shards, merge at coordinator           │
│                                                                 │
│  CONSEQUENCES:                                                  │
│  • Latency: Proportional to slowest shard                       │
│  • Memory: Coordinator buffers all results                      │
│  • Availability: One shard down = query fails                   │
│                                                                 │
│  ✅ SOLUTION:                                                   │
│  • Collocate related data: Shard users + orders by user_id      │
│  • Denormalize: Embed order summary in user document            │
│  • Pre-aggregate: Materialized views per shard                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Anti-pattern: Mutable Sharding Key

┌─────────────────────────────────────────────────────────────────┐
│            MUTABLE SHARD KEY ANTI-PATTERN                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ❌ PROBLEM:                                                    │
│  Shard key: customer.region (US, EU, APAC)                      │
│  → Customer moves from US to EU                                 │
│  → All customer data must migrate from Shard-US to Shard-EU     │
│  → During migration: partial data, consistency issues           │
│                                                                 │
│  CONSEQUENCES:                                                  │
│  • Complex migration logic                                      │
│  • Inconsistent state during transition                         │
│  • Cross-shard transactions required                            │
│                                                                 │
│  ✅ SOLUTION: Immutable shard key                               │
│  Shard by: customer_id (never changes)                          │
│  Store region as attribute (can change without resharding)      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Anti-pattern: Ignoring Eventual Consistency UX

┌─────────────────────────────────────────────────────────────────┐
│         CONSISTENCY UX ANTI-PATTERN                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ❌ PROBLEM:                                                    │
│  User: POST /orders → 201 Created                               │
│  User: GET /orders → Empty list (read from replica)             │
│  User: "Where's my order?!" → Support ticket                    │
│                                                                 │
│  REAL INCIDENT:                                                 │
│  E-commerce app với eventually consistent order list            │
│  → Users place duplicate orders ("didn't see first one")        │
│  → $2M in duplicate shipments before fix                        │
│                                                                 │
│  ✅ SOLUTIONS:                                                  │
│  1. Read-your-writes: Sticky sessions to primary                │
│  2. Optimistic UI: Show pending item immediately                │
│  3. Explicit sync: "Order processing..." spinner                │
│  4. Version checking: Poll until version matches                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

🔄 Migration Notes

Resharding Strategies

StrategyDescriptionTrade-off
Double-writeWrite to old + new shardConsistency complexity
Ghost tablesShadow writes, flip at cutoverStorage cost
Logical dumpExport/import with downtimeSimplest, requires window
Online migrationStreaming replication + cutoverComplex, zero downtime

Dual-Write Pattern

┌─────────────────────────────────────────────────────────────────┐
│               DUAL-WRITE MIGRATION                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  PHASE 1: Start dual-write                                      │
│  Write → Old Shard + New Shard                                  │
│  Read  → Old Shard                                              │
│                                                                 │
│  PHASE 2: Backfill                                              │
│  Copy historical data: Old → New (batched)                      │
│  Verify checksums                                               │
│                                                                 │
│  PHASE 3: Shadow read                                           │
│  Read both, compare results, log discrepancies                  │
│                                                                 │
│  PHASE 4: Cutover                                               │
│  Read  → New Shard                                              │
│  Write → New Shard + Old Shard (safety)                         │
│                                                                 │
│  PHASE 5: Cleanup                                               │
│  Stop writes to Old Shard                                       │
│  Deprecate and archive                                          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

📎 Cross-References