Skip to content

🗃️ DATABASE DESIGN

Mở rộng Quy mô Dữ liệu: Từ Single Server đến Distributed System


1. Vertical vs Horizontal Scaling

┌─────────────────────────────────────────────────────────┐
│              SCALING APPROACHES                         │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  VERTICAL SCALING (Scale Up):                           │
│  ─────────────────────────────                          │
│  [Small Server] → [BIGGER Server] → [BIGGEST Server]    │
│     4 vCPU          16 vCPU          64 vCPU            │
│     16GB RAM        128GB RAM        1TB RAM            │
│                                                         │
│  ✅ Simple: No code changes                             │
│  ✅ Strong consistency guaranteed                       │
│  ❌ Hardware limits (~256 cores, ~12TB RAM today)       │
│  ❌ Single point of failure                             │
│  ❌ Cost grows exponentially                            │
│                                                         │
│  ─────────────────────────────────────────────────────  │
│                                                         │
│  HORIZONTAL SCALING (Scale Out):                        │
│  ──────────────────────────────                         │
│  [Server] → [Server][Server][Server][Server]...         │
│                                                         │
│  ✅ Theoretically unlimited scale                       │
│  ✅ Better fault tolerance (redundancy)                 │
│  ✅ Linear cost growth                                  │
│  ❌ Complexity: distributed transactions, consistency   │
│  ❌ Network latency between nodes                       │
│                                                         │
└─────────────────────────────────────────────────────────┘

TIP

🔧 Kỹ sư Raizo: Rule of thumb: Scale VERTICALLY until you hit limits or cost becomes unreasonable, THEN scale horizontally. Premature horizontal scaling = unnecessary complexity.


2. Replication (Sao chép Dữ liệu)

2.1 Master-Slave (Primary-Replica) Architecture

┌─────────────────────────────────────────────────────────┐
│              MASTER-SLAVE REPLICATION                   │
├─────────────────────────────────────────────────────────┤
│                                                         │
│           ┌─────────────┐                               │
│           │   MASTER    │ ◄── All WRITES go here       │
│           │  (Primary)  │                               │
│           └──────┬──────┘                               │
│                  │                                      │
│        ┌─────────┼─────────┐                            │
│        │ Replication Stream │ (async or sync)           │
│        ▼         ▼         ▼                            │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐                 │
│  │  SLAVE 1 │ │  SLAVE 2 │ │  SLAVE 3 │                 │
│  │ (Replica)│ │ (Replica)│ │ (Replica)│                 │
│  └────┬─────┘ └────┬─────┘ └────┬─────┘                 │
│       │            │            │                       │
│       └────────────┴────────────┘                       │
│                    │                                    │
│                    ▼                                    │
│              READS distributed here                     │
│                                                         │
└─────────────────────────────────────────────────────────┘

2.2 Replication Lag Problem

┌─────────────────────────────────────────────────────────┐
│                 REPLICATION LAG                         │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  T=0ms:  Client writes "balance=100" to MASTER          │
│  T=1ms:  Master confirms write to client                │
│  T=2ms:  Client reads from SLAVE (load balanced)        │
│          → SLAVE still has "balance=0" (lag!)           │
│  T=50ms: Replication catches up                         │
│                                                         │
│  TIMELINE:                                              │
│  ─────────                                              │
│  Master:  [0ms: balance=100] ─────────────────────────► │
│                        │                                │
│                   replication                           │
│                   (async, 50ms lag)                     │
│                        │                                │
│  Slave:   [0ms: balance=0] ───► [50ms: balance=100]     │
│                    ▲                                    │
│                    │                                    │
│              Client reads here (STALE!)                 │
│                                                         │
└─────────────────────────────────────────────────────────┘

Solutions:

python
# Solution 1: Read Your Own Writes
# After write, force reads to Master for X seconds
def update_user(user_id: str, data: dict):
    master_db.execute("UPDATE users SET ...", data)
    # Set marker: "this user modified their data"
    cache.set(f"dirty:{user_id}", "1", ttl=30)

def get_user(user_id: str):
    if cache.exists(f"dirty:{user_id}"):
        return master_db.query(user_id)  # Read from master
    return replica_db.query(user_id)  # Normal: read from replica

# Solution 2: Synchronous Replication (Consistency over Performance)
# PostgreSQL: synchronous_commit = on
# MySQL: rpl_semi_sync_master_enabled = 1
# Trade-off: Write latency increases significantly!

# Solution 3: Monotonic Reads
# Track last read position, only read from replicas that are caught up

IMPORTANT

Triết lý HPN về Eventual Consistency: Replication lag là KHÔNG THỂ TRÁNH trong async replication. Bạn PHẢI thiết kế application logic chấp nhận "eventual consistency". Ví dụ: Sau khi post comment, redirect về page với 2-second delay.


3. Sharding (Phân mảnh Dữ liệu)

3.1 Sharding Concept

┌─────────────────────────────────────────────────────────┐
│                    SHARDING                             │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Split data across multiple databases based on KEY      │
│                                                         │
│  ┌───────────────────────────────────────────────────┐  │
│  │               ALL USERS DATA                       │  │
│  │  (100 million rows - too big for one server!)     │  │
│  └───────────────────────────────────────────────────┘  │
│                         │                               │
│                    SHARD KEY                            │
│                    (user_id)                            │
│                         │                               │
│         ┌───────────────┼───────────────┐               │
│         ▼               ▼               ▼               │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐        │
│  │  SHARD 1   │  │  SHARD 2   │  │  SHARD 3   │        │
│  │ users 1-33M│  │users 34-66M│  │users 67-100M│       │
│  └────────────┘  └────────────┘  └────────────┘        │
│                                                         │
└─────────────────────────────────────────────────────────┘

3.2 Range-based vs Hash-based Sharding

┌─────────────────────────────────────────────────────────┐
│                 SHARDING STRATEGIES                     │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  RANGE-BASED:                                           │
│  ────────────                                           │
│  Shard 1: user_id 1-1,000,000                           │
│  Shard 2: user_id 1,000,001-2,000,000                   │
│  Shard 3: user_id 2,000,001-3,000,000                   │
│                                                         │
│  ✅ Easy range queries (get users 500-600)              │
│  ✅ Simple to understand                                │
│  ❌ Uneven distribution (new users → last shard)        │
│  ❌ Hot shard problem                                   │
│                                                         │
│  ─────────────────────────────────────────────────────  │
│                                                         │
│  HASH-BASED:                                            │
│  ───────────                                            │
│  shard_id = hash(user_id) % num_shards                  │
│                                                         │
│  ✅ Even distribution (statistically)                   │
│  ✅ No hot shard by default                             │
│  ❌ Range queries impossible (must query all shards)    │
│  ❌ Resharding nightmare (mod N problem!)               │
│     → Use CONSISTENT HASHING instead!                   │
│                                                         │
└─────────────────────────────────────────────────────────┘

3.3 Hot Partition Problem (The Justin Bieber Problem)

┌─────────────────────────────────────────────────────────┐
│              HOT PARTITION DISASTER                     │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  SCENARIO: Twitter shards by user_id                    │
│                                                         │
│  Shard 1: [user:12345 - normal user, 10 followers]     │
│           [user:12346 - normal user, 50 followers]     │
│           ...                                           │
│           CPU: 5%, Memory: 20%                          │
│                                                         │
│  Shard 7: [user:justinbieber - 150 MILLION followers]  │
│           │                                             │
│           └─► Every tweet = 150M timeline writes!       │
│           CPU: 100% 💥, Memory: 100% 💥                 │
│           Disk I/O: saturated                           │
│           Other users on Shard 7: SLOW!                 │
│                                                         │
└─────────────────────────────────────────────────────────┘

Solutions:

python
# Solution 1: Virtual Users / Sub-sharding
# Split hot user's data across multiple "virtual" partitions
def get_shard_for_tweet(user_id: str, tweet_id: str) -> str:
    if user_id in HOT_USERS:  # Maintain a list of celebrities
        # Distribute across 10 sub-shards
        sub_shard = hash(tweet_id) % 10
        return f"shard_{hash(user_id) % N}_{sub_shard}"
    return f"shard_{hash(user_id) % N}"

# Solution 2: Separate Hot Path
# Celebrity tweets go to specialized high-capacity cluster
HOT_USER_THRESHOLD = 1_000_000  # followers

def route_tweet(user_id: str, tweet: Tweet):
    follower_count = get_follower_count(user_id)
    if follower_count > HOT_USER_THRESHOLD:
        return celebrity_cluster.write(tweet)  # Dedicated infra
    return normal_shard.write(user_id, tweet)

# Solution 3: Fan-out at Read vs Write
# Don't pre-compute timelines for celebrities
# Instead, merge their tweets at read time
def get_timeline(user_id: str) -> List[Tweet]:
    # Get tweets from followed normal users (pre-computed)
    timeline = cache.get(f"timeline:{user_id}")
    
    # Merge with celebrity tweets on-the-fly
    for celeb_id in get_followed_celebrities(user_id):
        celeb_tweets = celeb_cache.get_recent(celeb_id)
        timeline = merge_sorted(timeline, celeb_tweets)
    
    return timeline[:100]  # Top 100

4. CAP Theorem Applied

4.1 The CAP Triangle

┌─────────────────────────────────────────────────────────┐
│                   CAP THEOREM                           │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  In a distributed system, you can only have 2 of 3:     │
│                                                         │
│                 CONSISTENCY (C)                         │
│                      /\                                 │
│                     /  \                                │
│                    /    \                               │
│                   /      \                              │
│                  /   ??   \                             │
│                 /          \                            │
│                /____________\                           │
│     AVAILABILITY (A)    PARTITION                       │
│                         TOLERANCE (P)                   │
│                                                         │
│  ⚠️ In production, Partition WILL happen               │
│     (network failure, cable cut, datacenter split)     │
│                                                         │
│  So the REAL choice is: C+P or A+P                      │
│                                                         │
└─────────────────────────────────────────────────────────┘

4.2 The Scenario: Cáp quang biển bị đứt

┌─────────────────────────────────────────────────────────┐
│        PARTITION SCENARIO: SUBMARINE CABLE CUT          │
├─────────────────────────────────────────────────────────┤
│                                                         │
│            ┌──────────────┐                             │
│            │   US EAST    │                             │
│            │  Datacenter  │                             │
│            │  (Primary)   │                             │
│            └──────────────┘                             │
│                   │                                     │
│                   │ ← CABLE CUT! ✂️                     │
│                   │   (Network Partition)               │
│                   ╳                                     │
│                   │                                     │
│            ┌──────────────┐                             │
│            │    EUROPE    │                             │
│            │  Datacenter  │                             │
│            │  (Replica)   │                             │
│            └──────────────┘                             │
│                                                         │
│  European users are sending requests...                 │
│  What do you do?                                        │
│                                                         │
└─────────────────────────────────────────────────────────┘

4.3 Choice 1: Consistency + Partition Tolerance (CP)

┌─────────────────────────────────────────────────────────┐
│              CP SYSTEM (Consistency First)              │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  BEHAVIOR: Reject writes, serve reads from local cache  │
│                                                         │
│  European User Request:                                 │
│    "Transfer $1000 from Account A to B"                 │
│                                                         │
│  System Response:                                       │
│    HTTP 503: "Service temporarily unavailable.          │
│    We cannot guarantee consistency. Please retry."      │
│                                                         │
│  WHY? Without syncing to Primary:                       │
│    - User A might overdraft (both DCs accept transfer)  │
│    - Account balance would be inconsistent              │
│                                                         │
│  EXAMPLES: Banking, Financial systems, Inventory        │
│                                                         │
│  Trade-off: Users in EU CANNOT use system during        │
│             partition. But data is ALWAYS correct.      │
│                                                         │
└─────────────────────────────────────────────────────────┘

4.4 Choice 2: Availability + Partition Tolerance (AP)

┌─────────────────────────────────────────────────────────┐
│              AP SYSTEM (Availability First)             │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  BEHAVIOR: Accept all operations, resolve later         │
│                                                         │
│  European User Request:                                 │
│    "Post new tweet: Hello World!"                       │
│                                                         │
│  System Response:                                       │
│    HTTP 200: "Tweet posted successfully!"               │
│    (Stored locally, will sync when cable restored)      │
│                                                         │
│  WHY? For social media:                                 │
│    - User experience > perfect consistency              │
│    - Duplicate tweets can be resolved later             │
│    - "Eventually consistent" is acceptable              │
│                                                         │
│  EXAMPLES: Social media, DNS, Shopping cart             │
│                                                         │
│  Trade-off: Some users might see stale data,            │
│             conflicts need resolution strategy          │
│                                                         │
└─────────────────────────────────────────────────────────┘

IMPORTANT

🎓 Giáo sư Tom: CAP không phải binary choice. Thực tế, hầu hết systems là "tunable" - bạn có thể chọn consistency level PER OPERATION. Cassandra cho phép: read ONE (fast, maybe stale) vs read QUORUM (slow, consistent).


5. Decision Matrix

RequirementRecommended Approach
Read-heavy, write-lightMaster + Multiple Replicas
Write-heavySharding (hash-based with consistent hashing)
Strong consistencySynchronous replication / CP system
Geographic distributionMulti-master with conflict resolution
Analytics on full dataRead replicas dedicated for OLAP

6. Module Summary


🎓 Fundamentals Complete!

Bạn đã hoàn thành Fundamentals. Bây giờ bạn đã có foundation về:

  • Scalability: TCP/UDP, HTTP evolution, QUIC
  • Load Balancing: L4/L7, Consistent Hashing
  • Caching: Patterns, Thundering Herd, Penetration
  • Database Design: Replication, Sharding, CAP

👉 Quay lại System Design Universe →