Database Design — Chọn đúng database cho hệ thống

Năm 2010, Twitter chuyển từ MySQL sang một hệ thống tự phát triển tên Manhattan — không phải vì MySQL kém, mà vì ở quy mô 400M tweets/ngày, mô hình dữ liệu quan hệ tạo ra quá nhiều JOIN đắt đỏ. Ngược lại, Shopify phục vụ hàng triệu merchants trên MySQL đến tận hôm nay — bằng cách shard thông minh thay vì thay đổi database engine.

Không có database "tốt nhất". Chỉ có database phù hợp nhất cho workload cụ thể. Chọn sai database ở giai đoạn đầu giống như xây nhà trên nền móng không đúng — mọi thứ phía trên đều bất ổn, và chi phí sửa tăng theo cấp số nhân theo thời gian.

Bài này trang bị cho bạn framework để đưa ra quyết định database có cơ sở — không phải dựa trên hype hay quen thuộc, mà dựa trên workload analysis và trade-off engineering.

Bức tranh tư duy

Chọn database giống như chọn phương tiện di chuyển ở Việt Nam.

Xe máy (Redis): Cực nhanh, linh hoạt, nhưng chở được ít — tốt cho "chở ít đồ, đi nhanh" (cache, session, counter). Ô tô (PostgreSQL): Đáng tin cậy, chở được nhiều, có luật giao thông rõ ràng (ACID) — tốt cho "vận chuyển hàng quý giá" (financial data, user profiles). Xe tải (Cassandra): Chở hàng tấn, chạy nhiều tuyến song song, nhưng khó lùi khó quay (schema changes, complex queries) — tốt cho "vận chuyển khối lượng lớn" (time-series, logs, IoT). Xe buýt (Elasticsearch): Đón trả khách linh hoạt, tìm đường nhanh — tốt cho "tìm kiếm" (full-text search, analytics).

Lưu ý: decision tree này là điểm khởi đầu, không phải kết luận. Mỗi nhánh có hàng chục nuances mà chúng ta sẽ phân tích chi tiết bên dưới.

Cốt lõi kỹ thuật

SQL vs NoSQL — Decision Framework

Tiêu chí	SQL (Relational)	NoSQL
Data model	Tables, rows, relationships	Document, key-value, wide-column, graph
Schema	Fixed, enforced	Flexible, schema-on-read
Transactions	ACID (strong guarantees)	Thường BASE (eventual consistency)
Scaling	Vertical chủ yếu, horizontal phức tạp	Horizontal-first design
Query	SQL — cực mạnh cho JOIN, aggregation	Hạn chế — tối ưu cho access pattern cụ thể
Best for	Complex relationships, consistency critical	High throughput, flexible schema, horizontal scale

Quy tắc thực tế: Nếu bạn chưa chắc chắn, chọn PostgreSQL. Nó cover 80% use cases, có ACID, JSON support (document-like), full-text search, và extensible. Chỉ chuyển sang NoSQL khi có evidence rõ ràng rằng PostgreSQL không đáp ứng workload.

ACID vs BASE

ACID (SQL databases):

Thuộc tính	Ý nghĩa	Ví dụ
Atomicity	Tất cả hoặc không gì cả	Chuyển tiền: trừ A + cộng B thành công hoặc cả hai rollback
Consistency	Dữ liệu luôn ở trạng thái hợp lệ	Balance không bao giờ âm
Isolation	Transactions không thấy data chưa commit	2 người cùng đặt vé cuối cùng
Durability	Data đã commit không bao giờ mất	Server crash → data vẫn còn sau restart

BASE (NoSQL databases):

Thuộc tính	Ý nghĩa	Trade-off
Basically Available	Hệ thống luôn trả response	Response có thể stale
Soft state	State có thể thay đổi theo thời gian	Không guarantee consistency tại mọi thời điểm
Eventual consistency	Cuối cùng sẽ consistent	"Cuối cùng" có thể là ms hoặc phút

Indexing Strategies

Index là cấu trúc dữ liệu phụ giúp database tìm rows nhanh mà không cần scan toàn bộ table.

B-tree Index — Default trong hầu hết SQL databases

                    [50]
                   /    \
             [20, 35]    [65, 80]
            /   |   \   /   |   \
          [10] [25] [40] [55] [70] [90]

Tốt cho: Equality (=), range (BETWEEN, >, <), ORDER BY, prefix LIKE ('abc%')
Không tốt cho: Full-text search, suffix LIKE ('%abc'), high write throughput
Complexity: O(log N) cho lookup, insert, delete

Hash Index — Nhanh nhất cho equality lookup

Tốt cho: Exact match (WHERE id = 123)
Không tốt cho: Range query, ORDER BY
Complexity: O(1) average cho lookup
Dùng bởi: Redis internally, PostgreSQL hash index, memory-optimized tables

GIN Index (Generalized Inverted Index) — Cho composite values

Tốt cho: Full-text search, JSONB queries, array contains
Không tốt cho: Simple equality (overkill), high write throughput (update chậm)
PostgreSQL example: CREATE INDEX idx_tags ON articles USING gin(tags);

Index Type	Lookup	Range	Full-text	Write cost	Space
B-tree	O(log N)	✅ Tốt	❌	Trung bình	Trung bình
Hash	O(1)	❌	❌	Thấp	Thấp
GIN	O(log N)	❌	✅ Rất tốt	Cao	Cao

Normalization vs Denormalization

Normalization (tách bảng, giảm trùng lặp):

sql

-- 3NF: Mỗi fact lưu đúng 1 nơi
users:   id, name, email
orders:  id, user_id, total, created_at
items:   id, order_id, product_id, quantity, price

Denormalization (gộp bảng, tăng redundancy):

sql

-- Denormalized: Tất cả trong 1 document
{
  "user_id": 1,
  "user_name": "Minh",
  "orders": [{
    "total": 500000,
    "items": [{"product": "Laptop", "qty": 1, "price": 500000}]
  }]
}

Tiêu chí	Normalization	Denormalization
Read	Chậm (nhiều JOINs)	Nhanh (1 query)
Write	Nhanh (update 1 chỗ)	Chậm (update nhiều chỗ)
Consistency	Cao (single source of truth)	Thấp (data có thể inconsistent)
Storage	Tiết kiệm	Tốn hơn
Best for	Write-heavy, data integrity critical	Read-heavy, query pattern cố định

CAP Theorem Applied to Databases

Database	CAP Trade-off	Giải thích
PostgreSQL	CA (single node), CP (với sync replicas)	Strong consistency, sacrifice availability khi partition
MySQL	CA (single node), CP (Group Replication)	Tương tự PostgreSQL
MongoDB	CP (default) hoặc AP (tuỳ config)	`w:majority` = CP, `w:1` = AP
Cassandra	AP	Tunable consistency, nhưng default eventual
Redis	AP (Cluster mode)	Async replication, có thể mất data khi failover
CockroachDB	CP	NewSQL: distributed SQL với strong consistency

NewSQL — Best of Both Worlds?

NewSQL databases (CockroachDB, TiDB, Spanner) hứa hẹn: SQL interface + horizontal scalability + ACID transactions.

Trade-offs thực tế:

Latency cao hơn single-node SQL (cross-node consensus)
Complexity vận hành cao
Ecosystem và tooling chưa mature bằng PostgreSQL/MySQL
Cost cao hơn (đặc biệt Spanner)

Khi nào cân nhắc: >10TB data + cần ACID + cần horizontal scale + team có expertise.

Thực chiến

Bối cảnh: Startup xây dựng social platform, 1M users target Year 1. Features: user profiles, posts, comments, likes, real-time feed, search.

Mục tiêu: Chọn database stack tối ưu cho từng workload.

Read/Write Ratio Analysis:

User profiles: 95% read / 5% write → Read-heavy
Posts timeline: 90% read / 10% write → Read-heavy
Likes counter: 50% read / 50% write → Mixed
Notifications: 10% read / 90% write → Write-heavy
Search: 100% read → Read-only

Database Stack đề xuất:

Workload	Database	Lý do
User profiles, Posts, Comments	PostgreSQL	ACID cho data integrity, JOIN cho relationships
Timeline cache	Redis (Sorted Set)	Sub-ms latency, sorted by timestamp
Like counters	Redis (INCR)	Atomic increment, persist periodically to PG
Notifications	Cassandra hoặc PostgreSQL partitioned	Write-heavy, time-series pattern
Full-text search	Elasticsearch	Inverted index, relevance scoring

Phân tích: Tại 1M users, PostgreSQL single instance handle thoải mái (với proper indexing). Redis cho timeline và counters. Elasticsearch sync từ PostgreSQL qua Change Data Capture. Chưa cần Cassandra ở scale này.

Tình huống: Index optimization cho query chậm

Bối cảnh: Query tìm kiếm đơn hàng mất 3.2s trên table 50M rows:

sql

SELECT * FROM orders
WHERE user_id = 12345
  AND status = 'completed'
  AND created_at > '2024-01-01'
ORDER BY created_at DESC
LIMIT 20;

Phân tích với EXPLAIN ANALYZE:

Seq Scan on orders → quét toàn bộ 50M rows
Không có index phù hợp

Giải pháp: Composite index theo query pattern:

sql

CREATE INDEX idx_orders_user_status_date
ON orders (user_id, status, created_at DESC);

Kết quả: Query giảm từ 3.2s xuống 2ms — B-tree index traverse O(log 50M) ≈ 26 comparisons thay vì scan 50M rows.

Edge case đã xử lý: created_at DESC trong index khớp với ORDER BY created_at DESC → không cần sort riêng.

Sai lầm điển hình

❌ Sai lầm 1: Dùng MongoDB cho mọi thứ

Vấn đề: Team chọn MongoDB vì "flexible schema, dễ bắt đầu" rồi nhận ra cần transactions, JOINs, và data integrity constraints mà MongoDB không support tốt.

Tại sao hay mắc: MongoDB marketing hiệu quả, và flexible schema thực sự hấp dẫn ở giai đoạn prototype. Nhưng "dễ bắt đầu" không có nghĩa là "dễ vận hành ở production scale".

Đúng: Phân tích workload trước khi chọn. Nếu data có relationships rõ ràng (user → orders → items), SQL database gần như luôn là lựa chọn tốt hơn. MongoDB tốt cho: CMS content, product catalogs với schema biến đổi, event logging.

❌ Sai lầm 2: Over-Normalizing

Vấn đề: Thiết kế đến 6NF với hàng chục bảng nhỏ, mỗi query cần 8+ JOINs → performance thảm hại.

sql

-- Over-normalized: 5 JOINs cho 1 page load
SELECT u.name, p.title, c.body, l.count, t.name
FROM users u
JOIN posts p ON u.id = p.user_id
JOIN comments c ON p.id = c.post_id
JOIN like_counts l ON p.id = l.post_id
JOIN tags t ON p.id = t.post_id
WHERE u.id = 123;

Tại sao sai: Mỗi JOIN thêm disk I/O và CPU cost. Ở production scale (millions rows), 5+ JOINs có thể biến query 2ms thành 2s.

Đúng: Normalize đến 3NF cho write-heavy data. Denormalize (hoặc dùng materialized views) cho read-heavy access patterns. Cân bằng giữa data integrity và query performance.

❌ Sai lầm 3: Thiếu indexes

Vấn đề: Table 10M+ rows mà không có index cho queries thường dùng → mọi query đều full table scan.

Tại sao sai: Không có index, PostgreSQL phải đọc toàn bộ data pages. Với 10M rows × 100 bytes/row = 1GB data, mỗi query đọc 1GB từ disk thay vì 3-4 pages (32KB).

Đúng: Analyze query patterns → tạo indexes cho WHERE, JOIN, ORDER BY columns. Dùng pg_stat_user_indexes để kiểm tra index usage. Remove unused indexes (chúng vẫn tốn write cost).

❌ Sai lầm 4: N+1 Query Problem

Vấn đề: Fetch 100 users, rồi loop qua từng user để fetch orders → 1 + 100 = 101 queries.

python

# ❌ SAI: N+1 queries
users = db.query("SELECT * FROM users LIMIT 100")
for user in users:
    orders = db.query(f"SELECT * FROM orders WHERE user_id = {user.id}")

Tại sao sai: 101 round trips đến database thay vì 1-2. Mỗi round trip có network latency (~0.5ms trong datacenter). 101 queries × 0.5ms = 50ms chỉ cho network, chưa tính query execution.

python

# ✅ ĐÚNG: 2 queries, JOIN hoặc IN clause
users = db.query("SELECT * FROM users LIMIT 100")
user_ids = [u.id for u in users]
orders = db.query(f"SELECT * FROM orders WHERE user_id IN ({','.join(user_ids)})")

❌ Sai lầm 5: Không có connection pooling

Vấn đề: Mỗi request tạo database connection mới → connection setup overhead + database connection limit exhausted.

Tại sao sai: PostgreSQL fork process mới cho mỗi connection (~10MB RAM). 200 concurrent connections = 2GB RAM chỉ cho connections. Tạo connection mất ~5-10ms (TCP + TLS handshake + auth).

Đúng: Dùng connection pooler (PgBouncer cho PostgreSQL). Pool 20-50 connections có thể serve hàng nghìn concurrent requests nhờ transaction-level pooling.

Under the Hood

B-tree vs LSM-tree Storage Engines

Hai kiến trúc storage engine chính, với trade-offs hoàn toàn trái ngược:

B-tree (PostgreSQL, MySQL InnoDB):

Write: Đọc page → update in-place → write page lại → update WAL
Read: Traverse tree O(log N) → đọc leaf page
Tối ưu cho read-heavy workloads

LSM-tree (Cassandra, RocksDB, LevelDB):

Write: Append vào memtable (in-memory) → flush thành SSTable trên disk (sequential write)
Read: Check memtable → check bloom filters → check SSTables (có thể chậm)
Tối ưu cho write-heavy workloads

Tiêu chí	B-tree	LSM-tree
Write	Random I/O (chậm trên HDD)	Sequential I/O (nhanh)
Read	Predictable O(log N)	Có thể chậm (check nhiều SSTables)
Space	Fragmentation theo thời gian	Compaction overhead
Write amplification	~2-3x	~10-30x (compaction)
Read amplification	1x (single page)	~1-5x (multiple SSTables)
Best for	OLTP, read-heavy	Write-heavy, time-series, logging

WAL (Write-Ahead Log)

WAL đảm bảo durability: mọi thay đổi được ghi vào log trước khi apply vào data files.

1. Transaction: UPDATE balance SET amount = 900 WHERE user_id = 1
2. Write WAL record: {txn_id: 42, table: balance, row: 1, old: 1000, new: 900}
3. Flush WAL to disk (fsync)
4. Apply change to data page (có thể async)
5. Commit

Khi crash xảy ra giữa step 3 và 4: restart → replay WAL → data consistent. WAL là lý do PostgreSQL/MySQL không mất data khi server crash.

WAL size management: WAL segments tích lũy → cần archiving hoặc checkpoint để giải phóng. Checkpoint frequency là trade-off giữa recovery time và I/O overhead.

MVCC (Multi-Version Concurrency Control)

MVCC cho phép nhiều transactions đọc/ghi cùng data mà không cần lock — bằng cách giữ nhiều versions của mỗi row.

Transaction T1 (read):  Thấy version tại thời điểm T1 bắt đầu
Transaction T2 (write): Tạo version mới, không ảnh hưởng T1
Transaction T3 (read):  Nếu bắt đầu sau T2 commit → thấy version mới

Trade-off: MVCC tránh read locks (readers không block writers), nhưng tạo ra dead tuples cần VACUUM (PostgreSQL) hoặc purge (MySQL). Bỏ quên VACUUM → table bloat → performance degradation.

Connection Pooling Math

Bài toán: App có 50 instances, mỗi instance cần 20 DB connections. Tổng: 1000 connections. PostgreSQL recommend max_connections = 100-300.

Giải pháp: PgBouncer ở transaction mode:

50 app instances × 20 connections = 1000 connections đến PgBouncer
PgBouncer → 50 connections đến PostgreSQL (transaction pooling)

Vì mỗi query chỉ hold connection ~5ms trong tổng request 200ms
→ Utilization ratio: 5ms / 200ms = 2.5%
→ 1000 × 2.5% = 25 connections cần thiết cùng lúc
→ Pool 50 connections dư sức (headroom cho burst)

Formula: pool_size = active_connections × (avg_query_time / avg_request_time) × safety_factor

Trade-offs tổng hợp

Quyết định	Khi NÊN	Khi KHÔNG NÊN
PostgreSQL	Default choice, complex queries, ACID	Extreme write throughput (>100K writes/s)
MongoDB	Flexible schema, document-oriented data	Strong consistency, complex JOINs
Redis	Cache, session, counters, sub-ms latency	Primary data store (trừ khi accept data loss)
Cassandra	Write-heavy, multi-region, time-series	Complex queries, small scale (<100GB)
Elasticsearch	Full-text search, analytics	Primary data store, transactions

Checklist ghi nhớ

✅ Checklist triển khai

Database Selection

[ ] Phân tích read/write ratio trước khi chọn database
[ ] Default PostgreSQL nếu không có lý do cụ thể cho NoSQL
[ ] Xác định CAP trade-off: CP hay AP cho use case này?
[ ] Polyglot persistence: dùng nhiều databases cho nhiều workloads

Indexing

[ ] Index cho tất cả columns trong WHERE, JOIN, ORDER BY thường dùng
[ ] Composite index đúng thứ tự (high selectivity columns trước)
[ ] Monitor index usage, remove unused indexes
[ ] EXPLAIN ANALYZE cho mọi slow query (>100ms)

Performance

[ ] Connection pooling (PgBouncer/HikariCP) — không direct connections
[ ] N+1 query detection (ORM query logging)
[ ] VACUUM/maintenance schedule cho PostgreSQL
[ ] Read replicas cho read-heavy workloads (>80% reads)

Data Integrity

[ ] Foreign keys cho critical relationships
[ ] Constraints (NOT NULL, CHECK, UNIQUE) thay vì validate chỉ ở app
[ ] Backup strategy: daily full + WAL archiving cho point-in-time recovery
[ ] Test restore procedure định kỳ (backup không test = không có backup)

Bài tập luyện tập

Bài 1: Database Selection — Foundation

Đề bài: Cho 4 hệ thống, chọn database phù hợp nhất:

System A: Banking transaction ledger, 10K TPS, zero data loss tolerance
System B: IoT sensor data, 500K writes/s, query by time range
System C: E-commerce product catalog, nested categories, variable attributes
System D: Social media feed, 1M reads/s, eventual consistency OK

🧠 Quiz

Câu hỏi: Database nào phù hợp nhất cho System B (IoT sensor data)?

[ ] A. PostgreSQL
[ ] B. MongoDB
[x] C. Cassandra (hoặc TimescaleDB)
[ ] D. Redis Giải thích: IoT sensor data là write-heavy, time-series, cần horizontal scale. Cassandra với LSM-tree tối ưu cho sequential writes. TimescaleDB (PostgreSQL extension) cũng phù hợp nếu cần SQL query. Redis không phù hợp vì data quá lớn cho in-memory. MongoDB không tối ưu cho write throughput ở mức này.

Bài 2: Index Design — Intermediate

Đề bài: Table events có 100M rows, schema:

sql

CREATE TABLE events (
    id BIGSERIAL PRIMARY KEY,
    user_id BIGINT NOT NULL,
    event_type VARCHAR(50) NOT NULL,
    payload JSONB,
    created_at TIMESTAMPTZ NOT NULL
);

Queries thường dùng:

WHERE user_id = ? AND event_type = ? ORDER BY created_at DESC LIMIT 50
WHERE created_at > ? AND created_at < ?
WHERE payload @> '{"action": "purchase"}'

Thiết kế index strategy cho cả 3 queries.

💡 Gợi ý

Query 1: composite index, column order matters
Query 2: single column hoặc range index
Query 3: GIN index cho JSONB

✅ Lời giải

sql

-- Query 1: Composite B-tree (most selective first)
CREATE INDEX idx_events_user_type_date
ON events (user_id, event_type, created_at DESC);

-- Query 2: B-tree on timestamp (range query)
CREATE INDEX idx_events_created
ON events (created_at);

-- Query 3: GIN index for JSONB containment
CREATE INDEX idx_events_payload
ON events USING gin(payload);

Phân tích:

Query 1: B-tree traverse → filter user_id → filter event_type → index-ordered by created_at → no sort needed
Query 2: B-tree range scan trên created_at
Query 3: GIN inverted index cho JSONB path lookup

Trade-off: 3 indexes trên table 100M rows sẽ tốn disk space (~20-30% table size mỗi index) và slow down writes ~10-15%. Chấp nhận được nếu read-heavy.

Bài 3: Database Architecture Design — Advanced

Đề bài: Thiết kế database architecture cho food delivery platform (như GrabFood):

10M users, 500K daily orders
Real-time rider tracking (GPS updates mỗi 3s)
Menu search across 100K restaurants
Order history với full-text search
Peak: 3x average (lunch/dinner time)

Xác định: database stack, sharding strategy (nếu cần), caching layer, và estimated hardware.

💡 Gợi ý

Tách workloads: transactional vs analytical vs search vs real-time
GPS updates = extreme write throughput
Menu search = full-text search requirement
Order data cần ACID

✅ Lời giải

Database Stack:

Workload	Database	Sizing
Users, Orders, Payments	PostgreSQL (primary)	500K orders/day × 1KB = 500MB/day, manageable single instance
Rider GPS tracking	Redis (Geo commands)	50K active riders × update/3s = 16K writes/s
Menu search	Elasticsearch	100K restaurants × 50 items = 5M documents
Order history search	Elasticsearch (synced from PG)	Async sync via CDC
Session, Rate limiting	Redis	Standard caching layer

Sharding: Chưa cần ở 10M users. PostgreSQL single primary + 2 read replicas đủ cho 500K orders/day.

Caching: Redis cache cho restaurant menus (TTL 5 min), user sessions (TTL 24h). Target 90%+ cache hit rate.

Hardware estimate: PostgreSQL 16-core/64GB, 2 read replicas. Redis 8GB (GPS + cache). Elasticsearch 3-node cluster 16GB each.

Liên kết học tiếp

Từ khóa glossary: SQL, NoSQL, ACID, BASE, B-tree, LSM-tree, WAL, MVCC, Indexing, Sharding, Normalization, Denormalization, CAP Theorem, Connection Pooling, NewSQL, GIN Index

Tìm kiếm liên quan: chọn database, SQL vs NoSQL, indexing strategy, database sharding, B-tree vs LSM-tree, connection pooling, N+1 query

📸 Caselet: Instagram — Lưu trữ và phục vụ 2 tỷ+ bức ảnh

Instagram là một trong những case study kinh điển về database design ở quy mô khổng lồ. Với hơn 2 tỷ bức ảnh được lưu trữ và 100 triệu+ ảnh mới mỗi ngày, việc chọn đúng database cho từng loại dữ liệu không chỉ là bài tập lý thuyết — nó quyết định hệ thống sống hay chết. Điều thú vị là Instagram không dùng một database duy nhất — họ áp dụng polyglot persistence, nghĩa là mỗi loại dữ liệu được phục vụ bởi database phù hợp nhất với access pattern của nó.

User data (profiles, relationships, followers/following) được lưu trong PostgreSQL với horizontal sharding theo user_id. Đây là dữ liệu có cấu trúc rõ ràng, cần ACID transactions (ví dụ: khi user A follow user B, cả hai bảng followers và following phải update atomic). Photo metadata (timestamps, locations, filters applied, captions) lại được lưu trong Cassandra — vì đây là workload write-heavy với time-series access pattern: ảnh mới liên tục được upload, và khi đọc thì thường đọc theo timeline (ảnh mới nhất trước). Cassandra excels ở write throughput và horizontal scaling mà không cần complex sharding logic. Còn ảnh thực tế thì hoàn toàn không nằm trong database nào — chúng được lưu trên Amazon S3 (object storage) và phân phối qua CDN (Content Delivery Network) đến users trên toàn cầu.

Feed generation là bài toán phức tạp nhất. Instagram sử dụng chiến lược fanout-on-write: khi user A đăng ảnh, hệ thống pre-compute feed cho tất cả followers của A và lưu vào Redis. Database (PostgreSQL + Cassandra) đóng vai trò source of truth, còn Redis là serving layer cho tốc độ đọc. Chiến lược này hoạt động tốt cho user bình thường, nhưng với celebrity users (10M-100M followers), fanout-on-write sẽ tạo ra hàng trăm triệu writes cho một post duy nhất — Instagram xử lý bằng hybrid approach: fanout-on-write cho user thường, fanout-on-read cho celebrities.

Sharding strategy của Instagram tiến hóa qua nhiều giai đoạn. Giai đoạn đầu: một PostgreSQL server duy nhất với vertical scaling (thêm RAM, CPU, SSD). Khi traffic tăng: thêm read replicas (1 primary ghi, N replicas đọc). Ở quy mô lớn: horizontal sharding theo user_id — ban đầu dùng range-based sharding (user 1-1M ở shard 1, 1M-2M ở shard 2), nhưng nhanh chóng chuyển sang hash-based sharding vì range-based tạo ra hot spots (shard chứa user mới đăng ký luôn bận hơn). Thách thức lớn nhất: celebrity users với 100M followers tạo ra hot shards — dữ liệu liên quan đến họ (followers list, feed fanout) tập trung vào một shard, gây overload. Instagram xử lý bằng cách tách celebrity data ra dedicated infrastructure riêng.

Replication phục vụ read-heavy workload hiệu quả: mỗi shard có 1 primary cho writes và 12+ read replicas cho reads. Replication lag dao động 50-200ms — chấp nhận được cho timeline browsing (bạn không cần thấy ảnh mới post 100ms trước), nhưng KHÔNG chấp nhận được cho trải nghiệm "ảnh tôi vừa upload có hiển thị chưa?". Giải pháp: read-after-write consistency — trong 5 giây sau khi user thực hiện write, mọi read request của user đó được route đến primary (đảm bảo thấy data mới nhất), sau đó fall back về replicas.

CAP theorem được áp dụng khác nhau cho từng loại dữ liệu tại Instagram. User profiles: chọn CP (Consistency + Partition tolerance) — bạn PHẢI thấy profile edit mới nhất của chính mình ngay lập tức. Feed/timeline: chọn AP (Availability + Partition tolerance) — hiển thị feed hơi cũ vài giây vẫn chấp nhận được, nhưng feed phải luôn load được. Likes/comments count: chọn AP với eventual consistency — số like hiện 10,234 thay vì 10,237 trong vài giây không ai nhận ra. Direct messages: chọn CP — tin nhắn phải đúng thứ tự và delivered exactly once, mất tin nhắn là unacceptable. Câu hỏi suy ngẫm: "Instagram dùng cả PostgreSQL VÀ Cassandra. Tại sao không chọn MỘT database cho tất cả?"

Bài học cho architect

Polyglot persistence — dùng nhiều loại database cho các workload khác nhau — là pattern chuẩn ở quy mô lớn. Mỗi database excels ở một kiểu access pattern. Chi phí vận hành nhiều database được bù lại bằng performance và scalability tối ưu cho từng service.

🏗️ Thực hành: Chọn Replication hay Sharding cho từng kịch bản

Cho mỗi kịch bản dưới đây, hãy quyết định strategy phù hợp nhất.

Các lựa chọn:

📖 Read Replicas — 1 primary + N replicas (read scaling)
🔀 Horizontal Sharding — Chia data theo key (write + read scaling)
📖+🔀 Cả hai — Sharding + replicas per shard
🔄 Multi-primary Replication — Nhiều nodes đều ghi được

Kịch bản:

#	Hệ thống	Đặc điểm workload	Read:Write ratio	Data size	Strategy?
1	Blog platform	Đọc nhiều, viết ít, content ít thay đổi	100:1	50 GB	???
2	E-commerce orders	Ghi nhiều khi flash sale, đọc nhiều sau đó	10:1 (bình thường), 1:5 (flash sale)	500 GB	???
3	IoT sensor data	Write-heavy, time-series, append-only	1:100	10 TB+	???
4	Social media DMs	Read & write đều cao, data per-user	1:1	5 TB	???
5	Analytics dashboard	Complex queries, large scans, batch updates	50:1	2 TB	???
6	Global banking	Strong consistency, multi-region, regulatory	5:1	1 TB	???
7	User profiles	Read-heavy, small updates, millions of users	50:1	200 GB	???
8	Real-time leaderboard	Read-heavy, frequent score updates, ranked	20:1	10 GB	???

📋 Đáp án tham khảo

#	Hệ thống	Strategy	Giải thích
1	Blog	Read Replicas	50GB dễ fit 1 node. Read replicas xử lý read-heavy. Không cần sharding vì data size nhỏ
2	E-commerce orders	Sharding + Read Replicas	Flash sale cần write scaling (sharding by order_id). Bình thường cần read scaling (replicas). Shard by user_id hoặc order_id
3	IoT sensors	Horizontal Sharding	10TB+ không fit 1 node. Write-heavy → cần phân tán writes. Shard by device_id + time range. Read replicas ít hữu ích vì write-dominant
4	Social DMs	Sharding + Read Replicas	Per-user sharding (shard by conversation_id hoặc user_id). Replicas cho read scaling. Cần consistency cho message ordering
5	Analytics	Read Replicas	Complex queries benefit từ dedicated read replicas (không ảnh hưởng primary). 2TB fit 1 node. Hoặc dùng columnar DB (ClickHouse)
6	Global banking	Multi-primary Replication (CockroachDB/Spanner)	Strong consistency cross-region bắt buộc. Sharding có thể dùng nhưng cần distributed transactions. Spanner/CockroachDB handle cả hai
7	User profiles	Sharding (nếu >100M users) hoặc Read Replicas (nếu <10M)	Data per-user nhỏ nhưng millions of users → tổng lớn. Shard by user_id. Read replicas per shard nếu read traffic cao
8	Leaderboard	Read Replicas + Redis	10GB nhỏ → 1 node đủ. Dùng Redis sorted sets cho real-time ranking. DB là source of truth, Redis là read layer

Nguyên tắc quyết định:

Data size > khả năng 1 node? → Sharding bắt buộc
Read-heavy mà data fit 1 node? → Read Replicas đủ
Write-heavy? → Sharding (phân tán writes)
Cả hai cao? → Sharding + Replicas per shard
Cross-region consistency bắt buộc? → Multi-primary (Spanner/CockroachDB)

Database Design — Chọn đúng database cho hệ thống ​

Bức tranh tư duy ​

Cốt lõi kỹ thuật ​

SQL vs NoSQL — Decision Framework ​

ACID vs BASE ​

Indexing Strategies ​

Normalization vs Denormalization ​

CAP Theorem Applied to Databases ​

NewSQL — Best of Both Worlds? ​

Thực chiến ​

Tình huống: Chọn database cho social media platform ​

Tình huống: Index optimization cho query chậm ​

Sai lầm điển hình ​

❌ Sai lầm 1: Dùng MongoDB cho mọi thứ ​

❌ Sai lầm 2: Over-Normalizing ​

❌ Sai lầm 3: Thiếu indexes ​

❌ Sai lầm 4: N+1 Query Problem ​

❌ Sai lầm 5: Không có connection pooling ​

Under the Hood ​

B-tree vs LSM-tree Storage Engines ​

WAL (Write-Ahead Log) ​

MVCC (Multi-Version Concurrency Control) ​

Connection Pooling Math ​

Trade-offs tổng hợp ​

Checklist ghi nhớ ​

Bài tập luyện tập ​

Bài 1: Database Selection — Foundation ​

Bài 2: Index Design — Intermediate ​

Bài 3: Database Architecture Design — Advanced ​

Liên kết học tiếp ​

📸 Caselet: Instagram — Lưu trữ và phục vụ 2 tỷ+ bức ảnh ​

🏗️ Thực hành: Chọn Replication hay Sharding cho từng kịch bản ​