Thiết kế Twitter — Fan-out và Timeline ở quy mô lớn

Elon Musk đăng một tweet. Trong vòng 3 giây, 150 triệu followers phải nhìn thấy nó trên timeline. Không phải 150 nghìn — 150 triệu. Đó là bài toán mà đội ngũ kỹ sư Twitter phải giải quyết mỗi ngày, và cũng là câu hỏi phỏng vấn system design được hỏi nhiều nhất tại các công ty công nghệ hàng đầu.

Hệ thống Twitter vận hành ở quy mô: 300M DAU (Daily Active Users), 600M tweets/ngày, 7.000 write QPS trung bình (peak lên tới 15.000), và 35.000 read QPS — tỉ lệ read:write xấp xỉ 5:1. Mỗi tweet chỉ dài tối đa 280 ký tự, nhưng việc phân phối nó đến đúng người, đúng lúc, với độ trễ dưới vài giây là một bài toán kỹ thuật phân tán cực kỳ phức tạp.

Trọng tâm thiết kế nằm ở một quyết định kiến trúc then chốt: fan-out on write (pre-compute timeline cho mỗi user khi tweet được tạo) hay fan-out on read (compute timeline on-demand khi user mở app)? Mỗi chiến lược có trade-off riêng, và Twitter thực tế sử dụng hybrid approach — kết hợp cả hai. Bài viết này sẽ mổ xẻ toàn bộ kiến trúc đó, từ storage layer đến notification pipeline.

Bức tranh tư duy

Hãy hình dung Twitter như một tổng đài tin tức khổng lồ. Khi một người phát tin (đăng tweet), tổng đài phải quyết định: gửi bản tin trước cho tất cả thuê bao đã đăng ký (push model), hay để mỗi thuê bao tự gọi đến tổng đài lấy tin khi họ cần (pull model)?

Push model (fan-out on write) giống như bưu điện gửi thư đến từng hộp thư — nhanh khi nhận, nhưng tốn công gửi. Pull model (fan-out on read) giống như bạn tự đến bưu điện lấy thư — không tốn công gửi, nhưng chậm khi nhận vì phải tổng hợp từ nhiều nguồn.

Vòng đời của một tweet đi qua 4 giai đoạn:

Write → Store → Distribute → Read
 (API)  (DB)   (Fan-out)   (Timeline)

Khi nào mô hình này gãy? Khi một tài khoản celebrity có 80-150 triệu followers đăng tweet. Nếu dùng pure push model, hệ thống phải ghi tweet_id vào 150 triệu Redis entries — tốn hàng chục giây, gây hot partition, và làm nghẽn toàn bộ fan-out pipeline. Đây chính là Celebrity Problem, bài toán trung tâm của kiến trúc Twitter.

Cốt lõi kỹ thuật

Tweet Storage Layer

Mỗi tweet được lưu trữ với cấu trúc tối ưu cho việc đọc nhanh và sharding hiệu quả:

Field	Type	Mô tả
`tweet_id`	INT64 (Snowflake)	ID phân tán, chứa timestamp + datacenter + machine + sequence
`user_id`	INT64	ID người đăng, dùng làm shard key
`content`	VARCHAR(280)	Nội dung tweet, UTF-8
`media_urls`	JSON	Danh sách URL ảnh/video trên CDN
`reply_to`	INT64	NULL nếu là tweet gốc
`retweet_of`	INT64	NULL nếu không phải retweet
`created_at`	TIMESTAMP	Thời điểm tạo, derived từ Snowflake ID
`like_count`	INT32	Counter denormalized, cập nhật async
`retweet_count`	INT32	Counter denormalized, cập nhật async

Chiến lược sharding: Shard theo user_id — đảm bảo tất cả tweets của một user nằm trên cùng shard, tối ưu cho query "lấy tweets gần nhất của user X". Dùng consistent hashing với virtual nodes để phân phối đều.

Tại sao không shard theo tweet_id? Vì query phổ biến nhất là "lấy tweets của user X" — nếu shard theo tweet_id, query này phải scatter-gather across tất cả shards.

Fan-out Service — Trái tim của hệ thống

Fan-out Service là thành phần quyết định hiệu năng timeline. Nó hoạt động theo 3 chế độ:

Fan-out on Write (Push Model) — cho user có < 10.000 followers:

User A đăng tweet → Tweet Service ghi vào Tweet Store
Fan-out Service nhận event từ message queue
Query danh sách followers của User A từ Social Graph Service
Với mỗi follower, push tweet_id vào Redis Sorted Set của follower đó (score = timestamp)
Giới hạn: giữ tối đa 800 tweet_ids trong mỗi timeline cache

Fan-out on Read (Pull Model) — cho user có ≥ 10.000 followers (celebrity):

Celebrity đăng tweet → chỉ ghi vào Tweet Store, không fan-out
Khi follower mở timeline: hệ thống merge pre-computed timeline từ Redis + query tweets mới nhất từ các celebrity mà user follow
Merge sort theo timestamp, trả về top N results

Hybrid Approach — chiến lược thực tế của Twitter:

python

def fan_out_tweet(tweet, author):
    # Ghi tweet vào storage
    tweet_store.save(tweet)
    
    follower_count = social_graph.get_follower_count(author.id)
    
    if follower_count < CELEBRITY_THRESHOLD:  # 10,000
        # Fan-out on write: push to all followers
        followers = social_graph.get_followers(author.id)
        for batch in chunk(followers, BATCH_SIZE):
            fan_out_queue.enqueue(
                FanOutTask(tweet.id, batch, tweet.created_at)
            )
    else:
        # Celebrity: chỉ đánh dấu, fan-out on read
        celebrity_tweet_cache.add(author.id, tweet.id, tweet.created_at)

Timeline Assembly

Khi user mở app và request timeline:

python

def get_home_timeline(user_id, page_size=20, cursor=None):
    # Bước 1: Lấy pre-computed timeline từ Redis
    precomputed = redis.zrevrangebyscore(
        f"timeline:{user_id}",
        max=cursor or "+inf",
        min="-inf",
        offset=0,
        count=page_size
    )
    
    # Bước 2: Lấy tweets từ celebrities mà user follow
    celebrity_ids = social_graph.get_followed_celebrities(user_id)
    celebrity_tweets = []
    for celeb_id in celebrity_ids:
        recent = celebrity_tweet_cache.get_recent(celeb_id, limit=5)
        celebrity_tweets.extend(recent)
    
    # Bước 3: Merge sort theo timestamp
    merged = merge_sorted(precomputed, celebrity_tweets, key=lambda t: t.score)
    
    # Bước 4: Hydrate tweet data
    tweet_ids = [item.tweet_id for item in merged[:page_size]]
    tweets = tweet_store.multi_get(tweet_ids)
    
    return tweets

Redis Sorted Set là cấu trúc dữ liệu chủ lực:

Key: timeline:{user_id}
Member: tweet_id
Score: created_at timestamp (milliseconds)
Operations: ZADD (O(log N)), ZREVRANGEBYSCORE (O(log N + M))
Memory: ~800 tweet_ids × 8 bytes × 300M users ≈ 1.9 TB RAM cho timeline cache

Search Infrastructure

Twitter Search xử lý hàng tỷ queries/ngày với kiến trúc:

Inverted Index: Mỗi từ trong tweet được index → mapping từ word → list[tweet_id]. Dùng Elasticsearch cluster
Hashtag Index: Bảng riêng mapping hashtag → list[tweet_id] với timestamp, hỗ trợ trending detection
Real-time Indexing: Tweet mới được index trong vòng < 10 giây thông qua Kafka pipeline
Relevance Ranking: Kết hợp recency, engagement (likes, retweets), và social signals (tweets từ người bạn follow được ưu tiên)

Notification System

Pipeline thông báo hoạt động bất đồng bộ:

Tweet mới / Like / Retweet / Mention → Event đẩy vào Kafka topic
Notification Service consume events, xác định recipients và loại notification
Push notification gửi qua APNs (iOS) và FCM (Android) — batch gửi để tối ưu throughput
In-app notifications lưu vào notification queue riêng (Redis List), client poll hoặc WebSocket

Kiến trúc tổng thể

Data flow khi đăng tweet:

Client → API Gateway → Tweet Service: validate nội dung, lưu vào Tweet Store
Tweet Service → Kafka: publish TweetCreated event
Fan-out Service consume event → query Social Graph → push to Redis (normal users)
Search Indexer consume event → cập nhật Elasticsearch index
Notification Worker consume event → gửi push notifications cho mentions

Thực chiến

Tình huống: Celebrity Problem — Khi Lady Gaga tweet

Bối cảnh: Lady Gaga có 83 triệu followers. Cô ấy đăng một tweet quảng bá album mới lúc 8:00 PM EST — giờ cao điểm.

Vấn đề nếu dùng pure fan-out on write:

Metric	Giá trị	Hậu quả
Redis ZADD operations	83.000.000	Mỗi operation ~0.1ms → 2.3 giờ nếu sequential
Network bandwidth	83M × 16 bytes ≈ 1.3 GB	Spike traffic nội bộ
Redis memory spike	Tạm thời tăng 1.3 GB	Có thể trigger eviction
Fan-out lag	> 60 giây	Timeline của followers khác bị delay
Hot partition	Shard chứa Lady Gaga bị quá tải	Ảnh hưởng tất cả users trên shard đó

Ngay cả khi batch 1.000 operations/batch và chạy 100 workers song song: 83.000.000 / 1.000 / 100 = 830 batches = ~83 giây. Không chấp nhận được.

Giải pháp: Hybrid Fan-out với Celebrity Detection

python

CELEBRITY_THRESHOLD = 10_000  # followers

class HybridFanOutService:
    def process_tweet(self, tweet, author):
        follower_count = self.social_graph.get_follower_count(author.id)
        
        if follower_count < CELEBRITY_THRESHOLD:
            self._fan_out_on_write(tweet, author)
        else:
            self._mark_as_celebrity_tweet(tweet, author)
    
    def _fan_out_on_write(self, tweet, author):
        """Push tweet_id vào timeline cache của mỗi follower"""
        followers = self.social_graph.get_all_followers(author.id)
        
        for batch in chunk(followers, 1000):
            pipeline = self.redis.pipeline()
            for follower_id in batch:
                pipeline.zadd(
                    f"timeline:{follower_id}",
                    {tweet.id: tweet.created_at}
                )
                # Trim để giữ tối đa 800 entries
                pipeline.zremrangebyrank(f"timeline:{follower_id}", 0, -801)
            pipeline.execute()
    
    def _mark_as_celebrity_tweet(self, tweet, author):
        """Chỉ lưu vào celebrity cache, fan-out khi đọc"""
        self.redis.zadd(
            f"celebrity_tweets:{author.id}",
            {tweet.id: tweet.created_at}
        )
        self.redis.zremrangebyrank(
            f"celebrity_tweets:{author.id}", 0, -201  # giữ 200 tweets gần nhất
        )

So sánh hiệu năng thực tế:

Metric	Pure Push (Lady Gaga)	Hybrid (Lady Gaga)	Pure Push (Normal User, 500 followers)
Write latency	83 giây	< 50ms	50ms
Redis writes	83.000.000	1	500
Read latency	~5ms (pre-computed)	~15ms (merge at read)	~5ms
Memory per tweet	83M × 8B = 664 MB	8 bytes	500 × 8B = 4 KB

Kết quả: Hybrid approach giảm write amplification từ 83 triệu operations xuống còn 1, đánh đổi bằng read latency tăng ~10ms (hoàn toàn chấp nhận được).

Tình huống: Timeline Merge tại thời điểm đọc

Khi user mở app, Timeline Service phải merge 2 nguồn dữ liệu:

Complexity analysis:

User follow 200 người, trong đó 5 celebrities
Pre-computed timeline: O(1) Redis read → 800 tweet_ids
Celebrity tweets: 5 queries × O(log N) each → ~50 tweet_ids
Merge sort: O(N log K) với K = 6 sources → negligible
Tổng read latency: ~12-18ms (P99 < 50ms)

Sai lầm điển hình

❌ Sai lầm 1: Fan-out on Write cho TẤT CẢ users

Vấn đề: Áp dụng push model cho cả celebrity accounts — khi một account có 50M followers đăng tweet, fan-out service phải thực hiện 50 triệu Redis writes.

Tại sao sai (production impact):

Fan-out queue backlog tăng từ 0 lên hàng triệu messages trong vài giây
Timeline của tất cả users khác bị delay vì fan-out workers đang bận xử lý celebrity tweet
Redis cluster CPU spike lên 95%, response time tăng gấp 10x
Worst case: cascade failure — fan-out timeout → retry → queue tiếp tục phình to

Approach đúng: Hybrid fan-out. Đặt ngưỡng celebrity ở 10.000 followers. Tweets từ celebrity chỉ được lưu vào celebrity_tweets:{author_id} cache, merge tại thời điểm đọc. Monitoring: theo dõi fan-out queue depth và p99 latency liên tục.

❌ Sai lầm 2: Dùng single relational DB cho timeline

Vấn đề: Thiết kế timeline query bằng SQL JOIN trên bảng tweets và follows:

sql

-- ĐỪNG LÀM THẾ NÀY Ở QUY MÔ LỚN
SELECT t.* FROM tweets t
JOIN follows f ON t.user_id = f.followee_id
WHERE f.follower_id = :current_user
ORDER BY t.created_at DESC
LIMIT 20;

Tại sao sai (production impact):

Bảng tweets có hàng tỷ rows, bảng follows cũng hàng tỷ
JOIN trên 2 bảng billions-row = full table scan hoặc index scan cực chậm
P99 query time: > 5 giây — không chấp nhận được cho timeline
Không scale horizontally — sharding 2 bảng theo key khác nhau làm JOIN bất khả thi

Approach đúng: Pre-compute timeline vào Redis Sorted Set. SQL chỉ dùng cho Tweet Store (query đơn giản: get tweet by ID hoặc get tweets by user_id). Timeline assembly là bài toán in-memory, không phải database query.

❌ Sai lầm 3: Không cache timeline — recompute mỗi lần đọc

Vấn đề: Mỗi khi user refresh timeline, hệ thống query Social Graph (lấy list following) → query Tweet Store cho từng followed user → merge sort → trả về.

Tại sao sai (production impact):

User follow 200 người → 200 queries đến Tweet Store mỗi lần refresh
35.000 timeline reads/giây × 200 queries = 7 triệu queries/giây đến Tweet Store
Database sẽ sập trong vài phút dưới load này
P99 latency > 2 giây — user experience tệ hại

Approach đúng: Pre-compute và cache timeline trong Redis. Mỗi timeline read chỉ cần 1 Redis query (+ vài queries cho celebrity tweets). Cache hit ratio > 99%. TTL 24-48 giờ, invalidate khi có tweet mới (đã được fan-out service xử lý).

❌ Sai lầm 4: Dùng sequential auto-increment ID cho tweets

Vấn đề: Dùng MySQL AUTO_INCREMENT hoặc PostgreSQL SERIAL cho tweet_id.

Tại sao sai (production impact):

Single point of failure: Một database server phải generate tất cả IDs → bottleneck ở scale cao
Lộ business metrics: Competitor có thể estimate số tweets/ngày bằng cách so sánh IDs
Không chứa timestamp: Phải query thêm created_at cho mọi sorting operation
Không shard-friendly: Hai shards khác nhau có thể generate cùng ID

Approach đúng: Dùng Snowflake ID — ID phân tán 64-bit, có thể generate trên bất kỳ server nào mà không cần coordination. Chứa sẵn timestamp nên tự nhiên có thứ tự thời gian. Chi tiết ở phần Under the Hood.

Under the Hood

Back-of-Envelope Estimation

Metric	Tính toán	Kết quả
Write QPS	600M tweets/day ÷ 86.400s	~7.000 QPS (peak: 15K)
Read QPS	300M DAU × 10 reads/day ÷ 86.400s	~35.000 QPS (peak: 100K)
Tweet Storage/day	600M × 1 KB	~600 GB/ngày
Tweet Storage/year	600 GB × 365	~219 TB/năm
Timeline Cache (Redis)	300M users × 800 IDs × 8 bytes	~1.9 TB RAM
Fan-out writes/day	600M tweets × 200 avg followers	~120 tỷ Redis writes
Bandwidth (ingress)	7K QPS × 1 KB	~7 MB/s
Bandwidth (egress)	35K QPS × 20 tweets × 1 KB	~700 MB/s
Media Storage/day	600M × 30% có media × 500 KB	~90 TB/ngày

Snowflake ID — Giải pháp ID phân tán

Twitter tự phát triển Snowflake ID generator với cấu trúc 64-bit:

 0 | 00000000 00000000 00000000 00000000 00000000 0 | 00000 | 00000 | 000000000000
 ↑                      ↑                             ↑        ↑          ↑
Sign              Timestamp (41 bits)              DC ID    Machine   Sequence
(1 bit)          ~69 năm từ epoch                (5 bits)  (5 bits)  (12 bits)
                                                  32 DCs   32 máy    4096/ms

Thành phần	Bits	Phạm vi	Ý nghĩa
Sign bit	1	Luôn 0	Đảm bảo ID dương
Timestamp	41	~69 năm	Milliseconds từ custom epoch
Datacenter ID	5	0-31	Hỗ trợ 32 datacenters
Machine ID	5	0-31	32 machines per datacenter
Sequence	12	0-4095	4.096 IDs/millisecond/machine

Tại sao Snowflake quan trọng:

Distributed generation: Mỗi machine tự generate ID, không cần coordination
Roughly time-ordered: Vì timestamp chiếm 41 bits cao nhất, IDs tự nhiên có thứ tự thời gian
Shard-friendly: Không bao giờ trùng giữa các machines
Throughput: 32 DCs × 32 machines × 4.096/ms = 4.194.304 IDs/millisecond max

Trade-offs: Fan-out Strategies

Tiêu chí	Fan-out on Write	Fan-out on Read	Hybrid
Read latency	⚡ Rất thấp (~5ms)	🐢 Cao (~100-500ms)	✅ Thấp (~15ms)
Write latency	🐢 Cao (fan-out overhead)	⚡ Rất thấp (~10ms)	✅ Thấp cho đa số
Write amplification	❌ Rất cao (N writes per tweet)	✅ Không có	✅ Thấp (skip celebrities)
Storage (Redis)	❌ ~1.9 TB	✅ Minimal	✅ ~1.5 TB
Celebrity handling	❌ Thảm họa	✅ Tự nhiên xử lý	✅ Tốt
Complexity	✅ Đơn giản	✅ Đơn giản	⚠️ Phức tạp hơn
Consistency	⚠️ Eventual (fan-out lag)	✅ Always fresh	⚠️ Eventual + fresh mix
Khi nào dùng	Hầu hết users (< 10K followers)	Celebrity accounts	Production Twitter

CAP Theorem Considerations

Twitter chọn AP (Availability + Partition Tolerance) cho timeline reads:

Eventual Consistency cho timeline: User có thể thấy tweet muộn vài giây — chấp nhận được. Trải nghiệm "refresh và thấy tweets mới" là behavior quen thuộc
Consistency cho tweet writes: Tweet một khi đã ghi thành công không được mất. Dùng synchronous replication cho Tweet Store (MySQL semi-sync replication)
Consistency cho Social Graph: Follow/unfollow phải strongly consistent — nếu user unfollow ai đó, timeline phải phản ánh ngay (không muộn hơn 1 phút)

Cost Estimation (Ước tính chi phí hàng tháng)

Component	Specs	Chi phí ước tính/tháng
Redis Cluster (Timeline)	~2 TB RAM, 30+ nodes (r6g.4xlarge)	~$120.000
MySQL Cluster (Tweet Store)	20 shards, mỗi shard 3 replicas	~$80.000
Elasticsearch (Search)	50+ nodes, ~100 TB storage	~$150.000
Kafka Cluster	20+ brokers, high throughput	~$40.000
S3 + CDN (Media)	~2.7 PB/tháng media storage + transfer	~$200.000
Compute (Services)	500+ instances cho các services	~$100.000
Network (Cross-AZ, egress)	~700 MB/s egress	~$60.000
Tổng ước tính		~$750.000/tháng

Lưu ý: Đây là ước tính thô cho infrastructure. Chi phí thực tế của Twitter (nay là X) cao hơn nhiều khi tính cả engineering team, content moderation, compliance, và overhead khác.

Checklist ghi nhớ

✅ Checklist triển khai

Kiến trúc cốt lõi

[ ] Phân loại celebrity vs normal user: ngưỡng ≥ 10.000 followers
[ ] Hybrid fan-out: push cho normal users, pull cho celebrity accounts
[ ] Shard Tweet Store theo user_id với consistent hashing
[ ] Dùng Snowflake ID (64-bit) cho tweet — distributed, time-ordered, shard-safe
[ ] Tách biệt Write Path (Tweet Service → Kafka → Fan-out) và Read Path (Timeline Service → Redis)

Caching & Timeline

[ ] Redis Sorted Set cho pre-computed timeline: key = timeline:{user_id}, score = timestamp
[ ] Giữ tối đa 800 tweet_ids per user trong Redis (trim cũ nhất)
[ ] Merge celebrity tweets vào timeline tại thời điểm đọc (read-time merge)
[ ] TTL cho timeline cache: 24-48 giờ — re-populate khi user active trở lại
[ ] Cache invalidation khi user unfollow: xóa tweets của unfollowed user khỏi cache

Search & Notifications

[ ] Inverted index trên Elasticsearch cho full-text search
[ ] Hashtag index riêng biệt — phục vụ trending detection real-time
[ ] Push notification gửi qua message queue (Kafka → Notification Worker → APNs/FCM)
[ ] In-app notifications lưu Redis List, client poll hoặc WebSocket

Monitoring & Reliability

[ ] Monitor fan-out queue depth và processing lag (alert nếu > 30 giây)
[ ] Circuit breaker cho Fan-out Service — fallback sang fan-out on read nếu overload
[ ] Graceful degradation: khi Redis cluster quá tải, serve stale timeline + hiển thị "Đang cập nhật"
[ ] Rate limit tweet creation: chống spam và bảo vệ fan-out pipeline

Bài tập luyện tập

Bài 1: Tính toán Redis Memory cho Timeline Cache

Đề bài: Hệ thống có 300M DAU. Mỗi user có timeline cache chứa tối đa 800 tweet_ids. Mỗi entry trong Redis Sorted Set gồm member (tweet_id, 8 bytes) và score (timestamp, 8 bytes), cộng overhead ~50 bytes per entry cho Redis internal data structures.

Hãy tính tổng Redis memory cần thiết cho timeline cache.

🧠 Quiz

Tổng Redis memory cần thiết là bao nhiêu?

[ ] A. ~480 GB
[ ] B. ~1.9 TB
[x] C. ~15.8 TB
[ ] D. ~4.8 TB

Giải thích: Mỗi entry = 8 + 8 + 50 = 66 bytes. Mỗi user: 800 × 66 = 52.800 bytes ≈ 52.8 KB. Tổng: 300M × 52.8 KB = 15.84 TB. Tuy nhiên, không phải tất cả 300M users đều có full 800 entries — trung bình khoảng 400 entries → ~7.9 TB. Với replication factor 2 → ~15.8 TB tổng cluster. Đáp án C tính với overhead thực tế.

Mẹo phỏng vấn: Trong phỏng vấn, hãy nêu rõ assumptions. Interviewer quan tâm cách bạn tư duy hơn con số chính xác.

Đề bài: Thiết kế hệ thống phát hiện trending hashtags trên Twitter theo thời gian thực. Yêu cầu:

Phát hiện hashtags đang tăng đột biến trong 1-5 phút gần nhất
Hiển thị Top 10 trending topics cho mỗi quốc gia/khu vực
Xử lý 600M tweets/ngày (mỗi tweet có thể chứa 0-5 hashtags)

💡 Gợi ý tiếp cận

Nghĩ về 3 thành phần:

Data ingestion: Làm sao extract hashtags từ 7.000 tweets/giây?
Counting: Count-Min Sketch hay sliding window counter?
Ranking: Làm sao rank theo "tốc độ tăng" chứ không phải "tổng số lượng"?

Keyword: Sliding window, Count-Min Sketch, exponential decay, min-heap.

✅ Giải pháp đề xuất

Kiến trúc tổng thể:

Sliding Window Counter:

Chia thời gian thành buckets 1 phút
Redis key: hashtag:{tag}:minute:{timestamp_minute}
INCR mỗi khi hashtag xuất hiện
Giữ 60 buckets gần nhất (1 giờ)

Trend Detection:

Tính velocity: count trong 5 phút gần nhất ÷ count trung bình 1 giờ
Hashtag có velocity > threshold → trending candidate
Dùng Count-Min Sketch nếu cần tiết kiệm memory (approximate counting)

Top-K Ranking:

Min-heap size K=10 per region
Cập nhật mỗi 30 giây
Cache kết quả trong Redis với TTL 60 giây

Scale: 600M tweets × 2 hashtags trung bình = 1.2 tỷ hashtag events/ngày ≈ 14.000 events/giây. Dùng 10 stream processor instances, mỗi instance xử lý 1.400 events/giây — hoàn toàn khả thi.

Bài 3: Thiết kế Twitter Search

Đề bài: Thiết kế search infrastructure cho Twitter với yêu cầu:

Index 600M tweets mới mỗi ngày
Search latency P99 < 200ms
Hỗ trợ full-text search, hashtag search, user search
Kết quả phải bao gồm tweets trong vòng 10 giây sau khi đăng (near real-time)

💡 Gợi ý tiếp cận

Nghĩ về:

Indexing pipeline: Làm sao index 7.000 tweets/giây mà không ảnh hưởng write path?
Index structure: Inverted index cho text, separate index cho hashtags và users?
Relevance ranking: Những signals nào quyết định thứ tự kết quả?
Scaling: Partition index thế nào? Time-based hay hash-based?

✅ Giải pháp đề xuất

Kiến trúc Search Pipeline:

Tweet Created → Kafka → Search Indexer → Elasticsearch Cluster
                                              ↓
                         Search Query → Search Service → Elasticsearch
                                              ↓
                                        Ranking Service
                                              ↓
                                        Search Results

Index Strategy:

Time-based partitioning: Index mới mỗi ngày (tweets-2024-01-15). Query chỉ cần scan indexes gần nhất
Inverted index fields: content (analyzed, tokenized), hashtags (keyword), user_handle (keyword), language
Replication: Mỗi shard có 1 primary + 2 replicas

Near Real-time Indexing:

Elasticsearch refresh_interval: 1s (default)
Kafka consumer group với multiple partitions → parallel indexing
Bulk API: batch 500 documents per request → giảm overhead

Relevance Ranking Signals:

Recency: Tweet mới hơn → score cao hơn (exponential decay)
Engagement: like_count + retweet_count (normalized)
Social relevance: Tweet từ người user follow → boost 2x
Author authority: Verified account → boost 1.5x
Text relevance: BM25 score từ Elasticsearch

Scaling Numbers:

600M tweets/ngày × 1 KB = 600 GB index/ngày
Giữ 7 ngày hot data → ~4.2 TB
50 Elasticsearch nodes, mỗi node 32 GB RAM + 2 TB SSD
Query throughput: 50 nodes × 500 QPS/node = 25.000 search QPS

Liên kết học tiếp

Từ khóa tìm kiếm: twitter system design, fan-out architecture, timeline design, celebrity problem, snowflake id, hybrid fan-out, redis sorted set timeline, social media architecture

Thiết kế Twitter — Fan-out và Timeline ở quy mô lớn ​

Bức tranh tư duy ​

Cốt lõi kỹ thuật ​

Tweet Storage Layer ​

Fan-out Service — Trái tim của hệ thống ​

Timeline Assembly ​

Search Infrastructure ​

Notification System ​

Kiến trúc tổng thể ​

Thực chiến ​

Tình huống: Celebrity Problem — Khi Lady Gaga tweet ​

Tình huống: Timeline Merge tại thời điểm đọc ​

Sai lầm điển hình ​

❌ Sai lầm 1: Fan-out on Write cho TẤT CẢ users ​

❌ Sai lầm 2: Dùng single relational DB cho timeline ​

❌ Sai lầm 3: Không cache timeline — recompute mỗi lần đọc ​

❌ Sai lầm 4: Dùng sequential auto-increment ID cho tweets ​

Under the Hood ​

Back-of-Envelope Estimation ​

Snowflake ID — Giải pháp ID phân tán ​

Trade-offs: Fan-out Strategies ​

CAP Theorem Considerations ​

Cost Estimation (Ước tính chi phí hàng tháng) ​

Checklist ghi nhớ ​

Bài tập luyện tập ​

Bài 1: Tính toán Redis Memory cho Timeline Cache ​

Bài 2: Thiết kế Trending Topics ​

Bài 3: Thiết kế Twitter Search ​

Liên kết học tiếp ​

Bước tiếp theo ​

Kiến thức nền tảng ​

Chủ đề nâng cao liên quan ​

Thiết kế Twitter — Fan-out và Timeline ở quy mô lớn

Bức tranh tư duy

Cốt lõi kỹ thuật

Tweet Storage Layer

Fan-out Service — Trái tim của hệ thống

Timeline Assembly

Search Infrastructure

Notification System

Kiến trúc tổng thể

Thực chiến

Tình huống: Celebrity Problem — Khi Lady Gaga tweet

Tình huống: Timeline Merge tại thời điểm đọc

Sai lầm điển hình

❌ Sai lầm 1: Fan-out on Write cho TẤT CẢ users

❌ Sai lầm 2: Dùng single relational DB cho timeline

❌ Sai lầm 3: Không cache timeline — recompute mỗi lần đọc

❌ Sai lầm 4: Dùng sequential auto-increment ID cho tweets

Under the Hood

Back-of-Envelope Estimation

Snowflake ID — Giải pháp ID phân tán

Trade-offs: Fan-out Strategies

CAP Theorem Considerations

Cost Estimation (Ước tính chi phí hàng tháng)

Checklist ghi nhớ

Bài tập luyện tập

Bài 1: Tính toán Redis Memory cho Timeline Cache

Bài 2: Thiết kế Trending Topics

Bài 3: Thiết kế Twitter Search

Liên kết học tiếp

Bước tiếp theo

Kiến thức nền tảng

Chủ đề nâng cao liên quan