Skip to content

🐦 DESIGN TWITTER

Timeline & Fan-out Patterns at Scale

🎓 Giáo sư Tom

Twitter là bài toán kinh điển về Fan-out - làm sao để phân phối 1 tweet đến hàng triệu followers trong thời gian thực? Đây là nơi bạn học được sự khác biệt giữa Push và Pull model.

📊 Back-of-Envelope Calculations

Scale Assumptions

MetricValueRationale
Monthly Active Users (MAU)500MGlobal social platform
Daily Active Users (DAU)300M~60% of MAU
Tweets per user per day2Average, including retweets
Timeline reads per user per day10Users check feed multiple times
Average followers per user200Power law distribution
Celebrity threshold10K+ followersRequires special handling
Tweet size (avg)1 KB280 chars + metadata + media refs

QPS Calculations

Tweet Write Operations:
Daily tweets = 300M DAU x 2 tweets/user = 600M tweets/day

Average Write QPS = 600M / 86,400 seconds = ~7,000 QPS
Peak Write QPS = 7,000 x 3 = ~21,000 QPS

Timeline Read Operations:
Daily reads = 300M DAU x 10 reads/user = 3B reads/day

Average Read QPS = 3B / 86,400 = ~35,000 QPS
Peak Read QPS = 35,000 x 3 = ~100,000 QPS

Read:Write Ratio = 35,000 : 7,000 = 5:1
Calculation Breakdown
  • 86,400 = seconds in a day (24 x 60 x 60)
  • Peak multiplier = 3x = industry standard for social apps
  • Read-heavy workload -> optimize for reads (caching, fan-out on write)

Storage Calculations

Tweet Storage:
Daily tweet storage = 600M tweets x 1 KB = 600 GB/day
Monthly tweet storage = 600 GB x 30 = 18 TB/month
Yearly tweet storage = 18 TB x 12 = 216 TB/year

With 3x replication = 216 TB x 3 = 648 TB/year

Media Storage (images, videos):
Assume 20% tweets have media, avg 500 KB per media
Daily media = 600M x 0.2 x 500 KB = 60 TB/day
Monthly media = 60 TB x 30 = 1.8 PB/month

Timeline Cache (Redis):
Cache last 800 tweets per user (tweet IDs only)
Per user cache = 800 x 8 bytes (tweet_id) = 6.4 KB
Total cache = 300M users x 6.4 KB = ~2 TB

Bandwidth Calculations

Outbound Bandwidth (Timeline reads):
Avg timeline response = 20 tweets x 1 KB = 20 KB
Peak bandwidth = 100,000 QPS x 20 KB = 2 GB/s = 16 Gbps

Inbound Bandwidth (Tweet posts):
Peak inbound = 21,000 QPS x 1 KB = 21 MB/s = 168 Mbps

🏗️ High-Level Architecture

Component Responsibilities

ComponentResponsibilityTechnology
API GatewayRate limiting, authentication, routingKong/nginx
Tweet ServiceCreate, delete, like tweetsGo/Java microservice
Timeline ServiceFetch user's home timelineGo microservice
User ServiceUser profiles, follow/unfollowGo microservice
Fan-out WorkersDistribute tweets to follower timelinesKafka consumers
Timeline CachePre-computed timelines per userRedis Sorted Sets
Tweet DBPersistent tweet storagePostgreSQL (sharded by tweet_id)
Social GraphFollower/following relationshipsCassandra

🔧 Raizo's Note

Tại sao shard theo tweet_id thay vì user_id?

Nếu shard theo user_id, celebrity accounts (Elon Musk, Taylor Swift) sẽ tạo hot spots - 1 shard chứa quá nhiều data. Shard theo tweet_id phân bố đều hơn vì tweets được tạo liên tục từ nhiều users.

🔄 Core Flows

Flow 1: Tweet Posting (Write Path)

Flow 2: Timeline Retrieval (Read Path)

🎓 Giáo sư Tom

Cursor-based Pagination là best practice cho infinite scroll:

  • Không dùng OFFSET vì performance kém với large datasets
  • Dùng tweet_id hoặc timestamp làm cursor
  • Query: WHERE tweet_id < :cursor ORDER BY tweet_id DESC LIMIT 20

💡 Deep Dive: Fan-out Strategies

The Core Problem

Khi user A post 1 tweet, làm sao để 10 triệu followers của A thấy tweet đó trong timeline của họ?

Có 2 approaches chính: Pull ModelPush Model.

Strategy 1: Pull Model (Fan-out on Read)

┌─────────────────────────────────────────────────────────────┐
│                    PULL MODEL                                │
│                 (Fan-out on Read)                            │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  User A posts tweet                                         │
│       │                                                     │
│       ▼                                                     │
│  ┌─────────────┐                                           │
│  │ Tweet DB    │  ← Store tweet once                       │
│  └─────────────┘                                           │
│                                                             │
│  Later, when Follower B opens timeline:                    │
│       │                                                     │
│       ▼                                                     │
│  ┌─────────────────────────────────────────┐               │
│  │ 1. Get list of users B follows          │               │
│  │ 2. Query tweets from ALL followed users │               │
│  │ 3. Merge & sort by timestamp            │               │
│  │ 4. Return top N tweets                  │               │
│  └─────────────────────────────────────────┘               │
│                                                             │
└─────────────────────────────────────────────────────────────┘
ProsCons
✅ Write is fast (O(1))❌ Read is slow (O(followings))
✅ No wasted work for inactive users❌ High latency at read time
✅ Simple implementation❌ Heavy DB load on reads
✅ Always fresh data❌ Doesn't scale for users following 1000+ accounts

Strategy 2: Push Model (Fan-out on Write)

┌─────────────────────────────────────────────────────────────┐
│                    PUSH MODEL                                │
│                 (Fan-out on Write)                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  User A posts tweet (has 1000 followers)                   │
│       │                                                     │
│       ▼                                                     │
│  ┌─────────────┐                                           │
│  │ Tweet DB    │  ← Store tweet                            │
│  └──────┬──────┘                                           │
│         │                                                   │
│         ▼                                                   │
│  ┌─────────────────────────────────────────┐               │
│  │         Fan-out Workers                  │               │
│  │  Push tweet_id to 1000 follower caches  │               │
│  └─────────────────────────────────────────┘               │
│         │                                                   │
│         ▼                                                   │
│  ┌─────────┬─────────┬─────────┬─────────┐                │
│  │Timeline │Timeline │Timeline │  ...    │                │
│  │ User 1  │ User 2  │ User 3  │         │                │
│  └─────────┴─────────┴─────────┴─────────┘                │
│                                                             │
│  When Follower B opens timeline:                           │
│       │                                                     │
│       ▼                                                     │
│  ┌─────────────────────────────────────────┐               │
│  │ Just read from B's pre-computed cache   │  ← O(1)      │
│  └─────────────────────────────────────────┘               │
│                                                             │
└─────────────────────────────────────────────────────────────┘
ProsCons
✅ Read is instant (O(1))❌ Write is slow (O(followers))
✅ Low latency for users❌ Celebrity problem (100M followers = 100M writes)
✅ Predictable read performance❌ Wasted work for inactive followers
✅ Can pre-compute rankings❌ High storage for timeline caches

The Celebrity Problem (Justin Bieber Problem)

⚠️ The Problem

Khi Justin Bieber (100M followers) tweet, Push model phải write 100 triệu lần. Với 7,000 write QPS, mất 4 giờ để fan-out xong!

100,000,000 followers / 7,000 QPS = 14,285 seconds = ~4 hours

Strategy 3: Hybrid Solution (Twitter's Actual Approach)

How Hybrid Works:

  1. Normal users (< 10K followers): Use Push model

    • Tweet is fanned out to all follower timeline caches
    • Fast read experience for followers
  2. Celebrities (>= 10K followers): Use Pull model

    • Tweet is NOT fanned out
    • Stored in a separate "celebrity tweets" list
    • Merged at read time when follower opens timeline
  3. At read time:

    • Fetch pre-computed timeline from cache (pushed tweets)
    • Fetch recent tweets from followed celebrities (pulled tweets)
    • Merge and sort by timestamp
    • Return unified timeline

🎓 Giáo sư Tom

Threshold tuning: 10K là starting point. Trong production, Twitter dùng nhiều factors:

  • Follower count
  • Tweet frequency
  • Follower activity rate
  • Time of day

Một số accounts có thể switch giữa Push/Pull dynamically.

Implementation: Redis Timeline Cache

python
# Push model: Add tweet to follower's timeline
def fan_out_tweet(tweet_id: str, author_id: str, timestamp: int):
    followers = get_followers(author_id)
    
    if len(followers) >= CELEBRITY_THRESHOLD:
        # Celebrity: skip fan-out, add to celebrity list
        redis.zadd(f"celebrity_tweets:{author_id}", {tweet_id: timestamp})
        return
    
    # Normal user: fan-out to all followers
    pipeline = redis.pipeline()
    for follower_id in followers:
        pipeline.zadd(
            f"timeline:{follower_id}",
            {tweet_id: timestamp}
        )
        # Keep only last 800 tweets
        pipeline.zremrangebyrank(f"timeline:{follower_id}", 0, -801)
    pipeline.execute()

# Read timeline with hybrid merge
def get_timeline(user_id: str, count: int = 20) -> list:
    # 1. Get pushed tweets from cache
    pushed_tweets = redis.zrevrange(
        f"timeline:{user_id}", 0, count - 1, withscores=True
    )
    
    # 2. Get celebrity tweets (pull model)
    celebrity_followings = get_celebrity_followings(user_id)
    celebrity_tweets = []
    for celeb_id in celebrity_followings:
        tweets = redis.zrevrange(
            f"celebrity_tweets:{celeb_id}", 0, 10, withscores=True
        )
        celebrity_tweets.extend(tweets)
    
    # 3. Merge and sort
    all_tweets = pushed_tweets + celebrity_tweets
    all_tweets.sort(key=lambda x: x[1], reverse=True)  # Sort by timestamp
    
    return all_tweets[:count]

🔧 Raizo's Note

Production gotcha: Redis ZADD với millions of keys cần careful memory management:

  • Set maxmemory-policy volatile-lru để auto-evict old timelines
  • Monitor memory usage với INFO memory
  • Consider Redis Cluster cho horizontal scaling

⚖️ Trade-offs Analysis

Architecture Decision Matrix

DecisionOption AOption BChosenRationale
Feed GenerationPush (Fan-out on Write)Pull (Fan-out on Read)HybridBalance write cost vs read latency
Tweet StorageSingle PostgreSQLSharded PostgreSQLSharded by tweet_idAvoid hot spots from celebrities
Timeline CachePer-user Redis keysShared cachePer-user Sorted SetsO(1) read, easy pagination
ID GenerationAuto-incrementUUIDSnowflake IDTime-sortable, distributed, no coordination
Social GraphPostgreSQLCassandraCassandraWrite-heavy (follows), denormalized queries
Message QueueRabbitMQKafkaKafkaHigh throughput, replay capability

Snowflake ID Deep Dive

┌─────────────────────────────────────────────────────────────┐
│                 SNOWFLAKE ID (64 bits)                       │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────┬──────────────┬────────────┬─────────────────┐ │
│  │ Sign(1) │ Timestamp(41)│ Machine(10)│ Sequence(12)    │ │
│  │    0    │  ms since    │  Worker ID │  Counter        │ │
│  │         │  epoch       │  (1024)    │  (4096/ms)      │ │
│  └─────────┴──────────────┴────────────┴─────────────────┘ │
│                                                             │
│  Benefits:                                                  │
│  • Time-sortable: ORDER BY id = ORDER BY time              │
│  • Distributed: No central coordinator needed              │
│  • High throughput: 4096 IDs per ms per worker             │
│  • Compact: 64-bit integer fits in index efficiently       │
│                                                             │
└─────────────────────────────────────────────────────────────┘

🚨 Failure Scenarios & Mitigations

Scenario 1: Redis Timeline Cache Failure

AspectDetails
ImpactUsers see stale or empty timelines
DetectionRedis health checks, latency spikes
MitigationRedis Cluster with 3 replicas per shard
FallbackSwitch to Pull model from DB
RecoveryRebuild cache from DB on restart

🔧 Raizo's Note

Circuit Breaker Pattern: Khi Redis fails, đừng để requests pile up. Implement circuit breaker để fail fast và fallback to DB. Hystrix hoặc resilience4j là good choices.

Scenario 2: Celebrity Tweet Storm

AspectDetails
ImpactFan-out queue backs up, delays for all users
ExampleElon Musk tweets during Super Bowl
DetectionKafka consumer lag monitoring
MitigationSeparate queue for celebrity tweets
SolutionPull model for celebrity followers

Scenario 3: Database Shard Failure

AspectDetails
ImpactTweets on that shard unavailable
DetectionDB health checks, connection errors
MitigationSynchronous replication to standby
FailoverAutomatic promotion (< 30 seconds)
Data LossNear-zero with sync replication

Scenario 4: Kafka Cluster Failure

AspectDetails
ImpactFan-out stops, new tweets not distributed
DetectionProducer errors, consumer lag
MitigationMulti-AZ Kafka deployment
FallbackWrite directly to cache (degraded mode)
RecoveryReplay from Kafka offset on recovery

🔧 Raizo's Note

Idempotency is critical: Fan-out workers MUST be idempotent. Kafka có thể deliver messages multiple times (at-least-once). Dùng tweet_id làm dedup key trong Redis ZADD (same score = no duplicate).

💰 Cost Estimation

Monthly Infrastructure Costs (300M DAU Scale)

ServiceSpecificationUnit CostMonthly Cost
API Servers200 × c5.2xlarge (8 vCPU, 16GB)$0.34/hr$49,000
Fan-out Workers50 × c5.xlarge (4 vCPU, 8GB)$0.17/hr$6,100
Timeline CacheRedis Cluster 500GB (3 replicas)$0.068/GB/hr$15,000
Tweet CacheRedis 100GB$0.068/GB/hr$5,000
Tweet DBPostgreSQL 20TB (sharded, 3 replicas)$0.115/GB/mo$8,000
Social GraphCassandra 50TB (3 nodes)$0.50/GB/mo$25,000
Kafka Cluster10 brokers (m5.2xlarge)$0.384/hr$12,000
CDN500TB egress/month$0.08/GB$40,000
Load Balancers2 × ALB$0.025/hr + LCU$2,000
Media StorageS3 1.8PB$0.023/GB/mo$42,000

Cost Summary

CategoryMonthly Cost% of Total
Compute (API + Workers)$55,10034%
Caching (Redis)$20,00012%
Database (PostgreSQL + Cassandra)$33,00020%
Message Queue (Kafka)$12,0007%
CDN & Networking$42,00026%
Storage (S3)$42,00026%
Total~$162,000100%

🎓 Giáo sư Tom

Cost Optimization Strategies:

  1. Reserved Instances: 1-year commitment giảm 30-40% compute cost
  2. Spot Instances: Fan-out workers có thể dùng spot (giảm 70%)
  3. S3 Intelligent Tiering: Auto-move old media to cheaper storage
  4. CDN Caching: Higher cache hit ratio = lower origin egress

🔧 Raizo's Note

Hidden costs to watch:

  • Data transfer between AZs: $0.01/GB adds up fast
  • Redis memory overhead: Actual usage ~1.5x data size
  • Kafka retention: 7 days retention = 7x daily volume
  • Monitoring & Logging: CloudWatch/Datadog có thể $10K+/month

Cost per User Metrics

Monthly cost: $162,000
DAU: 300,000,000

Cost per DAU per month = $162,000 / 300M = $0.00054
Cost per DAU per year = $0.00054 × 12 = $0.0065

Revenue per user (ads): ~$5-10/year
Gross margin: Very healthy!

🎯 Interview Checklist

Must-Mention Items

TopicKey Points
Scale Estimation300M DAU, 600M tweets/day, 35K read QPS
Fan-out StrategyPush vs Pull trade-offs, Hybrid solution
Celebrity ProblemWhy 10K+ followers need special handling
Data StorageSharding strategy (by tweet_id, not user_id)
CachingRedis Sorted Sets for timeline, cache invalidation
ID GenerationSnowflake IDs (time-sortable, distributed)

Bonus Points 🌟

  • Ranking Algorithm: Mention relevance scoring beyond chronological
  • Real-time Updates: WebSocket/SSE for live timeline updates
  • Search Architecture: Elasticsearch for full-text search
  • Trending Topics: Sliding window counters, decay functions
  • Anti-spam: Rate limiting, content moderation ML models
  • Geo-distribution: Multi-region deployment, data locality

Common Mistakes

MistakeWhy It's WrongBetter Approach
Single databaseWon't scale past 10M usersShard from day 1
Pull-only modelHigh latency for usersHybrid Push/Pull
Push to all followersCelebrity problemThreshold-based hybrid
Shard by user_idHot spots from celebritiesShard by tweet_id
No cachingDB can't handle read QPSRedis timeline cache
Sync fan-outBlocks tweet postingAsync via Kafka

⚠️ Interview Red Flags

  • Không mention scale numbers (QPS, storage)
  • Không biết Fan-out on Write vs Fan-out on Read
  • Không address Celebrity problem
  • Thiết kế single point of failure
  • Không có caching strategy

🎓 Key Takeaways

  1. Hybrid Fan-out là real-world solution - không có silver bullet
  2. Read:Write ratio quyết định architecture (Twitter là 5:1 read-heavy)
  3. Sharding strategy phải consider data distribution, không chỉ logic
  4. Snowflake IDs giải quyết distributed ID generation elegantly
  5. Async processing (Kafka) decouple write path từ fan-out
  6. Cache everything - Redis Sorted Sets perfect cho timeline

🔗 Navigation

Case StudyKey LearningLink
YouTubeVideo processing, CDN, adaptive streamingDesign YouTube →
UberReal-time location, geospatial indexingDesign Uber →
WhatsAppMessaging, E2EE, connection managementDesign WhatsApp →

Prerequisites

Advanced Topics