Giao diện
🚗 DESIGN UBER
Real-time Location Services & Driver Matching
🎓 Giáo sư Tom
Uber là bài toán kinh điển về Geospatial Indexing - làm sao để tìm tài xế gần nhất trong hàng triệu tài xế đang di chuyển liên tục? Đây là nơi bạn học được tại sao SQL queries thông thường không hoạt động và cần các cấu trúc dữ liệu đặc biệt như Quadtree hay S2 Geometry.
📊 Back-of-Envelope Calculations
Scale Assumptions
| Metric | Value | Rationale |
|---|---|---|
| Monthly Active Users (MAU) | 100M | Global ride-sharing platform |
| Daily Active Users (DAU) | 30M | ~30% of MAU |
| Active Drivers (peak) | 2M | Supply side at peak hours |
| Trips per day | 15M | Average across all markets |
| Location updates per driver | 4/second | GPS ping frequency |
| Average trip duration | 15 minutes | City rides |
| Average trip distance | 8 km | Urban environment |
Location Update Calculations
Driver Location Updates:
Active drivers at peak = 2,000,000
Updates per driver per second = 4
Location Update QPS = 2M × 4 = 8,000,000 QPS (8M QPS!)
This is MASSIVE - one of the highest write loads in tech.
Location data per update = 50 bytes (lat, lng, timestamp, driver_id, status)
Bandwidth for location updates = 8M × 50 bytes = 400 MB/s = 3.2 Gbps🔧 Raizo's Note
8 triệu writes/second là con số khủng khiếp. Không database nào handle được trực tiếp. Đây là lý do Uber dùng in-memory data structures (Redis Geo) và chỉ persist aggregated data.
Ride Request Calculations
Ride Requests:
Daily trips = 15,000,000
Peak hours = 4 hours (morning + evening rush)
Peak trip percentage = 40% of daily trips in peak hours
Peak trips = 15M × 0.4 = 6M trips in 4 hours
Peak ride request QPS = 6M / (4 × 3600) = ~420 QPS
With retries and failed matches: ~1,000 QPS peakMatching Calculations
For each ride request:
1. Find nearby drivers (query geospatial index)
2. Calculate ETA for top 10 candidates
3. Send request to best driver
4. Wait for acceptance (timeout 15s)
5. If rejected, try next driver
Average drivers queried per request = 10
ETA calculations per request = 10
Maps API calls per request = 10
Peak Maps API calls = 1,000 QPS × 10 = 10,000 QPSStorage Calculations
Trip Data:
Daily trips = 15M
Trip record size = 2 KB (pickup, dropoff, route, fare, timestamps)
Daily trip storage = 15M × 2 KB = 30 GB/day
Monthly trip storage = 30 GB × 30 = 900 GB/month
Yearly trip storage = 900 GB × 12 = ~11 TB/year
Driver Location History (for analytics):
Sampled at 1/minute (not every 4/sec)
2M drivers × 60 samples/hour × 24 hours = 2.88B records/day
Record size = 30 bytes
Daily location history = 2.88B × 30 = 86 GB/day
Real-time Location Index (Redis):
2M drivers × 50 bytes = 100 MB (fits in memory easily!)Calculation Breakdown
- Peak hours: 7-9 AM và 5-7 PM là rush hours ở hầu hết cities
- 4 updates/second: Balance giữa accuracy và bandwidth
- Trip record 2KB: Includes polyline route, fare breakdown, ratings
- Location sampling: Real-time dùng 4/sec, history chỉ cần 1/min
Bandwidth Calculations
Inbound (Location Updates):
8M QPS × 50 bytes = 400 MB/s = 3.2 Gbps
Outbound (Driver App):
- Ride requests: 1,000 QPS × 1 KB = 1 MB/s
- Navigation updates: 2M drivers × 1 update/5s × 500 bytes = 200 MB/s
Outbound (Rider App):
- Driver location during trip: 15M active trips / 15 min avg
= 1M concurrent trips × 1 update/s × 100 bytes = 100 MB/s
- ETA updates: 100 MB/s
Total Outbound: ~400 MB/s = 3.2 Gbps🏗️ High-Level Architecture
Component Responsibilities
| Component | Responsibility | Technology |
|---|---|---|
| Location Service | Ingest 8M location updates/sec, update geo index | Go + Redis Geo |
| Matching Engine | Find optimal driver for rider, handle dispatch | Go microservice |
| Trip Service | Manage trip lifecycle (request → complete) | Go + PostgreSQL |
| Pricing Service | Calculate fares, surge pricing | Go microservice |
| Redis Geo | Real-time driver location index | Redis 6.0+ GEOADD |
| Maps Service | ETA calculation, routing, distance | Google Maps API |
| WebSocket Servers | Push real-time updates to apps | Go + gorilla/websocket |
| Kafka | Event streaming for analytics, notifications | Apache Kafka |
🔧 Raizo's Note
Tại sao dùng Go cho Location Service?
Go's goroutines handle millions of concurrent connections efficiently. Một server Go có thể handle 100K+ concurrent WebSocket connections với memory footprint nhỏ. Java/Node.js sẽ cần nhiều servers hơn đáng kể.
🔄 Core Flows
Flow 1: Driver Location Update
Flow 2: Ride Request & Driver Matching
Flow 3: Trip Lifecycle
🎓 Giáo sư Tom
State Machine là critical cho Trip Service. Mỗi trip đi qua các states: REQUESTED → ACCEPTED → DRIVER_ARRIVED → IN_PROGRESS → COMPLETED
Mỗi transition phải được validate. Ví dụ: không thể COMPLETE trip nếu chưa START.
💡 Deep Dive: Geospatial Indexing
The Core Problem
Khi rider request ride tại vị trí (10.8231, 106.6297), làm sao tìm tài xế gần nhất trong 2 triệu tài xế đang online?
Why SQL BETWEEN Queries Don't Work
sql
-- Naive approach: Find drivers within a bounding box
SELECT driver_id, lat, lng
FROM driver_locations
WHERE lat BETWEEN 10.82 AND 10.84
AND lng BETWEEN 106.62 AND 106.64;⚠️ Problems with SQL Approach
- Full table scan: Với 2M rows, query mất 2-5 seconds
- Index không hiệu quả: B-tree index trên (lat, lng) chỉ dùng được 1 column
- Bounding box không chính xác: Corners của box xa hơn edges
- Không scale: 8M updates/sec sẽ lock database liên tục
┌─────────────────────────────────────────────────────────────┐
│ WHY BOUNDING BOX IS INACCURATE │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────┐ │
│ │ × × × │ × = Drivers in box │
│ │ ┌─────────┐ │ ● = Rider location │
│ │ × │ ● │ × │ ○ = Actual 3km radius │
│ │ │ ○○○○○ │ │ │
│ │ × │ ○ ○ │ × │ Drivers at corners │
│ │ │○ ● ○│ │ are ~4.2km away! │
│ │ × │ ○ ○ │ × │ (3km × √2 = 4.24km) │
│ │ │ ○○○○○ │ │ │
│ │ × └─────────┘ × │ │
│ │ │ │
│ └─────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘Solution 1: Quadtree
Quadtree chia không gian 2D thành 4 quadrants recursively.
┌─────────────────────────────────────────────────────────────┐
│ QUADTREE │
├─────────────────────────────────────────────────────────────┤
│ │
│ Level 0 (Root): │
│ ┌─────────────────────────────────┐ │
│ │ │ │ │
│ │ NW │ NE │ │
│ │ │ │ │
│ ├───────────────┼─────────────────┤ │
│ │ │ │ │
│ │ SW │ SE │ │
│ │ │ │ │
│ └─────────────────────────────────┘ │
│ │
│ Level 2 (Subdivided): │
│ ┌────────┬────────┬────────┬────────┐ │
│ │ │ │ │ │ │ │ │ │ │
│ ├───┼────┼───┼────┼───┼────┼───┼────┤ │
│ │ │ 🚗 │ │ │ │ 🚗 │ │ │ │
│ ├────────┼────────┼────────┼────────┤ │
│ │ │ │ │ 📍 │ │ │ │ 🚗 │ 📍 = Rider │
│ ├───┼────┼───┼────┼───┼────┼───┼────┤ 🚗 = Drivers │
│ │ 🚗│ │ │ │ │ │ │ │ │
│ └────────┴────────┴────────┴────────┘ │
│ │
│ Search: Start at root → traverse to rider's cell │
│ → check adjacent cells → filter by distance │
│ │
└─────────────────────────────────────────────────────────────┘| Pros | Cons |
|---|---|
| ✅ O(log n) search | ❌ Unbalanced with clustered data |
| ✅ Easy to implement | ❌ Rebalancing expensive |
| ✅ Good for static data | ❌ Not great for moving objects |
Solution 2: Geohash
Geohash encode lat/lng thành string, nearby locations share prefix.
┌─────────────────────────────────────────────────────────────┐
│ GEOHASH │
├─────────────────────────────────────────────────────────────┤
│ │
│ Location: (10.8231, 106.6297) → Geohash: "w3gvk1" │
│ │
│ Precision levels: │
│ ┌──────────┬─────────────┬──────────────────────┐ │
│ │ Length │ Cell Size │ Example │ │
│ ├──────────┼─────────────┼──────────────────────┤ │
│ │ 1 │ 5000 km │ w │ │
│ │ 2 │ 1250 km │ w3 │ │
│ │ 3 │ 156 km │ w3g │ │
│ │ 4 │ 39 km │ w3gv │ │
│ │ 5 │ 4.9 km │ w3gvk │ │
│ │ 6 │ 1.2 km │ w3gvk1 │ ← Good │
│ │ 7 │ 153 m │ w3gvk1p │ │
│ │ 8 │ 38 m │ w3gvk1pq │ │
│ └──────────┴─────────────┴──────────────────────┘ │
│ │
│ Search: Find all drivers with geohash prefix "w3gvk" │
│ + 8 adjacent cells (edge cases) │
│ │
└─────────────────────────────────────────────────────────────┘| Pros | Cons |
|---|---|
| ✅ String prefix search (fast) | ❌ Edge problem (adjacent cells different prefix) |
| ✅ Easy to index in DB | ❌ Fixed cell sizes |
| ✅ Human readable | ❌ Cells not uniform near poles |
Solution 3: S2 Geometry (Google's Choice) ⭐
S2 projects Earth onto a cube, then uses Hilbert curve to map 2D → 1D.
┌─────────────────────────────────────────────────────────────┐
│ S2 GEOMETRY │
├─────────────────────────────────────────────────────────────┤
│ │
│ Earth → Cube → Hilbert Curve → 64-bit Cell ID │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ ┌─────┬─────┬─────┬─────┐ │ │
│ │ │ L14 │ L14 │ L14 │ L14 │ Level 14: ~0.8 km² │ │
│ │ │Cell1│Cell2│Cell3│Cell4│ │ │
│ │ ├─────┼─────┼─────┼─────┤ │ │
│ │ │ 🚗 │ │ 🚗 │ │ Drivers in cells │ │
│ │ │ 🚗 │ 🚗 │ │ 🚗 │ │ │
│ │ ├─────┼─────┼─────┼─────┤ │ │
│ │ │ │ 📍 │ │ │ 📍 = Rider location │ │
│ │ │ 🚗 │ │ 🚗 │ │ │ │
│ │ └─────┴─────┴─────┴─────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ Cell Hierarchy: │
│ Level 0: Earth face (~85M km²) │
│ Level 12: ~3.3 km² (city district) │
│ Level 14: ~0.8 km² (neighborhood) ← Uber uses this │
│ Level 16: ~0.05 km² (city block) │
│ Level 30: ~1 cm² (maximum precision) │
│ │
└─────────────────────────────────────────────────────────────┘Why S2 is Superior:
| Feature | Geohash | S2 |
|---|---|---|
| Cell shape | Rectangle (varies) | Nearly square (uniform) |
| Edge handling | 8 neighbors needed | Hierarchical covering |
| Variable precision | Fixed levels | Any level 0-30 |
| Covering algorithm | Manual | Built-in GetCovering() |
| Used by | Many apps | Google Maps, Uber, Foursquare |
Implementation: Redis Geo with S2
python
import redis
from s2sphere import CellId, LatLng
redis_client = redis.Redis()
def update_driver_location(driver_id: str, lat: float, lng: float):
"""Update driver location in Redis Geo index"""
# Redis GEOADD uses geohash internally, but we also store S2 cell
redis_client.geoadd("drivers:active", (lng, lat, driver_id))
# Store S2 cell for advanced queries
cell_id = CellId.from_lat_lng(LatLng.from_degrees(lat, lng))
cell_token = cell_id.parent(14).to_token() # Level 14
redis_client.hset(f"driver:{driver_id}", mapping={
"lat": lat,
"lng": lng,
"s2_cell": cell_token,
"last_update": int(time.time()),
"status": "available"
})
# Add to cell-based index for fast lookup
redis_client.sadd(f"cell:{cell_token}:drivers", driver_id)
# Set TTL for stale location cleanup
redis_client.expire(f"driver:{driver_id}", 30)
def find_nearby_drivers(lat: float, lng: float, radius_km: float = 3) -> list:
"""Find drivers within radius using Redis GEORADIUS"""
# Primary: Use Redis Geo (geohash-based)
nearby = redis_client.georadius(
"drivers:active",
lng, lat,
radius_km, unit="km",
withdist=True,
count=20,
sort="ASC"
)
# Filter by availability
available_drivers = []
for driver_id, distance in nearby:
driver_data = redis_client.hgetall(f"driver:{driver_id}")
if driver_data.get("status") == "available":
available_drivers.append({
"driver_id": driver_id,
"distance_km": distance,
"lat": float(driver_data["lat"]),
"lng": float(driver_data["lng"])
})
return available_drivers🔧 Raizo's Note
Redis GEORADIUS internally dùng geohash, không phải S2. Nhưng với radius 3-5km, accuracy đủ tốt. Uber dùng custom S2 implementation cho advanced features như:
- Variable cell sizes based on driver density
- Efficient range queries across cell boundaries
- Hierarchical caching
🎯 Deep Dive: Driver Matching Logic
The Matching Problem
Khi có 10 drivers gần rider, chọn driver nào? Đây không chỉ là "driver gần nhất" mà còn phải optimize cho:
- Rider experience: ETA thấp, driver rating cao
- Driver fairness: Phân phối rides đều
- Platform efficiency: Minimize empty miles
Strategy 1: Greedy Matching (Simple)
┌─────────────────────────────────────────────────────────────┐
│ GREEDY MATCHING │
├─────────────────────────────────────────────────────────────┤
│ │
│ Algorithm: │
│ 1. Find all available drivers within 3km │
│ 2. Calculate ETA for each driver │
│ 3. Pick driver with lowest ETA │
│ 4. Send request, wait for acceptance │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ 🚗 D1 (2.1km, ETA 4min) │ │
│ │ \ │ │
│ │ \ │ │
│ │ 🚗 D2 (1.5km, ETA 3min) ←── SELECTED │ │
│ │ \ │ │
│ │ 📍 Rider │ │
│ │ / │ │
│ │ 🚗 D3 (1.8km, ETA 5min) │ │
│ │ / │ │
│ │ 🚗 D4 (2.5km, ETA 6min) │ │
│ │ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘| Pros | Cons |
|---|---|
| ✅ Simple, fast O(n) | ❌ Locally optimal, not globally |
| ✅ Low latency | ❌ Can cause driver starvation |
| ✅ Easy to understand | ❌ Doesn't consider future requests |
Strategy 2: Global Optimization (Batch Matching)
┌─────────────────────────────────────────────────────────────┐
│ GLOBAL OPTIMIZATION │
├─────────────────────────────────────────────────────────────┤
│ │
│ Batch requests every 2 seconds, solve assignment problem │
│ │
│ Riders: R1 R2 R3 R4 │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ HUNGARIAN ALGORITHM │ │
│ │ (Bipartite Matching) │ │
│ │ │ │
│ │ Cost Matrix (ETA in minutes): │ │
│ │ D1 D2 D3 D4 │ │
│ │ R1 [ 4 3 5 6 ] │ │
│ │ R2 [ 7 2 4 3 ] │ │
│ │ R3 [ 3 5 2 4 ] │ │
│ │ R4 [ 5 4 6 2 ] │ │
│ │ │ │
│ │ Optimal: R1→D2, R2→D4, R3→D3, R4→D1 │ │
│ │ Total ETA: 3+3+2+5 = 13 minutes │ │
│ │ │ │
│ │ vs Greedy: R1→D2, R2→D2(taken)→D4, ... │ │
│ │ Greedy Total: 3+3+4+6 = 16 minutes │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘| Pros | Cons |
|---|---|
| ✅ Globally optimal | ❌ Higher latency (batching delay) |
| ✅ Better overall efficiency | ❌ Complex implementation |
| ✅ Fairer distribution | ❌ O(n³) Hungarian algorithm |
Uber's Hybrid Approach
python
def match_driver(rider_location, ride_type, urgency="normal"):
"""
Hybrid matching: Greedy with local optimization window
"""
# 1. Get candidate drivers
candidates = find_nearby_drivers(
rider_location.lat,
rider_location.lng,
radius_km=3
)
if not candidates:
# Expand search radius
candidates = find_nearby_drivers(
rider_location.lat,
rider_location.lng,
radius_km=5
)
# 2. Filter by ride type and availability
eligible = []
for driver in candidates:
if not is_driver_eligible(driver, ride_type):
continue
if is_driver_locked(driver.id):
continue
eligible.append(driver)
if not eligible:
return None
# 3. Calculate scores for each driver
scored_drivers = []
for driver in eligible[:10]: # Limit ETA calculations
eta = get_eta_from_maps(driver.location, rider_location)
score = calculate_match_score(
eta=eta,
driver_rating=driver.rating,
acceptance_rate=driver.acceptance_rate,
trips_today=driver.trips_today,
time_since_last_trip=driver.idle_time
)
scored_drivers.append((driver, score, eta))
# 4. Sort by score (higher is better)
scored_drivers.sort(key=lambda x: x[1], reverse=True)
# 5. Try to lock and dispatch
for driver, score, eta in scored_drivers:
if try_lock_driver(driver.id, timeout_seconds=15):
return dispatch_ride(driver, rider_location, eta)
return None # No driver available
def calculate_match_score(eta, driver_rating, acceptance_rate,
trips_today, idle_time):
"""
Multi-factor scoring function
Weights tuned based on business priorities:
- ETA: Primary factor (rider experience)
- Rating: Quality signal
- Acceptance rate: Reliability
- Trips today: Fairness (give rides to drivers with fewer trips)
- Idle time: Fairness (prioritize waiting drivers)
"""
# Normalize factors to 0-1 scale
eta_score = max(0, 1 - eta / 15) # 15 min max
rating_score = (driver_rating - 4.0) / 1.0 # 4.0-5.0 range
acceptance_score = acceptance_rate
fairness_score = max(0, 1 - trips_today / 20) # 20 trips/day target
idle_score = min(1, idle_time / 600) # 10 min idle = max score
# Weighted combination
score = (
0.40 * eta_score +
0.20 * rating_score +
0.15 * acceptance_score +
0.15 * fairness_score +
0.10 * idle_score
)
return scoreDriver Locking Mechanism
┌─────────────────────────────────────────────────────────────┐
│ DRIVER LOCKING │
├─────────────────────────────────────────────────────────────┤
│ │
│ Problem: Two riders request at same time, both get same │
│ driver → one rider left hanging │
│ │
│ Solution: Distributed lock with Redis │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Rider A requests → Lock Driver 1 (15s TTL) │ │
│ │ SET driver:1:lock ride_A EX 15 │ │
│ │ │ │
│ │ Rider B requests → Try lock Driver 1 │ │
│ │ SETNX fails (already locked) │ │
│ │ → Try Driver 2 instead │ │
│ │ │ │
│ │ Driver 1 accepts → Extend lock, start trip │ │
│ │ Driver 1 rejects → Release lock (DEL) │ │
│ │ Timeout (15s) → Lock auto-expires │ │
│ │ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘python
def try_lock_driver(driver_id: str, timeout_seconds: int = 15) -> bool:
"""
Attempt to acquire exclusive lock on driver
Returns True if lock acquired, False if driver already locked
"""
lock_key = f"driver:{driver_id}:lock"
lock_value = str(uuid.uuid4()) # Unique lock identifier
# SETNX with expiry - atomic operation
acquired = redis_client.set(
lock_key,
lock_value,
nx=True, # Only set if not exists
ex=timeout_seconds
)
if acquired:
# Store lock value for later release verification
return True
return False
def release_driver_lock(driver_id: str):
"""Release driver lock"""
lock_key = f"driver:{driver_id}:lock"
redis_client.delete(lock_key)ETA Calculation
┌─────────────────────────────────────────────────────────────┐
│ ETA CALCULATION │
├─────────────────────────────────────────────────────────────┤
│ │
│ Naive: Straight-line distance / average speed │
│ ETA = 2km / 30km/h = 4 minutes │
│ ❌ Ignores roads, traffic, one-way streets │
│ │
│ Better: Google Maps Directions API │
│ ✅ Real road network │
│ ✅ Real-time traffic │
│ ✅ Turn-by-turn routing │
│ ❌ Expensive at scale ($5/1000 requests) │
│ │
│ Uber's Approach: Hybrid │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ 1. Pre-compute base ETAs between S2 cells │ │
│ │ (offline, updated hourly) │ │
│ │ │ │
│ │ 2. Apply real-time traffic multiplier │ │
│ │ (from live driver data) │ │
│ │ │ │
│ │ 3. Use Maps API only for final routing │ │
│ │ (after driver accepts) │ │
│ │ │ │
│ │ ETA = base_eta[cell_A][cell_B] × traffic_factor │ │
│ │ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘🎓 Giáo sư Tom
ETA accuracy là critical metric. Uber tracks:
- ETA accuracy: Predicted vs actual arrival time
- ETA variance: Consistency of predictions
- User trust: Users stop using app if ETAs are unreliable
Target: 90% of ETAs within ±2 minutes of actual.
🗺️ Build vs Buy: Maps Service
The Maps Dilemma
Maps là critical dependency cho Uber. Mỗi ride cần:
- Geocoding: Address → Coordinates
- Reverse Geocoding: Coordinates → Address
- Routing: Point A → Point B directions
- ETA: Estimated time of arrival
- Distance: Actual road distance (not straight-line)
Option 1: Google Maps API
┌─────────────────────────────────────────────────────────────┐
│ GOOGLE MAPS API │
├─────────────────────────────────────────────────────────────┤
│ │
│ Pricing (as of 2024): │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ API │ Price per 1000 requests │ │
│ ├────────────────────────┼────────────────────────────┤ │
│ │ Directions │ $5.00 │ │
│ │ Distance Matrix │ $5.00 │ │
│ │ Geocoding │ $5.00 │ │
│ │ Places │ $17.00 │ │
│ │ Roads (Snap to Roads) │ $10.00 │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ Uber's Usage (estimated): │
│ - 15M trips/day × 10 API calls/trip = 150M calls/day │
│ - Monthly: 150M × 30 = 4.5B calls │
│ - Cost: 4.5B / 1000 × $5 = $22.5M/month 😱 │
│ │
│ With volume discounts: ~$500K/month (still huge!) │
│ │
└─────────────────────────────────────────────────────────────┘| Pros | Cons |
|---|---|
| ✅ Best-in-class accuracy | ❌ Extremely expensive at scale |
| ✅ Global coverage | ❌ Vendor lock-in |
| ✅ Real-time traffic | ❌ Rate limits |
| ✅ No maintenance | ❌ No customization |
| ✅ Continuous updates | ❌ Dependency risk |
Option 2: Self-Hosted OpenStreetMap (OSM)
┌─────────────────────────────────────────────────────────────┐
│ SELF-HOSTED OSM STACK │
├─────────────────────────────────────────────────────────────┤
│ │
│ Components: │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Nominatim │ │ OSRM │ │ │
│ │ │ (Geocoding) │ │ (Routing) │ │ │
│ │ └─────────────┘ └─────────────┘ │ │
│ │ │ │ │ │
│ │ └────────┬─────────┘ │ │
│ │ ▼ │ │
│ │ ┌─────────────┐ │ │
│ │ │ PostgreSQL │ │ │
│ │ │ + PostGIS │ │ │
│ │ │ (OSM Data) │ │ │
│ │ └─────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ Infrastructure Cost: │
│ - 50 routing servers: $15,000/month │
│ - 10 geocoding servers: $3,000/month │
│ - Database cluster: $5,000/month │
│ - Engineering team (5 people): $100,000/month │
│ - Total: ~$125,000/month │
│ │
│ vs Google Maps: $500,000/month │
│ Savings: $375,000/month = $4.5M/year │
│ │
└─────────────────────────────────────────────────────────────┘| Pros | Cons |
|---|---|
| ✅ Much cheaper at scale | ❌ No real-time traffic |
| ✅ Full control | ❌ Data quality varies by region |
| ✅ No rate limits | ❌ Requires dedicated team |
| ✅ Customizable | ❌ Map updates lag behind |
| ✅ No vendor lock-in | ❌ Complex infrastructure |
Uber's Actual Approach: Hybrid
┌─────────────────────────────────────────────────────────────┐
│ UBER'S HYBRID APPROACH │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Use Case │ Solution │ │
│ │ ──────────────────────┼───────────────────────── │ │
│ │ Driver matching ETA │ Pre-computed + OSM │ │
│ │ (high volume, approx) │ │ │
│ │ │ │ │
│ │ Turn-by-turn nav │ Google Maps API │ │
│ │ (accuracy critical) │ │ │
│ │ │ │ │
│ │ Fare calculation │ OSM (distance) + │ │
│ │ │ Internal (time) │ │
│ │ │ │ │
│ │ Geocoding │ Google (primary) + │ │
│ │ │ OSM (fallback) │ │
│ │ │ │ │
│ │ Traffic data │ Internal (driver GPS) + │ │
│ │ │ Google (supplement) │ │
│ │ │ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ Key Insight: Uber has MILLIONS of drivers sending GPS │
│ data every second. This is better traffic data than │
│ Google has in many markets! │
│ │
└─────────────────────────────────────────────────────────────┘🔧 Raizo's Note
Uber's secret weapon: Real-time traffic from driver fleet.
Với 2M active drivers gửi location 4 lần/giây, Uber có traffic data density mà Google không có. Họ dùng data này để:
- Detect traffic jams trước Google
- Find shortcuts drivers actually use
- Improve ETA accuracy in their markets
Đây là lý do Uber đầu tư heavily vào internal maps team.
Decision Framework
┌─────────────────────────────────────────────────────────────┐
│ MAPS BUILD VS BUY DECISION │
├─────────────────────────────────────────────────────────────┤
│ │
│ Choose Google Maps API if: │
│ □ < 1M API calls/month (cost manageable) │
│ □ Global coverage needed immediately │
│ □ No engineering bandwidth for maps │
│ □ Real-time traffic is critical │
│ □ Startup phase (focus on core product) │
│ │
│ Choose Self-hosted OSM if: │
│ □ > 100M API calls/month (cost savings significant) │
│ □ Operating in limited geographies │
│ □ Have dedicated maps engineering team │
│ □ Need customization (e.g., bike lanes, truck routes) │
│ □ Have alternative traffic data source │
│ │
│ Choose Hybrid if: │
│ □ High volume but need accuracy for some use cases │
│ □ Want to reduce dependency gradually │
│ □ Have some engineering capacity │
│ □ Can tolerate complexity │
│ │
└─────────────────────────────────────────────────────────────┘🎓 Giáo sư Tom
Interview tip: Khi được hỏi về Maps, đừng chỉ nói "dùng Google Maps API". Hãy discuss:
- Cost implications at scale
- Build vs buy trade-offs
- How to reduce dependency
- Alternative data sources (driver fleet)
Điều này shows bạn think about long-term architecture, không chỉ MVP.
⚖️ Trade-offs Analysis
Architecture Decision Matrix
| Decision | Option A | Option B | Chosen | Rationale |
|---|---|---|---|---|
| Location Index | PostGIS (SQL) | Redis Geo | Redis Geo | In-memory, handles 8M writes/sec |
| Geospatial Algorithm | Geohash | S2 Geometry | S2 | Variable cell sizes, better covering |
| Matching Strategy | Greedy (first available) | Global optimization | Hybrid | Balance latency vs efficiency |
| Map Data | Google Maps API | Self-hosted OSM | Hybrid | Cost vs accuracy trade-off |
| ETA Calculation | Real-time routing | Pre-computed | Hybrid | Pre-computed base + real-time adjustment |
| Driver State | PostgreSQL | Redis | Redis | Fast reads, TTL for stale data |
| Trip Storage | NoSQL | PostgreSQL | PostgreSQL | ACID for financial data |
| Real-time Updates | HTTP Polling | WebSocket | WebSocket | Lower latency, less bandwidth |
Location Index: PostGIS vs Redis Geo
┌─────────────────────────────────────────────────────────────┐
│ POSTGIS vs REDIS GEO │
├─────────────────────────────────────────────────────────────┤
│ │
│ PostGIS: │
│ + Rich geospatial queries (polygons, intersections) │
│ + Persistent storage │
│ + Complex spatial joins │
│ - Disk I/O bottleneck │
│ - ~1000 writes/sec max │
│ - Query latency 10-100ms │
│ │
│ Redis Geo: │
│ + In-memory (microsecond latency) │
│ + 100K+ writes/sec per node │
│ + Simple GEORADIUS queries │
│ - Limited query types │
│ - Memory-bound │
│ - No persistence (need backup strategy) │
│ │
│ Verdict: Redis Geo for real-time, PostGIS for analytics │
│ │
└─────────────────────────────────────────────────────────────┘🚨 Failure Scenarios & Mitigations
Scenario 1: Location Service Lag
| Aspect | Details |
|---|---|
| Impact | Stale driver locations → bad matches, wrong ETAs |
| Cause | Network issues, Redis overload, driver app bugs |
| Detection | Monitor last_update timestamp, alert if > 30s |
| Mitigation | Location TTL (auto-expire after 30s) |
| Fallback | Use last known location + velocity vector prediction |
python
def get_driver_location(driver_id: str) -> Optional[Location]:
driver_data = redis_client.hgetall(f"driver:{driver_id}")
if not driver_data:
return None
last_update = int(driver_data.get("last_update", 0))
age_seconds = time.time() - last_update
if age_seconds > 30:
# Location too stale, don't use
logger.warning(f"Stale location for driver {driver_id}: {age_seconds}s old")
return None
if age_seconds > 10:
# Predict location based on velocity
return predict_location(driver_data, age_seconds)
return Location(
lat=float(driver_data["lat"]),
lng=float(driver_data["lng"])
)🔧 Raizo's Note
Location prediction dùng simple physics:
new_lat = old_lat + (speed × cos(heading) × time) / 111000
new_lng = old_lng + (speed × sin(heading) × time) / (111000 × cos(lat))111000 = meters per degree latitude. Không perfect nhưng better than stale data.
Scenario 2: Matching Service Overload (Concert Ends)
| Aspect | Details |
|---|---|
| Impact | Long wait times, failed matches, angry users |
| Cause | 50,000 people request rides simultaneously |
| Detection | Queue depth monitoring, latency spikes |
| Mitigation | Queue-based matching with priority |
| Solution | Surge pricing to balance supply/demand |
┌─────────────────────────────────────────────────────────────┐
│ HANDLING DEMAND SPIKES │
├─────────────────────────────────────────────────────────────┤
│ │
│ Normal: 1,000 requests/sec → Direct matching │
│ │
│ Spike: 10,000 requests/sec → Queue + Surge │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ 1. Detect spike (requests > 3x normal) │ │
│ │ │ │
│ │ 2. Enable surge pricing │ │
│ │ - Reduces demand (price-sensitive users wait) │ │
│ │ - Increases supply (drivers come online) │ │
│ │ │ │
│ │ 3. Queue requests with estimated wait time │ │
│ │ "Your ride will be ready in ~8 minutes" │ │
│ │ │ │
│ │ 4. Pre-position drivers for known events │ │
│ │ (concerts, sports games, conferences) │ │
│ │ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘Scenario 3: Maps API Failure
| Aspect | Details |
|---|---|
| Impact | No ETA calculation, no turn-by-turn navigation |
| Cause | Google Maps outage, rate limiting, network issues |
| Detection | API error rates, latency monitoring |
| Mitigation | Cache recent routes, multiple providers |
| Fallback | Straight-line distance × 1.4 factor |
python
def get_eta(origin: Location, destination: Location) -> int:
"""Get ETA in seconds with fallback chain"""
# Try primary: Google Maps
try:
eta = google_maps_client.get_eta(origin, destination)
cache_route(origin, destination, eta)
return eta
except GoogleMapsError as e:
logger.warning(f"Google Maps failed: {e}")
# Try cache
cached_eta = get_cached_eta(origin, destination)
if cached_eta:
return cached_eta
# Try secondary: OSM/OSRM
try:
eta = osrm_client.get_eta(origin, destination)
return eta
except OSRMError as e:
logger.warning(f"OSRM failed: {e}")
# Last resort: straight-line estimate
distance_km = haversine_distance(origin, destination)
avg_speed_kmh = 25 # Urban average
eta_hours = distance_km / avg_speed_kmh
eta_seconds = int(eta_hours * 3600 * 1.4) # 1.4x factor for roads
logger.warning(f"Using fallback ETA: {eta_seconds}s")
return eta_secondsScenario 4: Redis Cluster Failure
| Aspect | Details |
|---|---|
| Impact | Cannot find nearby drivers, matching fails |
| Cause | Node failure, network partition, memory exhaustion |
| Detection | Redis health checks, connection errors |
| Mitigation | Redis Cluster with replicas (3 replicas per shard) |
| Failover | Automatic promotion of replica (< 10 seconds) |
🔧 Raizo's Note
Redis Cluster gotchas:
- GEORADIUS không work across slots - cần ensure all geo data in same slot
- Use hash tags:
drivers:{city}:active→ all drivers in same city go to same slot - Monitor memory: Redis OOM = instant outage
- Set
maxmemory-policy volatile-ttlđể auto-evict stale locations
Scenario 5: Payment Gateway Failure
| Aspect | Details |
|---|---|
| Impact | Cannot complete trips, drivers not paid |
| Cause | Stripe/Braintree outage, network issues |
| Detection | Payment error rates, timeout monitoring |
| Mitigation | Multiple payment providers, retry queue |
| Fallback | Complete trip, charge later (trust-based) |
┌─────────────────────────────────────────────────────────────┐
│ PAYMENT FAILURE HANDLING │
├─────────────────────────────────────────────────────────────┤
│ │
│ Trip completes → Payment fails │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ 1. Mark trip as COMPLETED (don't block driver) │ │
│ │ │ │
│ │ 2. Queue payment for retry │ │
│ │ - Retry 3x with exponential backoff │ │
│ │ - Try alternate payment method │ │
│ │ │ │
│ │ 3. If all retries fail: │ │
│ │ - Notify rider to update payment │ │
│ │ - Block future rides until resolved │ │
│ │ - Pay driver from Uber's float │ │
│ │ │ │
│ │ 4. Reconcile later (batch job) │ │
│ │ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ Key: Never block trip completion due to payment │
│ Driver experience > payment timing │
│ │
└─────────────────────────────────────────────────────────────┘💰 Cost Estimation
Monthly Infrastructure Costs (100M MAU Scale)
| Service | Specification | Unit Cost | Monthly Cost |
|---|---|---|---|
| Location Service | 100 × c5.4xlarge (16 vCPU, 32GB) | $0.68/hr | $49,000 |
| Matching Engine | 50 × c5.2xlarge (8 vCPU, 16GB) | $0.34/hr | $12,200 |
| Trip Service | 30 × c5.xlarge (4 vCPU, 8GB) | $0.17/hr | $3,700 |
| WebSocket Servers | 100 × c5.xlarge | $0.17/hr | $12,200 |
| Redis Geo Cluster | 200GB, 6 nodes (3 primary + 3 replica) | $0.068/GB/hr | $8,000 |
| Redis Cache | 100GB for sessions, locks | $0.068/GB/hr | $5,000 |
| Trip Database | PostgreSQL 10TB (Multi-AZ) | $0.115/GB/mo | $5,000 |
| Analytics DB | ClickHouse 50TB | $0.10/GB/mo | $5,000 |
| Kafka Cluster | 10 brokers (m5.2xlarge) | $0.384/hr | $12,000 |
| Maps API | 1B requests/month (with discounts) | $0.50/1000 | $500,000 |
| Load Balancers | 4 × ALB (high throughput) | $0.025/hr + LCU | $5,000 |
| Data Transfer | 500TB egress | $0.05/GB | $25,000 |
Cost Summary
| Category | Monthly Cost | % of Total |
|---|---|---|
| Compute (Services) | $77,100 | 12% |
| Caching (Redis) | $13,000 | 2% |
| Database (PostgreSQL + ClickHouse) | $10,000 | 2% |
| Message Queue (Kafka) | $12,000 | 2% |
| Maps API | $500,000 | 78% |
| Networking | $30,000 | 5% |
| Total | ~$642,000 | 100% |
⚠️ Maps API Dominates Cost!
78% của infrastructure cost là Maps API. Đây là lý do Uber đầu tư heavily vào:
- Internal maps team
- Pre-computed ETAs
- Driver fleet data for traffic
- OSM-based routing for non-critical paths
Với hybrid approach, có thể giảm Maps cost xuống ~$100K/month.
Cost Optimization Strategies
┌─────────────────────────────────────────────────────────────┐
│ COST OPTIMIZATION │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. Maps API Reduction: │
│ - Pre-compute cell-to-cell ETAs (offline) │
│ - Use driver GPS for traffic instead of API │
│ - Cache routes for popular corridors │
│ - Potential savings: $400K/month │
│ │
│ 2. Compute Optimization: │
│ - Reserved Instances (1-year): 30% savings │
│ - Spot Instances for analytics: 70% savings │
│ - Right-sizing based on actual usage │
│ - Potential savings: $25K/month │
│ │
│ 3. Data Transfer: │
│ - Keep services in same AZ when possible │
│ - Compress WebSocket messages │
│ - Use regional endpoints │
│ - Potential savings: $10K/month │
│ │
│ Optimized Total: ~$200K/month (vs $642K baseline) │
│ │
└─────────────────────────────────────────────────────────────┘Cost per Trip Metrics
Monthly cost: $642,000 (baseline) or $200,000 (optimized)
Monthly trips: 15M × 30 = 450M trips
Baseline cost per trip = $642,000 / 450M = $0.0014
Optimized cost per trip = $200,000 / 450M = $0.0004
Average fare: $15
Platform fee (25%): $3.75
Infrastructure cost: $0.0014
Gross margin per trip: $3.75 - $0.0014 = $3.7486 (99.96% margin!)🎓 Giáo sư Tom
Unit economics của ride-sharing rất healthy về infrastructure. Chi phí chính là:
- Driver incentives (subsidies, bonuses)
- Customer acquisition (marketing, promos)
- Insurance & compliance
Infrastructure chỉ chiếm < 1% of revenue. Đây là lý do ride-sharing có thể scale globally.
🎯 Interview Checklist
Must-Mention Items ✅
| Topic | Key Points |
|---|---|
| Scale Estimation | 100M MAU, 8M location updates/sec, 15M trips/day |
| Geospatial Indexing | Why SQL doesn't work, Quadtree/Geohash/S2 comparison |
| S2 Geometry | Cell hierarchy, Level 14 for neighborhood-level |
| Driver Matching | Greedy vs global optimization, scoring function |
| Driver Locking | Redis SETNX, prevent double-dispatch |
| ETA Calculation | Pre-computed + real-time adjustment |
| Maps Trade-off | Google API cost vs self-hosted OSM |
Bonus Points 🌟
- Surge Pricing: Dynamic pricing algorithm based on supply/demand ratio
- Driver Pre-positioning: ML model to predict demand hotspots
- Fraud Detection: Fake GPS, driver collusion, promo abuse
- Multi-destination: Optimize route for multiple stops
- Pool/Share Rides: Matching multiple riders going same direction
- Driver Earnings: Fair distribution algorithm
Common Mistakes ❌
| Mistake | Why It's Wrong | Better Approach |
|---|---|---|
| SQL for location queries | Can't handle 8M writes/sec | Redis Geo (in-memory) |
| Simple lat/lng range | Inaccurate, slow | S2 Geometry cells |
| First-available matching | Suboptimal, unfair | Multi-factor scoring |
| No driver locking | Race conditions | Redis distributed lock |
| Google Maps for everything | $500K+/month | Hybrid approach |
| Sync ETA calculation | Blocks matching | Pre-computed + cache |
⚠️ Interview Red Flags
- Không biết tại sao SQL BETWEEN không work cho geo queries
- Không mention driver locking mechanism
- Không discuss Maps API cost at scale
- Single point of failure trong architecture
- Không có fallback strategy cho external dependencies
🎓 Key Takeaways
- Geospatial indexing là core challenge - S2 Geometry là industry standard
- 8M writes/sec đòi hỏi in-memory solutions (Redis), không phải traditional DB
- Driver matching cần balance nhiều factors, không chỉ distance
- Distributed locking critical để prevent race conditions
- Maps API là biggest cost driver - hybrid approach saves millions
- Real-time updates via WebSocket, không phải polling
- Graceful degradation cho mọi external dependency
🔗 Navigation
Related Case Studies
| Case Study | Key Learning | Link |
|---|---|---|
| Fan-out patterns, timeline caching | Design Twitter → | |
| YouTube | Video processing, CDN, adaptive streaming | Design YouTube → |
| Real-time messaging, E2EE, connection management | Design WhatsApp → |