🚀 Deployment Patterns

Level: Advanced Solves: Triển khai ML models vào production một cách an toàn và hiệu quả

ML Deployment Challenges

💡 Giáo sư Tom

"Training a model is the easy part. Deploying it reliably is where the real engineering begins." ML deployment khác với software deployment thông thường vì model có thể fail silently - predictions sai nhưng không có error.

┌─────────────────────────────────────────────────────────────────┐
│              ML DEPLOYMENT CHALLENGES                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Traditional Software:                                          │
│  • Code works or crashes                                        │
│  • Errors are explicit                                          │
│  • Testing is deterministic                                     │
│                                                                 │
│  ML Systems:                                                    │
│  • Model can be "wrong" without crashing                        │
│  • Errors are statistical (accuracy drops)                      │
│  • Testing requires data distribution awareness                 │
│  • Performance degrades over time (drift)                       │
│  • Dependencies: code + data + model + features                 │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Serving Patterns

Batch vs Online Serving

┌─────────────────────────────────────────────────────────────────┐
│              BATCH vs ONLINE SERVING                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  BATCH SERVING                                                  │
│  ─────────────                                                  │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐       │
│  │  Data   │───►│  Model  │───►│ Results │───►│ Storage │       │
│  │ (batch) │    │         │    │         │    │ (DB/S3) │       │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘       │
│                                                                 │
│  • Scheduled (hourly, daily)                                    │
│  • High throughput                                              │
│  • Latency not critical                                         │
│  • Examples: recommendations, risk scores                       │
│                                                                 │
│  ONLINE SERVING                                                 │
│  ──────────────                                                 │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐                      │
│  │ Request │───►│  Model  │───►│Response │                      │
│  │ (API)   │    │ Service │    │ (JSON)  │                      │
│  └─────────┘    └─────────┘    └─────────┘                      │
│                                                                 │
│  • Real-time (milliseconds)                                     │
│  • Low latency critical                                         │
│  • Per-request predictions                                      │
│  • Examples: fraud detection, search ranking                    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Choosing Serving Pattern

Factor	Batch	Online
Latency requirement	Hours/minutes OK	Milliseconds required
Prediction freshness	Can be stale	Must be real-time
Feature availability	All features available	Limited to real-time features
Cost	Lower (scheduled)	Higher (always-on)
Complexity	Simpler	More complex

Hybrid Pattern

┌─────────────────────────────────────────────────────────────────┐
│              HYBRID SERVING PATTERN                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    BATCH LAYER                           │    │
│  │  • Pre-compute predictions for known entities            │    │
│  │  • Store in fast lookup (Redis, DynamoDB)                │    │
│  │  • Update periodically (hourly/daily)                    │    │
│  └─────────────────────────────────────────────────────────┘    │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                   SERVING LAYER                          │    │
│  │  Request → Cache lookup → Hit? → Return cached           │    │
│  │                            │                             │    │
│  │                           Miss                           │    │
│  │                            │                             │    │
│  │                            ▼                             │    │
│  │                    Online inference                      │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  Benefits:                                                      │
│  • Fast response for known entities (cache hit)                 │
│  • Fallback to online for new entities                          │
│  • Cost-effective                                               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Deployment Strategies

Blue-Green Deployment

┌─────────────────────────────────────────────────────────────────┐
│              BLUE-GREEN DEPLOYMENT                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Step 1: Blue is live                                           │
│  ┌─────────────┐                                                │
│  │   Traffic   │────────────────►┌─────────────┐                │
│  │   (100%)    │                 │  Blue (v1)  │                │
│  └─────────────┘                 └─────────────┘                │
│                                  ┌─────────────┐                │
│                                  │ Green (v2)  │ ← Deploy new   │
│                                  └─────────────┘                │
│                                                                 │
│  Step 2: Switch to Green                                        │
│  ┌─────────────┐                 ┌─────────────┐                │
│  │   Traffic   │                 │  Blue (v1)  │ ← Standby      │
│  │   (100%)    │────────────────►┌─────────────┐                │
│  └─────────────┘                 │ Green (v2)  │                │
│                                  └─────────────┘                │
│                                                                 │
│  Rollback: Switch back to Blue instantly                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Canary Deployment

┌─────────────────────────────────────────────────────────────────┐
│              CANARY DEPLOYMENT                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Phase 1: 5% traffic to canary                                  │
│  ┌─────────────┐     95%     ┌─────────────┐                    │
│  │   Traffic   │────────────►│  Stable v1  │                    │
│  │             │     5%      ├─────────────┤                    │
│  │             │────────────►│  Canary v2  │                    │
│  └─────────────┘             └─────────────┘                    │
│                                                                 │
│  Phase 2: Monitor metrics                                       │
│  • Latency: canary vs stable                                    │
│  • Error rate: canary vs stable                                 │
│  • Business metrics: conversion, engagement                     │
│                                                                 │
│  Phase 3: Gradual rollout (if metrics OK)                       │
│  5% → 10% → 25% → 50% → 100%                                    │
│                                                                 │
│  Rollback: If metrics degrade, route 100% to stable             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Shadow Mode

🔍 Shadow Mode

Shadow mode chạy model mới song song với model production, nhưng không serve predictions cho users. Dùng để validate model trên real traffic trước khi deploy.

┌─────────────────────────────────────────────────────────────────┐
│              SHADOW MODE DEPLOYMENT                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┐                                                │
│  │   Request   │                                                │
│  └─────────────┘                                                │
│         │                                                       │
│         ├────────────────────────────────────┐                  │
│         │                                    │                  │
│         ▼                                    ▼                  │
│  ┌─────────────┐                      ┌─────────────┐           │
│  │ Production  │                      │   Shadow    │           │
│  │  Model v1   │                      │  Model v2   │           │
│  └─────────────┘                      └─────────────┘           │
│         │                                    │                  │
│         │ Serve to user                      │ Log only         │
│         ▼                                    ▼                  │
│  ┌─────────────┐                      ┌─────────────┐           │
│  │  Response   │                      │   Metrics   │           │
│  │  to User    │                      │   & Logs    │           │
│  └─────────────┘                      └─────────────┘           │
│                                                                 │
│  Compare:                                                       │
│  • Prediction distribution                                      │
│  • Latency                                                      │
│  • Error rates                                                  │
│  • Business metric correlation                                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Model Serving Infrastructure

Serving Options

Option	Pros	Cons	Best For
REST API	Simple, universal	Higher latency	General purpose
gRPC	Fast, typed	More complex	High-performance
Serverless	Auto-scaling, pay-per-use	Cold starts	Variable traffic
Kubernetes	Scalable, portable	Complex setup	Large scale
Edge	Low latency	Limited compute	IoT, mobile

FastAPI Model Server

python

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import joblib
import numpy as np

app = FastAPI()

# Load model at startup
model = joblib.load("model.pkl")
scaler = joblib.load("scaler.pkl")

class PredictionRequest(BaseModel):
    features: list[float]

class PredictionResponse(BaseModel):
    prediction: int
    probability: float
    model_version: str

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    try:
        # Preprocess
        features = np.array(request.features).reshape(1, -1)
        features_scaled = scaler.transform(features)
        
        # Predict
        prediction = model.predict(features_scaled)[0]
        probability = model.predict_proba(features_scaled)[0].max()
        
        return PredictionResponse(
            prediction=int(prediction),
            probability=float(probability),
            model_version="v1.2.0"
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health():
    return {"status": "healthy"}

Docker Deployment

dockerfile

FROM python:3.10-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model and code
COPY model.pkl scaler.pkl ./
COPY app.py ./

# Expose port
EXPOSE 8000

# Run server
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Kubernetes Deployment

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ml-model
  template:
    metadata:
      labels:
        app: ml-model
    spec:
      containers:
      - name: model-server
        image: ml-model:v1.2.0
        ports:
        - containerPort: 8000
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 10
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 15
          periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: ml-model-service
spec:
  selector:
    app: ml-model
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer

Rollback Strategies

Rollback Decision Matrix

Signal	Severity	Action
Error rate spike	High	Immediate rollback
Latency increase >2x	High	Immediate rollback
Business metric drop >10%	Medium	Investigate, then rollback
Prediction distribution shift	Low	Monitor, gradual rollback

Automated Rollback

python

import time
from dataclasses import dataclass

@dataclass
class RollbackConfig:
    error_rate_threshold: float = 0.05
    latency_p99_threshold_ms: float = 200
    check_interval_seconds: int = 60
    min_samples: int = 100

def should_rollback(metrics: dict, config: RollbackConfig) -> bool:
    """Determine if rollback is needed based on metrics."""
    
    if metrics["sample_count"] < config.min_samples:
        return False  # Not enough data
    
    if metrics["error_rate"] > config.error_rate_threshold:
        print(f"Error rate {metrics['error_rate']:.2%} exceeds threshold")
        return True
    
    if metrics["latency_p99"] > config.latency_p99_threshold_ms:
        print(f"Latency {metrics['latency_p99']}ms exceeds threshold")
        return True
    
    return False

def monitor_and_rollback(deployment, config: RollbackConfig):
    """Monitor deployment and rollback if needed."""
    while True:
        metrics = deployment.get_metrics()
        
        if should_rollback(metrics, config):
            print("Initiating rollback...")
            deployment.rollback()
            break
        
        time.sleep(config.check_interval_seconds)

A/B Testing for ML

A/B Test Setup

┌─────────────────────────────────────────────────────────────────┐
│              ML A/B TESTING                                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┐                                                │
│  │   Traffic   │                                                │
│  └─────────────┘                                                │
│         │                                                       │
│         │ Random assignment                                     │
│         │ (user_id % 100)                                       │
│         │                                                       │
│    ┌────┴────┐                                                  │
│    │         │                                                  │
│    ▼         ▼                                                  │
│  Control   Treatment                                            │
│  (50%)     (50%)                                                │
│    │         │                                                  │
│    ▼         ▼                                                  │
│  Model A   Model B                                              │
│    │         │                                                  │
│    └────┬────┘                                                  │
│         │                                                       │
│         ▼                                                       │
│  ┌─────────────┐                                                │
│  │  Metrics    │                                                │
│  │  Collection │                                                │
│  └─────────────┘                                                │
│         │                                                       │
│         ▼                                                       │
│  Statistical significance test                                  │
│  (t-test, chi-squared, etc.)                                    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

A/B Test Considerations for ML

Consideration	Description
Sample size	Need enough samples for statistical power
Duration	Run long enough to capture patterns
Novelty effect	Users may behave differently with new model
Network effects	User interactions may contaminate groups
Metric selection	Choose metrics that reflect business value

Best Practices

Deployment Checklist

Phase	Check	Why
Pre-deploy	Model validated offline	Catch obvious issues
Pre-deploy	Feature pipeline tested	Ensure features available
Pre-deploy	Rollback plan ready	Quick recovery
Deploy	Shadow mode first	Validate on real traffic
Deploy	Canary with monitoring	Gradual rollout
Post-deploy	Metrics dashboards	Track performance
Post-deploy	Alerts configured	Catch degradation

Common Pitfalls

🚨 Deployment Anti-Patterns

Big bang deployment: 100% traffic to new model immediately
No rollback plan: Can't recover from bad deployment
Missing monitoring: Don't know when model fails
Feature skew: Training features ≠ serving features
No shadow testing: Deploy without real traffic validation

✅ Operational Checklist

Pre-Deployment Gate

Check	Required	Owner	Notes
Offline evaluation passed	✓	ML Engineer	See evaluation.md
Feature pipeline validated	✓	Data Engineer	No train/serve skew
Model registered	✓	ML Engineer	See governance.md
Rollback plan documented	✓	ML Lead	Previous version ready
Monitoring configured	✓	ML Engineer	Dashboards, alerts
Load test passed	✓	Platform	Can handle expected traffic

Deployment Execution Gate

Check	Required	Owner
Shadow mode validated	✓	ML Engineer
Canary started (5-10%)	✓	ML Engineer
Metrics monitored for 24h	✓	ML Engineer
No SLA violations	✓	ML Lead
Business metrics stable	✓	PM

Post-Deployment Gate

Check	Required	Frequency
Gradual rollout to 100%	✓	If metrics OK
Full monitoring active	✓	Continuous
Documentation updated	✓	Within 1 day
Runbook updated	✓	Within 1 day
Post-mortem if issues	As needed	Within 1 week

📎 Cross-References

📎 Experimentation - From experiment to deployment
📎 ML Monitoring - Post-deployment monitoring
📎 ML Governance - Approval gates
📎 Feature Engineering - Feature pipeline for serving
📎 System Design - ML system architecture patterns

🧠 Quiz

Câu 1: Canary deployment trong ML có ưu điểm gì?

[ ] A) Tăng tốc độ inference
[x] B) Giảm thiểu rủi ro bằng cách chỉ routing một phần nhỏ traffic đến model mới
[ ] C) Giảm chi phí infrastructure
[ ] D) Tự động retrain model

💡 Giải thích: Canary deployment cho phép thử nghiệm model mới trên một phần nhỏ traffic (ví dụ: 5%), giám sát hiệu suất, và chỉ mở rộng khi đã xác nhận model mới hoạt động tốt — giảm blast radius nếu có vấn đề.

Câu 2: Shadow mode deployment khác gì so với canary deployment?

[ ] A) Shadow mode nhanh hơn canary
[ ] B) Shadow mode dùng ít resources hơn
[x] C) Shadow mode chạy model mới song song nhưng không serve kết quả cho user, chỉ log để so sánh
[ ] D) Shadow mode không cần monitoring

💡 Giải thích: Shadow mode (dark launch) chạy model mới song song với model production hiện tại, log predictions của cả hai để so sánh, nhưng chỉ model cũ serve cho users. Điều này cho phép đánh giá model mới trên real traffic mà không có rủi ro.

🚀 Deployment Patterns ​

ML Deployment Challenges ​

Serving Patterns ​

Batch vs Online Serving ​

Choosing Serving Pattern ​

Hybrid Pattern ​

Deployment Strategies ​

Blue-Green Deployment ​

Canary Deployment ​

Shadow Mode ​

Model Serving Infrastructure ​

Serving Options ​

FastAPI Model Server ​

Docker Deployment ​

Kubernetes Deployment ​

Rollback Strategies ​

Rollback Decision Matrix ​

Automated Rollback ​

A/B Testing for ML ​

A/B Test Setup ​

A/B Test Considerations for ML ​

Best Practices ​

Deployment Checklist ​

Common Pitfalls ​

✅ Operational Checklist ​

Pre-Deployment Gate ​

Deployment Execution Gate ​

Post-Deployment Gate ​

📎 Cross-References ​

Câu 1: Canary deployment trong ML có ưu điểm gì? ​

Câu 2: Shadow mode deployment khác gì so với canary deployment? ​