Giao diện
🚀 Deployment Patterns
Level: Advanced Solves: Triển khai ML models vào production một cách an toàn và hiệu quả
ML Deployment Challenges
💡 Giáo sư Tom
"Training a model is the easy part. Deploying it reliably is where the real engineering begins." ML deployment khác với software deployment thông thường vì model có thể fail silently - predictions sai nhưng không có error.
┌─────────────────────────────────────────────────────────────────┐
│ ML DEPLOYMENT CHALLENGES │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Traditional Software: │
│ • Code works or crashes │
│ • Errors are explicit │
│ • Testing is deterministic │
│ │
│ ML Systems: │
│ • Model can be "wrong" without crashing │
│ • Errors are statistical (accuracy drops) │
│ • Testing requires data distribution awareness │
│ • Performance degrades over time (drift) │
│ • Dependencies: code + data + model + features │
│ │
└─────────────────────────────────────────────────────────────────┘Serving Patterns
Batch vs Online Serving
┌─────────────────────────────────────────────────────────────────┐
│ BATCH vs ONLINE SERVING │
├─────────────────────────────────────────────────────────────────┤
│ │
│ BATCH SERVING │
│ ───────────── │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Data │───►│ Model │───►│ Results │───►│ Storage │ │
│ │ (batch) │ │ │ │ │ │ (DB/S3) │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ • Scheduled (hourly, daily) │
│ • High throughput │
│ • Latency not critical │
│ • Examples: recommendations, risk scores │
│ │
│ ONLINE SERVING │
│ ────────────── │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Request │───►│ Model │───►│Response │ │
│ │ (API) │ │ Service │ │ (JSON) │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ • Real-time (milliseconds) │
│ • Low latency critical │
│ • Per-request predictions │
│ • Examples: fraud detection, search ranking │
│ │
└─────────────────────────────────────────────────────────────────┘Choosing Serving Pattern
| Factor | Batch | Online |
|---|---|---|
| Latency requirement | Hours/minutes OK | Milliseconds required |
| Prediction freshness | Can be stale | Must be real-time |
| Feature availability | All features available | Limited to real-time features |
| Cost | Lower (scheduled) | Higher (always-on) |
| Complexity | Simpler | More complex |
Hybrid Pattern
┌─────────────────────────────────────────────────────────────────┐
│ HYBRID SERVING PATTERN │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ BATCH LAYER │ │
│ │ • Pre-compute predictions for known entities │ │
│ │ • Store in fast lookup (Redis, DynamoDB) │ │
│ │ • Update periodically (hourly/daily) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ SERVING LAYER │ │
│ │ Request → Cache lookup → Hit? → Return cached │ │
│ │ │ │ │
│ │ Miss │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ Online inference │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ Benefits: │
│ • Fast response for known entities (cache hit) │
│ • Fallback to online for new entities │
│ • Cost-effective │
│ │
└─────────────────────────────────────────────────────────────────┘Deployment Strategies
Blue-Green Deployment
┌─────────────────────────────────────────────────────────────────┐
│ BLUE-GREEN DEPLOYMENT │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Step 1: Blue is live │
│ ┌─────────────┐ │
│ │ Traffic │────────────────►┌─────────────┐ │
│ │ (100%) │ │ Blue (v1) │ │
│ └─────────────┘ └─────────────┘ │
│ ┌─────────────┐ │
│ │ Green (v2) │ ← Deploy new │
│ └─────────────┘ │
│ │
│ Step 2: Switch to Green │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Traffic │ │ Blue (v1) │ ← Standby │
│ │ (100%) │────────────────►┌─────────────┐ │
│ └─────────────┘ │ Green (v2) │ │
│ └─────────────┘ │
│ │
│ Rollback: Switch back to Blue instantly │
│ │
└─────────────────────────────────────────────────────────────────┘Canary Deployment
┌─────────────────────────────────────────────────────────────────┐
│ CANARY DEPLOYMENT │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Phase 1: 5% traffic to canary │
│ ┌─────────────┐ 95% ┌─────────────┐ │
│ │ Traffic │────────────►│ Stable v1 │ │
│ │ │ 5% ├─────────────┤ │
│ │ │────────────►│ Canary v2 │ │
│ └─────────────┘ └─────────────┘ │
│ │
│ Phase 2: Monitor metrics │
│ • Latency: canary vs stable │
│ • Error rate: canary vs stable │
│ • Business metrics: conversion, engagement │
│ │
│ Phase 3: Gradual rollout (if metrics OK) │
│ 5% → 10% → 25% → 50% → 100% │
│ │
│ Rollback: If metrics degrade, route 100% to stable │
│ │
└─────────────────────────────────────────────────────────────────┘Shadow Mode
🔍 Shadow Mode
Shadow mode chạy model mới song song với model production, nhưng không serve predictions cho users. Dùng để validate model trên real traffic trước khi deploy.
┌─────────────────────────────────────────────────────────────────┐
│ SHADOW MODE DEPLOYMENT │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ │
│ │ Request │ │
│ └─────────────┘ │
│ │ │
│ ├────────────────────────────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Production │ │ Shadow │ │
│ │ Model v1 │ │ Model v2 │ │
│ └─────────────┘ └─────────────┘ │
│ │ │ │
│ │ Serve to user │ Log only │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Response │ │ Metrics │ │
│ │ to User │ │ & Logs │ │
│ └─────────────┘ └─────────────┘ │
│ │
│ Compare: │
│ • Prediction distribution │
│ • Latency │
│ • Error rates │
│ • Business metric correlation │
│ │
└─────────────────────────────────────────────────────────────────┘Model Serving Infrastructure
Serving Options
| Option | Pros | Cons | Best For |
|---|---|---|---|
| REST API | Simple, universal | Higher latency | General purpose |
| gRPC | Fast, typed | More complex | High-performance |
| Serverless | Auto-scaling, pay-per-use | Cold starts | Variable traffic |
| Kubernetes | Scalable, portable | Complex setup | Large scale |
| Edge | Low latency | Limited compute | IoT, mobile |
FastAPI Model Server
python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import joblib
import numpy as np
app = FastAPI()
# Load model at startup
model = joblib.load("model.pkl")
scaler = joblib.load("scaler.pkl")
class PredictionRequest(BaseModel):
features: list[float]
class PredictionResponse(BaseModel):
prediction: int
probability: float
model_version: str
@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
try:
# Preprocess
features = np.array(request.features).reshape(1, -1)
features_scaled = scaler.transform(features)
# Predict
prediction = model.predict(features_scaled)[0]
probability = model.predict_proba(features_scaled)[0].max()
return PredictionResponse(
prediction=int(prediction),
probability=float(probability),
model_version="v1.2.0"
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/health")
async def health():
return {"status": "healthy"}Docker Deployment
dockerfile
FROM python:3.10-slim
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy model and code
COPY model.pkl scaler.pkl ./
COPY app.py ./
# Expose port
EXPOSE 8000
# Run server
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]Kubernetes Deployment
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-model-server
spec:
replicas: 3
selector:
matchLabels:
app: ml-model
template:
metadata:
labels:
app: ml-model
spec:
containers:
- name: model-server
image: ml-model:v1.2.0
ports:
- containerPort: 8000
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 15
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: ml-model-service
spec:
selector:
app: ml-model
ports:
- port: 80
targetPort: 8000
type: LoadBalancerRollback Strategies
Rollback Decision Matrix
| Signal | Severity | Action |
|---|---|---|
| Error rate spike | High | Immediate rollback |
| Latency increase >2x | High | Immediate rollback |
| Business metric drop >10% | Medium | Investigate, then rollback |
| Prediction distribution shift | Low | Monitor, gradual rollback |
Automated Rollback
python
import time
from dataclasses import dataclass
@dataclass
class RollbackConfig:
error_rate_threshold: float = 0.05
latency_p99_threshold_ms: float = 200
check_interval_seconds: int = 60
min_samples: int = 100
def should_rollback(metrics: dict, config: RollbackConfig) -> bool:
"""Determine if rollback is needed based on metrics."""
if metrics["sample_count"] < config.min_samples:
return False # Not enough data
if metrics["error_rate"] > config.error_rate_threshold:
print(f"Error rate {metrics['error_rate']:.2%} exceeds threshold")
return True
if metrics["latency_p99"] > config.latency_p99_threshold_ms:
print(f"Latency {metrics['latency_p99']}ms exceeds threshold")
return True
return False
def monitor_and_rollback(deployment, config: RollbackConfig):
"""Monitor deployment and rollback if needed."""
while True:
metrics = deployment.get_metrics()
if should_rollback(metrics, config):
print("Initiating rollback...")
deployment.rollback()
break
time.sleep(config.check_interval_seconds)A/B Testing for ML
A/B Test Setup
┌─────────────────────────────────────────────────────────────────┐
│ ML A/B TESTING │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ │
│ │ Traffic │ │
│ └─────────────┘ │
│ │ │
│ │ Random assignment │
│ │ (user_id % 100) │
│ │ │
│ ┌────┴────┐ │
│ │ │ │
│ ▼ ▼ │
│ Control Treatment │
│ (50%) (50%) │
│ │ │ │
│ ▼ ▼ │
│ Model A Model B │
│ │ │ │
│ └────┬────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Metrics │ │
│ │ Collection │ │
│ └─────────────┘ │
│ │ │
│ ▼ │
│ Statistical significance test │
│ (t-test, chi-squared, etc.) │
│ │
└─────────────────────────────────────────────────────────────────┘A/B Test Considerations for ML
| Consideration | Description |
|---|---|
| Sample size | Need enough samples for statistical power |
| Duration | Run long enough to capture patterns |
| Novelty effect | Users may behave differently with new model |
| Network effects | User interactions may contaminate groups |
| Metric selection | Choose metrics that reflect business value |
Best Practices
Deployment Checklist
| Phase | Check | Why |
|---|---|---|
| Pre-deploy | Model validated offline | Catch obvious issues |
| Pre-deploy | Feature pipeline tested | Ensure features available |
| Pre-deploy | Rollback plan ready | Quick recovery |
| Deploy | Shadow mode first | Validate on real traffic |
| Deploy | Canary with monitoring | Gradual rollout |
| Post-deploy | Metrics dashboards | Track performance |
| Post-deploy | Alerts configured | Catch degradation |
Common Pitfalls
🚨 Deployment Anti-Patterns
- Big bang deployment: 100% traffic to new model immediately
- No rollback plan: Can't recover from bad deployment
- Missing monitoring: Don't know when model fails
- Feature skew: Training features ≠ serving features
- No shadow testing: Deploy without real traffic validation
✅ Operational Checklist
Pre-Deployment Gate
| Check | Required | Owner | Notes |
|---|---|---|---|
| Offline evaluation passed | ✓ | ML Engineer | See evaluation.md |
| Feature pipeline validated | ✓ | Data Engineer | No train/serve skew |
| Model registered | ✓ | ML Engineer | See governance.md |
| Rollback plan documented | ✓ | ML Lead | Previous version ready |
| Monitoring configured | ✓ | ML Engineer | Dashboards, alerts |
| Load test passed | ✓ | Platform | Can handle expected traffic |
Deployment Execution Gate
| Check | Required | Owner |
|---|---|---|
| Shadow mode validated | ✓ | ML Engineer |
| Canary started (5-10%) | ✓ | ML Engineer |
| Metrics monitored for 24h | ✓ | ML Engineer |
| No SLA violations | ✓ | ML Lead |
| Business metrics stable | ✓ | PM |
Post-Deployment Gate
| Check | Required | Frequency |
|---|---|---|
| Gradual rollout to 100% | ✓ | If metrics OK |
| Full monitoring active | ✓ | Continuous |
| Documentation updated | ✓ | Within 1 day |
| Runbook updated | ✓ | Within 1 day |
| Post-mortem if issues | As needed | Within 1 week |
📎 Cross-References
- 📎 Experimentation - From experiment to deployment
- 📎 ML Monitoring - Post-deployment monitoring
- 📎 ML Governance - Approval gates
- 📎 Feature Engineering - Feature pipeline for serving
- 📎 System Design - ML system architecture patterns
🧠 Quiz
Câu 1: Canary deployment trong ML có ưu điểm gì?
- [ ] A) Tăng tốc độ inference
- [x] B) Giảm thiểu rủi ro bằng cách chỉ routing một phần nhỏ traffic đến model mới
- [ ] C) Giảm chi phí infrastructure
- [ ] D) Tự động retrain model
💡 Giải thích: Canary deployment cho phép thử nghiệm model mới trên một phần nhỏ traffic (ví dụ: 5%), giám sát hiệu suất, và chỉ mở rộng khi đã xác nhận model mới hoạt động tốt — giảm blast radius nếu có vấn đề.
Câu 2: Shadow mode deployment khác gì so với canary deployment?
- [ ] A) Shadow mode nhanh hơn canary
- [ ] B) Shadow mode dùng ít resources hơn
- [x] C) Shadow mode chạy model mới song song nhưng không serve kết quả cho user, chỉ log để so sánh
- [ ] D) Shadow mode không cần monitoring
💡 Giải thích: Shadow mode (dark launch) chạy model mới song song với model production hiện tại, log predictions của cả hai để so sánh, nhưng chỉ model cũ serve cho users. Điều này cho phép đánh giá model mới trên real traffic mà không có rủi ro.