Giao diện
🚀 Deployment Patterns
Level: Advanced Solves: Triển khai ML models vào production một cách an toàn và hiệu quả
ML Deployment Challenges
💡 Giáo sư Tom
"Training a model is the easy part. Deploying it reliably is where the real engineering begins." ML deployment khác với software deployment thông thường vì model có thể fail silently - predictions sai nhưng không có error.
┌─────────────────────────────────────────────────────────────────┐
│ ML DEPLOYMENT CHALLENGES │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Traditional Software: │
│ • Code works or crashes │
│ • Errors are explicit │
│ • Testing is deterministic │
│ │
│ ML Systems: │
│ • Model can be "wrong" without crashing │
│ • Errors are statistical (accuracy drops) │
│ • Testing requires data distribution awareness │
│ • Performance degrades over time (drift) │
│ • Dependencies: code + data + model + features │
│ │
└─────────────────────────────────────────────────────────────────┘Serving Patterns
Batch vs Online Serving
┌─────────────────────────────────────────────────────────────────┐
│ BATCH vs ONLINE SERVING │
├─────────────────────────────────────────────────────────────────┤
│ │
│ BATCH SERVING │
│ ───────────── │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Data │───►│ Model │───►│ Results │───►│ Storage │ │
│ │ (batch) │ │ │ │ │ │ (DB/S3) │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ • Scheduled (hourly, daily) │
│ • High throughput │
│ • Latency not critical │
│ • Examples: recommendations, risk scores │
│ │
│ ONLINE SERVING │
│ ────────────── │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Request │───►│ Model │───►│Response │ │
│ │ (API) │ │ Service │ │ (JSON) │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ • Real-time (milliseconds) │
│ • Low latency critical │
│ • Per-request predictions │
│ • Examples: fraud detection, search ranking │
│ │
└─────────────────────────────────────────────────────────────────┘Choosing Serving Pattern
| Factor | Batch | Online |
|---|---|---|
| Latency requirement | Hours/minutes OK | Milliseconds required |
| Prediction freshness | Can be stale | Must be real-time |
| Feature availability | All features available | Limited to real-time features |
| Cost | Lower (scheduled) | Higher (always-on) |
| Complexity | Simpler | More complex |
Hybrid Pattern
┌─────────────────────────────────────────────────────────────────┐
│ HYBRID SERVING PATTERN │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ BATCH LAYER │ │
│ │ • Pre-compute predictions for known entities │ │
│ │ • Store in fast lookup (Redis, DynamoDB) │ │
│ │ • Update periodically (hourly/daily) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ SERVING LAYER │ │
│ │ Request → Cache lookup → Hit? → Return cached │ │
│ │ │ │ │
│ │ Miss │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ Online inference │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ Benefits: │
│ • Fast response for known entities (cache hit) │
│ • Fallback to online for new entities │
│ • Cost-effective │
│ │
└─────────────────────────────────────────────────────────────────┘Deployment Strategies
Blue-Green Deployment
┌─────────────────────────────────────────────────────────────────┐
│ BLUE-GREEN DEPLOYMENT │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Step 1: Blue is live │
│ ┌─────────────┐ │
│ │ Traffic │────────────────►┌─────────────┐ │
│ │ (100%) │ │ Blue (v1) │ │
│ └─────────────┘ └─────────────┘ │
│ ┌─────────────┐ │
│ │ Green (v2) │ ← Deploy new │
│ └─────────────┘ │
│ │
│ Step 2: Switch to Green │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Traffic │ │ Blue (v1) │ ← Standby │
│ │ (100%) │────────────────►┌─────────────┐ │
│ └─────────────┘ │ Green (v2) │ │
│ └─────────────┘ │
│ │
│ Rollback: Switch back to Blue instantly │
│ │
└─────────────────────────────────────────────────────────────────┘Canary Deployment
┌─────────────────────────────────────────────────────────────────┐
│ CANARY DEPLOYMENT │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Phase 1: 5% traffic to canary │
│ ┌─────────────┐ 95% ┌─────────────┐ │
│ │ Traffic │────────────►│ Stable v1 │ │
│ │ │ 5% ├─────────────┤ │
│ │ │────────────►│ Canary v2 │ │
│ └─────────────┘ └─────────────┘ │
│ │
│ Phase 2: Monitor metrics │
│ • Latency: canary vs stable │
│ • Error rate: canary vs stable │
│ • Business metrics: conversion, engagement │
│ │
│ Phase 3: Gradual rollout (if metrics OK) │
│ 5% → 10% → 25% → 50% → 100% │
│ │
│ Rollback: If metrics degrade, route 100% to stable │
│ │
└─────────────────────────────────────────────────────────────────┘Shadow Mode
🔍 Shadow Mode
Shadow mode chạy model mới song song với model production, nhưng không serve predictions cho users. Dùng để validate model trên real traffic trước khi deploy.
┌─────────────────────────────────────────────────────────────────┐
│ SHADOW MODE DEPLOYMENT │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ │
│ │ Request │ │
│ └─────────────┘ │
│ │ │
│ ├────────────────────────────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Production │ │ Shadow │ │
│ │ Model v1 │ │ Model v2 │ │
│ └─────────────┘ └─────────────┘ │
│ │ │ │
│ │ Serve to user │ Log only │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Response │ │ Metrics │ │
│ │ to User │ │ & Logs │ │
│ └─────────────┘ └─────────────┘ │
│ │
│ Compare: │
│ • Prediction distribution │
│ • Latency │
│ • Error rates │
│ • Business metric correlation │
│ │
└─────────────────────────────────────────────────────────────────┘Model Serving Infrastructure
Serving Options
| Option | Pros | Cons | Best For |
|---|---|---|---|
| REST API | Simple, universal | Higher latency | General purpose |
| gRPC | Fast, typed | More complex | High-performance |
| Serverless | Auto-scaling, pay-per-use | Cold starts | Variable traffic |
| Kubernetes | Scalable, portable | Complex setup | Large scale |
| Edge | Low latency | Limited compute | IoT, mobile |
FastAPI Model Server
python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import joblib
import numpy as np
app = FastAPI()
# Load model at startup
model = joblib.load("model.pkl")
scaler = joblib.load("scaler.pkl")
class PredictionRequest(BaseModel):
features: list[float]
class PredictionResponse(BaseModel):
prediction: int
probability: float
model_version: str
@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
try:
# Preprocess
features = np.array(request.features).reshape(1, -1)
features_scaled = scaler.transform(features)
# Predict
prediction = model.predict(features_scaled)[0]
probability = model.predict_proba(features_scaled)[0].max()
return PredictionResponse(
prediction=int(prediction),
probability=float(probability),
model_version="v1.2.0"
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/health")
async def health():
return {"status": "healthy"}Docker Deployment
dockerfile
FROM python:3.10-slim
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy model and code
COPY model.pkl scaler.pkl ./
COPY app.py ./
# Expose port
EXPOSE 8000
# Run server
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]Kubernetes Deployment
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-model-server
spec:
replicas: 3
selector:
matchLabels:
app: ml-model
template:
metadata:
labels:
app: ml-model
spec:
containers:
- name: model-server
image: ml-model:v1.2.0
ports:
- containerPort: 8000
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 15
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: ml-model-service
spec:
selector:
app: ml-model
ports:
- port: 80
targetPort: 8000
type: LoadBalancerRollback Strategies
Rollback Decision Matrix
| Signal | Severity | Action |
|---|---|---|
| Error rate spike | High | Immediate rollback |
| Latency increase >2x | High | Immediate rollback |
| Business metric drop >10% | Medium | Investigate, then rollback |
| Prediction distribution shift | Low | Monitor, gradual rollback |
Automated Rollback
python
import time
from dataclasses import dataclass
@dataclass
class RollbackConfig:
error_rate_threshold: float = 0.05
latency_p99_threshold_ms: float = 200
check_interval_seconds: int = 60
min_samples: int = 100
def should_rollback(metrics: dict, config: RollbackConfig) -> bool:
"""Determine if rollback is needed based on metrics."""
if metrics["sample_count"] < config.min_samples:
return False # Not enough data
if metrics["error_rate"] > config.error_rate_threshold:
print(f"Error rate {metrics['error_rate']:.2%} exceeds threshold")
return True
if metrics["latency_p99"] > config.latency_p99_threshold_ms:
print(f"Latency {metrics['latency_p99']}ms exceeds threshold")
return True
return False
def monitor_and_rollback(deployment, config: RollbackConfig):
"""Monitor deployment and rollback if needed."""
while True:
metrics = deployment.get_metrics()
if should_rollback(metrics, config):
print("Initiating rollback...")
deployment.rollback()
break
time.sleep(config.check_interval_seconds)A/B Testing for ML
A/B Test Setup
┌─────────────────────────────────────────────────────────────────┐
│ ML A/B TESTING │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ │
│ │ Traffic │ │
│ └─────────────┘ │
│ │ │
│ │ Random assignment │
│ │ (user_id % 100) │
│ │ │
│ ┌────┴────┐ │
│ │ │ │
│ ▼ ▼ │
│ Control Treatment │
│ (50%) (50%) │
│ │ │ │
│ ▼ ▼ │
│ Model A Model B │
│ │ │ │
│ └────┬────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Metrics │ │
│ │ Collection │ │
│ └─────────────┘ │
│ │ │
│ ▼ │
│ Statistical significance test │
│ (t-test, chi-squared, etc.) │
│ │
└─────────────────────────────────────────────────────────────────┘A/B Test Considerations for ML
| Consideration | Description |
|---|---|
| Sample size | Need enough samples for statistical power |
| Duration | Run long enough to capture patterns |
| Novelty effect | Users may behave differently with new model |
| Network effects | User interactions may contaminate groups |
| Metric selection | Choose metrics that reflect business value |
Best Practices
Deployment Checklist
| Phase | Check | Why |
|---|---|---|
| Pre-deploy | Model validated offline | Catch obvious issues |
| Pre-deploy | Feature pipeline tested | Ensure features available |
| Pre-deploy | Rollback plan ready | Quick recovery |
| Deploy | Shadow mode first | Validate on real traffic |
| Deploy | Canary with monitoring | Gradual rollout |
| Post-deploy | Metrics dashboards | Track performance |
| Post-deploy | Alerts configured | Catch degradation |
Common Pitfalls
🚨 Deployment Anti-Patterns
- Big bang deployment: 100% traffic to new model immediately
- No rollback plan: Can't recover from bad deployment
- Missing monitoring: Don't know when model fails
- Feature skew: Training features ≠ serving features
- No shadow testing: Deploy without real traffic validation
✅ Operational Checklist
Pre-Deployment Gate
| Check | Required | Owner | Notes |
|---|---|---|---|
| Offline evaluation passed | ✓ | ML Engineer | See evaluation.md |
| Feature pipeline validated | ✓ | Data Engineer | No train/serve skew |
| Model registered | ✓ | ML Engineer | See governance.md |
| Rollback plan documented | ✓ | ML Lead | Previous version ready |
| Monitoring configured | ✓ | ML Engineer | Dashboards, alerts |
| Load test passed | ✓ | Platform | Can handle expected traffic |
Deployment Execution Gate
| Check | Required | Owner |
|---|---|---|
| Shadow mode validated | ✓ | ML Engineer |
| Canary started (5-10%) | ✓ | ML Engineer |
| Metrics monitored for 24h | ✓ | ML Engineer |
| No SLA violations | ✓ | ML Lead |
| Business metrics stable | ✓ | PM |
Post-Deployment Gate
| Check | Required | Frequency |
|---|---|---|
| Gradual rollout to 100% | ✓ | If metrics OK |
| Full monitoring active | ✓ | Continuous |
| Documentation updated | ✓ | Within 1 day |
| Runbook updated | ✓ | Within 1 day |
| Post-mortem if issues | As needed | Within 1 week |
📎 Cross-References
- 📎 Experimentation - From experiment to deployment
- 📎 ML Monitoring - Post-deployment monitoring
- 📎 ML Governance - Approval gates
- 📎 Feature Engineering - Feature pipeline for serving
- 📎 System Design - ML system architecture patterns