Skip to content

🚀 Deployment Patterns

Level: Advanced Solves: Triển khai ML models vào production một cách an toàn và hiệu quả

ML Deployment Challenges

💡 Giáo sư Tom

"Training a model is the easy part. Deploying it reliably is where the real engineering begins." ML deployment khác với software deployment thông thường vì model có thể fail silently - predictions sai nhưng không có error.

┌─────────────────────────────────────────────────────────────────┐
│              ML DEPLOYMENT CHALLENGES                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Traditional Software:                                          │
│  • Code works or crashes                                        │
│  • Errors are explicit                                          │
│  • Testing is deterministic                                     │
│                                                                 │
│  ML Systems:                                                    │
│  • Model can be "wrong" without crashing                        │
│  • Errors are statistical (accuracy drops)                      │
│  • Testing requires data distribution awareness                 │
│  • Performance degrades over time (drift)                       │
│  • Dependencies: code + data + model + features                 │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Serving Patterns

Batch vs Online Serving

┌─────────────────────────────────────────────────────────────────┐
│              BATCH vs ONLINE SERVING                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  BATCH SERVING                                                  │
│  ─────────────                                                  │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐       │
│  │  Data   │───►│  Model  │───►│ Results │───►│ Storage │       │
│  │ (batch) │    │         │    │         │    │ (DB/S3) │       │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘       │
│                                                                 │
│  • Scheduled (hourly, daily)                                    │
│  • High throughput                                              │
│  • Latency not critical                                         │
│  • Examples: recommendations, risk scores                       │
│                                                                 │
│  ONLINE SERVING                                                 │
│  ──────────────                                                 │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐                      │
│  │ Request │───►│  Model  │───►│Response │                      │
│  │ (API)   │    │ Service │    │ (JSON)  │                      │
│  └─────────┘    └─────────┘    └─────────┘                      │
│                                                                 │
│  • Real-time (milliseconds)                                     │
│  • Low latency critical                                         │
│  • Per-request predictions                                      │
│  • Examples: fraud detection, search ranking                    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Choosing Serving Pattern

FactorBatchOnline
Latency requirementHours/minutes OKMilliseconds required
Prediction freshnessCan be staleMust be real-time
Feature availabilityAll features availableLimited to real-time features
CostLower (scheduled)Higher (always-on)
ComplexitySimplerMore complex

Hybrid Pattern

┌─────────────────────────────────────────────────────────────────┐
│              HYBRID SERVING PATTERN                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    BATCH LAYER                           │    │
│  │  • Pre-compute predictions for known entities            │    │
│  │  • Store in fast lookup (Redis, DynamoDB)                │    │
│  │  • Update periodically (hourly/daily)                    │    │
│  └─────────────────────────────────────────────────────────┘    │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                   SERVING LAYER                          │    │
│  │  Request → Cache lookup → Hit? → Return cached           │    │
│  │                            │                             │    │
│  │                           Miss                           │    │
│  │                            │                             │    │
│  │                            ▼                             │    │
│  │                    Online inference                      │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  Benefits:                                                      │
│  • Fast response for known entities (cache hit)                 │
│  • Fallback to online for new entities                          │
│  • Cost-effective                                               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Deployment Strategies

Blue-Green Deployment

┌─────────────────────────────────────────────────────────────────┐
│              BLUE-GREEN DEPLOYMENT                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Step 1: Blue is live                                           │
│  ┌─────────────┐                                                │
│  │   Traffic   │────────────────►┌─────────────┐                │
│  │   (100%)    │                 │  Blue (v1)  │                │
│  └─────────────┘                 └─────────────┘                │
│                                  ┌─────────────┐                │
│                                  │ Green (v2)  │ ← Deploy new   │
│                                  └─────────────┘                │
│                                                                 │
│  Step 2: Switch to Green                                        │
│  ┌─────────────┐                 ┌─────────────┐                │
│  │   Traffic   │                 │  Blue (v1)  │ ← Standby      │
│  │   (100%)    │────────────────►┌─────────────┐                │
│  └─────────────┘                 │ Green (v2)  │                │
│                                  └─────────────┘                │
│                                                                 │
│  Rollback: Switch back to Blue instantly                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Canary Deployment

┌─────────────────────────────────────────────────────────────────┐
│              CANARY DEPLOYMENT                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Phase 1: 5% traffic to canary                                  │
│  ┌─────────────┐     95%     ┌─────────────┐                    │
│  │   Traffic   │────────────►│  Stable v1  │                    │
│  │             │     5%      ├─────────────┤                    │
│  │             │────────────►│  Canary v2  │                    │
│  └─────────────┘             └─────────────┘                    │
│                                                                 │
│  Phase 2: Monitor metrics                                       │
│  • Latency: canary vs stable                                    │
│  • Error rate: canary vs stable                                 │
│  • Business metrics: conversion, engagement                     │
│                                                                 │
│  Phase 3: Gradual rollout (if metrics OK)                       │
│  5% → 10% → 25% → 50% → 100%                                    │
│                                                                 │
│  Rollback: If metrics degrade, route 100% to stable             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Shadow Mode

🔍 Shadow Mode

Shadow mode chạy model mới song song với model production, nhưng không serve predictions cho users. Dùng để validate model trên real traffic trước khi deploy.

┌─────────────────────────────────────────────────────────────────┐
│              SHADOW MODE DEPLOYMENT                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┐                                                │
│  │   Request   │                                                │
│  └─────────────┘                                                │
│         │                                                       │
│         ├────────────────────────────────────┐                  │
│         │                                    │                  │
│         ▼                                    ▼                  │
│  ┌─────────────┐                      ┌─────────────┐           │
│  │ Production  │                      │   Shadow    │           │
│  │  Model v1   │                      │  Model v2   │           │
│  └─────────────┘                      └─────────────┘           │
│         │                                    │                  │
│         │ Serve to user                      │ Log only         │
│         ▼                                    ▼                  │
│  ┌─────────────┐                      ┌─────────────┐           │
│  │  Response   │                      │   Metrics   │           │
│  │  to User    │                      │   & Logs    │           │
│  └─────────────┘                      └─────────────┘           │
│                                                                 │
│  Compare:                                                       │
│  • Prediction distribution                                      │
│  • Latency                                                      │
│  • Error rates                                                  │
│  • Business metric correlation                                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Model Serving Infrastructure

Serving Options

OptionProsConsBest For
REST APISimple, universalHigher latencyGeneral purpose
gRPCFast, typedMore complexHigh-performance
ServerlessAuto-scaling, pay-per-useCold startsVariable traffic
KubernetesScalable, portableComplex setupLarge scale
EdgeLow latencyLimited computeIoT, mobile

FastAPI Model Server

python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import joblib
import numpy as np

app = FastAPI()

# Load model at startup
model = joblib.load("model.pkl")
scaler = joblib.load("scaler.pkl")

class PredictionRequest(BaseModel):
    features: list[float]

class PredictionResponse(BaseModel):
    prediction: int
    probability: float
    model_version: str

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    try:
        # Preprocess
        features = np.array(request.features).reshape(1, -1)
        features_scaled = scaler.transform(features)
        
        # Predict
        prediction = model.predict(features_scaled)[0]
        probability = model.predict_proba(features_scaled)[0].max()
        
        return PredictionResponse(
            prediction=int(prediction),
            probability=float(probability),
            model_version="v1.2.0"
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health():
    return {"status": "healthy"}

Docker Deployment

dockerfile
FROM python:3.10-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model and code
COPY model.pkl scaler.pkl ./
COPY app.py ./

# Expose port
EXPOSE 8000

# Run server
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Kubernetes Deployment

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ml-model
  template:
    metadata:
      labels:
        app: ml-model
    spec:
      containers:
      - name: model-server
        image: ml-model:v1.2.0
        ports:
        - containerPort: 8000
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 10
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 15
          periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: ml-model-service
spec:
  selector:
    app: ml-model
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer

Rollback Strategies

Rollback Decision Matrix

SignalSeverityAction
Error rate spikeHighImmediate rollback
Latency increase >2xHighImmediate rollback
Business metric drop >10%MediumInvestigate, then rollback
Prediction distribution shiftLowMonitor, gradual rollback

Automated Rollback

python
import time
from dataclasses import dataclass

@dataclass
class RollbackConfig:
    error_rate_threshold: float = 0.05
    latency_p99_threshold_ms: float = 200
    check_interval_seconds: int = 60
    min_samples: int = 100

def should_rollback(metrics: dict, config: RollbackConfig) -> bool:
    """Determine if rollback is needed based on metrics."""
    
    if metrics["sample_count"] < config.min_samples:
        return False  # Not enough data
    
    if metrics["error_rate"] > config.error_rate_threshold:
        print(f"Error rate {metrics['error_rate']:.2%} exceeds threshold")
        return True
    
    if metrics["latency_p99"] > config.latency_p99_threshold_ms:
        print(f"Latency {metrics['latency_p99']}ms exceeds threshold")
        return True
    
    return False

def monitor_and_rollback(deployment, config: RollbackConfig):
    """Monitor deployment and rollback if needed."""
    while True:
        metrics = deployment.get_metrics()
        
        if should_rollback(metrics, config):
            print("Initiating rollback...")
            deployment.rollback()
            break
        
        time.sleep(config.check_interval_seconds)

A/B Testing for ML

A/B Test Setup

┌─────────────────────────────────────────────────────────────────┐
│              ML A/B TESTING                                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┐                                                │
│  │   Traffic   │                                                │
│  └─────────────┘                                                │
│         │                                                       │
│         │ Random assignment                                     │
│         │ (user_id % 100)                                       │
│         │                                                       │
│    ┌────┴────┐                                                  │
│    │         │                                                  │
│    ▼         ▼                                                  │
│  Control   Treatment                                            │
│  (50%)     (50%)                                                │
│    │         │                                                  │
│    ▼         ▼                                                  │
│  Model A   Model B                                              │
│    │         │                                                  │
│    └────┬────┘                                                  │
│         │                                                       │
│         ▼                                                       │
│  ┌─────────────┐                                                │
│  │  Metrics    │                                                │
│  │  Collection │                                                │
│  └─────────────┘                                                │
│         │                                                       │
│         ▼                                                       │
│  Statistical significance test                                  │
│  (t-test, chi-squared, etc.)                                    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

A/B Test Considerations for ML

ConsiderationDescription
Sample sizeNeed enough samples for statistical power
DurationRun long enough to capture patterns
Novelty effectUsers may behave differently with new model
Network effectsUser interactions may contaminate groups
Metric selectionChoose metrics that reflect business value

Best Practices

Deployment Checklist

PhaseCheckWhy
Pre-deployModel validated offlineCatch obvious issues
Pre-deployFeature pipeline testedEnsure features available
Pre-deployRollback plan readyQuick recovery
DeployShadow mode firstValidate on real traffic
DeployCanary with monitoringGradual rollout
Post-deployMetrics dashboardsTrack performance
Post-deployAlerts configuredCatch degradation

Common Pitfalls

🚨 Deployment Anti-Patterns

  • Big bang deployment: 100% traffic to new model immediately
  • No rollback plan: Can't recover from bad deployment
  • Missing monitoring: Don't know when model fails
  • Feature skew: Training features ≠ serving features
  • No shadow testing: Deploy without real traffic validation

Operational Checklist

Pre-Deployment Gate

CheckRequiredOwnerNotes
Offline evaluation passedML EngineerSee evaluation.md
Feature pipeline validatedData EngineerNo train/serve skew
Model registeredML EngineerSee governance.md
Rollback plan documentedML LeadPrevious version ready
Monitoring configuredML EngineerDashboards, alerts
Load test passedPlatformCan handle expected traffic

Deployment Execution Gate

CheckRequiredOwner
Shadow mode validatedML Engineer
Canary started (5-10%)ML Engineer
Metrics monitored for 24hML Engineer
No SLA violationsML Lead
Business metrics stablePM

Post-Deployment Gate

CheckRequiredFrequency
Gradual rollout to 100%If metrics OK
Full monitoring activeContinuous
Documentation updatedWithin 1 day
Runbook updatedWithin 1 day
Post-mortem if issuesAs neededWithin 1 week

📎 Cross-References