Skip to content

🔧 Model Ops

Level: Ops Solves: Quản lý model lifecycle từ training đến production với versioning và monitoring

Model Ops Overview

💡 Giáo sư Tom

Model Ops là cầu nối giữa research và production. Một model tốt trong notebook không có nghĩa là nó sẽ tốt trong production. Model Ops đảm bảo models được track, versioned, và monitored properly.

┌─────────────────────────────────────────────────────────────────┐
│              MODEL LIFECYCLE                                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐      │
│  │ DEVELOP │ →  │ TRAIN   │ →  │ VALIDATE│ →  │ DEPLOY  │      │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘      │
│       │              │              │              │            │
│       ▼              ▼              ▼              ▼            │
│  ┌─────────────────────────────────────────────────────┐       │
│  │              MODEL REGISTRY                          │       │
│  │  • Version control                                   │       │
│  │  • Metadata tracking                                 │       │
│  │  • Artifact storage                                  │       │
│  │  • Lineage tracking                                  │       │
│  └─────────────────────────────────────────────────────┘       │
│                          │                                      │
│                          ▼                                      │
│  ┌─────────────────────────────────────────────────────┐       │
│  │              MONITORING                              │       │
│  │  • Performance metrics                               │       │
│  │  • Data drift                                        │       │
│  │  • Model drift                                       │       │
│  │  • Alerts                                            │       │
│  └─────────────────────────────────────────────────────┘       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Model Registry

Why Model Registry?

┌─────────────────────────────────────────────────────────────────┐
│              MODEL REGISTRY BENEFITS                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  WITHOUT REGISTRY                  WITH REGISTRY                │
│  ────────────────                  ─────────────                │
│  • model_v1.pt                     • Semantic versioning        │
│  • model_v2_final.pt               • Metadata attached          │
│  • model_v2_final_REAL.pt          • Reproducibility            │
│  • model_best_dont_delete.pt       • Rollback capability        │
│  • ???                             • Audit trail                │
│                                                                 │
│  CORE FEATURES                                                  │
│  ─────────────                                                  │
│  • Centralized storage                                          │
│  • Version management                                           │
│  • Stage transitions (dev → staging → prod)                     │
│  • Access control                                               │
│  • Lineage tracking                                             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Model Registry Tools

ToolTypeBest For
MLflowOpen sourceGeneral ML, self-hosted
Weights & BiasesSaaSExperiment tracking + registry
NeptuneSaaSExperiment tracking + registry
DVCOpen sourceData + model versioning
SageMaker Model RegistryAWSAWS-native workflows
Vertex AI Model RegistryGCPGCP-native workflows

MLflow Model Registry

python
import mlflow
from mlflow.tracking import MlflowClient

# Log model during training
with mlflow.start_run():
    # Log parameters
    mlflow.log_params({
        "learning_rate": 0.001,
        "batch_size": 32,
        "epochs": 100
    })
    
    # Log metrics
    mlflow.log_metrics({
        "accuracy": 0.95,
        "f1_score": 0.93
    })
    
    # Log model
    mlflow.pytorch.log_model(
        model,
        "model",
        registered_model_name="image_classifier"
    )

# Manage model versions
client = MlflowClient()

# Transition to staging
client.transition_model_version_stage(
    name="image_classifier",
    version=1,
    stage="Staging"
)

# Transition to production
client.transition_model_version_stage(
    name="image_classifier",
    version=1,
    stage="Production"
)

Model Versioning

Versioning Strategy

┌─────────────────────────────────────────────────────────────────┐
│              VERSIONING STRATEGY                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SEMANTIC VERSIONING FOR MODELS                                 │
│  ──────────────────────────────                                 │
│  MAJOR.MINOR.PATCH                                              │
│                                                                 │
│  MAJOR: Breaking changes                                        │
│  • Different input/output schema                                │
│  • Incompatible API changes                                     │
│  • Significant architecture changes                             │
│                                                                 │
│  MINOR: New features, backward compatible                       │
│  • New output fields                                            │
│  • Performance improvements                                     │
│  • Additional capabilities                                      │
│                                                                 │
│  PATCH: Bug fixes, backward compatible                          │
│  • Bug fixes                                                    │
│  • Minor accuracy improvements                                  │
│  • Retraining on same data                                      │
│                                                                 │
│  EXAMPLE                                                        │
│  ───────                                                        │
│  v1.0.0 → Initial production release                            │
│  v1.0.1 → Bug fix in preprocessing                              │
│  v1.1.0 → Added confidence scores                               │
│  v2.0.0 → New architecture, different output format             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

What to Version

┌─────────────────────────────────────────────────────────────────┐
│              VERSIONING CHECKLIST                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  MODEL ARTIFACTS                                                │
│  ───────────────                                                │
│  □ Model weights (.pt, .onnx, .pb)                              │
│  □ Model architecture definition                                │
│  □ Preprocessing code/config                                    │
│  □ Postprocessing code/config                                   │
│                                                                 │
│  TRAINING ARTIFACTS                                             │
│  ─────────────────                                              │
│  □ Training script                                              │
│  □ Hyperparameters                                              │
│  □ Training data version/hash                                   │
│  □ Validation data version/hash                                 │
│  □ Random seeds                                                 │
│                                                                 │
│  ENVIRONMENT                                                    │
│  ───────────                                                    │
│  □ Python version                                               │
│  □ Dependencies (requirements.txt, conda.yaml)                  │
│  □ CUDA version                                                 │
│  □ Hardware specs                                               │
│                                                                 │
│  METADATA                                                       │
│  ────────                                                       │
│  □ Training metrics                                             │
│  □ Validation metrics                                           │
│  □ Training duration                                            │
│  □ Author/team                                                  │
│  □ Description/changelog                                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Model Metadata

Essential Metadata

python
# Model metadata schema
model_metadata = {
    # Identity
    "name": "image_classifier",
    "version": "1.2.0",
    "description": "ResNet50 for product classification",
    
    # Training info
    "training": {
        "dataset": "product_images_v3",
        "dataset_size": 100000,
        "epochs": 50,
        "batch_size": 32,
        "learning_rate": 0.001,
        "optimizer": "AdamW",
        "training_time_hours": 12.5,
    },
    
    # Performance
    "metrics": {
        "accuracy": 0.95,
        "f1_score": 0.93,
        "precision": 0.94,
        "recall": 0.92,
        "latency_p50_ms": 15,
        "latency_p99_ms": 45,
    },
    
    # Environment
    "environment": {
        "python_version": "3.10",
        "pytorch_version": "2.0.1",
        "cuda_version": "11.8",
        "gpu": "A100-40GB",
    },
    
    # Lineage
    "lineage": {
        "parent_model": "image_classifier:1.1.0",
        "training_run_id": "run_abc123",
        "git_commit": "a1b2c3d4",
    },
    
    # Governance
    "governance": {
        "owner": "ml-team",
        "created_at": "2024-01-15T10:30:00Z",
        "approved_by": "senior-ml-engineer",
        "stage": "production",
    }
}

Model Stages

Stage Transitions

┌─────────────────────────────────────────────────────────────────┐
│              MODEL STAGES                                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐      │
│  │  NONE   │ →  │STAGING  │ →  │PRODUCTION│ →  │ARCHIVED │      │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘      │
│                                                                 │
│  NONE (Development)                                             │
│  ─────────────────                                              │
│  • Experimental models                                          │
│  • Not validated                                                │
│  • May be deleted                                               │
│                                                                 │
│  STAGING                                                        │
│  ───────                                                        │
│  • Candidate for production                                     │
│  • Undergoing validation                                        │
│  • A/B testing                                                  │
│                                                                 │
│  PRODUCTION                                                     │
│  ──────────                                                     │
│  • Serving live traffic                                         │
│  • Fully validated                                              │
│  • Monitored                                                    │
│                                                                 │
│  ARCHIVED                                                       │
│  ────────                                                       │
│  • No longer in use                                             │
│  • Kept for audit/rollback                                      │
│  • May be deleted after retention period                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Promotion Criteria

┌─────────────────────────────────────────────────────────────────┐
│              PROMOTION CRITERIA                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  NONE → STAGING                                                 │
│  ──────────────                                                 │
│  □ Offline metrics meet threshold                               │
│  □ Model passes unit tests                                      │
│  □ Code review completed                                        │
│  □ Documentation updated                                        │
│                                                                 │
│  STAGING → PRODUCTION                                           │
│  ─────────────────────                                          │
│  □ A/B test shows improvement                                   │
│  □ No regression in key metrics                                 │
│  □ Latency within SLA                                           │
│  □ Error rate acceptable                                        │
│  □ Stakeholder approval                                         │
│                                                                 │
│  PRODUCTION → ARCHIVED                                          │
│  ────────────────────                                           │
│  □ Replaced by newer version                                    │
│  □ Traffic fully migrated                                       │
│  □ Monitoring shows no issues                                   │
│  □ Rollback period passed                                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Monitoring Integration

Key Metrics to Monitor

┌─────────────────────────────────────────────────────────────────┐
│              MONITORING METRICS                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  PERFORMANCE METRICS                                            │
│  ───────────────────                                            │
│  • Prediction accuracy (if labels available)                    │
│  • Confidence score distribution                                │
│  • Prediction distribution                                      │
│                                                                 │
│  OPERATIONAL METRICS                                            │
│  ───────────────────                                            │
│  • Latency (p50, p95, p99)                                      │
│  • Throughput (requests/second)                                 │
│  • Error rate                                                   │
│  • GPU/CPU utilization                                          │
│  • Memory usage                                                 │
│                                                                 │
│  DRIFT METRICS                                                  │
│  ─────────────                                                  │
│  • Input feature distribution                                   │
│  • Prediction distribution                                      │
│  • Data quality metrics                                         │
│                                                                 │
│  BUSINESS METRICS                                               │
│  ────────────────                                               │
│  • Conversion rate                                              │
│  • User engagement                                              │
│  • Revenue impact                                               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Alerting Strategy

python
# Example alerting rules
alerts = {
    "latency_high": {
        "condition": "p99_latency > 100ms",
        "severity": "warning",
        "action": "notify_oncall"
    },
    "error_rate_high": {
        "condition": "error_rate > 1%",
        "severity": "critical",
        "action": "page_oncall"
    },
    "accuracy_drop": {
        "condition": "accuracy < baseline - 5%",
        "severity": "warning",
        "action": "notify_ml_team"
    },
    "drift_detected": {
        "condition": "psi_score > 0.2",
        "severity": "warning",
        "action": "notify_ml_team"
    }
}

Rollback Strategy

Rollback Procedures

┌─────────────────────────────────────────────────────────────────┐
│              ROLLBACK STRATEGY                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  WHEN TO ROLLBACK                                               │
│  ────────────────                                               │
│  • Error rate exceeds threshold                                 │
│  • Latency exceeds SLA                                          │
│  • Significant accuracy drop                                    │
│  • Business metrics degradation                                 │
│                                                                 │
│  ROLLBACK TYPES                                                 │
│  ──────────────                                                 │
│                                                                 │
│  INSTANT ROLLBACK                                               │
│  • Switch traffic to previous version                           │
│  • No downtime                                                  │
│  • Requires previous version deployed                           │
│                                                                 │
│  GRADUAL ROLLBACK                                               │
│  • Slowly shift traffic back                                    │
│  • Monitor during transition                                    │
│  • Safer but slower                                             │
│                                                                 │
│  ROLLBACK CHECKLIST                                             │
│  ─────────────────                                              │
│  □ Identify issue and confirm rollback needed                   │
│  □ Notify stakeholders                                          │
│  □ Execute rollback                                             │
│  □ Verify previous version working                              │
│  □ Post-mortem and root cause analysis                          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

CI/CD for Models

Model CI/CD Pipeline

┌─────────────────────────────────────────────────────────────────┐
│              MODEL CI/CD PIPELINE                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐      │
│  │  CODE   │ →  │  BUILD  │ →  │  TEST   │ →  │ DEPLOY  │      │
│  │ COMMIT  │    │         │    │         │    │         │      │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘      │
│                                                                 │
│  CODE COMMIT                                                    │
│  ───────────                                                    │
│  • Training code changes                                        │
│  • Model config changes                                         │
│  • Trigger: Git push                                            │
│                                                                 │
│  BUILD                                                          │
│  ─────                                                          │
│  • Train model (or load pre-trained)                            │
│  • Export to inference format                                   │
│  • Build container image                                        │
│                                                                 │
│  TEST                                                           │
│  ────                                                           │
│  • Unit tests                                                   │
│  • Integration tests                                            │
│  • Performance tests                                            │
│  • Accuracy validation                                          │
│                                                                 │
│  DEPLOY                                                         │
│  ──────                                                         │
│  • Deploy to staging                                            │
│  • Run canary tests                                             │
│  • Gradual rollout to production                                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

GitHub Actions Example

yaml
# .github/workflows/model-ci.yml
name: Model CI/CD

on:
  push:
    paths:
      - 'models/**'
      - 'training/**'

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Run unit tests
        run: pytest tests/unit
      
      - name: Run model validation
        run: python scripts/validate_model.py
  
  build:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - name: Build inference container
        run: docker build -t model:${{ github.sha }} .
      
      - name: Push to registry
        run: docker push model:${{ github.sha }}
  
  deploy-staging:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to staging
        run: kubectl apply -f k8s/staging/
      
      - name: Run integration tests
        run: pytest tests/integration
  
  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment: production
    steps:
      - name: Deploy to production
        run: kubectl apply -f k8s/production/

📎 Cross-References