🔧 Model Ops

Level: Ops Solves: Quản lý model lifecycle từ training đến production với versioning và monitoring

Model Ops Overview

💡 Giáo sư Tom

Model Ops là cầu nối giữa research và production. Một model tốt trong notebook không có nghĩa là nó sẽ tốt trong production. Model Ops đảm bảo models được track, versioned, và monitored properly.

┌─────────────────────────────────────────────────────────────────┐
│              MODEL LIFECYCLE                                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐      │
│  │ DEVELOP │ →  │ TRAIN   │ →  │ VALIDATE│ →  │ DEPLOY  │      │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘      │
│       │              │              │              │            │
│       ▼              ▼              ▼              ▼            │
│  ┌─────────────────────────────────────────────────────┐       │
│  │              MODEL REGISTRY                          │       │
│  │  • Version control                                   │       │
│  │  • Metadata tracking                                 │       │
│  │  • Artifact storage                                  │       │
│  │  • Lineage tracking                                  │       │
│  └─────────────────────────────────────────────────────┘       │
│                          │                                      │
│                          ▼                                      │
│  ┌─────────────────────────────────────────────────────┐       │
│  │              MONITORING                              │       │
│  │  • Performance metrics                               │       │
│  │  • Data drift                                        │       │
│  │  • Model drift                                       │       │
│  │  • Alerts                                            │       │
│  └─────────────────────────────────────────────────────┘       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Model Registry

Why Model Registry?

┌─────────────────────────────────────────────────────────────────┐
│              MODEL REGISTRY BENEFITS                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  WITHOUT REGISTRY                  WITH REGISTRY                │
│  ────────────────                  ─────────────                │
│  • model_v1.pt                     • Semantic versioning        │
│  • model_v2_final.pt               • Metadata attached          │
│  • model_v2_final_REAL.pt          • Reproducibility            │
│  • model_best_dont_delete.pt       • Rollback capability        │
│  • ???                             • Audit trail                │
│                                                                 │
│  CORE FEATURES                                                  │
│  ─────────────                                                  │
│  • Centralized storage                                          │
│  • Version management                                           │
│  • Stage transitions (dev → staging → prod)                     │
│  • Access control                                               │
│  • Lineage tracking                                             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Model Registry Tools

Tool	Type	Best For
MLflow	Open source	General ML, self-hosted
Weights & Biases	SaaS	Experiment tracking + registry
Neptune	SaaS	Experiment tracking + registry
DVC	Open source	Data + model versioning
SageMaker Model Registry	AWS	AWS-native workflows
Vertex AI Model Registry	GCP	GCP-native workflows

MLflow Model Registry

python

import mlflow
from mlflow.tracking import MlflowClient

# Log model during training
with mlflow.start_run():
    # Log parameters
    mlflow.log_params({
        "learning_rate": 0.001,
        "batch_size": 32,
        "epochs": 100
    })
    
    # Log metrics
    mlflow.log_metrics({
        "accuracy": 0.95,
        "f1_score": 0.93
    })
    
    # Log model
    mlflow.pytorch.log_model(
        model,
        "model",
        registered_model_name="image_classifier"
    )

# Manage model versions
client = MlflowClient()

# Transition to staging
client.transition_model_version_stage(
    name="image_classifier",
    version=1,
    stage="Staging"
)

# Transition to production
client.transition_model_version_stage(
    name="image_classifier",
    version=1,
    stage="Production"
)

Model Versioning

Versioning Strategy

┌─────────────────────────────────────────────────────────────────┐
│              VERSIONING STRATEGY                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SEMANTIC VERSIONING FOR MODELS                                 │
│  ──────────────────────────────                                 │
│  MAJOR.MINOR.PATCH                                              │
│                                                                 │
│  MAJOR: Breaking changes                                        │
│  • Different input/output schema                                │
│  • Incompatible API changes                                     │
│  • Significant architecture changes                             │
│                                                                 │
│  MINOR: New features, backward compatible                       │
│  • New output fields                                            │
│  • Performance improvements                                     │
│  • Additional capabilities                                      │
│                                                                 │
│  PATCH: Bug fixes, backward compatible                          │
│  • Bug fixes                                                    │
│  • Minor accuracy improvements                                  │
│  • Retraining on same data                                      │
│                                                                 │
│  EXAMPLE                                                        │
│  ───────                                                        │
│  v1.0.0 → Initial production release                            │
│  v1.0.1 → Bug fix in preprocessing                              │
│  v1.1.0 → Added confidence scores                               │
│  v2.0.0 → New architecture, different output format             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

What to Version

┌─────────────────────────────────────────────────────────────────┐
│              VERSIONING CHECKLIST                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  MODEL ARTIFACTS                                                │
│  ───────────────                                                │
│  □ Model weights (.pt, .onnx, .pb)                              │
│  □ Model architecture definition                                │
│  □ Preprocessing code/config                                    │
│  □ Postprocessing code/config                                   │
│                                                                 │
│  TRAINING ARTIFACTS                                             │
│  ─────────────────                                              │
│  □ Training script                                              │
│  □ Hyperparameters                                              │
│  □ Training data version/hash                                   │
│  □ Validation data version/hash                                 │
│  □ Random seeds                                                 │
│                                                                 │
│  ENVIRONMENT                                                    │
│  ───────────                                                    │
│  □ Python version                                               │
│  □ Dependencies (requirements.txt, conda.yaml)                  │
│  □ CUDA version                                                 │
│  □ Hardware specs                                               │
│                                                                 │
│  METADATA                                                       │
│  ────────                                                       │
│  □ Training metrics                                             │
│  □ Validation metrics                                           │
│  □ Training duration                                            │
│  □ Author/team                                                  │
│  □ Description/changelog                                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Model Metadata

Essential Metadata

python

# Model metadata schema
model_metadata = {
    # Identity
    "name": "image_classifier",
    "version": "1.2.0",
    "description": "ResNet50 for product classification",
    
    # Training info
    "training": {
        "dataset": "product_images_v3",
        "dataset_size": 100000,
        "epochs": 50,
        "batch_size": 32,
        "learning_rate": 0.001,
        "optimizer": "AdamW",
        "training_time_hours": 12.5,
    },
    
    # Performance
    "metrics": {
        "accuracy": 0.95,
        "f1_score": 0.93,
        "precision": 0.94,
        "recall": 0.92,
        "latency_p50_ms": 15,
        "latency_p99_ms": 45,
    },
    
    # Environment
    "environment": {
        "python_version": "3.10",
        "pytorch_version": "2.0.1",
        "cuda_version": "11.8",
        "gpu": "A100-40GB",
    },
    
    # Lineage
    "lineage": {
        "parent_model": "image_classifier:1.1.0",
        "training_run_id": "run_abc123",
        "git_commit": "a1b2c3d4",
    },
    
    # Governance
    "governance": {
        "owner": "ml-team",
        "created_at": "2024-01-15T10:30:00Z",
        "approved_by": "senior-ml-engineer",
        "stage": "production",
    }
}

Model Stages

Stage Transitions

┌─────────────────────────────────────────────────────────────────┐
│              MODEL STAGES                                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐      │
│  │  NONE   │ →  │STAGING  │ →  │PRODUCTION│ →  │ARCHIVED │      │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘      │
│                                                                 │
│  NONE (Development)                                             │
│  ─────────────────                                              │
│  • Experimental models                                          │
│  • Not validated                                                │
│  • May be deleted                                               │
│                                                                 │
│  STAGING                                                        │
│  ───────                                                        │
│  • Candidate for production                                     │
│  • Undergoing validation                                        │
│  • A/B testing                                                  │
│                                                                 │
│  PRODUCTION                                                     │
│  ──────────                                                     │
│  • Serving live traffic                                         │
│  • Fully validated                                              │
│  • Monitored                                                    │
│                                                                 │
│  ARCHIVED                                                       │
│  ────────                                                       │
│  • No longer in use                                             │
│  • Kept for audit/rollback                                      │
│  • May be deleted after retention period                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Promotion Criteria

┌─────────────────────────────────────────────────────────────────┐
│              PROMOTION CRITERIA                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  NONE → STAGING                                                 │
│  ──────────────                                                 │
│  □ Offline metrics meet threshold                               │
│  □ Model passes unit tests                                      │
│  □ Code review completed                                        │
│  □ Documentation updated                                        │
│                                                                 │
│  STAGING → PRODUCTION                                           │
│  ─────────────────────                                          │
│  □ A/B test shows improvement                                   │
│  □ No regression in key metrics                                 │
│  □ Latency within SLA                                           │
│  □ Error rate acceptable                                        │
│  □ Stakeholder approval                                         │
│                                                                 │
│  PRODUCTION → ARCHIVED                                          │
│  ────────────────────                                           │
│  □ Replaced by newer version                                    │
│  □ Traffic fully migrated                                       │
│  □ Monitoring shows no issues                                   │
│  □ Rollback period passed                                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Monitoring Integration

Key Metrics to Monitor

┌─────────────────────────────────────────────────────────────────┐
│              MONITORING METRICS                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  PERFORMANCE METRICS                                            │
│  ───────────────────                                            │
│  • Prediction accuracy (if labels available)                    │
│  • Confidence score distribution                                │
│  • Prediction distribution                                      │
│                                                                 │
│  OPERATIONAL METRICS                                            │
│  ───────────────────                                            │
│  • Latency (p50, p95, p99)                                      │
│  • Throughput (requests/second)                                 │
│  • Error rate                                                   │
│  • GPU/CPU utilization                                          │
│  • Memory usage                                                 │
│                                                                 │
│  DRIFT METRICS                                                  │
│  ─────────────                                                  │
│  • Input feature distribution                                   │
│  • Prediction distribution                                      │
│  • Data quality metrics                                         │
│                                                                 │
│  BUSINESS METRICS                                               │
│  ────────────────                                               │
│  • Conversion rate                                              │
│  • User engagement                                              │
│  • Revenue impact                                               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Alerting Strategy

python

# Example alerting rules
alerts = {
    "latency_high": {
        "condition": "p99_latency > 100ms",
        "severity": "warning",
        "action": "notify_oncall"
    },
    "error_rate_high": {
        "condition": "error_rate > 1%",
        "severity": "critical",
        "action": "page_oncall"
    },
    "accuracy_drop": {
        "condition": "accuracy < baseline - 5%",
        "severity": "warning",
        "action": "notify_ml_team"
    },
    "drift_detected": {
        "condition": "psi_score > 0.2",
        "severity": "warning",
        "action": "notify_ml_team"
    }
}

Rollback Strategy

Rollback Procedures

┌─────────────────────────────────────────────────────────────────┐
│              ROLLBACK STRATEGY                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  WHEN TO ROLLBACK                                               │
│  ────────────────                                               │
│  • Error rate exceeds threshold                                 │
│  • Latency exceeds SLA                                          │
│  • Significant accuracy drop                                    │
│  • Business metrics degradation                                 │
│                                                                 │
│  ROLLBACK TYPES                                                 │
│  ──────────────                                                 │
│                                                                 │
│  INSTANT ROLLBACK                                               │
│  • Switch traffic to previous version                           │
│  • No downtime                                                  │
│  • Requires previous version deployed                           │
│                                                                 │
│  GRADUAL ROLLBACK                                               │
│  • Slowly shift traffic back                                    │
│  • Monitor during transition                                    │
│  • Safer but slower                                             │
│                                                                 │
│  ROLLBACK CHECKLIST                                             │
│  ─────────────────                                              │
│  □ Identify issue and confirm rollback needed                   │
│  □ Notify stakeholders                                          │
│  □ Execute rollback                                             │
│  □ Verify previous version working                              │
│  □ Post-mortem and root cause analysis                          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

CI/CD for Models

Model CI/CD Pipeline

┌─────────────────────────────────────────────────────────────────┐
│              MODEL CI/CD PIPELINE                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐      │
│  │  CODE   │ →  │  BUILD  │ →  │  TEST   │ →  │ DEPLOY  │      │
│  │ COMMIT  │    │         │    │         │    │         │      │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘      │
│                                                                 │
│  CODE COMMIT                                                    │
│  ───────────                                                    │
│  • Training code changes                                        │
│  • Model config changes                                         │
│  • Trigger: Git push                                            │
│                                                                 │
│  BUILD                                                          │
│  ─────                                                          │
│  • Train model (or load pre-trained)                            │
│  • Export to inference format                                   │
│  • Build container image                                        │
│                                                                 │
│  TEST                                                           │
│  ────                                                           │
│  • Unit tests                                                   │
│  • Integration tests                                            │
│  • Performance tests                                            │
│  • Accuracy validation                                          │
│                                                                 │
│  DEPLOY                                                         │
│  ──────                                                         │
│  • Deploy to staging                                            │
│  • Run canary tests                                             │
│  • Gradual rollout to production                                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

GitHub Actions Example

yaml

# .github/workflows/model-ci.yml
name: Model CI/CD

on:
  push:
    paths:
      - 'models/**'
      - 'training/**'

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Run unit tests
        run: pytest tests/unit
      
      - name: Run model validation
        run: python scripts/validate_model.py
  
  build:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - name: Build inference container
        run: docker build -t model:${{ github.sha }} .
      
      - name: Push to registry
        run: docker push model:${{ github.sha }}
  
  deploy-staging:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to staging
        run: kubectl apply -f k8s/staging/
      
      - name: Run integration tests
        run: pytest tests/integration
  
  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment: production
    steps:
      - name: Deploy to production
        run: kubectl apply -f k8s/production/

📎 Cross-References

📎 Inference Optimization - Production inference
📎 ML Monitoring - Monitoring strategies
📎 ML Deployment - Deployment patterns
📎 ML Experimentation - Experiment tracking
📎 LLM Observability - LLM-specific monitoring

🧠 Quiz

Câu 1: Model versioning trong production cần track những gì?

[ ] A) Chỉ model weights
[ ] B) Chỉ hyperparameters
[x] C) Model weights, training data version, code version, hyperparameters, và dependencies
[ ] D) Chỉ performance metrics

💡 Giải thích: Model versioning toàn diện cần track: model artifacts (weights), training data snapshot/version, code commit hash, hyperparameters, library versions, và hardware environment. Thiếu bất kỳ component nào đều có thể gây khó khăn khi reproduce hoặc rollback.

Câu 2: Tại sao cần monitoring integration trong model ops?

[ ] A) Chỉ để tuân thủ quy trình
[x] B) Để phát hiện sớm model degradation, data drift, và trigger automated retraining hoặc rollback
[ ] C) Để giảm chi phí inference
[ ] D) Chỉ cần cho models lớn

💡 Giải thích: Monitoring integration kết nối model performance metrics với alerting system, cho phép phát hiện sớm khi model quality giảm (data drift, concept drift) và tự động trigger actions: alert team, rollback đến version trước, hoặc kick off retraining pipeline.

Câu 3: Blue-green deployment cho ML models có ưu điểm gì?

[ ] A) Giảm chi phí infrastructure
[ ] B) Tăng accuracy của model
[x] C) Cho phép instant rollback bằng cách switch traffic giữa hai environments (blue và green)
[ ] D) Không cần testing trước khi deploy

💡 Giải thích: Blue-green deployment duy trì hai environments giống nhau. Khi deploy model mới lên green, nếu phát hiện vấn đề, có thể instantly rollback bằng cách switch traffic về blue (model cũ). Nhược điểm là chi phí duy trì hai environments.

🔧 Model Ops ​

Model Ops Overview ​

Model Registry ​

Why Model Registry? ​

Model Registry Tools ​

MLflow Model Registry ​

Model Versioning ​

Versioning Strategy ​

What to Version ​

Model Metadata ​

Essential Metadata ​

Model Stages ​

Stage Transitions ​

Promotion Criteria ​

Monitoring Integration ​

Key Metrics to Monitor ​

Alerting Strategy ​

Rollback Strategy ​

Rollback Procedures ​

CI/CD for Models ​

Model CI/CD Pipeline ​

GitHub Actions Example ​

📎 Cross-References ​

Câu 1: Model versioning trong production cần track những gì? ​

Câu 2: Tại sao cần monitoring integration trong model ops? ​

Câu 3: Blue-green deployment cho ML models có ưu điểm gì? ​

🔧 Model Ops

Model Ops Overview

Model Registry

Why Model Registry?

Model Registry Tools

MLflow Model Registry

Model Versioning

Versioning Strategy

What to Version

Model Metadata

Essential Metadata

Model Stages

Stage Transitions

Promotion Criteria

Monitoring Integration

Key Metrics to Monitor

Alerting Strategy

Rollback Strategy

Rollback Procedures

CI/CD for Models

Model CI/CD Pipeline

GitHub Actions Example

📎 Cross-References

Câu 1: Model versioning trong production cần track những gì?

Câu 2: Tại sao cần monitoring integration trong model ops?

Câu 3: Blue-green deployment cho ML models có ưu điểm gì?