Giao diện
🔧 Model Ops
Level: Ops Solves: Quản lý model lifecycle từ training đến production với versioning và monitoring
Model Ops Overview
💡 Giáo sư Tom
Model Ops là cầu nối giữa research và production. Một model tốt trong notebook không có nghĩa là nó sẽ tốt trong production. Model Ops đảm bảo models được track, versioned, và monitored properly.
┌─────────────────────────────────────────────────────────────────┐
│ MODEL LIFECYCLE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ DEVELOP │ → │ TRAIN │ → │ VALIDATE│ → │ DEPLOY │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ MODEL REGISTRY │ │
│ │ • Version control │ │
│ │ • Metadata tracking │ │
│ │ • Artifact storage │ │
│ │ • Lineage tracking │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ MONITORING │ │
│ │ • Performance metrics │ │
│ │ • Data drift │ │
│ │ • Model drift │ │
│ │ • Alerts │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Model Registry
Why Model Registry?
┌─────────────────────────────────────────────────────────────────┐
│ MODEL REGISTRY BENEFITS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ WITHOUT REGISTRY WITH REGISTRY │
│ ──────────────── ───────────── │
│ • model_v1.pt • Semantic versioning │
│ • model_v2_final.pt • Metadata attached │
│ • model_v2_final_REAL.pt • Reproducibility │
│ • model_best_dont_delete.pt • Rollback capability │
│ • ??? • Audit trail │
│ │
│ CORE FEATURES │
│ ───────────── │
│ • Centralized storage │
│ • Version management │
│ • Stage transitions (dev → staging → prod) │
│ • Access control │
│ • Lineage tracking │
│ │
└─────────────────────────────────────────────────────────────────┘Model Registry Tools
| Tool | Type | Best For |
|---|---|---|
| MLflow | Open source | General ML, self-hosted |
| Weights & Biases | SaaS | Experiment tracking + registry |
| Neptune | SaaS | Experiment tracking + registry |
| DVC | Open source | Data + model versioning |
| SageMaker Model Registry | AWS | AWS-native workflows |
| Vertex AI Model Registry | GCP | GCP-native workflows |
MLflow Model Registry
python
import mlflow
from mlflow.tracking import MlflowClient
# Log model during training
with mlflow.start_run():
# Log parameters
mlflow.log_params({
"learning_rate": 0.001,
"batch_size": 32,
"epochs": 100
})
# Log metrics
mlflow.log_metrics({
"accuracy": 0.95,
"f1_score": 0.93
})
# Log model
mlflow.pytorch.log_model(
model,
"model",
registered_model_name="image_classifier"
)
# Manage model versions
client = MlflowClient()
# Transition to staging
client.transition_model_version_stage(
name="image_classifier",
version=1,
stage="Staging"
)
# Transition to production
client.transition_model_version_stage(
name="image_classifier",
version=1,
stage="Production"
)Model Versioning
Versioning Strategy
┌─────────────────────────────────────────────────────────────────┐
│ VERSIONING STRATEGY │
├─────────────────────────────────────────────────────────────────┤
│ │
│ SEMANTIC VERSIONING FOR MODELS │
│ ────────────────────────────── │
│ MAJOR.MINOR.PATCH │
│ │
│ MAJOR: Breaking changes │
│ • Different input/output schema │
│ • Incompatible API changes │
│ • Significant architecture changes │
│ │
│ MINOR: New features, backward compatible │
│ • New output fields │
│ • Performance improvements │
│ • Additional capabilities │
│ │
│ PATCH: Bug fixes, backward compatible │
│ • Bug fixes │
│ • Minor accuracy improvements │
│ • Retraining on same data │
│ │
│ EXAMPLE │
│ ─────── │
│ v1.0.0 → Initial production release │
│ v1.0.1 → Bug fix in preprocessing │
│ v1.1.0 → Added confidence scores │
│ v2.0.0 → New architecture, different output format │
│ │
└─────────────────────────────────────────────────────────────────┘What to Version
┌─────────────────────────────────────────────────────────────────┐
│ VERSIONING CHECKLIST │
├─────────────────────────────────────────────────────────────────┤
│ │
│ MODEL ARTIFACTS │
│ ─────────────── │
│ □ Model weights (.pt, .onnx, .pb) │
│ □ Model architecture definition │
│ □ Preprocessing code/config │
│ □ Postprocessing code/config │
│ │
│ TRAINING ARTIFACTS │
│ ───────────────── │
│ □ Training script │
│ □ Hyperparameters │
│ □ Training data version/hash │
│ □ Validation data version/hash │
│ □ Random seeds │
│ │
│ ENVIRONMENT │
│ ─────────── │
│ □ Python version │
│ □ Dependencies (requirements.txt, conda.yaml) │
│ □ CUDA version │
│ □ Hardware specs │
│ │
│ METADATA │
│ ──────── │
│ □ Training metrics │
│ □ Validation metrics │
│ □ Training duration │
│ □ Author/team │
│ □ Description/changelog │
│ │
└─────────────────────────────────────────────────────────────────┘Model Metadata
Essential Metadata
python
# Model metadata schema
model_metadata = {
# Identity
"name": "image_classifier",
"version": "1.2.0",
"description": "ResNet50 for product classification",
# Training info
"training": {
"dataset": "product_images_v3",
"dataset_size": 100000,
"epochs": 50,
"batch_size": 32,
"learning_rate": 0.001,
"optimizer": "AdamW",
"training_time_hours": 12.5,
},
# Performance
"metrics": {
"accuracy": 0.95,
"f1_score": 0.93,
"precision": 0.94,
"recall": 0.92,
"latency_p50_ms": 15,
"latency_p99_ms": 45,
},
# Environment
"environment": {
"python_version": "3.10",
"pytorch_version": "2.0.1",
"cuda_version": "11.8",
"gpu": "A100-40GB",
},
# Lineage
"lineage": {
"parent_model": "image_classifier:1.1.0",
"training_run_id": "run_abc123",
"git_commit": "a1b2c3d4",
},
# Governance
"governance": {
"owner": "ml-team",
"created_at": "2024-01-15T10:30:00Z",
"approved_by": "senior-ml-engineer",
"stage": "production",
}
}Model Stages
Stage Transitions
┌─────────────────────────────────────────────────────────────────┐
│ MODEL STAGES │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ NONE │ → │STAGING │ → │PRODUCTION│ → │ARCHIVED │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ NONE (Development) │
│ ───────────────── │
│ • Experimental models │
│ • Not validated │
│ • May be deleted │
│ │
│ STAGING │
│ ─────── │
│ • Candidate for production │
│ • Undergoing validation │
│ • A/B testing │
│ │
│ PRODUCTION │
│ ────────── │
│ • Serving live traffic │
│ • Fully validated │
│ • Monitored │
│ │
│ ARCHIVED │
│ ──────── │
│ • No longer in use │
│ • Kept for audit/rollback │
│ • May be deleted after retention period │
│ │
└─────────────────────────────────────────────────────────────────┘Promotion Criteria
┌─────────────────────────────────────────────────────────────────┐
│ PROMOTION CRITERIA │
├─────────────────────────────────────────────────────────────────┤
│ │
│ NONE → STAGING │
│ ────────────── │
│ □ Offline metrics meet threshold │
│ □ Model passes unit tests │
│ □ Code review completed │
│ □ Documentation updated │
│ │
│ STAGING → PRODUCTION │
│ ───────────────────── │
│ □ A/B test shows improvement │
│ □ No regression in key metrics │
│ □ Latency within SLA │
│ □ Error rate acceptable │
│ □ Stakeholder approval │
│ │
│ PRODUCTION → ARCHIVED │
│ ──────────────────── │
│ □ Replaced by newer version │
│ □ Traffic fully migrated │
│ □ Monitoring shows no issues │
│ □ Rollback period passed │
│ │
└─────────────────────────────────────────────────────────────────┘Monitoring Integration
Key Metrics to Monitor
┌─────────────────────────────────────────────────────────────────┐
│ MONITORING METRICS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ PERFORMANCE METRICS │
│ ─────────────────── │
│ • Prediction accuracy (if labels available) │
│ • Confidence score distribution │
│ • Prediction distribution │
│ │
│ OPERATIONAL METRICS │
│ ─────────────────── │
│ • Latency (p50, p95, p99) │
│ • Throughput (requests/second) │
│ • Error rate │
│ • GPU/CPU utilization │
│ • Memory usage │
│ │
│ DRIFT METRICS │
│ ───────────── │
│ • Input feature distribution │
│ • Prediction distribution │
│ • Data quality metrics │
│ │
│ BUSINESS METRICS │
│ ──────────────── │
│ • Conversion rate │
│ • User engagement │
│ • Revenue impact │
│ │
└─────────────────────────────────────────────────────────────────┘Alerting Strategy
python
# Example alerting rules
alerts = {
"latency_high": {
"condition": "p99_latency > 100ms",
"severity": "warning",
"action": "notify_oncall"
},
"error_rate_high": {
"condition": "error_rate > 1%",
"severity": "critical",
"action": "page_oncall"
},
"accuracy_drop": {
"condition": "accuracy < baseline - 5%",
"severity": "warning",
"action": "notify_ml_team"
},
"drift_detected": {
"condition": "psi_score > 0.2",
"severity": "warning",
"action": "notify_ml_team"
}
}Rollback Strategy
Rollback Procedures
┌─────────────────────────────────────────────────────────────────┐
│ ROLLBACK STRATEGY │
├─────────────────────────────────────────────────────────────────┤
│ │
│ WHEN TO ROLLBACK │
│ ──────────────── │
│ • Error rate exceeds threshold │
│ • Latency exceeds SLA │
│ • Significant accuracy drop │
│ • Business metrics degradation │
│ │
│ ROLLBACK TYPES │
│ ────────────── │
│ │
│ INSTANT ROLLBACK │
│ • Switch traffic to previous version │
│ • No downtime │
│ • Requires previous version deployed │
│ │
│ GRADUAL ROLLBACK │
│ • Slowly shift traffic back │
│ • Monitor during transition │
│ • Safer but slower │
│ │
│ ROLLBACK CHECKLIST │
│ ───────────────── │
│ □ Identify issue and confirm rollback needed │
│ □ Notify stakeholders │
│ □ Execute rollback │
│ □ Verify previous version working │
│ □ Post-mortem and root cause analysis │
│ │
└─────────────────────────────────────────────────────────────────┘CI/CD for Models
Model CI/CD Pipeline
┌─────────────────────────────────────────────────────────────────┐
│ MODEL CI/CD PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ CODE │ → │ BUILD │ → │ TEST │ → │ DEPLOY │ │
│ │ COMMIT │ │ │ │ │ │ │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ CODE COMMIT │
│ ─────────── │
│ • Training code changes │
│ • Model config changes │
│ • Trigger: Git push │
│ │
│ BUILD │
│ ───── │
│ • Train model (or load pre-trained) │
│ • Export to inference format │
│ • Build container image │
│ │
│ TEST │
│ ──── │
│ • Unit tests │
│ • Integration tests │
│ • Performance tests │
│ • Accuracy validation │
│ │
│ DEPLOY │
│ ────── │
│ • Deploy to staging │
│ • Run canary tests │
│ • Gradual rollout to production │
│ │
└─────────────────────────────────────────────────────────────────┘GitHub Actions Example
yaml
# .github/workflows/model-ci.yml
name: Model CI/CD
on:
push:
paths:
- 'models/**'
- 'training/**'
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run unit tests
run: pytest tests/unit
- name: Run model validation
run: python scripts/validate_model.py
build:
needs: test
runs-on: ubuntu-latest
steps:
- name: Build inference container
run: docker build -t model:${{ github.sha }} .
- name: Push to registry
run: docker push model:${{ github.sha }}
deploy-staging:
needs: build
runs-on: ubuntu-latest
steps:
- name: Deploy to staging
run: kubectl apply -f k8s/staging/
- name: Run integration tests
run: pytest tests/integration
deploy-production:
needs: deploy-staging
runs-on: ubuntu-latest
environment: production
steps:
- name: Deploy to production
run: kubectl apply -f k8s/production/📎 Cross-References
- 📎 Inference Optimization - Production inference
- 📎 ML Monitoring - Monitoring strategies
- 📎 ML Deployment - Deployment patterns
- 📎 ML Experimentation - Experiment tracking
- 📎 LLM Observability - LLM-specific monitoring