Giao diện
🔧 Model Ops
Level: Ops Solves: Quản lý model lifecycle từ training đến production với versioning và monitoring
Model Ops Overview
💡 Giáo sư Tom
Model Ops là cầu nối giữa research và production. Một model tốt trong notebook không có nghĩa là nó sẽ tốt trong production. Model Ops đảm bảo models được track, versioned, và monitored properly.
┌─────────────────────────────────────────────────────────────────┐
│ MODEL LIFECYCLE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ DEVELOP │ → │ TRAIN │ → │ VALIDATE│ → │ DEPLOY │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ MODEL REGISTRY │ │
│ │ • Version control │ │
│ │ • Metadata tracking │ │
│ │ • Artifact storage │ │
│ │ • Lineage tracking │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ MONITORING │ │
│ │ • Performance metrics │ │
│ │ • Data drift │ │
│ │ • Model drift │ │
│ │ • Alerts │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Model Registry
Why Model Registry?
┌─────────────────────────────────────────────────────────────────┐
│ MODEL REGISTRY BENEFITS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ WITHOUT REGISTRY WITH REGISTRY │
│ ──────────────── ───────────── │
│ • model_v1.pt • Semantic versioning │
│ • model_v2_final.pt • Metadata attached │
│ • model_v2_final_REAL.pt • Reproducibility │
│ • model_best_dont_delete.pt • Rollback capability │
│ • ??? • Audit trail │
│ │
│ CORE FEATURES │
│ ───────────── │
│ • Centralized storage │
│ • Version management │
│ • Stage transitions (dev → staging → prod) │
│ • Access control │
│ • Lineage tracking │
│ │
└─────────────────────────────────────────────────────────────────┘Model Registry Tools
| Tool | Type | Best For |
|---|---|---|
| MLflow | Open source | General ML, self-hosted |
| Weights & Biases | SaaS | Experiment tracking + registry |
| Neptune | SaaS | Experiment tracking + registry |
| DVC | Open source | Data + model versioning |
| SageMaker Model Registry | AWS | AWS-native workflows |
| Vertex AI Model Registry | GCP | GCP-native workflows |
MLflow Model Registry
python
import mlflow
from mlflow.tracking import MlflowClient
# Log model during training
with mlflow.start_run():
# Log parameters
mlflow.log_params({
"learning_rate": 0.001,
"batch_size": 32,
"epochs": 100
})
# Log metrics
mlflow.log_metrics({
"accuracy": 0.95,
"f1_score": 0.93
})
# Log model
mlflow.pytorch.log_model(
model,
"model",
registered_model_name="image_classifier"
)
# Manage model versions
client = MlflowClient()
# Transition to staging
client.transition_model_version_stage(
name="image_classifier",
version=1,
stage="Staging"
)
# Transition to production
client.transition_model_version_stage(
name="image_classifier",
version=1,
stage="Production"
)Model Versioning
Versioning Strategy
┌─────────────────────────────────────────────────────────────────┐
│ VERSIONING STRATEGY │
├─────────────────────────────────────────────────────────────────┤
│ │
│ SEMANTIC VERSIONING FOR MODELS │
│ ────────────────────────────── │
│ MAJOR.MINOR.PATCH │
│ │
│ MAJOR: Breaking changes │
│ • Different input/output schema │
│ • Incompatible API changes │
│ • Significant architecture changes │
│ │
│ MINOR: New features, backward compatible │
│ • New output fields │
│ • Performance improvements │
│ • Additional capabilities │
│ │
│ PATCH: Bug fixes, backward compatible │
│ • Bug fixes │
│ • Minor accuracy improvements │
│ • Retraining on same data │
│ │
│ EXAMPLE │
│ ─────── │
│ v1.0.0 → Initial production release │
│ v1.0.1 → Bug fix in preprocessing │
│ v1.1.0 → Added confidence scores │
│ v2.0.0 → New architecture, different output format │
│ │
└─────────────────────────────────────────────────────────────────┘What to Version
┌─────────────────────────────────────────────────────────────────┐
│ VERSIONING CHECKLIST │
├─────────────────────────────────────────────────────────────────┤
│ │
│ MODEL ARTIFACTS │
│ ─────────────── │
│ □ Model weights (.pt, .onnx, .pb) │
│ □ Model architecture definition │
│ □ Preprocessing code/config │
│ □ Postprocessing code/config │
│ │
│ TRAINING ARTIFACTS │
│ ───────────────── │
│ □ Training script │
│ □ Hyperparameters │
│ □ Training data version/hash │
│ □ Validation data version/hash │
│ □ Random seeds │
│ │
│ ENVIRONMENT │
│ ─────────── │
│ □ Python version │
│ □ Dependencies (requirements.txt, conda.yaml) │
│ □ CUDA version │
│ □ Hardware specs │
│ │
│ METADATA │
│ ──────── │
│ □ Training metrics │
│ □ Validation metrics │
│ □ Training duration │
│ □ Author/team │
│ □ Description/changelog │
│ │
└─────────────────────────────────────────────────────────────────┘Model Metadata
Essential Metadata
python
# Model metadata schema
model_metadata = {
# Identity
"name": "image_classifier",
"version": "1.2.0",
"description": "ResNet50 for product classification",
# Training info
"training": {
"dataset": "product_images_v3",
"dataset_size": 100000,
"epochs": 50,
"batch_size": 32,
"learning_rate": 0.001,
"optimizer": "AdamW",
"training_time_hours": 12.5,
},
# Performance
"metrics": {
"accuracy": 0.95,
"f1_score": 0.93,
"precision": 0.94,
"recall": 0.92,
"latency_p50_ms": 15,
"latency_p99_ms": 45,
},
# Environment
"environment": {
"python_version": "3.10",
"pytorch_version": "2.0.1",
"cuda_version": "11.8",
"gpu": "A100-40GB",
},
# Lineage
"lineage": {
"parent_model": "image_classifier:1.1.0",
"training_run_id": "run_abc123",
"git_commit": "a1b2c3d4",
},
# Governance
"governance": {
"owner": "ml-team",
"created_at": "2024-01-15T10:30:00Z",
"approved_by": "senior-ml-engineer",
"stage": "production",
}
}Model Stages
Stage Transitions
┌─────────────────────────────────────────────────────────────────┐
│ MODEL STAGES │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ NONE │ → │STAGING │ → │PRODUCTION│ → │ARCHIVED │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ NONE (Development) │
│ ───────────────── │
│ • Experimental models │
│ • Not validated │
│ • May be deleted │
│ │
│ STAGING │
│ ─────── │
│ • Candidate for production │
│ • Undergoing validation │
│ • A/B testing │
│ │
│ PRODUCTION │
│ ────────── │
│ • Serving live traffic │
│ • Fully validated │
│ • Monitored │
│ │
│ ARCHIVED │
│ ──────── │
│ • No longer in use │
│ • Kept for audit/rollback │
│ • May be deleted after retention period │
│ │
└─────────────────────────────────────────────────────────────────┘Promotion Criteria
┌─────────────────────────────────────────────────────────────────┐
│ PROMOTION CRITERIA │
├─────────────────────────────────────────────────────────────────┤
│ │
│ NONE → STAGING │
│ ────────────── │
│ □ Offline metrics meet threshold │
│ □ Model passes unit tests │
│ □ Code review completed │
│ □ Documentation updated │
│ │
│ STAGING → PRODUCTION │
│ ───────────────────── │
│ □ A/B test shows improvement │
│ □ No regression in key metrics │
│ □ Latency within SLA │
│ □ Error rate acceptable │
│ □ Stakeholder approval │
│ │
│ PRODUCTION → ARCHIVED │
│ ──────────────────── │
│ □ Replaced by newer version │
│ □ Traffic fully migrated │
│ □ Monitoring shows no issues │
│ □ Rollback period passed │
│ │
└─────────────────────────────────────────────────────────────────┘Monitoring Integration
Key Metrics to Monitor
┌─────────────────────────────────────────────────────────────────┐
│ MONITORING METRICS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ PERFORMANCE METRICS │
│ ─────────────────── │
│ • Prediction accuracy (if labels available) │
│ • Confidence score distribution │
│ • Prediction distribution │
│ │
│ OPERATIONAL METRICS │
│ ─────────────────── │
│ • Latency (p50, p95, p99) │
│ • Throughput (requests/second) │
│ • Error rate │
│ • GPU/CPU utilization │
│ • Memory usage │
│ │
│ DRIFT METRICS │
│ ───────────── │
│ • Input feature distribution │
│ • Prediction distribution │
│ • Data quality metrics │
│ │
│ BUSINESS METRICS │
│ ──────────────── │
│ • Conversion rate │
│ • User engagement │
│ • Revenue impact │
│ │
└─────────────────────────────────────────────────────────────────┘Alerting Strategy
python
# Example alerting rules
alerts = {
"latency_high": {
"condition": "p99_latency > 100ms",
"severity": "warning",
"action": "notify_oncall"
},
"error_rate_high": {
"condition": "error_rate > 1%",
"severity": "critical",
"action": "page_oncall"
},
"accuracy_drop": {
"condition": "accuracy < baseline - 5%",
"severity": "warning",
"action": "notify_ml_team"
},
"drift_detected": {
"condition": "psi_score > 0.2",
"severity": "warning",
"action": "notify_ml_team"
}
}Rollback Strategy
Rollback Procedures
┌─────────────────────────────────────────────────────────────────┐
│ ROLLBACK STRATEGY │
├─────────────────────────────────────────────────────────────────┤
│ │
│ WHEN TO ROLLBACK │
│ ──────────────── │
│ • Error rate exceeds threshold │
│ • Latency exceeds SLA │
│ • Significant accuracy drop │
│ • Business metrics degradation │
│ │
│ ROLLBACK TYPES │
│ ────────────── │
│ │
│ INSTANT ROLLBACK │
│ • Switch traffic to previous version │
│ • No downtime │
│ • Requires previous version deployed │
│ │
│ GRADUAL ROLLBACK │
│ • Slowly shift traffic back │
│ • Monitor during transition │
│ • Safer but slower │
│ │
│ ROLLBACK CHECKLIST │
│ ───────────────── │
│ □ Identify issue and confirm rollback needed │
│ □ Notify stakeholders │
│ □ Execute rollback │
│ □ Verify previous version working │
│ □ Post-mortem and root cause analysis │
│ │
└─────────────────────────────────────────────────────────────────┘CI/CD for Models
Model CI/CD Pipeline
┌─────────────────────────────────────────────────────────────────┐
│ MODEL CI/CD PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ CODE │ → │ BUILD │ → │ TEST │ → │ DEPLOY │ │
│ │ COMMIT │ │ │ │ │ │ │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ CODE COMMIT │
│ ─────────── │
│ • Training code changes │
│ • Model config changes │
│ • Trigger: Git push │
│ │
│ BUILD │
│ ───── │
│ • Train model (or load pre-trained) │
│ • Export to inference format │
│ • Build container image │
│ │
│ TEST │
│ ──── │
│ • Unit tests │
│ • Integration tests │
│ • Performance tests │
│ • Accuracy validation │
│ │
│ DEPLOY │
│ ────── │
│ • Deploy to staging │
│ • Run canary tests │
│ • Gradual rollout to production │
│ │
└─────────────────────────────────────────────────────────────────┘GitHub Actions Example
yaml
# .github/workflows/model-ci.yml
name: Model CI/CD
on:
push:
paths:
- 'models/**'
- 'training/**'
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run unit tests
run: pytest tests/unit
- name: Run model validation
run: python scripts/validate_model.py
build:
needs: test
runs-on: ubuntu-latest
steps:
- name: Build inference container
run: docker build -t model:${{ github.sha }} .
- name: Push to registry
run: docker push model:${{ github.sha }}
deploy-staging:
needs: build
runs-on: ubuntu-latest
steps:
- name: Deploy to staging
run: kubectl apply -f k8s/staging/
- name: Run integration tests
run: pytest tests/integration
deploy-production:
needs: deploy-staging
runs-on: ubuntu-latest
environment: production
steps:
- name: Deploy to production
run: kubectl apply -f k8s/production/📎 Cross-References
- 📎 Inference Optimization - Production inference
- 📎 ML Monitoring - Monitoring strategies
- 📎 ML Deployment - Deployment patterns
- 📎 ML Experimentation - Experiment tracking
- 📎 LLM Observability - LLM-specific monitoring
🧠 Quiz
Câu 1: Model versioning trong production cần track những gì?
- [ ] A) Chỉ model weights
- [ ] B) Chỉ hyperparameters
- [x] C) Model weights, training data version, code version, hyperparameters, và dependencies
- [ ] D) Chỉ performance metrics
💡 Giải thích: Model versioning toàn diện cần track: model artifacts (weights), training data snapshot/version, code commit hash, hyperparameters, library versions, và hardware environment. Thiếu bất kỳ component nào đều có thể gây khó khăn khi reproduce hoặc rollback.
Câu 2: Tại sao cần monitoring integration trong model ops?
- [ ] A) Chỉ để tuân thủ quy trình
- [x] B) Để phát hiện sớm model degradation, data drift, và trigger automated retraining hoặc rollback
- [ ] C) Để giảm chi phí inference
- [ ] D) Chỉ cần cho models lớn
💡 Giải thích: Monitoring integration kết nối model performance metrics với alerting system, cho phép phát hiện sớm khi model quality giảm (data drift, concept drift) và tự động trigger actions: alert team, rollback đến version trước, hoặc kick off retraining pipeline.
Câu 3: Blue-green deployment cho ML models có ưu điểm gì?
- [ ] A) Giảm chi phí infrastructure
- [ ] B) Tăng accuracy của model
- [x] C) Cho phép instant rollback bằng cách switch traffic giữa hai environments (blue và green)
- [ ] D) Không cần testing trước khi deploy
💡 Giải thích: Blue-green deployment duy trì hai environments giống nhau. Khi deploy model mới lên green, nếu phát hiện vấn đề, có thể instantly rollback bằng cách switch traffic về blue (model cũ). Nhược điểm là chi phí duy trì hai environments.