Giao diện
📈 ML Monitoring
Level: Ops Solves: Phát hiện và xử lý model degradation trong production, đảm bảo ML systems hoạt động đúng theo thời gian
Tại sao ML Monitoring khác biệt?
💡 Giáo sư Tom
"Traditional software fails loudly. ML systems fail silently." Model có thể trả về predictions hoàn toàn sai mà không có error nào. Monitoring ML không chỉ là uptime và latency - đó là về chất lượng predictions theo thời gian.
┌─────────────────────────────────────────────────────────────────┐
│ ML MONITORING vs TRADITIONAL MONITORING │
├─────────────────────────────────────────────────────────────────┤
│ │
│ TRADITIONAL SOFTWARE: │
│ ───────────────────── │
│ • Is the service up? (health check) │
│ • Is latency acceptable? (response time) │
│ • Are there errors? (error rate) │
│ • Is resource usage OK? (CPU, memory) │
│ │
│ ML SYSTEMS (all above PLUS): │
│ ───────────────────────────── │
│ • Are predictions still accurate? (model performance) │
│ • Has input data changed? (data drift) │
│ • Has target distribution changed? (concept drift) │
│ • Are predictions fair? (bias monitoring) │
│ • Are features computed correctly? (feature quality) │
│ │
└─────────────────────────────────────────────────────────────────┘Types of Drift
Data Drift (Covariate Shift)
┌─────────────────────────────────────────────────────────────────┐
│ DATA DRIFT │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Definition: Input feature distribution changes over time │
│ │
│ Training Data Production Data │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ ╱╲ │ │ ╱╲ │ │
│ │ ╱ ╲ │ → │ ╱ ╲ │ │
│ │ ╱ ╲ │ │ ╱ ╲ │ │
│ │ ╱ ╲ │ │ ╱ ╲│ │
│ └─────────────┘ └─────────────┘ │
│ Mean: 50 Mean: 70 (shifted!) │
│ │
│ Examples: │
│ • User demographics change (younger users join) │
│ • Seasonal patterns (holiday shopping) │
│ • Market conditions change (economic downturn) │
│ │
│ Impact: Model may not generalize to new distribution │
│ │
└─────────────────────────────────────────────────────────────────┘Concept Drift
┌─────────────────────────────────────────────────────────────────┐
│ CONCEPT DRIFT │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Definition: Relationship between features and target changes │
│ │
│ Training Period Production Period │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ X=high → │ │ X=high → │ │
│ │ Y=positive │ → │ Y=negative │ │
│ └─────────────┘ └─────────────┘ │
│ │
│ Examples: │
│ • Fraud patterns evolve (adversarial) │
│ • User preferences change (trends) │
│ • Economic conditions affect behavior │
│ • COVID changed everything │
│ │
│ Impact: Model's learned patterns become invalid │
│ │
└─────────────────────────────────────────────────────────────────┘Prediction Drift
| Drift Type | What Changes | Detection Method |
|---|---|---|
| Data Drift | Input distribution | Statistical tests on features |
| Concept Drift | Feature-target relationship | Performance degradation |
| Prediction Drift | Output distribution | Statistical tests on predictions |
| Label Drift | Target distribution | Monitor label statistics |
Drift Detection Methods
Statistical Tests
┌─────────────────────────────────────────────────────────────────┐
│ DRIFT DETECTION METHODS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ NUMERICAL FEATURES: │
│ ─────────────────── │
│ • Kolmogorov-Smirnov (KS) test │
│ • Population Stability Index (PSI) │
│ • Wasserstein distance │
│ • Jensen-Shannon divergence │
│ │
│ CATEGORICAL FEATURES: │
│ ───────────────────── │
│ • Chi-squared test │
│ • Population Stability Index (PSI) │
│ • Jensen-Shannon divergence │
│ │
│ MULTIVARIATE: │
│ ───────────── │
│ • Maximum Mean Discrepancy (MMD) │
│ • Domain classifier │
│ │
└─────────────────────────────────────────────────────────────────┘Population Stability Index (PSI)
python
import numpy as np
def calculate_psi(expected: np.ndarray, actual: np.ndarray, bins: int = 10) -> float:
"""
Calculate Population Stability Index.
PSI < 0.1: No significant change
0.1 <= PSI < 0.2: Moderate change, investigate
PSI >= 0.2: Significant change, action required
"""
# Create bins from expected distribution
breakpoints = np.percentile(expected, np.linspace(0, 100, bins + 1))
breakpoints[0] = -np.inf
breakpoints[-1] = np.inf
# Calculate proportions
expected_counts = np.histogram(expected, breakpoints)[0]
actual_counts = np.histogram(actual, breakpoints)[0]
expected_pct = expected_counts / len(expected)
actual_pct = actual_counts / len(actual)
# Avoid division by zero
expected_pct = np.clip(expected_pct, 0.0001, None)
actual_pct = np.clip(actual_pct, 0.0001, None)
# Calculate PSI
psi = np.sum((actual_pct - expected_pct) * np.log(actual_pct / expected_pct))
return psi
# Usage
psi = calculate_psi(training_feature, production_feature)
if psi >= 0.2:
alert("Significant drift detected!")Drift Detection with Evidently
python
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset
# Create drift report
report = Report(metrics=[
DataDriftPreset(),
TargetDriftPreset(),
])
report.run(
reference_data=training_df,
current_data=production_df,
)
# Get results
drift_results = report.as_dict()
# Check for drift
if drift_results["metrics"][0]["result"]["dataset_drift"]:
print("Data drift detected!")
# Save report
report.save_html("drift_report.html")Performance Monitoring
Key Metrics to Track
┌─────────────────────────────────────────────────────────────────┐
│ ML PERFORMANCE METRICS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ MODEL QUALITY: │
│ ────────────── │
│ • Accuracy, Precision, Recall, F1 (classification) │
│ • MAE, RMSE, R² (regression) │
│ • AUC-ROC, AUC-PR (ranking) │
│ • Calibration metrics (Brier score) │
│ │
│ OPERATIONAL: │
│ ──────────── │
│ • Prediction latency (p50, p95, p99) │
│ • Throughput (predictions/second) │
│ • Error rate (failed predictions) │
│ • Resource utilization (CPU, memory, GPU) │
│ │
│ DATA QUALITY: │
│ ───────────── │
│ • Missing value rate │
│ • Feature value ranges │
│ • Feature correlation stability │
│ • Schema violations │
│ │
│ BUSINESS: │
│ ───────── │
│ • Conversion rate │
│ • Revenue impact │
│ • User engagement │
│ • Customer satisfaction │
│ │
└─────────────────────────────────────────────────────────────────┘Monitoring Dashboard
python
import prometheus_client as prom
# Define metrics
prediction_latency = prom.Histogram(
'model_prediction_latency_seconds',
'Time spent processing prediction',
buckets=[0.01, 0.05, 0.1, 0.5, 1.0]
)
prediction_count = prom.Counter(
'model_predictions_total',
'Total number of predictions',
['model_version', 'prediction_class']
)
feature_value = prom.Gauge(
'model_feature_value',
'Feature value distribution',
['feature_name', 'statistic']
)
# Use in prediction endpoint
@prediction_latency.time()
def predict(features):
prediction = model.predict(features)
# Track prediction distribution
prediction_count.labels(
model_version='v1.2.0',
prediction_class=str(prediction)
).inc()
# Track feature statistics
for name, value in features.items():
feature_value.labels(
feature_name=name,
statistic='mean'
).set(np.mean(value))
return predictionFairness Monitoring
Fairness Metrics
┌─────────────────────────────────────────────────────────────────┐
│ FAIRNESS METRICS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ DEMOGRAPHIC PARITY: │
│ ─────────────────── │
│ P(Ŷ=1|A=0) = P(Ŷ=1|A=1) │
│ "Positive prediction rate should be equal across groups" │
│ │
│ EQUALIZED ODDS: │
│ ─────────────── │
│ P(Ŷ=1|Y=1,A=0) = P(Ŷ=1|Y=1,A=1) (Equal TPR) │
│ P(Ŷ=1|Y=0,A=0) = P(Ŷ=1|Y=0,A=1) (Equal FPR) │
│ "Error rates should be equal across groups" │
│ │
│ CALIBRATION: │
│ ──────────── │
│ P(Y=1|Ŷ=p,A=0) = P(Y=1|Ŷ=p,A=1) = p │
│ "Predicted probabilities should be accurate for all groups" │
│ │
│ Where: │
│ • A = Protected attribute (gender, race, age) │
│ • Y = True label │
│ • Ŷ = Predicted label │
│ │
└─────────────────────────────────────────────────────────────────┘Fairness Monitoring Implementation
python
from sklearn.metrics import confusion_matrix
def calculate_fairness_metrics(y_true, y_pred, protected_attribute):
"""Calculate fairness metrics across protected groups."""
groups = np.unique(protected_attribute)
metrics = {}
for group in groups:
mask = protected_attribute == group
y_true_group = y_true[mask]
y_pred_group = y_pred[mask]
tn, fp, fn, tp = confusion_matrix(y_true_group, y_pred_group).ravel()
metrics[group] = {
'positive_rate': (tp + fp) / len(y_pred_group),
'tpr': tp / (tp + fn) if (tp + fn) > 0 else 0,
'fpr': fp / (fp + tn) if (fp + tn) > 0 else 0,
'accuracy': (tp + tn) / len(y_pred_group),
}
# Calculate disparities
group_list = list(groups)
if len(group_list) >= 2:
metrics['disparity'] = {
'positive_rate_ratio': metrics[group_list[0]]['positive_rate'] /
max(metrics[group_list[1]]['positive_rate'], 0.001),
'tpr_difference': abs(metrics[group_list[0]]['tpr'] -
metrics[group_list[1]]['tpr']),
}
return metrics
# Alert if disparity exceeds threshold
fairness = calculate_fairness_metrics(y_true, y_pred, gender)
if fairness['disparity']['positive_rate_ratio'] < 0.8:
alert("Potential fairness issue: positive rate disparity")SLA Management
ML SLA Components
| SLA Component | Metric | Typical Target |
|---|---|---|
| Availability | Uptime percentage | 99.9% |
| Latency | p99 response time | < 100ms |
| Throughput | Requests per second | > 1000 RPS |
| Accuracy | Model performance | > baseline + 5% |
| Freshness | Model age | < 7 days |
| Drift | PSI score | < 0.2 |
SLA Monitoring
python
from dataclasses import dataclass
from datetime import datetime, timedelta
@dataclass
class MLSLA:
availability_target: float = 0.999
latency_p99_ms: float = 100
accuracy_threshold: float = 0.85
max_drift_psi: float = 0.2
max_model_age_days: int = 7
def check_sla_compliance(metrics: dict, sla: MLSLA) -> dict:
"""Check if current metrics meet SLA requirements."""
violations = []
# Availability check
if metrics['availability'] < sla.availability_target:
violations.append({
'metric': 'availability',
'current': metrics['availability'],
'target': sla.availability_target,
'severity': 'critical'
})
# Latency check
if metrics['latency_p99'] > sla.latency_p99_ms:
violations.append({
'metric': 'latency_p99',
'current': metrics['latency_p99'],
'target': sla.latency_p99_ms,
'severity': 'high'
})
# Accuracy check
if metrics['accuracy'] < sla.accuracy_threshold:
violations.append({
'metric': 'accuracy',
'current': metrics['accuracy'],
'target': sla.accuracy_threshold,
'severity': 'high'
})
# Drift check
if metrics['drift_psi'] > sla.max_drift_psi:
violations.append({
'metric': 'drift_psi',
'current': metrics['drift_psi'],
'target': sla.max_drift_psi,
'severity': 'medium'
})
# Model freshness check
model_age = (datetime.now() - metrics['model_deployed_at']).days
if model_age > sla.max_model_age_days:
violations.append({
'metric': 'model_age',
'current': model_age,
'target': sla.max_model_age_days,
'severity': 'low'
})
return {
'compliant': len(violations) == 0,
'violations': violations
}Alerting Strategy
Alert Hierarchy
┌─────────────────────────────────────────────────────────────────┐
│ ALERTING HIERARCHY │
├─────────────────────────────────────────────────────────────────┤
│ │
│ CRITICAL (Page immediately): │
│ ──────────────────────────── │
│ • Service down (availability < 99%) │
│ • Error rate > 5% │
│ • Latency p99 > 5x baseline │
│ │
│ HIGH (Page during business hours): │
│ ────────────────────────────────── │
│ • Accuracy drop > 10% │
│ • Significant drift detected (PSI > 0.25) │
│ • Fairness violation │
│ │
│ MEDIUM (Slack notification): │
│ ──────────────────────────── │
│ • Moderate drift (0.1 < PSI < 0.25) │
│ • Accuracy drop 5-10% │
│ • Model approaching staleness │
│ │
│ LOW (Daily digest): │
│ ─────────────────── │
│ • Minor drift (PSI < 0.1) │
│ • Feature distribution changes │
│ • Resource utilization trends │
│ │
└─────────────────────────────────────────────────────────────────┘Alert Fatigue Prevention
⚠️ Alert Fatigue
Too many alerts = no alerts. Engineers will ignore them. Be selective about what triggers alerts.
| Strategy | Description |
|---|---|
| Threshold tuning | Set thresholds based on historical data |
| Alert aggregation | Group related alerts together |
| Cooldown periods | Don't re-alert for same issue |
| Severity levels | Not everything is critical |
| Actionable alerts | Every alert should have a clear action |
✅ Operational Checklist
Monitoring Setup Gate
| Check | Required | Owner | Notes |
|---|---|---|---|
| Baseline metrics captured | ✓ | ML Engineer | From training data |
| Drift thresholds defined | ✓ | ML Lead | PSI < 0.1 warning, < 0.2 critical |
| SLA defined | ✓ | Product + ML Lead | Latency, accuracy, availability |
| Alerting configured | ✓ | Platform | PagerDuty/Slack integration |
| Dashboards created | ✓ | ML Engineer | Grafana/Datadog |
| Fairness baselines set | ✓ | ML Lead | Group metrics documented |
Ongoing Monitoring Cadence
| Check | Frequency | Owner |
|---|---|---|
| Latency/error review | Daily | Ops |
| Drift check | Daily | ML Engineer |
| Accuracy evaluation | Weekly (with labels) | ML Engineer |
| Fairness audit | Weekly | ML Lead |
| Business metric review | Weekly | PM + ML Lead |
| SLA compliance report | Monthly | ML Lead |
Alert Response Runbook
| Alert Type | Response Time | First Responder | Escalation |
|---|---|---|---|
| Service down | 5 min | On-call | Page ML Lead |
| Latency spike | 15 min | On-call | Investigate |
| Accuracy drop | 1 hour | ML Engineer | Consider rollback |
| Drift detected | 24 hours | ML Engineer | Plan retrain |
📎 Cross-References
- 📎 Evaluation Fundamentals - Metrics definitions
- 📎 Deployment Patterns - Deployment monitoring
- 📎 Feature Engineering - Feature quality monitoring
- 📎 ML Security - Security monitoring
- 📎 ML Governance - SLA governance
- 📎 Data Leakage - Data quality issues