Skip to content

📈 ML Monitoring

Level: Ops Solves: Phát hiện và xử lý model degradation trong production, đảm bảo ML systems hoạt động đúng theo thời gian

Tại sao ML Monitoring khác biệt?

💡 Giáo sư Tom

"Traditional software fails loudly. ML systems fail silently." Model có thể trả về predictions hoàn toàn sai mà không có error nào. Monitoring ML không chỉ là uptime và latency - đó là về chất lượng predictions theo thời gian.

┌─────────────────────────────────────────────────────────────────┐
│              ML MONITORING vs TRADITIONAL MONITORING            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  TRADITIONAL SOFTWARE:                                          │
│  ─────────────────────                                          │
│  • Is the service up? (health check)                            │
│  • Is latency acceptable? (response time)                       │
│  • Are there errors? (error rate)                               │
│  • Is resource usage OK? (CPU, memory)                          │
│                                                                 │
│  ML SYSTEMS (all above PLUS):                                   │
│  ─────────────────────────────                                  │
│  • Are predictions still accurate? (model performance)          │
│  • Has input data changed? (data drift)                         │
│  • Has target distribution changed? (concept drift)             │
│  • Are predictions fair? (bias monitoring)                      │
│  • Are features computed correctly? (feature quality)           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Types of Drift

Data Drift (Covariate Shift)

┌─────────────────────────────────────────────────────────────────┐
│                    DATA DRIFT                                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Definition: Input feature distribution changes over time       │
│                                                                 │
│  Training Data          Production Data                         │
│  ┌─────────────┐        ┌─────────────┐                         │
│  │    ╱╲       │        │        ╱╲   │                         │
│  │   ╱  ╲      │   →    │       ╱  ╲  │                         │
│  │  ╱    ╲     │        │      ╱    ╲ │                         │
│  │ ╱      ╲    │        │     ╱      ╲│                         │
│  └─────────────┘        └─────────────┘                         │
│  Mean: 50               Mean: 70 (shifted!)                     │
│                                                                 │
│  Examples:                                                      │
│  • User demographics change (younger users join)                │
│  • Seasonal patterns (holiday shopping)                         │
│  • Market conditions change (economic downturn)                 │
│                                                                 │
│  Impact: Model may not generalize to new distribution           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Concept Drift

┌─────────────────────────────────────────────────────────────────┐
│                    CONCEPT DRIFT                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Definition: Relationship between features and target changes   │
│                                                                 │
│  Training Period        Production Period                       │
│  ┌─────────────┐        ┌─────────────┐                         │
│  │ X=high →    │        │ X=high →    │                         │
│  │ Y=positive  │   →    │ Y=negative  │                         │
│  └─────────────┘        └─────────────┘                         │
│                                                                 │
│  Examples:                                                      │
│  • Fraud patterns evolve (adversarial)                          │
│  • User preferences change (trends)                             │
│  • Economic conditions affect behavior                          │
│  • COVID changed everything                                     │
│                                                                 │
│  Impact: Model's learned patterns become invalid                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Prediction Drift

Drift TypeWhat ChangesDetection Method
Data DriftInput distributionStatistical tests on features
Concept DriftFeature-target relationshipPerformance degradation
Prediction DriftOutput distributionStatistical tests on predictions
Label DriftTarget distributionMonitor label statistics

Drift Detection Methods

Statistical Tests

┌─────────────────────────────────────────────────────────────────┐
│              DRIFT DETECTION METHODS                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  NUMERICAL FEATURES:                                            │
│  ───────────────────                                            │
│  • Kolmogorov-Smirnov (KS) test                                 │
│  • Population Stability Index (PSI)                             │
│  • Wasserstein distance                                         │
│  • Jensen-Shannon divergence                                    │
│                                                                 │
│  CATEGORICAL FEATURES:                                          │
│  ─────────────────────                                          │
│  • Chi-squared test                                             │
│  • Population Stability Index (PSI)                             │
│  • Jensen-Shannon divergence                                    │
│                                                                 │
│  MULTIVARIATE:                                                  │
│  ─────────────                                                  │
│  • Maximum Mean Discrepancy (MMD)                               │
│  • Domain classifier                                            │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Population Stability Index (PSI)

python
import numpy as np

def calculate_psi(expected: np.ndarray, actual: np.ndarray, bins: int = 10) -> float:
    """
    Calculate Population Stability Index.
    
    PSI < 0.1: No significant change
    0.1 <= PSI < 0.2: Moderate change, investigate
    PSI >= 0.2: Significant change, action required
    """
    # Create bins from expected distribution
    breakpoints = np.percentile(expected, np.linspace(0, 100, bins + 1))
    breakpoints[0] = -np.inf
    breakpoints[-1] = np.inf
    
    # Calculate proportions
    expected_counts = np.histogram(expected, breakpoints)[0]
    actual_counts = np.histogram(actual, breakpoints)[0]
    
    expected_pct = expected_counts / len(expected)
    actual_pct = actual_counts / len(actual)
    
    # Avoid division by zero
    expected_pct = np.clip(expected_pct, 0.0001, None)
    actual_pct = np.clip(actual_pct, 0.0001, None)
    
    # Calculate PSI
    psi = np.sum((actual_pct - expected_pct) * np.log(actual_pct / expected_pct))
    
    return psi

# Usage
psi = calculate_psi(training_feature, production_feature)
if psi >= 0.2:
    alert("Significant drift detected!")

Drift Detection with Evidently

python
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset

# Create drift report
report = Report(metrics=[
    DataDriftPreset(),
    TargetDriftPreset(),
])

report.run(
    reference_data=training_df,
    current_data=production_df,
)

# Get results
drift_results = report.as_dict()

# Check for drift
if drift_results["metrics"][0]["result"]["dataset_drift"]:
    print("Data drift detected!")
    
# Save report
report.save_html("drift_report.html")

Performance Monitoring

Key Metrics to Track

┌─────────────────────────────────────────────────────────────────┐
│              ML PERFORMANCE METRICS                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  MODEL QUALITY:                                                 │
│  ──────────────                                                 │
│  • Accuracy, Precision, Recall, F1 (classification)             │
│  • MAE, RMSE, R² (regression)                                   │
│  • AUC-ROC, AUC-PR (ranking)                                    │
│  • Calibration metrics (Brier score)                            │
│                                                                 │
│  OPERATIONAL:                                                   │
│  ────────────                                                   │
│  • Prediction latency (p50, p95, p99)                           │
│  • Throughput (predictions/second)                              │
│  • Error rate (failed predictions)                              │
│  • Resource utilization (CPU, memory, GPU)                      │
│                                                                 │
│  DATA QUALITY:                                                  │
│  ─────────────                                                  │
│  • Missing value rate                                           │
│  • Feature value ranges                                         │
│  • Feature correlation stability                                │
│  • Schema violations                                            │
│                                                                 │
│  BUSINESS:                                                      │
│  ─────────                                                      │
│  • Conversion rate                                              │
│  • Revenue impact                                               │
│  • User engagement                                              │
│  • Customer satisfaction                                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Monitoring Dashboard

python
import prometheus_client as prom

# Define metrics
prediction_latency = prom.Histogram(
    'model_prediction_latency_seconds',
    'Time spent processing prediction',
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0]
)

prediction_count = prom.Counter(
    'model_predictions_total',
    'Total number of predictions',
    ['model_version', 'prediction_class']
)

feature_value = prom.Gauge(
    'model_feature_value',
    'Feature value distribution',
    ['feature_name', 'statistic']
)

# Use in prediction endpoint
@prediction_latency.time()
def predict(features):
    prediction = model.predict(features)
    
    # Track prediction distribution
    prediction_count.labels(
        model_version='v1.2.0',
        prediction_class=str(prediction)
    ).inc()
    
    # Track feature statistics
    for name, value in features.items():
        feature_value.labels(
            feature_name=name,
            statistic='mean'
        ).set(np.mean(value))
    
    return prediction

Fairness Monitoring

Fairness Metrics

┌─────────────────────────────────────────────────────────────────┐
│              FAIRNESS METRICS                                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  DEMOGRAPHIC PARITY:                                            │
│  ───────────────────                                            │
│  P(Ŷ=1|A=0) = P(Ŷ=1|A=1)                                        │
│  "Positive prediction rate should be equal across groups"       │
│                                                                 │
│  EQUALIZED ODDS:                                                │
│  ───────────────                                                │
│  P(Ŷ=1|Y=1,A=0) = P(Ŷ=1|Y=1,A=1)  (Equal TPR)                   │
│  P(Ŷ=1|Y=0,A=0) = P(Ŷ=1|Y=0,A=1)  (Equal FPR)                   │
│  "Error rates should be equal across groups"                    │
│                                                                 │
│  CALIBRATION:                                                   │
│  ────────────                                                   │
│  P(Y=1|Ŷ=p,A=0) = P(Y=1|Ŷ=p,A=1) = p                            │
│  "Predicted probabilities should be accurate for all groups"   │
│                                                                 │
│  Where:                                                         │
│  • A = Protected attribute (gender, race, age)                  │
│  • Y = True label                                               │
│  • Ŷ = Predicted label                                          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Fairness Monitoring Implementation

python
from sklearn.metrics import confusion_matrix

def calculate_fairness_metrics(y_true, y_pred, protected_attribute):
    """Calculate fairness metrics across protected groups."""
    
    groups = np.unique(protected_attribute)
    metrics = {}
    
    for group in groups:
        mask = protected_attribute == group
        y_true_group = y_true[mask]
        y_pred_group = y_pred[mask]
        
        tn, fp, fn, tp = confusion_matrix(y_true_group, y_pred_group).ravel()
        
        metrics[group] = {
            'positive_rate': (tp + fp) / len(y_pred_group),
            'tpr': tp / (tp + fn) if (tp + fn) > 0 else 0,
            'fpr': fp / (fp + tn) if (fp + tn) > 0 else 0,
            'accuracy': (tp + tn) / len(y_pred_group),
        }
    
    # Calculate disparities
    group_list = list(groups)
    if len(group_list) >= 2:
        metrics['disparity'] = {
            'positive_rate_ratio': metrics[group_list[0]]['positive_rate'] / 
                                   max(metrics[group_list[1]]['positive_rate'], 0.001),
            'tpr_difference': abs(metrics[group_list[0]]['tpr'] - 
                                  metrics[group_list[1]]['tpr']),
        }
    
    return metrics

# Alert if disparity exceeds threshold
fairness = calculate_fairness_metrics(y_true, y_pred, gender)
if fairness['disparity']['positive_rate_ratio'] < 0.8:
    alert("Potential fairness issue: positive rate disparity")

SLA Management

ML SLA Components

SLA ComponentMetricTypical Target
AvailabilityUptime percentage99.9%
Latencyp99 response time< 100ms
ThroughputRequests per second> 1000 RPS
AccuracyModel performance> baseline + 5%
FreshnessModel age< 7 days
DriftPSI score< 0.2

SLA Monitoring

python
from dataclasses import dataclass
from datetime import datetime, timedelta

@dataclass
class MLSLA:
    availability_target: float = 0.999
    latency_p99_ms: float = 100
    accuracy_threshold: float = 0.85
    max_drift_psi: float = 0.2
    max_model_age_days: int = 7

def check_sla_compliance(metrics: dict, sla: MLSLA) -> dict:
    """Check if current metrics meet SLA requirements."""
    
    violations = []
    
    # Availability check
    if metrics['availability'] < sla.availability_target:
        violations.append({
            'metric': 'availability',
            'current': metrics['availability'],
            'target': sla.availability_target,
            'severity': 'critical'
        })
    
    # Latency check
    if metrics['latency_p99'] > sla.latency_p99_ms:
        violations.append({
            'metric': 'latency_p99',
            'current': metrics['latency_p99'],
            'target': sla.latency_p99_ms,
            'severity': 'high'
        })
    
    # Accuracy check
    if metrics['accuracy'] < sla.accuracy_threshold:
        violations.append({
            'metric': 'accuracy',
            'current': metrics['accuracy'],
            'target': sla.accuracy_threshold,
            'severity': 'high'
        })
    
    # Drift check
    if metrics['drift_psi'] > sla.max_drift_psi:
        violations.append({
            'metric': 'drift_psi',
            'current': metrics['drift_psi'],
            'target': sla.max_drift_psi,
            'severity': 'medium'
        })
    
    # Model freshness check
    model_age = (datetime.now() - metrics['model_deployed_at']).days
    if model_age > sla.max_model_age_days:
        violations.append({
            'metric': 'model_age',
            'current': model_age,
            'target': sla.max_model_age_days,
            'severity': 'low'
        })
    
    return {
        'compliant': len(violations) == 0,
        'violations': violations
    }

Alerting Strategy

Alert Hierarchy

┌─────────────────────────────────────────────────────────────────┐
│              ALERTING HIERARCHY                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  CRITICAL (Page immediately):                                   │
│  ────────────────────────────                                   │
│  • Service down (availability < 99%)                            │
│  • Error rate > 5%                                              │
│  • Latency p99 > 5x baseline                                    │
│                                                                 │
│  HIGH (Page during business hours):                             │
│  ──────────────────────────────────                             │
│  • Accuracy drop > 10%                                          │
│  • Significant drift detected (PSI > 0.25)                      │
│  • Fairness violation                                           │
│                                                                 │
│  MEDIUM (Slack notification):                                   │
│  ────────────────────────────                                   │
│  • Moderate drift (0.1 < PSI < 0.25)                            │
│  • Accuracy drop 5-10%                                          │
│  • Model approaching staleness                                  │
│                                                                 │
│  LOW (Daily digest):                                            │
│  ───────────────────                                            │
│  • Minor drift (PSI < 0.1)                                      │
│  • Feature distribution changes                                 │
│  • Resource utilization trends                                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Alert Fatigue Prevention

⚠️ Alert Fatigue

Too many alerts = no alerts. Engineers will ignore them. Be selective about what triggers alerts.

StrategyDescription
Threshold tuningSet thresholds based on historical data
Alert aggregationGroup related alerts together
Cooldown periodsDon't re-alert for same issue
Severity levelsNot everything is critical
Actionable alertsEvery alert should have a clear action

Operational Checklist

Monitoring Setup Gate

CheckRequiredOwnerNotes
Baseline metrics capturedML EngineerFrom training data
Drift thresholds definedML LeadPSI < 0.1 warning, < 0.2 critical
SLA definedProduct + ML LeadLatency, accuracy, availability
Alerting configuredPlatformPagerDuty/Slack integration
Dashboards createdML EngineerGrafana/Datadog
Fairness baselines setML LeadGroup metrics documented

Ongoing Monitoring Cadence

CheckFrequencyOwner
Latency/error reviewDailyOps
Drift checkDailyML Engineer
Accuracy evaluationWeekly (with labels)ML Engineer
Fairness auditWeeklyML Lead
Business metric reviewWeeklyPM + ML Lead
SLA compliance reportMonthlyML Lead

Alert Response Runbook

Alert TypeResponse TimeFirst ResponderEscalation
Service down5 minOn-callPage ML Lead
Latency spike15 minOn-callInvestigate
Accuracy drop1 hourML EngineerConsider rollback
Drift detected24 hoursML EngineerPlan retrain

📎 Cross-References