📈 ML Monitoring

Level: Ops Solves: Phát hiện và xử lý model degradation trong production, đảm bảo ML systems hoạt động đúng theo thời gian

Tại sao ML Monitoring khác biệt?

💡 Giáo sư Tom

"Traditional software fails loudly. ML systems fail silently." Model có thể trả về predictions hoàn toàn sai mà không có error nào. Monitoring ML không chỉ là uptime và latency - đó là về chất lượng predictions theo thời gian.

┌─────────────────────────────────────────────────────────────────┐
│              ML MONITORING vs TRADITIONAL MONITORING            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  TRADITIONAL SOFTWARE:                                          │
│  ─────────────────────                                          │
│  • Is the service up? (health check)                            │
│  • Is latency acceptable? (response time)                       │
│  • Are there errors? (error rate)                               │
│  • Is resource usage OK? (CPU, memory)                          │
│                                                                 │
│  ML SYSTEMS (all above PLUS):                                   │
│  ─────────────────────────────                                  │
│  • Are predictions still accurate? (model performance)          │
│  • Has input data changed? (data drift)                         │
│  • Has target distribution changed? (concept drift)             │
│  • Are predictions fair? (bias monitoring)                      │
│  • Are features computed correctly? (feature quality)           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Types of Drift

Data Drift (Covariate Shift)

┌─────────────────────────────────────────────────────────────────┐
│                    DATA DRIFT                                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Definition: Input feature distribution changes over time       │
│                                                                 │
│  Training Data          Production Data                         │
│  ┌─────────────┐        ┌─────────────┐                         │
│  │    ╱╲       │        │        ╱╲   │                         │
│  │   ╱  ╲      │   →    │       ╱  ╲  │                         │
│  │  ╱    ╲     │        │      ╱    ╲ │                         │
│  │ ╱      ╲    │        │     ╱      ╲│                         │
│  └─────────────┘        └─────────────┘                         │
│  Mean: 50               Mean: 70 (shifted!)                     │
│                                                                 │
│  Examples:                                                      │
│  • User demographics change (younger users join)                │
│  • Seasonal patterns (holiday shopping)                         │
│  • Market conditions change (economic downturn)                 │
│                                                                 │
│  Impact: Model may not generalize to new distribution           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Concept Drift

┌─────────────────────────────────────────────────────────────────┐
│                    CONCEPT DRIFT                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Definition: Relationship between features and target changes   │
│                                                                 │
│  Training Period        Production Period                       │
│  ┌─────────────┐        ┌─────────────┐                         │
│  │ X=high →    │        │ X=high →    │                         │
│  │ Y=positive  │   →    │ Y=negative  │                         │
│  └─────────────┘        └─────────────┘                         │
│                                                                 │
│  Examples:                                                      │
│  • Fraud patterns evolve (adversarial)                          │
│  • User preferences change (trends)                             │
│  • Economic conditions affect behavior                          │
│  • COVID changed everything                                     │
│                                                                 │
│  Impact: Model's learned patterns become invalid                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Prediction Drift

Drift Type	What Changes	Detection Method
Data Drift	Input distribution	Statistical tests on features
Concept Drift	Feature-target relationship	Performance degradation
Prediction Drift	Output distribution	Statistical tests on predictions
Label Drift	Target distribution	Monitor label statistics

Drift Detection Methods

Statistical Tests

┌─────────────────────────────────────────────────────────────────┐
│              DRIFT DETECTION METHODS                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  NUMERICAL FEATURES:                                            │
│  ───────────────────                                            │
│  • Kolmogorov-Smirnov (KS) test                                 │
│  • Population Stability Index (PSI)                             │
│  • Wasserstein distance                                         │
│  • Jensen-Shannon divergence                                    │
│                                                                 │
│  CATEGORICAL FEATURES:                                          │
│  ─────────────────────                                          │
│  • Chi-squared test                                             │
│  • Population Stability Index (PSI)                             │
│  • Jensen-Shannon divergence                                    │
│                                                                 │
│  MULTIVARIATE:                                                  │
│  ─────────────                                                  │
│  • Maximum Mean Discrepancy (MMD)                               │
│  • Domain classifier                                            │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Population Stability Index (PSI)

python

import numpy as np

def calculate_psi(expected: np.ndarray, actual: np.ndarray, bins: int = 10) -> float:
    """
    Calculate Population Stability Index.
    
    PSI < 0.1: No significant change
    0.1 <= PSI < 0.2: Moderate change, investigate
    PSI >= 0.2: Significant change, action required
    """
    # Create bins from expected distribution
    breakpoints = np.percentile(expected, np.linspace(0, 100, bins + 1))
    breakpoints[0] = -np.inf
    breakpoints[-1] = np.inf
    
    # Calculate proportions
    expected_counts = np.histogram(expected, breakpoints)[0]
    actual_counts = np.histogram(actual, breakpoints)[0]
    
    expected_pct = expected_counts / len(expected)
    actual_pct = actual_counts / len(actual)
    
    # Avoid division by zero
    expected_pct = np.clip(expected_pct, 0.0001, None)
    actual_pct = np.clip(actual_pct, 0.0001, None)
    
    # Calculate PSI
    psi = np.sum((actual_pct - expected_pct) * np.log(actual_pct / expected_pct))
    
    return psi

# Usage
psi = calculate_psi(training_feature, production_feature)
if psi >= 0.2:
    alert("Significant drift detected!")

Drift Detection with Evidently

python

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset

# Create drift report
report = Report(metrics=[
    DataDriftPreset(),
    TargetDriftPreset(),
])

report.run(
    reference_data=training_df,
    current_data=production_df,
)

# Get results
drift_results = report.as_dict()

# Check for drift
if drift_results["metrics"][0]["result"]["dataset_drift"]:
    print("Data drift detected!")
    
# Save report
report.save_html("drift_report.html")

Performance Monitoring

Key Metrics to Track

┌─────────────────────────────────────────────────────────────────┐
│              ML PERFORMANCE METRICS                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  MODEL QUALITY:                                                 │
│  ──────────────                                                 │
│  • Accuracy, Precision, Recall, F1 (classification)             │
│  • MAE, RMSE, R² (regression)                                   │
│  • AUC-ROC, AUC-PR (ranking)                                    │
│  • Calibration metrics (Brier score)                            │
│                                                                 │
│  OPERATIONAL:                                                   │
│  ────────────                                                   │
│  • Prediction latency (p50, p95, p99)                           │
│  • Throughput (predictions/second)                              │
│  • Error rate (failed predictions)                              │
│  • Resource utilization (CPU, memory, GPU)                      │
│                                                                 │
│  DATA QUALITY:                                                  │
│  ─────────────                                                  │
│  • Missing value rate                                           │
│  • Feature value ranges                                         │
│  • Feature correlation stability                                │
│  • Schema violations                                            │
│                                                                 │
│  BUSINESS:                                                      │
│  ─────────                                                      │
│  • Conversion rate                                              │
│  • Revenue impact                                               │
│  • User engagement                                              │
│  • Customer satisfaction                                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Monitoring Dashboard

python

import prometheus_client as prom

# Define metrics
prediction_latency = prom.Histogram(
    'model_prediction_latency_seconds',
    'Time spent processing prediction',
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0]
)

prediction_count = prom.Counter(
    'model_predictions_total',
    'Total number of predictions',
    ['model_version', 'prediction_class']
)

feature_value = prom.Gauge(
    'model_feature_value',
    'Feature value distribution',
    ['feature_name', 'statistic']
)

# Use in prediction endpoint
@prediction_latency.time()
def predict(features):
    prediction = model.predict(features)
    
    # Track prediction distribution
    prediction_count.labels(
        model_version='v1.2.0',
        prediction_class=str(prediction)
    ).inc()
    
    # Track feature statistics
    for name, value in features.items():
        feature_value.labels(
            feature_name=name,
            statistic='mean'
        ).set(np.mean(value))
    
    return prediction

Fairness Monitoring

Fairness Metrics

┌─────────────────────────────────────────────────────────────────┐
│              FAIRNESS METRICS                                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  DEMOGRAPHIC PARITY:                                            │
│  ───────────────────                                            │
│  P(Ŷ=1|A=0) = P(Ŷ=1|A=1)                                        │
│  "Positive prediction rate should be equal across groups"       │
│                                                                 │
│  EQUALIZED ODDS:                                                │
│  ───────────────                                                │
│  P(Ŷ=1|Y=1,A=0) = P(Ŷ=1|Y=1,A=1)  (Equal TPR)                   │
│  P(Ŷ=1|Y=0,A=0) = P(Ŷ=1|Y=0,A=1)  (Equal FPR)                   │
│  "Error rates should be equal across groups"                    │
│                                                                 │
│  CALIBRATION:                                                   │
│  ────────────                                                   │
│  P(Y=1|Ŷ=p,A=0) = P(Y=1|Ŷ=p,A=1) = p                            │
│  "Predicted probabilities should be accurate for all groups"   │
│                                                                 │
│  Where:                                                         │
│  • A = Protected attribute (gender, race, age)                  │
│  • Y = True label                                               │
│  • Ŷ = Predicted label                                          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Fairness Monitoring Implementation

python

from sklearn.metrics import confusion_matrix

def calculate_fairness_metrics(y_true, y_pred, protected_attribute):
    """Calculate fairness metrics across protected groups."""
    
    groups = np.unique(protected_attribute)
    metrics = {}
    
    for group in groups:
        mask = protected_attribute == group
        y_true_group = y_true[mask]
        y_pred_group = y_pred[mask]
        
        tn, fp, fn, tp = confusion_matrix(y_true_group, y_pred_group).ravel()
        
        metrics[group] = {
            'positive_rate': (tp + fp) / len(y_pred_group),
            'tpr': tp / (tp + fn) if (tp + fn) > 0 else 0,
            'fpr': fp / (fp + tn) if (fp + tn) > 0 else 0,
            'accuracy': (tp + tn) / len(y_pred_group),
        }
    
    # Calculate disparities
    group_list = list(groups)
    if len(group_list) >= 2:
        metrics['disparity'] = {
            'positive_rate_ratio': metrics[group_list[0]]['positive_rate'] / 
                                   max(metrics[group_list[1]]['positive_rate'], 0.001),
            'tpr_difference': abs(metrics[group_list[0]]['tpr'] - 
                                  metrics[group_list[1]]['tpr']),
        }
    
    return metrics

# Alert if disparity exceeds threshold
fairness = calculate_fairness_metrics(y_true, y_pred, gender)
if fairness['disparity']['positive_rate_ratio'] < 0.8:
    alert("Potential fairness issue: positive rate disparity")

SLA Management

ML SLA Components

SLA Component	Metric	Typical Target
Availability	Uptime percentage	99.9%
Latency	p99 response time	< 100ms
Throughput	Requests per second	> 1000 RPS
Accuracy	Model performance	> baseline + 5%
Freshness	Model age	< 7 days
Drift	PSI score	< 0.2

SLA Monitoring

python

from dataclasses import dataclass
from datetime import datetime, timedelta

@dataclass
class MLSLA:
    availability_target: float = 0.999
    latency_p99_ms: float = 100
    accuracy_threshold: float = 0.85
    max_drift_psi: float = 0.2
    max_model_age_days: int = 7

def check_sla_compliance(metrics: dict, sla: MLSLA) -> dict:
    """Check if current metrics meet SLA requirements."""
    
    violations = []
    
    # Availability check
    if metrics['availability'] < sla.availability_target:
        violations.append({
            'metric': 'availability',
            'current': metrics['availability'],
            'target': sla.availability_target,
            'severity': 'critical'
        })
    
    # Latency check
    if metrics['latency_p99'] > sla.latency_p99_ms:
        violations.append({
            'metric': 'latency_p99',
            'current': metrics['latency_p99'],
            'target': sla.latency_p99_ms,
            'severity': 'high'
        })
    
    # Accuracy check
    if metrics['accuracy'] < sla.accuracy_threshold:
        violations.append({
            'metric': 'accuracy',
            'current': metrics['accuracy'],
            'target': sla.accuracy_threshold,
            'severity': 'high'
        })
    
    # Drift check
    if metrics['drift_psi'] > sla.max_drift_psi:
        violations.append({
            'metric': 'drift_psi',
            'current': metrics['drift_psi'],
            'target': sla.max_drift_psi,
            'severity': 'medium'
        })
    
    # Model freshness check
    model_age = (datetime.now() - metrics['model_deployed_at']).days
    if model_age > sla.max_model_age_days:
        violations.append({
            'metric': 'model_age',
            'current': model_age,
            'target': sla.max_model_age_days,
            'severity': 'low'
        })
    
    return {
        'compliant': len(violations) == 0,
        'violations': violations
    }

Alerting Strategy

Alert Hierarchy

┌─────────────────────────────────────────────────────────────────┐
│              ALERTING HIERARCHY                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  CRITICAL (Page immediately):                                   │
│  ────────────────────────────                                   │
│  • Service down (availability < 99%)                            │
│  • Error rate > 5%                                              │
│  • Latency p99 > 5x baseline                                    │
│                                                                 │
│  HIGH (Page during business hours):                             │
│  ──────────────────────────────────                             │
│  • Accuracy drop > 10%                                          │
│  • Significant drift detected (PSI > 0.25)                      │
│  • Fairness violation                                           │
│                                                                 │
│  MEDIUM (Slack notification):                                   │
│  ────────────────────────────                                   │
│  • Moderate drift (0.1 < PSI < 0.25)                            │
│  • Accuracy drop 5-10%                                          │
│  • Model approaching staleness                                  │
│                                                                 │
│  LOW (Daily digest):                                            │
│  ───────────────────                                            │
│  • Minor drift (PSI < 0.1)                                      │
│  • Feature distribution changes                                 │
│  • Resource utilization trends                                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Alert Fatigue Prevention

⚠️ Alert Fatigue

Too many alerts = no alerts. Engineers will ignore them. Be selective about what triggers alerts.

Strategy	Description
Threshold tuning	Set thresholds based on historical data
Alert aggregation	Group related alerts together
Cooldown periods	Don't re-alert for same issue
Severity levels	Not everything is critical
Actionable alerts	Every alert should have a clear action

✅ Operational Checklist

Monitoring Setup Gate

Check	Required	Owner	Notes
Baseline metrics captured	✓	ML Engineer	From training data
Drift thresholds defined	✓	ML Lead	PSI < 0.1 warning, < 0.2 critical
SLA defined	✓	Product + ML Lead	Latency, accuracy, availability
Alerting configured	✓	Platform	PagerDuty/Slack integration
Dashboards created	✓	ML Engineer	Grafana/Datadog
Fairness baselines set	✓	ML Lead	Group metrics documented

Ongoing Monitoring Cadence

Check	Frequency	Owner
Latency/error review	Daily	Ops
Drift check	Daily	ML Engineer
Accuracy evaluation	Weekly (with labels)	ML Engineer
Fairness audit	Weekly	ML Lead
Business metric review	Weekly	PM + ML Lead
SLA compliance report	Monthly	ML Lead

Alert Response Runbook

Alert Type	Response Time	First Responder	Escalation
Service down	5 min	On-call	Page ML Lead
Latency spike	15 min	On-call	Investigate
Accuracy drop	1 hour	ML Engineer	Consider rollback
Drift detected	24 hours	ML Engineer	Plan retrain

📎 Cross-References

📎 Evaluation Fundamentals - Metrics definitions
📎 Deployment Patterns - Deployment monitoring
📎 Feature Engineering - Feature quality monitoring
📎 ML Security - Security monitoring
📎 ML Governance - SLA governance
📎 Data Leakage - Data quality issues

🧠 Quiz

Câu 1: Data drift và concept drift khác nhau như thế nào?

[ ] A) Chúng giống nhau, chỉ khác tên gọi
[x] B) Data drift là sự thay đổi phân phối input, concept drift là sự thay đổi mối quan hệ giữa input và output
[ ] C) Data drift nghiêm trọng hơn concept drift
[ ] D) Concept drift chỉ xảy ra với deep learning models

💡 Giải thích: Data drift (covariate shift) xảy ra khi phân phối P(X) thay đổi. Concept drift xảy ra khi mối quan hệ P(Y|X) thay đổi — tức cùng input nhưng output đúng đã khác đi. Cả hai đều cần monitoring nhưng có cách xử lý khác nhau.

Câu 2: SLA (Service Level Agreement) cho ML system thường bao gồm metrics nào?

[ ] A) Chỉ accuracy
[ ] B) Chỉ latency
[x] C) Latency (p50, p99), throughput, availability, và model quality metrics
[ ] D) Chỉ uptime percentage

💡 Giải thích: ML SLAs cần bao gồm cả system metrics (latency percentiles, throughput, availability) lẫn model quality metrics (accuracy, precision, recall trên production data). Chỉ monitor system mà bỏ qua model quality là thiếu sót nghiêm trọng.

📈 ML Monitoring ​

Tại sao ML Monitoring khác biệt? ​

Types of Drift ​

Data Drift (Covariate Shift) ​

Concept Drift ​

Prediction Drift ​

Drift Detection Methods ​

Statistical Tests ​

Population Stability Index (PSI) ​

Drift Detection with Evidently ​

Performance Monitoring ​

Key Metrics to Track ​

Monitoring Dashboard ​

Fairness Monitoring ​

Fairness Metrics ​

Fairness Monitoring Implementation ​

SLA Management ​

ML SLA Components ​

SLA Monitoring ​

Alerting Strategy ​

Alert Hierarchy ​

Alert Fatigue Prevention ​

✅ Operational Checklist ​

Monitoring Setup Gate ​

Ongoing Monitoring Cadence ​

Alert Response Runbook ​

📎 Cross-References ​

Câu 1: Data drift và concept drift khác nhau như thế nào? ​

Câu 2: SLA (Service Level Agreement) cho ML system thường bao gồm metrics nào? ​

📈 ML Monitoring

Tại sao ML Monitoring khác biệt?

Types of Drift

Data Drift (Covariate Shift)

Concept Drift

Prediction Drift

Drift Detection Methods

Statistical Tests

Population Stability Index (PSI)

Drift Detection with Evidently

Performance Monitoring

Key Metrics to Track

Monitoring Dashboard

Fairness Monitoring

Fairness Metrics

Fairness Monitoring Implementation

SLA Management

ML SLA Components

SLA Monitoring

Alerting Strategy

Alert Hierarchy

Alert Fatigue Prevention

✅ Operational Checklist

Monitoring Setup Gate

Ongoing Monitoring Cadence

Alert Response Runbook

📎 Cross-References

Câu 1: Data drift và concept drift khác nhau như thế nào?

Câu 2: SLA (Service Level Agreement) cho ML system thường bao gồm metrics nào?