🚨 Data Leakage & Evaluation Pitfalls

Level: Foundation Solves: Prevent "too good to be true" models that fail catastrophically in production

❌ The #1 ML Failure Mode

Data leakage là nguyên nhân hàng đầu của "model works great offline, fails in production". Nó không visible trong metrics — chỉ visible trong business impact.

Leakage Taxonomy

┌─────────────────────────────────────────────────────────────────┐
│                    LEAKAGE TAXONOMY                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. TARGET LEAKAGE                                              │
│  ────────────────                                               │
│  Feature contains direct information about target               │
│                                                                 │
│  Example: Predicting fraud                                      │
│  ❌ Feature: "fraud_investigation_date" (only exists if fraud)  │
│  → Model learns: if investigation_date exists → fraud           │
│  → 99.9% AUC offline, 50% in production                         │
│                                                                 │
│  2. TEMPORAL LEAKAGE                                            │
│  ──────────────────                                             │
│  Using future information to predict past                       │
│                                                                 │
│  Example: Predict next-day stock price                          │
│  ❌ Using data from day T+1 to predict day T                    │
│  ❌ Random train/test split instead of temporal split           │
│                                                                 │
│  3. PREPROCESSING LEAKAGE                                       │
│  ─────────────────────────                                      │
│  Test data influences training preprocessing                    │
│                                                                 │
│  Example: Feature scaling                                       │
│  ❌ scaler.fit_transform(all_data) then split                   │
│  → Test set statistics leak into training                       │
│                                                                 │
│  4. FEATURE LEAKAGE                                             │
│  ─────────────────                                              │
│  Feature derived from target or unavailable at prediction time  │
│                                                                 │
│  Example: Predict customer churn                                │
│  ❌ Feature: "cancellation_reason" (only exists after churn)    │
│  ❌ Feature: "days_until_churn" (literally the target)          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Detection Strategies

Red Flags Checklist

Red Flag	What It Means	Investigation
AUC > 0.95 on first try	Suspiciously good	Check for target leakage
Perfect score on subset	Data issue	Examine that subset
Performance drops in production	Train/deploy mismatch	Audit feature pipeline
Feature importance shows unexpected winner	Proxy for target?	Analyze correlation
Model "learns" in 1 epoch	Too easy	Check data splitting

Statistical Detection

python

import pandas as pd
import numpy as np
from scipy import stats

def detect_target_leakage(df: pd.DataFrame, target: str, threshold: float = 0.9):
    """
    Detect features suspiciously correlated with target.
    
    Returns features with correlation > threshold.
    """
    suspicious = []
    
    for col in df.columns:
        if col == target:
            continue
            
        if df[col].dtype in ['int64', 'float64']:
            corr = abs(df[col].corr(df[target]))
            if corr > threshold:
                suspicious.append({
                    'feature': col,
                    'correlation': corr,
                    'reason': 'high_correlation'
                })
        else:
            # For categorical: check if feature only exists for one target class
            for val in df[col].unique():
                subset = df[df[col] == val]
                if len(subset[target].unique()) == 1:
                    suspicious.append({
                        'feature': col,
                        'value': val,
                        'reason': 'single_class_subset'
                    })
    
    return suspicious


def detect_temporal_leakage(df: pd.DataFrame, time_col: str, 
                            train_mask: pd.Series, test_mask: pd.Series):
    """
    Check if any test samples have timestamps before train samples.
    """
    train_max = df.loc[train_mask, time_col].max()
    test_min = df.loc[test_mask, time_col].min()
    
    if test_min <= train_max:
        overlap = df.loc[test_mask & (df[time_col] <= train_max)]
        return {
            'leakage_detected': True,
            'overlapping_samples': len(overlap),
            'train_max': train_max,
            'test_min': test_min
        }
    return {'leakage_detected': False}

Feature Availability Audit

┌─────────────────────────────────────────────────────────────────┐
│               FEATURE AVAILABILITY MATRIX                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  For EACH feature, answer:                                      │
│                                                                 │
│  1. WHEN is this feature available?                             │
│     □ At prediction time (T+0)                                  │
│     □ After prediction (T+1, T+7, ...)                          │
│     □ Only after target is known                                │
│                                                                 │
│  2. HOW is this feature computed?                               │
│     □ Raw from source system                                    │
│     □ Aggregated over time window                               │
│     □ Derived from other features                               │
│                                                                 │
│  3. WHAT data does computation depend on?                       │
│     □ Only historical data                                      │
│     □ Current data point                                        │
│     □ Future data (LEAKAGE!)                                    │
│                                                                 │
│  EXAMPLE AUDIT:                                                 │
│  ─────────────                                                  │
│  Feature: avg_purchases_last_30_days                            │
│  ✓ Available at T+0                                             │
│  ✓ Aggregated over past 30 days                                 │
│  ✓ Only uses historical data                                    │
│  → SAFE                                                         │
│                                                                 │
│  Feature: total_purchases_this_month                            │
│  ❌ If predicting mid-month, includes future data               │
│  → LEAKAGE (temporal)                                           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Evaluation Pitfalls

Common Mistakes

┌─────────────────────────────────────────────────────────────────┐
│               EVALUATION PITFALLS                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  PITFALL 1: WRONG SPLITTING                                     │
│  ───────────────────────────                                    │
│  ❌ Random split for time-series data                           │
│  ❌ No stratification for imbalanced classes                    │
│  ❌ Same entity in train and test (user/session leak)           │
│                                                                 │
│  PITFALL 2: METRIC GAMING                                       │
│  ─────────────────────────                                      │
│  ❌ Optimizing AUC when business needs precision@k              │
│  ❌ Reporting best validation score, not test score             │
│  ❌ Threshold tuning on test set                                │
│                                                                 │
│  PITFALL 3: OFFLINE/ONLINE GAP                                  │
│  ─────────────────────────────                                  │
│  ❌ Batch features in training, realtime in serving             │
│  ❌ Different preprocessing pipelines                           │
│  ❌ Stale features in production                                │
│                                                                 │
│  PITFALL 4: CALIBRATION BLINDNESS                               │
│  ──────────────────────────────                                 │
│  ❌ Good ranking but bad probability estimates                  │
│  ❌ Not checking calibration curves                             │
│  ❌ Using uncalibrated probabilities for thresholds             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Correct Splitting Strategies

Data Type	Split Strategy	Implementation
Time-series	Temporal split	Train on T-N:T-1, test on T
Users/entities	Group split	All user data in same split
Imbalanced	Stratified	Preserve class ratios
Multi-label	Iterative stratification	scikit-multilearn
Cross-validation	Time-series CV	sklearn TimeSeriesSplit

python

from sklearn.model_selection import (
    TimeSeriesSplit, 
    GroupKFold, 
    StratifiedKFold
)

# Time-series: Don't leak future
tscv = TimeSeriesSplit(n_splits=5, gap=7)  # 7-day gap

# Users: Keep all user data together
gkf = GroupKFold(n_splits=5)
for train_idx, test_idx in gkf.split(X, y, groups=user_ids):
    # No user appears in both train and test
    pass

# Imbalanced: Preserve class distribution
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

Calibration

python

from sklearn.calibration import calibration_curve, CalibratedClassifierCV
import matplotlib.pyplot as plt

def check_calibration(y_true, y_prob, n_bins=10):
    """
    Check if predicted probabilities match actual frequencies.
    
    Good calibration: predicted 30% → ~30% actually positive
    """
    fraction_pos, mean_pred = calibration_curve(
        y_true, y_prob, n_bins=n_bins
    )
    
    # Perfect calibration: diagonal line
    plt.plot([0, 1], [0, 1], 'k--', label='Perfect')
    plt.plot(mean_pred, fraction_pos, 's-', label='Model')
    plt.xlabel('Mean predicted probability')
    plt.ylabel('Fraction of positives')
    plt.legend()
    plt.title('Calibration Curve')
    plt.show()
    
    # Expected Calibration Error (ECE)
    ece = np.abs(fraction_pos - mean_pred).mean()
    print(f"ECE: {ece:.4f}")
    return ece


# Fix calibration with isotonic regression
calibrated_model = CalibratedClassifierCV(
    base_model, 
    method='isotonic',  # or 'sigmoid'
    cv=5
)
calibrated_model.fit(X_train, y_train)

Prevention Strategies

Pipeline Design

python

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# ✅ CORRECT: Fit preprocessing only on training data
pipeline = Pipeline([
    ('preprocessor', ColumnTransformer([
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ])),
    ('model', LogisticRegression())
])

# Pipeline ensures scaler.fit() only sees training data
pipeline.fit(X_train, y_train)  # Fit on train
score = pipeline.score(X_test, y_test)  # Transform test using train params

Feature Store Discipline

┌─────────────────────────────────────────────────────────────────┐
│               FEATURE STORE PREVENTS LEAKAGE                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  TRAINING TIME:                                                 │
│  ──────────────                                                 │
│  get_features(entity_id, timestamp=T)                           │
│  → Returns features as they were at time T                      │
│  → Point-in-time correctness guaranteed                         │
│                                                                 │
│  SERVING TIME:                                                  │
│  ─────────────                                                  │
│  get_features(entity_id)                                        │
│  → Returns current feature values                               │
│  → Same computation as training                                 │
│                                                                 │
│  WHY THIS PREVENTS LEAKAGE:                                     │
│  ──────────────────────────                                     │
│  1. Training features computed with same logic as serving       │
│  2. Point-in-time join prevents future data in training         │
│  3. Feature definitions versioned and auditable                 │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

✅ Operational Checklist

Pre-Training Gate

Check	Required	How to Verify
Feature availability audit complete	✓	Document when each feature is available
Temporal split implemented	✓	Test timestamps > all train timestamps
Entity leakage prevented	✓	Same entity not in train and test
Preprocessing fitted on train only	✓	Use sklearn Pipeline
Target leakage scan passed	✓	Run detect_target_leakage()

Pre-Deployment Gate

Check	Required	How to Verify
Offline/online parity validated	✓	Compare feature values
Calibration checked	✓	ECE < 0.05
Business metric aligned	✓	Not just AUC, actual business KPI
Holdout evaluation done	✓	True holdout, not validation set
Suspicious performance investigated	✓	AUC > 0.95 explained

Production Monitoring

Check	Required	Trigger
Feature distribution drift	✓	Daily
Prediction distribution stability	✓	Daily
Offline-online metric gap	✓	Weekly comparison
Calibration stability	✓	Monthly

📎 Cross-References

📎 Problem Framing - Objective definition và constraints
📎 Evaluation Fundamentals - Metrics và splitting strategies
📎 Feature Engineering - Feature store và pipeline design
📎 ML Monitoring - Drift detection
📎 Data Governance - Data lineage

🧠 Quiz

Câu 1: Target leakage xảy ra khi nào?

[ ] A) Khi model overfit trên training data
[x] B) Khi feature chứa thông tin từ target variable mà không có sẵn tại thời điểm dự đoán
[ ] C) Khi dữ liệu test bị trùng với training data
[ ] D) Khi model có quá nhiều parameters

💡 Giải thích: Target leakage xảy ra khi feature trực tiếp hoặc gián tiếp chứa thông tin về label/target mà trong thực tế sẽ không available tại thời điểm inference, dẫn đến model có metrics tốt nhưng thất bại khi deploy.

Câu 2: Temporal leakage phổ biến nhất trong loại dữ liệu nào?

[ ] A) Image data
[ ] B) Text data
[x] C) Time-series và dữ liệu có yếu tố thời gian
[ ] D) Tabular data không có timestamp

💡 Giải thích: Temporal leakage xảy ra khi sử dụng dữ liệu từ tương lai để dự đoán quá khứ, đặc biệt phổ biến trong time-series, financial data, và bất kỳ dataset nào có temporal ordering.

Câu 3: Cách tốt nhất để phát hiện data leakage là gì?

[ ] A) Chỉ cần kiểm tra accuracy trên test set
[ ] B) Dùng model phức tạp hơn
[x] C) Kiểm tra feature importance và đánh giá xem performance có "quá tốt để là thật" không
[ ] D) Tăng kích thước dataset

💡 Giải thích: Feature importance analysis giúp phát hiện features có ảnh hưởng bất thường cao. Nếu model đạt performance "too good to be true" (ví dụ: 99% accuracy trên bài toán khó), đó thường là dấu hiệu của leakage.

🚨 Data Leakage & Evaluation Pitfalls ​

Leakage Taxonomy ​

Detection Strategies ​

Red Flags Checklist ​

Statistical Detection ​

Feature Availability Audit ​

Evaluation Pitfalls ​

Common Mistakes ​

Correct Splitting Strategies ​

Calibration ​

Prevention Strategies ​

Pipeline Design ​

Feature Store Discipline ​

✅ Operational Checklist ​

Pre-Training Gate ​

Pre-Deployment Gate ​

Production Monitoring ​

📎 Cross-References ​

Câu 1: Target leakage xảy ra khi nào? ​

Câu 2: Temporal leakage phổ biến nhất trong loại dữ liệu nào? ​

Câu 3: Cách tốt nhất để phát hiện data leakage là gì? ​

🚨 Data Leakage & Evaluation Pitfalls

Leakage Taxonomy

Detection Strategies

Red Flags Checklist

Statistical Detection

Feature Availability Audit

Evaluation Pitfalls

Common Mistakes

Correct Splitting Strategies

Calibration

Prevention Strategies

Pipeline Design

Feature Store Discipline

✅ Operational Checklist

Pre-Training Gate

Pre-Deployment Gate

Production Monitoring

📎 Cross-References

Câu 1: Target leakage xảy ra khi nào?

Câu 2: Temporal leakage phổ biến nhất trong loại dữ liệu nào?

Câu 3: Cách tốt nhất để phát hiện data leakage là gì?