Giao diện
🚨 Data Leakage & Evaluation Pitfalls
Level: Foundation Solves: Prevent "too good to be true" models that fail catastrophically in production
❌ The #1 ML Failure Mode
Data leakage là nguyên nhân hàng đầu của "model works great offline, fails in production". Nó không visible trong metrics — chỉ visible trong business impact.
Leakage Taxonomy
┌─────────────────────────────────────────────────────────────────┐
│ LEAKAGE TAXONOMY │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. TARGET LEAKAGE │
│ ──────────────── │
│ Feature contains direct information about target │
│ │
│ Example: Predicting fraud │
│ ❌ Feature: "fraud_investigation_date" (only exists if fraud) │
│ → Model learns: if investigation_date exists → fraud │
│ → 99.9% AUC offline, 50% in production │
│ │
│ 2. TEMPORAL LEAKAGE │
│ ────────────────── │
│ Using future information to predict past │
│ │
│ Example: Predict next-day stock price │
│ ❌ Using data from day T+1 to predict day T │
│ ❌ Random train/test split instead of temporal split │
│ │
│ 3. PREPROCESSING LEAKAGE │
│ ───────────────────────── │
│ Test data influences training preprocessing │
│ │
│ Example: Feature scaling │
│ ❌ scaler.fit_transform(all_data) then split │
│ → Test set statistics leak into training │
│ │
│ 4. FEATURE LEAKAGE │
│ ───────────────── │
│ Feature derived from target or unavailable at prediction time │
│ │
│ Example: Predict customer churn │
│ ❌ Feature: "cancellation_reason" (only exists after churn) │
│ ❌ Feature: "days_until_churn" (literally the target) │
│ │
└─────────────────────────────────────────────────────────────────┘Detection Strategies
Red Flags Checklist
| Red Flag | What It Means | Investigation |
|---|---|---|
| AUC > 0.95 on first try | Suspiciously good | Check for target leakage |
| Perfect score on subset | Data issue | Examine that subset |
| Performance drops in production | Train/deploy mismatch | Audit feature pipeline |
| Feature importance shows unexpected winner | Proxy for target? | Analyze correlation |
| Model "learns" in 1 epoch | Too easy | Check data splitting |
Statistical Detection
python
import pandas as pd
import numpy as np
from scipy import stats
def detect_target_leakage(df: pd.DataFrame, target: str, threshold: float = 0.9):
"""
Detect features suspiciously correlated with target.
Returns features with correlation > threshold.
"""
suspicious = []
for col in df.columns:
if col == target:
continue
if df[col].dtype in ['int64', 'float64']:
corr = abs(df[col].corr(df[target]))
if corr > threshold:
suspicious.append({
'feature': col,
'correlation': corr,
'reason': 'high_correlation'
})
else:
# For categorical: check if feature only exists for one target class
for val in df[col].unique():
subset = df[df[col] == val]
if len(subset[target].unique()) == 1:
suspicious.append({
'feature': col,
'value': val,
'reason': 'single_class_subset'
})
return suspicious
def detect_temporal_leakage(df: pd.DataFrame, time_col: str,
train_mask: pd.Series, test_mask: pd.Series):
"""
Check if any test samples have timestamps before train samples.
"""
train_max = df.loc[train_mask, time_col].max()
test_min = df.loc[test_mask, time_col].min()
if test_min <= train_max:
overlap = df.loc[test_mask & (df[time_col] <= train_max)]
return {
'leakage_detected': True,
'overlapping_samples': len(overlap),
'train_max': train_max,
'test_min': test_min
}
return {'leakage_detected': False}Feature Availability Audit
┌─────────────────────────────────────────────────────────────────┐
│ FEATURE AVAILABILITY MATRIX │
├─────────────────────────────────────────────────────────────────┤
│ │
│ For EACH feature, answer: │
│ │
│ 1. WHEN is this feature available? │
│ □ At prediction time (T+0) │
│ □ After prediction (T+1, T+7, ...) │
│ □ Only after target is known │
│ │
│ 2. HOW is this feature computed? │
│ □ Raw from source system │
│ □ Aggregated over time window │
│ □ Derived from other features │
│ │
│ 3. WHAT data does computation depend on? │
│ □ Only historical data │
│ □ Current data point │
│ □ Future data (LEAKAGE!) │
│ │
│ EXAMPLE AUDIT: │
│ ───────────── │
│ Feature: avg_purchases_last_30_days │
│ ✓ Available at T+0 │
│ ✓ Aggregated over past 30 days │
│ ✓ Only uses historical data │
│ → SAFE │
│ │
│ Feature: total_purchases_this_month │
│ ❌ If predicting mid-month, includes future data │
│ → LEAKAGE (temporal) │
│ │
└─────────────────────────────────────────────────────────────────┘Evaluation Pitfalls
Common Mistakes
┌─────────────────────────────────────────────────────────────────┐
│ EVALUATION PITFALLS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ PITFALL 1: WRONG SPLITTING │
│ ─────────────────────────── │
│ ❌ Random split for time-series data │
│ ❌ No stratification for imbalanced classes │
│ ❌ Same entity in train and test (user/session leak) │
│ │
│ PITFALL 2: METRIC GAMING │
│ ───────────────────────── │
│ ❌ Optimizing AUC when business needs precision@k │
│ ❌ Reporting best validation score, not test score │
│ ❌ Threshold tuning on test set │
│ │
│ PITFALL 3: OFFLINE/ONLINE GAP │
│ ───────────────────────────── │
│ ❌ Batch features in training, realtime in serving │
│ ❌ Different preprocessing pipelines │
│ ❌ Stale features in production │
│ │
│ PITFALL 4: CALIBRATION BLINDNESS │
│ ────────────────────────────── │
│ ❌ Good ranking but bad probability estimates │
│ ❌ Not checking calibration curves │
│ ❌ Using uncalibrated probabilities for thresholds │
│ │
└─────────────────────────────────────────────────────────────────┘Correct Splitting Strategies
| Data Type | Split Strategy | Implementation |
|---|---|---|
| Time-series | Temporal split | Train on T-N:T-1, test on T |
| Users/entities | Group split | All user data in same split |
| Imbalanced | Stratified | Preserve class ratios |
| Multi-label | Iterative stratification | scikit-multilearn |
| Cross-validation | Time-series CV | sklearn TimeSeriesSplit |
python
from sklearn.model_selection import (
TimeSeriesSplit,
GroupKFold,
StratifiedKFold
)
# Time-series: Don't leak future
tscv = TimeSeriesSplit(n_splits=5, gap=7) # 7-day gap
# Users: Keep all user data together
gkf = GroupKFold(n_splits=5)
for train_idx, test_idx in gkf.split(X, y, groups=user_ids):
# No user appears in both train and test
pass
# Imbalanced: Preserve class distribution
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)Calibration
python
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
import matplotlib.pyplot as plt
def check_calibration(y_true, y_prob, n_bins=10):
"""
Check if predicted probabilities match actual frequencies.
Good calibration: predicted 30% → ~30% actually positive
"""
fraction_pos, mean_pred = calibration_curve(
y_true, y_prob, n_bins=n_bins
)
# Perfect calibration: diagonal line
plt.plot([0, 1], [0, 1], 'k--', label='Perfect')
plt.plot(mean_pred, fraction_pos, 's-', label='Model')
plt.xlabel('Mean predicted probability')
plt.ylabel('Fraction of positives')
plt.legend()
plt.title('Calibration Curve')
plt.show()
# Expected Calibration Error (ECE)
ece = np.abs(fraction_pos - mean_pred).mean()
print(f"ECE: {ece:.4f}")
return ece
# Fix calibration with isotonic regression
calibrated_model = CalibratedClassifierCV(
base_model,
method='isotonic', # or 'sigmoid'
cv=5
)
calibrated_model.fit(X_train, y_train)Prevention Strategies
Pipeline Design
python
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# ✅ CORRECT: Fit preprocessing only on training data
pipeline = Pipeline([
('preprocessor', ColumnTransformer([
('num', StandardScaler(), numerical_cols),
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
])),
('model', LogisticRegression())
])
# Pipeline ensures scaler.fit() only sees training data
pipeline.fit(X_train, y_train) # Fit on train
score = pipeline.score(X_test, y_test) # Transform test using train paramsFeature Store Discipline
┌─────────────────────────────────────────────────────────────────┐
│ FEATURE STORE PREVENTS LEAKAGE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ TRAINING TIME: │
│ ────────────── │
│ get_features(entity_id, timestamp=T) │
│ → Returns features as they were at time T │
│ → Point-in-time correctness guaranteed │
│ │
│ SERVING TIME: │
│ ───────────── │
│ get_features(entity_id) │
│ → Returns current feature values │
│ → Same computation as training │
│ │
│ WHY THIS PREVENTS LEAKAGE: │
│ ────────────────────────── │
│ 1. Training features computed with same logic as serving │
│ 2. Point-in-time join prevents future data in training │
│ 3. Feature definitions versioned and auditable │
│ │
└─────────────────────────────────────────────────────────────────┘✅ Operational Checklist
Pre-Training Gate
| Check | Required | How to Verify |
|---|---|---|
| Feature availability audit complete | ✓ | Document when each feature is available |
| Temporal split implemented | ✓ | Test timestamps > all train timestamps |
| Entity leakage prevented | ✓ | Same entity not in train and test |
| Preprocessing fitted on train only | ✓ | Use sklearn Pipeline |
| Target leakage scan passed | ✓ | Run detect_target_leakage() |
Pre-Deployment Gate
| Check | Required | How to Verify |
|---|---|---|
| Offline/online parity validated | ✓ | Compare feature values |
| Calibration checked | ✓ | ECE < 0.05 |
| Business metric aligned | ✓ | Not just AUC, actual business KPI |
| Holdout evaluation done | ✓ | True holdout, not validation set |
| Suspicious performance investigated | ✓ | AUC > 0.95 explained |
Production Monitoring
| Check | Required | Trigger |
|---|---|---|
| Feature distribution drift | ✓ | Daily |
| Prediction distribution stability | ✓ | Daily |
| Offline-online metric gap | ✓ | Weekly comparison |
| Calibration stability | ✓ | Monthly |
📎 Cross-References
- 📎 Problem Framing - Objective definition và constraints
- 📎 Evaluation Fundamentals - Metrics và splitting strategies
- 📎 Feature Engineering - Feature store và pipeline design
- 📎 ML Monitoring - Drift detection
- 📎 Data Governance - Data lineage