Giao diện
📊 Evaluation Fundamentals
Level: Foundation Solves: Đánh giá model một cách chính xác, tránh overfitting và đảm bảo generalization
Tại sao Evaluation quan trọng?
💡 Giáo sư Tom
Model evaluation không chỉ là chạy model.score(). Đó là nghệ thuật đảm bảo model của bạn sẽ hoạt động tốt trên data chưa từng thấy. Một evaluation strategy sai có thể khiến bạn deploy model tệ hại mà không hề biết.
┌─────────────────────────────────────────────────────────────────┐
│ EVALUATION HIERARCHY │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ OFFLINE EVALUATION │ │
│ │ • Train/Val/Test splits │ │
│ │ • Cross-validation │ │
│ │ • Holdout metrics │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ ONLINE EVALUATION │ │
│ │ • A/B testing │ │
│ │ • Shadow mode │ │
│ │ • Canary deployment │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ PRODUCTION MONITORING │ │
│ │ • Drift detection │ │
│ │ • Performance tracking │ │
│ │ • Business metrics │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Data Splitting Strategies
Basic Train/Val/Test Split
┌─────────────────────────────────────────────────────────────────┐
│ STANDARD DATA SPLIT │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ FULL DATASET │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ TRAIN │ │ VALIDATION │ │ TEST │ │
│ │ (60-70%) │ │ (15-20%) │ │ (15-20%) │ │
│ ├─────────────┤ ├─────────────┤ ├─────────────┤ │
│ │ Model │ │ Hyperparameter│ │ Final │ │
│ │ training │ │ tuning │ │ evaluation │ │
│ │ │ │ Model select │ │ Report │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ ⚠️ RULE: Never touch test set until final evaluation! │
│ │
└─────────────────────────────────────────────────────────────────┘Temporal Split (Time Series)
⚠️ Time Series Data
Với time series, KHÔNG BAO GIỜ dùng random split. Luôn split theo thời gian để tránh data leakage.
┌─────────────────────────────────────────────────────────────────┐
│ TEMPORAL DATA SPLIT │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Timeline: ──────────────────────────────────────────────────► │
│ │
│ ┌──────────────┬──────────────┬──────────────┐ │
│ │ TRAIN │ VALIDATION │ TEST │ │
│ │ Jan - Jun │ Jul - Aug │ Sep - Oct │ │
│ └──────────────┴──────────────┴──────────────┘ │
│ │
│ ❌ WRONG: Random split mixes future data into training │
│ ✅ CORRECT: Temporal split preserves time ordering │
│ │
└─────────────────────────────────────────────────────────────────┘Stratified Split
python
# Stratified split giữ nguyên class distribution
from sklearn.model_selection import train_test_split
# ❌ WRONG: Random split có thể mất cân bằng
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# ✅ CORRECT: Stratified split giữ tỷ lệ classes
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)Group Split
┌─────────────────────────────────────────────────────────────────┐
│ GROUP-AWARE SPLIT │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Use Case: Multiple samples per entity (user, patient, etc.) │
│ │
│ ❌ WRONG: Same user in both train and test │
│ Train: [User_A_sample1, User_B_sample1, User_A_sample2] │
│ Test: [User_A_sample3, User_C_sample1] │
│ → Model memorizes User_A patterns! │
│ │
│ ✅ CORRECT: Users completely separated │
│ Train: [User_A_sample1, User_A_sample2, User_A_sample3] │
│ Test: [User_B_sample1, User_C_sample1] │
│ → Model must generalize to new users │
│ │
└─────────────────────────────────────────────────────────────────┘Cross-Validation
K-Fold Cross-Validation
┌─────────────────────────────────────────────────────────────────┐
│ 5-FOLD CROSS-VALIDATION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Fold 1: [VAL] [TRAIN] [TRAIN] [TRAIN] [TRAIN] → Score_1 │
│ Fold 2: [TRAIN] [VAL] [TRAIN] [TRAIN] [TRAIN] → Score_2 │
│ Fold 3: [TRAIN] [TRAIN] [VAL] [TRAIN] [TRAIN] → Score_3 │
│ Fold 4: [TRAIN] [TRAIN] [TRAIN] [VAL] [TRAIN] → Score_4 │
│ Fold 5: [TRAIN] [TRAIN] [TRAIN] [TRAIN] [VAL] → Score_5 │
│ │
│ Final Score = mean(Score_1, ..., Score_5) ± std │
│ │
│ Benefits: │
│ • Uses all data for both training and validation │
│ • Provides variance estimate │
│ • More robust than single split │
│ │
└─────────────────────────────────────────────────────────────────┘CV Variants
| Variant | Use Case | Description |
|---|---|---|
| K-Fold | General purpose | Standard k splits |
| Stratified K-Fold | Imbalanced classes | Preserves class ratio |
| Group K-Fold | Grouped data | Keeps groups together |
| Time Series Split | Temporal data | Expanding window |
| Leave-One-Out | Small datasets | K = N samples |
| Nested CV | Model selection + evaluation | Inner + outer loops |
Time Series Cross-Validation
┌─────────────────────────────────────────────────────────────────┐
│ TIME SERIES CROSS-VALIDATION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Expanding Window: │
│ Fold 1: [TRAIN ] [VAL] │
│ Fold 2: [TRAIN ] [VAL] │
│ Fold 3: [TRAIN ] [VAL] │
│ Fold 4: [TRAIN ] [VAL] │
│ │
│ Sliding Window: │
│ Fold 1: [TRAIN ] [VAL] │
│ Fold 2: [TRAIN ] [VAL] │
│ Fold 3: [TRAIN ] [VAL] │
│ Fold 4: [TRAIN ] [VAL] │
│ │
└─────────────────────────────────────────────────────────────────┘Classification Metrics
Confusion Matrix
┌─────────────────────────────────────────────────────────────────┐
│ CONFUSION MATRIX │
├─────────────────────────────────────────────────────────────────┤
│ │
│ PREDICTED │
│ Positive Negative │
│ ┌──────────┬──────────┐ │
│ Positive │ TP │ FN │ ← Actual Positives │
│ ACTUAL ├──────────┼──────────┤ │
│ Negative │ FP │ TN │ ← Actual Negatives │
│ └──────────┴──────────┘ │
│ ↑ ↑ │
│ Predicted Predicted │
│ Positives Negatives │
│ │
│ TP = True Positive (Correct positive prediction) │
│ TN = True Negative (Correct negative prediction) │
│ FP = False Positive (Type I error - False alarm) │
│ FN = False Negative (Type II error - Missed detection) │
│ │
└─────────────────────────────────────────────────────────────────┘Key Metrics
| Metric | Formula | Use When |
|---|---|---|
| Accuracy | (TP+TN)/(TP+TN+FP+FN) | Balanced classes |
| Precision | TP/(TP+FP) | FP is costly |
| Recall | TP/(TP+FN) | FN is costly |
| F1-Score | 2×(P×R)/(P+R) | Balance P and R |
| Specificity | TN/(TN+FP) | Need true negative rate |
ROC and AUC
┌─────────────────────────────────────────────────────────────────┐
│ ROC CURVE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ True Positive Rate (Recall) │
│ 1.0 ┤ ╭──────── │
│ │ ╭────╯ │
│ │ ╭────╯ │
│ 0.5 ┤ ╭────╯ Perfect: AUC = 1.0 │
│ │ ╭────╯ Good: AUC > 0.8 │
│ │ ╭────╯ Random: AUC = 0.5 │
│ │ ╭────╯ │
│ 0.0 ┼─────╯───────────────────────────────────────── │
│ 0.0 0.5 1.0 │
│ False Positive Rate (1 - Specificity) │
│ │
│ AUC = Area Under Curve │
│ • Threshold-independent metric │
│ • Measures ranking quality │
│ • Good for imbalanced datasets │
│ │
└─────────────────────────────────────────────────────────────────┘Precision-Recall Curve
💡 Khi nào dùng PR Curve thay vì ROC?
Với highly imbalanced datasets (e.g., fraud detection với 0.1% positive), PR curve cho insight tốt hơn vì nó focus vào positive class.
Regression Metrics
| Metric | Formula | Interpretation |
|---|---|---|
| MAE | mean(|y - ŷ|) | Average absolute error |
| MSE | mean((y - ŷ)²) | Penalizes large errors |
| RMSE | √MSE | Same unit as target |
| MAPE | mean(|y - ŷ|/y) × 100 | Percentage error |
| R² | 1 - SS_res/SS_tot | Variance explained |
Metric Selection Guide
┌─────────────────────────────────────────────────────────────────┐
│ REGRESSION METRIC SELECTION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Question Recommended Metric │
│ ──────── ────────────────── │
│ │
│ "Average error magnitude?" → MAE (robust to outliers) │
│ │
│ "Penalize large errors?" → MSE/RMSE (sensitive) │
│ │
│ "Relative error?" → MAPE (percentage-based) │
│ │
│ "How much variance explained?"→ R² (0 to 1 scale) │
│ │
│ "Business-specific cost?" → Custom weighted metric │
│ │
└─────────────────────────────────────────────────────────────────┘Calibration
What is Calibration?
📊 Calibration Definition
Model được calibrated khi predicted probability phản ánh đúng actual probability. Ví dụ: trong 100 predictions với P=0.7, khoảng 70 cases nên là positive.
Calibration Plot
┌─────────────────────────────────────────────────────────────────┐
│ CALIBRATION PLOT │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Actual Positive Rate │
│ 1.0 ┤ ╱ │
│ │ ╱ │
│ │ ╱ Perfect calibration │
│ 0.5 ┤ ╱ │
│ │ ╱ │
│ │ ╱ │
│ │ ╱ │
│ 0.0 ┼──────╱──────────────────────────────────── │
│ 0.0 0.5 1.0 │
│ Predicted Probability │
│ │
│ • Points above line: Under-confident (predicts lower than actual)
│ • Points below line: Over-confident (predicts higher than actual)
│ │
└─────────────────────────────────────────────────────────────────┘Calibration Methods
| Method | Description | Use Case |
|---|---|---|
| Platt Scaling | Logistic regression on predictions | Binary classification |
| Isotonic Regression | Non-parametric monotonic fit | More flexible |
| Temperature Scaling | Single parameter scaling | Neural networks |
python
from sklearn.calibration import CalibratedClassifierCV
# Calibrate model predictions
calibrated_model = CalibratedClassifierCV(
base_estimator=model,
method='isotonic', # or 'sigmoid' for Platt scaling
cv=5
)
calibrated_model.fit(X_train, y_train)Evaluation Best Practices
Evaluation Checklist
| Step | Action | Why |
|---|---|---|
| 1 | Define metrics before training | Avoid metric shopping |
| 2 | Use appropriate split strategy | Prevent leakage |
| 3 | Report confidence intervals | Quantify uncertainty |
| 4 | Compare to baseline | Establish improvement |
| 5 | Check calibration | Ensure reliable probabilities |
| 6 | Evaluate on subgroups | Detect fairness issues |
Common Pitfalls
🚨 Evaluation Anti-Patterns
- Metric shopping: Trying many metrics, reporting best one
- Test set peeking: Using test set for model selection
- Ignoring variance: Reporting single number without confidence
- Wrong baseline: Comparing to weak baseline
- Leaky evaluation: Future data in training
Statistical Significance
python
# Compare two models with paired t-test
from scipy import stats
# Cross-validation scores for two models
scores_model_a = [0.82, 0.85, 0.83, 0.84, 0.86]
scores_model_b = [0.84, 0.87, 0.85, 0.86, 0.88]
# Paired t-test
t_stat, p_value = stats.ttest_rel(scores_model_a, scores_model_b)
if p_value < 0.05:
print("Difference is statistically significant")
else:
print("No significant difference")✅ Operational Checklist
Offline Evaluation Gate (Pre-Deployment)
| Check | Required | Owner | Notes |
|---|---|---|---|
| Correct split strategy used | ✓ | ML Engineer | Temporal/Group as needed |
| Test set not touched during training | ✓ | ML Engineer | No peeking! |
| Business metric computed | ✓ | ML Engineer | Not just AUC |
| Baseline comparison done | ✓ | ML Engineer | vs current solution |
| Calibration checked | ✓ | ML Engineer | ECE < 0.05 |
| Fairness metrics reviewed | ✓ | ML Lead | Subgroup performance |
| Confidence intervals reported | ✓ | ML Engineer | Cross-validation std |
| Leakage audit passed | ✓ | ML Lead | See leakage.md |
Metric Reporting Standards
| Metric Type | Report Format | Example |
|---|---|---|
| Classification | AUC, Precision@K, Recall@K | AUC: 0.85 ± 0.02 |
| Regression | RMSE, MAE, R² | RMSE: 12.3 ± 1.1 |
| Ranking | NDCG@K, MAP | NDCG@10: 0.72 |
| Calibration | ECE, Brier Score | ECE: 0.03 |
📎 Cross-References
- 📎 Data Leakage & Evaluation Pitfalls - Evaluation pitfalls deep-dive
- 📎 Problem Framing - Định nghĩa metrics phù hợp với business objective
- 📎 Model Selection - Chọn model dựa trên evaluation results
- 📎 ML Monitoring - Online evaluation và drift detection
- 📎 Experimentation - A/B testing và statistical significance
- 📎 ML Governance - Approval gates