Giao diện
🎯 Problem Framing
Level: Foundation Solves: Định nghĩa đúng bài toán ML, tránh các pitfalls phổ biến như data leakage và misaligned objectives
Tại sao Problem Framing quan trọng?
💡 Giáo sư Tom
80% ML projects fail không phải vì model kém, mà vì problem được định nghĩa sai từ đầu. Một model hoàn hảo giải quyết sai vấn đề vẫn là thất bại. Dành thời gian ở bước này sẽ tiết kiệm hàng tháng refactoring sau này.
┌─────────────────────────────────────────────────────────────────┐
│ ML PROJECT SUCCESS FACTORS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ PROBLEM FRAMING (60%) │ │
│ │ • Clear objective definition │ │
│ │ • Correct success metrics │ │
│ │ • Leakage prevention │ │
│ │ • Constraint identification │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ DATA QUALITY (25%) │ │
│ │ • Representative data │ │
│ │ • Proper labeling │ │
│ │ • Feature engineering │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ MODEL SELECTION (15%) │ │
│ │ • Algorithm choice │ │
│ │ • Hyperparameter tuning │ │
│ │ • Architecture design │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Objective Definition Framework
Business Objective vs ML Objective
| Aspect | Business Objective | ML Objective |
|---|---|---|
| Focus | Business outcome | Measurable prediction |
| Metric | Revenue, retention | Accuracy, AUC, RMSE |
| Stakeholder | Business team | ML team |
| Example | "Reduce churn" | "Predict churn probability" |
The Translation Process
┌─────────────────────────────────────────────────────────────────┐
│ BUSINESS → ML OBJECTIVE TRANSLATION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Business Goal ML Problem Type ML Objective │
│ ───────────── ─────────────── ──────────── │
│ │
│ "Reduce churn" → Classification → Predict P(churn) │
│ │
│ "Increase sales" → Recommendation → Rank products │
│ │
│ "Detect fraud" → Anomaly Detection → Score anomaly │
│ │
│ "Forecast demand" → Regression → Predict quantity │
│ │
│ "Segment users" → Clustering → Group similar │
│ │
└─────────────────────────────────────────────────────────────────┘Objective Definition Checklist
⚠️ Trước khi code bất kỳ model nào
Trả lời được TẤT CẢ các câu hỏi sau:
| Category | Question | Why It Matters |
|---|---|---|
| What | Prediction target là gì? | Defines label |
| When | Prediction cần ở thời điểm nào? | Defines feature availability |
| Who | Prediction cho ai/cái gì? | Defines entity |
| How | Prediction được sử dụng như thế nào? | Defines threshold/ranking |
| Why | Business impact của prediction? | Defines success metric |
Constraint Identification
Types of Constraints
┌─────────────────────────────────────────────────────────────────┐
│ ML PROJECT CONSTRAINTS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ TECHNICAL │ │ BUSINESS │ │ ETHICAL │ │
│ ├─────────────────┤ ├─────────────────┤ ├─────────────────┤ │
│ │ • Latency │ │ • Budget │ │ • Fairness │ │
│ │ • Throughput │ │ • Timeline │ │ • Privacy │ │
│ │ • Memory │ │ • Interpretability│ │ • Bias │ │
│ │ • Compute │ │ • Maintenance │ │ • Transparency │ │
│ │ • Data volume │ │ • Team skills │ │ • Consent │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Constraint Trade-offs Matrix
| Constraint | Impact on Model | Mitigation Strategy |
|---|---|---|
| Low latency (<10ms) | Simpler models, fewer features | Feature caching, model distillation |
| High interpretability | Linear models, decision trees | SHAP, LIME for complex models |
| Limited labeled data | Overfitting risk | Transfer learning, semi-supervised |
| Real-time features | Feature engineering complexity | Feature stores, streaming pipelines |
| Fairness requirements | Constrained optimization | Fairness-aware training |
Data Leakage Prevention
What is Data Leakage?
🚨 Data Leakage
Data leakage xảy ra khi thông tin từ tương lai (hoặc thông tin không available ở prediction time) "rò rỉ" vào training data, tạo ra model có performance ảo cao nhưng fail trong production.
Types of Leakage
┌─────────────────────────────────────────────────────────────────┐
│ TYPES OF DATA LEAKAGE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. TARGET LEAKAGE │
│ ───────────────── │
│ Feature chứa thông tin về target │
│ │
│ Example: Predicting loan default │
│ ❌ Feature: "collection_calls_count" │
│ → Chỉ có sau khi đã default! │
│ │
│ 2. TRAIN-TEST CONTAMINATION │
│ ───────────────────────── │
│ Test data "nhìn thấy" training data │
│ │
│ Example: Time series forecasting │
│ ❌ Random split thay vì temporal split │
│ → Future data trong training set! │
│ │
│ 3. FEATURE LEAKAGE │
│ ─────────────── │
│ Feature computed với future information │
│ │
│ Example: Customer lifetime value │
│ ❌ Feature: "total_purchases_ever" │
│ → Includes future purchases! │
│ │
└─────────────────────────────────────────────────────────────────┘Leakage Detection Checklist
| Check | Question | Red Flag |
|---|---|---|
| Temporal | Feature có available ở prediction time? | Feature từ sau prediction point |
| Causal | Feature có cause hay effect của target? | Feature là consequence của target |
| Proxy | Feature có proxy trực tiếp cho target? | Near-perfect correlation |
| Processing | Preprocessing có dùng test data? | Fit scaler trên full data |
Prevention Strategies
python
# ❌ WRONG: Leakage through preprocessing
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Fit on ALL data
X_train, X_test = train_test_split(X_scaled)
# ✅ CORRECT: Proper preprocessing
X_train, X_test = train_test_split(X)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit only on train
X_test_scaled = scaler.transform(X_test) # Transform testProblem Type Selection
Decision Framework
┌─────────────────────────────────────────────────────────────────┐
│ ML PROBLEM TYPE DECISION TREE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Do you have labels? │
│ │ │
│ ┌───────────┴───────────┐ │
│ │ │ │
│ YES NO │
│ │ │ │
│ Is target continuous? → UNSUPERVISED │
│ │ • Clustering │
│ ┌────────┴────────┐ • Anomaly Detection │
│ │ │ • Dimensionality Reduction │
│ YES NO │
│ │ │ │
│ REGRESSION Is it ranking? │
│ • Linear │ │
│ • Tree-based ┌────┴────┐ │
│ • Neural │ │ │
│ YES NO │
│ │ │ │
│ RANKING CLASSIFICATION │
│ • LTR • Binary │
│ • Rec • Multi-class │
│ • Multi-label │
│ │
└─────────────────────────────────────────────────────────────────┘Common Reframing Patterns
| Original Problem | Reframed As | Benefit |
|---|---|---|
| Regression → Classification | Bucket continuous target | Simpler, more robust |
| Classification → Ranking | Rank by probability | Better for recommendations |
| Multi-class → Binary | One-vs-rest | Easier to interpret |
| Point prediction → Distribution | Predict uncertainty | Better decision making |
Success Metrics Alignment
Metric Selection Framework
| Business Goal | ML Metric | Threshold Consideration |
|---|---|---|
| Minimize false positives | Precision | High precision, lower recall |
| Minimize false negatives | Recall | High recall, lower precision |
| Balance both | F1-Score | Harmonic mean |
| Ranking quality | AUC-ROC, NDCG | Threshold-independent |
| Calibrated probabilities | Brier Score, Log Loss | Probability accuracy |
Metric-Business Alignment Example
┌─────────────────────────────────────────────────────────────────┐
│ FRAUD DETECTION: METRIC ALIGNMENT │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Business Context: │
│ • False Positive: Block legitimate transaction → Customer angry│
│ • False Negative: Miss fraud → Financial loss │
│ │
│ Cost Analysis: │
│ • Cost of FP: $10 (customer service, friction) │
│ • Cost of FN: $500 (average fraud amount) │
│ │
│ Optimal Threshold: │
│ • FN is 50x more costly than FP │
│ • Optimize for high recall (catch more fraud) │
│ • Accept lower precision (more false alarms) │
│ │
│ Recommended Metric: Recall @ Precision = 0.5 │
│ Or: Custom cost-weighted metric │
│ │
└─────────────────────────────────────────────────────────────────┘Best Practices
Problem Framing Document Template
markdown
## ML Problem Framing Document
### 1. Business Context
- Business objective: [What business outcome?]
- Current solution: [How is it solved today?]
- Success criteria: [What defines success?]
### 2. ML Formulation
- Prediction target: [What to predict?]
- Entity: [Predict for what/whom?]
- Prediction time: [When is prediction needed?]
- Problem type: [Classification/Regression/etc.]
### 3. Data Availability
- Label source: [Where do labels come from?]
- Feature sources: [What data is available?]
- Temporal constraints: [What's available at prediction time?]
### 4. Constraints
- Latency: [Max prediction time?]
- Interpretability: [Need explanations?]
- Fairness: [Protected attributes?]
### 5. Risks
- Leakage risks: [Potential leakage sources?]
- Distribution shift: [How might data change?]
- Feedback loops: [Does prediction affect future data?]✅ Operational Checklist
Problem Framing Gate (Gate 0)
| Check | Required | Owner | Notes |
|---|---|---|---|
| Business objective documented | ✓ | PM | Clear success criteria |
| ML objective defined | ✓ | ML Lead | Prediction target, entity, timing |
| Problem type selected | ✓ | ML Engineer | Classification/Regression/Ranking |
| Constraints identified | ✓ | Tech Lead | Latency, interpretability, fairness |
| Data availability confirmed | ✓ | Data Engineer | Labels, features, volume |
| Leakage risks assessed | ✓ | ML Lead | See leakage.md |
| Success metrics aligned | ✓ | PM + ML Lead | Business ↔ ML metric mapping |
| Baseline defined | ✓ | ML Engineer | What to compare against |
Pre-Experimentation Gate
| Check | Required | Owner |
|---|---|---|
| Problem framing doc approved | ✓ | ML Lead |
| Data access granted | ✓ | Data Governance |
| Compute resources allocated | ✓ | Platform |
| Experiment tracking configured | ✓ | ML Engineer |
📎 Cross-References
- 📎 Data Leakage & Evaluation Pitfalls - Deep-dive vào leakage detection
- 📎 Evaluation Fundamentals - Metrics và evaluation strategies
- 📎 Feature Engineering - Feature design với leakage prevention
- 📎 Model Selection - Chọn model phù hợp với constraints
- 📎 ML Governance - Approval gates
- 📎 Data Modeling Fundamentals - Data structure design