🎯 Problem Framing

Level: Foundation Solves: Định nghĩa đúng bài toán ML, tránh các pitfalls phổ biến như data leakage và misaligned objectives

Tại sao Problem Framing quan trọng?

💡 Giáo sư Tom

80% ML projects fail không phải vì model kém, mà vì problem được định nghĩa sai từ đầu. Một model hoàn hảo giải quyết sai vấn đề vẫn là thất bại. Dành thời gian ở bước này sẽ tiết kiệm hàng tháng refactoring sau này.

┌─────────────────────────────────────────────────────────────────┐
│                 ML PROJECT SUCCESS FACTORS                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              PROBLEM FRAMING (60%)                       │    │
│  │  • Clear objective definition                            │    │
│  │  • Correct success metrics                               │    │
│  │  • Leakage prevention                                    │    │
│  │  • Constraint identification                             │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              DATA QUALITY (25%)                          │    │
│  │  • Representative data                                   │    │
│  │  • Proper labeling                                       │    │
│  │  • Feature engineering                                   │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              MODEL SELECTION (15%)                       │    │
│  │  • Algorithm choice                                      │    │
│  │  • Hyperparameter tuning                                 │    │
│  │  • Architecture design                                   │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Objective Definition Framework

Business Objective vs ML Objective

Aspect	Business Objective	ML Objective
Focus	Business outcome	Measurable prediction
Metric	Revenue, retention	Accuracy, AUC, RMSE
Stakeholder	Business team	ML team
Example	"Reduce churn"	"Predict churn probability"

The Translation Process

┌─────────────────────────────────────────────────────────────────┐
│           BUSINESS → ML OBJECTIVE TRANSLATION                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Business Goal          ML Problem Type        ML Objective     │
│  ─────────────          ───────────────        ────────────     │
│                                                                 │
│  "Reduce churn"    →    Classification    →   Predict P(churn) │
│                                                                 │
│  "Increase sales"  →    Recommendation    →   Rank products    │
│                                                                 │
│  "Detect fraud"    →    Anomaly Detection →   Score anomaly    │
│                                                                 │
│  "Forecast demand" →    Regression        →   Predict quantity │
│                                                                 │
│  "Segment users"   →    Clustering        →   Group similar    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Objective Definition Checklist

⚠️ Trước khi code bất kỳ model nào

Trả lời được TẤT CẢ các câu hỏi sau:

Category	Question	Why It Matters
What	Prediction target là gì?	Defines label
When	Prediction cần ở thời điểm nào?	Defines feature availability
Who	Prediction cho ai/cái gì?	Defines entity
How	Prediction được sử dụng như thế nào?	Defines threshold/ranking
Why	Business impact của prediction?	Defines success metric

Constraint Identification

Types of Constraints

┌─────────────────────────────────────────────────────────────────┐
│                    ML PROJECT CONSTRAINTS                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  │
│  │   TECHNICAL     │  │    BUSINESS     │  │    ETHICAL      │  │
│  ├─────────────────┤  ├─────────────────┤  ├─────────────────┤  │
│  │ • Latency       │  │ • Budget        │  │ • Fairness      │  │
│  │ • Throughput    │  │ • Timeline      │  │ • Privacy       │  │
│  │ • Memory        │  │ • Interpretability│ │ • Bias          │  │
│  │ • Compute       │  │ • Maintenance   │  │ • Transparency  │  │
│  │ • Data volume   │  │ • Team skills   │  │ • Consent       │  │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Constraint Trade-offs Matrix

Constraint	Impact on Model	Mitigation Strategy
Low latency (<10ms)	Simpler models, fewer features	Feature caching, model distillation
High interpretability	Linear models, decision trees	SHAP, LIME for complex models
Limited labeled data	Overfitting risk	Transfer learning, semi-supervised
Real-time features	Feature engineering complexity	Feature stores, streaming pipelines
Fairness requirements	Constrained optimization	Fairness-aware training

Data Leakage Prevention

What is Data Leakage?

🚨 Data Leakage

Data leakage xảy ra khi thông tin từ tương lai (hoặc thông tin không available ở prediction time) "rò rỉ" vào training data, tạo ra model có performance ảo cao nhưng fail trong production.

Types of Leakage

┌─────────────────────────────────────────────────────────────────┐
│                    TYPES OF DATA LEAKAGE                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. TARGET LEAKAGE                                              │
│     ─────────────────                                           │
│     Feature chứa thông tin về target                            │
│                                                                 │
│     Example: Predicting loan default                            │
│     ❌ Feature: "collection_calls_count"                        │
│     → Chỉ có sau khi đã default!                                │
│                                                                 │
│  2. TRAIN-TEST CONTAMINATION                                    │
│     ─────────────────────────                                   │
│     Test data "nhìn thấy" training data                         │
│                                                                 │
│     Example: Time series forecasting                            │
│     ❌ Random split thay vì temporal split                      │
│     → Future data trong training set!                           │
│                                                                 │
│  3. FEATURE LEAKAGE                                             │
│     ───────────────                                             │
│     Feature computed với future information                     │
│                                                                 │
│     Example: Customer lifetime value                            │
│     ❌ Feature: "total_purchases_ever"                          │
│     → Includes future purchases!                                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Leakage Detection Checklist

Check	Question	Red Flag
Temporal	Feature có available ở prediction time?	Feature từ sau prediction point
Causal	Feature có cause hay effect của target?	Feature là consequence của target
Proxy	Feature có proxy trực tiếp cho target?	Near-perfect correlation
Processing	Preprocessing có dùng test data?	Fit scaler trên full data

Prevention Strategies

python

# ❌ WRONG: Leakage through preprocessing
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Fit on ALL data
X_train, X_test = train_test_split(X_scaled)

# ✅ CORRECT: Proper preprocessing
X_train, X_test = train_test_split(X)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Fit only on train
X_test_scaled = scaler.transform(X_test)  # Transform test

Problem Type Selection

Decision Framework

┌─────────────────────────────────────────────────────────────────┐
│              ML PROBLEM TYPE DECISION TREE                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│                    Do you have labels?                          │
│                          │                                      │
│              ┌───────────┴───────────┐                          │
│              │                       │                          │
│             YES                      NO                         │
│              │                       │                          │
│     Is target continuous?    → UNSUPERVISED                     │
│              │                  • Clustering                    │
│     ┌────────┴────────┐         • Anomaly Detection             │
│     │                 │         • Dimensionality Reduction      │
│    YES               NO                                         │
│     │                 │                                         │
│ REGRESSION      Is it ranking?                                  │
│ • Linear            │                                           │
│ • Tree-based   ┌────┴────┐                                      │
│ • Neural       │         │                                      │
│               YES        NO                                     │
│                │         │                                      │
│           RANKING   CLASSIFICATION                              │
│           • LTR     • Binary                                    │
│           • Rec     • Multi-class                               │
│                     • Multi-label                               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Common Reframing Patterns

Original Problem	Reframed As	Benefit
Regression → Classification	Bucket continuous target	Simpler, more robust
Classification → Ranking	Rank by probability	Better for recommendations
Multi-class → Binary	One-vs-rest	Easier to interpret
Point prediction → Distribution	Predict uncertainty	Better decision making

Success Metrics Alignment

Metric Selection Framework

Business Goal	ML Metric	Threshold Consideration
Minimize false positives	Precision	High precision, lower recall
Minimize false negatives	Recall	High recall, lower precision
Balance both	F1-Score	Harmonic mean
Ranking quality	AUC-ROC, NDCG	Threshold-independent
Calibrated probabilities	Brier Score, Log Loss	Probability accuracy

Metric-Business Alignment Example

┌─────────────────────────────────────────────────────────────────┐
│           FRAUD DETECTION: METRIC ALIGNMENT                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Business Context:                                              │
│  • False Positive: Block legitimate transaction → Customer angry│
│  • False Negative: Miss fraud → Financial loss                  │
│                                                                 │
│  Cost Analysis:                                                 │
│  • Cost of FP: $10 (customer service, friction)                 │
│  • Cost of FN: $500 (average fraud amount)                      │
│                                                                 │
│  Optimal Threshold:                                             │
│  • FN is 50x more costly than FP                                │
│  • Optimize for high recall (catch more fraud)                  │
│  • Accept lower precision (more false alarms)                   │
│                                                                 │
│  Recommended Metric: Recall @ Precision = 0.5                   │
│  Or: Custom cost-weighted metric                                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Best Practices

Problem Framing Document Template

markdown

## ML Problem Framing Document

### 1. Business Context
- Business objective: [What business outcome?]
- Current solution: [How is it solved today?]
- Success criteria: [What defines success?]

### 2. ML Formulation
- Prediction target: [What to predict?]
- Entity: [Predict for what/whom?]
- Prediction time: [When is prediction needed?]
- Problem type: [Classification/Regression/etc.]

### 3. Data Availability
- Label source: [Where do labels come from?]
- Feature sources: [What data is available?]
- Temporal constraints: [What's available at prediction time?]

### 4. Constraints
- Latency: [Max prediction time?]
- Interpretability: [Need explanations?]
- Fairness: [Protected attributes?]

### 5. Risks
- Leakage risks: [Potential leakage sources?]
- Distribution shift: [How might data change?]
- Feedback loops: [Does prediction affect future data?]

✅ Operational Checklist

Problem Framing Gate (Gate 0)

Check	Required	Owner	Notes
Business objective documented	✓	PM	Clear success criteria
ML objective defined	✓	ML Lead	Prediction target, entity, timing
Problem type selected	✓	ML Engineer	Classification/Regression/Ranking
Constraints identified	✓	Tech Lead	Latency, interpretability, fairness
Data availability confirmed	✓	Data Engineer	Labels, features, volume
Leakage risks assessed	✓	ML Lead	See leakage.md
Success metrics aligned	✓	PM + ML Lead	Business ↔ ML metric mapping
Baseline defined	✓	ML Engineer	What to compare against

Pre-Experimentation Gate

Check	Required	Owner
Problem framing doc approved	✓	ML Lead
Data access granted	✓	Data Governance
Compute resources allocated	✓	Platform
Experiment tracking configured	✓	ML Engineer

📎 Cross-References

📎 Data Leakage & Evaluation Pitfalls - Deep-dive vào leakage detection
📎 Evaluation Fundamentals - Metrics và evaluation strategies
📎 Feature Engineering - Feature design với leakage prevention
📎 Model Selection - Chọn model phù hợp với constraints
📎 ML Governance - Approval gates
📎 Data Modeling Fundamentals - Data structure design

🧠 Quiz

Câu 1: Bước đầu tiên khi bắt đầu một dự án ML nên là gì?

[ ] A) Chọn model phức tạp nhất có thể
[x] B) Định nghĩa rõ objective và constraints của bài toán
[ ] C) Thu thập càng nhiều dữ liệu càng tốt
[ ] D) Tối ưu hyperparameters ngay từ đầu

💡 Giải thích: Problem framing yêu cầu định nghĩa rõ mục tiêu (objective) và các ràng buộc (constraints) trước khi làm bất cứ điều gì khác. Không có mục tiêu rõ ràng, mọi nỗ lực kỹ thuật đều có thể đi sai hướng.

Câu 2: "Proxy metric" trong ML là gì?

[ ] A) Metric đo trực tiếp business value
[x] B) Metric kỹ thuật dùng thay thế khi không thể đo trực tiếp business outcome
[ ] C) Metric chỉ dùng trong giai đoạn test
[ ] D) Metric đo hiệu suất phần cứng

💡 Giải thích: Proxy metric là chỉ số kỹ thuật (ví dụ: accuracy, AUC) được dùng thay thế khi business metric thực (ví dụ: revenue) khó đo trực tiếp hoặc cần thời gian dài để thu thập.

Câu 3: Khi nào KHÔNG nên sử dụng ML để giải quyết bài toán?

[ ] A) Khi dữ liệu có nhiều noise
[ ] B) Khi bài toán có tính phi tuyến cao
[x] C) Khi rule-based logic đơn giản đã giải quyết được vấn đề hiệu quả
[ ] D) Khi dữ liệu có nhiều features

💡 Giải thích: ML không phải lúc nào cũng là giải pháp tốt nhất. Nếu rule-based logic đơn giản đã hoạt động hiệu quả, việc dùng ML chỉ tăng thêm complexity mà không mang lại giá trị đáng kể.

🎯 Problem Framing ​

Tại sao Problem Framing quan trọng? ​

Objective Definition Framework ​

Business Objective vs ML Objective ​

The Translation Process ​

Objective Definition Checklist ​

Constraint Identification ​

Types of Constraints ​

Constraint Trade-offs Matrix ​

Data Leakage Prevention ​

What is Data Leakage? ​

Types of Leakage ​

Leakage Detection Checklist ​

Prevention Strategies ​

Problem Type Selection ​

Decision Framework ​

Common Reframing Patterns ​

Success Metrics Alignment ​

Metric Selection Framework ​

Metric-Business Alignment Example ​

Best Practices ​

Problem Framing Document Template ​

✅ Operational Checklist ​

Problem Framing Gate (Gate 0) ​

Pre-Experimentation Gate ​

📎 Cross-References ​

Câu 1: Bước đầu tiên khi bắt đầu một dự án ML nên là gì? ​

Câu 2: "Proxy metric" trong ML là gì? ​

Câu 3: Khi nào KHÔNG nên sử dụng ML để giải quyết bài toán? ​