Giao diện
🧪 Experimentation
Level: Advanced Solves: Quản lý experiments một cách có hệ thống, đảm bảo reproducibility và collaboration
Tại sao Experimentation Management quan trọng?
💡 Giáo sư Tom
"Nếu bạn không thể reproduce kết quả, bạn không có kết quả." ML experiments mà không có proper tracking giống như làm khoa học mà không ghi chép. Một tháng sau, bạn sẽ không nhớ hyperparameters nào cho kết quả tốt nhất.
┌─────────────────────────────────────────────────────────────────┐
│ EXPERIMENTATION CHAOS vs ORDER │
├─────────────────────────────────────────────────────────────────┤
│ │
│ WITHOUT TRACKING: │
│ ───────────────── │
│ • "Which notebook had the best model?" │
│ • "What hyperparameters did I use?" │
│ • "Why did this model perform better?" │
│ • "Can we reproduce last week's results?" │
│ • model_v2_final_FINAL_v3.pkl │
│ │
│ WITH PROPER TRACKING: │
│ ───────────────────── │
│ • Every experiment logged with parameters │
│ • Metrics tracked over time │
│ • Artifacts versioned and stored │
│ • Easy comparison between runs │
│ • One-click reproduction │
│ │
└─────────────────────────────────────────────────────────────────┘Experiment Tracking Components
What to Track
┌─────────────────────────────────────────────────────────────────┐
│ EXPERIMENT TRACKING COMPONENTS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ PARAMETERS │ │
│ │ • Hyperparameters (learning_rate, n_estimators) │ │
│ │ • Data parameters (train_size, features_used) │ │
│ │ • Environment (python_version, library_versions) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ METRICS │ │
│ │ • Training metrics (loss, accuracy per epoch) │ │
│ │ • Validation metrics (AUC, F1, RMSE) │ │
│ │ • Custom business metrics │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ ARTIFACTS │ │
│ │ • Model files (model.pkl, model.h5) │ │
│ │ • Plots (confusion_matrix.png, roc_curve.png) │ │
│ │ • Data samples (predictions.csv) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ CODE │ │
│ │ • Git commit hash │ │
│ │ • Notebook snapshot │ │
│ │ • Config files │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Tracking Hierarchy
| Level | Description | Example |
|---|---|---|
| Project | High-level ML initiative | "Customer Churn Prediction" |
| Experiment | Specific approach/hypothesis | "XGBoost with new features" |
| Run | Single training execution | "Run #42 with lr=0.01" |
| Metric | Measured value | "val_auc=0.85" |
MLflow
MLflow Architecture
┌─────────────────────────────────────────────────────────────────┐
│ MLFLOW COMPONENTS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ MLflow │ │ MLflow │ │ MLflow │ │
│ │ Tracking │ │ Projects │ │ Models │ │
│ ├─────────────────┤ ├─────────────────┤ ├─────────────────┤ │
│ │ • Log params │ │ • Package code │ │ • Model format │ │
│ │ • Log metrics │ │ • Dependencies │ │ • Model registry│ │
│ │ • Log artifacts │ │ • Entry points │ │ • Deployment │ │
│ │ • Compare runs │ │ • Reproducibility│ │ • Versioning │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ MLflow Model Registry │ │
│ │ • Stage transitions (Staging → Production) │ │
│ │ • Model versioning │ │
│ │ • Approval workflows │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘MLflow Tracking Example
python
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
# Set experiment
mlflow.set_experiment("churn-prediction")
# Start run
with mlflow.start_run(run_name="rf-baseline"):
# Log parameters
params = {
"n_estimators": 100,
"max_depth": 10,
"min_samples_split": 5,
"random_state": 42
}
mlflow.log_params(params)
# Train model
model = RandomForestClassifier(**params)
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]
# Log metrics
mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
mlflow.log_metric("auc", roc_auc_score(y_test, y_proba))
# Log model
mlflow.sklearn.log_model(model, "model")
# Log artifacts
mlflow.log_artifact("feature_importance.png")
# Log tags
mlflow.set_tag("model_type", "random_forest")
mlflow.set_tag("data_version", "v2.1")MLflow Model Registry
python
# Register model
model_uri = f"runs:/{run_id}/model"
mlflow.register_model(model_uri, "churn-model")
# Transition to staging
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
name="churn-model",
version=1,
stage="Staging"
)
# Load model from registry
model = mlflow.pyfunc.load_model("models:/churn-model/Staging")Weights & Biases (W&B)
W&B Features
┌─────────────────────────────────────────────────────────────────┐
│ W&B FEATURES │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Experiments │ │ Artifacts │ │ Sweeps │ │
│ ├─────────────────┤ ├─────────────────┤ ├─────────────────┤ │
│ │ • Run tracking │ │ • Dataset │ │ • Hyperparameter│ │
│ │ • Comparison │ │ versioning │ │ optimization │ │
│ │ • Visualization │ │ • Model storage │ │ • Distributed │ │
│ │ • Collaboration │ │ • Lineage │ │ search │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Reports │ │ Tables │ │ Alerts │ │
│ ├─────────────────┤ ├─────────────────┤ ├─────────────────┤ │
│ │ • Shareable │ │ • Data │ │ • Metric │ │
│ │ dashboards │ │ exploration │ │ thresholds │ │
│ │ • Markdown │ │ • Predictions │ │ • Run failures │ │
│ │ • Embedding │ │ • Comparisons │ │ • Notifications │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘W&B Example
python
import wandb
from sklearn.ensemble import RandomForestClassifier
# Initialize run
wandb.init(
project="churn-prediction",
name="rf-baseline",
config={
"n_estimators": 100,
"max_depth": 10,
"learning_rate": 0.01,
}
)
# Access config
config = wandb.config
# Train model
model = RandomForestClassifier(
n_estimators=config.n_estimators,
max_depth=config.max_depth
)
model.fit(X_train, y_train)
# Log metrics
wandb.log({
"accuracy": accuracy_score(y_test, y_pred),
"auc": roc_auc_score(y_test, y_proba),
"confusion_matrix": wandb.plot.confusion_matrix(
y_true=y_test, preds=y_pred, class_names=["No Churn", "Churn"]
)
})
# Log model artifact
artifact = wandb.Artifact("model", type="model")
artifact.add_file("model.pkl")
wandb.log_artifact(artifact)
wandb.finish()W&B Sweeps (Hyperparameter Tuning)
python
# Define sweep config
sweep_config = {
"method": "bayes", # or "grid", "random"
"metric": {"name": "val_auc", "goal": "maximize"},
"parameters": {
"n_estimators": {"values": [100, 200, 500]},
"max_depth": {"min": 3, "max": 15},
"learning_rate": {"min": 0.001, "max": 0.1, "distribution": "log_uniform"},
}
}
# Create sweep
sweep_id = wandb.sweep(sweep_config, project="churn-prediction")
# Define training function
def train():
wandb.init()
config = wandb.config
model = XGBClassifier(
n_estimators=config.n_estimators,
max_depth=config.max_depth,
learning_rate=config.learning_rate
)
model.fit(X_train, y_train)
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
wandb.log({"val_auc": auc})
# Run sweep
wandb.agent(sweep_id, train, count=50)Reproducibility
Reproducibility Checklist
┌─────────────────────────────────────────────────────────────────┐
│ REPRODUCIBILITY CHECKLIST │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ☐ CODE VERSIONING │
│ • Git commit hash logged │
│ • No uncommitted changes │
│ • Branch/tag recorded │
│ │
│ ☐ DATA VERSIONING │
│ • Data version/hash logged │
│ • Data source documented │
│ • Preprocessing steps recorded │
│ │
│ ☐ ENVIRONMENT │
│ • requirements.txt / environment.yml │
│ • Python version │
│ • CUDA/GPU versions (if applicable) │
│ │
│ ☐ RANDOM SEEDS │
│ • All random seeds set and logged │
│ • numpy, random, torch seeds │
│ │
│ ☐ CONFIGURATION │
│ • All hyperparameters logged │
│ • Config files versioned │
│ │
└─────────────────────────────────────────────────────────────────┘Setting Random Seeds
python
import random
import numpy as np
import torch
def set_seed(seed: int = 42):
"""Set all random seeds for reproducibility."""
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
# For CUDA determinism (may impact performance)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
# Set environment variable
os.environ['PYTHONHASHSEED'] = str(seed)
# Use at the start of every experiment
set_seed(42)Data Versioning with DVC
bash
# Initialize DVC
dvc init
# Track data file
dvc add data/train.csv
# Commit to git
git add data/train.csv.dvc .gitignore
git commit -m "Add training data v1"
# Push data to remote storage
dvc push
# Pull data in another environment
dvc pullExperiment Organization
Project Structure
ml-project/
├── configs/
│ ├── base.yaml # Base configuration
│ ├── experiment_1.yaml # Experiment-specific overrides
│ └── experiment_2.yaml
├── data/
│ ├── raw/ # Original data (DVC tracked)
│ ├── processed/ # Processed data
│ └── features/ # Feature store
├── models/
│ └── artifacts/ # Saved models
├── notebooks/
│ ├── exploration/ # EDA notebooks
│ └── experiments/ # Experiment notebooks
├── src/
│ ├── data/ # Data processing
│ ├── features/ # Feature engineering
│ ├── models/ # Model definitions
│ └── evaluation/ # Evaluation utilities
├── tests/
├── mlflow.db # MLflow tracking database
├── requirements.txt
└── README.mdConfiguration Management with Hydra
python
# config/config.yaml
defaults:
- model: xgboost
- data: default
experiment:
name: "baseline"
seed: 42
# config/model/xgboost.yaml
model:
name: xgboost
params:
n_estimators: 100
max_depth: 6
learning_rate: 0.1
# train.py
import hydra
from omegaconf import DictConfig
@hydra.main(config_path="config", config_name="config")
def train(cfg: DictConfig):
print(f"Experiment: {cfg.experiment.name}")
print(f"Model: {cfg.model.name}")
print(f"Params: {cfg.model.params}")
# Training logic here...
if __name__ == "__main__":
train()
# Run with overrides
# python train.py model.params.n_estimators=200 experiment.name=exp_v2Best Practices
Experiment Naming Convention
| Component | Convention | Example |
|---|---|---|
| Project | {team}-{problem} | ml-churn-prediction |
| Experiment | {approach}-{version} | xgboost-v2 |
| Run | {date}-{description} | 20240315-new-features |
Comparison and Analysis
python
# MLflow: Compare runs
import mlflow
# Get all runs from experiment
experiment = mlflow.get_experiment_by_name("churn-prediction")
runs = mlflow.search_runs(experiment_ids=[experiment.experiment_id])
# Filter and sort
best_runs = runs[runs["metrics.auc"] > 0.8].sort_values(
"metrics.auc", ascending=False
)
# Compare parameters of top runs
print(best_runs[["params.n_estimators", "params.max_depth", "metrics.auc"]])Common Pitfalls
🚨 Experimentation Anti-Patterns
- No tracking: Running experiments without logging
- Incomplete logging: Missing parameters or metrics
- No versioning: Can't reproduce results
- Messy organization: Experiments scattered across notebooks
- No comparison: Not analyzing what worked and why
📎 Cross-References
- 📎 Evaluation Fundamentals - Metrics để track
- 📎 Model Selection - Experiments cho model comparison
- 📎 Deployment Patterns - From experiment to production
- 📎 ML Monitoring - Production experiment tracking