Giao diện
🔧 Feature Engineering
Level: Core Solves: Thiết kế và quản lý features một cách có hệ thống, đảm bảo consistency giữa training và serving
Tại sao Feature Engineering quan trọng?
💡 Giáo sư Tom
"Applied machine learning is basically feature engineering" - Andrew Ng.
Model tốt nhất với features tệ sẽ thua model đơn giản với features tốt. Feature engineering là nơi domain knowledge gặp data science.
┌─────────────────────────────────────────────────────────────────┐
│ FEATURE ENGINEERING IMPACT │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Model Performance Contribution: │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Feature Engineering ████████████████████████████ 70% │ │
│ │ Model Selection ████████████ 20% │ │
│ │ Hyperparameter Tuning████ 10% │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ "Garbage in, garbage out" - No model can fix bad features │
│ │
└─────────────────────────────────────────────────────────────────┘Feature Types
Categorical Features
| Type | Description | Encoding Methods |
|---|---|---|
| Nominal | No order (color, country) | One-hot, Target encoding |
| Ordinal | Has order (low/med/high) | Label encoding, Ordinal encoding |
| High cardinality | Many unique values (user_id) | Hashing, Embedding |
Numerical Features
| Type | Description | Transformation |
|---|---|---|
| Continuous | Any value (price, age) | Scaling, Binning |
| Discrete | Integer values (count) | Log transform, Binning |
| Ratio | Has true zero (distance) | Standard scaling |
| Interval | No true zero (temperature) | Min-max scaling |
Temporal Features
┌─────────────────────────────────────────────────────────────────┐
│ TEMPORAL FEATURE EXTRACTION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ From timestamp "2024-03-15 14:30:00": │
│ │
│ Cyclical Features: │
│ • hour_sin, hour_cos (captures daily patterns) │
│ • day_of_week_sin/cos (captures weekly patterns) │
│ • month_sin/cos (captures seasonal patterns) │
│ │
│ Binary Features: │
│ • is_weekend (0 or 1) │
│ • is_holiday (0 or 1) │
│ • is_business_hours (0 or 1) │
│ │
│ Lag Features: │
│ • value_1h_ago (1 hour lag) │
│ • value_24h_ago (1 day lag) │
│ • value_7d_ago (1 week lag) │
│ │
│ Rolling Features: │
│ • rolling_mean_7d (7-day moving average) │
│ • rolling_std_7d (7-day standard deviation) │
│ • rolling_max_30d (30-day maximum) │
│ │
└─────────────────────────────────────────────────────────────────┘Feature Transformation Techniques
Scaling Methods
| Method | Formula | Use Case |
|---|---|---|
| Standard Scaling | (x - μ) / σ | Gaussian-like distributions |
| Min-Max Scaling | (x - min) / (max - min) | Bounded range [0, 1] |
| Robust Scaling | (x - median) / IQR | Outlier-resistant |
| Log Transform | log(x + 1) | Right-skewed distributions |
Encoding Strategies
python
# One-Hot Encoding (low cardinality)
# ❌ WRONG for high cardinality - creates sparse matrix
pd.get_dummies(df['country']) # 200 countries = 200 columns
# Target Encoding (high cardinality)
# ✅ CORRECT - single column with target statistics
from category_encoders import TargetEncoder
encoder = TargetEncoder()
df['country_encoded'] = encoder.fit_transform(df['country'], df['target'])
# Hashing (very high cardinality)
# ✅ CORRECT for user_id, product_id
from sklearn.feature_extraction import FeatureHasher
hasher = FeatureHasher(n_features=1000, input_type='string')Feature Interactions
┌─────────────────────────────────────────────────────────────────┐
│ FEATURE INTERACTION PATTERNS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Arithmetic Interactions: │
│ • price_per_sqft = price / square_feet │
│ • total_spend = quantity × unit_price │
│ • profit_margin = (revenue - cost) / revenue │
│ │
│ Polynomial Features: │
│ • age² (quadratic relationship) │
│ • age × income (interaction term) │
│ │
│ Domain-Specific: │
│ • BMI = weight / height² │
│ • CTR = clicks / impressions │
│ • recency × frequency (RFM score) │
│ │
│ ⚠️ Warning: More features ≠ better model │
│ → Feature selection is equally important │
│ │
└─────────────────────────────────────────────────────────────────┘Feature Pipeline Design
Pipeline Architecture
┌─────────────────────────────────────────────────────────────────┐
│ FEATURE PIPELINE ARCHITECTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Raw Data │───►│ Feature │───►│ Feature │ │
│ │ Sources │ │ Engineering │ │ Store │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ • Events │ │ • Transform │ │ • Offline │ │
│ │ • Logs │ │ • Aggregate │ │ (batch) │ │
│ │ • Tables │ │ • Join │ │ • Online │ │
│ │ • APIs │ │ • Validate │ │ (real-time)│ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ ┌────────────────────┴────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐
│ │ Training │ │ Serving │
│ │ Pipeline │ │ Pipeline │
│ └─────────────┘ └─────────────┘
│ │
└─────────────────────────────────────────────────────────────────┘Sklearn Pipeline Example
python
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
# Define feature groups
numeric_features = ['age', 'income', 'tenure']
categorical_features = ['gender', 'region', 'plan_type']
# Numeric pipeline
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Categorical pipeline
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('encoder', OneHotEncoder(handle_unknown='ignore'))
])
# Combine pipelines
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
]
)
# Full pipeline with model
full_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression())
])Offline/Online Parity
The Training-Serving Skew Problem
🚨 Training-Serving Skew
Một trong những bugs khó debug nhất trong ML: model hoạt động tốt offline nhưng fail trong production vì features được compute khác nhau.
┌─────────────────────────────────────────────────────────────────┐
│ TRAINING-SERVING SKEW SOURCES │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. DATA SKEW │
│ Training data ≠ Production data distribution │
│ Example: Training on historical data, serving on new users │
│ │
│ 2. FEATURE SKEW │
│ Feature computation differs between training and serving │
│ Example: Different aggregation windows │
│ │
│ 3. LABEL SKEW │
│ Label definition changes over time │
│ Example: "Churn" definition updated │
│ │
│ 4. TEMPORAL SKEW │
│ Time-dependent features computed incorrectly │
│ Example: Using future data in training │
│ │
└─────────────────────────────────────────────────────────────────┘Ensuring Parity
| Strategy | Description | Implementation |
|---|---|---|
| Single codebase | Same code for training/serving | Shared feature library |
| Feature store | Centralized feature management | Feast, Tecton, Vertex AI |
| Point-in-time joins | Correct temporal joins | Avoid future leakage |
| Feature validation | Compare distributions | Great Expectations |
Feature Stores
What is a Feature Store?
┌─────────────────────────────────────────────────────────────────┐
│ FEATURE STORE ARCHITECTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ FEATURE STORE │ │
│ ├─────────────────────────────────────────────────────────┤ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Feature │ │ Offline │ │ Online │ │ │
│ │ │ Registry │ │ Store │ │ Store │ │ │
│ │ ├─────────────┤ ├─────────────┤ ├─────────────┤ │ │
│ │ │ • Metadata │ │ • Historical│ │ • Latest │ │ │
│ │ │ • Lineage │ │ • Batch │ │ • Low latency│ │ │
│ │ │ • Versioning│ │ • Training │ │ • Serving │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ Benefits: │
│ • Feature reuse across teams │
│ • Consistent training/serving │
│ • Point-in-time correctness │
│ • Feature discovery and documentation │
│ │
└─────────────────────────────────────────────────────────────────┘Feature Store Options
| Tool | Type | Best For |
|---|---|---|
| Feast | Open source | Flexibility, self-hosted |
| Tecton | Managed | Enterprise, real-time |
| Vertex AI Feature Store | GCP managed | GCP ecosystem |
| SageMaker Feature Store | AWS managed | AWS ecosystem |
| Databricks Feature Store | Databricks | Spark-based workflows |
Feast Example
python
from feast import FeatureStore, Entity, Feature, FeatureView
from feast.types import Float32, Int64
# Define entity
user = Entity(
name="user_id",
value_type=Int64,
description="User identifier"
)
# Define feature view
user_features = FeatureView(
name="user_features",
entities=["user_id"],
ttl=timedelta(days=1),
features=[
Feature(name="total_purchases", dtype=Float32),
Feature(name="avg_order_value", dtype=Float32),
Feature(name="days_since_last_order", dtype=Int64),
],
online=True,
source=user_source,
)
# Get features for training
training_df = store.get_historical_features(
entity_df=entity_df,
features=["user_features:total_purchases", "user_features:avg_order_value"]
).to_df()
# Get features for serving
online_features = store.get_online_features(
features=["user_features:total_purchases"],
entity_rows=[{"user_id": 12345}]
).to_dict()Feature Selection
Selection Methods
┌─────────────────────────────────────────────────────────────────┐
│ FEATURE SELECTION METHODS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. FILTER METHODS (Fast, model-agnostic) │
│ • Correlation with target │
│ • Mutual information │
│ • Chi-squared test │
│ • Variance threshold │
│ │
│ 2. WRAPPER METHODS (Accurate, expensive) │
│ • Forward selection │
│ • Backward elimination │
│ • Recursive feature elimination (RFE) │
│ │
│ 3. EMBEDDED METHODS (Balance) │
│ • L1 regularization (Lasso) │
│ • Tree-based importance │
│ • Permutation importance │
│ │
└─────────────────────────────────────────────────────────────────┘Feature Importance Analysis
python
import shap
# SHAP values for feature importance
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
# Summary plot
shap.summary_plot(shap_values, X_test, feature_names=feature_names)
# Permutation importance
from sklearn.inspection import permutation_importance
result = permutation_importance(model, X_test, y_test, n_repeats=10)
importance_df = pd.DataFrame({
'feature': feature_names,
'importance': result.importances_mean
}).sort_values('importance', ascending=False)Best Practices
Feature Engineering Checklist
| Step | Action | Why |
|---|---|---|
| 1 | Understand domain | Features should make business sense |
| 2 | Explore data | Find patterns, outliers, missing values |
| 3 | Handle missing values | Imputation strategy matters |
| 4 | Encode categoricals | Choose appropriate method |
| 5 | Scale numericals | Required for many algorithms |
| 6 | Create interactions | Capture non-linear relationships |
| 7 | Select features | Remove noise, reduce overfitting |
| 8 | Validate pipeline | Ensure reproducibility |
Common Pitfalls
⚠️ Feature Engineering Anti-Patterns
- Data leakage: Using future information
- Target leakage: Features derived from target
- Overfitting: Too many features for data size
- Inconsistency: Different logic in training/serving
- Missing documentation: Features without explanation
📎 Cross-References
- 📎 Problem Framing - Định nghĩa features cần thiết
- 📎 Model Selection - Chọn model phù hợp với features
- 📎 Data Modeling - Data structure cho features
- 📎 ML Monitoring - Feature drift detection