🔧 Feature Engineering

Level: Core Solves: Thiết kế và quản lý features một cách có hệ thống, đảm bảo consistency giữa training và serving

Tại sao Feature Engineering quan trọng?

💡 Giáo sư Tom

"Applied machine learning is basically feature engineering" - Andrew Ng.

Model tốt nhất với features tệ sẽ thua model đơn giản với features tốt. Feature engineering là nơi domain knowledge gặp data science.

┌─────────────────────────────────────────────────────────────────┐
│              FEATURE ENGINEERING IMPACT                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Model Performance Contribution:                                │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Feature Engineering  ████████████████████████████  70%  │    │
│  │ Model Selection      ████████████                  20%  │    │
│  │ Hyperparameter Tuning████                          10%  │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  "Garbage in, garbage out" - No model can fix bad features      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Feature Types

Categorical Features

Type	Description	Encoding Methods
Nominal	No order (color, country)	One-hot, Target encoding
Ordinal	Has order (low/med/high)	Label encoding, Ordinal encoding
High cardinality	Many unique values (user_id)	Hashing, Embedding

Numerical Features

Type	Description	Transformation
Continuous	Any value (price, age)	Scaling, Binning
Discrete	Integer values (count)	Log transform, Binning
Ratio	Has true zero (distance)	Standard scaling
Interval	No true zero (temperature)	Min-max scaling

Temporal Features

┌─────────────────────────────────────────────────────────────────┐
│              TEMPORAL FEATURE EXTRACTION                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  From timestamp "2024-03-15 14:30:00":                          │
│                                                                 │
│  Cyclical Features:                                             │
│  • hour_sin, hour_cos     (captures daily patterns)             │
│  • day_of_week_sin/cos    (captures weekly patterns)            │
│  • month_sin/cos          (captures seasonal patterns)          │
│                                                                 │
│  Binary Features:                                               │
│  • is_weekend             (0 or 1)                              │
│  • is_holiday             (0 or 1)                              │
│  • is_business_hours      (0 or 1)                              │
│                                                                 │
│  Lag Features:                                                  │
│  • value_1h_ago           (1 hour lag)                          │
│  • value_24h_ago          (1 day lag)                           │
│  • value_7d_ago           (1 week lag)                          │
│                                                                 │
│  Rolling Features:                                              │
│  • rolling_mean_7d        (7-day moving average)                │
│  • rolling_std_7d         (7-day standard deviation)            │
│  • rolling_max_30d        (30-day maximum)                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Feature Transformation Techniques

Scaling Methods

Method	Formula	Use Case
Standard Scaling	(x - μ) / σ	Gaussian-like distributions
Min-Max Scaling	(x - min) / (max - min)	Bounded range [0, 1]
Robust Scaling	(x - median) / IQR	Outlier-resistant
Log Transform	log(x + 1)	Right-skewed distributions

Encoding Strategies

python

# One-Hot Encoding (low cardinality)
# ❌ WRONG for high cardinality - creates sparse matrix
pd.get_dummies(df['country'])  # 200 countries = 200 columns

# Target Encoding (high cardinality)
# ✅ CORRECT - single column with target statistics
from category_encoders import TargetEncoder
encoder = TargetEncoder()
df['country_encoded'] = encoder.fit_transform(df['country'], df['target'])

# Hashing (very high cardinality)
# ✅ CORRECT for user_id, product_id
from sklearn.feature_extraction import FeatureHasher
hasher = FeatureHasher(n_features=1000, input_type='string')

Feature Interactions

┌─────────────────────────────────────────────────────────────────┐
│              FEATURE INTERACTION PATTERNS                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Arithmetic Interactions:                                       │
│  • price_per_sqft = price / square_feet                         │
│  • total_spend = quantity × unit_price                          │
│  • profit_margin = (revenue - cost) / revenue                   │
│                                                                 │
│  Polynomial Features:                                           │
│  • age² (quadratic relationship)                                │
│  • age × income (interaction term)                              │
│                                                                 │
│  Domain-Specific:                                               │
│  • BMI = weight / height²                                       │
│  • CTR = clicks / impressions                                   │
│  • recency × frequency (RFM score)                              │
│                                                                 │
│  ⚠️ Warning: More features ≠ better model                       │
│  → Feature selection is equally important                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Feature Pipeline Design

Pipeline Architecture

┌─────────────────────────────────────────────────────────────────┐
│              FEATURE PIPELINE ARCHITECTURE                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐          │
│  │ Raw Data    │───►│ Feature     │───►│ Feature     │          │
│  │ Sources     │    │ Engineering │    │ Store       │          │
│  └─────────────┘    └─────────────┘    └─────────────┘          │
│        │                   │                  │                 │
│        │                   │                  │                 │
│        ▼                   ▼                  ▼                 │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐          │
│  │ • Events    │    │ • Transform │    │ • Offline   │          │
│  │ • Logs      │    │ • Aggregate │    │   (batch)   │          │
│  │ • Tables    │    │ • Join      │    │ • Online    │          │
│  │ • APIs      │    │ • Validate  │    │   (real-time)│         │
│  └─────────────┘    └─────────────┘    └─────────────┘          │
│                                               │                 │
│                          ┌────────────────────┴────────────┐    │
│                          │                                 │    │
│                          ▼                                 ▼    │
│                   ┌─────────────┐                  ┌─────────────┐
│                   │  Training   │                  │  Serving    │
│                   │  Pipeline   │                  │  Pipeline   │
│                   └─────────────┘                  └─────────────┘
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Sklearn Pipeline Example

python

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Define feature groups
numeric_features = ['age', 'income', 'tenure']
categorical_features = ['gender', 'region', 'plan_type']

# Numeric pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical pipeline
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Combine pipelines
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Full pipeline with model
full_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

Offline/Online Parity

The Training-Serving Skew Problem

🚨 Training-Serving Skew

Một trong những bugs khó debug nhất trong ML: model hoạt động tốt offline nhưng fail trong production vì features được compute khác nhau.

┌─────────────────────────────────────────────────────────────────┐
│              TRAINING-SERVING SKEW SOURCES                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. DATA SKEW                                                   │
│     Training data ≠ Production data distribution                │
│     Example: Training on historical data, serving on new users  │
│                                                                 │
│  2. FEATURE SKEW                                                │
│     Feature computation differs between training and serving    │
│     Example: Different aggregation windows                      │
│                                                                 │
│  3. LABEL SKEW                                                  │
│     Label definition changes over time                          │
│     Example: "Churn" definition updated                         │
│                                                                 │
│  4. TEMPORAL SKEW                                               │
│     Time-dependent features computed incorrectly                │
│     Example: Using future data in training                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Ensuring Parity

Strategy	Description	Implementation
Single codebase	Same code for training/serving	Shared feature library
Feature store	Centralized feature management	Feast, Tecton, Vertex AI
Point-in-time joins	Correct temporal joins	Avoid future leakage
Feature validation	Compare distributions	Great Expectations

Feature Stores

What is a Feature Store?

┌─────────────────────────────────────────────────────────────────┐
│                    FEATURE STORE ARCHITECTURE                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    FEATURE STORE                         │    │
│  ├─────────────────────────────────────────────────────────┤    │
│  │                                                         │    │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐      │    │
│  │  │ Feature     │  │ Offline     │  │ Online      │      │    │
│  │  │ Registry    │  │ Store       │  │ Store       │      │    │
│  │  ├─────────────┤  ├─────────────┤  ├─────────────┤      │    │
│  │  │ • Metadata  │  │ • Historical│  │ • Latest    │      │    │
│  │  │ • Lineage   │  │ • Batch     │  │ • Low latency│     │    │
│  │  │ • Versioning│  │ • Training  │  │ • Serving   │      │    │
│  │  └─────────────┘  └─────────────┘  └─────────────┘      │    │
│  │                                                         │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  Benefits:                                                      │
│  • Feature reuse across teams                                   │
│  • Consistent training/serving                                  │
│  • Point-in-time correctness                                    │
│  • Feature discovery and documentation                          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Feature Store Options

Tool	Type	Best For
Feast	Open source	Flexibility, self-hosted
Tecton	Managed	Enterprise, real-time
Vertex AI Feature Store	GCP managed	GCP ecosystem
SageMaker Feature Store	AWS managed	AWS ecosystem
Databricks Feature Store	Databricks	Spark-based workflows

Feast Example

python

from feast import FeatureStore, Entity, Feature, FeatureView
from feast.types import Float32, Int64

# Define entity
user = Entity(
    name="user_id",
    value_type=Int64,
    description="User identifier"
)

# Define feature view
user_features = FeatureView(
    name="user_features",
    entities=["user_id"],
    ttl=timedelta(days=1),
    features=[
        Feature(name="total_purchases", dtype=Float32),
        Feature(name="avg_order_value", dtype=Float32),
        Feature(name="days_since_last_order", dtype=Int64),
    ],
    online=True,
    source=user_source,
)

# Get features for training
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=["user_features:total_purchases", "user_features:avg_order_value"]
).to_df()

# Get features for serving
online_features = store.get_online_features(
    features=["user_features:total_purchases"],
    entity_rows=[{"user_id": 12345}]
).to_dict()

Feature Selection

Selection Methods

┌─────────────────────────────────────────────────────────────────┐
│              FEATURE SELECTION METHODS                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. FILTER METHODS (Fast, model-agnostic)                       │
│     • Correlation with target                                   │
│     • Mutual information                                        │
│     • Chi-squared test                                          │
│     • Variance threshold                                        │
│                                                                 │
│  2. WRAPPER METHODS (Accurate, expensive)                       │
│     • Forward selection                                         │
│     • Backward elimination                                      │
│     • Recursive feature elimination (RFE)                       │
│                                                                 │
│  3. EMBEDDED METHODS (Balance)                                  │
│     • L1 regularization (Lasso)                                 │
│     • Tree-based importance                                     │
│     • Permutation importance                                    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Feature Importance Analysis

python

import shap

# SHAP values for feature importance
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Summary plot
shap.summary_plot(shap_values, X_test, feature_names=feature_names)

# Permutation importance
from sklearn.inspection import permutation_importance

result = permutation_importance(model, X_test, y_test, n_repeats=10)
importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': result.importances_mean
}).sort_values('importance', ascending=False)

Best Practices

Feature Engineering Checklist

Step	Action	Why
1	Understand domain	Features should make business sense
2	Explore data	Find patterns, outliers, missing values
3	Handle missing values	Imputation strategy matters
4	Encode categoricals	Choose appropriate method
5	Scale numericals	Required for many algorithms
6	Create interactions	Capture non-linear relationships
7	Select features	Remove noise, reduce overfitting
8	Validate pipeline	Ensure reproducibility

Common Pitfalls

⚠️ Feature Engineering Anti-Patterns

Data leakage: Using future information
Target leakage: Features derived from target
Overfitting: Too many features for data size
Inconsistency: Different logic in training/serving
Missing documentation: Features without explanation

📎 Cross-References

📎 Problem Framing - Định nghĩa features cần thiết
📎 Model Selection - Chọn model phù hợp với features
📎 Data Modeling - Data structure cho features
📎 ML Monitoring - Feature drift detection

🧠 Quiz

Câu 1: Feature store giải quyết vấn đề gì chính?

[ ] A) Tăng tốc độ training model
[x] B) Đảm bảo consistency giữa offline training và online serving features
[ ] C) Giảm chi phí lưu trữ dữ liệu
[ ] D) Tự động chọn features tốt nhất

💡 Giải thích: Feature store đảm bảo offline/online parity — cùng một logic tính toán features được dùng cho cả training (offline) và inference (online), tránh training-serving skew.

Câu 2: Khi xử lý missing values, phương pháp nào phù hợp nhất cho categorical features?

[ ] A) Thay bằng giá trị trung bình (mean)
[ ] B) Xóa tất cả rows có missing values
[x] C) Thay bằng giá trị mode hoặc tạo category "MISSING" riêng
[ ] D) Thay bằng giá trị 0

💡 Giải thích: Với categorical features, mean/median không có ý nghĩa. Sử dụng mode (giá trị phổ biến nhất) hoặc tạo category riêng cho missing values giúp model học được pattern từ sự thiếu dữ liệu.

🔧 Feature Engineering ​

Tại sao Feature Engineering quan trọng? ​

Feature Types ​

Categorical Features ​

Numerical Features ​

Temporal Features ​

Feature Transformation Techniques ​

Scaling Methods ​

Encoding Strategies ​

Feature Interactions ​

Feature Pipeline Design ​

Pipeline Architecture ​

Sklearn Pipeline Example ​

Offline/Online Parity ​

The Training-Serving Skew Problem ​

Ensuring Parity ​

Feature Stores ​

What is a Feature Store? ​

Feature Store Options ​

Feast Example ​

Feature Selection ​

Selection Methods ​

Feature Importance Analysis ​

Best Practices ​

Feature Engineering Checklist ​

Common Pitfalls ​

📎 Cross-References ​

Câu 1: Feature store giải quyết vấn đề gì chính? ​

Câu 2: Khi xử lý missing values, phương pháp nào phù hợp nhất cho categorical features? ​

🔧 Feature Engineering

Tại sao Feature Engineering quan trọng?

Feature Types

Categorical Features

Numerical Features

Temporal Features

Feature Transformation Techniques

Scaling Methods

Encoding Strategies

Feature Interactions

Feature Pipeline Design

Pipeline Architecture

Sklearn Pipeline Example

Offline/Online Parity

The Training-Serving Skew Problem

Ensuring Parity

Feature Stores

What is a Feature Store?

Feature Store Options

Feast Example

Feature Selection

Selection Methods

Feature Importance Analysis

Best Practices

Feature Engineering Checklist

Common Pitfalls

📎 Cross-References

Câu 1: Feature store giải quyết vấn đề gì chính?

Câu 2: Khi xử lý missing values, phương pháp nào phù hợp nhất cho categorical features?