Skip to content

🔧 Feature Engineering

Level: Core Solves: Thiết kế và quản lý features một cách có hệ thống, đảm bảo consistency giữa training và serving

Tại sao Feature Engineering quan trọng?

💡 Giáo sư Tom

"Applied machine learning is basically feature engineering" - Andrew Ng.

Model tốt nhất với features tệ sẽ thua model đơn giản với features tốt. Feature engineering là nơi domain knowledge gặp data science.

┌─────────────────────────────────────────────────────────────────┐
│              FEATURE ENGINEERING IMPACT                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Model Performance Contribution:                                │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Feature Engineering  ████████████████████████████  70%  │    │
│  │ Model Selection      ████████████                  20%  │    │
│  │ Hyperparameter Tuning████                          10%  │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  "Garbage in, garbage out" - No model can fix bad features      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Feature Types

Categorical Features

TypeDescriptionEncoding Methods
NominalNo order (color, country)One-hot, Target encoding
OrdinalHas order (low/med/high)Label encoding, Ordinal encoding
High cardinalityMany unique values (user_id)Hashing, Embedding

Numerical Features

TypeDescriptionTransformation
ContinuousAny value (price, age)Scaling, Binning
DiscreteInteger values (count)Log transform, Binning
RatioHas true zero (distance)Standard scaling
IntervalNo true zero (temperature)Min-max scaling

Temporal Features

┌─────────────────────────────────────────────────────────────────┐
│              TEMPORAL FEATURE EXTRACTION                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  From timestamp "2024-03-15 14:30:00":                          │
│                                                                 │
│  Cyclical Features:                                             │
│  • hour_sin, hour_cos     (captures daily patterns)             │
│  • day_of_week_sin/cos    (captures weekly patterns)            │
│  • month_sin/cos          (captures seasonal patterns)          │
│                                                                 │
│  Binary Features:                                               │
│  • is_weekend             (0 or 1)                              │
│  • is_holiday             (0 or 1)                              │
│  • is_business_hours      (0 or 1)                              │
│                                                                 │
│  Lag Features:                                                  │
│  • value_1h_ago           (1 hour lag)                          │
│  • value_24h_ago          (1 day lag)                           │
│  • value_7d_ago           (1 week lag)                          │
│                                                                 │
│  Rolling Features:                                              │
│  • rolling_mean_7d        (7-day moving average)                │
│  • rolling_std_7d         (7-day standard deviation)            │
│  • rolling_max_30d        (30-day maximum)                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Feature Transformation Techniques

Scaling Methods

MethodFormulaUse Case
Standard Scaling(x - μ) / σGaussian-like distributions
Min-Max Scaling(x - min) / (max - min)Bounded range [0, 1]
Robust Scaling(x - median) / IQROutlier-resistant
Log Transformlog(x + 1)Right-skewed distributions

Encoding Strategies

python
# One-Hot Encoding (low cardinality)
# ❌ WRONG for high cardinality - creates sparse matrix
pd.get_dummies(df['country'])  # 200 countries = 200 columns

# Target Encoding (high cardinality)
# ✅ CORRECT - single column with target statistics
from category_encoders import TargetEncoder
encoder = TargetEncoder()
df['country_encoded'] = encoder.fit_transform(df['country'], df['target'])

# Hashing (very high cardinality)
# ✅ CORRECT for user_id, product_id
from sklearn.feature_extraction import FeatureHasher
hasher = FeatureHasher(n_features=1000, input_type='string')

Feature Interactions

┌─────────────────────────────────────────────────────────────────┐
│              FEATURE INTERACTION PATTERNS                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Arithmetic Interactions:                                       │
│  • price_per_sqft = price / square_feet                         │
│  • total_spend = quantity × unit_price                          │
│  • profit_margin = (revenue - cost) / revenue                   │
│                                                                 │
│  Polynomial Features:                                           │
│  • age² (quadratic relationship)                                │
│  • age × income (interaction term)                              │
│                                                                 │
│  Domain-Specific:                                               │
│  • BMI = weight / height²                                       │
│  • CTR = clicks / impressions                                   │
│  • recency × frequency (RFM score)                              │
│                                                                 │
│  ⚠️ Warning: More features ≠ better model                       │
│  → Feature selection is equally important                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Feature Pipeline Design

Pipeline Architecture

┌─────────────────────────────────────────────────────────────────┐
│              FEATURE PIPELINE ARCHITECTURE                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐          │
│  │ Raw Data    │───►│ Feature     │───►│ Feature     │          │
│  │ Sources     │    │ Engineering │    │ Store       │          │
│  └─────────────┘    └─────────────┘    └─────────────┘          │
│        │                   │                  │                 │
│        │                   │                  │                 │
│        ▼                   ▼                  ▼                 │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐          │
│  │ • Events    │    │ • Transform │    │ • Offline   │          │
│  │ • Logs      │    │ • Aggregate │    │   (batch)   │          │
│  │ • Tables    │    │ • Join      │    │ • Online    │          │
│  │ • APIs      │    │ • Validate  │    │   (real-time)│         │
│  └─────────────┘    └─────────────┘    └─────────────┘          │
│                                               │                 │
│                          ┌────────────────────┴────────────┐    │
│                          │                                 │    │
│                          ▼                                 ▼    │
│                   ┌─────────────┐                  ┌─────────────┐
│                   │  Training   │                  │  Serving    │
│                   │  Pipeline   │                  │  Pipeline   │
│                   └─────────────┘                  └─────────────┘
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Sklearn Pipeline Example

python
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Define feature groups
numeric_features = ['age', 'income', 'tenure']
categorical_features = ['gender', 'region', 'plan_type']

# Numeric pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical pipeline
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Combine pipelines
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Full pipeline with model
full_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

Offline/Online Parity

The Training-Serving Skew Problem

🚨 Training-Serving Skew

Một trong những bugs khó debug nhất trong ML: model hoạt động tốt offline nhưng fail trong production vì features được compute khác nhau.

┌─────────────────────────────────────────────────────────────────┐
│              TRAINING-SERVING SKEW SOURCES                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. DATA SKEW                                                   │
│     Training data ≠ Production data distribution                │
│     Example: Training on historical data, serving on new users  │
│                                                                 │
│  2. FEATURE SKEW                                                │
│     Feature computation differs between training and serving    │
│     Example: Different aggregation windows                      │
│                                                                 │
│  3. LABEL SKEW                                                  │
│     Label definition changes over time                          │
│     Example: "Churn" definition updated                         │
│                                                                 │
│  4. TEMPORAL SKEW                                               │
│     Time-dependent features computed incorrectly                │
│     Example: Using future data in training                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Ensuring Parity

StrategyDescriptionImplementation
Single codebaseSame code for training/servingShared feature library
Feature storeCentralized feature managementFeast, Tecton, Vertex AI
Point-in-time joinsCorrect temporal joinsAvoid future leakage
Feature validationCompare distributionsGreat Expectations

Feature Stores

What is a Feature Store?

┌─────────────────────────────────────────────────────────────────┐
│                    FEATURE STORE ARCHITECTURE                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    FEATURE STORE                         │    │
│  ├─────────────────────────────────────────────────────────┤    │
│  │                                                         │    │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐      │    │
│  │  │ Feature     │  │ Offline     │  │ Online      │      │    │
│  │  │ Registry    │  │ Store       │  │ Store       │      │    │
│  │  ├─────────────┤  ├─────────────┤  ├─────────────┤      │    │
│  │  │ • Metadata  │  │ • Historical│  │ • Latest    │      │    │
│  │  │ • Lineage   │  │ • Batch     │  │ • Low latency│     │    │
│  │  │ • Versioning│  │ • Training  │  │ • Serving   │      │    │
│  │  └─────────────┘  └─────────────┘  └─────────────┘      │    │
│  │                                                         │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  Benefits:                                                      │
│  • Feature reuse across teams                                   │
│  • Consistent training/serving                                  │
│  • Point-in-time correctness                                    │
│  • Feature discovery and documentation                          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Feature Store Options

ToolTypeBest For
FeastOpen sourceFlexibility, self-hosted
TectonManagedEnterprise, real-time
Vertex AI Feature StoreGCP managedGCP ecosystem
SageMaker Feature StoreAWS managedAWS ecosystem
Databricks Feature StoreDatabricksSpark-based workflows

Feast Example

python
from feast import FeatureStore, Entity, Feature, FeatureView
from feast.types import Float32, Int64

# Define entity
user = Entity(
    name="user_id",
    value_type=Int64,
    description="User identifier"
)

# Define feature view
user_features = FeatureView(
    name="user_features",
    entities=["user_id"],
    ttl=timedelta(days=1),
    features=[
        Feature(name="total_purchases", dtype=Float32),
        Feature(name="avg_order_value", dtype=Float32),
        Feature(name="days_since_last_order", dtype=Int64),
    ],
    online=True,
    source=user_source,
)

# Get features for training
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=["user_features:total_purchases", "user_features:avg_order_value"]
).to_df()

# Get features for serving
online_features = store.get_online_features(
    features=["user_features:total_purchases"],
    entity_rows=[{"user_id": 12345}]
).to_dict()

Feature Selection

Selection Methods

┌─────────────────────────────────────────────────────────────────┐
│              FEATURE SELECTION METHODS                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. FILTER METHODS (Fast, model-agnostic)                       │
│     • Correlation with target                                   │
│     • Mutual information                                        │
│     • Chi-squared test                                          │
│     • Variance threshold                                        │
│                                                                 │
│  2. WRAPPER METHODS (Accurate, expensive)                       │
│     • Forward selection                                         │
│     • Backward elimination                                      │
│     • Recursive feature elimination (RFE)                       │
│                                                                 │
│  3. EMBEDDED METHODS (Balance)                                  │
│     • L1 regularization (Lasso)                                 │
│     • Tree-based importance                                     │
│     • Permutation importance                                    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Feature Importance Analysis

python
import shap

# SHAP values for feature importance
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Summary plot
shap.summary_plot(shap_values, X_test, feature_names=feature_names)

# Permutation importance
from sklearn.inspection import permutation_importance

result = permutation_importance(model, X_test, y_test, n_repeats=10)
importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': result.importances_mean
}).sort_values('importance', ascending=False)

Best Practices

Feature Engineering Checklist

StepActionWhy
1Understand domainFeatures should make business sense
2Explore dataFind patterns, outliers, missing values
3Handle missing valuesImputation strategy matters
4Encode categoricalsChoose appropriate method
5Scale numericalsRequired for many algorithms
6Create interactionsCapture non-linear relationships
7Select featuresRemove noise, reduce overfitting
8Validate pipelineEnsure reproducibility

Common Pitfalls

⚠️ Feature Engineering Anti-Patterns

  • Data leakage: Using future information
  • Target leakage: Features derived from target
  • Overfitting: Too many features for data size
  • Inconsistency: Different logic in training/serving
  • Missing documentation: Features without explanation

📎 Cross-References