Thực hành: Feature Engineering

🎯 Mục tiêu

🎯 Sau bài thực hành này, bạn sẽ:

Biết cách tạo features mới từ raw data
Xử lý missing values và categorical encoding
Nhận biết feature leakage và cách phòng tránh

Mô tả bài tập

Bạn được giao một dataset về giao dịch e-commerce. Nhiệm vụ là xây dựng feature pipeline phục vụ cho model dự đoán churn (khách hàng rời bỏ).

Yêu cầu

Bài 1: Xử lý Missing Values

Cho DataFrame với các cột có missing values. Viết hàm xử lý phù hợp cho từng loại feature.

python

import pandas as pd
import numpy as np

data = {
    'age': [25, np.nan, 35, 40, np.nan, 28],
    'income': [50000, 60000, np.nan, 80000, 45000, np.nan],
    'city': ['HCM', 'HN', None, 'HCM', 'DN', 'HN'],
    'purchase_count': [5, 12, 8, np.nan, 3, 15]
}
df = pd.DataFrame(data)

def handle_missing(df):
    """Xử lý missing values cho mỗi cột phù hợp."""
    # TODO: Implement
    pass

Bài 2: Feature Creation từ Timestamp

Tạo time-based features từ cột timestamp giao dịch.

python

transactions = pd.DataFrame({
    'user_id': [1, 1, 1, 2, 2, 3],
    'timestamp': pd.to_datetime([
        '2024-01-15 09:30', '2024-01-20 14:00', '2024-02-01 22:15',
        '2024-01-10 08:00', '2024-03-01 16:30', '2024-01-25 11:45'
    ]),
    'amount': [100, 250, 80, 500, 120, 300]
})

def create_time_features(df):
    """Tạo features: hour_of_day, day_of_week, is_weekend, days_since_first."""
    # TODO: Implement
    pass

Bài 3: Categorical Encoding

Chọn encoding strategy phù hợp cho từng loại categorical feature.

python

df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'red', 'blue'],       # Nominal
    'size': ['S', 'M', 'L', 'XL', 'M'],                     # Ordinal
    'city': ['HCM', 'HN', 'DN', 'HP', 'CT'],                # High cardinality
})

def encode_categoricals(df):
    """Apply encoding phù hợp cho mỗi loại categorical."""
    # TODO: One-hot cho nominal, ordinal encoding cho ordinal,
    # target/frequency encoding cho high-cardinality
    pass

Gợi ý

💡 Xem gợi ý

Bài 1: Dùng median cho numerical (robust với outliers), mode cho categorical. Cân nhắc tạo indicator column is_missing_X.
Bài 2: Dùng dt accessor của pandas. days_since_first tính bằng groupby user_id rồi lấy min timestamp.
Bài 3: Nominal dùng pd.get_dummies(), ordinal dùng mapping dict, high-cardinality cân nhắc frequency encoding.

Lời giải

✅ Xem lời giải

python

# Bài 1
def handle_missing(df):
    df = df.copy()
    df['age'] = df['age'].fillna(df['age'].median())
    df['income'] = df['income'].fillna(df['income'].median())
    df['city'] = df['city'].fillna(df['city'].mode()[0])
    df['purchase_count'] = df['purchase_count'].fillna(df['purchase_count'].median())
    return df

# Bài 2
def create_time_features(df):
    df = df.copy()
    df['hour_of_day'] = df['timestamp'].dt.hour
    df['day_of_week'] = df['timestamp'].dt.dayofweek
    df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
    first_purchase = df.groupby('user_id')['timestamp'].transform('min')
    df['days_since_first'] = (df['timestamp'] - first_purchase).dt.days
    return df

# Bài 3
def encode_categoricals(df):
    df = df.copy()
    # Nominal: one-hot
    df = pd.get_dummies(df, columns=['color'], prefix='color')
    # Ordinal: mapping
    size_map = {'S': 1, 'M': 2, 'L': 3, 'XL': 4}
    df['size_encoded'] = df['size'].map(size_map)
    # High cardinality: frequency encoding
    freq = df['city'].value_counts(normalize=True)
    df['city_freq'] = df['city'].map(freq)
    return df.drop(columns=['size', 'city'])

Thực hành: Feature Engineering ​

Mô tả bài tập ​

Yêu cầu ​

Bài 1: Xử lý Missing Values ​

Bài 2: Feature Creation từ Timestamp ​

Bài 3: Categorical Encoding ​

Gợi ý ​

Lời giải ​

Liên kết liên quan ​

Thực hành: Feature Engineering

Mô tả bài tập

Yêu cầu

Bài 1: Xử lý Missing Values

Bài 2: Feature Creation từ Timestamp

Bài 3: Categorical Encoding

Gợi ý

Lời giải

Liên kết liên quan