EDA — Khám phá dữ liệu như một kỹ sư, không phải họa sĩ

🎯 Mục tiêu

Sau bài này, bạn sẽ:

Đối xử với EDA như quy trình quality assurance có hệ thống, không phải vẽ chart cho đẹp
Nắm vững checklist 6 bước để validate bất kỳ dataset nào trước khi modeling
Phân biệt missing data patterns (MCAR / MAR / MNAR) và chọn đúng imputation strategy
Phát hiện data leakage, outlier, distribution drift trước khi chúng phá hỏng model

EDA ≠ Pretty Charts

Nhiều tutorial dạy EDA kiểu: "import seaborn, vẽ heatmap, xong". Đó không phải EDA — đó là data decoration. EDA thực sự là quality assurance cho dữ liệu. Giống như không deploy code chưa chạy test, bạn không train model chưa chạy EDA.

Khi bạn bỏ qua EDA

Hậu quả	Ví dụ
Data leakage	Feature chứa thông tin tương lai → model "gian lận"
Garbage model	Train trên data 40% duplicate → model học thuộc
Silent failures	Distribution đổi sau 3 tháng, model sai mà không ai biết

EDA Engineering Checklist — 6 bước

Step 1: Schema & Types       → Dtype đúng chưa?
Step 2: Missing Data         → Thiếu gì? Tại sao?
Step 3: Distributions        → Phân bố hợp lý? Outlier?
Step 4: Target Variable      → Target sạch? Imbalanced?
Step 5: Correlations         → Multicollinearity? Leakage?
Step 6: Temporal & Drift     → Ổn định theo thời gian?

Real Scenario: Duplicate Users phá hỏng Recommendation Model

Một team e-commerce train recommendation model trên 2M records. Precision@10 = 0.85 trên test set.

Production: precision rớt xuống 0.31.

Nguyên nhân: 40% users bị duplicate (cùng email, khác user_id do bug registration). Model "học thuộc" preference của cùng một người ở cả train và test. Nếu chạy df['email'].nunique() vs df['user_id'].nunique() trong EDA, đã phát hiện ngay.

Bài học: EDA 30 phút tiết kiệm 3 tuần debug production.

Step 1: Schema & Types

Bước đơn giản nhất nhưng bị skip nhiều nhất. Sai data type → sai mọi phép tính phía sau.

python

import pandas as pd

df = pd.read_csv('orders.csv')
df.info()       # Tổng quan: rows, columns, dtypes, non-null counts
df.head(10)     # Luôn luôn xem vài dòng đầu

Bẫy thường gặp: số lưu dạng string ("$12.99"), ngày lưu dạng object, categorical lưu dạng int (gender: 0, 1), ID lưu dạng int (mean of user_id = vô nghĩa).

Ví dụ: Order table với price dạng string

python

# ❌ Dữ liệu thực tế
# 0    $12.99
# 1    $5.50
# 2    N/A
# 3    -$1.50    ← refund? hay lỗi?

# ✅ Cleaning pipeline
df['price_clean'] = (
    df['price']
    .str.replace('$', '', regex=False)
    .str.replace(',', '', regex=False)
    .replace('N/A', pd.NA)
    .astype(float)
)

negative_prices = df[df['price_clean'] < 0]
print(f"Có {len(negative_prices)} giá trị âm — cần xác minh với business team")

Schema Validation Tự Động

Dùng Pandera để define schema và validate mỗi lần data đổ vào:

python

import pandera as pa
schema = pa.DataFrameSchema({
    "user_id": pa.Column(int, nullable=False, unique=True),
    "price": pa.Column(float, pa.Check.ge(0)),
    "created_at": pa.Column("datetime64[ns]"),
})
schema.validate(df)  # Raise error nếu sai

Step 2: Missing Data Patterns

Missing data không chỉ là đếm NaN. Pattern quyết định cách xử lý.

python

# Tỷ lệ missing (quan trọng hơn số tuyệt đối)
print(df.isnull().mean().round(4) * 100)

MCAR vs MAR vs MNAR

Loại	Ý nghĩa	Ví dụ	Xử lý
MCAR	Missing ngẫu nhiên hoàn toàn	Sensor lỗi random	Drop hoặc impute đơn giản
MAR	Missing phụ thuộc feature khác	Thu nhập missing nhiều hơn ở nhóm trẻ	Impute dựa trên feature liên quan
MNAR	Missing phụ thuộc chính giá trị bị missing	Người giàu không khai báo thu nhập	Nguy hiểm nhất — impute sai gây bias

Nhìn pattern, không chỉ đếm số

user_id  age  income  purchase
  ✓       ✓     ✗       ✓         ← income missing (common cho user trẻ)
  ✓       ✓     ✓       ✓
  ✓       ✗     ✗       ✓         ← age VÀ income cùng missing (same users?)
  ✓       ✓     ✓       ✓
  ✓       ✗     ✗       ✓         ← lại age + income cùng missing

age và income thường missing cùng nhau → có thể cùng nguồn (form đăng ký bỏ qua personal info). Đây là dấu hiệu MAR.

python

both_missing = df[['age', 'income']].isnull().all(axis=1).sum()
only_age = df['age'].isnull().sum() - both_missing
only_income = df['income'].isnull().sum() - both_missing
print(f"Cả hai: {both_missing} | Chỉ age: {only_age} | Chỉ income: {only_income}")

DROP vs IMPUTE Decision Tree

Tình huống	Hành động
> 50% missing	Drop column
Pattern = MNAR	Flag as feature (`is_missing_income = 1`)
< 5% missing & MCAR	Drop rows
MAR	Impute dựa trên relationship (KNN, MICE)
Numerical, MCAR	Mean / Median
Time-series	Forward-fill

WARNING

Không bao giờ impute target variable. Nếu target bị missing → drop row đó. Impute target = bịa label cho model học.

Step 3: Distribution Profiling

Numerical: `df.describe()` là vũ khí đầu tiên

python

#          price     age    income
# count   10000    9500      7800      ← count khác nhau = missing
# mean    45.20    34.5    120000      ← mean vs median = skewness
# 50%     29.00    33.0     50000      ← median
# max   9999.99   120.0   5000000      ← max bất thường?

Phân tích: income mean=120k nhưng median=50k → right-skewed cực mạnh. age max=120 → hợp lý hay lỗi? price max=9999.99 → giá thật hay placeholder?

Categorical: Đừng quên chúng

python

for col in df.select_dtypes(include='object').columns:
    print(f"\n{col}: {df[col].nunique()} unique values")
    print(df[col].value_counts().head(5))

Cardinality Check

< 20 unique: Tốt cho one-hot encoding
20-100: Cân nhắc target encoding
> 100: One-hot tạo sparse matrix khổng lồ — cần strategy khác
Gần bằng số rows: Có thể là ID column — không dùng làm feature

Detecting Outliers

python

import numpy as np

def detect_outliers_iqr(series, multiplier=1.5):
    """IQR method — robust với non-normal distributions."""
    Q1, Q3 = series.quantile(0.25), series.quantile(0.75)
    IQR = Q3 - Q1
    lower, upper = Q1 - multiplier * IQR, Q3 + multiplier * IQR
    return series[(series < lower) | (series > upper)], lower, upper

outliers, low, high = detect_outliers_iqr(df['income'].dropna())
print(f"Bounds: [{low:,.0f}, {high:,.0f}] | Outliers: {len(outliers)}")

Salary với Billionaire Outlier

Median $50k, Mean $120k, Max $50M → tỷ phú trong dataset.

Dùng mean impute → mọi người thành "gần giàu". Dùng standard scaling → tất cả bị nén về gần 0. Giải pháp: log-transform hoặc robust scaler (dùng median + IQR).

Step 4: Target Variable Analysis

Class Distribution (Classification)

python

print(df['target'].value_counts(normalize=True))
#  0    0.95  (95%)  ← Negative
#  1    0.05  ( 5%)  ← Positive — IMBALANCED!

Model luôn predict class 0 vẫn đạt accuracy 95%. Đây là lý do accuracy vô dụng cho imbalanced data. Xử lý chi tiết tại Bài 06: Imbalanced Data.

Target Leakage Detection

Target leakage = feature chứa thông tin từ target mà không tồn tại tại thời điểm prediction.

python

# Tìm features correlate bất thường cao với target
corr_with_target = df.corr()['target'].abs().sort_values(ascending=False)
suspicious = corr_with_target[corr_with_target > 0.95].index.tolist()
if len(suspicious) > 1:  # target correlate 1.0 với chính nó
    print(f"⚠️  POSSIBLE LEAKAGE: {suspicious}")

Ví dụ: Predict loan default, feature loan_status có giá trị 'paid', 'defaulted' → đây chính là target dưới dạng khác.

WARNING

Quy tắc vàng: Với mỗi feature, hỏi — "Tại thời điểm predict, thông tin này đã tồn tại chưa?"

Target leakage thường không rõ ràng: predict bệnh nhân cần ICU, feature "số ngày nằm ICU" — thông tin chỉ có sau khi đã vào ICU.

Step 5: Correlation & Relationships

Correlation Matrix

python

corr_matrix = df.select_dtypes(include='number').corr()

# Tìm pairs correlation cao
high_corr = []
for i in range(len(corr_matrix.columns)):
    for j in range(i + 1, len(corr_matrix.columns)):
        if abs(corr_matrix.iloc[i, j]) > 0.7:
            high_corr.append((
                corr_matrix.columns[i], corr_matrix.columns[j],
                round(corr_matrix.iloc[i, j], 3)
            ))
for f1, f2, r in sorted(high_corr, key=lambda x: abs(x[2]), reverse=True):
    print(f"  {f1} ↔ {f2}: {r}")

Correlation ≠ Causation

📈 Ice cream sales ←─correlate─→ 📈 Drowning
           \                        /
            └──── 🌞 Mùa hè ──────┘    (confounding variable)

Hai biến correlate mạnh mà không ảnh hưởng lẫn nhau. Luôn tìm confounding variable.

Multicollinearity: VIF Check

Hai features correlate > 0.9 mang cùng thông tin → coefficients không ổn định, feature importance chia sẻ.

python

from statsmodels.stats.outliers_influence import variance_inflation_factor

def calculate_vif(df_numeric):
    """VIF > 5: chú ý. VIF > 10: multicollinearity nghiêm trọng."""
    vif = pd.DataFrame({
        'Feature': df_numeric.columns,
        'VIF': [variance_inflation_factor(df_numeric.dropna().values, i)
                for i in range(df_numeric.shape[1])]
    })
    return vif.sort_values('VIF', ascending=False)

Giải pháp cho Multicollinearity

Drop một trong hai features (giữ feature có business meaning rõ hơn)
PCA: Combine correlated features thành components mới
L1 Regularization (Lasso): Tự động loại features dư thừa
Domain knowledge: Tạo feature mới (avg_order_value = total_spend / num_orders)

Step 6: Temporal & Data Drift

Train/test split PHẢI respect thời gian

python

# ❌ Random split trộn lẫn quá khứ và tương lai
X_train, X_test = train_test_split(X, test_size=0.2)

# ✅ Split theo thời gian
df_sorted = df.sort_values('timestamp')
split_idx = int(len(df_sorted) * 0.8)
train, test = df_sorted.iloc[:split_idx], df_sorted.iloc[split_idx:]

Feature Distribution Drift

python

def compare_distributions(df, feature, time_col, p1, p2):
    """So sánh distribution giữa hai khoảng thời gian."""
    d1 = df[df[time_col].between(*p1)][feature].describe()
    d2 = df[df[time_col].between(*p2)][feature].describe()
    drift = ((d2 - d1) / d1 * 100).round(1)
    return pd.DataFrame({'Period1': d1, 'Period2': d2, 'Drift%': drift})

DANGER

Mean hoặc std drift > 20% giữa time windows → dấu hiệu data drift. Model train trên data cũ sẽ degrade. Cần retrain định kỳ và set up monitoring alerts.

Tools: ydata-profiling

One-line Report

python

from ydata_profiling import ProfileReport

profile = ProfileReport(df, title="EDA Report", explorative=True)
profile.to_notebook_iframe()   # Trong notebook
profile.to_file("report.html") # Export HTML

Tự phát hiện: high correlation, constant columns, duplicate rows, text length distribution, missing data patterns — những thứ manual EDA dễ bỏ sót.

WARNING

Đừng dùng như chiếc nạng. Report tự động cho bạn dữ kiện, không cho insight. Bạn vẫn phải:

Phát hiện target leakage (tool không biết feature nào là target)
Kiểm tra temporal patterns (tool không hiểu time ordering)
Quyết định imputation strategy dựa trên MCAR/MAR/MNAR

Tool tạo ra facts. Engineer tạo ra insights.

🔥 GPU Paragraph

EDA chạy trên CPU và thường trên sampled data — bình thường cho hầu hết projects. Với production datasets (100M+ rows), pandas chậm và tốn RAM — cân nhắc: Polars (Rust-based, 10-100x faster), Dask (distributed, data lớn hơn RAM), RAPIDS cuDF (GPU-accelerated), hoặc Apache Spark (enterprise-scale).

Mindset: EDA mindset không đổi dù tool nào — vẫn là 6 bước. Chỉ có tool thay đổi theo scale.

🧠 Common Beginner Misconception

"EDA là optional cho người có kinh nghiệm" hoặc "Tôi chỉ cần chạy AutoML"

Không có EDA, bạn không biết data có valid hay không. AutoML trên garbage data sinh ra garbage model với confidence cao — loại model nguy hiểm nhất.

Model predict sai nhưng tự tin 98% sẽ qua code review (metrics đẹp), deploy production (test performance cao), gây thiệt hại thật (không ai nghi ngờ) — cho đến khi phát hiện feature days_since_churn leak target is_churned.

Model nguy hiểm nhất là model trông tốt trên test data nhưng train trên leaked features. Kinh nghiệm càng nhiều, EDA càng nhanh — nhưng không bao giờ là zero.

⚡ Fast Exercise

Nhìn vào output dưới đây, chỉ ra 3 data quality issues:

python

df.info()
#  Column      Non-Null Count  Dtype
#  user_id     50000 non-null  int64
#  age         48500 non-null  float64
#  salary      32000 non-null  object    ← ???
#  signup_date 50000 non-null  object    ← ???
#  is_premium  50000 non-null  int64

df.describe()
#           user_id        age   is_premium
# mean    25000.50      34.20         0.05
# std     14433.90      85.30         0.22
# min         1.00       0.00         0.00
# max     50000.00     999.00         1.00

Đáp án

Issue 1 — dtype sai: salary là object (string) thay vì float64. Cần clean và convert.

Issue 2 — Missing data: salary có 32000/50000 non-null → 36% missing. Cần phân tích pattern — MCAR hay MAR?

Issue 3 — Outlier/Invalid: age max = 999, std = 85.3 (bất thường). Giá trị 999 gần như chắc chắn là placeholder cho missing, cần replace bằng NaN.

Bonus: signup_date là object thay vì datetime64. is_premium chỉ 5% positive → imbalanced nếu dùng làm target.

🪤 Gotcha

python

# ❌ df.describe() chỉ show numeric columns by default
print(df.describe())
# → KHÔNG thấy cột 'city', 'salary' (object), 'signup_date'

# ✅ Truyền include='all' để xem TẤT CẢ columns
print(df.describe(include='all'))
# Bây giờ thấy: city có 150 unique, top = 'Hanoi' (8200 entries)

df.describe() cũng bỏ qua NaN khi tính statistics. count khác nhau giữa các cột = dấu hiệu missing data.

📊 Performance Note

Profiling Data Lớn

Với DataFrame > 1M rows, đừng profile toàn bộ — notebook sẽ crash.

python

# Profile trên sample, kiểm tra representativeness
sample = df.sample(n=50_000, random_state=42)
print("Full target rate:", df['target'].mean())
print("Sample target rate:", sample['target'].mean())

# ydata-profiling chế độ minimal cho data lớn
ProfileReport(sample, minimal=True)

Rule of thumb: 50,000 rows đủ cho hầu hết EDA. Profile on sample, validate edge cases on full data.

🚫 Production Anti-pattern

Chạy EDA một lần rồi quên

Data scientist chạy EDA lúc đầu project. Deploy model. 6 tháng sau, data distribution thay đổi, model sai, không ai biết.

Giải pháp: Automated Data Quality Checks

python

# Great Expectations — define expectations, validate liên tục
import great_expectations as gx
validator = context.sources.pandas_default.read_csv("orders.csv")
validator.expect_column_values_to_not_be_null("user_id")
validator.expect_column_values_to_be_between("price", min_value=0, max_value=10000)

# Pandera — schema validation trong code
import pandera as pa
class OrderSchema(pa.DataFrameModel):
    user_id: int = pa.Field(nullable=False)
    price: float = pa.Field(ge=0, le=10000)
    status: str = pa.Field(isin=["pending", "shipped", "delivered"])

@pa.check_types
def process_orders(df: pa.typing.DataFrame[OrderSchema]) -> pd.DataFrame:
    return df.groupby('status')['price'].sum()

Mindset đúng: EDA không phải one-time task. Notebook EDA cho exploration. Automated checks cho production.

🎮 Playground: Complete EDA Workflow

python

import pandas as pd
import numpy as np

# Tạo sample e-commerce dataset
np.random.seed(42)
n = 5000
df = pd.DataFrame({
    'user_id': range(1, n + 1),
    'age': np.random.normal(35, 10, n).astype(int),
    'income': np.where(np.random.random(n) > 0.3,
                       np.random.lognormal(10.5, 0.8, n), np.nan),
    'city': np.random.choice(['Hanoi', 'HCMC', 'Danang', 'Other'], n,
                             p=[0.35, 0.40, 0.10, 0.15]),
    'num_orders': np.random.poisson(5, n),
    'total_spend': None,
    'signup_date': pd.date_range('2023-01-01', periods=n, freq='2h'),
    'is_churned': None,
})
df['total_spend'] = (df['num_orders'] * np.random.normal(50, 15, n)).clip(lower=0)
churn_prob = 1 / (1 + np.exp(-(-2 + 0.03 * df['age'] - 0.3 * df['num_orders'])))
df['is_churned'] = (np.random.random(n) < churn_prob).astype(int)

# Inject data quality issues
df.loc[0:4, 'age'] = 999           # Placeholder values
df.loc[100:102, 'age'] = -5        # Invalid negative age
df.loc[200:250, 'city'] = None     # Missing city
dup = df.iloc[:50].copy()           # Duplicate users
dup['user_id'] = range(n + 1, n + 51)
df = pd.concat([df, dup], ignore_index=True)

# STEP 1: Schema
print("=== STEP 1: Schema ===")
print(df.dtypes)

# STEP 2: Missing Data
print("\n=== STEP 2: Missing ===")
print((df.isnull().mean() * 100).round(1))

# STEP 3: Distributions
print("\n=== STEP 3: Distributions ===")
print(df.describe())
print(f"❓ age max={df['age'].max()} — placeholder hay thật?")

# STEP 4: Target
print("\n=== STEP 4: Target ===")
print(df['is_churned'].value_counts(normalize=True))

# STEP 5: Correlations
print("\n=== STEP 5: Correlations ===")
corr = df.select_dtypes(include='number').corr()
print(f"num_orders ↔ total_spend: {corr.loc['num_orders','total_spend']:.3f}")

# STEP 6: Temporal
print("\n=== STEP 6: Temporal ===")
df['month'] = df['signup_date'].dt.to_period('M')
print(df.groupby('month')['is_churned'].mean().head(6))

# BONUS: Duplicates
print(f"\nRows: {len(df)} | Unique user_ids: {df['user_id'].nunique()}")

Suggested Next Pages

Bài tiếp	Nội dung	Tại sao cần?
Feature Engineering →	Biến raw data thành features cho model	EDA cho hiểu biết, FE biến hiểu biết thành power
Imbalanced Data →	Xử lý class imbalance ở Step 4	95/5 ratio cần strategy riêng

Tóm tắt Checklist

Schema & Types → df.info() — Dtype đúng chưa?
Missing Data → df.isnull().mean() — Missing pattern?
Distributions → df.describe(include='all') — Outlier? Skew?
Target Variable → Imbalanced? Leakage?
Correlations → df.corr() — Multicollinearity?
Temporal & Drift → Distribution thay đổi theo thời gian?

Nếu chỉ nhớ một điều: EDA là quality assurance, không phải data decoration. Mỗi chart phải trả lời một câu hỏi cụ thể.

EDA — Khám phá dữ liệu như một kỹ sư, không phải họa sĩ ​

EDA ≠ Pretty Charts ​

Khi bạn bỏ qua EDA ​

EDA Engineering Checklist — 6 bước ​

Step 1: Schema & Types ​

Ví dụ: Order table với price dạng string ​

Step 2: Missing Data Patterns ​

MCAR vs MAR vs MNAR ​

Nhìn pattern, không chỉ đếm số ​

DROP vs IMPUTE Decision Tree ​

Step 3: Distribution Profiling ​

Numerical: df.describe() là vũ khí đầu tiên ​

Categorical: Đừng quên chúng ​

Detecting Outliers ​

Step 4: Target Variable Analysis ​

Class Distribution (Classification) ​

Target Leakage Detection ​

Step 5: Correlation & Relationships ​

Correlation Matrix ​

Correlation ≠ Causation ​

Multicollinearity: VIF Check ​

Step 6: Temporal & Data Drift ​

Train/test split PHẢI respect thời gian ​

Feature Distribution Drift ​

Tools: ydata-profiling ​

One-line Report ​

🔥 GPU Paragraph ​

⚡ Fast Exercise ​

🪤 Gotcha ​

📊 Performance Note ​

🚫 Production Anti-pattern ​

🎮 Playground: Complete EDA Workflow ​

Suggested Next Pages ​

EDA — Khám phá dữ liệu như một kỹ sư, không phải họa sĩ

EDA ≠ Pretty Charts

Khi bạn bỏ qua EDA

EDA Engineering Checklist — 6 bước

Step 1: Schema & Types

Ví dụ: Order table với price dạng string

Step 2: Missing Data Patterns

MCAR vs MAR vs MNAR

Nhìn pattern, không chỉ đếm số

DROP vs IMPUTE Decision Tree

Step 3: Distribution Profiling

Numerical: `df.describe()` là vũ khí đầu tiên

Categorical: Đừng quên chúng

Detecting Outliers

Step 4: Target Variable Analysis

Class Distribution (Classification)

Target Leakage Detection

Step 5: Correlation & Relationships

Correlation Matrix

Correlation ≠ Causation

Multicollinearity: VIF Check

Step 6: Temporal & Data Drift

Train/test split PHẢI respect thời gian

Feature Distribution Drift

Tools: ydata-profiling

One-line Report

🔥 GPU Paragraph

⚡ Fast Exercise

🪤 Gotcha

📊 Performance Note

🚫 Production Anti-pattern

🎮 Playground: Complete EDA Workflow

Suggested Next Pages