🔄 Normalization

Level: Foundation Solves: Hiểu và áp dụng normalization để loại bỏ data redundancy và anomalies

Tại sao cần Normalization?

💡 Giáo sư Tom

Normalization không phải là academic exercise. Đó là cách bạn tránh được những đêm thức trắng debug data inconsistencies. Mỗi violation của normal form là một bug waiting to happen.

Data Anomalies

┌─────────────────────────────────────────────────────────────────┐
│                    DATA ANOMALIES                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  UNNORMALIZED TABLE: student_courses                            │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ student_id │ student_name │ course_id │ course_name     │    │
│  ├────────────┼──────────────┼───────────┼─────────────────┤    │
│  │ 1          │ Alice        │ CS101     │ Intro to CS     │    │
│  │ 1          │ Alice        │ CS102     │ Data Structures │    │
│  │ 2          │ Bob          │ CS101     │ Intro to CS     │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ANOMALIES:                                                     │
│                                                                 │
│  ❌ UPDATE ANOMALY:                                             │
│     Đổi tên "Alice" → phải update nhiều rows                    │
│     Quên 1 row → data inconsistent                              │
│                                                                 │
│  ❌ INSERT ANOMALY:                                             │
│     Thêm course mới → cần student (không có student nào)        │
│     Không thể lưu course độc lập                                │
│                                                                 │
│  ❌ DELETE ANOMALY:                                             │
│     Xóa Bob → mất luôn thông tin CS101 nếu Bob là student cuối  │
│     Mất data không liên quan                                    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Normal Forms Overview

First Normal Form (1NF)

Rules

Atomic values: Mỗi cell chứa một giá trị duy nhất
No repeating groups: Không có arrays hay nested structures
Unique rows: Có primary key để identify mỗi row

Ví dụ: Violation và Fix

┌─────────────────────────────────────────────────────────────────┐
│                    1NF VIOLATION                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ❌ BEFORE (Violates 1NF):                                      │
│  ┌──────────┬─────────────────────────────────────┐             │
│  │ order_id │ products                            │             │
│  ├──────────┼─────────────────────────────────────┤             │
│  │ 1        │ iPhone, MacBook, AirPods            │             │
│  │ 2        │ iPad                                │             │
│  └──────────┴─────────────────────────────────────┘             │
│                                                                 │
│  ✅ AFTER (1NF Compliant):                                      │
│  ┌──────────┬────────────┐                                      │
│  │ order_id │ product    │                                      │
│  ├──────────┼────────────┤                                      │
│  │ 1        │ iPhone     │                                      │
│  │ 1        │ MacBook    │                                      │
│  │ 1        │ AirPods    │                                      │
│  │ 2        │ iPad       │                                      │
│  └──────────┴────────────┘                                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Second Normal Form (2NF)

Rules

Must be in 1NF
No partial dependencies: Non-key attributes phải depend on entire primary key

Ví dụ: Partial Dependency

┌─────────────────────────────────────────────────────────────────┐
│                    2NF VIOLATION                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ❌ BEFORE (Violates 2NF):                                      │
│  Table: order_items                                             │
│  PK: (order_id, product_id)                                     │
│  ┌──────────┬────────────┬──────────────┬───────────┐           │
│  │ order_id │ product_id │ product_name │ quantity  │           │
│  ├──────────┼────────────┼──────────────┼───────────┤           │
│  │ 1        │ P001       │ iPhone       │ 2         │           │
│  │ 1        │ P002       │ MacBook      │ 1         │           │
│  │ 2        │ P001       │ iPhone       │ 1         │           │
│  └──────────┴────────────┴──────────────┴───────────┘           │
│                                                                 │
│  PROBLEM: product_name depends only on product_id               │
│           (partial dependency on composite key)                 │
│                                                                 │
│  ✅ AFTER (2NF Compliant):                                      │
│                                                                 │
│  Table: order_items          Table: products                    │
│  ┌──────────┬────────────┬───────────┐  ┌────────────┬────────┐ │
│  │ order_id │ product_id │ quantity  │  │ product_id │ name   │ │
│  ├──────────┼────────────┼───────────┤  ├────────────┼────────┤ │
│  │ 1        │ P001       │ 2         │  │ P001       │ iPhone │ │
│  │ 1        │ P002       │ 1         │  │ P002       │ MacBook│ │
│  │ 2        │ P001       │ 1         │  └────────────┴────────┘ │
│  └──────────┴────────────┴───────────┘                          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Third Normal Form (3NF)

Rules

Must be in 2NF
No transitive dependencies: Non-key attributes không depend on other non-key attributes

Ví dụ: Transitive Dependency

┌─────────────────────────────────────────────────────────────────┐
│                    3NF VIOLATION                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ❌ BEFORE (Violates 3NF):                                      │
│  Table: employees                                               │
│  PK: employee_id                                                │
│  ┌─────────────┬──────────┬─────────────┬───────────────┐       │
│  │ employee_id │ name     │ dept_id     │ dept_name     │       │
│  ├─────────────┼──────────┼─────────────┼───────────────┤       │
│  │ E001        │ Alice    │ D01         │ Engineering   │       │
│  │ E002        │ Bob      │ D01         │ Engineering   │       │
│  │ E003        │ Carol    │ D02         │ Marketing     │       │
│  └─────────────┴──────────┴─────────────┴───────────────┘       │
│                                                                 │
│  PROBLEM: dept_name depends on dept_id (not on employee_id)     │
│           employee_id → dept_id → dept_name (transitive)        │
│                                                                 │
│  ✅ AFTER (3NF Compliant):                                      │
│                                                                 │
│  Table: employees            Table: departments                 │
│  ┌─────────────┬────────┬─────────┐  ┌─────────┬─────────────┐  │
│  │ employee_id │ name   │ dept_id │  │ dept_id │ dept_name   │  │
│  ├─────────────┼────────┼─────────┤  ├─────────┼─────────────┤  │
│  │ E001        │ Alice  │ D01     │  │ D01     │ Engineering │  │
│  │ E002        │ Bob    │ D01     │  │ D02     │ Marketing   │  │
│  │ E003        │ Carol  │ D02     │  └─────────┴─────────────┘  │
│  └─────────────┴────────┴─────────┘                             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Boyce-Codd Normal Form (BCNF)

Rules

Must be in 3NF
Every determinant is a candidate key: Nếu X → Y, thì X phải là superkey

Ví dụ: 3NF nhưng không BCNF

┌─────────────────────────────────────────────────────────────────┐
│                    BCNF VIOLATION                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SCENARIO: Students, Subjects, Teachers                         │
│  - Each subject is taught by one teacher                        │
│  - Each teacher teaches only one subject                        │
│  - Students can take multiple subjects                          │
│                                                                 │
│  ❌ BEFORE (3NF but not BCNF):                                  │
│  Table: student_subjects                                        │
│  PK: (student_id, subject)                                      │
│  ┌────────────┬───────────┬─────────┐                           │
│  │ student_id │ subject   │ teacher │                           │
│  ├────────────┼───────────┼─────────┤                           │
│  │ S001       │ Math      │ T01     │                           │
│  │ S001       │ Physics   │ T02     │                           │
│  │ S002       │ Math      │ T01     │                           │
│  └────────────┴───────────┴─────────┘                           │
│                                                                 │
│  PROBLEM: teacher → subject (teacher determines subject)        │
│           But teacher is not a superkey!                        │
│                                                                 │
│  ✅ AFTER (BCNF Compliant):                                     │
│                                                                 │
│  Table: student_teachers     Table: teacher_subjects            │
│  ┌────────────┬─────────┐    ┌─────────┬───────────┐            │
│  │ student_id │ teacher │    │ teacher │ subject   │            │
│  ├────────────┼─────────┤    ├─────────┼───────────┤            │
│  │ S001       │ T01     │    │ T01     │ Math      │            │
│  │ S001       │ T02     │    │ T02     │ Physics   │            │
│  │ S002       │ T01     │    └─────────┴───────────┘            │
│  └────────────┴─────────┘                                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Denormalization Trade-offs

⚠️ Denormalization là Optimization

Chỉ denormalize khi bạn có performance problems đã được đo lường. Premature denormalization là root of all evil trong data modeling.

Khi nào Denormalize?

Scenario	Denormalization Strategy	Trade-off
Read-heavy workloads	Duplicate data để tránh JOINs	Storage ↑, Write complexity ↑
Reporting/Analytics	Pre-aggregated tables	Data freshness ↓
Caching layer	Materialized views	Maintenance overhead ↑
Distributed systems	Embed related data	Consistency complexity ↑

Denormalization Patterns

┌─────────────────────────────────────────────────────────────────┐
│                 DENORMALIZATION PATTERNS                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. DUPLICATE COLUMNS                                           │
│     orders.customer_name (copy from customers)                  │
│     → Avoid JOIN for common queries                             │
│                                                                 │
│  2. PRE-COMPUTED AGGREGATES                                     │
│     products.review_count, products.avg_rating                  │
│     → Avoid COUNT/AVG on every query                            │
│                                                                 │
│  3. MATERIALIZED VIEWS                                          │
│     CREATE MATERIALIZED VIEW daily_sales AS ...                 │
│     → Pre-computed complex queries                              │
│                                                                 │
│  4. SUMMARY TABLES                                              │
│     monthly_revenue, user_activity_summary                      │
│     → Historical aggregations                                   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Maintaining Denormalized Data

sql

-- Option 1: Triggers (synchronous)
CREATE TRIGGER update_order_customer_name
AFTER UPDATE ON customers
FOR EACH ROW
UPDATE orders SET customer_name = NEW.name
WHERE customer_id = NEW.customer_id;

-- Option 2: Application logic (controlled)
-- Update both tables in same transaction

-- Option 3: Async sync (eventual consistency)
-- Use CDC (Change Data Capture) to sync

Quick Reference: Normal Forms

Form	Rule	Eliminates
1NF	Atomic values, no repeating groups	Repeating groups
2NF	1NF + No partial dependencies	Partial dependencies
3NF	2NF + No transitive dependencies	Transitive dependencies
BCNF	Every determinant is a candidate key	Non-key determinants

🔍 Common Review Comments

⚠️ Normalization Review Feedback

Những comments này thường xuất hiện khi review schema normalization decisions. Anticipate chúng để defend design choices của bạn.

Review Comment	Ý nghĩa	Cách khắc phục
"Tại sao denormalize mà chưa có load testing?"	Denormalization là optimization, cần evidence	Benchmark first, denormalize với data-driven decision
"Transitive dependency này sẽ gây sync issues at scale"	Hidden dependency sẽ tạo update anomalies khi có nhiều writers	Normalize hoặc document sync strategy rõ ràng
"Composite key này có partial dependency"	2NF violation, một phần key determine non-key attribute	Tách table, move attribute sang table của partial key
"Denormalization này không có maintenance strategy"	Duplicated data sẽ drift nếu không có sync mechanism	Define triggers, CDC, hoặc application-level sync
"Normalization level quá cao cho OLTP"	5NF/6NF overkill cho transactional workload	Target 3NF/BCNF cho OLTP, document trade-off
"Data redundancy ở đây intentional hay bug?"	Unclear whether denormalization is by design	Comment trong schema hoặc document trong ADR

⚠️ Anti-patterns

❌ Normalization Mistakes

Những anti-patterns này dẫn đến data inconsistency, incorrect reporting, và debugging nightmares khi system scales.

Anti-pattern 1: Premature Denormalization

┌─────────────────────────────────────────────────────────────────┐
│            PREMATURE DENORMALIZATION ANTI-PATTERN               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ❌ PROBLEM:                                                    │
│  "JOINs are slow, let's denormalize everything upfront"         │
│                                                                 │
│  CREATE TABLE orders (                                          │
│      order_id BIGINT,                                           │
│      customer_id BIGINT,                                        │
│      customer_name VARCHAR(200),      -- Duplicated!            │
│      customer_email VARCHAR(255),     -- Duplicated!            │
│      product_id BIGINT,                                         │
│      product_name VARCHAR(200),       -- Duplicated!            │
│      product_price DECIMAL(10,2),     -- Duplicated!            │
│      shipping_city VARCHAR(100),      -- Duplicated!            │
│      shipping_country VARCHAR(100)    -- Duplicated!            │
│  );                                                             │
│                                                                 │
│  CONSEQUENCES:                                                  │
│  • Update anomalies: Customer changes name → orders inconsistent│
│  • Storage waste: Same data repeated millions of times          │
│  • Write amplification: Every update touches multiple rows      │
│  • No single source of truth: Which customer_name is correct?   │
│  • Premature optimization: JOIN might not even be slow          │
│                                                                 │
│  CORRECT APPROACH:                                              │
│  1. Start normalized (3NF)                                      │
│  2. Profile queries, identify actual bottlenecks                │
│  3. Denormalize specific hot paths với caching/materialized view│
│  4. Document sync strategy cho denormalized data                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Anti-pattern 2: Over-Normalization for OLTP

┌─────────────────────────────────────────────────────────────────┐
│            OVER-NORMALIZATION ANTI-PATTERN                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ❌ PROBLEM:                                                    │
│  "Let's go to 5NF for maximum data integrity"                   │
│                                                                 │
│  -- Normalized to death:                                        │
│  SELECT * FROM orders                                           │
│  JOIN order_customers ON ...                                    │
│  JOIN customer_names ON ...                                     │
│  JOIN customer_emails ON ...                                    │
│  JOIN order_products ON ...                                     │
│  JOIN product_names ON ...                                      │
│  JOIN product_prices ON ...                                     │
│  JOIN product_price_history ON ...                              │
│  -- 15 JOINs for a simple order display                         │
│                                                                 │
│  CONSEQUENCES:                                                  │
│  • Query complexity: Simple reads become complex queries        │
│  • Join overhead: CPU và I/O tăng exponentially                 │
│  • Developer confusion: Hard to understand data model           │
│  • ORM nightmare: N+1 queries everywhere                        │
│  • Latency: User-facing endpoint becomes slow                   │
│                                                                 │
│  CORRECT APPROACH:                                              │
│  • OLTP: Target 3NF, accept controlled redundancy               │
│  • OLAP: Denormalize intentionally for query patterns           │
│  • Hybrid: Normalize writes, denormalize reads (CQRS)           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Anti-pattern 3: Ignoring 2NF in Composite Keys

┌─────────────────────────────────────────────────────────────────┐
│            SUBTLE 2NF VIOLATION ANTI-PATTERN                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ❌ PROBLEM:                                                    │
│  CREATE TABLE course_enrollments (                              │
│      student_id BIGINT,                                         │
│      course_id BIGINT,                                          │
│      course_name VARCHAR(200),    -- Depends only on course_id! │
│      instructor_name VARCHAR(200),-- Depends only on course_id! │
│      enrollment_date DATE,                                      │
│      PRIMARY KEY (student_id, course_id)                        │
│  );                                                             │
│                                                                 │
│  CONSEQUENCES:                                                  │
│  • Subtle update anomaly: Change course name → update all rows  │
│  • Data inconsistency: "CS101" shows different names per student│
│  • Hidden bugs: Queries assume course_name is consistent        │
│  • Reporting errors: COUNT(DISTINCT course_name) != COUNT(course_id) │
│                                                                 │
│  WHY IT'S MISSED:                                               │
│  • Developers focus on functional correctness                   │
│  • Works fine with small data                                   │
│  • Only breaks at scale with concurrent updates                 │
│                                                                 │
│  FIX: Move course_name to courses table (proper 2NF)            │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Anti-pattern 4: Denormalization Without Sync Strategy

┌─────────────────────────────────────────────────────────────────┐
│          ORPHANED DENORMALIZATION ANTI-PATTERN                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ❌ PROBLEM:                                                    │
│  -- Added for "performance" without thinking about maintenance  │
│  ALTER TABLE orders ADD COLUMN customer_name VARCHAR(200);      │
│  UPDATE orders o SET customer_name = (                          │
│      SELECT name FROM customers c WHERE c.id = o.customer_id    │
│  );                                                             │
│  -- And then... nothing. No triggers, no sync, no CDC.          │
│                                                                 │
│  CONSEQUENCES:                                                  │
│  • Data drift: customer_name in orders becomes stale            │
│  • Silent corruption: No errors, just wrong data                │
│  • Trust erosion: "Which name is correct?"                      │
│  • Audit failures: Cannot explain data discrepancies            │
│  • Debugging hell: "Why does the order show old name?"          │
│                                                                 │
│  SYNC OPTIONS (Pick One):                                       │
│  1. Database trigger: Automatic, but adds write overhead        │
│  2. Application sync: Controlled, but requires discipline       │
│  3. CDC pipeline: Async, eventual consistency acceptable        │
│  4. Materialized view: Query-time computation, fresh data       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

🔄 Migration Notes

Normalizing Legacy Tables

┌─────────────────────────────────────────────────────────────────┐
│               LEGACY NORMALIZATION MIGRATION                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SCENARIO: Legacy denormalized table → Normalized schema        │
│                                                                 │
│  STEP 1: ANALYSIS                                               │
│  • Identify functional dependencies                             │
│  • Map current column usage in application code                 │
│  • Estimate data volume và migration time                       │
│                                                                 │
│  STEP 2: CREATE NEW TABLES (Expand)                             │
│  CREATE TABLE customers_new (                                   │
│      customer_id BIGINT PRIMARY KEY,                            │
│      name VARCHAR(200),                                         │
│      email VARCHAR(255)                                         │
│  );                                                             │
│                                                                 │
│  STEP 3: BACKFILL (Batched)                                     │
│  -- Avoid locking, process in chunks                            │
│  INSERT INTO customers_new (customer_id, name, email)           │
│  SELECT DISTINCT customer_id, customer_name, customer_email     │
│  FROM legacy_orders                                             │
│  WHERE customer_id BETWEEN :start AND :end                      │
│  ON CONFLICT (customer_id) DO NOTHING;                          │
│                                                                 │
│  STEP 4: ADD FOREIGN KEY                                        │
│  ALTER TABLE orders ADD COLUMN customer_id_new BIGINT           │
│      REFERENCES customers_new(customer_id);                     │
│  UPDATE orders SET customer_id_new = customer_id                │
│      WHERE customer_id_new IS NULL;                             │
│                                                                 │
│  STEP 5: UPDATE APPLICATION                                     │
│  -- Dual-write period: write to both old và new columns         │
│  -- Read from new, fallback to old                              │
│                                                                 │
│  STEP 6: CLEANUP (Contract)                                     │
│  -- After deprecation period                                    │
│  ALTER TABLE orders DROP COLUMN customer_name;                  │
│  ALTER TABLE orders DROP COLUMN customer_email;                 │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Backfill Strategies

Strategy	Use Case	Trade-off
Full scan batch	Small tables (<1M rows)	Simple, but locks table
Chunked by ID range	Large tables with sequential IDs	No locks, predictable progress
CDC-based	Very large tables, zero downtime	Complex setup, eventual consistency
Shadow table swap	Critical tables, must be atomic	Doubles storage temporarily

Handling Data Conflicts During Normalization

sql

-- Problem: Legacy data has inconsistent duplicates
-- customer_id=123 has customer_name='Alice' in some rows, 'Alice Smith' in others

-- Solution 1: Latest wins
INSERT INTO customers_new (customer_id, name)
SELECT customer_id, name
FROM (
    SELECT customer_id, customer_name as name,
           ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at DESC) as rn
    FROM legacy_orders
) t WHERE rn = 1;

-- Solution 2: Most frequent wins
INSERT INTO customers_new (customer_id, name)
SELECT customer_id, name
FROM (
    SELECT customer_id, customer_name as name,
           ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY cnt DESC) as rn
    FROM (
        SELECT customer_id, customer_name, COUNT(*) as cnt
        FROM legacy_orders
        GROUP BY customer_id, customer_name
    ) t1
) t2 WHERE rn = 1;

-- Solution 3: Flag for manual review
INSERT INTO customers_new (customer_id, name, needs_review)
SELECT customer_id, MIN(name), COUNT(DISTINCT name) > 1
FROM legacy_orders
GROUP BY customer_id;

💡 Migration Best Practice

Luôn dry-run migration trên production clone trước khi chạy thật. Measure:

Execution time
Lock duration
Index rebuild time
Rollback procedure

📎 Cross-References

📎 Modeling Fundamentals - Conceptual/Logical/Physical modeling
📎 OLTP Modeling - Áp dụng normalization cho transactional systems
📎 OLAP Modeling - Khi nào denormalize cho analytics
📎 Schema Evolution - Migration strategies chi tiết
📎 SQL Advanced Joins - Query patterns cho normalized schemas

🔄 Normalization ​

Tại sao cần Normalization? ​

Data Anomalies ​

Normal Forms Overview ​

First Normal Form (1NF) ​

Rules ​

Ví dụ: Violation và Fix ​

Second Normal Form (2NF) ​

Rules ​

Ví dụ: Partial Dependency ​

Third Normal Form (3NF) ​

Rules ​

Ví dụ: Transitive Dependency ​

Boyce-Codd Normal Form (BCNF) ​

Rules ​

Ví dụ: 3NF nhưng không BCNF ​

Denormalization Trade-offs ​

Khi nào Denormalize? ​

Denormalization Patterns ​

Maintaining Denormalized Data ​

Quick Reference: Normal Forms ​

🔍 Common Review Comments ​

⚠️ Anti-patterns ​

Anti-pattern 1: Premature Denormalization ​

Anti-pattern 2: Over-Normalization for OLTP ​

Anti-pattern 3: Ignoring 2NF in Composite Keys ​

Anti-pattern 4: Denormalization Without Sync Strategy ​

🔄 Migration Notes ​

Normalizing Legacy Tables ​

Backfill Strategies ​

Handling Data Conflicts During Normalization ​

📎 Cross-References ​

🔄 Normalization

Tại sao cần Normalization?

Data Anomalies

Normal Forms Overview

First Normal Form (1NF)

Rules

Ví dụ: Violation và Fix

Second Normal Form (2NF)

Rules

Ví dụ: Partial Dependency

Third Normal Form (3NF)

Rules

Ví dụ: Transitive Dependency

Boyce-Codd Normal Form (BCNF)

Rules

Ví dụ: 3NF nhưng không BCNF

Denormalization Trade-offs

Khi nào Denormalize?

Denormalization Patterns

Maintaining Denormalized Data

Quick Reference: Normal Forms

🔍 Common Review Comments

⚠️ Anti-patterns

Anti-pattern 1: Premature Denormalization

Anti-pattern 2: Over-Normalization for OLTP

Anti-pattern 3: Ignoring 2NF in Composite Keys

Anti-pattern 4: Denormalization Without Sync Strategy

🔄 Migration Notes

Normalizing Legacy Tables

Backfill Strategies

Handling Data Conflicts During Normalization

📎 Cross-References