Skip to content

🔄 Normalization

Level: Foundation Solves: Hiểu và áp dụng normalization để loại bỏ data redundancy và anomalies

Tại sao cần Normalization?

💡 Giáo sư Tom

Normalization không phải là academic exercise. Đó là cách bạn tránh được những đêm thức trắng debug data inconsistencies. Mỗi violation của normal form là một bug waiting to happen.

Data Anomalies

┌─────────────────────────────────────────────────────────────────┐
│                    DATA ANOMALIES                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  UNNORMALIZED TABLE: student_courses                            │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ student_id │ student_name │ course_id │ course_name     │    │
│  ├────────────┼──────────────┼───────────┼─────────────────┤    │
│  │ 1          │ Alice        │ CS101     │ Intro to CS     │    │
│  │ 1          │ Alice        │ CS102     │ Data Structures │    │
│  │ 2          │ Bob          │ CS101     │ Intro to CS     │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ANOMALIES:                                                     │
│                                                                 │
│  ❌ UPDATE ANOMALY:                                             │
│     Đổi tên "Alice" → phải update nhiều rows                    │
│     Quên 1 row → data inconsistent                              │
│                                                                 │
│  ❌ INSERT ANOMALY:                                             │
│     Thêm course mới → cần student (không có student nào)        │
│     Không thể lưu course độc lập                                │
│                                                                 │
│  ❌ DELETE ANOMALY:                                             │
│     Xóa Bob → mất luôn thông tin CS101 nếu Bob là student cuối  │
│     Mất data không liên quan                                    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Normal Forms Overview

First Normal Form (1NF)

Rules

  1. Atomic values: Mỗi cell chứa một giá trị duy nhất
  2. No repeating groups: Không có arrays hay nested structures
  3. Unique rows: Có primary key để identify mỗi row

Ví dụ: Violation và Fix

┌─────────────────────────────────────────────────────────────────┐
│                    1NF VIOLATION                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ❌ BEFORE (Violates 1NF):                                      │
│  ┌──────────┬─────────────────────────────────────┐             │
│  │ order_id │ products                            │             │
│  ├──────────┼─────────────────────────────────────┤             │
│  │ 1        │ iPhone, MacBook, AirPods            │             │
│  │ 2        │ iPad                                │             │
│  └──────────┴─────────────────────────────────────┘             │
│                                                                 │
│  ✅ AFTER (1NF Compliant):                                      │
│  ┌──────────┬────────────┐                                      │
│  │ order_id │ product    │                                      │
│  ├──────────┼────────────┤                                      │
│  │ 1        │ iPhone     │                                      │
│  │ 1        │ MacBook    │                                      │
│  │ 1        │ AirPods    │                                      │
│  │ 2        │ iPad       │                                      │
│  └──────────┴────────────┘                                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Second Normal Form (2NF)

Rules

  1. Must be in 1NF
  2. No partial dependencies: Non-key attributes phải depend on entire primary key

Ví dụ: Partial Dependency

┌─────────────────────────────────────────────────────────────────┐
│                    2NF VIOLATION                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ❌ BEFORE (Violates 2NF):                                      │
│  Table: order_items                                             │
│  PK: (order_id, product_id)                                     │
│  ┌──────────┬────────────┬──────────────┬───────────┐           │
│  │ order_id │ product_id │ product_name │ quantity  │           │
│  ├──────────┼────────────┼──────────────┼───────────┤           │
│  │ 1        │ P001       │ iPhone       │ 2         │           │
│  │ 1        │ P002       │ MacBook      │ 1         │           │
│  │ 2        │ P001       │ iPhone       │ 1         │           │
│  └──────────┴────────────┴──────────────┴───────────┘           │
│                                                                 │
│  PROBLEM: product_name depends only on product_id               │
│           (partial dependency on composite key)                 │
│                                                                 │
│  ✅ AFTER (2NF Compliant):                                      │
│                                                                 │
│  Table: order_items          Table: products                    │
│  ┌──────────┬────────────┬───────────┐  ┌────────────┬────────┐ │
│  │ order_id │ product_id │ quantity  │  │ product_id │ name   │ │
│  ├──────────┼────────────┼───────────┤  ├────────────┼────────┤ │
│  │ 1        │ P001       │ 2         │  │ P001       │ iPhone │ │
│  │ 1        │ P002       │ 1         │  │ P002       │ MacBook│ │
│  │ 2        │ P001       │ 1         │  └────────────┴────────┘ │
│  └──────────┴────────────┴───────────┘                          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Third Normal Form (3NF)

Rules

  1. Must be in 2NF
  2. No transitive dependencies: Non-key attributes không depend on other non-key attributes

Ví dụ: Transitive Dependency

┌─────────────────────────────────────────────────────────────────┐
│                    3NF VIOLATION                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ❌ BEFORE (Violates 3NF):                                      │
│  Table: employees                                               │
│  PK: employee_id                                                │
│  ┌─────────────┬──────────┬─────────────┬───────────────┐       │
│  │ employee_id │ name     │ dept_id     │ dept_name     │       │
│  ├─────────────┼──────────┼─────────────┼───────────────┤       │
│  │ E001        │ Alice    │ D01         │ Engineering   │       │
│  │ E002        │ Bob      │ D01         │ Engineering   │       │
│  │ E003        │ Carol    │ D02         │ Marketing     │       │
│  └─────────────┴──────────┴─────────────┴───────────────┘       │
│                                                                 │
│  PROBLEM: dept_name depends on dept_id (not on employee_id)     │
│           employee_id → dept_id → dept_name (transitive)        │
│                                                                 │
│  ✅ AFTER (3NF Compliant):                                      │
│                                                                 │
│  Table: employees            Table: departments                 │
│  ┌─────────────┬────────┬─────────┐  ┌─────────┬─────────────┐  │
│  │ employee_id │ name   │ dept_id │  │ dept_id │ dept_name   │  │
│  ├─────────────┼────────┼─────────┤  ├─────────┼─────────────┤  │
│  │ E001        │ Alice  │ D01     │  │ D01     │ Engineering │  │
│  │ E002        │ Bob    │ D01     │  │ D02     │ Marketing   │  │
│  │ E003        │ Carol  │ D02     │  └─────────┴─────────────┘  │
│  └─────────────┴────────┴─────────┘                             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Boyce-Codd Normal Form (BCNF)

Rules

  1. Must be in 3NF
  2. Every determinant is a candidate key: Nếu X → Y, thì X phải là superkey

Ví dụ: 3NF nhưng không BCNF

┌─────────────────────────────────────────────────────────────────┐
│                    BCNF VIOLATION                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SCENARIO: Students, Subjects, Teachers                         │
│  - Each subject is taught by one teacher                        │
│  - Each teacher teaches only one subject                        │
│  - Students can take multiple subjects                          │
│                                                                 │
│  ❌ BEFORE (3NF but not BCNF):                                  │
│  Table: student_subjects                                        │
│  PK: (student_id, subject)                                      │
│  ┌────────────┬───────────┬─────────┐                           │
│  │ student_id │ subject   │ teacher │                           │
│  ├────────────┼───────────┼─────────┤                           │
│  │ S001       │ Math      │ T01     │                           │
│  │ S001       │ Physics   │ T02     │                           │
│  │ S002       │ Math      │ T01     │                           │
│  └────────────┴───────────┴─────────┘                           │
│                                                                 │
│  PROBLEM: teacher → subject (teacher determines subject)        │
│           But teacher is not a superkey!                        │
│                                                                 │
│  ✅ AFTER (BCNF Compliant):                                     │
│                                                                 │
│  Table: student_teachers     Table: teacher_subjects            │
│  ┌────────────┬─────────┐    ┌─────────┬───────────┐            │
│  │ student_id │ teacher │    │ teacher │ subject   │            │
│  ├────────────┼─────────┤    ├─────────┼───────────┤            │
│  │ S001       │ T01     │    │ T01     │ Math      │            │
│  │ S001       │ T02     │    │ T02     │ Physics   │            │
│  │ S002       │ T01     │    └─────────┴───────────┘            │
│  └────────────┴─────────┘                                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Denormalization Trade-offs

⚠️ Denormalization là Optimization

Chỉ denormalize khi bạn có performance problems đã được đo lường. Premature denormalization là root of all evil trong data modeling.

Khi nào Denormalize?

ScenarioDenormalization StrategyTrade-off
Read-heavy workloadsDuplicate data để tránh JOINsStorage ↑, Write complexity ↑
Reporting/AnalyticsPre-aggregated tablesData freshness ↓
Caching layerMaterialized viewsMaintenance overhead ↑
Distributed systemsEmbed related dataConsistency complexity ↑

Denormalization Patterns

┌─────────────────────────────────────────────────────────────────┐
│                 DENORMALIZATION PATTERNS                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. DUPLICATE COLUMNS                                           │
│     orders.customer_name (copy from customers)                  │
│     → Avoid JOIN for common queries                             │
│                                                                 │
│  2. PRE-COMPUTED AGGREGATES                                     │
│     products.review_count, products.avg_rating                  │
│     → Avoid COUNT/AVG on every query                            │
│                                                                 │
│  3. MATERIALIZED VIEWS                                          │
│     CREATE MATERIALIZED VIEW daily_sales AS ...                 │
│     → Pre-computed complex queries                              │
│                                                                 │
│  4. SUMMARY TABLES                                              │
│     monthly_revenue, user_activity_summary                      │
│     → Historical aggregations                                   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Maintaining Denormalized Data

sql
-- Option 1: Triggers (synchronous)
CREATE TRIGGER update_order_customer_name
AFTER UPDATE ON customers
FOR EACH ROW
UPDATE orders SET customer_name = NEW.name
WHERE customer_id = NEW.customer_id;

-- Option 2: Application logic (controlled)
-- Update both tables in same transaction

-- Option 3: Async sync (eventual consistency)
-- Use CDC (Change Data Capture) to sync

Quick Reference: Normal Forms

FormRuleEliminates
1NFAtomic values, no repeating groupsRepeating groups
2NF1NF + No partial dependenciesPartial dependencies
3NF2NF + No transitive dependenciesTransitive dependencies
BCNFEvery determinant is a candidate keyNon-key determinants

🔍 Common Review Comments

⚠️ Normalization Review Feedback

Những comments này thường xuất hiện khi review schema normalization decisions. Anticipate chúng để defend design choices của bạn.

Review CommentÝ nghĩaCách khắc phục
"Tại sao denormalize mà chưa có load testing?"Denormalization là optimization, cần evidenceBenchmark first, denormalize với data-driven decision
"Transitive dependency này sẽ gây sync issues at scale"Hidden dependency sẽ tạo update anomalies khi có nhiều writersNormalize hoặc document sync strategy rõ ràng
"Composite key này có partial dependency"2NF violation, một phần key determine non-key attributeTách table, move attribute sang table của partial key
"Denormalization này không có maintenance strategy"Duplicated data sẽ drift nếu không có sync mechanismDefine triggers, CDC, hoặc application-level sync
"Normalization level quá cao cho OLTP"5NF/6NF overkill cho transactional workloadTarget 3NF/BCNF cho OLTP, document trade-off
"Data redundancy ở đây intentional hay bug?"Unclear whether denormalization is by designComment trong schema hoặc document trong ADR

⚠️ Anti-patterns

❌ Normalization Mistakes

Những anti-patterns này dẫn đến data inconsistency, incorrect reporting, và debugging nightmares khi system scales.

Anti-pattern 1: Premature Denormalization

┌─────────────────────────────────────────────────────────────────┐
│            PREMATURE DENORMALIZATION ANTI-PATTERN               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ❌ PROBLEM:                                                    │
│  "JOINs are slow, let's denormalize everything upfront"         │
│                                                                 │
│  CREATE TABLE orders (                                          │
│      order_id BIGINT,                                           │
│      customer_id BIGINT,                                        │
│      customer_name VARCHAR(200),      -- Duplicated!            │
│      customer_email VARCHAR(255),     -- Duplicated!            │
│      product_id BIGINT,                                         │
│      product_name VARCHAR(200),       -- Duplicated!            │
│      product_price DECIMAL(10,2),     -- Duplicated!            │
│      shipping_city VARCHAR(100),      -- Duplicated!            │
│      shipping_country VARCHAR(100)    -- Duplicated!            │
│  );                                                             │
│                                                                 │
│  CONSEQUENCES:                                                  │
│  • Update anomalies: Customer changes name → orders inconsistent│
│  • Storage waste: Same data repeated millions of times          │
│  • Write amplification: Every update touches multiple rows      │
│  • No single source of truth: Which customer_name is correct?   │
│  • Premature optimization: JOIN might not even be slow          │
│                                                                 │
│  CORRECT APPROACH:                                              │
│  1. Start normalized (3NF)                                      │
│  2. Profile queries, identify actual bottlenecks                │
│  3. Denormalize specific hot paths với caching/materialized view│
│  4. Document sync strategy cho denormalized data                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Anti-pattern 2: Over-Normalization for OLTP

┌─────────────────────────────────────────────────────────────────┐
│            OVER-NORMALIZATION ANTI-PATTERN                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ❌ PROBLEM:                                                    │
│  "Let's go to 5NF for maximum data integrity"                   │
│                                                                 │
│  -- Normalized to death:                                        │
│  SELECT * FROM orders                                           │
│  JOIN order_customers ON ...                                    │
│  JOIN customer_names ON ...                                     │
│  JOIN customer_emails ON ...                                    │
│  JOIN order_products ON ...                                     │
│  JOIN product_names ON ...                                      │
│  JOIN product_prices ON ...                                     │
│  JOIN product_price_history ON ...                              │
│  -- 15 JOINs for a simple order display                         │
│                                                                 │
│  CONSEQUENCES:                                                  │
│  • Query complexity: Simple reads become complex queries        │
│  • Join overhead: CPU và I/O tăng exponentially                 │
│  • Developer confusion: Hard to understand data model           │
│  • ORM nightmare: N+1 queries everywhere                        │
│  • Latency: User-facing endpoint becomes slow                   │
│                                                                 │
│  CORRECT APPROACH:                                              │
│  • OLTP: Target 3NF, accept controlled redundancy               │
│  • OLAP: Denormalize intentionally for query patterns           │
│  • Hybrid: Normalize writes, denormalize reads (CQRS)           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Anti-pattern 3: Ignoring 2NF in Composite Keys

┌─────────────────────────────────────────────────────────────────┐
│            SUBTLE 2NF VIOLATION ANTI-PATTERN                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ❌ PROBLEM:                                                    │
│  CREATE TABLE course_enrollments (                              │
│      student_id BIGINT,                                         │
│      course_id BIGINT,                                          │
│      course_name VARCHAR(200),    -- Depends only on course_id! │
│      instructor_name VARCHAR(200),-- Depends only on course_id! │
│      enrollment_date DATE,                                      │
│      PRIMARY KEY (student_id, course_id)                        │
│  );                                                             │
│                                                                 │
│  CONSEQUENCES:                                                  │
│  • Subtle update anomaly: Change course name → update all rows  │
│  • Data inconsistency: "CS101" shows different names per student│
│  • Hidden bugs: Queries assume course_name is consistent        │
│  • Reporting errors: COUNT(DISTINCT course_name) != COUNT(course_id) │
│                                                                 │
│  WHY IT'S MISSED:                                               │
│  • Developers focus on functional correctness                   │
│  • Works fine with small data                                   │
│  • Only breaks at scale with concurrent updates                 │
│                                                                 │
│  FIX: Move course_name to courses table (proper 2NF)            │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Anti-pattern 4: Denormalization Without Sync Strategy

┌─────────────────────────────────────────────────────────────────┐
│          ORPHANED DENORMALIZATION ANTI-PATTERN                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ❌ PROBLEM:                                                    │
│  -- Added for "performance" without thinking about maintenance  │
│  ALTER TABLE orders ADD COLUMN customer_name VARCHAR(200);      │
│  UPDATE orders o SET customer_name = (                          │
│      SELECT name FROM customers c WHERE c.id = o.customer_id    │
│  );                                                             │
│  -- And then... nothing. No triggers, no sync, no CDC.          │
│                                                                 │
│  CONSEQUENCES:                                                  │
│  • Data drift: customer_name in orders becomes stale            │
│  • Silent corruption: No errors, just wrong data                │
│  • Trust erosion: "Which name is correct?"                      │
│  • Audit failures: Cannot explain data discrepancies            │
│  • Debugging hell: "Why does the order show old name?"          │
│                                                                 │
│  SYNC OPTIONS (Pick One):                                       │
│  1. Database trigger: Automatic, but adds write overhead        │
│  2. Application sync: Controlled, but requires discipline       │
│  3. CDC pipeline: Async, eventual consistency acceptable        │
│  4. Materialized view: Query-time computation, fresh data       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

🔄 Migration Notes

Normalizing Legacy Tables

┌─────────────────────────────────────────────────────────────────┐
│               LEGACY NORMALIZATION MIGRATION                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SCENARIO: Legacy denormalized table → Normalized schema        │
│                                                                 │
│  STEP 1: ANALYSIS                                               │
│  • Identify functional dependencies                             │
│  • Map current column usage in application code                 │
│  • Estimate data volume và migration time                       │
│                                                                 │
│  STEP 2: CREATE NEW TABLES (Expand)                             │
│  CREATE TABLE customers_new (                                   │
│      customer_id BIGINT PRIMARY KEY,                            │
│      name VARCHAR(200),                                         │
│      email VARCHAR(255)                                         │
│  );                                                             │
│                                                                 │
│  STEP 3: BACKFILL (Batched)                                     │
│  -- Avoid locking, process in chunks                            │
│  INSERT INTO customers_new (customer_id, name, email)           │
│  SELECT DISTINCT customer_id, customer_name, customer_email     │
│  FROM legacy_orders                                             │
│  WHERE customer_id BETWEEN :start AND :end                      │
│  ON CONFLICT (customer_id) DO NOTHING;                          │
│                                                                 │
│  STEP 4: ADD FOREIGN KEY                                        │
│  ALTER TABLE orders ADD COLUMN customer_id_new BIGINT           │
│      REFERENCES customers_new(customer_id);                     │
│  UPDATE orders SET customer_id_new = customer_id                │
│      WHERE customer_id_new IS NULL;                             │
│                                                                 │
│  STEP 5: UPDATE APPLICATION                                     │
│  -- Dual-write period: write to both old và new columns         │
│  -- Read from new, fallback to old                              │
│                                                                 │
│  STEP 6: CLEANUP (Contract)                                     │
│  -- After deprecation period                                    │
│  ALTER TABLE orders DROP COLUMN customer_name;                  │
│  ALTER TABLE orders DROP COLUMN customer_email;                 │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Backfill Strategies

StrategyUse CaseTrade-off
Full scan batchSmall tables (<1M rows)Simple, but locks table
Chunked by ID rangeLarge tables with sequential IDsNo locks, predictable progress
CDC-basedVery large tables, zero downtimeComplex setup, eventual consistency
Shadow table swapCritical tables, must be atomicDoubles storage temporarily

Handling Data Conflicts During Normalization

sql
-- Problem: Legacy data has inconsistent duplicates
-- customer_id=123 has customer_name='Alice' in some rows, 'Alice Smith' in others

-- Solution 1: Latest wins
INSERT INTO customers_new (customer_id, name)
SELECT customer_id, name
FROM (
    SELECT customer_id, customer_name as name,
           ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at DESC) as rn
    FROM legacy_orders
) t WHERE rn = 1;

-- Solution 2: Most frequent wins
INSERT INTO customers_new (customer_id, name)
SELECT customer_id, name
FROM (
    SELECT customer_id, customer_name as name,
           ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY cnt DESC) as rn
    FROM (
        SELECT customer_id, customer_name, COUNT(*) as cnt
        FROM legacy_orders
        GROUP BY customer_id, customer_name
    ) t1
) t2 WHERE rn = 1;

-- Solution 3: Flag for manual review
INSERT INTO customers_new (customer_id, name, needs_review)
SELECT customer_id, MIN(name), COUNT(DISTINCT name) > 1
FROM legacy_orders
GROUP BY customer_id;

💡 Migration Best Practice

Luôn dry-run migration trên production clone trước khi chạy thật. Measure:

  • Execution time
  • Lock duration
  • Index rebuild time
  • Rollback procedure

📎 Cross-References