Giao diện
🔄 Normalization
Level: Foundation Solves: Hiểu và áp dụng normalization để loại bỏ data redundancy và anomalies
Tại sao cần Normalization?
💡 Giáo sư Tom
Normalization không phải là academic exercise. Đó là cách bạn tránh được những đêm thức trắng debug data inconsistencies. Mỗi violation của normal form là một bug waiting to happen.
Data Anomalies
┌─────────────────────────────────────────────────────────────────┐
│ DATA ANOMALIES │
├─────────────────────────────────────────────────────────────────┤
│ │
│ UNNORMALIZED TABLE: student_courses │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ student_id │ student_name │ course_id │ course_name │ │
│ ├────────────┼──────────────┼───────────┼─────────────────┤ │
│ │ 1 │ Alice │ CS101 │ Intro to CS │ │
│ │ 1 │ Alice │ CS102 │ Data Structures │ │
│ │ 2 │ Bob │ CS101 │ Intro to CS │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ANOMALIES: │
│ │
│ ❌ UPDATE ANOMALY: │
│ Đổi tên "Alice" → phải update nhiều rows │
│ Quên 1 row → data inconsistent │
│ │
│ ❌ INSERT ANOMALY: │
│ Thêm course mới → cần student (không có student nào) │
│ Không thể lưu course độc lập │
│ │
│ ❌ DELETE ANOMALY: │
│ Xóa Bob → mất luôn thông tin CS101 nếu Bob là student cuối │
│ Mất data không liên quan │
│ │
└─────────────────────────────────────────────────────────────────┘Normal Forms Overview
First Normal Form (1NF)
Rules
- Atomic values: Mỗi cell chứa một giá trị duy nhất
- No repeating groups: Không có arrays hay nested structures
- Unique rows: Có primary key để identify mỗi row
Ví dụ: Violation và Fix
┌─────────────────────────────────────────────────────────────────┐
│ 1NF VIOLATION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ❌ BEFORE (Violates 1NF): │
│ ┌──────────┬─────────────────────────────────────┐ │
│ │ order_id │ products │ │
│ ├──────────┼─────────────────────────────────────┤ │
│ │ 1 │ iPhone, MacBook, AirPods │ │
│ │ 2 │ iPad │ │
│ └──────────┴─────────────────────────────────────┘ │
│ │
│ ✅ AFTER (1NF Compliant): │
│ ┌──────────┬────────────┐ │
│ │ order_id │ product │ │
│ ├──────────┼────────────┤ │
│ │ 1 │ iPhone │ │
│ │ 1 │ MacBook │ │
│ │ 1 │ AirPods │ │
│ │ 2 │ iPad │ │
│ └──────────┴────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Second Normal Form (2NF)
Rules
- Must be in 1NF
- No partial dependencies: Non-key attributes phải depend on entire primary key
Ví dụ: Partial Dependency
┌─────────────────────────────────────────────────────────────────┐
│ 2NF VIOLATION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ❌ BEFORE (Violates 2NF): │
│ Table: order_items │
│ PK: (order_id, product_id) │
│ ┌──────────┬────────────┬──────────────┬───────────┐ │
│ │ order_id │ product_id │ product_name │ quantity │ │
│ ├──────────┼────────────┼──────────────┼───────────┤ │
│ │ 1 │ P001 │ iPhone │ 2 │ │
│ │ 1 │ P002 │ MacBook │ 1 │ │
│ │ 2 │ P001 │ iPhone │ 1 │ │
│ └──────────┴────────────┴──────────────┴───────────┘ │
│ │
│ PROBLEM: product_name depends only on product_id │
│ (partial dependency on composite key) │
│ │
│ ✅ AFTER (2NF Compliant): │
│ │
│ Table: order_items Table: products │
│ ┌──────────┬────────────┬───────────┐ ┌────────────┬────────┐ │
│ │ order_id │ product_id │ quantity │ │ product_id │ name │ │
│ ├──────────┼────────────┼───────────┤ ├────────────┼────────┤ │
│ │ 1 │ P001 │ 2 │ │ P001 │ iPhone │ │
│ │ 1 │ P002 │ 1 │ │ P002 │ MacBook│ │
│ │ 2 │ P001 │ 1 │ └────────────┴────────┘ │
│ └──────────┴────────────┴───────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Third Normal Form (3NF)
Rules
- Must be in 2NF
- No transitive dependencies: Non-key attributes không depend on other non-key attributes
Ví dụ: Transitive Dependency
┌─────────────────────────────────────────────────────────────────┐
│ 3NF VIOLATION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ❌ BEFORE (Violates 3NF): │
│ Table: employees │
│ PK: employee_id │
│ ┌─────────────┬──────────┬─────────────┬───────────────┐ │
│ │ employee_id │ name │ dept_id │ dept_name │ │
│ ├─────────────┼──────────┼─────────────┼───────────────┤ │
│ │ E001 │ Alice │ D01 │ Engineering │ │
│ │ E002 │ Bob │ D01 │ Engineering │ │
│ │ E003 │ Carol │ D02 │ Marketing │ │
│ └─────────────┴──────────┴─────────────┴───────────────┘ │
│ │
│ PROBLEM: dept_name depends on dept_id (not on employee_id) │
│ employee_id → dept_id → dept_name (transitive) │
│ │
│ ✅ AFTER (3NF Compliant): │
│ │
│ Table: employees Table: departments │
│ ┌─────────────┬────────┬─────────┐ ┌─────────┬─────────────┐ │
│ │ employee_id │ name │ dept_id │ │ dept_id │ dept_name │ │
│ ├─────────────┼────────┼─────────┤ ├─────────┼─────────────┤ │
│ │ E001 │ Alice │ D01 │ │ D01 │ Engineering │ │
│ │ E002 │ Bob │ D01 │ │ D02 │ Marketing │ │
│ │ E003 │ Carol │ D02 │ └─────────┴─────────────┘ │
│ └─────────────┴────────┴─────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Boyce-Codd Normal Form (BCNF)
Rules
- Must be in 3NF
- Every determinant is a candidate key: Nếu X → Y, thì X phải là superkey
Ví dụ: 3NF nhưng không BCNF
┌─────────────────────────────────────────────────────────────────┐
│ BCNF VIOLATION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ SCENARIO: Students, Subjects, Teachers │
│ - Each subject is taught by one teacher │
│ - Each teacher teaches only one subject │
│ - Students can take multiple subjects │
│ │
│ ❌ BEFORE (3NF but not BCNF): │
│ Table: student_subjects │
│ PK: (student_id, subject) │
│ ┌────────────┬───────────┬─────────┐ │
│ │ student_id │ subject │ teacher │ │
│ ├────────────┼───────────┼─────────┤ │
│ │ S001 │ Math │ T01 │ │
│ │ S001 │ Physics │ T02 │ │
│ │ S002 │ Math │ T01 │ │
│ └────────────┴───────────┴─────────┘ │
│ │
│ PROBLEM: teacher → subject (teacher determines subject) │
│ But teacher is not a superkey! │
│ │
│ ✅ AFTER (BCNF Compliant): │
│ │
│ Table: student_teachers Table: teacher_subjects │
│ ┌────────────┬─────────┐ ┌─────────┬───────────┐ │
│ │ student_id │ teacher │ │ teacher │ subject │ │
│ ├────────────┼─────────┤ ├─────────┼───────────┤ │
│ │ S001 │ T01 │ │ T01 │ Math │ │
│ │ S001 │ T02 │ │ T02 │ Physics │ │
│ │ S002 │ T01 │ └─────────┴───────────┘ │
│ └────────────┴─────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Denormalization Trade-offs
⚠️ Denormalization là Optimization
Chỉ denormalize khi bạn có performance problems đã được đo lường. Premature denormalization là root of all evil trong data modeling.
Khi nào Denormalize?
| Scenario | Denormalization Strategy | Trade-off |
|---|---|---|
| Read-heavy workloads | Duplicate data để tránh JOINs | Storage ↑, Write complexity ↑ |
| Reporting/Analytics | Pre-aggregated tables | Data freshness ↓ |
| Caching layer | Materialized views | Maintenance overhead ↑ |
| Distributed systems | Embed related data | Consistency complexity ↑ |
Denormalization Patterns
┌─────────────────────────────────────────────────────────────────┐
│ DENORMALIZATION PATTERNS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. DUPLICATE COLUMNS │
│ orders.customer_name (copy from customers) │
│ → Avoid JOIN for common queries │
│ │
│ 2. PRE-COMPUTED AGGREGATES │
│ products.review_count, products.avg_rating │
│ → Avoid COUNT/AVG on every query │
│ │
│ 3. MATERIALIZED VIEWS │
│ CREATE MATERIALIZED VIEW daily_sales AS ... │
│ → Pre-computed complex queries │
│ │
│ 4. SUMMARY TABLES │
│ monthly_revenue, user_activity_summary │
│ → Historical aggregations │
│ │
└─────────────────────────────────────────────────────────────────┘Maintaining Denormalized Data
sql
-- Option 1: Triggers (synchronous)
CREATE TRIGGER update_order_customer_name
AFTER UPDATE ON customers
FOR EACH ROW
UPDATE orders SET customer_name = NEW.name
WHERE customer_id = NEW.customer_id;
-- Option 2: Application logic (controlled)
-- Update both tables in same transaction
-- Option 3: Async sync (eventual consistency)
-- Use CDC (Change Data Capture) to syncQuick Reference: Normal Forms
| Form | Rule | Eliminates |
|---|---|---|
| 1NF | Atomic values, no repeating groups | Repeating groups |
| 2NF | 1NF + No partial dependencies | Partial dependencies |
| 3NF | 2NF + No transitive dependencies | Transitive dependencies |
| BCNF | Every determinant is a candidate key | Non-key determinants |
🔍 Common Review Comments
⚠️ Normalization Review Feedback
Những comments này thường xuất hiện khi review schema normalization decisions. Anticipate chúng để defend design choices của bạn.
| Review Comment | Ý nghĩa | Cách khắc phục |
|---|---|---|
| "Tại sao denormalize mà chưa có load testing?" | Denormalization là optimization, cần evidence | Benchmark first, denormalize với data-driven decision |
| "Transitive dependency này sẽ gây sync issues at scale" | Hidden dependency sẽ tạo update anomalies khi có nhiều writers | Normalize hoặc document sync strategy rõ ràng |
| "Composite key này có partial dependency" | 2NF violation, một phần key determine non-key attribute | Tách table, move attribute sang table của partial key |
| "Denormalization này không có maintenance strategy" | Duplicated data sẽ drift nếu không có sync mechanism | Define triggers, CDC, hoặc application-level sync |
| "Normalization level quá cao cho OLTP" | 5NF/6NF overkill cho transactional workload | Target 3NF/BCNF cho OLTP, document trade-off |
| "Data redundancy ở đây intentional hay bug?" | Unclear whether denormalization is by design | Comment trong schema hoặc document trong ADR |
⚠️ Anti-patterns
❌ Normalization Mistakes
Những anti-patterns này dẫn đến data inconsistency, incorrect reporting, và debugging nightmares khi system scales.
Anti-pattern 1: Premature Denormalization
┌─────────────────────────────────────────────────────────────────┐
│ PREMATURE DENORMALIZATION ANTI-PATTERN │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ❌ PROBLEM: │
│ "JOINs are slow, let's denormalize everything upfront" │
│ │
│ CREATE TABLE orders ( │
│ order_id BIGINT, │
│ customer_id BIGINT, │
│ customer_name VARCHAR(200), -- Duplicated! │
│ customer_email VARCHAR(255), -- Duplicated! │
│ product_id BIGINT, │
│ product_name VARCHAR(200), -- Duplicated! │
│ product_price DECIMAL(10,2), -- Duplicated! │
│ shipping_city VARCHAR(100), -- Duplicated! │
│ shipping_country VARCHAR(100) -- Duplicated! │
│ ); │
│ │
│ CONSEQUENCES: │
│ • Update anomalies: Customer changes name → orders inconsistent│
│ • Storage waste: Same data repeated millions of times │
│ • Write amplification: Every update touches multiple rows │
│ • No single source of truth: Which customer_name is correct? │
│ • Premature optimization: JOIN might not even be slow │
│ │
│ CORRECT APPROACH: │
│ 1. Start normalized (3NF) │
│ 2. Profile queries, identify actual bottlenecks │
│ 3. Denormalize specific hot paths với caching/materialized view│
│ 4. Document sync strategy cho denormalized data │
│ │
└─────────────────────────────────────────────────────────────────┘Anti-pattern 2: Over-Normalization for OLTP
┌─────────────────────────────────────────────────────────────────┐
│ OVER-NORMALIZATION ANTI-PATTERN │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ❌ PROBLEM: │
│ "Let's go to 5NF for maximum data integrity" │
│ │
│ -- Normalized to death: │
│ SELECT * FROM orders │
│ JOIN order_customers ON ... │
│ JOIN customer_names ON ... │
│ JOIN customer_emails ON ... │
│ JOIN order_products ON ... │
│ JOIN product_names ON ... │
│ JOIN product_prices ON ... │
│ JOIN product_price_history ON ... │
│ -- 15 JOINs for a simple order display │
│ │
│ CONSEQUENCES: │
│ • Query complexity: Simple reads become complex queries │
│ • Join overhead: CPU và I/O tăng exponentially │
│ • Developer confusion: Hard to understand data model │
│ • ORM nightmare: N+1 queries everywhere │
│ • Latency: User-facing endpoint becomes slow │
│ │
│ CORRECT APPROACH: │
│ • OLTP: Target 3NF, accept controlled redundancy │
│ • OLAP: Denormalize intentionally for query patterns │
│ • Hybrid: Normalize writes, denormalize reads (CQRS) │
│ │
└─────────────────────────────────────────────────────────────────┘Anti-pattern 3: Ignoring 2NF in Composite Keys
┌─────────────────────────────────────────────────────────────────┐
│ SUBTLE 2NF VIOLATION ANTI-PATTERN │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ❌ PROBLEM: │
│ CREATE TABLE course_enrollments ( │
│ student_id BIGINT, │
│ course_id BIGINT, │
│ course_name VARCHAR(200), -- Depends only on course_id! │
│ instructor_name VARCHAR(200),-- Depends only on course_id! │
│ enrollment_date DATE, │
│ PRIMARY KEY (student_id, course_id) │
│ ); │
│ │
│ CONSEQUENCES: │
│ • Subtle update anomaly: Change course name → update all rows │
│ • Data inconsistency: "CS101" shows different names per student│
│ • Hidden bugs: Queries assume course_name is consistent │
│ • Reporting errors: COUNT(DISTINCT course_name) != COUNT(course_id) │
│ │
│ WHY IT'S MISSED: │
│ • Developers focus on functional correctness │
│ • Works fine with small data │
│ • Only breaks at scale with concurrent updates │
│ │
│ FIX: Move course_name to courses table (proper 2NF) │
│ │
└─────────────────────────────────────────────────────────────────┘Anti-pattern 4: Denormalization Without Sync Strategy
┌─────────────────────────────────────────────────────────────────┐
│ ORPHANED DENORMALIZATION ANTI-PATTERN │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ❌ PROBLEM: │
│ -- Added for "performance" without thinking about maintenance │
│ ALTER TABLE orders ADD COLUMN customer_name VARCHAR(200); │
│ UPDATE orders o SET customer_name = ( │
│ SELECT name FROM customers c WHERE c.id = o.customer_id │
│ ); │
│ -- And then... nothing. No triggers, no sync, no CDC. │
│ │
│ CONSEQUENCES: │
│ • Data drift: customer_name in orders becomes stale │
│ • Silent corruption: No errors, just wrong data │
│ • Trust erosion: "Which name is correct?" │
│ • Audit failures: Cannot explain data discrepancies │
│ • Debugging hell: "Why does the order show old name?" │
│ │
│ SYNC OPTIONS (Pick One): │
│ 1. Database trigger: Automatic, but adds write overhead │
│ 2. Application sync: Controlled, but requires discipline │
│ 3. CDC pipeline: Async, eventual consistency acceptable │
│ 4. Materialized view: Query-time computation, fresh data │
│ │
└─────────────────────────────────────────────────────────────────┘🔄 Migration Notes
Normalizing Legacy Tables
┌─────────────────────────────────────────────────────────────────┐
│ LEGACY NORMALIZATION MIGRATION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ SCENARIO: Legacy denormalized table → Normalized schema │
│ │
│ STEP 1: ANALYSIS │
│ • Identify functional dependencies │
│ • Map current column usage in application code │
│ • Estimate data volume và migration time │
│ │
│ STEP 2: CREATE NEW TABLES (Expand) │
│ CREATE TABLE customers_new ( │
│ customer_id BIGINT PRIMARY KEY, │
│ name VARCHAR(200), │
│ email VARCHAR(255) │
│ ); │
│ │
│ STEP 3: BACKFILL (Batched) │
│ -- Avoid locking, process in chunks │
│ INSERT INTO customers_new (customer_id, name, email) │
│ SELECT DISTINCT customer_id, customer_name, customer_email │
│ FROM legacy_orders │
│ WHERE customer_id BETWEEN :start AND :end │
│ ON CONFLICT (customer_id) DO NOTHING; │
│ │
│ STEP 4: ADD FOREIGN KEY │
│ ALTER TABLE orders ADD COLUMN customer_id_new BIGINT │
│ REFERENCES customers_new(customer_id); │
│ UPDATE orders SET customer_id_new = customer_id │
│ WHERE customer_id_new IS NULL; │
│ │
│ STEP 5: UPDATE APPLICATION │
│ -- Dual-write period: write to both old và new columns │
│ -- Read from new, fallback to old │
│ │
│ STEP 6: CLEANUP (Contract) │
│ -- After deprecation period │
│ ALTER TABLE orders DROP COLUMN customer_name; │
│ ALTER TABLE orders DROP COLUMN customer_email; │
│ │
└─────────────────────────────────────────────────────────────────┘Backfill Strategies
| Strategy | Use Case | Trade-off |
|---|---|---|
| Full scan batch | Small tables (<1M rows) | Simple, but locks table |
| Chunked by ID range | Large tables with sequential IDs | No locks, predictable progress |
| CDC-based | Very large tables, zero downtime | Complex setup, eventual consistency |
| Shadow table swap | Critical tables, must be atomic | Doubles storage temporarily |
Handling Data Conflicts During Normalization
sql
-- Problem: Legacy data has inconsistent duplicates
-- customer_id=123 has customer_name='Alice' in some rows, 'Alice Smith' in others
-- Solution 1: Latest wins
INSERT INTO customers_new (customer_id, name)
SELECT customer_id, name
FROM (
SELECT customer_id, customer_name as name,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at DESC) as rn
FROM legacy_orders
) t WHERE rn = 1;
-- Solution 2: Most frequent wins
INSERT INTO customers_new (customer_id, name)
SELECT customer_id, name
FROM (
SELECT customer_id, customer_name as name,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY cnt DESC) as rn
FROM (
SELECT customer_id, customer_name, COUNT(*) as cnt
FROM legacy_orders
GROUP BY customer_id, customer_name
) t1
) t2 WHERE rn = 1;
-- Solution 3: Flag for manual review
INSERT INTO customers_new (customer_id, name, needs_review)
SELECT customer_id, MIN(name), COUNT(DISTINCT name) > 1
FROM legacy_orders
GROUP BY customer_id;💡 Migration Best Practice
Luôn dry-run migration trên production clone trước khi chạy thật. Measure:
- Execution time
- Lock duration
- Index rebuild time
- Rollback procedure
📎 Cross-References
- 📎 Modeling Fundamentals - Conceptual/Logical/Physical modeling
- 📎 OLTP Modeling - Áp dụng normalization cho transactional systems
- 📎 OLAP Modeling - Khi nào denormalize cho analytics
- 📎 Schema Evolution - Migration strategies chi tiết
- 📎 SQL Advanced Joins - Query patterns cho normalized schemas