📊 OLAP/Dimensional Modeling

Level: Core Solves: Thiết kế data models tối ưu cho Online Analytical Processing và data warehousing

Dimensional Modeling Fundamentals

💡 Giáo sư Tom

Dimensional modeling là nghệ thuật của việc denormalize có chủ đích. Bạn trade storage và write complexity để lấy query performance và business user friendliness. Ralph Kimball đã đúng - business users nghĩ theo dimensions và measures.

Star Schema Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    STAR SCHEMA                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│                    ┌─────────────┐                              │
│                    │ dim_product │                              │
│                    │─────────────│                              │
│                    │ product_key │                              │
│                    │ product_id  │                              │
│                    │ name        │                              │
│                    │ category    │                              │
│                    │ brand       │                              │
│                    └──────┬──────┘                              │
│                           │                                     │
│  ┌─────────────┐    ┌─────┴─────┐    ┌─────────────┐            │
│  │ dim_customer│    │fact_sales │    │  dim_date   │            │
│  │─────────────│    │───────────│    │─────────────│            │
│  │customer_key │◄───│date_key   │───►│ date_key    │            │
│  │ customer_id │    │customer_key│   │ full_date   │            │
│  │ name        │    │product_key │   │ year        │            │
│  │ segment     │    │store_key   │   │ quarter     │            │
│  │ region      │    │───────────│    │ month       │            │
│  └─────────────┘    │quantity   │    │ week        │            │
│                     │unit_price │    │ day_of_week │            │
│  ┌─────────────┐    │total_amount│   └─────────────┘            │
│  │  dim_store  │    │discount   │                               │
│  │─────────────│    └───────────┘                               │
│  │ store_key   │◄────────┘                                      │
│  │ store_id    │                                                │
│  │ store_name  │                                                │
│  │ city        │                                                │
│  │ country     │                                                │
│  └─────────────┘                                                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Fact Tables

Fact Table Types

Type	Description	Example
Transaction	One row per event	fact_sales, fact_orders
Periodic Snapshot	One row per period	fact_daily_inventory
Accumulating Snapshot	One row per lifecycle	fact_order_fulfillment
Factless	Events without measures	fact_student_attendance

Grain Definition

⚠️ Grain is Everything

Grain là quyết định quan trọng nhất trong dimensional modeling. Sai grain = sai mọi thứ. Luôn declare grain trước khi design.

┌─────────────────────────────────────────────────────────────────┐
│                    GRAIN EXAMPLES                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  FACT TABLE              GRAIN STATEMENT                        │
│  ──────────              ───────────────                        │
│                                                                 │
│  fact_sales              One row per line item per transaction  │
│                                                                 │
│  fact_daily_inventory    One row per product per store per day  │
│                                                                 │
│  fact_order_fulfillment  One row per order (tracks lifecycle)   │
│                                                                 │
│  fact_page_views         One row per page view event            │
│                                                                 │
│                                                                 │
│  GRAIN DETERMINES:                                              │
│  • Which dimensions can be attached                             │
│  • What measures make sense                                     │
│  • How to aggregate correctly                                   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Fact Table Design

sql

-- Transaction Fact Table
CREATE TABLE fact_sales (
    -- Surrogate keys to dimensions
    date_key        INTEGER NOT NULL REFERENCES dim_date(date_key),
    customer_key    INTEGER NOT NULL REFERENCES dim_customer(customer_key),
    product_key     INTEGER NOT NULL REFERENCES dim_product(product_key),
    store_key       INTEGER NOT NULL REFERENCES dim_store(store_key),
    
    -- Degenerate dimension (no separate dim table)
    transaction_id  VARCHAR(50) NOT NULL,
    
    -- Measures (facts)
    quantity        INTEGER NOT NULL,
    unit_price      DECIMAL(10, 2) NOT NULL,
    discount_amount DECIMAL(10, 2) DEFAULT 0,
    total_amount    DECIMAL(12, 2) NOT NULL,
    
    -- Composite primary key
    PRIMARY KEY (date_key, customer_key, product_key, store_key, transaction_id)
);

-- Periodic Snapshot Fact Table
CREATE TABLE fact_daily_inventory (
    date_key        INTEGER NOT NULL REFERENCES dim_date(date_key),
    product_key     INTEGER NOT NULL REFERENCES dim_product(product_key),
    store_key       INTEGER NOT NULL REFERENCES dim_store(store_key),
    
    -- Snapshot measures
    quantity_on_hand    INTEGER NOT NULL,
    quantity_on_order   INTEGER NOT NULL,
    reorder_point       INTEGER NOT NULL,
    
    PRIMARY KEY (date_key, product_key, store_key)
);

Dimension Tables

Dimension Design Principles

sql

-- Well-designed dimension table
CREATE TABLE dim_customer (
    -- Surrogate key (warehouse-generated)
    customer_key    SERIAL PRIMARY KEY,
    
    -- Natural key (from source system)
    customer_id     VARCHAR(50) NOT NULL,
    
    -- Descriptive attributes
    first_name      VARCHAR(100),
    last_name       VARCHAR(100),
    email           VARCHAR(255),
    phone           VARCHAR(50),
    
    -- Hierarchies
    city            VARCHAR(100),
    state           VARCHAR(100),
    country         VARCHAR(100),
    region          VARCHAR(50),
    
    -- Derived attributes
    customer_segment    VARCHAR(50),  -- Gold, Silver, Bronze
    lifetime_value_tier VARCHAR(20),
    
    -- SCD metadata
    effective_date  DATE NOT NULL,
    expiration_date DATE,
    is_current      BOOLEAN DEFAULT TRUE,
    
    -- Audit
    source_system   VARCHAR(50),
    load_timestamp  TIMESTAMP DEFAULT NOW()
);

Surrogate Keys vs Natural Keys

┌─────────────────────────────────────────────────────────────────┐
│              SURROGATE vs NATURAL KEYS                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SURROGATE KEY (Recommended)        NATURAL KEY                 │
│  ────────────────────────           ───────────                 │
│                                                                 │
│  ✅ Integer, compact                ❌ Often VARCHAR, larger    │
│  ✅ Stable (never changes)          ❌ May change in source     │
│  ✅ Enables SCD tracking            ❌ Hard to track history    │
│  ✅ Source-system independent       ❌ Tied to source system    │
│  ✅ Better JOIN performance         ❌ Slower JOINs             │
│                                                                 │
│  PATTERN:                                                       │
│  • Surrogate key: customer_key (INTEGER, auto-generated)        │
│  • Natural key: customer_id (VARCHAR, from source)              │
│  • Both stored in dimension table                               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Slowly Changing Dimensions (SCD)

SCD Types Overview

Type	Strategy	History	Use Case
Type 0	Retain original	No changes	Fixed attributes
Type 1	Overwrite	No history	Corrections only
Type 2	Add new row	Full history	Track changes
Type 3	Add column	Limited history	Previous value only
Type 4	Mini-dimension	Separate table	Rapidly changing
Type 6	Hybrid (1+2+3)	Full + current	Best of all worlds

SCD Type 2 Implementation

sql

-- SCD Type 2: Full history tracking
CREATE TABLE dim_customer_scd2 (
    customer_key    SERIAL PRIMARY KEY,
    customer_id     VARCHAR(50) NOT NULL,  -- Natural key
    
    -- Attributes that may change
    name            VARCHAR(200),
    email           VARCHAR(255),
    address         VARCHAR(500),
    segment         VARCHAR(50),
    
    -- SCD Type 2 metadata
    effective_date  DATE NOT NULL,
    expiration_date DATE DEFAULT '9999-12-31',
    is_current      BOOLEAN DEFAULT TRUE,
    version         INTEGER DEFAULT 1
);

-- Example data showing history
-- customer_key | customer_id | name  | segment | effective_date | expiration_date | is_current
-- 1            | C001        | Alice | Bronze  | 2023-01-01     | 2023-06-30      | FALSE
-- 2            | C001        | Alice | Silver  | 2023-07-01     | 2024-01-31      | FALSE  
-- 3            | C001        | Alice | Gold    | 2024-02-01     | 9999-12-31      | TRUE

SCD Type 2 ETL Pattern

sql

-- Step 1: Expire existing current records that have changes
UPDATE dim_customer_scd2 d
SET 
    expiration_date = CURRENT_DATE - 1,
    is_current = FALSE
FROM staging_customers s
WHERE d.customer_id = s.customer_id
  AND d.is_current = TRUE
  AND (d.name != s.name OR d.segment != s.segment);

-- Step 2: Insert new versions for changed records
INSERT INTO dim_customer_scd2 (customer_id, name, email, segment, effective_date)
SELECT 
    s.customer_id,
    s.name,
    s.email,
    s.segment,
    CURRENT_DATE
FROM staging_customers s
JOIN dim_customer_scd2 d ON s.customer_id = d.customer_id
WHERE d.expiration_date = CURRENT_DATE - 1;

-- Step 3: Insert completely new customers
INSERT INTO dim_customer_scd2 (customer_id, name, email, segment, effective_date)
SELECT 
    s.customer_id,
    s.name,
    s.email,
    s.segment,
    CURRENT_DATE
FROM staging_customers s
WHERE NOT EXISTS (
    SELECT 1 FROM dim_customer_scd2 d WHERE d.customer_id = s.customer_id
);

Advanced Patterns

Snowflake Schema

┌─────────────────────────────────────────────────────────────────┐
│                   SNOWFLAKE SCHEMA                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────┐    ┌─────────────┐    ┌───────────┐               │
│  │dim_brand │◄───│ dim_product │◄───│fact_sales │               │
│  └──────────┘    └──────┬──────┘    └───────────┘               │
│                         │                                       │
│                         ▼                                       │
│                  ┌─────────────┐                                │
│                  │dim_category │                                │
│                  └──────┬──────┘                                │
│                         │                                       │
│                         ▼                                       │
│                  ┌─────────────┐                                │
│                  │dim_subcategory│                              │
│                  └─────────────┘                                │
│                                                                 │
│  TRADE-OFFS:                                                    │
│  ✅ Less storage (normalized dimensions)                        │
│  ✅ Easier dimension maintenance                                │
│  ❌ More JOINs = slower queries                                 │
│  ❌ More complex for business users                             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Conformed Dimensions

┌─────────────────────────────────────────────────────────────────┐
│                 CONFORMED DIMENSIONS                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SHARED DIMENSION ACROSS FACT TABLES:                           │
│                                                                 │
│  ┌─────────────┐                                                │
│  │  dim_date   │◄──────┬──────────┬──────────┐                  │
│  └─────────────┘       │          │          │                  │
│                        │          │          │                  │
│                  ┌─────┴────┐ ┌───┴────┐ ┌───┴────┐             │
│                  │fact_sales│ │fact_inv│ │fact_hr │             │
│                  └──────────┘ └────────┘ └────────┘             │
│                                                                 │
│  BENEFITS:                                                      │
│  • Consistent reporting across subject areas                    │
│  • Drill-across queries possible                                │
│  • Single source of truth for dimension                         │
│                                                                 │
│  REQUIREMENTS:                                                  │
│  • Same surrogate keys                                          │
│  • Same attributes and hierarchies                              │
│  • Same grain                                                   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Date Dimension

sql

-- Essential date dimension
CREATE TABLE dim_date (
    date_key        INTEGER PRIMARY KEY,  -- YYYYMMDD format
    full_date       DATE NOT NULL UNIQUE,
    
    -- Calendar attributes
    year            INTEGER NOT NULL,
    quarter         INTEGER NOT NULL,
    month           INTEGER NOT NULL,
    month_name      VARCHAR(20) NOT NULL,
    week_of_year    INTEGER NOT NULL,
    day_of_month    INTEGER NOT NULL,
    day_of_week     INTEGER NOT NULL,
    day_name        VARCHAR(20) NOT NULL,
    
    -- Fiscal calendar (adjust for your fiscal year)
    fiscal_year     INTEGER NOT NULL,
    fiscal_quarter  INTEGER NOT NULL,
    fiscal_month    INTEGER NOT NULL,
    
    -- Flags
    is_weekend      BOOLEAN NOT NULL,
    is_holiday      BOOLEAN DEFAULT FALSE,
    holiday_name    VARCHAR(100),
    
    -- Relative flags (useful for filtering)
    is_current_day  BOOLEAN DEFAULT FALSE,
    is_current_week BOOLEAN DEFAULT FALSE,
    is_current_month BOOLEAN DEFAULT FALSE
);

-- Generate date dimension (PostgreSQL)
INSERT INTO dim_date
SELECT 
    TO_CHAR(d, 'YYYYMMDD')::INTEGER as date_key,
    d as full_date,
    EXTRACT(YEAR FROM d) as year,
    EXTRACT(QUARTER FROM d) as quarter,
    EXTRACT(MONTH FROM d) as month,
    TO_CHAR(d, 'Month') as month_name,
    EXTRACT(WEEK FROM d) as week_of_year,
    EXTRACT(DAY FROM d) as day_of_month,
    EXTRACT(DOW FROM d) as day_of_week,
    TO_CHAR(d, 'Day') as day_name,
    -- Fiscal year starting April
    CASE WHEN EXTRACT(MONTH FROM d) >= 4 
         THEN EXTRACT(YEAR FROM d) 
         ELSE EXTRACT(YEAR FROM d) - 1 END as fiscal_year,
    -- ... more calculations
    EXTRACT(DOW FROM d) IN (0, 6) as is_weekend
FROM generate_series('2020-01-01'::date, '2030-12-31'::date, '1 day') as d;

🔍 Common Review Comments

⚠️ OLAP/DW Design Review Feedback

Data Warehouse designs often reveal issues only after data is loaded. These comments catch problems before they become expensive to fix.

Review Comment	Ý nghĩa	Cách khắc phục
"Grain không declared — làm sao biết aggregation correct?"	Missing grain statement → double-counting risk	Document grain FIRST, before any other design
"Dimension này nên conformed across fact tables"	Same dimension với different keys → drill-across fails	Centralize dimension definition, share surrogate keys
"SCD Type 2 không có effective_date index"	Historical queries scan entire dimension	Index (natural_key, effective_date) cho range lookups
"Fact table thiếu degenerate dimension"	Transaction ID chỉ lưu mà không model → join khó	Include transaction_id in fact, mark as degenerate dimension
"Dimension quá rộng — nên snowflake hoặc mini-dimension"	100+ columns dimension → slow ETL, excessive memory	Split rapidly changing attributes to mini-dimension
"Late-arriving fact strategy chưa defined"	Facts arrive after dimension cutoff → orphan facts	Define late-arriving dimension member pattern
"Aggregate table refresh schedule conflicts với ETL window"	Materialized views refresh during load → stale data	Coordinate refresh với ETL completion

📊 Grain & Cardinality Deep-Dive

❌ Grain Is The #1 DW Mistake

Hầu hết data warehouse failures bắt đầu từ grain definition sai. Một khi data được loaded với wrong grain, recovery tốn millions.

Grain Declaration Standards

┌─────────────────────────────────────────────────────────────────┐
│               GRAIN DECLARATION TEMPLATE                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  FACT TABLE: fact_sales                                         │
│                                                                 │
│  GRAIN STATEMENT (must answer all questions):                   │
│  ─────────────────────────────────────────────                  │
│  "One row represents: One line item on one transaction          │
│                       at one store, by one customer,            │
│                       on one date, for one product"             │
│                                                                 │
│  GRAIN COMPONENTS:                                              │
│  • WHAT is being measured? → Line item (not order, not product) │
│  • WHEN? → Transaction timestamp (not day, not week)            │
│  • WHERE? → Store location                                      │
│  • WHO? → Customer (if known)                                   │
│  • HOW OFTEN? → Per transaction                                 │
│                                                                 │
│  ADDITIVE MEASURES (can SUM across all dimensions):             │
│  • quantity, revenue, discount                                  │
│                                                                 │
│  SEMI-ADDITIVE (can SUM across some dimensions):                │
│  • inventory_on_hand (cannot SUM across time)                   │
│                                                                 │
│  NON-ADDITIVE (cannot SUM):                                     │
│  • unit_price, ratio, percentage                                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Cardinality Explosion Warning

┌─────────────────────────────────────────────────────────────────┐
│               CARDINALITY EXPLOSION                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SCENARIO: E-commerce data warehouse                            │
│                                                                 │
│  Dimensions:                                                    │
│  • dim_date: 3,650 rows (10 years)                              │
│  • dim_customer: 10,000,000 rows                                │
│  • dim_product: 500,000 rows                                    │
│  • dim_store: 1,000 rows                                        │
│                                                                 │
│  POTENTIAL COMBINATIONS:                                        │
│  3,650 × 10M × 500K × 1K = 18,250,000,000,000,000,000 rows      │
│                            (18 quintillion!)                    │
│                                                                 │
│  REALITY: Most combinations don't occur                         │
│  → Actual fact table: ~500M rows (sparse matrix)                │
│                                                                 │
│  PROBLEMS WHEN GRAIN IS WRONG:                                  │
│  • Too coarse: Loss of detail, cannot drill down                │
│  • Too fine: Storage explosion, slow queries                    │
│  • Misaligned: JOIN creates Cartesian product                   │
│                                                                 │
│  REAL INCIDENT:                                                 │
│  Warehouse defined grain as "order" but stored "line items"     │
│  → Report showed $500M revenue instead of $100M                 │
│  → CEO made investment decision on wrong data                   │
│  → 3-month restatement project                                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

⚠️ Anti-patterns

❌ Data Warehouse Design Mistakes

Những anti-patterns này không visible ngay — chúng emerge từ từ qua years của accumulated bad data.

Anti-pattern 1: Wrong Grain Definition

┌─────────────────────────────────────────────────────────────────┐
│               WRONG GRAIN ANTI-PATTERN                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ❌ PROBLEM:                                                    │
│  -- Claimed grain: "One row per order"                          │
│  -- Actual data: One row per order LINE ITEM                    │
│                                                                 │
│  fact_sales:                                                    │
│  order_id │ product │ quantity │ revenue                        │
│  1        │ iPhone  │ 1        │ 999                            │
│  1        │ Case    │ 2        │ 50                             │
│  1        │ Charger │ 1        │ 29                             │
│                                                                 │
│  ❌ WRONG QUERY (assumes order grain):                          │
│  SELECT COUNT(order_id) FROM fact_sales → Returns 3, not 1!     │
│  SELECT SUM(revenue) → Correct by accident                      │
│                                                                 │
│  CONSEQUENCES:                                                  │
│  • COUNT inflated: Every report shows wrong counts              │
│  • AVG wrong: Average order value divided by line items         │
│  • JOIN issues: Each order JOINs multiple times                 │
│  • Trust destroyed: Users stop believing warehouse              │
│                                                                 │
│  FIX:                                                           │
│  1. ACKNOWLEDGE actual grain (line items)                       │
│  2. Create proper order-level aggregate (if needed)             │
│  3. Update all reports to use DISTINCT counts                   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Anti-pattern 2: Missing Conformed Dimensions

┌─────────────────────────────────────────────────────────────────┐
│           NON-CONFORMED DIMENSION ANTI-PATTERN                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ❌ PROBLEM:                                                    │
│  -- Sales team: dim_customer with customer_key = 1,2,3...       │
│  -- Support team: dim_customer with customer_key = 101,102...   │
│  -- Marketing: dim_customer with customer_id (natural key)      │
│                                                                 │
│  CONSEQUENCES:                                                  │
│  • Cannot combine: "How many customers bought AND opened ticket?"│
│  • Data silos: Each team has different truth                    │
│  • Drill-across fails: Cannot navigate between fact tables      │
│  • Reconciliation nightmare: Who has the right customer count?  │
│                                                                 │
│  REAL INCIDENT:                                                 │
│  Sales reported 50K active customers                            │
│  Marketing reported 75K active customers                        │
│  CEO asked "Which is right?" — Neither could answer             │
│                                                                 │
│  FIX:                                                           │
│  1. Create ONE conformed customer dimension                     │
│  2. All fact tables use same customer_key                       │
│  3. Single source of truth for customer attributes              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Anti-pattern 3: SCD Type 2 Without Partition Pruning

┌─────────────────────────────────────────────────────────────────┐
│           SLOW SCD TYPE 2 ANTI-PATTERN                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ❌ PROBLEM:                                                    │
│  dim_customer_scd2: 100M rows (10M customers × 10 versions avg) │
│                                                                 │
│  -- Query current customers:                                    │
│  SELECT * FROM dim_customer_scd2 WHERE is_current = TRUE        │
│  -- Scans ALL 100M rows to find 10M current!                    │
│                                                                 │
│  CONSEQUENCES:                                                  │
│  • Full table scans: Every query touches all history            │
│  • Slow JOINs: Fact-to-dimension JOIN is expensive              │
│  • ETL bottleneck: SCD processing takes hours                   │
│  • Index bloat: B-tree on is_current has poor selectivity       │
│                                                                 │
│  BETTER APPROACHES:                                             │
│                                                                 │
│  1. PARTIAL INDEX:                                              │
│  CREATE INDEX idx_current ON dim_customer                       │
│      WHERE is_current = TRUE;                                   │
│                                                                 │
│  2. PARTITIONING:                                               │
│  PARTITION BY LIST (is_current)                                 │
│  -- Current partition is small, history partition is large      │
│                                                                 │
│  3. SEPARATE TABLES:                                            │
│  dim_customer_current: 10M rows (active)                        │
│  dim_customer_history: 90M rows (archived)                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Anti-pattern 4: Ignoring Late-Arriving Facts/Dimensions

┌─────────────────────────────────────────────────────────────────┐
│           LATE-ARRIVING DATA ANTI-PATTERN                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ❌ PROBLEM:                                                    │
│  -- ETL runs at midnight                                        │
│  -- Fact arrives: order_date = 2024-01-15, customer_id = 999    │
│  -- But customer 999 was created at 2024-01-15 11:00 PM         │
│  -- Dimension ETL already ran → customer 999 not in dim yet!    │
│                                                                 │
│  CONSEQUENCES:                                                  │
│  • Orphan facts: Facts reference non-existent dimension keys    │
│  • Failed loads: FK constraint violations                       │
│  • Missing data: Facts dropped silently                         │
│  • Wrong attribution: Facts assigned to "Unknown" member        │
│                                                                 │
│  PATTERNS FOR HANDLING:                                         │
│                                                                 │
│  1. INFERRED DIMENSION MEMBER:                                  │
│     Insert placeholder: customer_key=999, name="Unknown"        │
│     Update when real data arrives                               │
│                                                                 │
│  2. LATE-ARRIVING FACT TABLE:                                   │
│     Stage late facts separately                                 │
│     Reprocess when dimensions catch up                          │
│                                                                 │
│  3. MICRO-BATCH ETL:                                            │
│     Run dimension loads more frequently                         │
│     Reduce window for late-arriving scenarios                   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

🔄 Migration Notes

SCD Type 2 Backfill

sql

-- Scenario: Converting Type 1 to Type 2 (adding history)

-- Step 1: Add SCD columns
ALTER TABLE dim_customer ADD COLUMN effective_date DATE;
ALTER TABLE dim_customer ADD COLUMN expiration_date DATE DEFAULT '9999-12-31';
ALTER TABLE dim_customer ADD COLUMN is_current BOOLEAN DEFAULT TRUE;
ALTER TABLE dim_customer ADD COLUMN version INTEGER DEFAULT 1;

-- Step 2: Set effective_date for existing records
UPDATE dim_customer 
SET effective_date = COALESCE(created_at::date, '2020-01-01');

-- Step 3: Create partial index for current records
CREATE INDEX idx_dim_customer_current ON dim_customer(customer_id) 
WHERE is_current = TRUE;

-- Step 4: Update ETL to handle SCD Type 2 logic going forward

Late-Arriving Dimension Handling

sql

-- Create placeholder for unknown dimension members
INSERT INTO dim_customer (customer_key, customer_id, name, is_inferred)
VALUES (-1, 'UNKNOWN', 'Unknown Customer', FALSE);

-- Insert inferred member when fact references unknown customer
INSERT INTO dim_customer (customer_key, customer_id, name, is_inferred)
SELECT 
    nextval('dim_customer_seq'),
    f.customer_id,
    'Inferred - Pending Update',
    TRUE
FROM staging_facts f
LEFT JOIN dim_customer d ON f.customer_id = d.customer_id
WHERE d.customer_key IS NULL
ON CONFLICT (customer_id) DO NOTHING;

-- Later: Update inferred members when real data arrives
UPDATE dim_customer d
SET 
    name = s.name,
    email = s.email,
    is_inferred = FALSE
FROM staging_customers s
WHERE d.customer_id = s.customer_id
  AND d.is_inferred = TRUE;

Aggregate Table Refresh Strategy

Strategy	Use Case	Trade-off
Full rebuild	Small aggregates (<1M rows)	Simple, but expensive
Incremental	Large aggregates, additive measures	Complex, but fast
Partition swap	Date-partitioned aggregates	Atomic, no downtime
Materialized view	Complex JOINs, need freshness	DB-managed, but refresh overhead

📎 Cross-References

📎 OLTP Modeling - So sánh với transactional modeling
📎 Normalization - Khi nào denormalize
📎 Schema Evolution - SCD migration strategies
📎 SQL Window Functions - Analytics queries
📎 SQL Aggregation - GROUP BY patterns cho OLAP

📊 OLAP/Dimensional Modeling ​

Dimensional Modeling Fundamentals ​

Star Schema Architecture ​

Fact Tables ​

Fact Table Types ​

Grain Definition ​

Fact Table Design ​

Dimension Tables ​

Dimension Design Principles ​

Surrogate Keys vs Natural Keys ​

Slowly Changing Dimensions (SCD) ​

SCD Types Overview ​

SCD Type 2 Implementation ​

SCD Type 2 ETL Pattern ​

Advanced Patterns ​

Snowflake Schema ​

Conformed Dimensions ​

Date Dimension ​

🔍 Common Review Comments ​

📊 Grain & Cardinality Deep-Dive ​

Grain Declaration Standards ​

Cardinality Explosion Warning ​

⚠️ Anti-patterns ​

Anti-pattern 1: Wrong Grain Definition ​

Anti-pattern 2: Missing Conformed Dimensions ​

Anti-pattern 3: SCD Type 2 Without Partition Pruning ​

Anti-pattern 4: Ignoring Late-Arriving Facts/Dimensions ​

🔄 Migration Notes ​

SCD Type 2 Backfill ​

Late-Arriving Dimension Handling ​

Aggregate Table Refresh Strategy ​

📎 Cross-References ​