Skip to content

🔄 Schema Evolution

Level: Ops Solves: Quản lý schema changes trong production mà không gây downtime hoặc data loss

Schema Evolution Challenges

💡 Giáo sư Tom

Schema evolution là nơi theory gặp reality. Bạn không thể just "ALTER TABLE" trong production với millions of rows và zero downtime. Mỗi migration là một potential incident waiting to happen.

Why Schema Evolution is Hard

┌─────────────────────────────────────────────────────────────────┐
│              SCHEMA EVOLUTION CHALLENGES                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. BACKWARD COMPATIBILITY                                      │
│     Old code must work with new schema                          │
│     → Rolling deployments, blue-green                           │
│                                                                 │
│  2. FORWARD COMPATIBILITY                                       │
│     New code must work with old schema                          │
│     → Rollback scenarios                                        │
│                                                                 │
│  3. DATA MIGRATION                                              │
│     Existing data must be transformed                           │
│     → Large tables, long-running operations                     │
│                                                                 │
│  4. ZERO DOWNTIME                                               │
│     Changes must not interrupt service                          │
│     → Online DDL, shadow tables                                 │
│                                                                 │
│  5. DISTRIBUTED SYSTEMS                                         │
│     Multiple services, multiple databases                       │
│     → Coordination, eventual consistency                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Data Contracts

What is a Data Contract?

┌─────────────────────────────────────────────────────────────────┐
│                    DATA CONTRACT                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  A DATA CONTRACT is a formal agreement between:                 │
│  • Data Producer (writes data)                                  │
│  • Data Consumer (reads data)                                   │
│                                                                 │
│  CONTRACT INCLUDES:                                             │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Schema Definition                                        │    │
│  │ • Field names and types                                  │    │
│  │ • Required vs optional fields                            │    │
│  │ • Constraints and validations                            │    │
│  ├─────────────────────────────────────────────────────────┤    │
│  │ Semantic Meaning                                         │    │
│  │ • What each field represents                             │    │
│  │ • Units, formats, enums                                  │    │
│  │ • Business rules                                         │    │
│  ├─────────────────────────────────────────────────────────┤    │
│  │ SLA & Quality                                            │    │
│  │ • Freshness requirements                                 │    │
│  │ • Completeness expectations                              │    │
│  │ • Accuracy guarantees                                    │    │
│  ├─────────────────────────────────────────────────────────┤    │
│  │ Evolution Rules                                          │    │
│  │ • How changes are communicated                           │    │
│  │ • Deprecation policy                                     │    │
│  │ • Version compatibility                                  │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Data Contract Example (YAML)

yaml
# data-contracts/orders.yaml
apiVersion: datacontract/v1
kind: DataContract
metadata:
  name: orders
  version: 2.1.0
  owner: order-service-team
  
schema:
  type: object
  properties:
    order_id:
      type: string
      format: uuid
      description: Unique identifier for the order
      required: true
    customer_id:
      type: string
      format: uuid
      description: Reference to customer
      required: true
    status:
      type: string
      enum: [pending, confirmed, shipped, delivered, cancelled]
      description: Current order status
      required: true
    total_amount:
      type: number
      format: decimal
      minimum: 0
      description: Total order amount in USD
      required: true
    created_at:
      type: string
      format: date-time
      description: Order creation timestamp (UTC)
      required: true
    # New field in v2.1.0
    shipping_method:
      type: string
      enum: [standard, express, overnight]
      description: Selected shipping method
      required: false  # Optional for backward compatibility
      default: standard
      added_in: 2.1.0

quality:
  freshness: 
    max_delay: 5 minutes
  completeness:
    required_fields: 99.9%
  accuracy:
    validation_rules:
      - total_amount >= 0
      - created_at <= now()

evolution:
  compatibility: backward
  deprecation_notice: 90 days
  breaking_change_policy: major_version_bump

Compatibility Rules

Schema Compatibility Types

┌─────────────────────────────────────────────────────────────────┐
│              COMPATIBILITY TYPES                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  BACKWARD COMPATIBLE (Most Common)                              │
│  New schema can read old data                                   │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ ✅ Add optional field with default                       │    │
│  │ ✅ Remove optional field                                 │    │
│  │ ✅ Widen type (int → long)                               │    │
│  │ ❌ Add required field                                    │    │
│  │ ❌ Remove required field                                 │    │
│  │ ❌ Rename field                                          │    │
│  │ ❌ Change field type incompatibly                        │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  FORWARD COMPATIBLE                                             │
│  Old schema can read new data                                   │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ ✅ Add optional field (old code ignores)                 │    │
│  │ ✅ Remove optional field                                 │    │
│  │ ❌ Add required field                                    │    │
│  │ ❌ Change field type                                     │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  FULL COMPATIBLE                                                │
│  Both backward AND forward compatible                           │
│  Most restrictive, safest for rolling deployments               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Safe vs Unsafe Changes

Change TypeBackwardForwardRecommendation
Add optional fieldSafe, use defaults
Add required fieldMake optional first
Remove optional fieldDeprecate first
Remove required fieldNever
Rename fieldAdd new, deprecate old
Change type (widen)Usually safe
Change type (narrow)Never
Add enum valueSafe for producers
Remove enum valueDangerous

Migration Strategies

Online Schema Migration Pattern

┌─────────────────────────────────────────────────────────────────┐
│              EXPAND-CONTRACT PATTERN                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  PHASE 1: EXPAND                                                │
│  Add new column, keep old column                                │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ ALTER TABLE users ADD COLUMN full_name VARCHAR(200);    │    │
│  │ -- Both first_name/last_name AND full_name exist        │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  PHASE 2: MIGRATE                                               │
│  Backfill data, update application to write both                │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ -- Backfill existing data                               │    │
│  │ UPDATE users SET full_name = first_name || ' ' || last_name │ │
│  │ WHERE full_name IS NULL;                                │    │
│  │                                                         │    │
│  │ -- Application writes to both columns                   │    │
│  │ INSERT INTO users (first_name, last_name, full_name)    │    │
│  │ VALUES ('John', 'Doe', 'John Doe');                     │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  PHASE 3: CONTRACT                                              │
│  Remove old columns after all consumers migrated                │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ -- After deprecation period                             │    │
│  │ ALTER TABLE users DROP COLUMN first_name;               │    │
│  │ ALTER TABLE users DROP COLUMN last_name;                │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Large Table Migration

sql
-- Problem: ALTER TABLE on 100M rows = hours of locking

-- Solution 1: pt-online-schema-change (Percona)
-- Creates shadow table, copies data, swaps atomically
-- pt-online-schema-change --alter "ADD COLUMN status VARCHAR(20)" D=mydb,t=orders

-- Solution 2: gh-ost (GitHub)
-- Similar but uses binary log for copying
-- gh-ost --alter="ADD COLUMN status VARCHAR(20)" --database=mydb --table=orders

-- Solution 3: Manual shadow table approach
-- Step 1: Create new table with desired schema
CREATE TABLE orders_new (
    id BIGINT PRIMARY KEY,
    customer_id BIGINT,
    status VARCHAR(20) DEFAULT 'pending',  -- New column
    created_at TIMESTAMP
);

-- Step 2: Copy data in batches
INSERT INTO orders_new (id, customer_id, created_at)
SELECT id, customer_id, created_at 
FROM orders 
WHERE id BETWEEN 1 AND 100000;
-- Repeat for all batches

-- Step 3: Sync recent changes (use triggers or CDC)
-- Step 4: Atomic swap
RENAME TABLE orders TO orders_old, orders_new TO orders;

-- Step 5: Drop old table after verification
DROP TABLE orders_old;

Zero-Downtime Migration Checklist

┌─────────────────────────────────────────────────────────────────┐
│              MIGRATION CHECKLIST                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  PRE-MIGRATION:                                                 │
│  □ Schema change is backward compatible                         │
│  □ Migration tested in staging with production-like data        │
│  □ Rollback plan documented                                     │
│  □ Monitoring alerts configured                                 │
│  □ Team notified, on-call aware                                 │
│                                                                 │
│  DURING MIGRATION:                                              │
│  □ Run during low-traffic period                                │
│  □ Monitor database metrics (locks, replication lag)            │
│  □ Monitor application errors                                   │
│  □ Batch large data migrations                                  │
│  □ Keep transaction sizes small                                 │
│                                                                 │
│  POST-MIGRATION:                                                │
│  □ Verify data integrity                                        │
│  □ Check application functionality                              │
│  □ Monitor for 24-48 hours                                      │
│  □ Update documentation                                         │
│  □ Schedule cleanup of deprecated columns                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Version Management

Semantic Versioning for Schemas

┌─────────────────────────────────────────────────────────────────┐
│              SCHEMA VERSIONING                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  VERSION: MAJOR.MINOR.PATCH                                     │
│                                                                 │
│  MAJOR (Breaking Change)                                        │
│  • Remove required field                                        │
│  • Change field type incompatibly                               │
│  • Rename field                                                 │
│  → Requires consumer migration                                  │
│                                                                 │
│  MINOR (Backward Compatible)                                    │
│  • Add optional field                                           │
│  • Add new enum value                                           │
│  • Widen type                                                   │
│  → Consumers can upgrade at their pace                          │
│                                                                 │
│  PATCH (Bug Fix)                                                │
│  • Fix documentation                                            │
│  • Clarify semantics                                            │
│  • No schema change                                             │
│  → Transparent to consumers                                     │
│                                                                 │
│  EXAMPLE HISTORY:                                               │
│  v1.0.0 - Initial schema                                        │
│  v1.1.0 - Added optional 'shipping_method' field                │
│  v1.2.0 - Added 'express' to shipping_method enum               │
│  v2.0.0 - Removed deprecated 'legacy_status' field              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Schema Registry

┌─────────────────────────────────────────────────────────────────┐
│              SCHEMA REGISTRY WORKFLOW                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────┐     ┌─────────────────┐     ┌──────────┐          │
│  │ Producer │────►│ Schema Registry │◄────│ Consumer │          │
│  └──────────┘     └─────────────────┘     └──────────┘          │
│       │                   │                    │                │
│       │  1. Register      │                    │                │
│       │     schema        │                    │                │
│       │──────────────────►│                    │                │
│       │                   │                    │                │
│       │  2. Get schema ID │                    │                │
│       │◄──────────────────│                    │                │
│       │                   │                    │                │
│       │  3. Send data     │                    │                │
│       │     with ID       │                    │                │
│       │───────────────────┼───────────────────►│                │
│       │                   │                    │                │
│       │                   │  4. Fetch schema   │                │
│       │                   │◄───────────────────│                │
│       │                   │                    │                │
│       │                   │  5. Return schema  │                │
│       │                   │───────────────────►│                │
│       │                   │                    │                │
│       │                   │  6. Deserialize    │                │
│       │                   │     with schema    │                │
│                                                                 │
│  BENEFITS:                                                      │
│  • Centralized schema management                                │
│  • Automatic compatibility checking                             │
│  • Schema evolution tracking                                    │
│  • Reduced payload size (ID vs full schema)                     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

🔍 Common Review Comments

⚠️ Schema Evolution Review Feedback

Schema changes in production are irreversible. These comments prevent outages and data loss.

Review CommentÝ nghĩaCách khắc phục
"Breaking change without major version bump"Consumer will break silentlyBump major version, notify consumers
"No deprecation notice for removed field"Consumers still using old field3-release deprecation cycle
"Backfill query will lock table for hours"Full table scan/updateBatch by PK range
"No rollback plan documented"If migration fails, stuckDocument rollback procedure
"Migration tested on 1% prod data"Edge cases missedTest on full prod clone
"No compatibility matrix"Unknown producer/consumer versionsMaintain compatibility doc

⚠️ Anti-patterns (Expanded)

Anti-pattern: Big-Bang Migration

┌─────────────────────────────────────────────────────────────────┐
│            BIG-BANG MIGRATION ANTI-PATTERN                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ❌ PROBLEM:                                                    │
│  1. Write migration script                                      │
│  2. Schedule 4-hour maintenance window                          │
│  3. Take system offline                                         │
│  4. Run migration                                               │
│  5. Pray it works                                               │
│  6. Resume service                                              │
│                                                                 │
│  CONSEQUENCES:                                                  │
│  • Downtime: Hours of lost revenue                              │
│  • Rollback impossible: Changes committed                       │
│  • Risk concentrated: All-or-nothing                            │
│  • Testing limited: Cannot test at scale                        │
│                                                                 │
│  REAL INCIDENT: 6-hour "30-minute" maintenance window           │
│  Migration discovered index corruption mid-way                  │
│  No rollback possible → manual data repair                      │
│                                                                 │
│  ✅ SOLUTION: Expand-Contract Pattern                           │
│  1. Expand: Add new schema alongside old                        │
│  2. Dual-write: Write to both                                   │
│  3. Backfill: Migrate historical data                           │
│  4. Switch reads: Move traffic gradually                        │
│  5. Contract: Remove old schema                                 │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Anti-pattern: Skipping Compatibility Checks

┌─────────────────────────────────────────────────────────────────┐
│         COMPATIBILITY SKIP ANTI-PATTERN                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ❌ PROBLEM:                                                    │
│  Producer: "I'm adding a new required field"                    │
│  Consumer: (hasn't updated yet)                                 │
│  → Consumer crashes on first message                            │
│                                                                 │
│  CONSEQUENCES:                                                  │
│  • Producer/consumer version mismatch → breakage                │
│  • Silent data loss if consumer ignores unknown fields          │
│  • Debugging nightmare across team boundaries                   │
│                                                                 │
│  ✅ SOLUTION: Schema Registry                                   │
│  • Confluent Schema Registry (Avro/Protobuf)                    │
│  • Validate compatibility at publish time                       │
│  • Block incompatible changes                                   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

📋 Versioning Strategy Deep-Dive

Semantic Versioning for Schemas

MAJOR.MINOR.PATCH

MAJOR: Breaking changes (removed fields, type changes)
       → Requires consumer update
       
MINOR: Backward-compatible additions (new optional fields)
       → Consumers work without update
       
PATCH: Documentation, no schema change
       → Transparent to consumers

Deprecation Timeline

WeekAction
0Mark field deprecated in schema + docs
+4Log warning when deprecated field used
+8Return error in non-prod, warning in prod
+12Remove field in next major version

📦 Backfill Procedures

Safe Backfill Pattern

sql
-- ❌ DANGEROUS: Full table lock
UPDATE users SET new_column = computed_value;

-- ✅ SAFE: Batched by PK range
DO $$
DECLARE
    batch_size INT := 10000;
    max_id BIGINT;
    current_id BIGINT := 0;
BEGIN
    SELECT MAX(id) INTO max_id FROM users;
    
    WHILE current_id < max_id LOOP
        UPDATE users 
        SET new_column = computed_value
        WHERE id > current_id 
          AND id <= current_id + batch_size
          AND new_column IS NULL;
        
        current_id := current_id + batch_size;
        COMMIT;
        
        -- Throttle to avoid overwhelming DB
        PERFORM pg_sleep(0.1);
    END LOOP;
END $$;

Backfill Monitoring

MetricThresholdAction
Rows processedTrack progressLog every 100K
Replication lag< 10 secondsPause if exceeded
Lock wait time< 1 secondReduce batch size
CPU usage< 50%Safe to continue

📎 Cross-References