Giao diện
🔄 Schema Evolution
Level: Ops Solves: Quản lý schema changes trong production mà không gây downtime hoặc data loss
Schema Evolution Challenges
💡 Giáo sư Tom
Schema evolution là nơi theory gặp reality. Bạn không thể just "ALTER TABLE" trong production với millions of rows và zero downtime. Mỗi migration là một potential incident waiting to happen.
Why Schema Evolution is Hard
┌─────────────────────────────────────────────────────────────────┐
│ SCHEMA EVOLUTION CHALLENGES │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. BACKWARD COMPATIBILITY │
│ Old code must work with new schema │
│ → Rolling deployments, blue-green │
│ │
│ 2. FORWARD COMPATIBILITY │
│ New code must work with old schema │
│ → Rollback scenarios │
│ │
│ 3. DATA MIGRATION │
│ Existing data must be transformed │
│ → Large tables, long-running operations │
│ │
│ 4. ZERO DOWNTIME │
│ Changes must not interrupt service │
│ → Online DDL, shadow tables │
│ │
│ 5. DISTRIBUTED SYSTEMS │
│ Multiple services, multiple databases │
│ → Coordination, eventual consistency │
│ │
└─────────────────────────────────────────────────────────────────┘Data Contracts
What is a Data Contract?
┌─────────────────────────────────────────────────────────────────┐
│ DATA CONTRACT │
├─────────────────────────────────────────────────────────────────┤
│ │
│ A DATA CONTRACT is a formal agreement between: │
│ • Data Producer (writes data) │
│ • Data Consumer (reads data) │
│ │
│ CONTRACT INCLUDES: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Schema Definition │ │
│ │ • Field names and types │ │
│ │ • Required vs optional fields │ │
│ │ • Constraints and validations │ │
│ ├─────────────────────────────────────────────────────────┤ │
│ │ Semantic Meaning │ │
│ │ • What each field represents │ │
│ │ • Units, formats, enums │ │
│ │ • Business rules │ │
│ ├─────────────────────────────────────────────────────────┤ │
│ │ SLA & Quality │ │
│ │ • Freshness requirements │ │
│ │ • Completeness expectations │ │
│ │ • Accuracy guarantees │ │
│ ├─────────────────────────────────────────────────────────┤ │
│ │ Evolution Rules │ │
│ │ • How changes are communicated │ │
│ │ • Deprecation policy │ │
│ │ • Version compatibility │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Data Contract Example (YAML)
yaml
# data-contracts/orders.yaml
apiVersion: datacontract/v1
kind: DataContract
metadata:
name: orders
version: 2.1.0
owner: order-service-team
schema:
type: object
properties:
order_id:
type: string
format: uuid
description: Unique identifier for the order
required: true
customer_id:
type: string
format: uuid
description: Reference to customer
required: true
status:
type: string
enum: [pending, confirmed, shipped, delivered, cancelled]
description: Current order status
required: true
total_amount:
type: number
format: decimal
minimum: 0
description: Total order amount in USD
required: true
created_at:
type: string
format: date-time
description: Order creation timestamp (UTC)
required: true
# New field in v2.1.0
shipping_method:
type: string
enum: [standard, express, overnight]
description: Selected shipping method
required: false # Optional for backward compatibility
default: standard
added_in: 2.1.0
quality:
freshness:
max_delay: 5 minutes
completeness:
required_fields: 99.9%
accuracy:
validation_rules:
- total_amount >= 0
- created_at <= now()
evolution:
compatibility: backward
deprecation_notice: 90 days
breaking_change_policy: major_version_bumpCompatibility Rules
Schema Compatibility Types
┌─────────────────────────────────────────────────────────────────┐
│ COMPATIBILITY TYPES │
├─────────────────────────────────────────────────────────────────┤
│ │
│ BACKWARD COMPATIBLE (Most Common) │
│ New schema can read old data │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ ✅ Add optional field with default │ │
│ │ ✅ Remove optional field │ │
│ │ ✅ Widen type (int → long) │ │
│ │ ❌ Add required field │ │
│ │ ❌ Remove required field │ │
│ │ ❌ Rename field │ │
│ │ ❌ Change field type incompatibly │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ FORWARD COMPATIBLE │
│ Old schema can read new data │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ ✅ Add optional field (old code ignores) │ │
│ │ ✅ Remove optional field │ │
│ │ ❌ Add required field │ │
│ │ ❌ Change field type │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ FULL COMPATIBLE │
│ Both backward AND forward compatible │
│ Most restrictive, safest for rolling deployments │
│ │
└─────────────────────────────────────────────────────────────────┘Safe vs Unsafe Changes
| Change Type | Backward | Forward | Recommendation |
|---|---|---|---|
| Add optional field | ✅ | ✅ | Safe, use defaults |
| Add required field | ❌ | ❌ | Make optional first |
| Remove optional field | ✅ | ❌ | Deprecate first |
| Remove required field | ❌ | ❌ | Never |
| Rename field | ❌ | ❌ | Add new, deprecate old |
| Change type (widen) | ✅ | ❌ | Usually safe |
| Change type (narrow) | ❌ | ❌ | Never |
| Add enum value | ✅ | ❌ | Safe for producers |
| Remove enum value | ❌ | ✅ | Dangerous |
Migration Strategies
Online Schema Migration Pattern
┌─────────────────────────────────────────────────────────────────┐
│ EXPAND-CONTRACT PATTERN │
├─────────────────────────────────────────────────────────────────┤
│ │
│ PHASE 1: EXPAND │
│ Add new column, keep old column │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ ALTER TABLE users ADD COLUMN full_name VARCHAR(200); │ │
│ │ -- Both first_name/last_name AND full_name exist │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ PHASE 2: MIGRATE │
│ Backfill data, update application to write both │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ -- Backfill existing data │ │
│ │ UPDATE users SET full_name = first_name || ' ' || last_name │ │
│ │ WHERE full_name IS NULL; │ │
│ │ │ │
│ │ -- Application writes to both columns │ │
│ │ INSERT INTO users (first_name, last_name, full_name) │ │
│ │ VALUES ('John', 'Doe', 'John Doe'); │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ PHASE 3: CONTRACT │
│ Remove old columns after all consumers migrated │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ -- After deprecation period │ │
│ │ ALTER TABLE users DROP COLUMN first_name; │ │
│ │ ALTER TABLE users DROP COLUMN last_name; │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Large Table Migration
sql
-- Problem: ALTER TABLE on 100M rows = hours of locking
-- Solution 1: pt-online-schema-change (Percona)
-- Creates shadow table, copies data, swaps atomically
-- pt-online-schema-change --alter "ADD COLUMN status VARCHAR(20)" D=mydb,t=orders
-- Solution 2: gh-ost (GitHub)
-- Similar but uses binary log for copying
-- gh-ost --alter="ADD COLUMN status VARCHAR(20)" --database=mydb --table=orders
-- Solution 3: Manual shadow table approach
-- Step 1: Create new table with desired schema
CREATE TABLE orders_new (
id BIGINT PRIMARY KEY,
customer_id BIGINT,
status VARCHAR(20) DEFAULT 'pending', -- New column
created_at TIMESTAMP
);
-- Step 2: Copy data in batches
INSERT INTO orders_new (id, customer_id, created_at)
SELECT id, customer_id, created_at
FROM orders
WHERE id BETWEEN 1 AND 100000;
-- Repeat for all batches
-- Step 3: Sync recent changes (use triggers or CDC)
-- Step 4: Atomic swap
RENAME TABLE orders TO orders_old, orders_new TO orders;
-- Step 5: Drop old table after verification
DROP TABLE orders_old;Zero-Downtime Migration Checklist
┌─────────────────────────────────────────────────────────────────┐
│ MIGRATION CHECKLIST │
├─────────────────────────────────────────────────────────────────┤
│ │
│ PRE-MIGRATION: │
│ □ Schema change is backward compatible │
│ □ Migration tested in staging with production-like data │
│ □ Rollback plan documented │
│ □ Monitoring alerts configured │
│ □ Team notified, on-call aware │
│ │
│ DURING MIGRATION: │
│ □ Run during low-traffic period │
│ □ Monitor database metrics (locks, replication lag) │
│ □ Monitor application errors │
│ □ Batch large data migrations │
│ □ Keep transaction sizes small │
│ │
│ POST-MIGRATION: │
│ □ Verify data integrity │
│ □ Check application functionality │
│ □ Monitor for 24-48 hours │
│ □ Update documentation │
│ □ Schedule cleanup of deprecated columns │
│ │
└─────────────────────────────────────────────────────────────────┘Version Management
Semantic Versioning for Schemas
┌─────────────────────────────────────────────────────────────────┐
│ SCHEMA VERSIONING │
├─────────────────────────────────────────────────────────────────┤
│ │
│ VERSION: MAJOR.MINOR.PATCH │
│ │
│ MAJOR (Breaking Change) │
│ • Remove required field │
│ • Change field type incompatibly │
│ • Rename field │
│ → Requires consumer migration │
│ │
│ MINOR (Backward Compatible) │
│ • Add optional field │
│ • Add new enum value │
│ • Widen type │
│ → Consumers can upgrade at their pace │
│ │
│ PATCH (Bug Fix) │
│ • Fix documentation │
│ • Clarify semantics │
│ • No schema change │
│ → Transparent to consumers │
│ │
│ EXAMPLE HISTORY: │
│ v1.0.0 - Initial schema │
│ v1.1.0 - Added optional 'shipping_method' field │
│ v1.2.0 - Added 'express' to shipping_method enum │
│ v2.0.0 - Removed deprecated 'legacy_status' field │
│ │
└─────────────────────────────────────────────────────────────────┘Schema Registry
┌─────────────────────────────────────────────────────────────────┐
│ SCHEMA REGISTRY WORKFLOW │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌─────────────────┐ ┌──────────┐ │
│ │ Producer │────►│ Schema Registry │◄────│ Consumer │ │
│ └──────────┘ └─────────────────┘ └──────────┘ │
│ │ │ │ │
│ │ 1. Register │ │ │
│ │ schema │ │ │
│ │──────────────────►│ │ │
│ │ │ │ │
│ │ 2. Get schema ID │ │ │
│ │◄──────────────────│ │ │
│ │ │ │ │
│ │ 3. Send data │ │ │
│ │ with ID │ │ │
│ │───────────────────┼───────────────────►│ │
│ │ │ │ │
│ │ │ 4. Fetch schema │ │
│ │ │◄───────────────────│ │
│ │ │ │ │
│ │ │ 5. Return schema │ │
│ │ │───────────────────►│ │
│ │ │ │ │
│ │ │ 6. Deserialize │ │
│ │ │ with schema │ │
│ │
│ BENEFITS: │
│ • Centralized schema management │
│ • Automatic compatibility checking │
│ • Schema evolution tracking │
│ • Reduced payload size (ID vs full schema) │
│ │
└─────────────────────────────────────────────────────────────────┘🔍 Common Review Comments
⚠️ Schema Evolution Review Feedback
Schema changes in production are irreversible. These comments prevent outages and data loss.
| Review Comment | Ý nghĩa | Cách khắc phục |
|---|---|---|
| "Breaking change without major version bump" | Consumer will break silently | Bump major version, notify consumers |
| "No deprecation notice for removed field" | Consumers still using old field | 3-release deprecation cycle |
| "Backfill query will lock table for hours" | Full table scan/update | Batch by PK range |
| "No rollback plan documented" | If migration fails, stuck | Document rollback procedure |
| "Migration tested on 1% prod data" | Edge cases missed | Test on full prod clone |
| "No compatibility matrix" | Unknown producer/consumer versions | Maintain compatibility doc |
⚠️ Anti-patterns (Expanded)
Anti-pattern: Big-Bang Migration
┌─────────────────────────────────────────────────────────────────┐
│ BIG-BANG MIGRATION ANTI-PATTERN │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ❌ PROBLEM: │
│ 1. Write migration script │
│ 2. Schedule 4-hour maintenance window │
│ 3. Take system offline │
│ 4. Run migration │
│ 5. Pray it works │
│ 6. Resume service │
│ │
│ CONSEQUENCES: │
│ • Downtime: Hours of lost revenue │
│ • Rollback impossible: Changes committed │
│ • Risk concentrated: All-or-nothing │
│ • Testing limited: Cannot test at scale │
│ │
│ REAL INCIDENT: 6-hour "30-minute" maintenance window │
│ Migration discovered index corruption mid-way │
│ No rollback possible → manual data repair │
│ │
│ ✅ SOLUTION: Expand-Contract Pattern │
│ 1. Expand: Add new schema alongside old │
│ 2. Dual-write: Write to both │
│ 3. Backfill: Migrate historical data │
│ 4. Switch reads: Move traffic gradually │
│ 5. Contract: Remove old schema │
│ │
└─────────────────────────────────────────────────────────────────┘Anti-pattern: Skipping Compatibility Checks
┌─────────────────────────────────────────────────────────────────┐
│ COMPATIBILITY SKIP ANTI-PATTERN │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ❌ PROBLEM: │
│ Producer: "I'm adding a new required field" │
│ Consumer: (hasn't updated yet) │
│ → Consumer crashes on first message │
│ │
│ CONSEQUENCES: │
│ • Producer/consumer version mismatch → breakage │
│ • Silent data loss if consumer ignores unknown fields │
│ • Debugging nightmare across team boundaries │
│ │
│ ✅ SOLUTION: Schema Registry │
│ • Confluent Schema Registry (Avro/Protobuf) │
│ • Validate compatibility at publish time │
│ • Block incompatible changes │
│ │
└─────────────────────────────────────────────────────────────────┘📋 Versioning Strategy Deep-Dive
Semantic Versioning for Schemas
MAJOR.MINOR.PATCH
MAJOR: Breaking changes (removed fields, type changes)
→ Requires consumer update
MINOR: Backward-compatible additions (new optional fields)
→ Consumers work without update
PATCH: Documentation, no schema change
→ Transparent to consumersDeprecation Timeline
| Week | Action |
|---|---|
| 0 | Mark field deprecated in schema + docs |
| +4 | Log warning when deprecated field used |
| +8 | Return error in non-prod, warning in prod |
| +12 | Remove field in next major version |
📦 Backfill Procedures
Safe Backfill Pattern
sql
-- ❌ DANGEROUS: Full table lock
UPDATE users SET new_column = computed_value;
-- ✅ SAFE: Batched by PK range
DO $$
DECLARE
batch_size INT := 10000;
max_id BIGINT;
current_id BIGINT := 0;
BEGIN
SELECT MAX(id) INTO max_id FROM users;
WHILE current_id < max_id LOOP
UPDATE users
SET new_column = computed_value
WHERE id > current_id
AND id <= current_id + batch_size
AND new_column IS NULL;
current_id := current_id + batch_size;
COMMIT;
-- Throttle to avoid overwhelming DB
PERFORM pg_sleep(0.1);
END LOOP;
END $$;Backfill Monitoring
| Metric | Threshold | Action |
|---|---|---|
| Rows processed | Track progress | Log every 100K |
| Replication lag | < 10 seconds | Pause if exceeded |
| Lock wait time | < 1 second | Reduce batch size |
| CPU usage | < 50% | Safe to continue |
📎 Cross-References
- 📎 Data Governance - Governance policies for schema changes
- 📎 OLTP Modeling - Constraint management
- 📎 Distributed Systems - Schema evolution in distributed DBs
- 📎 SQL Transactions - Safe migration transactions