Skip to content

🛡️ Reliability & DR

Level: Ops Solves: Design và implement high availability và disaster recovery strategies cho enterprise workloads

🎯 Mục tiêu (Outcomes)

Sau khi áp dụng kiến thức trong trang này, bạn sẽ có khả năng:

  • Xác định RTO/RPO phù hợp cho từng workload dựa trên business requirements
  • Thiết kế Multi-AZ Architecture cho high availability trong single region
  • Triển khai Multi-Region DR với appropriate strategy (Pilot Light, Warm Standby, Active-Active)
  • Xây dựng DR Runbooks với clear procedures và testing schedules
  • Implement Chaos Engineering để validate resilience assumptions
  • Tối ưu DR Cost dựa trên actual RTO/RPO requirements

Khi nào dùng

DR StrategyRTORPOUse CaseCost
Backup & Restore24h+24h+Non-critical, cost-sensitiveThấp nhất
Pilot Light1-4hMinutesBusiness applicationsThấp
Warm Standby15-60minNear-zeroImportant workloadsTrung bình
Multi-Site ActiveSecondsNear-zeroMission-criticalCao nhất

Khi nào KHÔNG dùng

PatternVấn đềThay thế
Active-Active cho dev/testCost quá cao không cần thiếtBackup & Restore
Multi-region cho single-market appComplexity và latencyMulti-AZ
Manual failover cho critical systemsRTO quá dàiAutomated failover
DR không testKhông biết có work hay khôngRegular DR drills

⚠️ Cảnh báo từ Raizo

"Một công ty đầu tư $200K/năm vào DR infrastructure. Khi thực sự cần failover, runbook đã 2 năm không update, credentials hết hạn, và replication đã broken 3 tháng. DR không test = không có DR."

Reliability Fundamentals

Availability Targets

┌─────────────────────────────────────────────────────────────────┐
│                 AVAILABILITY TARGETS                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Availability    Downtime/Year    Downtime/Month    Use Case    │
│  ────────────    ─────────────    ──────────────    ────────    │
│  99%             3.65 days        7.3 hours         Dev/Test    │
│  99.9%           8.76 hours       43.8 minutes      Standard    │
│  99.95%          4.38 hours       21.9 minutes      Business    │
│  99.99%          52.6 minutes     4.38 minutes      Critical    │
│  99.999%         5.26 minutes     26.3 seconds      Mission     │
│                                                                 │
│  COST IMPLICATION:                                              │
│  Each additional "9" typically doubles infrastructure cost      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

RTO vs RPO

┌─────────────────────────────────────────────────────────────────┐
│                    RTO vs RPO                                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Timeline:                                                      │
│                                                                 │
│  ◄────────── RPO ──────────►│◄────────── RTO ──────────►        │
│                              │                                  │
│  Last Backup    Data Loss    │    Disaster    Recovery          │
│  ─────────────────────────────────────────────────────────      │
│       │              │       │         │           │            │
│       ▼              ▼       ▼         ▼           ▼            │
│  ┌─────────┐    ┌─────────┐  │    ┌─────────┐  ┌─────────┐     │
│  │ Backup  │    │  Lost   │  │    │ Outage  │  │ Service │     │
│  │  Point  │    │  Data   │  │    │ Starts  │  │ Restored│     │
│  └─────────┘    └─────────┘  │    └─────────┘  └─────────┘     │
│                              │                                  │
│  RPO = Recovery Point Objective (acceptable data loss)          │
│  RTO = Recovery Time Objective (acceptable downtime)            │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Multi-AZ Architecture

Standard Multi-AZ Pattern

┌─────────────────────────────────────────────────────────────────┐
│                 MULTI-AZ ARCHITECTURE                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Region: us-east-1                                              │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    Application Load Balancer             │    │
│  └────────────────────────────┬────────────────────────────┘    │
│                               │                                 │
│         ┌─────────────────────┼─────────────────────┐           │
│         │                     │                     │           │
│         ▼                     ▼                     ▼           │
│  ┌─────────────┐       ┌─────────────┐       ┌─────────────┐    │
│  │    AZ-A     │       │    AZ-B     │       │    AZ-C     │    │
│  │             │       │             │       │             │    │
│  │ ┌─────────┐ │       │ ┌─────────┐ │       │ ┌─────────┐ │    │
│  │ │   App   │ │       │ │   App   │ │       │ │   App   │ │    │
│  │ │ (ASG)   │ │       │ │ (ASG)   │ │       │ │ (ASG)   │ │    │
│  │ └─────────┘ │       │ └─────────┘ │       │ └─────────┘ │    │
│  │             │       │             │       │             │    │
│  │ ┌─────────┐ │       │ ┌─────────┐ │       │             │    │
│  │ │   RDS   │◄┼───────┼►│   RDS   │ │       │             │    │
│  │ │ Primary │ │ Sync  │ │ Standby │ │       │             │    │
│  │ └─────────┘ │       │ └─────────┘ │       │             │    │
│  └─────────────┘       └─────────────┘       └─────────────┘    │
│                                                                 │
│  Failover: Automatic for RDS, ALB routes to healthy targets     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Service-Specific Multi-AZ

ServiceMulti-AZ BehaviorFailover Time
RDSSynchronous standby60-120 seconds
Aurora6 copies across 3 AZs< 30 seconds
ElastiCacheReplica in different AZSeconds
EFSAutomatic across AZsTransparent
S3Automatic across AZsTransparent
DynamoDBAutomatic across AZsTransparent

Multi-Region Architecture

Active-Passive Pattern

┌─────────────────────────────────────────────────────────────────┐
│              ACTIVE-PASSIVE MULTI-REGION                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                      Route 53                            │    │
│  │              (Failover Routing Policy)                   │    │
│  └────────────────────────┬────────────────────────────────┘    │
│                           │                                     │
│         ┌─────────────────┴─────────────────┐                   │
│         │ Primary                │ Secondary                    │
│         ▼                        ▼                              │
│  ┌─────────────┐          ┌─────────────┐                       │
│  │ us-east-1   │          │ us-west-2   │                       │
│  │  (Active)   │          │  (Passive)  │                       │
│  │             │          │             │                       │
│  │ ┌─────────┐ │  Async   │ ┌─────────┐ │                       │
│  │ │   RDS   │─┼──Replica─┼►│   RDS   │ │                       │
│  │ │ Primary │ │          │ │ Read Rep│ │                       │
│  │ └─────────┘ │          │ └─────────┘ │                       │
│  │             │          │             │                       │
│  │ ┌─────────┐ │  Cross-  │ ┌─────────┐ │                       │
│  │ │   S3    │─┼──Region──┼►│   S3    │ │                       │
│  │ │         │ │  Replic. │ │         │ │                       │
│  │ └─────────┘ │          │ └─────────┘ │                       │
│  └─────────────┘          └─────────────┘                       │
│                                                                 │
│  RPO: Minutes (async replication lag)                           │
│  RTO: Minutes to hours (manual promotion)                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Active-Active Pattern

┌─────────────────────────────────────────────────────────────────┐
│              ACTIVE-ACTIVE MULTI-REGION                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                      Route 53                            │    │
│  │           (Latency or Geolocation Routing)               │    │
│  └────────────────────────┬────────────────────────────────┘    │
│                           │                                     │
│         ┌─────────────────┴─────────────────┐                   │
│         ▼                                   ▼                   │
│  ┌─────────────┐                     ┌─────────────┐            │
│  │ us-east-1   │                     │ eu-west-1   │            │
│  │  (Active)   │                     │  (Active)   │            │
│  │             │                     │             │            │
│  │ ┌─────────┐ │                     │ ┌─────────┐ │            │
│  │ │ Aurora  │◄┼─────Global DB───────┼►│ Aurora  │ │            │
│  │ │ Global  │ │   (< 1s latency)    │ │ Global  │ │            │
│  │ └─────────┘ │                     │ └─────────┘ │            │
│  │             │                     │             │            │
│  │ ┌─────────┐ │                     │ ┌─────────┐ │            │
│  │ │DynamoDB │◄┼───Global Tables─────┼►│DynamoDB │ │            │
│  │ │ Global  │ │                     │ │ Global  │ │            │
│  │ └─────────┘ │                     │ └─────────┘ │            │
│  └─────────────┘                     └─────────────┘            │
│                                                                 │
│  RPO: Near-zero (synchronous/near-sync replication)             │
│  RTO: Seconds (automatic failover)                              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

DR Strategy Selection

Strategy Comparison

StrategyRTORPOCostUse Case
Backup & RestoreHoursHours$Non-critical
Pilot Light10s minutesMinutes$$Important
Warm StandbyMinutesSeconds-Minutes$$$Business Critical
Multi-Site ActiveSecondsNear-zero$$$$Mission Critical

Pilot Light Implementation

┌─────────────────────────────────────────────────────────────────┐
│                    PILOT LIGHT PATTERN                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Primary Region (Active)         DR Region (Pilot Light)        │
│  ┌─────────────────────┐        ┌─────────────────────┐         │
│  │                     │        │                     │         │
│  │ ┌─────────────────┐ │        │ ┌─────────────────┐ │         │
│  │ │ Full App Stack  │ │        │ │ Minimal Infra   │ │         │
│  │ │ • Web servers   │ │        │ │ • DB replica    │ │         │
│  │ │ • App servers   │ │        │ │ • Core configs  │ │         │
│  │ │ • Databases     │ │        │ │ • AMIs ready    │ │         │
│  │ │ • Caches        │ │        │ │                 │ │         │
│  │ └─────────────────┘ │        │ └─────────────────┘ │         │
│  │                     │        │                     │         │
│  └─────────────────────┘        └─────────────────────┘         │
│                                                                 │
│  On Failover:                                                   │
│  1. Scale up compute (ASG, ECS)                                 │
│  2. Promote DB replica                                          │
│  3. Update DNS                                                  │
│  4. Warm up caches                                              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Game Days & Testing

DR Testing Framework

┌─────────────────────────────────────────────────────────────────┐
│                 DR TESTING FRAMEWORK                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Test Types:                                                    │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ 1. Tabletop Exercise (Quarterly)                        │    │
│  │    • Walk through runbooks                              │    │
│  │    • Identify gaps in documentation                     │    │
│  │    • No actual failover                                 │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ 2. Component Failover (Monthly)                         │    │
│  │    • Test individual component failover                 │    │
│  │    • RDS failover, AZ failure simulation                │    │
│  │    • Measure actual RTO                                 │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ 3. Full DR Drill (Annually)                             │    │
│  │    • Complete region failover                           │    │
│  │    • Run production traffic in DR region                │    │
│  │    • Validate all systems                               │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Chaos Engineering with FIS

json
{
  "description": "AZ failure simulation",
  "targets": {
    "ec2-instances": {
      "resourceType": "aws:ec2:instance",
      "resourceTags": {
        "Environment": "production"
      },
      "filters": [
        {
          "path": "Placement.AvailabilityZone",
          "values": ["us-east-1a"]
        }
      ],
      "selectionMode": "ALL"
    }
  },
  "actions": {
    "stop-instances": {
      "actionId": "aws:ec2:stop-instances",
      "parameters": {},
      "targets": {
        "Instances": "ec2-instances"
      }
    }
  },
  "stopConditions": [
    {
      "source": "aws:cloudwatch:alarm",
      "value": "arn:aws:cloudwatch:us-east-1:123456789012:alarm:HighErrorRate"
    }
  ]
}

Runbook Template

markdown
# DR Runbook: Region Failover

## Pre-Conditions
- [ ] DR region infrastructure verified
- [ ] Database replication lag < 5 minutes
- [ ] On-call team notified

## Failover Steps

### 1. Assess Situation (5 min)
- Confirm primary region is unavailable
- Check replication status
- Notify stakeholders

### 2. Database Failover (10 min)
- Promote RDS read replica to primary
- Update application connection strings
- Verify database connectivity

### 3. Application Failover (15 min)
- Scale up ASG in DR region
- Verify application health checks
- Update Route 53 to DR region

### 4. Validation (10 min)
- Run smoke tests
- Verify critical user journeys
- Monitor error rates

## Rollback Procedure
[Steps to fail back to primary region]

## Contacts
- Primary: [name] - [phone]
- Secondary: [name] - [phone]

Best Practices Checklist

  • [ ] Define RTO/RPO for each workload
  • [ ] Implement Multi-AZ for all production workloads
  • [ ] Set up cross-region replication for critical data
  • [ ] Create and test DR runbooks
  • [ ] Conduct regular DR drills
  • [ ] Automate failover where possible
  • [ ] Monitor replication lag
  • [ ] Document dependencies and failover order

⚖️ Trade-offs

Trade-off 1: RTO vs Cost

RTO TargetStrategyMonthly Cost EstimateROI Consideration
24h+Backup & Restore~$100/TB storedAccept 24h+ downtime
1-4hPilot Light~$500/month baseBalance cost/recovery
15-60minWarm Standby~$2,000-5,000/monthImportant but not critical
< 1minActive-Active2x production costMission-critical

Ví dụ tính toán downtime cost:

Revenue: $1M/ngày
Downtime cost: $41,667/giờ

Backup & Restore (24h RTO): Potential loss = $1M
Warm Standby ($5K/tháng): Break-even = 1 outage/5 năm

Nếu probability outage > 20%/năm: Warm Standby worth it

Trade-off 2: RPO vs Replication Cost

RPO TargetReplication MethodCost Impact
24hDaily backupsLowest
1hHourly snapshotsLow
MinutesAsync replicationMedium
Near-zeroSync replicationHigh + latency

Trade-off 3: Automation vs Complexity

ApproachProsCons
Manual failoverSimple, full controlSlow, human error
Automated failoverFast, consistentComplex, false positives
HybridBalancedModerate complexity

🚨 Failure Modes

Failure Mode 1: Replication Lag Undetected

🔥 Incident thực tế

Database replication đã fail 1 tuần mà không ai biết. Khi primary region down, DR database thiếu 1 tuần data. RPO 0 trên giấy, RPO 1 tuần thực tế. $2M data reconstruction cost.

Cách phát hiệnCách phòng tránh
Monitor replication lag metricCloudWatch alarm cho lag > threshold
Daily replication health checkAutomated validation job
DR drill failuresMonthly replication verification

Failure Mode 2: DR Runbook Outdated

Cách phát hiệnCách phòng tránh
Failed DR drillQuarterly runbook review
Missing new servicesIaC-based runbooks (auto-update)
Wrong credentialsCredential rotation reminder
Expired certificatesCertificate expiry monitoring

Failure Mode 3: Cascading Failures

┌─────────────────────────────────────────────────────────────────┐
│                 CASCADE FAILURE PATTERN                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. Database slow → 2. App timeouts → 3. Queue backlog          │
│                                    ↓                             │
│  4. Retry storms → 5. More DB load → 6. Database crash           │
│                                    ↓                             │
│  7. All dependent services fail                                 │
│                                                                 │
│  Mitigation: Circuit breakers, bulkheads, rate limiting        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
Cách phát hiệnCách phòng tránh
Error rate spike across servicesCircuit breakers per dependency
Queue depth growing exponentiallyBulkhead pattern (service isolation)
Retry metrics spikeExponential backoff with jitter

🔐 Security Baseline

DR Security Requirements

RequirementImplementationVerification
DR credentials secureSecrets Manager in DR regionAccess audit
Cross-region encryptionKMS multi-region keysKey policy review
Network security parityIaC-managed Security GroupsConfig drift detection
IAM parityIAM Access AnalyzerCross-region comparison

Data Sovereignty

ConsiderationImplementation
GDPR data residencyDR region trong EU
Data classificationReplicate only allowed data
Encryption keysRegion-specific CMKs nếu required
Audit logsReplicate to both regions

📊 Ops Readiness

Metrics cần Monitoring

ComponentMetricAlert Threshold
RDS ReplicationReplicaLag> 60 seconds
S3 ReplicationReplicationLatency> 15 minutes
Route 53 HealthHealthCheckStatusUnhealthy
DR RegionResourceCountMismatch với primary
BackupBackupJobStatusFailed

DR Drill Schedule

Drill TypeFrequencyDurationScope
TabletopMonthly1-2 hoursReview runbooks
Partial failoverQuarterly2-4 hoursNon-critical services
Full failoverAnnually4-8 hoursAll production
Chaos engineeringOngoingVariesContinuous validation

Runbook Entry Points

Tình huốngRunbook
Primary region outagerunbook/full-region-failover.md
Database failoverrunbook/database-failover.md
Replication lag alertrunbook/replication-lag-investigation.md
Failed DR drillrunbook/dr-drill-failure-analysis.md
Failback to primaryrunbook/failback-procedure.md
Partial service recoveryrunbook/partial-failover.md

Design Review Checklist

RTO/RPO

  • [ ] RTO/RPO defined và documented cho mọi workload
  • [ ] Business sign-off on targets
  • [ ] Cost-benefit analysis completed
  • [ ] Dependencies mapped

Architecture

  • [ ] Multi-AZ cho tất cả production
  • [ ] DR strategy matches RTO/RPO
  • [ ] Replication configured và monitored
  • [ ] Failover automation tested

Testing

  • [ ] DR drills scheduled
  • [ ] Runbooks up-to-date
  • [ ] Credentials valid
  • [ ] Last successful drill date documented

Operations

  • [ ] Replication lag monitoring
  • [ ] Health check endpoints
  • [ ] On-call được training về DR
  • [ ] Post-incident review process

📎 Liên kết