🛡️ Reliability & DR

Level: Ops Solves: Design và implement high availability và disaster recovery strategies cho enterprise workloads

🎯 Mục tiêu (Outcomes)

Sau khi áp dụng kiến thức trong trang này, bạn sẽ có khả năng:

Xác định RTO/RPO phù hợp cho từng workload dựa trên business requirements
Thiết kế Multi-AZ Architecture cho high availability trong single region
Triển khai Multi-Region DR với appropriate strategy (Pilot Light, Warm Standby, Active-Active)
Xây dựng DR Runbooks với clear procedures và testing schedules
Implement Chaos Engineering để validate resilience assumptions
Tối ưu DR Cost dựa trên actual RTO/RPO requirements

✅ Khi nào dùng

DR Strategy	RTO	RPO	Use Case	Cost
Backup & Restore	24h+	24h+	Non-critical, cost-sensitive	Thấp nhất
Pilot Light	1-4h	Minutes	Business applications	Thấp
Warm Standby	15-60min	Near-zero	Important workloads	Trung bình
Multi-Site Active	Seconds	Near-zero	Mission-critical	Cao nhất

❌ Khi nào KHÔNG dùng

Pattern	Vấn đề	Thay thế
Active-Active cho dev/test	Cost quá cao không cần thiết	Backup & Restore
Multi-region cho single-market app	Complexity và latency	Multi-AZ
Manual failover cho critical systems	RTO quá dài	Automated failover
DR không test	Không biết có work hay không	Regular DR drills

⚠️ Cảnh báo từ Raizo

"Một công ty đầu tư $200K/năm vào DR infrastructure. Khi thực sự cần failover, runbook đã 2 năm không update, credentials hết hạn, và replication đã broken 3 tháng. DR không test = không có DR."

Reliability Fundamentals

Availability Targets

┌─────────────────────────────────────────────────────────────────┐
│                 AVAILABILITY TARGETS                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Availability    Downtime/Year    Downtime/Month    Use Case    │
│  ────────────    ─────────────    ──────────────    ────────    │
│  99%             3.65 days        7.3 hours         Dev/Test    │
│  99.9%           8.76 hours       43.8 minutes      Standard    │
│  99.95%          4.38 hours       21.9 minutes      Business    │
│  99.99%          52.6 minutes     4.38 minutes      Critical    │
│  99.999%         5.26 minutes     26.3 seconds      Mission     │
│                                                                 │
│  COST IMPLICATION:                                              │
│  Each additional "9" typically doubles infrastructure cost      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

RTO vs RPO

┌─────────────────────────────────────────────────────────────────┐
│                    RTO vs RPO                                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Timeline:                                                      │
│                                                                 │
│  ◄────────── RPO ──────────►│◄────────── RTO ──────────►        │
│                              │                                  │
│  Last Backup    Data Loss    │    Disaster    Recovery          │
│  ─────────────────────────────────────────────────────────      │
│       │              │       │         │           │            │
│       ▼              ▼       ▼         ▼           ▼            │
│  ┌─────────┐    ┌─────────┐  │    ┌─────────┐  ┌─────────┐     │
│  │ Backup  │    │  Lost   │  │    │ Outage  │  │ Service │     │
│  │  Point  │    │  Data   │  │    │ Starts  │  │ Restored│     │
│  └─────────┘    └─────────┘  │    └─────────┘  └─────────┘     │
│                              │                                  │
│  RPO = Recovery Point Objective (acceptable data loss)          │
│  RTO = Recovery Time Objective (acceptable downtime)            │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Multi-AZ Architecture

Standard Multi-AZ Pattern

┌─────────────────────────────────────────────────────────────────┐
│                 MULTI-AZ ARCHITECTURE                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Region: us-east-1                                              │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    Application Load Balancer             │    │
│  └────────────────────────────┬────────────────────────────┘    │
│                               │                                 │
│         ┌─────────────────────┼─────────────────────┐           │
│         │                     │                     │           │
│         ▼                     ▼                     ▼           │
│  ┌─────────────┐       ┌─────────────┐       ┌─────────────┐    │
│  │    AZ-A     │       │    AZ-B     │       │    AZ-C     │    │
│  │             │       │             │       │             │    │
│  │ ┌─────────┐ │       │ ┌─────────┐ │       │ ┌─────────┐ │    │
│  │ │   App   │ │       │ │   App   │ │       │ │   App   │ │    │
│  │ │ (ASG)   │ │       │ │ (ASG)   │ │       │ │ (ASG)   │ │    │
│  │ └─────────┘ │       │ └─────────┘ │       │ └─────────┘ │    │
│  │             │       │             │       │             │    │
│  │ ┌─────────┐ │       │ ┌─────────┐ │       │             │    │
│  │ │   RDS   │◄┼───────┼►│   RDS   │ │       │             │    │
│  │ │ Primary │ │ Sync  │ │ Standby │ │       │             │    │
│  │ └─────────┘ │       │ └─────────┘ │       │             │    │
│  └─────────────┘       └─────────────┘       └─────────────┘    │
│                                                                 │
│  Failover: Automatic for RDS, ALB routes to healthy targets     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Service-Specific Multi-AZ

Service	Multi-AZ Behavior	Failover Time
RDS	Synchronous standby	60-120 seconds
Aurora	6 copies across 3 AZs	< 30 seconds
ElastiCache	Replica in different AZ	Seconds
EFS	Automatic across AZs	Transparent
S3	Automatic across AZs	Transparent
DynamoDB	Automatic across AZs	Transparent

Multi-Region Architecture

Active-Passive Pattern

┌─────────────────────────────────────────────────────────────────┐
│              ACTIVE-PASSIVE MULTI-REGION                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                      Route 53                            │    │
│  │              (Failover Routing Policy)                   │    │
│  └────────────────────────┬────────────────────────────────┘    │
│                           │                                     │
│         ┌─────────────────┴─────────────────┐                   │
│         │ Primary                │ Secondary                    │
│         ▼                        ▼                              │
│  ┌─────────────┐          ┌─────────────┐                       │
│  │ us-east-1   │          │ us-west-2   │                       │
│  │  (Active)   │          │  (Passive)  │                       │
│  │             │          │             │                       │
│  │ ┌─────────┐ │  Async   │ ┌─────────┐ │                       │
│  │ │   RDS   │─┼──Replica─┼►│   RDS   │ │                       │
│  │ │ Primary │ │          │ │ Read Rep│ │                       │
│  │ └─────────┘ │          │ └─────────┘ │                       │
│  │             │          │             │                       │
│  │ ┌─────────┐ │  Cross-  │ ┌─────────┐ │                       │
│  │ │   S3    │─┼──Region──┼►│   S3    │ │                       │
│  │ │         │ │  Replic. │ │         │ │                       │
│  │ └─────────┘ │          │ └─────────┘ │                       │
│  └─────────────┘          └─────────────┘                       │
│                                                                 │
│  RPO: Minutes (async replication lag)                           │
│  RTO: Minutes to hours (manual promotion)                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Active-Active Pattern

┌─────────────────────────────────────────────────────────────────┐
│              ACTIVE-ACTIVE MULTI-REGION                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                      Route 53                            │    │
│  │           (Latency or Geolocation Routing)               │    │
│  └────────────────────────┬────────────────────────────────┘    │
│                           │                                     │
│         ┌─────────────────┴─────────────────┐                   │
│         ▼                                   ▼                   │
│  ┌─────────────┐                     ┌─────────────┐            │
│  │ us-east-1   │                     │ eu-west-1   │            │
│  │  (Active)   │                     │  (Active)   │            │
│  │             │                     │             │            │
│  │ ┌─────────┐ │                     │ ┌─────────┐ │            │
│  │ │ Aurora  │◄┼─────Global DB───────┼►│ Aurora  │ │            │
│  │ │ Global  │ │   (< 1s latency)    │ │ Global  │ │            │
│  │ └─────────┘ │                     │ └─────────┘ │            │
│  │             │                     │             │            │
│  │ ┌─────────┐ │                     │ ┌─────────┐ │            │
│  │ │DynamoDB │◄┼───Global Tables─────┼►│DynamoDB │ │            │
│  │ │ Global  │ │                     │ │ Global  │ │            │
│  │ └─────────┘ │                     │ └─────────┘ │            │
│  └─────────────┘                     └─────────────┘            │
│                                                                 │
│  RPO: Near-zero (synchronous/near-sync replication)             │
│  RTO: Seconds (automatic failover)                              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

DR Strategy Selection

Strategy Comparison

Strategy	RTO	RPO	Cost	Use Case
Backup & Restore	Hours	Hours	$	Non-critical
Pilot Light	10s minutes	Minutes	$$	Important
Warm Standby	Minutes	Seconds-Minutes	$$$	Business Critical
Multi-Site Active	Seconds	Near-zero	$$$$	Mission Critical

Pilot Light Implementation

┌─────────────────────────────────────────────────────────────────┐
│                    PILOT LIGHT PATTERN                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Primary Region (Active)         DR Region (Pilot Light)        │
│  ┌─────────────────────┐        ┌─────────────────────┐         │
│  │                     │        │                     │         │
│  │ ┌─────────────────┐ │        │ ┌─────────────────┐ │         │
│  │ │ Full App Stack  │ │        │ │ Minimal Infra   │ │         │
│  │ │ • Web servers   │ │        │ │ • DB replica    │ │         │
│  │ │ • App servers   │ │        │ │ • Core configs  │ │         │
│  │ │ • Databases     │ │        │ │ • AMIs ready    │ │         │
│  │ │ • Caches        │ │        │ │                 │ │         │
│  │ └─────────────────┘ │        │ └─────────────────┘ │         │
│  │                     │        │                     │         │
│  └─────────────────────┘        └─────────────────────┘         │
│                                                                 │
│  On Failover:                                                   │
│  1. Scale up compute (ASG, ECS)                                 │
│  2. Promote DB replica                                          │
│  3. Update DNS                                                  │
│  4. Warm up caches                                              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Game Days & Testing

DR Testing Framework

┌─────────────────────────────────────────────────────────────────┐
│                 DR TESTING FRAMEWORK                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Test Types:                                                    │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ 1. Tabletop Exercise (Quarterly)                        │    │
│  │    • Walk through runbooks                              │    │
│  │    • Identify gaps in documentation                     │    │
│  │    • No actual failover                                 │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ 2. Component Failover (Monthly)                         │    │
│  │    • Test individual component failover                 │    │
│  │    • RDS failover, AZ failure simulation                │    │
│  │    • Measure actual RTO                                 │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ 3. Full DR Drill (Annually)                             │    │
│  │    • Complete region failover                           │    │
│  │    • Run production traffic in DR region                │    │
│  │    • Validate all systems                               │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Chaos Engineering with FIS

json

{
  "description": "AZ failure simulation",
  "targets": {
    "ec2-instances": {
      "resourceType": "aws:ec2:instance",
      "resourceTags": {
        "Environment": "production"
      },
      "filters": [
        {
          "path": "Placement.AvailabilityZone",
          "values": ["us-east-1a"]
        }
      ],
      "selectionMode": "ALL"
    }
  },
  "actions": {
    "stop-instances": {
      "actionId": "aws:ec2:stop-instances",
      "parameters": {},
      "targets": {
        "Instances": "ec2-instances"
      }
    }
  },
  "stopConditions": [
    {
      "source": "aws:cloudwatch:alarm",
      "value": "arn:aws:cloudwatch:us-east-1:123456789012:alarm:HighErrorRate"
    }
  ]
}

Runbook Template

markdown

# DR Runbook: Region Failover

## Pre-Conditions
- [ ] DR region infrastructure verified
- [ ] Database replication lag < 5 minutes
- [ ] On-call team notified

## Failover Steps

### 1. Assess Situation (5 min)
- Confirm primary region is unavailable
- Check replication status
- Notify stakeholders

### 2. Database Failover (10 min)
- Promote RDS read replica to primary
- Update application connection strings
- Verify database connectivity

### 3. Application Failover (15 min)
- Scale up ASG in DR region
- Verify application health checks
- Update Route 53 to DR region

### 4. Validation (10 min)
- Run smoke tests
- Verify critical user journeys
- Monitor error rates

## Rollback Procedure
[Steps to fail back to primary region]

## Contacts
- Primary: [name] - [phone]
- Secondary: [name] - [phone]

Best Practices Checklist

[ ] Define RTO/RPO for each workload
[ ] Implement Multi-AZ for all production workloads
[ ] Set up cross-region replication for critical data
[ ] Create and test DR runbooks
[ ] Conduct regular DR drills
[ ] Automate failover where possible
[ ] Monitor replication lag
[ ] Document dependencies and failover order

⚖️ Trade-offs

Trade-off 1: RTO vs Cost

RTO Target	Strategy	Monthly Cost Estimate	ROI Consideration
24h+	Backup & Restore	~$100/TB stored	Accept 24h+ downtime
1-4h	Pilot Light	~$500/month base	Balance cost/recovery
15-60min	Warm Standby	~$2,000-5,000/month	Important but not critical
< 1min	Active-Active	2x production cost	Mission-critical

Ví dụ tính toán downtime cost:

Revenue: $1M/ngày
Downtime cost: $41,667/giờ

Backup & Restore (24h RTO): Potential loss = $1M
Warm Standby ($5K/tháng): Break-even = 1 outage/5 năm

Nếu probability outage > 20%/năm: Warm Standby worth it

Trade-off 2: RPO vs Replication Cost

RPO Target	Replication Method	Cost Impact
24h	Daily backups	Lowest
1h	Hourly snapshots	Low
Minutes	Async replication	Medium
Near-zero	Sync replication	High + latency

Trade-off 3: Automation vs Complexity

Approach	Pros	Cons
Manual failover	Simple, full control	Slow, human error
Automated failover	Fast, consistent	Complex, false positives
Hybrid	Balanced	Moderate complexity

🚨 Failure Modes

Failure Mode 1: Replication Lag Undetected

🔥 Incident thực tế

Database replication đã fail 1 tuần mà không ai biết. Khi primary region down, DR database thiếu 1 tuần data. RPO 0 trên giấy, RPO 1 tuần thực tế. $2M data reconstruction cost.

Cách phát hiện	Cách phòng tránh
Monitor replication lag metric	CloudWatch alarm cho lag > threshold
Daily replication health check	Automated validation job
DR drill failures	Monthly replication verification

Failure Mode 2: DR Runbook Outdated

Cách phát hiện	Cách phòng tránh
Failed DR drill	Quarterly runbook review
Missing new services	IaC-based runbooks (auto-update)
Wrong credentials	Credential rotation reminder
Expired certificates	Certificate expiry monitoring

Failure Mode 3: Cascading Failures

┌─────────────────────────────────────────────────────────────────┐
│                 CASCADE FAILURE PATTERN                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. Database slow → 2. App timeouts → 3. Queue backlog          │
│                                    ↓                             │
│  4. Retry storms → 5. More DB load → 6. Database crash           │
│                                    ↓                             │
│  7. All dependent services fail                                 │
│                                                                 │
│  Mitigation: Circuit breakers, bulkheads, rate limiting        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Cách phát hiện	Cách phòng tránh
Error rate spike across services	Circuit breakers per dependency
Queue depth growing exponentially	Bulkhead pattern (service isolation)
Retry metrics spike	Exponential backoff with jitter

🔐 Security Baseline

DR Security Requirements

Requirement	Implementation	Verification
DR credentials secure	Secrets Manager in DR region	Access audit
Cross-region encryption	KMS multi-region keys	Key policy review
Network security parity	IaC-managed Security Groups	Config drift detection
IAM parity	IAM Access Analyzer	Cross-region comparison

Data Sovereignty

Consideration	Implementation
GDPR data residency	DR region trong EU
Data classification	Replicate only allowed data
Encryption keys	Region-specific CMKs nếu required
Audit logs	Replicate to both regions

📊 Ops Readiness

Metrics cần Monitoring

Component	Metric	Alert Threshold
RDS Replication	ReplicaLag	> 60 seconds
S3 Replication	ReplicationLatency	> 15 minutes
Route 53 Health	HealthCheckStatus	Unhealthy
DR Region	ResourceCount	Mismatch với primary
Backup	BackupJobStatus	Failed

DR Drill Schedule

Drill Type	Frequency	Duration	Scope
Tabletop	Monthly	1-2 hours	Review runbooks
Partial failover	Quarterly	2-4 hours	Non-critical services
Full failover	Annually	4-8 hours	All production
Chaos engineering	Ongoing	Varies	Continuous validation

Runbook Entry Points

Tình huống	Runbook
Primary region outage	`runbook/full-region-failover.md`
Database failover	`runbook/database-failover.md`
Replication lag alert	`runbook/replication-lag-investigation.md`
Failed DR drill	`runbook/dr-drill-failure-analysis.md`
Failback to primary	`runbook/failback-procedure.md`
Partial service recovery	`runbook/partial-failover.md`

✅ Design Review Checklist

RTO/RPO

[ ] RTO/RPO defined và documented cho mọi workload
[ ] Business sign-off on targets
[ ] Cost-benefit analysis completed
[ ] Dependencies mapped

Architecture

[ ] Multi-AZ cho tất cả production
[ ] DR strategy matches RTO/RPO
[ ] Replication configured và monitored
[ ] Failover automation tested

Testing

[ ] DR drills scheduled
[ ] Runbooks up-to-date
[ ] Credentials valid
[ ] Last successful drill date documented

Operations

[ ] Replication lag monitoring
[ ] Health check endpoints
[ ] On-call được training về DR
[ ] Post-incident review process

📎 Liên kết

📎 GCP Observability - So sánh với GCP's DR capabilities
📎 Observability & Auditing - Monitoring cho DR
📎 Storage & Data Protection - Backup strategies
📎 VPC & Networking - Network architecture cho DR
📎 Terraform Environments - Multi-region IaC

🛡️ Reliability & DR ​

🎯 Mục tiêu (Outcomes) ​

✅ Khi nào dùng ​

❌ Khi nào KHÔNG dùng ​

Reliability Fundamentals ​

Availability Targets ​

RTO vs RPO ​

Multi-AZ Architecture ​

Standard Multi-AZ Pattern ​

Service-Specific Multi-AZ ​

Multi-Region Architecture ​

Active-Passive Pattern ​

Active-Active Pattern ​

DR Strategy Selection ​

Strategy Comparison ​

Pilot Light Implementation ​

Game Days & Testing ​

DR Testing Framework ​

Chaos Engineering with FIS ​

Runbook Template ​

Best Practices Checklist ​

⚖️ Trade-offs ​

Trade-off 1: RTO vs Cost ​

Trade-off 2: RPO vs Replication Cost ​

Trade-off 3: Automation vs Complexity ​

🚨 Failure Modes ​

Failure Mode 1: Replication Lag Undetected ​

Failure Mode 2: DR Runbook Outdated ​

Failure Mode 3: Cascading Failures ​

🔐 Security Baseline ​

DR Security Requirements ​

Data Sovereignty ​

📊 Ops Readiness ​

Metrics cần Monitoring ​

DR Drill Schedule ​

Runbook Entry Points ​

✅ Design Review Checklist ​

RTO/RPO ​

Architecture ​

Testing ​

Operations ​

📎 Liên kết ​