Giao diện
🛡️ Reliability & DR
Level: Ops Solves: Design và implement high availability và disaster recovery strategies cho enterprise workloads
🎯 Mục tiêu (Outcomes)
Sau khi áp dụng kiến thức trong trang này, bạn sẽ có khả năng:
- Xác định RTO/RPO phù hợp cho từng workload dựa trên business requirements
- Thiết kế Multi-AZ Architecture cho high availability trong single region
- Triển khai Multi-Region DR với appropriate strategy (Pilot Light, Warm Standby, Active-Active)
- Xây dựng DR Runbooks với clear procedures và testing schedules
- Implement Chaos Engineering để validate resilience assumptions
- Tối ưu DR Cost dựa trên actual RTO/RPO requirements
✅ Khi nào dùng
| DR Strategy | RTO | RPO | Use Case | Cost |
|---|---|---|---|---|
| Backup & Restore | 24h+ | 24h+ | Non-critical, cost-sensitive | Thấp nhất |
| Pilot Light | 1-4h | Minutes | Business applications | Thấp |
| Warm Standby | 15-60min | Near-zero | Important workloads | Trung bình |
| Multi-Site Active | Seconds | Near-zero | Mission-critical | Cao nhất |
❌ Khi nào KHÔNG dùng
| Pattern | Vấn đề | Thay thế |
|---|---|---|
| Active-Active cho dev/test | Cost quá cao không cần thiết | Backup & Restore |
| Multi-region cho single-market app | Complexity và latency | Multi-AZ |
| Manual failover cho critical systems | RTO quá dài | Automated failover |
| DR không test | Không biết có work hay không | Regular DR drills |
⚠️ Cảnh báo từ Raizo
"Một công ty đầu tư $200K/năm vào DR infrastructure. Khi thực sự cần failover, runbook đã 2 năm không update, credentials hết hạn, và replication đã broken 3 tháng. DR không test = không có DR."
Reliability Fundamentals
Availability Targets
┌─────────────────────────────────────────────────────────────────┐
│ AVAILABILITY TARGETS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Availability Downtime/Year Downtime/Month Use Case │
│ ──────────── ───────────── ────────────── ──────── │
│ 99% 3.65 days 7.3 hours Dev/Test │
│ 99.9% 8.76 hours 43.8 minutes Standard │
│ 99.95% 4.38 hours 21.9 minutes Business │
│ 99.99% 52.6 minutes 4.38 minutes Critical │
│ 99.999% 5.26 minutes 26.3 seconds Mission │
│ │
│ COST IMPLICATION: │
│ Each additional "9" typically doubles infrastructure cost │
│ │
└─────────────────────────────────────────────────────────────────┘RTO vs RPO
┌─────────────────────────────────────────────────────────────────┐
│ RTO vs RPO │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Timeline: │
│ │
│ ◄────────── RPO ──────────►│◄────────── RTO ──────────► │
│ │ │
│ Last Backup Data Loss │ Disaster Recovery │
│ ───────────────────────────────────────────────────────── │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ │ ┌─────────┐ ┌─────────┐ │
│ │ Backup │ │ Lost │ │ │ Outage │ │ Service │ │
│ │ Point │ │ Data │ │ │ Starts │ │ Restored│ │
│ └─────────┘ └─────────┘ │ └─────────┘ └─────────┘ │
│ │ │
│ RPO = Recovery Point Objective (acceptable data loss) │
│ RTO = Recovery Time Objective (acceptable downtime) │
│ │
└─────────────────────────────────────────────────────────────────┘Multi-AZ Architecture
Standard Multi-AZ Pattern
┌─────────────────────────────────────────────────────────────────┐
│ MULTI-AZ ARCHITECTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Region: us-east-1 │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Application Load Balancer │ │
│ └────────────────────────────┬────────────────────────────┘ │
│ │ │
│ ┌─────────────────────┼─────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ AZ-A │ │ AZ-B │ │ AZ-C │ │
│ │ │ │ │ │ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │
│ │ │ App │ │ │ │ App │ │ │ │ App │ │ │
│ │ │ (ASG) │ │ │ │ (ASG) │ │ │ │ (ASG) │ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │
│ │ │ │ │ │ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ │
│ │ │ RDS │◄┼───────┼►│ RDS │ │ │ │ │
│ │ │ Primary │ │ Sync │ │ Standby │ │ │ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ Failover: Automatic for RDS, ALB routes to healthy targets │
│ │
└─────────────────────────────────────────────────────────────────┘Service-Specific Multi-AZ
| Service | Multi-AZ Behavior | Failover Time |
|---|---|---|
| RDS | Synchronous standby | 60-120 seconds |
| Aurora | 6 copies across 3 AZs | < 30 seconds |
| ElastiCache | Replica in different AZ | Seconds |
| EFS | Automatic across AZs | Transparent |
| S3 | Automatic across AZs | Transparent |
| DynamoDB | Automatic across AZs | Transparent |
Multi-Region Architecture
Active-Passive Pattern
┌─────────────────────────────────────────────────────────────────┐
│ ACTIVE-PASSIVE MULTI-REGION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Route 53 │ │
│ │ (Failover Routing Policy) │ │
│ └────────────────────────┬────────────────────────────────┘ │
│ │ │
│ ┌─────────────────┴─────────────────┐ │
│ │ Primary │ Secondary │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ us-east-1 │ │ us-west-2 │ │
│ │ (Active) │ │ (Passive) │ │
│ │ │ │ │ │
│ │ ┌─────────┐ │ Async │ ┌─────────┐ │ │
│ │ │ RDS │─┼──Replica─┼►│ RDS │ │ │
│ │ │ Primary │ │ │ │ Read Rep│ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │
│ │ │ │ │ │
│ │ ┌─────────┐ │ Cross- │ ┌─────────┐ │ │
│ │ │ S3 │─┼──Region──┼►│ S3 │ │ │
│ │ │ │ │ Replic. │ │ │ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │
│ └─────────────┘ └─────────────┘ │
│ │
│ RPO: Minutes (async replication lag) │
│ RTO: Minutes to hours (manual promotion) │
│ │
└─────────────────────────────────────────────────────────────────┘Active-Active Pattern
┌─────────────────────────────────────────────────────────────────┐
│ ACTIVE-ACTIVE MULTI-REGION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Route 53 │ │
│ │ (Latency or Geolocation Routing) │ │
│ └────────────────────────┬────────────────────────────────┘ │
│ │ │
│ ┌─────────────────┴─────────────────┐ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ us-east-1 │ │ eu-west-1 │ │
│ │ (Active) │ │ (Active) │ │
│ │ │ │ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │
│ │ │ Aurora │◄┼─────Global DB───────┼►│ Aurora │ │ │
│ │ │ Global │ │ (< 1s latency) │ │ Global │ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │
│ │ │ │ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │
│ │ │DynamoDB │◄┼───Global Tables─────┼►│DynamoDB │ │ │
│ │ │ Global │ │ │ │ Global │ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │
│ └─────────────┘ └─────────────┘ │
│ │
│ RPO: Near-zero (synchronous/near-sync replication) │
│ RTO: Seconds (automatic failover) │
│ │
└─────────────────────────────────────────────────────────────────┘DR Strategy Selection
Strategy Comparison
| Strategy | RTO | RPO | Cost | Use Case |
|---|---|---|---|---|
| Backup & Restore | Hours | Hours | $ | Non-critical |
| Pilot Light | 10s minutes | Minutes | $$ | Important |
| Warm Standby | Minutes | Seconds-Minutes | $$$ | Business Critical |
| Multi-Site Active | Seconds | Near-zero | $$$$ | Mission Critical |
Pilot Light Implementation
┌─────────────────────────────────────────────────────────────────┐
│ PILOT LIGHT PATTERN │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Primary Region (Active) DR Region (Pilot Light) │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ │ │ │ │
│ │ ┌─────────────────┐ │ │ ┌─────────────────┐ │ │
│ │ │ Full App Stack │ │ │ │ Minimal Infra │ │ │
│ │ │ • Web servers │ │ │ │ • DB replica │ │ │
│ │ │ • App servers │ │ │ │ • Core configs │ │ │
│ │ │ • Databases │ │ │ │ • AMIs ready │ │ │
│ │ │ • Caches │ │ │ │ │ │ │
│ │ └─────────────────┘ │ │ └─────────────────┘ │ │
│ │ │ │ │ │
│ └─────────────────────┘ └─────────────────────┘ │
│ │
│ On Failover: │
│ 1. Scale up compute (ASG, ECS) │
│ 2. Promote DB replica │
│ 3. Update DNS │
│ 4. Warm up caches │
│ │
└─────────────────────────────────────────────────────────────────┘Game Days & Testing
DR Testing Framework
┌─────────────────────────────────────────────────────────────────┐
│ DR TESTING FRAMEWORK │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Test Types: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ 1. Tabletop Exercise (Quarterly) │ │
│ │ • Walk through runbooks │ │
│ │ • Identify gaps in documentation │ │
│ │ • No actual failover │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ 2. Component Failover (Monthly) │ │
│ │ • Test individual component failover │ │
│ │ • RDS failover, AZ failure simulation │ │
│ │ • Measure actual RTO │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ 3. Full DR Drill (Annually) │ │
│ │ • Complete region failover │ │
│ │ • Run production traffic in DR region │ │
│ │ • Validate all systems │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Chaos Engineering with FIS
json
{
"description": "AZ failure simulation",
"targets": {
"ec2-instances": {
"resourceType": "aws:ec2:instance",
"resourceTags": {
"Environment": "production"
},
"filters": [
{
"path": "Placement.AvailabilityZone",
"values": ["us-east-1a"]
}
],
"selectionMode": "ALL"
}
},
"actions": {
"stop-instances": {
"actionId": "aws:ec2:stop-instances",
"parameters": {},
"targets": {
"Instances": "ec2-instances"
}
}
},
"stopConditions": [
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:us-east-1:123456789012:alarm:HighErrorRate"
}
]
}Runbook Template
markdown
# DR Runbook: Region Failover
## Pre-Conditions
- [ ] DR region infrastructure verified
- [ ] Database replication lag < 5 minutes
- [ ] On-call team notified
## Failover Steps
### 1. Assess Situation (5 min)
- Confirm primary region is unavailable
- Check replication status
- Notify stakeholders
### 2. Database Failover (10 min)
- Promote RDS read replica to primary
- Update application connection strings
- Verify database connectivity
### 3. Application Failover (15 min)
- Scale up ASG in DR region
- Verify application health checks
- Update Route 53 to DR region
### 4. Validation (10 min)
- Run smoke tests
- Verify critical user journeys
- Monitor error rates
## Rollback Procedure
[Steps to fail back to primary region]
## Contacts
- Primary: [name] - [phone]
- Secondary: [name] - [phone]Best Practices Checklist
- [ ] Define RTO/RPO for each workload
- [ ] Implement Multi-AZ for all production workloads
- [ ] Set up cross-region replication for critical data
- [ ] Create and test DR runbooks
- [ ] Conduct regular DR drills
- [ ] Automate failover where possible
- [ ] Monitor replication lag
- [ ] Document dependencies and failover order
⚖️ Trade-offs
Trade-off 1: RTO vs Cost
| RTO Target | Strategy | Monthly Cost Estimate | ROI Consideration |
|---|---|---|---|
| 24h+ | Backup & Restore | ~$100/TB stored | Accept 24h+ downtime |
| 1-4h | Pilot Light | ~$500/month base | Balance cost/recovery |
| 15-60min | Warm Standby | ~$2,000-5,000/month | Important but not critical |
| < 1min | Active-Active | 2x production cost | Mission-critical |
Ví dụ tính toán downtime cost:
Revenue: $1M/ngày
Downtime cost: $41,667/giờ
Backup & Restore (24h RTO): Potential loss = $1M
Warm Standby ($5K/tháng): Break-even = 1 outage/5 năm
Nếu probability outage > 20%/năm: Warm Standby worth itTrade-off 2: RPO vs Replication Cost
| RPO Target | Replication Method | Cost Impact |
|---|---|---|
| 24h | Daily backups | Lowest |
| 1h | Hourly snapshots | Low |
| Minutes | Async replication | Medium |
| Near-zero | Sync replication | High + latency |
Trade-off 3: Automation vs Complexity
| Approach | Pros | Cons |
|---|---|---|
| Manual failover | Simple, full control | Slow, human error |
| Automated failover | Fast, consistent | Complex, false positives |
| Hybrid | Balanced | Moderate complexity |
🚨 Failure Modes
Failure Mode 1: Replication Lag Undetected
🔥 Incident thực tế
Database replication đã fail 1 tuần mà không ai biết. Khi primary region down, DR database thiếu 1 tuần data. RPO 0 trên giấy, RPO 1 tuần thực tế. $2M data reconstruction cost.
| Cách phát hiện | Cách phòng tránh |
|---|---|
| Monitor replication lag metric | CloudWatch alarm cho lag > threshold |
| Daily replication health check | Automated validation job |
| DR drill failures | Monthly replication verification |
Failure Mode 2: DR Runbook Outdated
| Cách phát hiện | Cách phòng tránh |
|---|---|
| Failed DR drill | Quarterly runbook review |
| Missing new services | IaC-based runbooks (auto-update) |
| Wrong credentials | Credential rotation reminder |
| Expired certificates | Certificate expiry monitoring |
Failure Mode 3: Cascading Failures
┌─────────────────────────────────────────────────────────────────┐
│ CASCADE FAILURE PATTERN │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. Database slow → 2. App timeouts → 3. Queue backlog │
│ ↓ │
│ 4. Retry storms → 5. More DB load → 6. Database crash │
│ ↓ │
│ 7. All dependent services fail │
│ │
│ Mitigation: Circuit breakers, bulkheads, rate limiting │
│ │
└─────────────────────────────────────────────────────────────────┘| Cách phát hiện | Cách phòng tránh |
|---|---|
| Error rate spike across services | Circuit breakers per dependency |
| Queue depth growing exponentially | Bulkhead pattern (service isolation) |
| Retry metrics spike | Exponential backoff with jitter |
🔐 Security Baseline
DR Security Requirements
| Requirement | Implementation | Verification |
|---|---|---|
| DR credentials secure | Secrets Manager in DR region | Access audit |
| Cross-region encryption | KMS multi-region keys | Key policy review |
| Network security parity | IaC-managed Security Groups | Config drift detection |
| IAM parity | IAM Access Analyzer | Cross-region comparison |
Data Sovereignty
| Consideration | Implementation |
|---|---|
| GDPR data residency | DR region trong EU |
| Data classification | Replicate only allowed data |
| Encryption keys | Region-specific CMKs nếu required |
| Audit logs | Replicate to both regions |
📊 Ops Readiness
Metrics cần Monitoring
| Component | Metric | Alert Threshold |
|---|---|---|
| RDS Replication | ReplicaLag | > 60 seconds |
| S3 Replication | ReplicationLatency | > 15 minutes |
| Route 53 Health | HealthCheckStatus | Unhealthy |
| DR Region | ResourceCount | Mismatch với primary |
| Backup | BackupJobStatus | Failed |
DR Drill Schedule
| Drill Type | Frequency | Duration | Scope |
|---|---|---|---|
| Tabletop | Monthly | 1-2 hours | Review runbooks |
| Partial failover | Quarterly | 2-4 hours | Non-critical services |
| Full failover | Annually | 4-8 hours | All production |
| Chaos engineering | Ongoing | Varies | Continuous validation |
Runbook Entry Points
| Tình huống | Runbook |
|---|---|
| Primary region outage | runbook/full-region-failover.md |
| Database failover | runbook/database-failover.md |
| Replication lag alert | runbook/replication-lag-investigation.md |
| Failed DR drill | runbook/dr-drill-failure-analysis.md |
| Failback to primary | runbook/failback-procedure.md |
| Partial service recovery | runbook/partial-failover.md |
✅ Design Review Checklist
RTO/RPO
- [ ] RTO/RPO defined và documented cho mọi workload
- [ ] Business sign-off on targets
- [ ] Cost-benefit analysis completed
- [ ] Dependencies mapped
Architecture
- [ ] Multi-AZ cho tất cả production
- [ ] DR strategy matches RTO/RPO
- [ ] Replication configured và monitored
- [ ] Failover automation tested
Testing
- [ ] DR drills scheduled
- [ ] Runbooks up-to-date
- [ ] Credentials valid
- [ ] Last successful drill date documented
Operations
- [ ] Replication lag monitoring
- [ ] Health check endpoints
- [ ] On-call được training về DR
- [ ] Post-incident review process
📎 Liên kết
- 📎 GCP Observability - So sánh với GCP's DR capabilities
- 📎 Observability & Auditing - Monitoring cho DR
- 📎 Storage & Data Protection - Backup strategies
- 📎 VPC & Networking - Network architecture cho DR
- 📎 Terraform Environments - Multi-region IaC