Giao diện
💰 Cost Governance
Level: Ops Solves: Implement cost visibility, optimization, và governance cho enterprise AWS environments
🎯 Mục tiêu (Outcomes)
Sau khi áp dụng kiến thức trong trang này, bạn sẽ có khả năng:
- Thiết kế Tagging Strategy cho cost allocation và showback
- Cấu hình AWS Budgets với alerts và actions
- Triển khai Reserved Instances/Savings Plans cho cost optimization
- Implement Spot Strategy cho fault-tolerant workloads
- Track Unit Economics để measure cost efficiency
- Xây dựng Cost Governance với policies và automation
✅ Khi nào dùng
| Optimization | Use Case | Tiết kiệm |
|---|---|---|
| Reserved Instances | Stable, predictable EC2/RDS | 30-72% |
| Savings Plans | Flexible compute commitment | 30-66% |
| Spot Instances | Fault-tolerant, flexible timing | 60-90% |
| Right-sizing | Over-provisioned resources | 20-50% |
| Storage tiering | Infrequent access data | 50-90% |
❌ Khi nào KHÔNG dùng
| Pattern | Vấn đề | Thay thế |
|---|---|---|
| RI cho dev/test | Waste khi không dùng | Spot hoặc On-Demand |
| Spot cho databases | Data loss risk | Reserved |
| 3-year RI cho uncertain workloads | Lock-in risk | 1-year hoặc Savings Plans |
| No tags | Không thể allocate costs | Mandatory tagging |
⚠️ Cảnh báo từ Raizo
"Một team mua $500K Reserved Instances cho workload họ 'chắc chắn' sẽ stable. 6 tháng sau, workload migrate sang containers. RIs unused 18 tháng. Luon bắt đầu với 1-year commitments và convertible options."
Cost Management Framework
FinOps Principles
┌─────────────────────────────────────────────────────────────────┐
│ FINOPS FRAMEWORK │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ INFORM │ │
│ │ • Cost visibility and allocation │ │
│ │ • Tagging and showback │ │
│ │ • Anomaly detection │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ OPTIMIZE │ │
│ │ • Right-sizing │ │
│ │ • Reserved capacity │ │
│ │ • Spot instances │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ OPERATE │ │
│ │ • Budgets and alerts │ │
│ │ • Governance policies │ │
│ │ • Continuous improvement │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Tagging Strategy
Mandatory Tags
┌─────────────────────────────────────────────────────────────────┐
│ TAGGING TAXONOMY │
├─────────────────────────────────────────────────────────────────┤
│ │
│ COST ALLOCATION (Required): │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ CostCenter │ engineering-platform │ │
│ │ Project │ payment-gateway │ │
│ │ Environment │ production | staging | development │ │
│ │ Owner │ team-payments@company.com │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ OPERATIONAL (Required): │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Application │ checkout-service │ │
│ │ ManagedBy │ terraform | manual | cloudformation │ │
│ │ Backup │ daily | weekly | none │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ COMPLIANCE (As needed): │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ DataClassification │ public | internal | confidential │ │
│ │ Compliance │ pci-dss | hipaa | sox │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Tag Enforcement
json
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "RequireTags",
"Effect": "Deny",
"Action": [
"ec2:RunInstances",
"ec2:CreateVolume",
"rds:CreateDBInstance"
],
"Resource": "*",
"Condition": {
"Null": {
"aws:RequestTag/CostCenter": "true",
"aws:RequestTag/Environment": "true",
"aws:RequestTag/Owner": "true"
}
}
}
]
}AWS Config Rule for Tags
yaml
# Config rule to check required tags
Resources:
RequiredTagsRule:
Type: AWS::Config::ConfigRule
Properties:
ConfigRuleName: required-tags
Description: Check that required tags are present
InputParameters:
tag1Key: CostCenter
tag2Key: Environment
tag3Key: Owner
Source:
Owner: AWS
SourceIdentifier: REQUIRED_TAGS
Scope:
ComplianceResourceTypes:
- AWS::EC2::Instance
- AWS::RDS::DBInstance
- AWS::S3::BucketAWS Budgets
Budget Types
| Type | Use Case | Alert Trigger |
|---|---|---|
| Cost Budget | Track spending | Actual or forecasted |
| Usage Budget | Track resource usage | Hours, GB, requests |
| RI Utilization | Monitor RI usage | Below threshold |
| RI Coverage | Monitor RI coverage | Below threshold |
| Savings Plans | Monitor SP usage | Below threshold |
Budget Configuration
json
{
"BudgetName": "Monthly-Production-Budget",
"BudgetLimit": {
"Amount": "10000",
"Unit": "USD"
},
"BudgetType": "COST",
"TimeUnit": "MONTHLY",
"CostFilters": {
"TagKeyValue": [
"user:Environment$production"
]
},
"NotificationsWithSubscribers": [
{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 80,
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [
{
"SubscriptionType": "EMAIL",
"Address": "finops@company.com"
}
]
},
{
"Notification": {
"NotificationType": "FORECASTED",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 100,
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [
{
"SubscriptionType": "SNS",
"Address": "arn:aws:sns:us-east-1:123456789012:budget-alerts"
}
]
}
]
}Budget Actions (Auto-Remediation)
┌─────────────────────────────────────────────────────────────────┐
│ BUDGET ACTIONS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Threshold: 100% of budget │
│ │
│ Actions: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ 1. Apply IAM Policy │ │
│ │ • Deny ec2:RunInstances │ │
│ │ • Deny rds:CreateDBInstance │ │
│ │ • Prevents new resource creation │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ 2. Apply SCP (Organization level) │ │
│ │ • Restrict entire account │ │
│ │ • Emergency cost control │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ⚠️ Use with caution - can impact production! │
│ │
└─────────────────────────────────────────────────────────────────┘Reserved Instances & Savings Plans
Commitment Options
RI vs Savings Plans
| Feature | Reserved Instances | Savings Plans |
|---|---|---|
| Flexibility | Instance type specific | Any instance type |
| Region | Region specific | Any region (Compute SP) |
| Services | EC2, RDS, ElastiCache | EC2, Fargate, Lambda |
| Max Savings | Up to 72% | Up to 72% |
| Commitment | 1 or 3 years | 1 or 3 years |
Coverage Analysis
sql
-- Athena query for RI coverage analysis
SELECT
line_item_product_code,
product_instance_type,
SUM(CASE WHEN line_item_line_item_type = 'DiscountedUsage'
THEN line_item_usage_amount ELSE 0 END) as reserved_usage,
SUM(CASE WHEN line_item_line_item_type = 'Usage'
THEN line_item_usage_amount ELSE 0 END) as on_demand_usage,
ROUND(
SUM(CASE WHEN line_item_line_item_type = 'DiscountedUsage'
THEN line_item_usage_amount ELSE 0 END) * 100.0 /
NULLIF(SUM(line_item_usage_amount), 0), 2
) as coverage_percentage
FROM cost_and_usage_report
WHERE line_item_usage_start_date >= date_add('day', -30, current_date)
GROUP BY line_item_product_code, product_instance_type
HAVING SUM(line_item_usage_amount) > 100
ORDER BY on_demand_usage DESC;Spot Instances Strategy
Spot Best Practices
┌─────────────────────────────────────────────────────────────────┐
│ SPOT INSTANCE STRATEGY │
├─────────────────────────────────────────────────────────────────┤
│ │
│ DIVERSIFICATION: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ • Use multiple instance types (at least 10) │ │
│ │ • Spread across all AZs │ │
│ │ • Mix instance families (m5, m5a, m5n, m6i) │ │
│ │ • Use capacity-optimized allocation strategy │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ WORKLOAD FIT: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ ✅ Good for Spot: │ │
│ │ • Batch processing │ │
│ │ • CI/CD workers │ │
│ │ • Data analysis │ │
│ │ • Stateless web servers (with proper LB) │ │
│ │ │ │
│ │ ❌ Not for Spot: │ │
│ │ • Databases │ │
│ │ • Single points of failure │ │
│ │ • Long-running stateful jobs │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Mixed Instances ASG
json
{
"MixedInstancesPolicy": {
"LaunchTemplate": {
"LaunchTemplateSpecification": {
"LaunchTemplateId": "lt-xxx",
"Version": "$Latest"
},
"Overrides": [
{"InstanceType": "m5.large"},
{"InstanceType": "m5a.large"},
{"InstanceType": "m5n.large"},
{"InstanceType": "m6i.large"},
{"InstanceType": "m5.xlarge", "WeightedCapacity": "2"}
]
},
"InstancesDistribution": {
"OnDemandBaseCapacity": 2,
"OnDemandPercentageAboveBaseCapacity": 20,
"SpotAllocationStrategy": "capacity-optimized",
"SpotMaxPrice": ""
}
}
}Cost Anomaly Detection
Configuration
┌─────────────────────────────────────────────────────────────────┐
│ COST ANOMALY DETECTION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Monitor Types: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ 1. AWS Services │ │
│ │ • Detect anomalies across all services │ │
│ │ • Good for overall cost monitoring │ │
│ │ │ │
│ │ 2. Linked Accounts │ │
│ │ • Monitor specific accounts │ │
│ │ • Good for multi-account organizations │ │
│ │ │ │
│ │ 3. Cost Categories │ │
│ │ • Monitor by business unit/project │ │
│ │ • Requires cost categories setup │ │
│ │ │ │
│ │ 4. Cost Allocation Tags │ │
│ │ • Monitor by specific tags │ │
│ │ • Most granular control │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ Alert Thresholds: │
│ • Absolute: Alert when anomaly > $100 │
│ • Percentage: Alert when anomaly > 10% of expected │
│ │
└─────────────────────────────────────────────────────────────────┘Unit Economics
Cost Per Transaction
sql
-- Calculate cost per transaction
WITH daily_costs AS (
SELECT
DATE(line_item_usage_start_date) as date,
SUM(line_item_unblended_cost) as total_cost
FROM cost_and_usage_report
WHERE line_item_product_code = 'AmazonEC2'
AND resource_tags_user_application = 'checkout-service'
GROUP BY DATE(line_item_usage_start_date)
),
daily_transactions AS (
SELECT
DATE(timestamp) as date,
COUNT(*) as transaction_count
FROM application_metrics
WHERE metric_name = 'checkout_completed'
GROUP BY DATE(timestamp)
)
SELECT
c.date,
c.total_cost,
t.transaction_count,
ROUND(c.total_cost / t.transaction_count, 4) as cost_per_transaction
FROM daily_costs c
JOIN daily_transactions t ON c.date = t.date
ORDER BY c.date DESC;Cost Dashboard Metrics
| Metric | Formula | Target |
|---|---|---|
| Cost per Request | Total Cost / Request Count | < $0.001 |
| Cost per User | Total Cost / Active Users | < $1/month |
| Infrastructure Efficiency | Revenue / Infrastructure Cost | > 10x |
| RI/SP Coverage | Reserved Hours / Total Hours | > 70% |
| Spot Utilization | Spot Hours / Non-prod Hours | > 80% |
Best Practices Checklist
- [ ] Implement mandatory tagging with enforcement
- [ ] Set up budgets with alerts at 50%, 80%, 100%
- [ ] Enable Cost Anomaly Detection
- [ ] Purchase Reserved Instances for stable workloads
- [ ] Use Savings Plans for flexible compute
- [ ] Implement Spot for fault-tolerant workloads
- [ ] Create cost allocation reports by team/project
- [ ] Review and right-size resources monthly
- [ ] Track unit economics metrics
⚖️ Trade-offs
Trade-off 1: Reserved Instances vs Savings Plans
| Khía cạnh | Reserved Instances | Savings Plans |
|---|---|---|
| Flexibility | Instance type locked | Any instance type |
| Discount | Lên đến 72% | Lên đến 66% |
| Scope | Region/AZ specific | Region hoặc any region |
| Sellable | Có (Marketplace) | Không |
| Best for | Stable, known workloads | Dynamic compute needs |
Khuyến nghị:
- RIs: Database instances, known baseline
- Savings Plans: Application servers, variable needs
Trade-off 2: Commitment Length
| Term | Discount | Risk | Recommendation |
|---|---|---|---|
| On-Demand | 0% | O | Unknown/variable workloads |
| 1-year | 30-40% | Low | Standard production |
| 3-year | 60-72% | High | Only ultra-stable workloads |
Rule of thumb: Bắt đầu với 1-year, upgrade lên 3-year sau khi chứng minh stability.
Trade-off 3: Spot vs Reserved
| Khía cạnh | Spot | Reserved |
|---|---|---|
| Discount | 60-90% | 30-72% |
| Availability | Không guarantee | Guarantee |
| Interruption | Có (2 min warning) | Không |
| Best for | Batch, stateless | Databases, stateful |
🚨 Failure Modes
Failure Mode 1: Budget Overrun
🔥 Incident thực tế
Developer spin up ML training cluster, quên terminate. $80,000 bill sau 1 tuần. Không có budget alerts. Team chỉ biết khi nhận monthly bill.
| Cách phát hiện | Cách phòng tránh |
|---|---|
| Monthly bill shock | Budget alerts at 50%, 80%, 100% |
| Cost Anomaly Detection alerts | Enable và configure thresholds |
| Cost Explorer review | Daily cost monitoring |
Failure Mode 2: Wasted Reserved Capacity
| Cách phát hiện | Cách phòng tránh |
|---|---|
| RI utilization report < 80% | Right-size trước khi commit |
| Unused RIs | Start với 1-year, convertible |
| Workload migration | Savings Plans cho flexibility |
Failure Mode 3: Untagged Resources
┌─────────────────────────────────────────────────────────────────┐
│ UNTAGGED RESOURCES PROBLEM │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Monthly AWS Bill: $500,000 │
│ │ │
│ ├─ Team A: $??? (20% untagged) │
│ ├─ Team B: $??? (15% untagged) │
│ ├─ Unknown: $80,000 (không ai biết là của ai) │
│ │
│ Impact: Không thể do chi phí, không accountability │
│ │
└─────────────────────────────────────────────────────────────────┘| Cách phát hiện | Cách phòng tránh |
|---|---|
| Cost allocation report gaps | SCPs require tags |
| "Untagged" trong Cost Explorer | Tag policies |
| Config rules violations | Automated tagging remediation |
🔐 Security Baseline
Cost Governance Security
| Requirement | Implementation | Verification |
|---|---|---|
| Budget owner access | IAM policies per team | Access review |
| Cost data access | Role-based Cost Explorer | Policy audit |
| No surprise resources | SCPs limit instance types | Compliance check |
| Billing access restricted | Billing console IAM | Access audit |
Cost-Related SCPs
json
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyExpensiveInstances",
"Effect": "Deny",
"Action": ["ec2:RunInstances"],
"Resource": "arn:aws:ec2:*:*:instance/*",
"Condition": {
"ForAnyValue:StringLike": {
"ec2:InstanceType": ["p4d.*", "x2*", "u-*"]
}
}
}
]
}📊 Ops Readiness
Metrics cần Monitoring
| Metric | Source | Alert Threshold |
|---|---|---|
| Daily spend | Cost Explorer | > budget/30 |
| RI utilization | Cost Explorer | < 80% |
| Spot interruption rate | CloudWatch | > 10% |
| Untagged resources | Config | > 5% |
| Unit cost | Custom | > baseline + 20% |
Cost Review Schedule
| Review | Frequency | Participants | Actions |
|---|---|---|---|
| Daily glance | Daily | FinOps | Anomaly check |
| Weekly review | Weekly | FinOps + Leads | Right-sizing |
| Monthly deep-dive | Monthly | All stakeholders | Optimization |
| Quarterly planning | Quarterly | Leadership | Commitments |
Runbook Entry Points
| Tình huống | Runbook |
|---|---|
| Budget alert triggered | runbook/budget-overrun-response.md |
| Cost anomaly detected | runbook/cost-anomaly-investigation.md |
| RI underutilization | runbook/ri-optimization.md |
| Untagged resources found | runbook/tagging-remediation.md |
| Spot capacity issues | runbook/spot-capacity-management.md |
| Right-sizing opportunity | runbook/right-sizing-process.md |
✅ Design Review Checklist
Visibility
- [ ] Mandatory tags defined và enforced
- [ ] Cost allocation reports configured
- [ ] Budgets set up với alerts
- [ ] Cost Anomaly Detection enabled
Optimization
- [ ] Right-sizing analysis completed
- [ ] RI/SP coverage > 70% cho stable workloads
- [ ] Spot strategy cho fault-tolerant workloads
- [ ] Storage tiering applied
Governance
- [ ] SCPs limit expensive resources
- [ ] Tag policies enforced
- [ ] Cost review meetings scheduled
- [ ] Team accountability defined
Operations
- [ ] Unit economics tracking
- [ ] Monthly optimization review
- [ ] Runbooks documented
- [ ] FinOps owner assigned
📎 Liên kết
- 📎 GCP Cost & Quotas - So sánh với GCP's cost management
- 📎 Compute Decisioning - Compute cost optimization
- 📎 Storage & Data Protection - Storage cost optimization
- 📎 Account & Landing Zone - Multi-account cost allocation
- 📎 Terraform IaC - Infrastructure as Code basics