💰 Cost Governance

Level: Ops Solves: Implement cost visibility, optimization, và governance cho enterprise AWS environments

🎯 Mục tiêu (Outcomes)

Sau khi áp dụng kiến thức trong trang này, bạn sẽ có khả năng:

Thiết kế Tagging Strategy cho cost allocation và showback
Cấu hình AWS Budgets với alerts và actions
Triển khai Reserved Instances/Savings Plans cho cost optimization
Implement Spot Strategy cho fault-tolerant workloads
Track Unit Economics để measure cost efficiency
Xây dựng Cost Governance với policies và automation

✅ Khi nào dùng

Optimization	Use Case	Tiết kiệm
Reserved Instances	Stable, predictable EC2/RDS	30-72%
Savings Plans	Flexible compute commitment	30-66%
Spot Instances	Fault-tolerant, flexible timing	60-90%
Right-sizing	Over-provisioned resources	20-50%
Storage tiering	Infrequent access data	50-90%

❌ Khi nào KHÔNG dùng

Pattern	Vấn đề	Thay thế
RI cho dev/test	Waste khi không dùng	Spot hoặc On-Demand
Spot cho databases	Data loss risk	Reserved
3-year RI cho uncertain workloads	Lock-in risk	1-year hoặc Savings Plans
No tags	Không thể allocate costs	Mandatory tagging

⚠️ Cảnh báo từ Raizo

"Một team mua $500K Reserved Instances cho workload họ 'chắc chắn' sẽ stable. 6 tháng sau, workload migrate sang containers. RIs unused 18 tháng. Luon bắt đầu với 1-year commitments và convertible options."

Cost Management Framework

FinOps Principles

┌─────────────────────────────────────────────────────────────────┐
│                    FINOPS FRAMEWORK                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    INFORM                                │    │
│  │  • Cost visibility and allocation                        │    │
│  │  • Tagging and showback                                  │    │
│  │  • Anomaly detection                                     │    │
│  └─────────────────────────────────────────────────────────┘    │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                   OPTIMIZE                               │    │
│  │  • Right-sizing                                          │    │
│  │  • Reserved capacity                                     │    │
│  │  • Spot instances                                        │    │
│  └─────────────────────────────────────────────────────────┘    │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    OPERATE                               │    │
│  │  • Budgets and alerts                                    │    │
│  │  • Governance policies                                   │    │
│  │  • Continuous improvement                                │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Tagging Strategy

Mandatory Tags

┌─────────────────────────────────────────────────────────────────┐
│                 TAGGING TAXONOMY                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  COST ALLOCATION (Required):                                    │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ CostCenter      │ engineering-platform                  │    │
│  │ Project         │ payment-gateway                       │    │
│  │ Environment     │ production | staging | development    │    │
│  │ Owner           │ team-payments@company.com             │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  OPERATIONAL (Required):                                        │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Application     │ checkout-service                      │    │
│  │ ManagedBy       │ terraform | manual | cloudformation   │    │
│  │ Backup          │ daily | weekly | none                 │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  COMPLIANCE (As needed):                                        │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ DataClassification │ public | internal | confidential   │    │
│  │ Compliance         │ pci-dss | hipaa | sox              │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Tag Enforcement

json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "RequireTags",
      "Effect": "Deny",
      "Action": [
        "ec2:RunInstances",
        "ec2:CreateVolume",
        "rds:CreateDBInstance"
      ],
      "Resource": "*",
      "Condition": {
        "Null": {
          "aws:RequestTag/CostCenter": "true",
          "aws:RequestTag/Environment": "true",
          "aws:RequestTag/Owner": "true"
        }
      }
    }
  ]
}

AWS Config Rule for Tags

yaml

# Config rule to check required tags
Resources:
  RequiredTagsRule:
    Type: AWS::Config::ConfigRule
    Properties:
      ConfigRuleName: required-tags
      Description: Check that required tags are present
      InputParameters:
        tag1Key: CostCenter
        tag2Key: Environment
        tag3Key: Owner
      Source:
        Owner: AWS
        SourceIdentifier: REQUIRED_TAGS
      Scope:
        ComplianceResourceTypes:
          - AWS::EC2::Instance
          - AWS::RDS::DBInstance
          - AWS::S3::Bucket

AWS Budgets

Budget Types

Type	Use Case	Alert Trigger
Cost Budget	Track spending	Actual or forecasted
Usage Budget	Track resource usage	Hours, GB, requests
RI Utilization	Monitor RI usage	Below threshold
RI Coverage	Monitor RI coverage	Below threshold
Savings Plans	Monitor SP usage	Below threshold

Budget Configuration

json

{
  "BudgetName": "Monthly-Production-Budget",
  "BudgetLimit": {
    "Amount": "10000",
    "Unit": "USD"
  },
  "BudgetType": "COST",
  "TimeUnit": "MONTHLY",
  "CostFilters": {
    "TagKeyValue": [
      "user:Environment$production"
    ]
  },
  "NotificationsWithSubscribers": [
    {
      "Notification": {
        "NotificationType": "ACTUAL",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 80,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        {
          "SubscriptionType": "EMAIL",
          "Address": "finops@company.com"
        }
      ]
    },
    {
      "Notification": {
        "NotificationType": "FORECASTED",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 100,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        {
          "SubscriptionType": "SNS",
          "Address": "arn:aws:sns:us-east-1:123456789012:budget-alerts"
        }
      ]
    }
  ]
}

Budget Actions (Auto-Remediation)

┌─────────────────────────────────────────────────────────────────┐
│                 BUDGET ACTIONS                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Threshold: 100% of budget                                      │
│                                                                 │
│  Actions:                                                       │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ 1. Apply IAM Policy                                     │    │
│  │    • Deny ec2:RunInstances                              │    │
│  │    • Deny rds:CreateDBInstance                          │    │
│  │    • Prevents new resource creation                     │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ 2. Apply SCP (Organization level)                       │    │
│  │    • Restrict entire account                            │    │
│  │    • Emergency cost control                             │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ⚠️ Use with caution - can impact production!                   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Reserved Instances & Savings Plans

Commitment Options

RI vs Savings Plans

Feature	Reserved Instances	Savings Plans
Flexibility	Instance type specific	Any instance type
Region	Region specific	Any region (Compute SP)
Services	EC2, RDS, ElastiCache	EC2, Fargate, Lambda
Max Savings	Up to 72%	Up to 72%
Commitment	1 or 3 years	1 or 3 years

Coverage Analysis

sql

-- Athena query for RI coverage analysis
SELECT 
  line_item_product_code,
  product_instance_type,
  SUM(CASE WHEN line_item_line_item_type = 'DiscountedUsage' 
      THEN line_item_usage_amount ELSE 0 END) as reserved_usage,
  SUM(CASE WHEN line_item_line_item_type = 'Usage' 
      THEN line_item_usage_amount ELSE 0 END) as on_demand_usage,
  ROUND(
    SUM(CASE WHEN line_item_line_item_type = 'DiscountedUsage' 
        THEN line_item_usage_amount ELSE 0 END) * 100.0 /
    NULLIF(SUM(line_item_usage_amount), 0), 2
  ) as coverage_percentage
FROM cost_and_usage_report
WHERE line_item_usage_start_date >= date_add('day', -30, current_date)
GROUP BY line_item_product_code, product_instance_type
HAVING SUM(line_item_usage_amount) > 100
ORDER BY on_demand_usage DESC;

Spot Instances Strategy

Spot Best Practices

┌─────────────────────────────────────────────────────────────────┐
│                 SPOT INSTANCE STRATEGY                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  DIVERSIFICATION:                                               │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ • Use multiple instance types (at least 10)             │    │
│  │ • Spread across all AZs                                 │    │
│  │ • Mix instance families (m5, m5a, m5n, m6i)             │    │
│  │ • Use capacity-optimized allocation strategy            │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  WORKLOAD FIT:                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ ✅ Good for Spot:                                       │    │
│  │    • Batch processing                                   │    │
│  │    • CI/CD workers                                      │    │
│  │    • Data analysis                                      │    │
│  │    • Stateless web servers (with proper LB)             │    │
│  │                                                         │    │
│  │ ❌ Not for Spot:                                        │    │
│  │    • Databases                                          │    │
│  │    • Single points of failure                           │    │
│  │    • Long-running stateful jobs                         │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Mixed Instances ASG

json

{
  "MixedInstancesPolicy": {
    "LaunchTemplate": {
      "LaunchTemplateSpecification": {
        "LaunchTemplateId": "lt-xxx",
        "Version": "$Latest"
      },
      "Overrides": [
        {"InstanceType": "m5.large"},
        {"InstanceType": "m5a.large"},
        {"InstanceType": "m5n.large"},
        {"InstanceType": "m6i.large"},
        {"InstanceType": "m5.xlarge", "WeightedCapacity": "2"}
      ]
    },
    "InstancesDistribution": {
      "OnDemandBaseCapacity": 2,
      "OnDemandPercentageAboveBaseCapacity": 20,
      "SpotAllocationStrategy": "capacity-optimized",
      "SpotMaxPrice": ""
    }
  }
}

Cost Anomaly Detection

Configuration

┌─────────────────────────────────────────────────────────────────┐
│              COST ANOMALY DETECTION                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Monitor Types:                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ 1. AWS Services                                         │    │
│  │    • Detect anomalies across all services               │    │
│  │    • Good for overall cost monitoring                   │    │
│  │                                                         │    │
│  │ 2. Linked Accounts                                      │    │
│  │    • Monitor specific accounts                          │    │
│  │    • Good for multi-account organizations               │    │
│  │                                                         │    │
│  │ 3. Cost Categories                                      │    │
│  │    • Monitor by business unit/project                   │    │
│  │    • Requires cost categories setup                     │    │
│  │                                                         │    │
│  │ 4. Cost Allocation Tags                                 │    │
│  │    • Monitor by specific tags                           │    │
│  │    • Most granular control                              │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  Alert Thresholds:                                              │
│  • Absolute: Alert when anomaly > $100                          │
│  • Percentage: Alert when anomaly > 10% of expected             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Unit Economics

Cost Per Transaction

sql

-- Calculate cost per transaction
WITH daily_costs AS (
  SELECT 
    DATE(line_item_usage_start_date) as date,
    SUM(line_item_unblended_cost) as total_cost
  FROM cost_and_usage_report
  WHERE line_item_product_code = 'AmazonEC2'
    AND resource_tags_user_application = 'checkout-service'
  GROUP BY DATE(line_item_usage_start_date)
),
daily_transactions AS (
  SELECT 
    DATE(timestamp) as date,
    COUNT(*) as transaction_count
  FROM application_metrics
  WHERE metric_name = 'checkout_completed'
  GROUP BY DATE(timestamp)
)
SELECT 
  c.date,
  c.total_cost,
  t.transaction_count,
  ROUND(c.total_cost / t.transaction_count, 4) as cost_per_transaction
FROM daily_costs c
JOIN daily_transactions t ON c.date = t.date
ORDER BY c.date DESC;

Cost Dashboard Metrics

Metric	Formula	Target
Cost per Request	Total Cost / Request Count	< $0.001
Cost per User	Total Cost / Active Users	< $1/month
Infrastructure Efficiency	Revenue / Infrastructure Cost	> 10x
RI/SP Coverage	Reserved Hours / Total Hours	> 70%
Spot Utilization	Spot Hours / Non-prod Hours	> 80%

Best Practices Checklist

[ ] Implement mandatory tagging with enforcement
[ ] Set up budgets with alerts at 50%, 80%, 100%
[ ] Enable Cost Anomaly Detection
[ ] Purchase Reserved Instances for stable workloads
[ ] Use Savings Plans for flexible compute
[ ] Implement Spot for fault-tolerant workloads
[ ] Create cost allocation reports by team/project
[ ] Review and right-size resources monthly
[ ] Track unit economics metrics

⚖️ Trade-offs

Trade-off 1: Reserved Instances vs Savings Plans

Khía cạnh	Reserved Instances	Savings Plans
Flexibility	Instance type locked	Any instance type
Discount	Lên đến 72%	Lên đến 66%
Scope	Region/AZ specific	Region hoặc any region
Sellable	Có (Marketplace)	Không
Best for	Stable, known workloads	Dynamic compute needs

Khuyến nghị:

RIs: Database instances, known baseline
Savings Plans: Application servers, variable needs

Trade-off 2: Commitment Length

Term	Discount	Risk	Recommendation
On-Demand	0%	O	Unknown/variable workloads
1-year	30-40%	Low	Standard production
3-year	60-72%	High	Only ultra-stable workloads

Rule of thumb: Bắt đầu với 1-year, upgrade lên 3-year sau khi chứng minh stability.

Trade-off 3: Spot vs Reserved

Khía cạnh	Spot	Reserved
Discount	60-90%	30-72%
Availability	Không guarantee	Guarantee
Interruption	Có (2 min warning)	Không
Best for	Batch, stateless	Databases, stateful

🚨 Failure Modes

Failure Mode 1: Budget Overrun

🔥 Incident thực tế

Developer spin up ML training cluster, quên terminate. $80,000 bill sau 1 tuần. Không có budget alerts. Team chỉ biết khi nhận monthly bill.

Cách phát hiện	Cách phòng tránh
Monthly bill shock	Budget alerts at 50%, 80%, 100%
Cost Anomaly Detection alerts	Enable và configure thresholds
Cost Explorer review	Daily cost monitoring

Failure Mode 2: Wasted Reserved Capacity

Cách phát hiện	Cách phòng tránh
RI utilization report < 80%	Right-size trước khi commit
Unused RIs	Start với 1-year, convertible
Workload migration	Savings Plans cho flexibility

Failure Mode 3: Untagged Resources

┌─────────────────────────────────────────────────────────────────┐
│                  UNTAGGED RESOURCES PROBLEM                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Monthly AWS Bill: $500,000                                     │
│  │                                                               │
│  ├─ Team A: $??? (20% untagged)                                 │
│  ├─ Team B: $??? (15% untagged)                                 │
│  ├─ Unknown: $80,000 (không ai biết là của ai)                 │
│                                                                 │
│  Impact: Không thể do chi phí, không accountability            │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Cách phát hiện	Cách phòng tránh
Cost allocation report gaps	SCPs require tags
"Untagged" trong Cost Explorer	Tag policies
Config rules violations	Automated tagging remediation

🔐 Security Baseline

Cost Governance Security

Requirement	Implementation	Verification
Budget owner access	IAM policies per team	Access review
Cost data access	Role-based Cost Explorer	Policy audit
No surprise resources	SCPs limit instance types	Compliance check
Billing access restricted	Billing console IAM	Access audit

json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyExpensiveInstances",
      "Effect": "Deny",
      "Action": ["ec2:RunInstances"],
      "Resource": "arn:aws:ec2:*:*:instance/*",
      "Condition": {
        "ForAnyValue:StringLike": {
          "ec2:InstanceType": ["p4d.*", "x2*", "u-*"]
        }
      }
    }
  ]
}

📊 Ops Readiness

Metrics cần Monitoring

Metric	Source	Alert Threshold
Daily spend	Cost Explorer	> budget/30
RI utilization	Cost Explorer	< 80%
Spot interruption rate	CloudWatch	> 10%
Untagged resources	Config	> 5%
Unit cost	Custom	> baseline + 20%

Cost Review Schedule

Review	Frequency	Participants	Actions
Daily glance	Daily	FinOps	Anomaly check
Weekly review	Weekly	FinOps + Leads	Right-sizing
Monthly deep-dive	Monthly	All stakeholders	Optimization
Quarterly planning	Quarterly	Leadership	Commitments

Runbook Entry Points

Tình huống	Runbook
Budget alert triggered	`runbook/budget-overrun-response.md`
Cost anomaly detected	`runbook/cost-anomaly-investigation.md`
RI underutilization	`runbook/ri-optimization.md`
Untagged resources found	`runbook/tagging-remediation.md`
Spot capacity issues	`runbook/spot-capacity-management.md`
Right-sizing opportunity	`runbook/right-sizing-process.md`

✅ Design Review Checklist

Visibility

[ ] Mandatory tags defined và enforced
[ ] Cost allocation reports configured
[ ] Budgets set up với alerts
[ ] Cost Anomaly Detection enabled

Optimization

[ ] Right-sizing analysis completed
[ ] RI/SP coverage > 70% cho stable workloads
[ ] Spot strategy cho fault-tolerant workloads
[ ] Storage tiering applied

Governance

[ ] SCPs limit expensive resources
[ ] Tag policies enforced
[ ] Cost review meetings scheduled
[ ] Team accountability defined

Operations

[ ] Unit economics tracking
[ ] Monthly optimization review
[ ] Runbooks documented
[ ] FinOps owner assigned

📎 Liên kết

📎 GCP Cost & Quotas - So sánh với GCP's cost management
📎 Compute Decisioning - Compute cost optimization
📎 Storage & Data Protection - Storage cost optimization
📎 Account & Landing Zone - Multi-account cost allocation
📎 Terraform IaC - Infrastructure as Code basics

💰 Cost Governance ​

🎯 Mục tiêu (Outcomes) ​

✅ Khi nào dùng ​

❌ Khi nào KHÔNG dùng ​

Cost Management Framework ​

FinOps Principles ​

Tagging Strategy ​

Mandatory Tags ​

Tag Enforcement ​

AWS Config Rule for Tags ​

AWS Budgets ​

Budget Types ​

Budget Configuration ​

Budget Actions (Auto-Remediation) ​

Reserved Instances & Savings Plans ​

Commitment Options ​

RI vs Savings Plans ​

Coverage Analysis ​

Spot Instances Strategy ​

Spot Best Practices ​

Mixed Instances ASG ​

Cost Anomaly Detection ​

Configuration ​

Unit Economics ​

Cost Per Transaction ​

Cost Dashboard Metrics ​

Best Practices Checklist ​

⚖️ Trade-offs ​

Trade-off 1: Reserved Instances vs Savings Plans ​

Trade-off 2: Commitment Length ​

Trade-off 3: Spot vs Reserved ​

🚨 Failure Modes ​

Failure Mode 1: Budget Overrun ​

Failure Mode 2: Wasted Reserved Capacity ​

Failure Mode 3: Untagged Resources ​

🔐 Security Baseline ​

Cost Governance Security ​

Cost-Related SCPs ​

📊 Ops Readiness ​

Metrics cần Monitoring ​

Cost Review Schedule ​

Runbook Entry Points ​

✅ Design Review Checklist ​

Visibility ​

Optimization ​

Governance ​

Operations ​

📎 Liên kết ​

💰 Cost Governance

🎯 Mục tiêu (Outcomes)

✅ Khi nào dùng

❌ Khi nào KHÔNG dùng

Cost Management Framework

FinOps Principles

Tagging Strategy

Mandatory Tags

Tag Enforcement

AWS Config Rule for Tags

AWS Budgets

Budget Types

Budget Configuration

Budget Actions (Auto-Remediation)

Reserved Instances & Savings Plans

Commitment Options

RI vs Savings Plans

Coverage Analysis

Spot Instances Strategy

Spot Best Practices

Mixed Instances ASG

Cost Anomaly Detection

Configuration

Unit Economics

Cost Per Transaction

Cost Dashboard Metrics

Best Practices Checklist

⚖️ Trade-offs

Trade-off 1: Reserved Instances vs Savings Plans

Trade-off 2: Commitment Length

Trade-off 3: Spot vs Reserved

🚨 Failure Modes

Failure Mode 1: Budget Overrun

Failure Mode 2: Wasted Reserved Capacity

Failure Mode 3: Untagged Resources

🔐 Security Baseline

Cost Governance Security

Cost-Related SCPs

📊 Ops Readiness

Metrics cần Monitoring

Cost Review Schedule

Runbook Entry Points

✅ Design Review Checklist

Visibility

Optimization

Governance

Operations

📎 Liên kết