Skip to content

💰 Cost Governance

Level: Ops Solves: Implement cost visibility, optimization, và governance cho enterprise AWS environments

🎯 Mục tiêu (Outcomes)

Sau khi áp dụng kiến thức trong trang này, bạn sẽ có khả năng:

  • Thiết kế Tagging Strategy cho cost allocation và showback
  • Cấu hình AWS Budgets với alerts và actions
  • Triển khai Reserved Instances/Savings Plans cho cost optimization
  • Implement Spot Strategy cho fault-tolerant workloads
  • Track Unit Economics để measure cost efficiency
  • Xây dựng Cost Governance với policies và automation

Khi nào dùng

OptimizationUse CaseTiết kiệm
Reserved InstancesStable, predictable EC2/RDS30-72%
Savings PlansFlexible compute commitment30-66%
Spot InstancesFault-tolerant, flexible timing60-90%
Right-sizingOver-provisioned resources20-50%
Storage tieringInfrequent access data50-90%

Khi nào KHÔNG dùng

PatternVấn đềThay thế
RI cho dev/testWaste khi không dùngSpot hoặc On-Demand
Spot cho databasesData loss riskReserved
3-year RI cho uncertain workloadsLock-in risk1-year hoặc Savings Plans
No tagsKhông thể allocate costsMandatory tagging

⚠️ Cảnh báo từ Raizo

"Một team mua $500K Reserved Instances cho workload họ 'chắc chắn' sẽ stable. 6 tháng sau, workload migrate sang containers. RIs unused 18 tháng. Luon bắt đầu với 1-year commitments và convertible options."

Cost Management Framework

FinOps Principles

┌─────────────────────────────────────────────────────────────────┐
│                    FINOPS FRAMEWORK                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    INFORM                                │    │
│  │  • Cost visibility and allocation                        │    │
│  │  • Tagging and showback                                  │    │
│  │  • Anomaly detection                                     │    │
│  └─────────────────────────────────────────────────────────┘    │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                   OPTIMIZE                               │    │
│  │  • Right-sizing                                          │    │
│  │  • Reserved capacity                                     │    │
│  │  • Spot instances                                        │    │
│  └─────────────────────────────────────────────────────────┘    │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    OPERATE                               │    │
│  │  • Budgets and alerts                                    │    │
│  │  • Governance policies                                   │    │
│  │  • Continuous improvement                                │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Tagging Strategy

Mandatory Tags

┌─────────────────────────────────────────────────────────────────┐
│                 TAGGING TAXONOMY                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  COST ALLOCATION (Required):                                    │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ CostCenter      │ engineering-platform                  │    │
│  │ Project         │ payment-gateway                       │    │
│  │ Environment     │ production | staging | development    │    │
│  │ Owner           │ team-payments@company.com             │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  OPERATIONAL (Required):                                        │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Application     │ checkout-service                      │    │
│  │ ManagedBy       │ terraform | manual | cloudformation   │    │
│  │ Backup          │ daily | weekly | none                 │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  COMPLIANCE (As needed):                                        │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ DataClassification │ public | internal | confidential   │    │
│  │ Compliance         │ pci-dss | hipaa | sox              │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Tag Enforcement

json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "RequireTags",
      "Effect": "Deny",
      "Action": [
        "ec2:RunInstances",
        "ec2:CreateVolume",
        "rds:CreateDBInstance"
      ],
      "Resource": "*",
      "Condition": {
        "Null": {
          "aws:RequestTag/CostCenter": "true",
          "aws:RequestTag/Environment": "true",
          "aws:RequestTag/Owner": "true"
        }
      }
    }
  ]
}

AWS Config Rule for Tags

yaml
# Config rule to check required tags
Resources:
  RequiredTagsRule:
    Type: AWS::Config::ConfigRule
    Properties:
      ConfigRuleName: required-tags
      Description: Check that required tags are present
      InputParameters:
        tag1Key: CostCenter
        tag2Key: Environment
        tag3Key: Owner
      Source:
        Owner: AWS
        SourceIdentifier: REQUIRED_TAGS
      Scope:
        ComplianceResourceTypes:
          - AWS::EC2::Instance
          - AWS::RDS::DBInstance
          - AWS::S3::Bucket

AWS Budgets

Budget Types

TypeUse CaseAlert Trigger
Cost BudgetTrack spendingActual or forecasted
Usage BudgetTrack resource usageHours, GB, requests
RI UtilizationMonitor RI usageBelow threshold
RI CoverageMonitor RI coverageBelow threshold
Savings PlansMonitor SP usageBelow threshold

Budget Configuration

json
{
  "BudgetName": "Monthly-Production-Budget",
  "BudgetLimit": {
    "Amount": "10000",
    "Unit": "USD"
  },
  "BudgetType": "COST",
  "TimeUnit": "MONTHLY",
  "CostFilters": {
    "TagKeyValue": [
      "user:Environment$production"
    ]
  },
  "NotificationsWithSubscribers": [
    {
      "Notification": {
        "NotificationType": "ACTUAL",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 80,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        {
          "SubscriptionType": "EMAIL",
          "Address": "finops@company.com"
        }
      ]
    },
    {
      "Notification": {
        "NotificationType": "FORECASTED",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 100,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        {
          "SubscriptionType": "SNS",
          "Address": "arn:aws:sns:us-east-1:123456789012:budget-alerts"
        }
      ]
    }
  ]
}

Budget Actions (Auto-Remediation)

┌─────────────────────────────────────────────────────────────────┐
│                 BUDGET ACTIONS                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Threshold: 100% of budget                                      │
│                                                                 │
│  Actions:                                                       │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ 1. Apply IAM Policy                                     │    │
│  │    • Deny ec2:RunInstances                              │    │
│  │    • Deny rds:CreateDBInstance                          │    │
│  │    • Prevents new resource creation                     │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ 2. Apply SCP (Organization level)                       │    │
│  │    • Restrict entire account                            │    │
│  │    • Emergency cost control                             │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ⚠️ Use with caution - can impact production!                   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Reserved Instances & Savings Plans

Commitment Options

RI vs Savings Plans

FeatureReserved InstancesSavings Plans
FlexibilityInstance type specificAny instance type
RegionRegion specificAny region (Compute SP)
ServicesEC2, RDS, ElastiCacheEC2, Fargate, Lambda
Max SavingsUp to 72%Up to 72%
Commitment1 or 3 years1 or 3 years

Coverage Analysis

sql
-- Athena query for RI coverage analysis
SELECT 
  line_item_product_code,
  product_instance_type,
  SUM(CASE WHEN line_item_line_item_type = 'DiscountedUsage' 
      THEN line_item_usage_amount ELSE 0 END) as reserved_usage,
  SUM(CASE WHEN line_item_line_item_type = 'Usage' 
      THEN line_item_usage_amount ELSE 0 END) as on_demand_usage,
  ROUND(
    SUM(CASE WHEN line_item_line_item_type = 'DiscountedUsage' 
        THEN line_item_usage_amount ELSE 0 END) * 100.0 /
    NULLIF(SUM(line_item_usage_amount), 0), 2
  ) as coverage_percentage
FROM cost_and_usage_report
WHERE line_item_usage_start_date >= date_add('day', -30, current_date)
GROUP BY line_item_product_code, product_instance_type
HAVING SUM(line_item_usage_amount) > 100
ORDER BY on_demand_usage DESC;

Spot Instances Strategy

Spot Best Practices

┌─────────────────────────────────────────────────────────────────┐
│                 SPOT INSTANCE STRATEGY                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  DIVERSIFICATION:                                               │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ • Use multiple instance types (at least 10)             │    │
│  │ • Spread across all AZs                                 │    │
│  │ • Mix instance families (m5, m5a, m5n, m6i)             │    │
│  │ • Use capacity-optimized allocation strategy            │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  WORKLOAD FIT:                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ ✅ Good for Spot:                                       │    │
│  │    • Batch processing                                   │    │
│  │    • CI/CD workers                                      │    │
│  │    • Data analysis                                      │    │
│  │    • Stateless web servers (with proper LB)             │    │
│  │                                                         │    │
│  │ ❌ Not for Spot:                                        │    │
│  │    • Databases                                          │    │
│  │    • Single points of failure                           │    │
│  │    • Long-running stateful jobs                         │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Mixed Instances ASG

json
{
  "MixedInstancesPolicy": {
    "LaunchTemplate": {
      "LaunchTemplateSpecification": {
        "LaunchTemplateId": "lt-xxx",
        "Version": "$Latest"
      },
      "Overrides": [
        {"InstanceType": "m5.large"},
        {"InstanceType": "m5a.large"},
        {"InstanceType": "m5n.large"},
        {"InstanceType": "m6i.large"},
        {"InstanceType": "m5.xlarge", "WeightedCapacity": "2"}
      ]
    },
    "InstancesDistribution": {
      "OnDemandBaseCapacity": 2,
      "OnDemandPercentageAboveBaseCapacity": 20,
      "SpotAllocationStrategy": "capacity-optimized",
      "SpotMaxPrice": ""
    }
  }
}

Cost Anomaly Detection

Configuration

┌─────────────────────────────────────────────────────────────────┐
│              COST ANOMALY DETECTION                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Monitor Types:                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ 1. AWS Services                                         │    │
│  │    • Detect anomalies across all services               │    │
│  │    • Good for overall cost monitoring                   │    │
│  │                                                         │    │
│  │ 2. Linked Accounts                                      │    │
│  │    • Monitor specific accounts                          │    │
│  │    • Good for multi-account organizations               │    │
│  │                                                         │    │
│  │ 3. Cost Categories                                      │    │
│  │    • Monitor by business unit/project                   │    │
│  │    • Requires cost categories setup                     │    │
│  │                                                         │    │
│  │ 4. Cost Allocation Tags                                 │    │
│  │    • Monitor by specific tags                           │    │
│  │    • Most granular control                              │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  Alert Thresholds:                                              │
│  • Absolute: Alert when anomaly > $100                          │
│  • Percentage: Alert when anomaly > 10% of expected             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Unit Economics

Cost Per Transaction

sql
-- Calculate cost per transaction
WITH daily_costs AS (
  SELECT 
    DATE(line_item_usage_start_date) as date,
    SUM(line_item_unblended_cost) as total_cost
  FROM cost_and_usage_report
  WHERE line_item_product_code = 'AmazonEC2'
    AND resource_tags_user_application = 'checkout-service'
  GROUP BY DATE(line_item_usage_start_date)
),
daily_transactions AS (
  SELECT 
    DATE(timestamp) as date,
    COUNT(*) as transaction_count
  FROM application_metrics
  WHERE metric_name = 'checkout_completed'
  GROUP BY DATE(timestamp)
)
SELECT 
  c.date,
  c.total_cost,
  t.transaction_count,
  ROUND(c.total_cost / t.transaction_count, 4) as cost_per_transaction
FROM daily_costs c
JOIN daily_transactions t ON c.date = t.date
ORDER BY c.date DESC;

Cost Dashboard Metrics

MetricFormulaTarget
Cost per RequestTotal Cost / Request Count< $0.001
Cost per UserTotal Cost / Active Users< $1/month
Infrastructure EfficiencyRevenue / Infrastructure Cost> 10x
RI/SP CoverageReserved Hours / Total Hours> 70%
Spot UtilizationSpot Hours / Non-prod Hours> 80%

Best Practices Checklist

  • [ ] Implement mandatory tagging with enforcement
  • [ ] Set up budgets with alerts at 50%, 80%, 100%
  • [ ] Enable Cost Anomaly Detection
  • [ ] Purchase Reserved Instances for stable workloads
  • [ ] Use Savings Plans for flexible compute
  • [ ] Implement Spot for fault-tolerant workloads
  • [ ] Create cost allocation reports by team/project
  • [ ] Review and right-size resources monthly
  • [ ] Track unit economics metrics

⚖️ Trade-offs

Trade-off 1: Reserved Instances vs Savings Plans

Khía cạnhReserved InstancesSavings Plans
FlexibilityInstance type lockedAny instance type
DiscountLên đến 72%Lên đến 66%
ScopeRegion/AZ specificRegion hoặc any region
SellableCó (Marketplace)Không
Best forStable, known workloadsDynamic compute needs

Khuyến nghị:

  • RIs: Database instances, known baseline
  • Savings Plans: Application servers, variable needs

Trade-off 2: Commitment Length

TermDiscountRiskRecommendation
On-Demand0%OUnknown/variable workloads
1-year30-40%LowStandard production
3-year60-72%HighOnly ultra-stable workloads

Rule of thumb: Bắt đầu với 1-year, upgrade lên 3-year sau khi chứng minh stability.


Trade-off 3: Spot vs Reserved

Khía cạnhSpotReserved
Discount60-90%30-72%
AvailabilityKhông guaranteeGuarantee
InterruptionCó (2 min warning)Không
Best forBatch, statelessDatabases, stateful

🚨 Failure Modes

Failure Mode 1: Budget Overrun

🔥 Incident thực tế

Developer spin up ML training cluster, quên terminate. $80,000 bill sau 1 tuần. Không có budget alerts. Team chỉ biết khi nhận monthly bill.

Cách phát hiệnCách phòng tránh
Monthly bill shockBudget alerts at 50%, 80%, 100%
Cost Anomaly Detection alertsEnable và configure thresholds
Cost Explorer reviewDaily cost monitoring

Failure Mode 2: Wasted Reserved Capacity

Cách phát hiệnCách phòng tránh
RI utilization report < 80%Right-size trước khi commit
Unused RIsStart với 1-year, convertible
Workload migrationSavings Plans cho flexibility

Failure Mode 3: Untagged Resources

┌─────────────────────────────────────────────────────────────────┐
│                  UNTAGGED RESOURCES PROBLEM                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Monthly AWS Bill: $500,000                                     │
│  │                                                               │
│  ├─ Team A: $??? (20% untagged)                                 │
│  ├─ Team B: $??? (15% untagged)                                 │
│  ├─ Unknown: $80,000 (không ai biết là của ai)                 │
│                                                                 │
│  Impact: Không thể do chi phí, không accountability            │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
Cách phát hiệnCách phòng tránh
Cost allocation report gapsSCPs require tags
"Untagged" trong Cost ExplorerTag policies
Config rules violationsAutomated tagging remediation

🔐 Security Baseline

Cost Governance Security

RequirementImplementationVerification
Budget owner accessIAM policies per teamAccess review
Cost data accessRole-based Cost ExplorerPolicy audit
No surprise resourcesSCPs limit instance typesCompliance check
Billing access restrictedBilling console IAMAccess audit
json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyExpensiveInstances",
      "Effect": "Deny",
      "Action": ["ec2:RunInstances"],
      "Resource": "arn:aws:ec2:*:*:instance/*",
      "Condition": {
        "ForAnyValue:StringLike": {
          "ec2:InstanceType": ["p4d.*", "x2*", "u-*"]
        }
      }
    }
  ]
}

📊 Ops Readiness

Metrics cần Monitoring

MetricSourceAlert Threshold
Daily spendCost Explorer> budget/30
RI utilizationCost Explorer< 80%
Spot interruption rateCloudWatch> 10%
Untagged resourcesConfig> 5%
Unit costCustom> baseline + 20%

Cost Review Schedule

ReviewFrequencyParticipantsActions
Daily glanceDailyFinOpsAnomaly check
Weekly reviewWeeklyFinOps + LeadsRight-sizing
Monthly deep-diveMonthlyAll stakeholdersOptimization
Quarterly planningQuarterlyLeadershipCommitments

Runbook Entry Points

Tình huốngRunbook
Budget alert triggeredrunbook/budget-overrun-response.md
Cost anomaly detectedrunbook/cost-anomaly-investigation.md
RI underutilizationrunbook/ri-optimization.md
Untagged resources foundrunbook/tagging-remediation.md
Spot capacity issuesrunbook/spot-capacity-management.md
Right-sizing opportunityrunbook/right-sizing-process.md

Design Review Checklist

Visibility

  • [ ] Mandatory tags defined và enforced
  • [ ] Cost allocation reports configured
  • [ ] Budgets set up với alerts
  • [ ] Cost Anomaly Detection enabled

Optimization

  • [ ] Right-sizing analysis completed
  • [ ] RI/SP coverage > 70% cho stable workloads
  • [ ] Spot strategy cho fault-tolerant workloads
  • [ ] Storage tiering applied

Governance

  • [ ] SCPs limit expensive resources
  • [ ] Tag policies enforced
  • [ ] Cost review meetings scheduled
  • [ ] Team accountability defined

Operations

  • [ ] Unit economics tracking
  • [ ] Monthly optimization review
  • [ ] Runbooks documented
  • [ ] FinOps owner assigned

📎 Liên kết