Giao diện
💰 Cost & Quotas
Level: Ops Solves: Quản lý chi phí và quotas hiệu quả cho enterprise workloads với visibility và control
🎯 Mục tiêu (Outcomes)
Sau khi áp dụng kiến thức trong trang này, bạn sẽ có khả năng:
- Thiết lập Billing Export đến BigQuery cho cost analysis
- Cấu hình Budgets với alerts và programmatic actions
- Mua Committed Use Discounts đúng thời điểm
- Triển khai Spot VMs cho fault-tolerant workloads
- Quản lý Quotas và request increases proactively
- Implement Label Strategy cho cost allocation
✅ Khi nào dùng
| Strategy | Use Case | Lý do |
|---|---|---|
| CUDs | Stable, predictable workloads | Lên đến 57% discount |
| Spot VMs | Batch, CI/CD, fault-tolerant | Lên đến 91% discount |
| Budgets | Mọi project | Prevent surprises |
| Labels | Mọi resource | Cost allocation |
| Billing export | Organization | Deep analysis |
❌ Khi nào KHÔNG dùng
| Pattern | Vấn đề | Thay thế |
|---|---|---|
| CUDs cho variable workloads | Waste unused commitment | On-demand + Spot |
| Spot cho production databases | Preemption risk | On-demand + CUD |
| No budgets | Bill shock | Set budgets always |
| No labels | Cost allocation impossible | Mandatory labels |
| Manual quota management | Scale failures | Proactive requests |
⚠️ Cảnh báo từ Raizo
"Team không monitor quota. Black Friday, scale-up fail do hết vCPU quota. 2 tiếng downtime. Luôn request quota increase TRƯỚC khi cần."
Billing Structure
GCP Billing Hierarchy
┌─────────────────────────────────────────────────────────────────┐
│ GCP BILLING HIERARCHY │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Cloud Billing Account │ │
│ │ • Payment method (credit card, invoice) │ │
│ │ • Billing contact │ │
│ │ • Currency settings │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────┼────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Project A │ │ Project B │ │ Project C │ │
│ │ $500/month │ │ $1200/month│ │ $300/month │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ ENTERPRISE PATTERN: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Master Billing Account (Finance team) │ │
│ │ ├── Sub-account: Production │ │
│ │ │ └── Projects: prod-* │ │
│ │ ├── Sub-account: Development │ │
│ │ │ └── Projects: dev-*, stg-* │ │
│ │ └── Sub-account: Sandbox │ │
│ │ └── Projects: sbx-* │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Billing Export to BigQuery
sql
-- Enable billing export in Cloud Console
-- Billing > Billing export > BigQuery export
-- Query: Monthly cost by project
SELECT
project.name as project_name,
SUM(cost) as total_cost,
SUM(IFNULL(credits.amount, 0)) as total_credits,
SUM(cost) + SUM(IFNULL(credits.amount, 0)) as net_cost
FROM `billing_project.billing_dataset.gcp_billing_export_v1_XXXXXX`
LEFT JOIN UNNEST(credits) as credits
WHERE invoice.month = '202401'
GROUP BY project.name
ORDER BY net_cost DESC;
-- Query: Cost by service and SKU
SELECT
service.description as service,
sku.description as sku,
SUM(cost) as cost,
SUM(usage.amount) as usage_amount,
usage.unit as usage_unit
FROM `billing_project.billing_dataset.gcp_billing_export_v1_XXXXXX`
WHERE invoice.month = '202401'
GROUP BY service, sku, usage_unit
ORDER BY cost DESC
LIMIT 20;Budget Management
Budget Configuration
┌─────────────────────────────────────────────────────────────────┐
│ BUDGET CONFIGURATION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ BUDGET TYPES │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Specified Amount: Fixed dollar amount │ │
│ │ • Example: $10,000/month │ │
│ │ │ │
│ │ Last Month's Spend: Dynamic based on history │ │
│ │ • Example: Alert if 20% higher than last month │ │
│ │ │ │
│ │ Last Period's Spend: Compare to same period last year │ │
│ │ • Example: Alert if higher than Jan 2023 │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ALERT THRESHOLDS │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ 50% → Email notification (early warning) │ │
│ │ 80% → Email + Slack notification │ │
│ │ 100% → Email + Slack + PagerDuty │ │
│ │ 120% → All above + auto-disable billing (optional) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ SCOPE OPTIONS │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ • Billing account (all projects) │ │
│ │ • Specific projects │ │
│ │ • Specific services │ │
│ │ • Labels (e.g., team=data-engineering) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Programmatic Budget Alerts
python
# Cloud Function to handle budget alerts
import base64
import json
from google.cloud import pubsub_v1
def budget_alert_handler(event, context):
"""Handle budget alert from Pub/Sub."""
pubsub_message = base64.b64decode(event['data']).decode('utf-8')
budget_notification = json.loads(pubsub_message)
budget_name = budget_notification['budgetDisplayName']
cost_amount = budget_notification['costAmount']
budget_amount = budget_notification['budgetAmount']
threshold = budget_notification['alertThresholdExceeded']
# Calculate percentage
percentage = (cost_amount / budget_amount) * 100
if threshold >= 1.0: # 100% exceeded
# Critical: Take action
send_pagerduty_alert(budget_name, percentage)
# Optional: Disable billing
# disable_billing_for_project(project_id)
elif threshold >= 0.8: # 80% exceeded
send_slack_alert(budget_name, percentage)
else:
send_email_alert(budget_name, percentage)Cost Optimization
Committed Use Discounts (CUDs)
┌─────────────────────────────────────────────────────────────────┐
│ COMMITTED USE DISCOUNTS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ RESOURCE-BASED CUDs (Compute Engine) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Commitment: vCPUs + Memory for 1 or 3 years │ │
│ │ │ │
│ │ Discount: │ │
│ │ • 1-year: Up to 37% off │ │
│ │ • 3-year: Up to 57% off │ │
│ │ │ │
│ │ Flexibility: │ │
│ │ • Applies across machine types in same region │ │
│ │ • Can mix N2, N2D, C2, etc. │ │
│ │ • Shared across projects in billing account │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ SPEND-BASED CUDs (BigQuery, Cloud SQL, etc.) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Commitment: Dollar amount per hour │ │
│ │ │ │
│ │ BigQuery Editions: │ │
│ │ • Standard: No commitment required │ │
│ │ • Enterprise: Commit to slots, get discount │ │
│ │ │ │
│ │ Cloud SQL: │ │
│ │ • 1-year: 25% off │ │
│ │ • 3-year: 52% off │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Spot VMs
┌─────────────────────────────────────────────────────────────────┐
│ SPOT VMs │
├─────────────────────────────────────────────────────────────────┤
│ │
│ DISCOUNT: Up to 91% off on-demand price │
│ │
│ TRADE-OFF: Can be preempted with 30-second warning │
│ │
│ IDEAL FOR: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ ✅ Batch processing (Dataproc, Dataflow) │ │
│ │ ✅ CI/CD pipelines │ │
│ │ ✅ Fault-tolerant workloads │ │
│ │ ✅ Dev/test environments │ │
│ │ ✅ Stateless containers (GKE node pools) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ NOT IDEAL FOR: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ ❌ Production databases │ │
│ │ ❌ Stateful applications │ │
│ │ ❌ Long-running jobs without checkpointing │ │
│ │ ❌ User-facing services (without fallback) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ BEST PRACTICE: Mix Spot + On-demand in GKE │
│ • Spot node pool for batch workloads │
│ • On-demand node pool for critical services │
│ • Use pod anti-affinity to spread across pools │
│ │
└─────────────────────────────────────────────────────────────────┘Cost Optimization Recommendations
┌─────────────────────────────────────────────────────────────────┐
│ COST OPTIMIZATION CHECKLIST │
├─────────────────────────────────────────────────────────────────┤
│ │
│ COMPUTE │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ □ Right-size VMs (use Recommender) │ │
│ │ □ Use Spot VMs for fault-tolerant workloads │ │
│ │ □ Purchase CUDs for predictable workloads │ │
│ │ □ Use E2 machine types for cost-sensitive workloads │ │
│ │ □ Schedule non-prod VMs to stop after hours │ │
│ │ □ Delete unused disks and snapshots │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ STORAGE │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ □ Use appropriate storage class (Standard/Nearline/etc) │ │
│ │ □ Set lifecycle policies for old objects │ │
│ │ □ Enable Autoclass for unpredictable access patterns │ │
│ │ □ Delete orphaned persistent disks │ │
│ │ □ Use regional storage only when needed │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ NETWORKING │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ □ Use Premium tier only when needed │ │
│ │ □ Minimize cross-region traffic │ │
│ │ □ Use Cloud CDN for static content │ │
│ │ □ Delete unused external IPs │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ DATA │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ □ Use BigQuery slots for heavy workloads │ │
│ │ □ Partition and cluster BigQuery tables │ │
│ │ □ Set table expiration for temp data │ │
│ │ □ Use Dataproc Serverless instead of clusters │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Quota Management
Quota Types
┌─────────────────────────────────────────────────────────────────┐
│ QUOTA TYPES │
├─────────────────────────────────────────────────────────────────┤
│ │
│ RATE QUOTAS (Requests per time period) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ • API requests per minute │ │
│ │ • Queries per second │ │
│ │ • Operations per day │ │
│ │ │ │
│ │ Example: BigQuery 100 concurrent queries │ │
│ │ Example: Compute Engine API 20 requests/second │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ALLOCATION QUOTAS (Resource limits) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ • Number of VMs per region │ │
│ │ • Total vCPUs per region │ │
│ │ • Number of VPCs per project │ │
│ │ │ │
│ │ Example: 24 vCPUs per region (default) │ │
│ │ Example: 15 VPCs per project │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ QUOTA INCREASE REQUEST │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ 1. Go to IAM & Admin > Quotas │ │
│ │ 2. Filter by service and quota name │ │
│ │ 3. Select quota and click "Edit Quotas" │ │
│ │ 4. Enter new limit and justification │ │
│ │ 5. Wait for approval (usually 24-48 hours) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Quota Monitoring
bash
# List quotas for a project
gcloud compute project-info describe --project=PROJECT_ID
# List specific quota
gcloud compute regions describe REGION \
--project=PROJECT_ID \
--format="table(quotas.metric,quotas.limit,quotas.usage)"
# Set quota alert
gcloud alpha monitoring policies create \
--notification-channels=CHANNEL_ID \
--display-name="CPU Quota Alert" \
--condition-display-name="CPU usage > 80%" \
--condition-filter='
resource.type="compute.googleapis.com/Quota"
AND metric.type="compute.googleapis.com/quota/cpus_per_region/usage"
' \
--condition-threshold-value=0.8 \
--condition-threshold-comparison=COMPARISON_GTLabels & Cost Allocation
Label Strategy
yaml
# Required labels for cost allocation
labels:
# Business context
cost-center: "cc-12345" # Finance tracking
business-unit: "engineering" # Department
product: "platform-api" # Product/service
# Technical context
environment: "production" # prod/staging/dev
team: "platform" # Owning team
managed-by: "terraform" # How it's managed
# Lifecycle
created-by: "john@example.com" # Creator
expiry-date: "2024-12-31" # For temp resourcesCost Allocation Report
sql
-- BigQuery: Cost by label
SELECT
labels.value as team,
SUM(cost) as total_cost
FROM `billing_project.billing_dataset.gcp_billing_export_v1_XXXXXX`,
UNNEST(labels) as labels
WHERE labels.key = 'team'
AND invoice.month = '202401'
GROUP BY team
ORDER BY total_cost DESC;
-- Cost by environment
SELECT
COALESCE(
(SELECT value FROM UNNEST(labels) WHERE key = 'environment'),
'unlabeled'
) as environment,
SUM(cost) as total_cost
FROM `billing_project.billing_dataset.gcp_billing_export_v1_XXXXXX`
WHERE invoice.month = '202401'
GROUP BY environment
ORDER BY total_cost DESC;Best Practices Checklist
- [ ] Enable billing export to BigQuery
- [ ] Set up budgets with multiple thresholds
- [ ] Implement label strategy for cost allocation
- [ ] Review Recommender suggestions weekly
- [ ] Purchase CUDs for predictable workloads
- [ ] Use Spot VMs for fault-tolerant workloads
- [ ] Monitor quota usage and request increases proactively
- [ ] Schedule non-prod resources to stop after hours
⚖️ Trade-offs
Trade-off 1: CUDs vs On-Demand
| Khía cạnh | CUDs (1-year) | CUDs (3-year) | On-Demand |
|---|---|---|---|
| Discount | 37% | 57% | 0% |
| Flexibility | Low | Very low | Full |
| Risk | Medium | High | None |
| Best for | Stable prod | Long-term | Variable |
Khuyến nghị: CUD cho 60-80% baseline, on-demand + Spot cho peaks.
Trade-off 2: Spot vs On-Demand
| Khía cạnh | Spot VMs | On-Demand |
|---|---|---|
| Discount | Lên đến 91% | 0% |
| Availability | Không guarantee | Guarantee |
| Preemption | Có (30s warning) | Không |
| Best for | Batch, CI/CD | Stateful, critical |
Trade-off 3: Label Enforcement
| Approach | Coverage | Implementation |
|---|---|---|
| Org policy | Enforced | Blocks non-compliant |
| Terraform validation | IaC only | Pre-deploy check |
| Post-hoc audit | Reactive | Reports unlabeled |
🚨 Failure Modes
Failure Mode 1: Budget Overrun
🔥 Incident thực tế
Developer tạo 100 n2-highmem-96 VMs và forget. End of month: $200K bill. No budget alerts configured.
| Cách phát hiện | Cách phòng tránh |
|---|---|
| Monthly bill shock | Budgets with alerts |
| Cost spike in billing export | Programmatic actions |
| Finance team escalation | Multiple thresholds |
Failure Mode 2: Quota Exhaustion
| Cách phát hiện | Cách phòng tránh |
|---|---|
| Scale-up failures | Proactive quota requests |
| API rate limit errors | Quota monitoring |
| Deployment blocked | Multiple regions |
Failure Mode 3: Unused CUDs
| Cách phát hiện | Cách phòng tránh |
|---|---|
| Low utilization in reports | Right-size before commit |
| Waste in billing analysis | Start with 1-year |
| Over-commitment | Monitor usage first |
🔐 Security Baseline
Billing Security Requirements
| Requirement | Implementation | Verification |
|---|---|---|
| Billing account access | Limited to finance | IAM audit |
| Budget alerts | All projects | Budget review |
| Billing export | Enabled | Export verified |
| Label enforcement | Org policy | Compliance audit |
Access Control
| Role | Scope | Users |
|---|---|---|
| Billing Account Admin | Billing account | Finance only |
| Billing Account User | Project | Team leads |
| Billing Account Viewer | Billing account | Stakeholders |
📊 Ops Readiness
Metrics cần Monitoring
| Metric | Source | Alert Threshold |
|---|---|---|
| Daily spend | Billing export | > 120% average |
| Budget utilization | Cloud Monitoring | > 80% |
| Quota usage | Quota Monitoring | > 70% |
| CUD utilization | Billing reports | < 80% |
| Unlabeled resources | Asset Inventory | Any |
Runbook Entry Points
| Tình huống | Runbook |
|---|---|
| Budget alert triggered | runbook/budget-alert-response.md |
| Cost spike detected | runbook/cost-spike-investigation.md |
| Quota exhausted | runbook/quota-increase-request.md |
| CUD under-utilized | runbook/cud-optimization.md |
| Unlabeled resources | runbook/label-compliance.md |
✅ Design Review Checklist
Cost Visibility
- [ ] Billing export enabled
- [ ] Dashboards configured
- [ ] Labels strategy defined
- [ ] Cost allocation reports
Cost Control
- [ ] Budgets set
- [ ] Alerts configured
- [ ] Programmatic actions
- [ ] Approval workflows
Optimization
- [ ] CUD analysis done
- [ ] Spot VM opportunities
- [ ] Recommender reviewed
- [ ] Scheduling for non-prod
Quota
- [ ] Quota monitoring
- [ ] Proactive requests
- [ ] Multi-region strategy
- [ ] Alerting configured
📎 Liên kết
- 📎 AWS Cost Governance - So sánh với AWS cost management
- 📎 Resource Hierarchy - Billing structure và hierarchy
- 📎 Compute Patterns - Cost optimization cho compute
- 📎 Data Platforms - BigQuery cost management
- 📎 GCP Observability - Cost monitoring integration