Giao diện
📊 Observability & Audit
Level: Ops Solves: Thiết lập monitoring, logging, và audit trail cho enterprise workloads với centralized visibility
🎯 Mục tiêu (Outcomes)
Sau khi áp dụng kiến thức trong trang này, bạn sẽ có khả năng:
- Thiết lập Centralized Logging với org-level sinks
- Cấu hình Cloud Monitoring với alerting và SLOs
- Triển khai Distributed Tracing với Cloud Trace
- Phân tích Audit Logs cho security và compliance
- Xây dựng Compliance Dashboards cho security team
- So sánh với AWS CloudWatch và CloudTrail
✅ Khi nào dùng
| Tool | Use Case | Lý do |
|---|---|---|
| Cloud Logging | Centralized log management | Integrated, scalable |
| Cloud Monitoring | Metrics và alerting | Native GCP integration |
| Cloud Trace | Request latency analysis | Distributed tracing |
| Log Analytics | SQL queries on logs | Complex analysis |
| Error Reporting | Exception tracking | Auto-grouping errors |
❌ Khi nào KHÔNG dùng
| Pattern | Vấn đề | Thay thế |
|---|---|---|
| Logs trong Cloud Logging forever | Cost | Export sang GCS |
| Alert on mọi metric | Alert fatigue | SLO-based alerting |
| 100% sampling cho traces | Cost, noise | Sample 1-10% |
| Data Access logs cho high-volume APIs | Huge cost | Selective enablement |
⚠️ Cảnh báo từ Raizo
"Một team enable Data Access logs cho BigQuery API. 500TB queries/ngày = $15,000/tháng chỉ riêng logging. Selective enablement và log filtering là critical."
Observability Stack
Google Cloud Operations Suite
┌─────────────────────────────────────────────────────────────────┐
│ CLOUD OPERATIONS SUITE (formerly Stackdriver) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Cloud Monitoring │ │
│ │ • Metrics collection │ │
│ │ • Dashboards │ │
│ │ • Alerting policies │ │
│ │ • Uptime checks │ │
│ │ • SLO monitoring │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Cloud Logging │ │
│ │ • Log ingestion │ │
│ │ • Log routing │ │
│ │ • Log-based metrics │ │
│ │ • Log analytics │ │
│ │ • Audit logs │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Cloud Trace │ │
│ │ • Distributed tracing │ │
│ │ • Latency analysis │ │
│ │ • Request flow visualization │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Error Reporting │ │
│ │ • Error aggregation │ │
│ │ • Stack trace analysis │ │
│ │ • Notification integration │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Cloud Logging
Log Types
┌─────────────────────────────────────────────────────────────────┐
│ GCP LOG TYPES │
├─────────────────────────────────────────────────────────────────┤
│ │
│ AUDIT LOGS (Auto-enabled, critical for compliance) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Admin Activity: WHO did WHAT to WHICH resource │ │
│ │ • Always enabled, free, 400-day retention │ │
│ │ • Cannot be disabled │ │
│ │ │ │
│ │ Data Access: WHO accessed WHAT data │ │
│ │ • Must be enabled per service │ │
│ │ • Charged for ingestion │ │
│ │ • 30-day default retention │ │
│ │ │ │
│ │ System Event: GCP system actions │ │
│ │ • Always enabled, free │ │
│ │ │ │
│ │ Policy Denied: IAM denials │ │
│ │ • Always enabled, free │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ PLATFORM LOGS │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ • VPC Flow Logs │ │
│ │ • Firewall Rules Logging │ │
│ │ • Load Balancer Logs │ │
│ │ • Cloud NAT Logs │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ APPLICATION LOGS │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ • stdout/stderr from containers │ │
│ │ • Custom application logs │ │
│ │ • Cloud Functions logs │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Centralized Logging Architecture
┌─────────────────────────────────────────────────────────────────┐
│ CENTRALIZED LOGGING ARCHITECTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Project A Project B Project C │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Logs │ │ Logs │ │ Logs │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ └──────────────────┼──────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Organization Log Sink │ │
│ │ (Aggregated at org or folder level) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────┼────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ BigQuery │ │Cloud Storage│ │ Pub/Sub │ │
│ │ (Analytics) │ │ (Archive) │ │ (Stream) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ SINK FILTER EXAMPLES: │
│ • All audit logs: logName:"cloudaudit.googleapis.com" │
│ • Errors only: severity >= ERROR │
│ • Specific service: resource.type="gce_instance" │
│ │
└─────────────────────────────────────────────────────────────────┘Log Sink Configuration
bash
# Create org-level sink for all audit logs to BigQuery
gcloud logging sinks create org-audit-sink \
bigquery.googleapis.com/projects/logging-project/datasets/audit_logs \
--organization=ORG_ID \
--include-children \
--log-filter='logName:"cloudaudit.googleapis.com"'
# Create sink for security-relevant logs to Cloud Storage
gcloud logging sinks create security-archive-sink \
storage.googleapis.com/security-logs-bucket \
--organization=ORG_ID \
--include-children \
--log-filter='
logName:"cloudaudit.googleapis.com" OR
logName:"vpc_flows" OR
logName:"firewall"
'Cloud Monitoring
Metrics Types
| Type | Source | Examples |
|---|---|---|
| System Metrics | GCP services (auto) | CPU, memory, disk, network |
| Agent Metrics | Ops Agent | OS-level, custom apps |
| Custom Metrics | Your code | Business metrics, KPIs |
| Log-based Metrics | Log entries | Error counts, latency |
Alerting Best Practices
yaml
# Alert Policy Structure
displayName: "High Error Rate - Production API"
documentation:
content: |
## Impact
Users may experience errors when calling the API.
## Runbook
1. Check Cloud Run logs for error details
2. Verify downstream dependencies
3. Check recent deployments
## Escalation
Page on-call if not resolved in 15 minutes.
conditions:
- displayName: "Error rate > 1%"
conditionThreshold:
filter: |
resource.type="cloud_run_revision"
AND metric.type="run.googleapis.com/request_count"
AND metric.labels.response_code_class="5xx"
aggregations:
- alignmentPeriod: 60s
perSeriesAligner: ALIGN_RATE
crossSeriesReducer: REDUCE_SUM
groupByFields:
- resource.label.service_name
comparison: COMPARISON_GT
thresholdValue: 0.01
duration: 300s
notificationChannels:
- projects/PROJECT/notificationChannels/CHANNEL_ID
alertStrategy:
autoClose: 1800s # Auto-close after 30 min if resolvedSLO Monitoring
┌─────────────────────────────────────────────────────────────────┐
│ SLO MONITORING │
├─────────────────────────────────────────────────────────────────┤
│ │
│ SERVICE LEVEL INDICATORS (SLIs) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Availability: % of successful requests │ │
│ │ Latency: % of requests < threshold │ │
│ │ Throughput: Requests per second │ │
│ │ Error Rate: % of failed requests │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ SERVICE LEVEL OBJECTIVES (SLOs) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Example: 99.9% availability over 30-day rolling window │ │
│ │ │ │
│ │ Error Budget = 100% - SLO = 0.1% │ │
│ │ In 30 days: 0.1% × 30 × 24 × 60 = 43.2 minutes │ │
│ │ │ │
│ │ If error budget exhausted → freeze deployments │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ BURN RATE ALERTS │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Fast burn: 14.4x rate → exhausts budget in 2 days │ │
│ │ Slow burn: 3x rate → exhausts budget in 10 days │ │
│ │ │ │
│ │ Alert when burn rate exceeds threshold │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Cloud Trace
Distributed Tracing
┌─────────────────────────────────────────────────────────────────┐
│ DISTRIBUTED TRACE EXAMPLE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Request: GET /api/orders/123 │
│ Total Latency: 450ms │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Cloud Run: api-gateway [50ms] │ │
│ │ ├── Cloud Run: order-service [150ms] │ │
│ │ │ ├── Cloud SQL: SELECT order [80ms] │ │
│ │ │ └── Memorystore: GET cache [5ms] │ │
│ │ └── Cloud Run: user-service [200ms] │ │
│ │ └── Firestore: GET user [180ms] │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ INSIGHTS: │
│ • Firestore query is the bottleneck (180ms) │
│ • Consider caching user data │
│ • Order service has good cache hit rate │
│ │
└─────────────────────────────────────────────────────────────────┘Trace Instrumentation
python
# Python with OpenTelemetry
from opentelemetry import trace
from opentelemetry.exporter.cloud_trace import CloudTraceSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Setup
provider = TracerProvider()
processor = BatchSpanProcessor(CloudTraceSpanExporter())
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
# Usage
@tracer.start_as_current_span("process_order")
def process_order(order_id: str):
with tracer.start_as_current_span("fetch_order"):
order = fetch_order(order_id)
with tracer.start_as_current_span("validate_order"):
validate(order)
return orderAudit & Compliance
Audit Log Analysis
sql
-- BigQuery: Find all IAM changes in last 7 days
SELECT
timestamp,
protopayload_auditlog.authenticationInfo.principalEmail as actor,
protopayload_auditlog.methodName as action,
protopayload_auditlog.resourceName as resource,
protopayload_auditlog.request
FROM `project.dataset.cloudaudit_googleapis_com_activity_*`
WHERE _TABLE_SUFFIX >= FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY))
AND protopayload_auditlog.methodName LIKE '%SetIamPolicy%'
ORDER BY timestamp DESC;
-- Find failed authentication attempts
SELECT
timestamp,
protopayload_auditlog.authenticationInfo.principalEmail,
protopayload_auditlog.status.message as error,
resource.labels.project_id
FROM `project.dataset.cloudaudit_googleapis_com_activity_*`
WHERE protopayload_auditlog.status.code != 0
AND _TABLE_SUFFIX >= FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
ORDER BY timestamp DESC
LIMIT 100;Compliance Dashboards
┌─────────────────────────────────────────────────────────────────┐
│ COMPLIANCE MONITORING DASHBOARD │
├─────────────────────────────────────────────────────────────────┤
│ │
│ SECURITY METRICS │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ • IAM policy changes: 15 (last 24h) │ │
│ │ • Failed auth attempts: 3 (last 24h) │ │
│ │ • Service account key creations: 0 ✓ │ │
│ │ • Public resources created: 0 ✓ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ RESOURCE COMPLIANCE │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ • VMs without OS Login: 2 ⚠️ │ │
│ │ • Buckets without uniform access: 0 ✓ │ │
│ │ • Unencrypted disks: 0 ✓ │ │
│ │ • Public IPs on VMs: 5 ⚠️ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ DATA ACCESS │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ • BigQuery queries on PII tables: 45 (last 24h) │ │
│ │ • GCS downloads from sensitive buckets: 12 │ │
│ │ • Cross-project data access: 8 │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Best Practices Checklist
- [ ] Enable Data Access audit logs for sensitive services
- [ ] Set up org-level log sinks for centralized logging
- [ ] Export logs to BigQuery for long-term analysis
- [ ] Configure alerting policies with clear runbooks
- [ ] Implement SLO monitoring with error budgets
- [ ] Enable Cloud Trace for distributed systems
- [ ] Create compliance dashboards for security team
- [ ] Set appropriate log retention periods
⚖️ Trade-offs
Trade-off 1: Log Retention vs Cost
| Retention | Storage Cost | Use Case |
|---|---|---|
| 30 days | Default, free tier | Active troubleshooting |
| 1 year | Higher | Compliance, audit |
| 7 years | Export to GCS | Regulatory (coldline) |
Khuyến nghị: 30 days in Cloud Logging + BigQuery export cho analytics + GCS archive cho compliance.
Trade-off 2: Trace Sampling Rate
| Sampling | Cost | Visibility |
|---|---|---|
| 100% | Rất cao | Full |
| 10% | Moderate | Good |
| 1% | Low | Basic |
| Head-based | Variable | Request context |
Trade-off 3: Alert Granularity
| Approach | Alert Volume | Actionability |
|---|---|---|
| Metric-based | High | Variable |
| SLO-based | Low | High |
| Burn-rate | Balanced | Predictive |
🚨 Failure Modes
Failure Mode 1: Alert Fatigue
🔥 Incident thực tế
On-call nhận 200 alerts/ngày. Team bắt đầu ignore alerts. Real incident bị miss 4 giờ. Customer-facing outage.
| Cách phát hiện | Cách phòng tránh |
|---|---|
| High alert volume | SLO-based alerting |
| Low response rate | Tune thresholds |
| Team burnout | Consolidate alerts |
Failure Mode 2: Missing Logs
| Cách phát hiện | Cách phòng tránh |
|---|---|
| Gaps in log timeline | Monitor log ingestion |
| Investigation blocked | Verify sink configuration |
| Audit fails | Test log export regularly |
Failure Mode 3: Logging Cost Explosion
| Cách phát hiện | Cách phòng tránh |
|---|---|
| Billing spike | Log exclusion filters |
| High ingestion volume | Selective Data Access logs |
| Unexpected charges | Set logging budgets |
🔐 Security Baseline
Logging Security Requirements
| Requirement | Implementation | Verification |
|---|---|---|
| Audit logs enabled | Org-wide, cannot disable | Admin Activity always on |
| Log export | Org-level sinks | Sink configuration audit |
| Access control | IAM for log access | Access review |
| Immutable logs | Export to GCS with retention lock | Bucket configuration |
Critical Log Sources
| Log Type | Enable For | Retention |
|---|---|---|
| Admin Activity | All (auto) | 400 days |
| Data Access | Sensitive services | 30+ days |
| VPC Flow Logs | Security subnets | 30 days |
| Firewall Logs | All rules | 30 days |
📊 Ops Readiness
Metrics cần Monitoring
| Metric | Source | Alert Threshold |
|---|---|---|
| Log ingestion rate | Cloud Logging | Spike > 3x |
| Error log rate | Cloud Logging | > baseline |
| Alert response time | Cloud Monitoring | > 15 min |
| Trace sample rate | Cloud Trace | Drop > 50% |
| SLO burn rate | Cloud Monitoring | > budget |
Runbook Entry Points
| Tình huống | Runbook |
|---|---|
| Alert storm | runbook/alert-storm-response.md |
| Missing logs | runbook/log-investigation.md |
| High logging cost | runbook/logging-cost-optimization.md |
| Trace gaps | runbook/trace-troubleshooting.md |
| Compliance audit | runbook/audit-log-export.md |
| SLO breach | runbook/slo-breach-response.md |
✅ Design Review Checklist
Logging
- [ ] Org-level sinks configured
- [ ] Data Access logs selective
- [ ] Export to BigQuery/GCS
- [ ] Retention policies set
Monitoring
- [ ] SLOs defined
- [ ] Burn-rate alerting
- [ ] Dashboards available
- [ ] Notification channels tested
Tracing
- [ ] Cloud Trace enabled
- [ ] Sampling configured
- [ ] Key paths instrumented
- [ ] Error Reporting integrated
Compliance
- [ ] Audit log export
- [ ] Retention meets requirements
- [ ] Access control verified
- [ ] Dashboards for security team
📎 Liên kết
- 📎 AWS Observability & Auditing - So sánh với AWS CloudWatch/CloudTrail
- 📎 Security & Data Perimeter - Security monitoring integration
- 📎 Resource Hierarchy - Org-level logging setup
- 📎 Terraform Testing - IaC for monitoring resources
- 📎 GCP Cost & Quotas - Logging cost management