📊 Observability & Audit

Level: Ops Solves: Thiết lập monitoring, logging, và audit trail cho enterprise workloads với centralized visibility

🎯 Mục tiêu (Outcomes)

Sau khi áp dụng kiến thức trong trang này, bạn sẽ có khả năng:

Thiết lập Centralized Logging với org-level sinks
Cấu hình Cloud Monitoring với alerting và SLOs
Triển khai Distributed Tracing với Cloud Trace
Phân tích Audit Logs cho security và compliance
Xây dựng Compliance Dashboards cho security team
So sánh với AWS CloudWatch và CloudTrail

✅ Khi nào dùng

Tool	Use Case	Lý do
Cloud Logging	Centralized log management	Integrated, scalable
Cloud Monitoring	Metrics và alerting	Native GCP integration
Cloud Trace	Request latency analysis	Distributed tracing
Log Analytics	SQL queries on logs	Complex analysis
Error Reporting	Exception tracking	Auto-grouping errors

❌ Khi nào KHÔNG dùng

Pattern	Vấn đề	Thay thế
Logs trong Cloud Logging forever	Cost	Export sang GCS
Alert on mọi metric	Alert fatigue	SLO-based alerting
100% sampling cho traces	Cost, noise	Sample 1-10%
Data Access logs cho high-volume APIs	Huge cost	Selective enablement

⚠️ Cảnh báo từ Raizo

"Một team enable Data Access logs cho BigQuery API. 500TB queries/ngày = $15,000/tháng chỉ riêng logging. Selective enablement và log filtering là critical."

Observability Stack

Google Cloud Operations Suite

┌─────────────────────────────────────────────────────────────────┐
│           CLOUD OPERATIONS SUITE (formerly Stackdriver)         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    Cloud Monitoring                     │    │
│  │  • Metrics collection                                   │    │
│  │  • Dashboards                                           │    │
│  │  • Alerting policies                                    │    │
│  │  • Uptime checks                                        │    │
│  │  • SLO monitoring                                       │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    Cloud Logging                        │    │
│  │  • Log ingestion                                        │    │
│  │  • Log routing                                          │    │
│  │  • Log-based metrics                                    │    │
│  │  • Log analytics                                        │    │
│  │  • Audit logs                                           │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    Cloud Trace                          │    │
│  │  • Distributed tracing                                  │    │
│  │  • Latency analysis                                     │    │
│  │  • Request flow visualization                           │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    Error Reporting                      │    │
│  │  • Error aggregation                                    │    │
│  │  • Stack trace analysis                                 │    │
│  │  • Notification integration                             │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Cloud Logging

Log Types

┌─────────────────────────────────────────────────────────────────┐
│                    GCP LOG TYPES                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  AUDIT LOGS (Auto-enabled, critical for compliance)             │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Admin Activity: WHO did WHAT to WHICH resource          │    │
│  │ • Always enabled, free, 400-day retention               │    │
│  │ • Cannot be disabled                                    │    │
│  │                                                         │    │
│  │ Data Access: WHO accessed WHAT data                     │    │
│  │ • Must be enabled per service                           │    │
│  │ • Charged for ingestion                                 │    │
│  │ • 30-day default retention                              │    │
│  │                                                         │    │
│  │ System Event: GCP system actions                        │    │
│  │ • Always enabled, free                                  │    │
│  │                                                         │    │
│  │ Policy Denied: IAM denials                              │    │
│  │ • Always enabled, free                                  │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  PLATFORM LOGS                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ • VPC Flow Logs                                         │    │
│  │ • Firewall Rules Logging                                │    │
│  │ • Load Balancer Logs                                    │    │
│  │ • Cloud NAT Logs                                        │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  APPLICATION LOGS                                               │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ • stdout/stderr from containers                         │    │
│  │ • Custom application logs                               │    │
│  │ • Cloud Functions logs                                  │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Centralized Logging Architecture

┌─────────────────────────────────────────────────────────────────┐
│           CENTRALIZED LOGGING ARCHITECTURE                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Project A          Project B          Project C                │
│  ┌─────────┐        ┌─────────┐        ┌─────────┐             │
│  │  Logs   │        │  Logs   │        │  Logs   │             │
│  └────┬────┘        └────┬────┘        └────┬────┘             │
│       │                  │                  │                   │
│       └──────────────────┼──────────────────┘                   │
│                          │                                      │
│                          ▼                                      │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              Organization Log Sink                      │    │
│  │  (Aggregated at org or folder level)                    │    │
│  └─────────────────────────────────────────────────────────┘    │
│                          │                                      │
│         ┌────────────────┼────────────────┐                     │
│         │                │                │                     │
│         ▼                ▼                ▼                     │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐             │
│  │   BigQuery  │  │Cloud Storage│  │   Pub/Sub   │             │
│  │ (Analytics) │  │  (Archive)  │  │  (Stream)   │             │
│  └─────────────┘  └─────────────┘  └─────────────┘             │
│                                                                 │
│  SINK FILTER EXAMPLES:                                          │
│  • All audit logs: logName:"cloudaudit.googleapis.com"          │
│  • Errors only: severity >= ERROR                               │
│  • Specific service: resource.type="gce_instance"               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Log Sink Configuration

bash

# Create org-level sink for all audit logs to BigQuery
gcloud logging sinks create org-audit-sink \
  bigquery.googleapis.com/projects/logging-project/datasets/audit_logs \
  --organization=ORG_ID \
  --include-children \
  --log-filter='logName:"cloudaudit.googleapis.com"'

# Create sink for security-relevant logs to Cloud Storage
gcloud logging sinks create security-archive-sink \
  storage.googleapis.com/security-logs-bucket \
  --organization=ORG_ID \
  --include-children \
  --log-filter='
    logName:"cloudaudit.googleapis.com" OR
    logName:"vpc_flows" OR
    logName:"firewall"
  '

Cloud Monitoring

Metrics Types

Type	Source	Examples
System Metrics	GCP services (auto)	CPU, memory, disk, network
Agent Metrics	Ops Agent	OS-level, custom apps
Custom Metrics	Your code	Business metrics, KPIs
Log-based Metrics	Log entries	Error counts, latency

Alerting Best Practices

yaml

# Alert Policy Structure
displayName: "High Error Rate - Production API"
documentation:
  content: |
    ## Impact
    Users may experience errors when calling the API.
    
    ## Runbook
    1. Check Cloud Run logs for error details
    2. Verify downstream dependencies
    3. Check recent deployments
    
    ## Escalation
    Page on-call if not resolved in 15 minutes.
conditions:
  - displayName: "Error rate > 1%"
    conditionThreshold:
      filter: |
        resource.type="cloud_run_revision"
        AND metric.type="run.googleapis.com/request_count"
        AND metric.labels.response_code_class="5xx"
      aggregations:
        - alignmentPeriod: 60s
          perSeriesAligner: ALIGN_RATE
          crossSeriesReducer: REDUCE_SUM
          groupByFields:
            - resource.label.service_name
      comparison: COMPARISON_GT
      thresholdValue: 0.01
      duration: 300s
notificationChannels:
  - projects/PROJECT/notificationChannels/CHANNEL_ID
alertStrategy:
  autoClose: 1800s  # Auto-close after 30 min if resolved

SLO Monitoring

┌─────────────────────────────────────────────────────────────────┐
│                    SLO MONITORING                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SERVICE LEVEL INDICATORS (SLIs)                                │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Availability: % of successful requests                 │    │
│  │ Latency: % of requests < threshold                     │    │
│  │ Throughput: Requests per second                        │    │
│  │ Error Rate: % of failed requests                       │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  SERVICE LEVEL OBJECTIVES (SLOs)                                │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Example: 99.9% availability over 30-day rolling window │    │
│  │                                                         │    │
│  │ Error Budget = 100% - SLO = 0.1%                        │    │
│  │ In 30 days: 0.1% × 30 × 24 × 60 = 43.2 minutes          │    │
│  │                                                         │    │
│  │ If error budget exhausted → freeze deployments          │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  BURN RATE ALERTS                                               │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Fast burn: 14.4x rate → exhausts budget in 2 days      │    │
│  │ Slow burn: 3x rate → exhausts budget in 10 days        │    │
│  │                                                         │    │
│  │ Alert when burn rate exceeds threshold                  │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Cloud Trace

Distributed Tracing

┌─────────────────────────────────────────────────────────────────┐
│                 DISTRIBUTED TRACE EXAMPLE                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Request: GET /api/orders/123                                   │
│  Total Latency: 450ms                                           │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Cloud Run: api-gateway                          [50ms]  │    │
│  │ ├── Cloud Run: order-service                   [150ms]  │    │
│  │ │   ├── Cloud SQL: SELECT order              [80ms]     │    │
│  │ │   └── Memorystore: GET cache               [5ms]      │    │
│  │ └── Cloud Run: user-service                   [200ms]   │    │
│  │     └── Firestore: GET user                  [180ms]    │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  INSIGHTS:                                                      │
│  • Firestore query is the bottleneck (180ms)                    │
│  • Consider caching user data                                   │
│  • Order service has good cache hit rate                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Trace Instrumentation

python

# Python with OpenTelemetry
from opentelemetry import trace
from opentelemetry.exporter.cloud_trace import CloudTraceSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Setup
provider = TracerProvider()
processor = BatchSpanProcessor(CloudTraceSpanExporter())
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)

# Usage
@tracer.start_as_current_span("process_order")
def process_order(order_id: str):
    with tracer.start_as_current_span("fetch_order"):
        order = fetch_order(order_id)
    
    with tracer.start_as_current_span("validate_order"):
        validate(order)
    
    return order

Audit & Compliance

Audit Log Analysis

sql

-- BigQuery: Find all IAM changes in last 7 days
SELECT
  timestamp,
  protopayload_auditlog.authenticationInfo.principalEmail as actor,
  protopayload_auditlog.methodName as action,
  protopayload_auditlog.resourceName as resource,
  protopayload_auditlog.request
FROM `project.dataset.cloudaudit_googleapis_com_activity_*`
WHERE _TABLE_SUFFIX >= FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY))
  AND protopayload_auditlog.methodName LIKE '%SetIamPolicy%'
ORDER BY timestamp DESC;

-- Find failed authentication attempts
SELECT
  timestamp,
  protopayload_auditlog.authenticationInfo.principalEmail,
  protopayload_auditlog.status.message as error,
  resource.labels.project_id
FROM `project.dataset.cloudaudit_googleapis_com_activity_*`
WHERE protopayload_auditlog.status.code != 0
  AND _TABLE_SUFFIX >= FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
ORDER BY timestamp DESC
LIMIT 100;

Compliance Dashboards

┌─────────────────────────────────────────────────────────────────┐
│              COMPLIANCE MONITORING DASHBOARD                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SECURITY METRICS                                               │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ • IAM policy changes: 15 (last 24h)                     │    │
│  │ • Failed auth attempts: 3 (last 24h)                    │    │
│  │ • Service account key creations: 0 ✓                    │    │
│  │ • Public resources created: 0 ✓                         │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  RESOURCE COMPLIANCE                                            │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ • VMs without OS Login: 2 ⚠️                            │    │
│  │ • Buckets without uniform access: 0 ✓                   │    │
│  │ • Unencrypted disks: 0 ✓                                │    │
│  │ • Public IPs on VMs: 5 ⚠️                               │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  DATA ACCESS                                                    │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ • BigQuery queries on PII tables: 45 (last 24h)         │    │
│  │ • GCS downloads from sensitive buckets: 12              │    │
│  │ • Cross-project data access: 8                          │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Best Practices Checklist

[ ] Enable Data Access audit logs for sensitive services
[ ] Set up org-level log sinks for centralized logging
[ ] Export logs to BigQuery for long-term analysis
[ ] Configure alerting policies with clear runbooks
[ ] Implement SLO monitoring with error budgets
[ ] Enable Cloud Trace for distributed systems
[ ] Create compliance dashboards for security team
[ ] Set appropriate log retention periods

⚖️ Trade-offs

Trade-off 1: Log Retention vs Cost

Retention	Storage Cost	Use Case
30 days	Default, free tier	Active troubleshooting
1 year	Higher	Compliance, audit
7 years	Export to GCS	Regulatory (coldline)

Khuyến nghị: 30 days in Cloud Logging + BigQuery export cho analytics + GCS archive cho compliance.

Trade-off 2: Trace Sampling Rate

Sampling	Cost	Visibility
100%	Rất cao	Full
10%	Moderate	Good
1%	Low	Basic
Head-based	Variable	Request context

Trade-off 3: Alert Granularity

Approach	Alert Volume	Actionability
Metric-based	High	Variable
SLO-based	Low	High
Burn-rate	Balanced	Predictive

🚨 Failure Modes

Failure Mode 1: Alert Fatigue

🔥 Incident thực tế

On-call nhận 200 alerts/ngày. Team bắt đầu ignore alerts. Real incident bị miss 4 giờ. Customer-facing outage.

Cách phát hiện	Cách phòng tránh
High alert volume	SLO-based alerting
Low response rate	Tune thresholds
Team burnout	Consolidate alerts

Failure Mode 2: Missing Logs

Cách phát hiện	Cách phòng tránh
Gaps in log timeline	Monitor log ingestion
Investigation blocked	Verify sink configuration
Audit fails	Test log export regularly

Failure Mode 3: Logging Cost Explosion

Cách phát hiện	Cách phòng tránh
Billing spike	Log exclusion filters
High ingestion volume	Selective Data Access logs
Unexpected charges	Set logging budgets

🔐 Security Baseline

Logging Security Requirements

Requirement	Implementation	Verification
Audit logs enabled	Org-wide, cannot disable	Admin Activity always on
Log export	Org-level sinks	Sink configuration audit
Access control	IAM for log access	Access review
Immutable logs	Export to GCS with retention lock	Bucket configuration

Critical Log Sources

Log Type	Enable For	Retention
Admin Activity	All (auto)	400 days
Data Access	Sensitive services	30+ days
VPC Flow Logs	Security subnets	30 days
Firewall Logs	All rules	30 days

📊 Ops Readiness

Metrics cần Monitoring

Metric	Source	Alert Threshold
Log ingestion rate	Cloud Logging	Spike > 3x
Error log rate	Cloud Logging	> baseline
Alert response time	Cloud Monitoring	> 15 min
Trace sample rate	Cloud Trace	Drop > 50%
SLO burn rate	Cloud Monitoring	> budget

Runbook Entry Points

Tình huống	Runbook
Alert storm	`runbook/alert-storm-response.md`
Missing logs	`runbook/log-investigation.md`
High logging cost	`runbook/logging-cost-optimization.md`
Trace gaps	`runbook/trace-troubleshooting.md`
Compliance audit	`runbook/audit-log-export.md`
SLO breach	`runbook/slo-breach-response.md`

✅ Design Review Checklist

Logging

[ ] Org-level sinks configured
[ ] Data Access logs selective
[ ] Export to BigQuery/GCS
[ ] Retention policies set

Monitoring

[ ] SLOs defined
[ ] Burn-rate alerting
[ ] Dashboards available
[ ] Notification channels tested

Tracing

[ ] Cloud Trace enabled
[ ] Sampling configured
[ ] Key paths instrumented
[ ] Error Reporting integrated

Compliance

[ ] Audit log export
[ ] Retention meets requirements
[ ] Access control verified
[ ] Dashboards for security team

📎 Liên kết

📎 AWS Observability & Auditing - So sánh với AWS CloudWatch/CloudTrail
📎 Security & Data Perimeter - Security monitoring integration
📎 Resource Hierarchy - Org-level logging setup
📎 Terraform Testing - IaC for monitoring resources
📎 GCP Cost & Quotas - Logging cost management

📊 Observability & Audit ​

🎯 Mục tiêu (Outcomes) ​

✅ Khi nào dùng ​

❌ Khi nào KHÔNG dùng ​

Observability Stack ​

Google Cloud Operations Suite ​

Cloud Logging ​

Log Types ​

Centralized Logging Architecture ​

Log Sink Configuration ​

Cloud Monitoring ​

Metrics Types ​

Alerting Best Practices ​

SLO Monitoring ​

Cloud Trace ​

Distributed Tracing ​

Trace Instrumentation ​

Audit & Compliance ​

Audit Log Analysis ​

Compliance Dashboards ​

Best Practices Checklist ​

⚖️ Trade-offs ​

Trade-off 1: Log Retention vs Cost ​

Trade-off 2: Trace Sampling Rate ​

Trade-off 3: Alert Granularity ​

🚨 Failure Modes ​

Failure Mode 1: Alert Fatigue ​

Failure Mode 2: Missing Logs ​

Failure Mode 3: Logging Cost Explosion ​

🔐 Security Baseline ​

Logging Security Requirements ​

Critical Log Sources ​

📊 Ops Readiness ​

Metrics cần Monitoring ​

Runbook Entry Points ​

✅ Design Review Checklist ​

Logging ​

Monitoring ​

Tracing ​

Compliance ​

📎 Liên kết ​