Skip to content

📊 Observability & Audit

Level: Ops Solves: Thiết lập monitoring, logging, và audit trail cho enterprise workloads với centralized visibility

🎯 Mục tiêu (Outcomes)

Sau khi áp dụng kiến thức trong trang này, bạn sẽ có khả năng:

  • Thiết lập Centralized Logging với org-level sinks
  • Cấu hình Cloud Monitoring với alerting và SLOs
  • Triển khai Distributed Tracing với Cloud Trace
  • Phân tích Audit Logs cho security và compliance
  • Xây dựng Compliance Dashboards cho security team
  • So sánh với AWS CloudWatch và CloudTrail

Khi nào dùng

ToolUse CaseLý do
Cloud LoggingCentralized log managementIntegrated, scalable
Cloud MonitoringMetrics và alertingNative GCP integration
Cloud TraceRequest latency analysisDistributed tracing
Log AnalyticsSQL queries on logsComplex analysis
Error ReportingException trackingAuto-grouping errors

Khi nào KHÔNG dùng

PatternVấn đềThay thế
Logs trong Cloud Logging foreverCostExport sang GCS
Alert on mọi metricAlert fatigueSLO-based alerting
100% sampling cho tracesCost, noiseSample 1-10%
Data Access logs cho high-volume APIsHuge costSelective enablement

⚠️ Cảnh báo từ Raizo

"Một team enable Data Access logs cho BigQuery API. 500TB queries/ngày = $15,000/tháng chỉ riêng logging. Selective enablement và log filtering là critical."

Observability Stack

Google Cloud Operations Suite

┌─────────────────────────────────────────────────────────────────┐
│           CLOUD OPERATIONS SUITE (formerly Stackdriver)         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    Cloud Monitoring                     │    │
│  │  • Metrics collection                                   │    │
│  │  • Dashboards                                           │    │
│  │  • Alerting policies                                    │    │
│  │  • Uptime checks                                        │    │
│  │  • SLO monitoring                                       │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    Cloud Logging                        │    │
│  │  • Log ingestion                                        │    │
│  │  • Log routing                                          │    │
│  │  • Log-based metrics                                    │    │
│  │  • Log analytics                                        │    │
│  │  • Audit logs                                           │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    Cloud Trace                          │    │
│  │  • Distributed tracing                                  │    │
│  │  • Latency analysis                                     │    │
│  │  • Request flow visualization                           │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    Error Reporting                      │    │
│  │  • Error aggregation                                    │    │
│  │  • Stack trace analysis                                 │    │
│  │  • Notification integration                             │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Cloud Logging

Log Types

┌─────────────────────────────────────────────────────────────────┐
│                    GCP LOG TYPES                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  AUDIT LOGS (Auto-enabled, critical for compliance)             │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Admin Activity: WHO did WHAT to WHICH resource          │    │
│  │ • Always enabled, free, 400-day retention               │    │
│  │ • Cannot be disabled                                    │    │
│  │                                                         │    │
│  │ Data Access: WHO accessed WHAT data                     │    │
│  │ • Must be enabled per service                           │    │
│  │ • Charged for ingestion                                 │    │
│  │ • 30-day default retention                              │    │
│  │                                                         │    │
│  │ System Event: GCP system actions                        │    │
│  │ • Always enabled, free                                  │    │
│  │                                                         │    │
│  │ Policy Denied: IAM denials                              │    │
│  │ • Always enabled, free                                  │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  PLATFORM LOGS                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ • VPC Flow Logs                                         │    │
│  │ • Firewall Rules Logging                                │    │
│  │ • Load Balancer Logs                                    │    │
│  │ • Cloud NAT Logs                                        │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  APPLICATION LOGS                                               │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ • stdout/stderr from containers                         │    │
│  │ • Custom application logs                               │    │
│  │ • Cloud Functions logs                                  │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Centralized Logging Architecture

┌─────────────────────────────────────────────────────────────────┐
│           CENTRALIZED LOGGING ARCHITECTURE                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Project A          Project B          Project C                │
│  ┌─────────┐        ┌─────────┐        ┌─────────┐             │
│  │  Logs   │        │  Logs   │        │  Logs   │             │
│  └────┬────┘        └────┬────┘        └────┬────┘             │
│       │                  │                  │                   │
│       └──────────────────┼──────────────────┘                   │
│                          │                                      │
│                          ▼                                      │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              Organization Log Sink                      │    │
│  │  (Aggregated at org or folder level)                    │    │
│  └─────────────────────────────────────────────────────────┘    │
│                          │                                      │
│         ┌────────────────┼────────────────┐                     │
│         │                │                │                     │
│         ▼                ▼                ▼                     │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐             │
│  │   BigQuery  │  │Cloud Storage│  │   Pub/Sub   │             │
│  │ (Analytics) │  │  (Archive)  │  │  (Stream)   │             │
│  └─────────────┘  └─────────────┘  └─────────────┘             │
│                                                                 │
│  SINK FILTER EXAMPLES:                                          │
│  • All audit logs: logName:"cloudaudit.googleapis.com"          │
│  • Errors only: severity >= ERROR                               │
│  • Specific service: resource.type="gce_instance"               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Log Sink Configuration

bash
# Create org-level sink for all audit logs to BigQuery
gcloud logging sinks create org-audit-sink \
  bigquery.googleapis.com/projects/logging-project/datasets/audit_logs \
  --organization=ORG_ID \
  --include-children \
  --log-filter='logName:"cloudaudit.googleapis.com"'

# Create sink for security-relevant logs to Cloud Storage
gcloud logging sinks create security-archive-sink \
  storage.googleapis.com/security-logs-bucket \
  --organization=ORG_ID \
  --include-children \
  --log-filter='
    logName:"cloudaudit.googleapis.com" OR
    logName:"vpc_flows" OR
    logName:"firewall"
  '

Cloud Monitoring

Metrics Types

TypeSourceExamples
System MetricsGCP services (auto)CPU, memory, disk, network
Agent MetricsOps AgentOS-level, custom apps
Custom MetricsYour codeBusiness metrics, KPIs
Log-based MetricsLog entriesError counts, latency

Alerting Best Practices

yaml
# Alert Policy Structure
displayName: "High Error Rate - Production API"
documentation:
  content: |
    ## Impact
    Users may experience errors when calling the API.
    
    ## Runbook
    1. Check Cloud Run logs for error details
    2. Verify downstream dependencies
    3. Check recent deployments
    
    ## Escalation
    Page on-call if not resolved in 15 minutes.
conditions:
  - displayName: "Error rate > 1%"
    conditionThreshold:
      filter: |
        resource.type="cloud_run_revision"
        AND metric.type="run.googleapis.com/request_count"
        AND metric.labels.response_code_class="5xx"
      aggregations:
        - alignmentPeriod: 60s
          perSeriesAligner: ALIGN_RATE
          crossSeriesReducer: REDUCE_SUM
          groupByFields:
            - resource.label.service_name
      comparison: COMPARISON_GT
      thresholdValue: 0.01
      duration: 300s
notificationChannels:
  - projects/PROJECT/notificationChannels/CHANNEL_ID
alertStrategy:
  autoClose: 1800s  # Auto-close after 30 min if resolved

SLO Monitoring

┌─────────────────────────────────────────────────────────────────┐
│                    SLO MONITORING                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SERVICE LEVEL INDICATORS (SLIs)                                │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Availability: % of successful requests                 │    │
│  │ Latency: % of requests < threshold                     │    │
│  │ Throughput: Requests per second                        │    │
│  │ Error Rate: % of failed requests                       │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  SERVICE LEVEL OBJECTIVES (SLOs)                                │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Example: 99.9% availability over 30-day rolling window │    │
│  │                                                         │    │
│  │ Error Budget = 100% - SLO = 0.1%                        │    │
│  │ In 30 days: 0.1% × 30 × 24 × 60 = 43.2 minutes          │    │
│  │                                                         │    │
│  │ If error budget exhausted → freeze deployments          │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  BURN RATE ALERTS                                               │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Fast burn: 14.4x rate → exhausts budget in 2 days      │    │
│  │ Slow burn: 3x rate → exhausts budget in 10 days        │    │
│  │                                                         │    │
│  │ Alert when burn rate exceeds threshold                  │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Cloud Trace

Distributed Tracing

┌─────────────────────────────────────────────────────────────────┐
│                 DISTRIBUTED TRACE EXAMPLE                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Request: GET /api/orders/123                                   │
│  Total Latency: 450ms                                           │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Cloud Run: api-gateway                          [50ms]  │    │
│  │ ├── Cloud Run: order-service                   [150ms]  │    │
│  │ │   ├── Cloud SQL: SELECT order              [80ms]     │    │
│  │ │   └── Memorystore: GET cache               [5ms]      │    │
│  │ └── Cloud Run: user-service                   [200ms]   │    │
│  │     └── Firestore: GET user                  [180ms]    │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  INSIGHTS:                                                      │
│  • Firestore query is the bottleneck (180ms)                    │
│  • Consider caching user data                                   │
│  • Order service has good cache hit rate                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Trace Instrumentation

python
# Python with OpenTelemetry
from opentelemetry import trace
from opentelemetry.exporter.cloud_trace import CloudTraceSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Setup
provider = TracerProvider()
processor = BatchSpanProcessor(CloudTraceSpanExporter())
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)

# Usage
@tracer.start_as_current_span("process_order")
def process_order(order_id: str):
    with tracer.start_as_current_span("fetch_order"):
        order = fetch_order(order_id)
    
    with tracer.start_as_current_span("validate_order"):
        validate(order)
    
    return order

Audit & Compliance

Audit Log Analysis

sql
-- BigQuery: Find all IAM changes in last 7 days
SELECT
  timestamp,
  protopayload_auditlog.authenticationInfo.principalEmail as actor,
  protopayload_auditlog.methodName as action,
  protopayload_auditlog.resourceName as resource,
  protopayload_auditlog.request
FROM `project.dataset.cloudaudit_googleapis_com_activity_*`
WHERE _TABLE_SUFFIX >= FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY))
  AND protopayload_auditlog.methodName LIKE '%SetIamPolicy%'
ORDER BY timestamp DESC;

-- Find failed authentication attempts
SELECT
  timestamp,
  protopayload_auditlog.authenticationInfo.principalEmail,
  protopayload_auditlog.status.message as error,
  resource.labels.project_id
FROM `project.dataset.cloudaudit_googleapis_com_activity_*`
WHERE protopayload_auditlog.status.code != 0
  AND _TABLE_SUFFIX >= FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
ORDER BY timestamp DESC
LIMIT 100;

Compliance Dashboards

┌─────────────────────────────────────────────────────────────────┐
│              COMPLIANCE MONITORING DASHBOARD                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SECURITY METRICS                                               │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ • IAM policy changes: 15 (last 24h)                     │    │
│  │ • Failed auth attempts: 3 (last 24h)                    │    │
│  │ • Service account key creations: 0 ✓                    │    │
│  │ • Public resources created: 0 ✓                         │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  RESOURCE COMPLIANCE                                            │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ • VMs without OS Login: 2 ⚠️                            │    │
│  │ • Buckets without uniform access: 0 ✓                   │    │
│  │ • Unencrypted disks: 0 ✓                                │    │
│  │ • Public IPs on VMs: 5 ⚠️                               │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  DATA ACCESS                                                    │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ • BigQuery queries on PII tables: 45 (last 24h)         │    │
│  │ • GCS downloads from sensitive buckets: 12              │    │
│  │ • Cross-project data access: 8                          │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Best Practices Checklist

  • [ ] Enable Data Access audit logs for sensitive services
  • [ ] Set up org-level log sinks for centralized logging
  • [ ] Export logs to BigQuery for long-term analysis
  • [ ] Configure alerting policies with clear runbooks
  • [ ] Implement SLO monitoring with error budgets
  • [ ] Enable Cloud Trace for distributed systems
  • [ ] Create compliance dashboards for security team
  • [ ] Set appropriate log retention periods

⚖️ Trade-offs

Trade-off 1: Log Retention vs Cost

RetentionStorage CostUse Case
30 daysDefault, free tierActive troubleshooting
1 yearHigherCompliance, audit
7 yearsExport to GCSRegulatory (coldline)

Khuyến nghị: 30 days in Cloud Logging + BigQuery export cho analytics + GCS archive cho compliance.


Trade-off 2: Trace Sampling Rate

SamplingCostVisibility
100%Rất caoFull
10%ModerateGood
1%LowBasic
Head-basedVariableRequest context

Trade-off 3: Alert Granularity

ApproachAlert VolumeActionability
Metric-basedHighVariable
SLO-basedLowHigh
Burn-rateBalancedPredictive

🚨 Failure Modes

Failure Mode 1: Alert Fatigue

🔥 Incident thực tế

On-call nhận 200 alerts/ngày. Team bắt đầu ignore alerts. Real incident bị miss 4 giờ. Customer-facing outage.

Cách phát hiệnCách phòng tránh
High alert volumeSLO-based alerting
Low response rateTune thresholds
Team burnoutConsolidate alerts

Failure Mode 2: Missing Logs

Cách phát hiệnCách phòng tránh
Gaps in log timelineMonitor log ingestion
Investigation blockedVerify sink configuration
Audit failsTest log export regularly

Failure Mode 3: Logging Cost Explosion

Cách phát hiệnCách phòng tránh
Billing spikeLog exclusion filters
High ingestion volumeSelective Data Access logs
Unexpected chargesSet logging budgets

🔐 Security Baseline

Logging Security Requirements

RequirementImplementationVerification
Audit logs enabledOrg-wide, cannot disableAdmin Activity always on
Log exportOrg-level sinksSink configuration audit
Access controlIAM for log accessAccess review
Immutable logsExport to GCS with retention lockBucket configuration

Critical Log Sources

Log TypeEnable ForRetention
Admin ActivityAll (auto)400 days
Data AccessSensitive services30+ days
VPC Flow LogsSecurity subnets30 days
Firewall LogsAll rules30 days

📊 Ops Readiness

Metrics cần Monitoring

MetricSourceAlert Threshold
Log ingestion rateCloud LoggingSpike > 3x
Error log rateCloud Logging> baseline
Alert response timeCloud Monitoring> 15 min
Trace sample rateCloud TraceDrop > 50%
SLO burn rateCloud Monitoring> budget

Runbook Entry Points

Tình huốngRunbook
Alert stormrunbook/alert-storm-response.md
Missing logsrunbook/log-investigation.md
High logging costrunbook/logging-cost-optimization.md
Trace gapsrunbook/trace-troubleshooting.md
Compliance auditrunbook/audit-log-export.md
SLO breachrunbook/slo-breach-response.md

Design Review Checklist

Logging

  • [ ] Org-level sinks configured
  • [ ] Data Access logs selective
  • [ ] Export to BigQuery/GCS
  • [ ] Retention policies set

Monitoring

  • [ ] SLOs defined
  • [ ] Burn-rate alerting
  • [ ] Dashboards available
  • [ ] Notification channels tested

Tracing

  • [ ] Cloud Trace enabled
  • [ ] Sampling configured
  • [ ] Key paths instrumented
  • [ ] Error Reporting integrated

Compliance

  • [ ] Audit log export
  • [ ] Retention meets requirements
  • [ ] Access control verified
  • [ ] Dashboards for security team

📎 Liên kết