Skip to content

📊 Observability & Auditing

Level: Ops Solves: Implement comprehensive observability và audit trail cho enterprise AWS environments

🎯 Mục tiêu (Outcomes)

Sau khi áp dụng kiến thức trong trang này, bạn sẽ có khả năng:

  • Thiết kế Observability Strategy theo Three Pillars (Metrics, Logs, Traces)
  • Cấu hình CloudWatch với alarms, dashboards, và log insights
  • Triển khai CloudTrail cho comprehensive audit trail cross-account
  • Áp dụng AWS Config cho compliance monitoring và auto-remediation
  • Implement X-Ray cho distributed tracing diện mạo này
  • Xây dựng Log Aggregation tập trung cho multi-account analysis

Khi nào dùng

ServiceUse CaseLý do
CloudWatch MetricsResource và application monitoringNative integration, alarms
CloudWatch LogsApplication logs, audit trailsRetention, Insights queries
CloudTrailAPI auditing, security investigationBắt buộc cho compliance
AWS ConfigConfiguration complianceContinuous monitoring, auto-fix
X-RayDistributed tracingMicroservices troubleshooting
CloudWatch SyntheticsEndpoint monitoringProactive issue detection

Khi nào KHÔNG dùng

PatternVấn đềThay thế
CloudWatch cho long-term log storageCost cao, query chậmS3 + Athena
X-Ray cho mọi requestCost, performance overheadSampling strategy
CloudWatch alarms cho mọi metricAlert fatigueFocus vào SLIs
Config rules tự viết cho common checksMaintenance overheadManaged rules

⚠️ Cảnh báo từ Raizo

"Một team tạo 500+ CloudWatch alarms. Kết quả? Alert fatigue - mọi người ignore alerts. Khi có real incident, không ai phản hồi vì đã mệt với false positives. Nhớ: Ít alarms chất lượng cao > nhiều alarms noise."

Observability Pillars

Three Pillars Framework

┌─────────────────────────────────────────────────────────────────┐
│              OBSERVABILITY THREE PILLARS                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐          │
│  │   METRICS   │    │    LOGS     │    │   TRACES    │          │
│  │             │    │             │    │             │          │
│  │ CloudWatch  │    │ CloudWatch  │    │   X-Ray     │          │
│  │  Metrics    │    │    Logs     │    │             │          │
│  └──────┬──────┘    └──────┬──────┘    └──────┬──────┘          │
│         │                  │                  │                 │
│         ▼                  ▼                  ▼                 │
│  • CPU, Memory       • Application      • Request flow          │
│  • Request count     • Error messages   • Latency breakdown     │
│  • Latency p99       • Audit events     • Service dependencies  │
│  • Custom metrics    • Access logs      • Error propagation     │
│                                                                 │
│  CORRELATION: Use X-Ray trace ID in logs for end-to-end view   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

CloudWatch Deep Dive

Metrics Architecture

Key Metrics to Monitor

ServiceCritical MetricsAlarm Threshold
EC2CPUUtilization, StatusCheckFailed> 80%, > 0
RDSCPUUtilization, FreeStorageSpace, DatabaseConnections> 80%, < 20%, > 80%
LambdaErrors, Duration, ConcurrentExecutions> 1%, > 80% timeout, > 80% limit
ALBTargetResponseTime, HTTPCode_ELB_5XX> 1s p99, > 0
ECSCPUUtilization, MemoryUtilization> 80%, > 80%

CloudWatch Alarms Best Practices

json
{
  "AlarmName": "HighCPUUtilization",
  "MetricName": "CPUUtilization",
  "Namespace": "AWS/EC2",
  "Statistic": "Average",
  "Period": 300,
  "EvaluationPeriods": 3,
  "DatapointsToAlarm": 2,
  "Threshold": 80,
  "ComparisonOperator": "GreaterThanThreshold",
  "TreatMissingData": "breaching",
  "AlarmActions": [
    "arn:aws:sns:us-east-1:123456789012:ops-alerts"
  ],
  "OKActions": [
    "arn:aws:sns:us-east-1:123456789012:ops-alerts"
  ]
}

💡 Alarm Configuration

  • Use DatapointsToAlarm < EvaluationPeriods để tránh flapping
  • Set TreatMissingData appropriately (breaching for critical, notBreaching for optional)
  • Always configure OKActions để biết khi issue resolved

CloudWatch Logs Insights

sql
-- Find errors in Lambda functions
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

-- Analyze API Gateway latency
fields @timestamp, @message
| filter @message like /requestId/
| parse @message '"latency":*,' as latency
| stats avg(latency), max(latency), pct(latency, 99) by bin(5m)

-- Find slow database queries
fields @timestamp, @message
| filter @message like /duration/
| parse @message 'duration: * ms' as duration
| filter duration > 1000
| sort duration desc
| limit 50

CloudTrail for Auditing

Trail Configuration

┌─────────────────────────────────────────────────────────────────┐
│                 CLOUDTRAIL ARCHITECTURE                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                  Organization Trail                      │    │
│  │  • All accounts in organization                          │    │
│  │  • All regions                                           │    │
│  │  • Management + Data events                              │    │
│  └─────────────────────────────────────────────────────────┘    │
│                           │                                     │
│           ┌───────────────┼───────────────┐                     │
│           ▼               ▼               ▼                     │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐              │
│  │ S3 Bucket   │  │ CloudWatch  │  │   Athena    │              │
│  │ (Log Archive│  │    Logs     │  │  (Analysis) │              │
│  │  Account)   │  │ (Real-time) │  │             │              │
│  └─────────────┘  └─────────────┘  └─────────────┘              │
│                                                                 │
│  Retention:                                                     │
│  • S3: 7 years (compliance)                                     │
│  • CloudWatch Logs: 90 days (operational)                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Critical Events to Monitor

json
{
  "EventPatterns": [
    {
      "name": "RootAccountUsage",
      "pattern": {
        "userIdentity": {
          "type": ["Root"]
        }
      },
      "severity": "CRITICAL"
    },
    {
      "name": "IAMPolicyChanges",
      "pattern": {
        "eventSource": ["iam.amazonaws.com"],
        "eventName": [
          "CreatePolicy",
          "DeletePolicy",
          "AttachRolePolicy",
          "DetachRolePolicy"
        ]
      },
      "severity": "HIGH"
    },
    {
      "name": "SecurityGroupChanges",
      "pattern": {
        "eventSource": ["ec2.amazonaws.com"],
        "eventName": [
          "AuthorizeSecurityGroupIngress",
          "AuthorizeSecurityGroupEgress",
          "RevokeSecurityGroupIngress"
        ]
      },
      "severity": "MEDIUM"
    }
  ]
}

Athena Query Examples

sql
-- Find all root account activities
SELECT eventTime, eventName, sourceIPAddress, userAgent
FROM cloudtrail_logs
WHERE userIdentity.type = 'Root'
ORDER BY eventTime DESC;

-- Detect unusual API calls by IP
SELECT sourceIPAddress, COUNT(*) as call_count
FROM cloudtrail_logs
WHERE eventTime > date_add('day', -1, current_date)
GROUP BY sourceIPAddress
HAVING COUNT(*) > 1000
ORDER BY call_count DESC;

-- Track S3 bucket policy changes
SELECT eventTime, userIdentity.arn, requestParameters
FROM cloudtrail_logs
WHERE eventSource = 's3.amazonaws.com'
  AND eventName = 'PutBucketPolicy'
ORDER BY eventTime DESC;

AWS Config for Compliance

Config Rules Categories

┌─────────────────────────────────────────────────────────────────┐
│                 AWS CONFIG RULES                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Security:                                                      │
│  ├── s3-bucket-public-read-prohibited                           │
│  ├── encrypted-volumes                                          │
│  ├── iam-password-policy                                        │
│  ├── root-account-mfa-enabled                                   │
│  └── vpc-flow-logs-enabled                                      │
│                                                                 │
│  Operational:                                                   │
│  ├── ec2-instance-managed-by-systems-manager                    │
│  ├── rds-multi-az-support                                       │
│  ├── dynamodb-autoscaling-enabled                               │
│  └── cloudwatch-alarm-action-check                              │
│                                                                 │
│  Cost:                                                          │
│  ├── ec2-stopped-instance                                       │
│  ├── ebs-optimized-instance                                     │
│  └── required-tags                                              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Conformance Packs

yaml
# Example: CIS AWS Foundations Benchmark
ConformancePackName: CIS-AWS-Foundations-Benchmark
ConformancePackInputParameters:
  - ParameterName: AccessKeysRotatedParamMaxAccessKeyAge
    ParameterValue: "90"
TemplateBody: |
  Resources:
    IAMRootAccessKeyCheck:
      Type: AWS::Config::ConfigRule
      Properties:
        ConfigRuleName: iam-root-access-key-check
        Source:
          Owner: AWS
          SourceIdentifier: IAM_ROOT_ACCESS_KEY_CHECK
    
    MFAEnabledForIAMConsoleAccess:
      Type: AWS::Config::ConfigRule
      Properties:
        ConfigRuleName: mfa-enabled-for-iam-console-access
        Source:
          Owner: AWS
          SourceIdentifier: MFA_ENABLED_FOR_IAM_CONSOLE_ACCESS

X-Ray for Distributed Tracing

Trace Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    X-RAY TRACE FLOW                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Client Request                                                 │
│       │                                                         │
│       ▼                                                         │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐      │
│  │   ALB   │───►│ Lambda  │───►│   RDS   │    │   S3    │      │
│  │         │    │         │    │         │    │         │      │
│  │ Segment │    │ Segment │    │Subsegment│   │Subsegment│     │
│  └─────────┘    └────┬────┘    └─────────┘    └─────────┘      │
│                      │                                          │
│                      ▼                                          │
│                 ┌─────────┐                                     │
│                 │   SQS   │                                     │
│                 │         │                                     │
│                 │Subsegment│                                    │
│                 └─────────┘                                     │
│                                                                 │
│  Trace ID: 1-5f84c7a5-example12345678901234567                  │
│  Total Duration: 245ms                                          │
│  Segments: 5                                                    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

X-Ray SDK Integration

python
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.core import patch_all

# Patch all supported libraries
patch_all()

@xray_recorder.capture('process_order')
def process_order(order_id):
    # Add annotation for filtering
    xray_recorder.put_annotation('order_id', order_id)
    
    # Add metadata for debugging
    xray_recorder.put_metadata('order_details', {
        'items': 5,
        'total': 99.99
    })
    
    # Business logic here
    result = validate_order(order_id)
    return result

Centralized Logging Architecture

Best Practices Checklist

  • [ ] Enable CloudTrail organization trail
  • [ ] Configure CloudWatch Logs retention policies
  • [ ] Set up critical alarms with appropriate thresholds
  • [ ] Enable AWS Config with conformance packs
  • [ ] Implement X-Ray for distributed applications
  • [ ] Create CloudWatch dashboards for key services
  • [ ] Configure log aggregation to central account
  • [ ] Set up automated compliance reporting

⚖️ Trade-offs

Trade-off 1: Log Retention Cost vs Query Capability

StorageCostQuery SpeedUse Case
CloudWatch Logs (30 days)$0.50/GB ingestedFastRecent troubleshooting
S3 Standard$0.023/GB/thángSlow (Athena)Archive, compliance
S3 Glacier$0.004/GB/thángVery slowLong-term retention

Khuyến nghị Architecture:

Application → CloudWatch Logs (7-30 days) → Kinesis Firehose → S3

                                              Athena (ad-hoc queries)
                                              OpenSearch (search/dashboards)

Trade-off 2: Detailed Monitoring vs Cost

SettingFrequencyCost ImpactWhen
Basic Monitoring5 minFreeNon-critical
Detailed Monitoring1 min+$3.50/instanceProduction
High-Resolution Metrics1 sec+++Trading, gaming

Trade-off 3: X-Ray Sampling vs Visibility

Sampling RateCostDebug CapabilityRecommendation
100%Rất caoFull visibilityDevelopment only
Reservoir + RateTrung bìnhBalancedProduction
Disable$0Không có tracesKhông khuyến nghị

🚨 Failure Modes

Failure Mode 1: Alert Fatigue

🔥 Incident thực tế

Team nhận 200 alerts/ngày. Khi production database down, alert bị lạc trong noise. MTTR 4 giờ thay vì 15 phút. $50,000 revenue loss.

Cách phát hiệnCách phòng tránh
Alert response rate < 50%Focus alerts vào SLIs (latency, errors, availability)
High silence rateTune thresholds dựa trên baselines
Repeated false positivesComposite alarms giảm noise
Team complaintsAlert runbook required cho mỗi alarm

Failure Mode 2: Log Loss

Cách phát hiệnCách phòng tránh
Gaps trong log timelineCloudWatch agent với buffer
Missing logs trong investigationKinesis Firehose với retry
Agent crashesMonitor agent health
Throttling errorsRight-size log groups

Failure Mode 3: Blind Spots trong Monitoring

Cách phát hiệnCách phòng tránh
Incidents without prior alertsChaos engineering expose gaps
User reports issues trước alertsSynthetic monitoring
Dependencies not monitoredService map review
Third-party failures missedExternal endpoint checks

🔐 Security Baseline

Audit Requirements

RequirementImplementationVerification
CloudTrail enabledOrganization trail, all regionsConfig Rule
Log integrityCloudTrail log file validationAutomated check
Log encryptionSSE-KMS cho S3 và CloudWatchBucket policy
Immutable logsS3 Object Lock cho audit logsGovernance mode

Access Control

json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyCloudTrailModification",
      "Effect": "Deny",
      "Action": [
        "cloudtrail:DeleteTrail",
        "cloudtrail:StopLogging",
        "cloudtrail:UpdateTrail"
      ],
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:PrincipalArn": "arn:aws:iam::*:role/SecurityAdmin"
        }
      }
    }
  ]
}

Sensitive Data in Logs

Data TypeMitigation
PIILog scrubbing, masking
CredentialsNever log, detect và alert
API keysRedact trong application
Financial dataEncryption, access control

📊 Ops Readiness

Metrics cần Monitoring

ComponentMetricAlert Threshold
CloudWatch AgentAgent uptimeAny down
CloudTrailTrail statusInactive
Log GroupsIncomingLogEventsDrop > 50%
AlarmsAlarmActions failed> 0
ConfigNon-compliant resources> 0 critical

Dashboard Design

┌─────────────────────────────────────────────────────────────────┐
│               EXECUTIVE DASHBOARD LAYOUT                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │ Row 1: SLIs (Availability, Latency, Error Rate)           │  │
│  ├───────────────────────────────────────────────────────────┤  │
│  │ Row 2: Active Alarms, Recent Deployments                  │  │
│  ├───────────────────────────────────────────────────────────┤  │
│  │ Row 3: Service Health by Component                        │  │
│  ├───────────────────────────────────────────────────────────┤  │
│  │ Row 4: Cost Trends, Resource Utilization                  │  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Runbook Entry Points

Tình huốngRunbook
Alert stormrunbook/alert-storm-triage.md
Log delivery failurerunbook/log-delivery-troubleshooting.md
CloudTrail monitoring gaprunbook/cloudtrail-gap-investigation.md
Config non-compliance spikerunbook/config-remediation.md
X-Ray tracing issuesrunbook/xray-troubleshooting.md
Dashboard not updatingrunbook/dashboard-debug.md

Design Review Checklist

Observability Coverage

  • [ ] Three pillars implemented (Metrics, Logs, Traces)
  • [ ] SLIs defined và monitored
  • [ ] Dashboard cho mỗi service layer
  • [ ] Correlation IDs trong logs

Alerting

  • [ ] Alarms focus vào SLIs
  • [ ] Composite alarms giảm noise
  • [ ] Mỗi alarm có runbook
  • [ ] Alert routing đến đúng team

Audit & Compliance

  • [ ] CloudTrail organization trail enabled
  • [ ] Log integrity validation
  • [ ] Retention policies meet compliance
  • [ ] Access to logs restricted

Operations

  • [ ] Log aggregation tập trung
  • [ ] Query playbooks documented
  • [ ] On-call has dashboard access
  • [ ] Regular review của alert effectiveness

📎 Liên kết