Giao diện
📊 Observability & Auditing
Level: Ops Solves: Implement comprehensive observability và audit trail cho enterprise AWS environments
🎯 Mục tiêu (Outcomes)
Sau khi áp dụng kiến thức trong trang này, bạn sẽ có khả năng:
- Thiết kế Observability Strategy theo Three Pillars (Metrics, Logs, Traces)
- Cấu hình CloudWatch với alarms, dashboards, và log insights
- Triển khai CloudTrail cho comprehensive audit trail cross-account
- Áp dụng AWS Config cho compliance monitoring và auto-remediation
- Implement X-Ray cho distributed tracing diện mạo này
- Xây dựng Log Aggregation tập trung cho multi-account analysis
✅ Khi nào dùng
| Service | Use Case | Lý do |
|---|---|---|
| CloudWatch Metrics | Resource và application monitoring | Native integration, alarms |
| CloudWatch Logs | Application logs, audit trails | Retention, Insights queries |
| CloudTrail | API auditing, security investigation | Bắt buộc cho compliance |
| AWS Config | Configuration compliance | Continuous monitoring, auto-fix |
| X-Ray | Distributed tracing | Microservices troubleshooting |
| CloudWatch Synthetics | Endpoint monitoring | Proactive issue detection |
❌ Khi nào KHÔNG dùng
| Pattern | Vấn đề | Thay thế |
|---|---|---|
| CloudWatch cho long-term log storage | Cost cao, query chậm | S3 + Athena |
| X-Ray cho mọi request | Cost, performance overhead | Sampling strategy |
| CloudWatch alarms cho mọi metric | Alert fatigue | Focus vào SLIs |
| Config rules tự viết cho common checks | Maintenance overhead | Managed rules |
⚠️ Cảnh báo từ Raizo
"Một team tạo 500+ CloudWatch alarms. Kết quả? Alert fatigue - mọi người ignore alerts. Khi có real incident, không ai phản hồi vì đã mệt với false positives. Nhớ: Ít alarms chất lượng cao > nhiều alarms noise."
Observability Pillars
Three Pillars Framework
┌─────────────────────────────────────────────────────────────────┐
│ OBSERVABILITY THREE PILLARS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ METRICS │ │ LOGS │ │ TRACES │ │
│ │ │ │ │ │ │ │
│ │ CloudWatch │ │ CloudWatch │ │ X-Ray │ │
│ │ Metrics │ │ Logs │ │ │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ • CPU, Memory • Application • Request flow │
│ • Request count • Error messages • Latency breakdown │
│ • Latency p99 • Audit events • Service dependencies │
│ • Custom metrics • Access logs • Error propagation │
│ │
│ CORRELATION: Use X-Ray trace ID in logs for end-to-end view │
│ │
└─────────────────────────────────────────────────────────────────┘CloudWatch Deep Dive
Metrics Architecture
Key Metrics to Monitor
| Service | Critical Metrics | Alarm Threshold |
|---|---|---|
| EC2 | CPUUtilization, StatusCheckFailed | > 80%, > 0 |
| RDS | CPUUtilization, FreeStorageSpace, DatabaseConnections | > 80%, < 20%, > 80% |
| Lambda | Errors, Duration, ConcurrentExecutions | > 1%, > 80% timeout, > 80% limit |
| ALB | TargetResponseTime, HTTPCode_ELB_5XX | > 1s p99, > 0 |
| ECS | CPUUtilization, MemoryUtilization | > 80%, > 80% |
CloudWatch Alarms Best Practices
json
{
"AlarmName": "HighCPUUtilization",
"MetricName": "CPUUtilization",
"Namespace": "AWS/EC2",
"Statistic": "Average",
"Period": 300,
"EvaluationPeriods": 3,
"DatapointsToAlarm": 2,
"Threshold": 80,
"ComparisonOperator": "GreaterThanThreshold",
"TreatMissingData": "breaching",
"AlarmActions": [
"arn:aws:sns:us-east-1:123456789012:ops-alerts"
],
"OKActions": [
"arn:aws:sns:us-east-1:123456789012:ops-alerts"
]
}💡 Alarm Configuration
- Use
DatapointsToAlarm<EvaluationPeriodsđể tránh flapping - Set
TreatMissingDataappropriately (breaching for critical, notBreaching for optional) - Always configure OKActions để biết khi issue resolved
CloudWatch Logs Insights
sql
-- Find errors in Lambda functions
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100
-- Analyze API Gateway latency
fields @timestamp, @message
| filter @message like /requestId/
| parse @message '"latency":*,' as latency
| stats avg(latency), max(latency), pct(latency, 99) by bin(5m)
-- Find slow database queries
fields @timestamp, @message
| filter @message like /duration/
| parse @message 'duration: * ms' as duration
| filter duration > 1000
| sort duration desc
| limit 50CloudTrail for Auditing
Trail Configuration
┌─────────────────────────────────────────────────────────────────┐
│ CLOUDTRAIL ARCHITECTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Organization Trail │ │
│ │ • All accounts in organization │ │
│ │ • All regions │ │
│ │ • Management + Data events │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ S3 Bucket │ │ CloudWatch │ │ Athena │ │
│ │ (Log Archive│ │ Logs │ │ (Analysis) │ │
│ │ Account) │ │ (Real-time) │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ Retention: │
│ • S3: 7 years (compliance) │
│ • CloudWatch Logs: 90 days (operational) │
│ │
└─────────────────────────────────────────────────────────────────┘Critical Events to Monitor
json
{
"EventPatterns": [
{
"name": "RootAccountUsage",
"pattern": {
"userIdentity": {
"type": ["Root"]
}
},
"severity": "CRITICAL"
},
{
"name": "IAMPolicyChanges",
"pattern": {
"eventSource": ["iam.amazonaws.com"],
"eventName": [
"CreatePolicy",
"DeletePolicy",
"AttachRolePolicy",
"DetachRolePolicy"
]
},
"severity": "HIGH"
},
{
"name": "SecurityGroupChanges",
"pattern": {
"eventSource": ["ec2.amazonaws.com"],
"eventName": [
"AuthorizeSecurityGroupIngress",
"AuthorizeSecurityGroupEgress",
"RevokeSecurityGroupIngress"
]
},
"severity": "MEDIUM"
}
]
}Athena Query Examples
sql
-- Find all root account activities
SELECT eventTime, eventName, sourceIPAddress, userAgent
FROM cloudtrail_logs
WHERE userIdentity.type = 'Root'
ORDER BY eventTime DESC;
-- Detect unusual API calls by IP
SELECT sourceIPAddress, COUNT(*) as call_count
FROM cloudtrail_logs
WHERE eventTime > date_add('day', -1, current_date)
GROUP BY sourceIPAddress
HAVING COUNT(*) > 1000
ORDER BY call_count DESC;
-- Track S3 bucket policy changes
SELECT eventTime, userIdentity.arn, requestParameters
FROM cloudtrail_logs
WHERE eventSource = 's3.amazonaws.com'
AND eventName = 'PutBucketPolicy'
ORDER BY eventTime DESC;AWS Config for Compliance
Config Rules Categories
┌─────────────────────────────────────────────────────────────────┐
│ AWS CONFIG RULES │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Security: │
│ ├── s3-bucket-public-read-prohibited │
│ ├── encrypted-volumes │
│ ├── iam-password-policy │
│ ├── root-account-mfa-enabled │
│ └── vpc-flow-logs-enabled │
│ │
│ Operational: │
│ ├── ec2-instance-managed-by-systems-manager │
│ ├── rds-multi-az-support │
│ ├── dynamodb-autoscaling-enabled │
│ └── cloudwatch-alarm-action-check │
│ │
│ Cost: │
│ ├── ec2-stopped-instance │
│ ├── ebs-optimized-instance │
│ └── required-tags │
│ │
└─────────────────────────────────────────────────────────────────┘Conformance Packs
yaml
# Example: CIS AWS Foundations Benchmark
ConformancePackName: CIS-AWS-Foundations-Benchmark
ConformancePackInputParameters:
- ParameterName: AccessKeysRotatedParamMaxAccessKeyAge
ParameterValue: "90"
TemplateBody: |
Resources:
IAMRootAccessKeyCheck:
Type: AWS::Config::ConfigRule
Properties:
ConfigRuleName: iam-root-access-key-check
Source:
Owner: AWS
SourceIdentifier: IAM_ROOT_ACCESS_KEY_CHECK
MFAEnabledForIAMConsoleAccess:
Type: AWS::Config::ConfigRule
Properties:
ConfigRuleName: mfa-enabled-for-iam-console-access
Source:
Owner: AWS
SourceIdentifier: MFA_ENABLED_FOR_IAM_CONSOLE_ACCESSX-Ray for Distributed Tracing
Trace Architecture
┌─────────────────────────────────────────────────────────────────┐
│ X-RAY TRACE FLOW │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Client Request │
│ │ │
│ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ ALB │───►│ Lambda │───►│ RDS │ │ S3 │ │
│ │ │ │ │ │ │ │ │ │
│ │ Segment │ │ Segment │ │Subsegment│ │Subsegment│ │
│ └─────────┘ └────┬────┘ └─────────┘ └─────────┘ │
│ │ │
│ ▼ │
│ ┌─────────┐ │
│ │ SQS │ │
│ │ │ │
│ │Subsegment│ │
│ └─────────┘ │
│ │
│ Trace ID: 1-5f84c7a5-example12345678901234567 │
│ Total Duration: 245ms │
│ Segments: 5 │
│ │
└─────────────────────────────────────────────────────────────────┘X-Ray SDK Integration
python
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.core import patch_all
# Patch all supported libraries
patch_all()
@xray_recorder.capture('process_order')
def process_order(order_id):
# Add annotation for filtering
xray_recorder.put_annotation('order_id', order_id)
# Add metadata for debugging
xray_recorder.put_metadata('order_details', {
'items': 5,
'total': 99.99
})
# Business logic here
result = validate_order(order_id)
return resultCentralized Logging Architecture
Best Practices Checklist
- [ ] Enable CloudTrail organization trail
- [ ] Configure CloudWatch Logs retention policies
- [ ] Set up critical alarms with appropriate thresholds
- [ ] Enable AWS Config with conformance packs
- [ ] Implement X-Ray for distributed applications
- [ ] Create CloudWatch dashboards for key services
- [ ] Configure log aggregation to central account
- [ ] Set up automated compliance reporting
⚖️ Trade-offs
Trade-off 1: Log Retention Cost vs Query Capability
| Storage | Cost | Query Speed | Use Case |
|---|---|---|---|
| CloudWatch Logs (30 days) | $0.50/GB ingested | Fast | Recent troubleshooting |
| S3 Standard | $0.023/GB/tháng | Slow (Athena) | Archive, compliance |
| S3 Glacier | $0.004/GB/tháng | Very slow | Long-term retention |
Khuyến nghị Architecture:
Application → CloudWatch Logs (7-30 days) → Kinesis Firehose → S3
↓
Athena (ad-hoc queries)
OpenSearch (search/dashboards)Trade-off 2: Detailed Monitoring vs Cost
| Setting | Frequency | Cost Impact | When |
|---|---|---|---|
| Basic Monitoring | 5 min | Free | Non-critical |
| Detailed Monitoring | 1 min | +$3.50/instance | Production |
| High-Resolution Metrics | 1 sec | +++ | Trading, gaming |
Trade-off 3: X-Ray Sampling vs Visibility
| Sampling Rate | Cost | Debug Capability | Recommendation |
|---|---|---|---|
| 100% | Rất cao | Full visibility | Development only |
| Reservoir + Rate | Trung bình | Balanced | Production |
| Disable | $0 | Không có traces | Không khuyến nghị |
🚨 Failure Modes
Failure Mode 1: Alert Fatigue
🔥 Incident thực tế
Team nhận 200 alerts/ngày. Khi production database down, alert bị lạc trong noise. MTTR 4 giờ thay vì 15 phút. $50,000 revenue loss.
| Cách phát hiện | Cách phòng tránh |
|---|---|
| Alert response rate < 50% | Focus alerts vào SLIs (latency, errors, availability) |
| High silence rate | Tune thresholds dựa trên baselines |
| Repeated false positives | Composite alarms giảm noise |
| Team complaints | Alert runbook required cho mỗi alarm |
Failure Mode 2: Log Loss
| Cách phát hiện | Cách phòng tránh |
|---|---|
| Gaps trong log timeline | CloudWatch agent với buffer |
| Missing logs trong investigation | Kinesis Firehose với retry |
| Agent crashes | Monitor agent health |
| Throttling errors | Right-size log groups |
Failure Mode 3: Blind Spots trong Monitoring
| Cách phát hiện | Cách phòng tránh |
|---|---|
| Incidents without prior alerts | Chaos engineering expose gaps |
| User reports issues trước alerts | Synthetic monitoring |
| Dependencies not monitored | Service map review |
| Third-party failures missed | External endpoint checks |
🔐 Security Baseline
Audit Requirements
| Requirement | Implementation | Verification |
|---|---|---|
| CloudTrail enabled | Organization trail, all regions | Config Rule |
| Log integrity | CloudTrail log file validation | Automated check |
| Log encryption | SSE-KMS cho S3 và CloudWatch | Bucket policy |
| Immutable logs | S3 Object Lock cho audit logs | Governance mode |
Access Control
json
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyCloudTrailModification",
"Effect": "Deny",
"Action": [
"cloudtrail:DeleteTrail",
"cloudtrail:StopLogging",
"cloudtrail:UpdateTrail"
],
"Resource": "*",
"Condition": {
"StringNotEquals": {
"aws:PrincipalArn": "arn:aws:iam::*:role/SecurityAdmin"
}
}
}
]
}Sensitive Data in Logs
| Data Type | Mitigation |
|---|---|
| PII | Log scrubbing, masking |
| Credentials | Never log, detect và alert |
| API keys | Redact trong application |
| Financial data | Encryption, access control |
📊 Ops Readiness
Metrics cần Monitoring
| Component | Metric | Alert Threshold |
|---|---|---|
| CloudWatch Agent | Agent uptime | Any down |
| CloudTrail | Trail status | Inactive |
| Log Groups | IncomingLogEvents | Drop > 50% |
| Alarms | AlarmActions failed | > 0 |
| Config | Non-compliant resources | > 0 critical |
Dashboard Design
┌─────────────────────────────────────────────────────────────────┐
│ EXECUTIVE DASHBOARD LAYOUT │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Row 1: SLIs (Availability, Latency, Error Rate) │ │
│ ├───────────────────────────────────────────────────────────┤ │
│ │ Row 2: Active Alarms, Recent Deployments │ │
│ ├───────────────────────────────────────────────────────────┤ │
│ │ Row 3: Service Health by Component │ │
│ ├───────────────────────────────────────────────────────────┤ │
│ │ Row 4: Cost Trends, Resource Utilization │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Runbook Entry Points
| Tình huống | Runbook |
|---|---|
| Alert storm | runbook/alert-storm-triage.md |
| Log delivery failure | runbook/log-delivery-troubleshooting.md |
| CloudTrail monitoring gap | runbook/cloudtrail-gap-investigation.md |
| Config non-compliance spike | runbook/config-remediation.md |
| X-Ray tracing issues | runbook/xray-troubleshooting.md |
| Dashboard not updating | runbook/dashboard-debug.md |
✅ Design Review Checklist
Observability Coverage
- [ ] Three pillars implemented (Metrics, Logs, Traces)
- [ ] SLIs defined và monitored
- [ ] Dashboard cho mỗi service layer
- [ ] Correlation IDs trong logs
Alerting
- [ ] Alarms focus vào SLIs
- [ ] Composite alarms giảm noise
- [ ] Mỗi alarm có runbook
- [ ] Alert routing đến đúng team
Audit & Compliance
- [ ] CloudTrail organization trail enabled
- [ ] Log integrity validation
- [ ] Retention policies meet compliance
- [ ] Access to logs restricted
Operations
- [ ] Log aggregation tập trung
- [ ] Query playbooks documented
- [ ] On-call has dashboard access
- [ ] Regular review của alert effectiveness
📎 Liên kết
- 📎 GCP Observability - So sánh với Cloud Logging, Monitoring
- 📎 Security Posture - Security monitoring integration
- 📎 Reliability & DR - Monitoring trong DR strategy
- 📎 IAM Fundamentals - Auditing IAM activities
- 📎 Terraform Drift Detection - IaC drift monitoring