Giới Thiệu Prometheus & Grafana trên Kubernetes

👁️ Góc nhìn DevOps Lead - Observability Specialist

Module này được viết từ góc nhìn của DevOps Lead chuyên về Observability tại HPN. Mục tiêu: "You can't fix what you can't see" - Không thể sửa những gì bạn không nhìn thấy.

🎯 Tại Sao Cần Observability?

text

[SRE On-Call - 2:00 AM]
🚨 Alert: Response time spiked to 5 seconds!
❓ Câu hỏi: Vấn đề ở đâu?
   - Database chậm?
   - CPU/Memory thiếu?
   - Network issue?
   - Application bug?

Không có Observability = "Đoán mò" 🎲
Có Observability = "Nhìn thẳng vào vấn đề" 🎯

Ba trụ cột của Observability:

Trụ cột	Công cụ	Trả lời câu hỏi
Metrics	Prometheus	"Hệ thống đang như thế nào?" (CPU, Memory, RPS)
Logs	Loki / EFK	"Chuyện gì đã xảy ra?" (Error messages)
Traces	Jaeger / Tempo	"Request đi qua đâu?" (Distributed tracing)

🏗️ Kiến Trúc Prometheus: Pull Model

💡 ĐIỂM KHÁC BIỆT QUAN TRỌNG

Traditional Monitoring (Push Model):

Agents trên mỗi server ĐẨY metrics đến central server
Vấn đề: Central server có thể bị overwhelm

Prometheus (Pull Model):

Prometheus KÉO (scrape) metrics từ targets
Targets expose metrics tại endpoint /metrics

Kiến trúc tổng quan

Các thành phần chính

Component	Vai trò	Metrics cung cấp
Prometheus Server	Thu thập & lưu trữ metrics	Time-series database (TSDB)
Node Exporter	Metrics từ hardware/OS	CPU, Memory, Disk, Network của node
kube-state-metrics	Metrics từ K8s objects	Pod status, Deployment replicas, etc.
Alertmanager	Xử lý alerts	Gửi notification (Slack, Email, PagerDuty)
Grafana	Visualization	Dashboards, Graphs, Panels

📡 Exporters: Nguồn Cung Cấp Metrics

Node Exporter (Hardware Metrics)

Chạy như DaemonSet trên mỗi node, thu thập metrics về:

CPU usage per core
Memory usage (used, free, cached)
Disk I/O, Disk space
Network throughput

yaml

# Được cài tự động với kube-prometheus-stack
# Xem metrics tại: http://node-exporter:9100/metrics

# Ví dụ metrics:
node_cpu_seconds_total{cpu="0",mode="idle"} 123456.78
node_memory_MemTotal_bytes 16777216000
node_filesystem_avail_bytes{mountpoint="/"} 50000000000

kube-state-metrics (Cluster Metrics)

Thu thập metrics từ Kubernetes API về trạng thái các objects:

yaml

# Ví dụ metrics:
kube_deployment_status_replicas_available{deployment="myapp"} 3
kube_pod_status_phase{pod="myapp-xxx", phase="Running"} 1
kube_node_status_condition{node="worker-1", condition="Ready"} 1

Application Exporters (Custom Metrics)

Ứng dụng của bạn cần expose /metrics endpoint:

python

# Python với prometheus_client
from prometheus_client import Counter, Histogram, start_http_server

REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP Requests', ['method', 'endpoint'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'Request latency')

@app.route('/api/orders')
def orders():
    REQUEST_COUNT.labels(method='GET', endpoint='/api/orders').inc()
    with REQUEST_LATENCY.time():
        return handle_orders()

# Expose metrics at :8000/metrics
start_http_server(8000)

java

// Java với Micrometer (Spring Boot)
// application.yml
management:
  endpoints:
    web:
      exposure:
        include: prometheus
  metrics:
    export:
      prometheus:
        enabled: true

📊 PromQL: Ngôn Ngữ Truy Vấn

💡 PromQL là gì?

Prometheus Query Language - ngôn ngữ để query và aggregate metrics từ Prometheus.

Query cơ bản: `rate()`

promql

# ❓ Câu hỏi: CPU đang được sử dụng bao nhiêu?

# Raw metric (không hữu ích lắm - chỉ là counter tăng dần)
container_cpu_usage_seconds_total

# ✅ Với rate() - tính TỐC ĐỘ THAY ĐỔI trong 5 phút
rate(container_cpu_usage_seconds_total[5m])

rate() làm gì?

text

Time 0:00 → container_cpu_usage_seconds_total = 1000s
Time 5:00 → container_cpu_usage_seconds_total = 1300s

rate() = (1300 - 1000) / 300 seconds = 1 CPU core

Ý nghĩa: Container đang sử dụng 1 CPU core (100% của 1 core)

Các query hữu ích khác

promql

# 📈 CPU usage theo namespace
sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)

# 💾 Memory usage hiện tại
sum(container_memory_usage_bytes) by (pod)

# 🚀 Request rate (RPS)
sum(rate(http_requests_total[5m])) by (service)

# ❌ Error rate (%)
sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ 
sum(rate(http_requests_total[5m])) * 100

# ⏱️ 99th percentile latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

📺 Grafana: The Face of Observability

💡 Grafana là gì?

Grafana là công cụ visualization - biến data từ Prometheus thành dashboards đẹp và dễ hiểu.

Golden Signals Dashboard

Google SRE Book định nghĩa 4 Golden Signals - 4 metrics quan trọng nhất để monitor:

Signal	Mô tả	PromQL Example
Latency	Thời gian xử lý request	`histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))`
Traffic	Số lượng requests	`sum(rate(http_requests_total[5m]))`
Errors	Tỷ lệ lỗi	`sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))`
Saturation	Mức độ "đầy" của resources	`container_memory_usage_bytes / container_spec_memory_limit_bytes`

Pre-built Dashboards

Grafana có hàng ngàn dashboards được cộng đồng tạo sẵn:

Dashboard ID	Tên	Mục đích
315	Kubernetes Cluster Monitoring	Overview toàn cluster
6417	Kubernetes Pods	Detail từng pod
1860	Node Exporter Full	Hardware metrics
7249	Kubernetes API Server	Control plane health

Import dashboard:

Vào Grafana → Dashboards → Import
Nhập Dashboard ID (vd: 315)
Chọn Prometheus datasource
Done! 🎉

📝 Logging: Đừng SSH vào Node!

🚫 ANTI-PATTERN

bash

# ❌ SAI: SSH vào từng node để đọc logs
ssh node-1
docker logs container-xxx
cat /var/log/...

# Vấn đề:
# - Không scale (10 nodes = 10 lần SSH?)
# - Container restart = logs mất
# - Không thể search cross-service

Centralized Logging Solutions

Stack	Components	Đặc điểm
EFK	Elasticsearch + Fluentd + Kibana	Powerful search, high resource
Loki	Loki + Promtail + Grafana	Lightweight, label-based, low cost
ELK	Elasticsearch + Logstash + Kibana	Enterprise standard

Loki: The Lightweight Choice

Tại sao Loki?

Không index full-text (như Elasticsearch) → 10x ít storage
Query bằng labels (giống Prometheus) → Familiar syntax
Native Grafana integration → 1 dashboard for all

logql

# LogQL query example
{namespace="production", app="myapp"} |= "error" | json | status >= 500

🚀 Cài Đặt: kube-prometheus-stack

📦 KHUYẾN NGHỊ

Dùng kube-prometheus-stack Helm chart - đã được pre-configured với:

Prometheus Server
Alertmanager
Grafana (với nhiều dashboards sẵn)
Node Exporter
kube-state-metrics

Quick Install

bash

# Thêm Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Cài đặt full stack
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword=admin123

# Kiểm tra
kubectl get pods -n monitoring

Access Dashboards

bash

# Grafana (mặc định: admin/prom-operator)
kubectl port-forward -n monitoring svc/kube-prometheus-grafana 3000:80

# Prometheus UI
kubectl port-forward -n monitoring svc/kube-prometheus-prometheus 9090:9090

# Alertmanager
kubectl port-forward -n monitoring svc/kube-prometheus-alertmanager 9093:9093

📊 Tổng Kết

Khái niệm	Mô tả
Pull Model	Prometheus scrape metrics từ targets
Exporters	Nguồn cung cấp metrics (Node, kube-state, App)
PromQL	Ngôn ngữ query (rate, sum, histogram_quantile)
Golden Signals	Latency, Traffic, Errors, Saturation
Grafana	Visualization & Dashboards
Loki	Lightweight centralized logging

⚠️ NHỮNG ĐIỀU CẦN NHỚ

Metrics ≠ Logs - Metrics cho "how much", Logs cho "what happened"
Pull model - Prometheus kéo data, không phải app đẩy
Labels are powerful - Filter, aggregate, group by labels
Don't reinvent - Dùng kube-prometheus-stack đã có sẵn
Golden Signals - Bắt đầu với 4 metrics cơ bản này

🔗 Liên kết

Trước đó: Module 10: HPA & Metrics Server
Tham khảo: Module 9: Probes (Liveness cho health check)
Thực hành: Challenge: Observability Debug

🧠 Quiz

Câu 1: Prometheus thu thập metrics theo mô hình nào?

[ ] A) Push model — ứng dụng đẩy metrics đến Prometheus
[x] B) Pull model — Prometheus chủ động scrape metrics từ targets
[ ] C) Event-driven — Prometheus lắng nghe events
[ ] D) Streaming — continuous data flow

💡 Giải thích: Prometheus sử dụng Pull model: định kỳ scrape (kéo) metrics từ /metrics endpoint của targets. Ưu điểm: dễ detect target down, không cần config phía app.

Câu 2: 4 Golden Signals trong Observability gồm những gì?

[ ] A) CPU, Memory, Disk, Network
[x] B) Latency, Traffic, Errors, Saturation
[ ] C) Logs, Metrics, Traces, Events
[ ] D) Availability, Durability, Scalability, Performance

💡 Giải thích: 4 Golden Signals (Google SRE): Latency (thời gian phản hồi), Traffic (lưu lượng), Errors (tỷ lệ lỗi), Saturation (mức độ bão hòa tài nguyên). Đây là 4 metrics cốt lõi cần monitor.

Giới Thiệu Prometheus & Grafana trên Kubernetes ​

🎯 Tại Sao Cần Observability? ​

🏗️ Kiến Trúc Prometheus: Pull Model ​

Kiến trúc tổng quan ​

Các thành phần chính ​

📡 Exporters: Nguồn Cung Cấp Metrics ​

Node Exporter (Hardware Metrics) ​

kube-state-metrics (Cluster Metrics) ​

Application Exporters (Custom Metrics) ​

📊 PromQL: Ngôn Ngữ Truy Vấn ​

Query cơ bản: rate() ​

Các query hữu ích khác ​

📺 Grafana: The Face of Observability ​

Golden Signals Dashboard ​

Pre-built Dashboards ​

📝 Logging: Đừng SSH vào Node! ​

Centralized Logging Solutions ​

Loki: The Lightweight Choice ​

🚀 Cài Đặt: kube-prometheus-stack ​

Quick Install ​

Access Dashboards ​

📊 Tổng Kết ​

🔗 Liên kết ​

Câu 1: Prometheus thu thập metrics theo mô hình nào? ​

Câu 2: 4 Golden Signals trong Observability gồm những gì? ​

Giới Thiệu Prometheus & Grafana trên Kubernetes

🎯 Tại Sao Cần Observability?

🏗️ Kiến Trúc Prometheus: Pull Model

Kiến trúc tổng quan

Các thành phần chính

📡 Exporters: Nguồn Cung Cấp Metrics

Node Exporter (Hardware Metrics)

kube-state-metrics (Cluster Metrics)

Application Exporters (Custom Metrics)

📊 PromQL: Ngôn Ngữ Truy Vấn

Query cơ bản: `rate()`

Các query hữu ích khác

📺 Grafana: The Face of Observability

Golden Signals Dashboard

Pre-built Dashboards

📝 Logging: Đừng SSH vào Node!

Centralized Logging Solutions

Loki: The Lightweight Choice

🚀 Cài Đặt: kube-prometheus-stack

Quick Install

Access Dashboards

📊 Tổng Kết

🔗 Liên kết

Câu 1: Prometheus thu thập metrics theo mô hình nào?

Câu 2: 4 Golden Signals trong Observability gồm những gì?