Giao diện
🔍 LLM Observability
🎓 Page Overview
Trang này cung cấp kiến thức về observability cho LLM systems, bao gồm distributed tracing, structured logging, và privacy-preserving telemetry practices.
Level: Ops Solves: Thiết lập comprehensive observability cho LLM applications với proper tracing, logging, và privacy controls
🎯 Observability Architecture
Three Pillars + LLM Extensions
LLM-Specific Telemetry Data
| Data Type | Purpose | Retention |
|---|---|---|
| Prompt/Response | Debugging, quality analysis | Privacy-controlled |
| Token Counts | Cost tracking | Long-term |
| Latency Breakdown | Performance optimization | 30 days |
| Model Version | Regression tracking | Long-term |
| Evaluation Scores | Quality monitoring | 90 days |
📊 Distributed Tracing
LLM Trace Structure
Span Attributes for LLM
python
# Standard OpenTelemetry attributes for LLM operations
LLM_SPAN_ATTRIBUTES = {
# Request attributes
"llm.request.model": "gpt-4-turbo",
"llm.request.max_tokens": 1024,
"llm.request.temperature": 0.7,
"llm.request.top_p": 1.0,
# Token usage
"llm.usage.prompt_tokens": 500,
"llm.usage.completion_tokens": 200,
"llm.usage.total_tokens": 700,
# Cost (calculated)
"llm.cost.usd": 0.0035,
# Response metadata
"llm.response.finish_reason": "stop",
"llm.response.model": "gpt-4-turbo-2024-01-01",
# RAG-specific
"rag.retrieval.count": 5,
"rag.retrieval.score_avg": 0.85,
"rag.rerank.enabled": True,
}Trace Example
json
{
"trace_id": "abc123",
"spans": [
{
"name": "llm_request",
"duration_ms": 2500,
"attributes": {
"llm.request.model": "gpt-4-turbo",
"llm.usage.total_tokens": 700
},
"children": [
{
"name": "embedding",
"duration_ms": 150,
"attributes": {
"embedding.model": "text-embedding-3-small",
"embedding.token_count": 100
}
},
{
"name": "vector_search",
"duration_ms": 50,
"attributes": {
"vector_db.engine": "pinecone",
"vector_db.results_count": 10
}
},
{
"name": "rerank",
"duration_ms": 200,
"attributes": {
"rerank.model": "cohere-rerank-v3",
"rerank.input_count": 10,
"rerank.output_count": 5
}
},
{
"name": "llm_generation",
"duration_ms": 2000,
"attributes": {
"llm.request.model": "gpt-4-turbo",
"llm.usage.prompt_tokens": 500,
"llm.usage.completion_tokens": 200
}
}
]
}
]
}📝 Structured Logging
Log Schema for LLM Operations
python
@dataclass
class LLMLogEntry:
"""Structured log entry for LLM operations."""
timestamp: datetime
level: Literal["DEBUG", "INFO", "WARN", "ERROR"]
trace_id: str
span_id: str
# Request identification
user_id: str # May be hashed for privacy
session_id: str
request_id: str
# Operation details
operation: str # "embedding", "generation", "retrieval"
model: str
model_version: str
# Performance
duration_ms: int
tokens_prompt: int
tokens_completion: int
# Quality signals
finish_reason: str
error_code: Optional[str]
# Privacy-controlled fields
prompt_hash: str # Never log raw prompts in production
response_hash: strLog Levels Strategy
| Level | What to Log | Example |
|---|---|---|
| DEBUG | Full prompts (dev only) | Raw input/output |
| INFO | Request metadata | Model, tokens, latency |
| WARN | Degraded service | Retry, fallback used |
| ERROR | Failures | API error, timeout |
Sensitive Data Handling
python
def sanitize_log_data(data: dict) -> dict:
"""Remove or mask sensitive information before logging."""
sanitized = data.copy()
# Hash identifiers
if "user_id" in sanitized:
sanitized["user_id"] = hash_pii(sanitized["user_id"])
# Remove raw content in production
if not is_debug_mode():
sanitized.pop("prompt", None)
sanitized.pop("response", None)
# Keep hashes for correlation
if "prompt" in data:
sanitized["prompt_hash"] = hash_content(data["prompt"])
# Mask potential PII in metadata
sanitized = mask_pii_patterns(sanitized)
return sanitized📈 Metrics
Key Metrics Dashboard
| Metric | Type | Description |
|---|---|---|
llm.requests.total | Counter | Total LLM requests |
llm.requests.duration_ms | Histogram | Request latency |
llm.tokens.total | Counter | Token consumption |
llm.cost.usd | Counter | Cost accumulation |
llm.errors.total | Counter | Error count by type |
llm.cache.hit_rate | Gauge | Semantic cache effectiveness |
Metric Labels
yaml
labels:
- model: "gpt-4-turbo"
- environment: "production"
- tenant_id: "acme-corp"
- feature: "chat-assistant"
- status: "success|error|timeout"SLI/SLO Definitions
| SLI | Formula | SLO Target |
|---|---|---|
| Availability | successful_requests / total_requests | 99.9% |
| Latency | p95(response_time) | < 3s |
| Quality | avg(user_satisfaction_score) | > 4.0/5.0 |
| Cost Efficiency | tokens_used / expected_tokens | < 1.2x |
🔒 Privacy-Preserving Telemetry
Data Classification
| Category | Examples | Handling |
|---|---|---|
| Never Log | SSN, credit cards, passwords | Block at ingestion |
| Hash Only | User IDs, email addresses | One-way hash |
| Encrypt | Conversation content | Encryption at rest |
| Aggregate Only | Usage patterns | No individual data |
PII Detection Pipeline
Masking Strategies
python
MASKING_RULES = {
"email": {
"pattern": r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+",
"replacement": "[EMAIL_REDACTED]"
},
"phone": {
"pattern": r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b",
"replacement": "[PHONE_REDACTED]"
},
"credit_card": {
"pattern": r"\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b",
"replacement": "[CC_REDACTED]"
},
"api_key": {
"pattern": r"(sk-|pk-|api_)[a-zA-Z0-9]{20,}",
"replacement": "[API_KEY_REDACTED]"
}
}Differential Privacy for Analytics
python
def add_differential_privacy(metric: float, epsilon: float = 1.0) -> float:
"""Add Laplacian noise for differential privacy."""
sensitivity = 1.0 # Depends on metric
scale = sensitivity / epsilon
noise = np.random.laplace(0, scale)
return metric + noise🛠️ Debugging Workflows
Trace-Driven Debugging
Common Debugging Scenarios
| Symptom | Investigation | Resolution |
|---|---|---|
| High latency | Check span breakdown | Optimize bottleneck |
| Low quality | Review prompt/context logs | Improve retrieval/prompts |
| High cost | Analyze token distribution | Optimize token usage |
| Inconsistent results | Compare response variations | Stabilize temperature |
Replay Capability
python
class RequestReplayer:
"""Replay historical requests for debugging."""
def replay(self, trace_id: str, modifications: dict = None):
"""
Replay a request from trace history.
Args:
trace_id: Original trace to replay
modifications: Optional overrides (model, temperature, etc.)
"""
original = self.trace_store.get(trace_id)
# Apply modifications
request = original.request.copy()
if modifications:
request.update(modifications)
# Execute with new trace
result = self.llm_service.execute(request)
# Compare results
return self.compare(original.response, result)📋 Observability Checklist
Setup Phase
- [ ] Deploy distributed tracing với LLM-specific attributes
- [ ] Configure structured logging với privacy controls
- [ ] Set up metrics collection và dashboards
- [ ] Define SLIs/SLOs for LLM operations
Implementation Phase
- [ ] Add trace spans for all LLM operations
- [ ] Implement PII detection và redaction
- [ ] Configure log retention policies
- [ ] Set up alerting rules
Production Phase
- [ ] Monitor latency và token budgets
- [ ] Track cost attribution by feature/tenant
- [ ] Regular privacy audits
- [ ] Incident response với traces
🔗 Cross-References
- 📎 LLM Evaluation - Quality metrics integration
- 📎 ML Monitoring - Drift detection patterns
- 📎 DL Model Ops - Model versioning integration
- 📎 Cost Optimization - Cost tracking strategies
📚 Further Reading
- "OpenTelemetry for LLM Observability" - OpenTelemetry Docs
- "LangSmith Tracing Guide" - LangChain Documentation
- "Privacy-Preserving ML Monitoring" - MLOps Community