Skip to content

🔍 LLM Observability

🎓 Page Overview

Trang này cung cấp kiến thức về observability cho LLM systems, bao gồm distributed tracing, structured logging, và privacy-preserving telemetry practices.

Level: Ops Solves: Thiết lập comprehensive observability cho LLM applications với proper tracing, logging, và privacy controls

🎯 Observability Architecture

Three Pillars + LLM Extensions

LLM-Specific Telemetry Data

Data TypePurposeRetention
Prompt/ResponseDebugging, quality analysisPrivacy-controlled
Token CountsCost trackingLong-term
Latency BreakdownPerformance optimization30 days
Model VersionRegression trackingLong-term
Evaluation ScoresQuality monitoring90 days

📊 Distributed Tracing

LLM Trace Structure

Span Attributes for LLM

python
# Standard OpenTelemetry attributes for LLM operations
LLM_SPAN_ATTRIBUTES = {
    # Request attributes
    "llm.request.model": "gpt-4-turbo",
    "llm.request.max_tokens": 1024,
    "llm.request.temperature": 0.7,
    "llm.request.top_p": 1.0,
    
    # Token usage
    "llm.usage.prompt_tokens": 500,
    "llm.usage.completion_tokens": 200,
    "llm.usage.total_tokens": 700,
    
    # Cost (calculated)
    "llm.cost.usd": 0.0035,
    
    # Response metadata
    "llm.response.finish_reason": "stop",
    "llm.response.model": "gpt-4-turbo-2024-01-01",
    
    # RAG-specific
    "rag.retrieval.count": 5,
    "rag.retrieval.score_avg": 0.85,
    "rag.rerank.enabled": True,
}

Trace Example

json
{
  "trace_id": "abc123",
  "spans": [
    {
      "name": "llm_request",
      "duration_ms": 2500,
      "attributes": {
        "llm.request.model": "gpt-4-turbo",
        "llm.usage.total_tokens": 700
      },
      "children": [
        {
          "name": "embedding",
          "duration_ms": 150,
          "attributes": {
            "embedding.model": "text-embedding-3-small",
            "embedding.token_count": 100
          }
        },
        {
          "name": "vector_search",
          "duration_ms": 50,
          "attributes": {
            "vector_db.engine": "pinecone",
            "vector_db.results_count": 10
          }
        },
        {
          "name": "rerank",
          "duration_ms": 200,
          "attributes": {
            "rerank.model": "cohere-rerank-v3",
            "rerank.input_count": 10,
            "rerank.output_count": 5
          }
        },
        {
          "name": "llm_generation",
          "duration_ms": 2000,
          "attributes": {
            "llm.request.model": "gpt-4-turbo",
            "llm.usage.prompt_tokens": 500,
            "llm.usage.completion_tokens": 200
          }
        }
      ]
    }
  ]
}

📝 Structured Logging

Log Schema for LLM Operations

python
@dataclass
class LLMLogEntry:
    """Structured log entry for LLM operations."""
    timestamp: datetime
    level: Literal["DEBUG", "INFO", "WARN", "ERROR"]
    trace_id: str
    span_id: str
    
    # Request identification
    user_id: str  # May be hashed for privacy
    session_id: str
    request_id: str
    
    # Operation details
    operation: str  # "embedding", "generation", "retrieval"
    model: str
    model_version: str
    
    # Performance
    duration_ms: int
    tokens_prompt: int
    tokens_completion: int
    
    # Quality signals
    finish_reason: str
    error_code: Optional[str]
    
    # Privacy-controlled fields
    prompt_hash: str  # Never log raw prompts in production
    response_hash: str

Log Levels Strategy

LevelWhat to LogExample
DEBUGFull prompts (dev only)Raw input/output
INFORequest metadataModel, tokens, latency
WARNDegraded serviceRetry, fallback used
ERRORFailuresAPI error, timeout

Sensitive Data Handling

python
def sanitize_log_data(data: dict) -> dict:
    """Remove or mask sensitive information before logging."""
    sanitized = data.copy()
    
    # Hash identifiers
    if "user_id" in sanitized:
        sanitized["user_id"] = hash_pii(sanitized["user_id"])
    
    # Remove raw content in production
    if not is_debug_mode():
        sanitized.pop("prompt", None)
        sanitized.pop("response", None)
        
        # Keep hashes for correlation
        if "prompt" in data:
            sanitized["prompt_hash"] = hash_content(data["prompt"])
    
    # Mask potential PII in metadata
    sanitized = mask_pii_patterns(sanitized)
    
    return sanitized

📈 Metrics

Key Metrics Dashboard

MetricTypeDescription
llm.requests.totalCounterTotal LLM requests
llm.requests.duration_msHistogramRequest latency
llm.tokens.totalCounterToken consumption
llm.cost.usdCounterCost accumulation
llm.errors.totalCounterError count by type
llm.cache.hit_rateGaugeSemantic cache effectiveness

Metric Labels

yaml
labels:
  - model: "gpt-4-turbo"
  - environment: "production"
  - tenant_id: "acme-corp"
  - feature: "chat-assistant"
  - status: "success|error|timeout"

SLI/SLO Definitions

SLIFormulaSLO Target
Availabilitysuccessful_requests / total_requests99.9%
Latencyp95(response_time)< 3s
Qualityavg(user_satisfaction_score)> 4.0/5.0
Cost Efficiencytokens_used / expected_tokens< 1.2x

🔒 Privacy-Preserving Telemetry

Data Classification

CategoryExamplesHandling
Never LogSSN, credit cards, passwordsBlock at ingestion
Hash OnlyUser IDs, email addressesOne-way hash
EncryptConversation contentEncryption at rest
Aggregate OnlyUsage patternsNo individual data

PII Detection Pipeline

Masking Strategies

python
MASKING_RULES = {
    "email": {
        "pattern": r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+",
        "replacement": "[EMAIL_REDACTED]"
    },
    "phone": {
        "pattern": r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b",
        "replacement": "[PHONE_REDACTED]"
    },
    "credit_card": {
        "pattern": r"\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b",
        "replacement": "[CC_REDACTED]"
    },
    "api_key": {
        "pattern": r"(sk-|pk-|api_)[a-zA-Z0-9]{20,}",
        "replacement": "[API_KEY_REDACTED]"
    }
}

Differential Privacy for Analytics

python
def add_differential_privacy(metric: float, epsilon: float = 1.0) -> float:
    """Add Laplacian noise for differential privacy."""
    sensitivity = 1.0  # Depends on metric
    scale = sensitivity / epsilon
    noise = np.random.laplace(0, scale)
    return metric + noise

🛠️ Debugging Workflows

Trace-Driven Debugging

Common Debugging Scenarios

SymptomInvestigationResolution
High latencyCheck span breakdownOptimize bottleneck
Low qualityReview prompt/context logsImprove retrieval/prompts
High costAnalyze token distributionOptimize token usage
Inconsistent resultsCompare response variationsStabilize temperature

Replay Capability

python
class RequestReplayer:
    """Replay historical requests for debugging."""
    
    def replay(self, trace_id: str, modifications: dict = None):
        """
        Replay a request from trace history.
        
        Args:
            trace_id: Original trace to replay
            modifications: Optional overrides (model, temperature, etc.)
        """
        original = self.trace_store.get(trace_id)
        
        # Apply modifications
        request = original.request.copy()
        if modifications:
            request.update(modifications)
        
        # Execute with new trace
        result = self.llm_service.execute(request)
        
        # Compare results
        return self.compare(original.response, result)

📋 Observability Checklist

Setup Phase

  • [ ] Deploy distributed tracing với LLM-specific attributes
  • [ ] Configure structured logging với privacy controls
  • [ ] Set up metrics collection và dashboards
  • [ ] Define SLIs/SLOs for LLM operations

Implementation Phase

  • [ ] Add trace spans for all LLM operations
  • [ ] Implement PII detection và redaction
  • [ ] Configure log retention policies
  • [ ] Set up alerting rules

Production Phase

  • [ ] Monitor latency và token budgets
  • [ ] Track cost attribution by feature/tenant
  • [ ] Regular privacy audits
  • [ ] Incident response với traces

🔗 Cross-References

📚 Further Reading

  • "OpenTelemetry for LLM Observability" - OpenTelemetry Docs
  • "LangSmith Tracing Guide" - LangChain Documentation
  • "Privacy-Preserving ML Monitoring" - MLOps Community