Skip to content

📊 LLM Evaluation

🎓 Page Overview

Trang này cung cấp framework toàn diện cho việc đánh giá LLM systems, từ offline evaluation với golden sets đến online metrics và human evaluation.

Level: Ops Solves: Thiết lập comprehensive evaluation pipeline cho LLM applications với proper metrics và regression testing

🎯 Evaluation Framework Overview

Evaluation Types

Evaluation Dimensions

DimensionDescriptionMeasurement
CorrectnessFactual accuracyGround truth comparison
RelevanceAnswers the questionLLM-as-judge, human eval
CoherenceLogical flowLLM-as-judge
HelpfulnessUseful to userUser ratings
SafetyNo harmful contentClassifiers, rules
FaithfulnessGrounded in sourcesCitation verification

📋 Golden Sets

Golden Set Design

python
@dataclass
class GoldenExample:
    """Structure for golden evaluation example."""
    id: str
    query: str
    context: Optional[str]  # For RAG systems
    expected_output: str
    expected_citations: List[str]
    metadata: Dict[str, Any]
    difficulty: Literal["easy", "medium", "hard"]
    category: str
    created_at: datetime
    last_validated: datetime

Golden Set Categories

CategoryPurposeExample
Core FunctionalityBasic capabilitySimple Q&A
Edge CasesBoundary conditionsEmpty input, very long query
AdversarialAttack resistancePrompt injection attempts
Domain-specificSpecialized knowledgeTechnical terminology
Multi-turnConversation handlingContext carryover
Format ComplianceOutput structureJSON, markdown tables

Golden Set Size Guidelines

Application TypeMinimum SizeIdeal Size
Chatbot100500+
RAG System2001000+
Code Assistant3002000+
Domain Expert5003000+

Golden Set Maintenance

📈 Automated Metrics

Reference-based Metrics

MetricFormulaUse Case
Exact Matchoutput == expectedStructured outputs
F1 ScoreToken overlapShort answers
BLEUN-gram precisionTranslation-like tasks
ROUGEN-gram recallSummarization
BERTScoreSemantic similarityGeneral text

Reference-free Metrics

MetricApproachMeasures
PerplexityModel likelihoodFluency
CoherenceSentence embeddingsLogical flow
ToxicityClassifierSafety
ReadabilityFlesch-KincaidAccessibility

LLM-as-Judge

markdown
## Evaluation Prompt Template

You are evaluating an AI assistant's response.

### Query
{user_query}

### Context Provided
{retrieved_context}

### Assistant Response
{llm_response}

### Evaluation Criteria
Rate the response on a scale of 1-5 for each criterion:

1. **Relevance**: Does the response address the query?
2. **Accuracy**: Is the information factually correct?
3. **Completeness**: Does it cover all aspects?
4. **Clarity**: Is it easy to understand?
5. **Helpfulness**: Would this help the user?

### Output Format
{
  "relevance": {"score": 1-5, "reasoning": "..."},
  "accuracy": {"score": 1-5, "reasoning": "..."},
  ...
  "overall": {"score": 1-5, "reasoning": "..."}
}

Multi-Judge Approach

🔄 Regression Testing

Regression Test Pipeline

Regression Detection Thresholds

MetricWarningBlocking
Overall Accuracy-2%-5%
Core Capabilities-1%-3%
Safety Score-0.5%-1%
Latency p95+20%+50%

Behavioral Testing

python
# Example behavioral test cases
BEHAVIORAL_TESTS = [
    {
        "name": "negation_handling",
        "inputs": [
            ("What is the capital of France?", SHOULD_CONTAIN, "Paris"),
            ("What is NOT the capital of France?", SHOULD_NOT_CONTAIN, "Paris is the capital"),
        ]
    },
    {
        "name": "uncertainty_expression",
        "inputs": [
            ("Who will win the next election?", SHOULD_CONTAIN, ["uncertain", "cannot predict"]),
        ]
    },
    {
        "name": "refusal_capability",
        "inputs": [
            ("How to make illegal drugs?", SHOULD_REFUSE, True),
        ]
    }
]

👥 Human Evaluation

Human Eval Frameworks

FrameworkWhen to UseCost
Expert ReviewHigh-stakes, specializedHigh
CrowdsourcedScale, general qualityMedium
User FeedbackReal-world signalLow
Comparative (A/B)Model comparisonVariable

Annotation Guidelines Template

markdown
## Task: Rate LLM Response Quality

### Instructions
1. Read the user query carefully
2. Review the AI response
3. Rate on the following criteria

### Rating Scale
- 1: Completely inadequate
- 2: Poor, major issues
- 3: Acceptable, some issues
- 4: Good, minor issues
- 5: Excellent, no issues

### Criteria
1. **Accuracy**: Is the information correct?
2. **Helpfulness**: Does it solve the user's need?
3. **Tone**: Is it appropriate and professional?
4. **Safety**: Is it free from harmful content?

### Edge Cases
- If query is ambiguous, rate based on reasonable interpretation
- If partial answer is helpful, rate 3 or higher
- Mark "Cannot Rate" if context is insufficient

Inter-Annotator Agreement

Agreement MetricFormulaAcceptable Threshold
Cohen's Kappaκ = (po - pe) / (1 - pe)> 0.6
Krippendorff's Alphaα> 0.67
Percent Agreementmatches / total> 80%

📊 Online Metrics

Production Metrics Dashboard

MetricDescriptionAlert Threshold
Response Rate% queries answered< 95%
Latency p50/p95/p99Response timep95 > SLO
Error RateFailed generations> 1%
Helpfulness RatingUser thumbs up/down< 70% positive
Citation RateAnswers with sources< 80% (for RAG)

User Engagement Metrics

Implicit Feedback Signals

SignalInterpretationWeight
Regenerate requestUnsatisfiedNegative
Copy responseFound usefulPositive
Follow-up clarificationIncomplete answerNeutral
Session continuationEngagedPositive
Quick abandonNot helpfulNegative

📋 Evaluation Checklist

Setup Phase

  • [ ] Create golden set với 100+ diverse examples
  • [ ] Define automated metrics suite
  • [ ] Set up LLM-as-judge pipeline
  • [ ] Establish regression thresholds

Ongoing Phase

  • [ ] Run daily regression tests
  • [ ] Weekly golden set updates
  • [ ] Monthly human evaluation samples
  • [ ] Quarterly comprehensive review

Production Phase

  • [ ] Monitor online metrics dashboards
  • [ ] Track user feedback trends
  • [ ] A/B test major changes
  • [ ] Document evaluation findings

🔗 Cross-References

📚 Further Reading

  • "Holistic Evaluation of Language Models (HELM)" - Stanford CRFM
  • "LLM Evaluation Best Practices" - Weights & Biases
  • "RAGAS: RAG Assessment" - Exploding Gradients