📏
LLM Evaluation & Benchmarking: Beyond RAG Metrics
An intensive training for ML Engineers, AI Architects, and Product Managers who need to measure and guarantee the quality of LLM systems in production — beyond RAG-specific metrics. You will learn to choose the right metric for your task type, design statistically valid prompt A/B tests, build a quality-aware model router, and implement a fully automated evaluation harness with CI/CD integration. Every module includes open-source, runnable code using LangChain, sentence-transformers, BERTScore, and Anthropic Claude.
Duration
2 days
Level
Advanced
Price
9.99 EUR/month (all courses included)
Max group
12 participants
What you will learn
+Explain why BLEU and ROUGE fail for generative AI and when BERTScore and semantic similarity are better choices
+Build a multi-dimensional scorecard combining automated metrics and LLM-as-judge scoring
+Design a statistically valid prompt A/B test — calculate required sample size and interpret p-values
+Implement a prompt version registry with regression detection that blocks bad deploys
+Benchmark three model tiers (Ollama, Haiku, Sonnet) on cost, latency P50/P95, and quality
+Build a quality-aware model router that routes queries to the cheapest sufficient model
+Integrate a custom evaluation harness into GitHub Actions CI/CD with quality gates
Course program
Module 1: LLM Evaluation Foundations: Metrics That Actually Matter
3h00- Why BLEU and ROUGE fail for generative AI: paraphrase, synonym, and length penalties
- BERTScore: contextual token matching with DeBERTa-XL — when to use it
- Semantic similarity with sentence-transformers: cosine distance as a quality proxy
- Choosing the right metric by task type: code, Q&A, summarization, translation, instructions
- Building a multi-dimensional scorecard: automated metrics + LLM-as-judge
Module 2: Prompt A/B Testing: Systematic Comparison Frameworks
3h30- Pairwise comparison vs. absolute scoring: why pairwise is 15-20% more reliable
- Position bias in LLM judges: detection and correction through randomization
- Statistical significance: binomial test, required sample size calculation
- Prompt version registry: content-hashing for immutable versioning
- Regression detection pipeline: block deploys when score drops > threshold
Module 3: Cost, Latency, and Quality: The Production Trade-off Framework
3h30- The three-tier model stack: Ollama (free) → Haiku ($0.80/M) → Sonnet ($3/M)
- Benchmarking framework: quality vs P50/P95 latency vs cost-per-query
- Quality-aware model router: route to cheapest model above quality floor
- Latency budget analysis: identify bottlenecks in multi-step AI pipelines
- Cost-per-interaction calculator: daily/monthly projections at scale
Module 4: Building Custom Evaluation Harnesses for Production
3h30- LangChain built-in evaluators: QA, CRITERIA, LABELED_CRITERIA — when to use each
- Domain-specific rubric evaluators: customer support, code review, medical information
- CI/CD integration: GitHub Actions workflow that blocks PRs on regression
- pytest-compatible quality gates: test_quality_threshold + test_no_regression
- Time-series metrics dashboard with anomaly detection (z-score alerts)
Ready to get started?
9.99 EUR/month — All courses included, cancel anytime