Talki Academy
📏

LLM Evaluation & Benchmarking: Beyond RAG Metrics

An intensive training for ML Engineers, AI Architects, and Product Managers who need to measure and guarantee the quality of LLM systems in production — beyond RAG-specific metrics. You will learn to choose the right metric for your task type, design statistically valid prompt A/B tests, build a quality-aware model router, and implement a fully automated evaluation harness with CI/CD integration. Every module includes open-source, runnable code using LangChain, sentence-transformers, BERTScore, and Anthropic Claude.

Duration
2 days
Level
Advanced
Price
9.99 EUR/month (all courses included)
Max group
12 participants

What you will learn

+Explain why BLEU and ROUGE fail for generative AI and when BERTScore and semantic similarity are better choices
+Build a multi-dimensional scorecard combining automated metrics and LLM-as-judge scoring
+Design a statistically valid prompt A/B test — calculate required sample size and interpret p-values
+Implement a prompt version registry with regression detection that blocks bad deploys
+Benchmark three model tiers (Ollama, Haiku, Sonnet) on cost, latency P50/P95, and quality
+Build a quality-aware model router that routes queries to the cheapest sufficient model
+Integrate a custom evaluation harness into GitHub Actions CI/CD with quality gates

Course program

Module 1: LLM Evaluation Foundations: Metrics That Actually Matter

3h00
  • Why BLEU and ROUGE fail for generative AI: paraphrase, synonym, and length penalties
  • BERTScore: contextual token matching with DeBERTa-XL — when to use it
  • Semantic similarity with sentence-transformers: cosine distance as a quality proxy
  • Choosing the right metric by task type: code, Q&A, summarization, translation, instructions
  • Building a multi-dimensional scorecard: automated metrics + LLM-as-judge

Module 2: Prompt A/B Testing: Systematic Comparison Frameworks

3h30
  • Pairwise comparison vs. absolute scoring: why pairwise is 15-20% more reliable
  • Position bias in LLM judges: detection and correction through randomization
  • Statistical significance: binomial test, required sample size calculation
  • Prompt version registry: content-hashing for immutable versioning
  • Regression detection pipeline: block deploys when score drops > threshold

Module 3: Cost, Latency, and Quality: The Production Trade-off Framework

3h30
  • The three-tier model stack: Ollama (free) → Haiku ($0.80/M) → Sonnet ($3/M)
  • Benchmarking framework: quality vs P50/P95 latency vs cost-per-query
  • Quality-aware model router: route to cheapest model above quality floor
  • Latency budget analysis: identify bottlenecks in multi-step AI pipelines
  • Cost-per-interaction calculator: daily/monthly projections at scale

Module 4: Building Custom Evaluation Harnesses for Production

3h30
  • LangChain built-in evaluators: QA, CRITERIA, LABELED_CRITERIA — when to use each
  • Domain-specific rubric evaluators: customer support, code review, medical information
  • CI/CD integration: GitHub Actions workflow that blocks PRs on regression
  • pytest-compatible quality gates: test_quality_threshold + test_no_regression
  • Time-series metrics dashboard with anomaly detection (z-score alerts)

Ready to get started?

9.99 EUR/month — All courses included, cancel anytime

Request a quoteView all courses