📏

LLM Evaluation & Benchmarking: Beyond RAG Metrics

Name: LLM Evaluation & Benchmarking: Beyond RAG Metrics — 2026
Price: 9.99 EUR
Availability: InStock
Rating: 4.6 (25 reviews)

An intensive training for ML Engineers, AI Architects, and Product Managers who need to measure and guarantee the quality of LLM systems in production — beyond RAG-specific metrics. You will learn to choose the right metric for your task type, design statistically valid prompt A/B tests, build a quality-aware model router, and implement a fully automated evaluation harness with CI/CD integration. Every module includes open-source, runnable code using LangChain, sentence-transformers, BERTScore, and Anthropic Claude.

Duration

2 days

Level

Advanced

Price

9.99 EUR/month (all courses included)

Max group

12 participants

What you will learn

+Explain why BLEU and ROUGE fail for generative AI and when BERTScore and semantic similarity are better choices

+Build a multi-dimensional scorecard combining automated metrics and LLM-as-judge scoring

+Design a statistically valid prompt A/B test — calculate required sample size and interpret p-values

+Implement a prompt version registry with regression detection that blocks bad deploys

+Benchmark three model tiers (Ollama, Haiku, Sonnet) on cost, latency P50/P95, and quality

+Build a quality-aware model router that routes queries to the cheapest sufficient model

+Integrate a custom evaluation harness into GitHub Actions CI/CD with quality gates

Course program

Module 1: LLM Evaluation Foundations: Metrics That Actually Matter

3h00

Why BLEU and ROUGE fail for generative AI: paraphrase, synonym, and length penalties
BERTScore: contextual token matching with DeBERTa-XL — when to use it
Semantic similarity with sentence-transformers: cosine distance as a quality proxy
Choosing the right metric by task type: code, Q&A, summarization, translation, instructions
Building a multi-dimensional scorecard: automated metrics + LLM-as-judge

Module 2: Prompt A/B Testing: Systematic Comparison Frameworks

3h30

Pairwise comparison vs. absolute scoring: why pairwise is 15-20% more reliable
Position bias in LLM judges: detection and correction through randomization
Statistical significance: binomial test, required sample size calculation
Prompt version registry: content-hashing for immutable versioning
Regression detection pipeline: block deploys when score drops > threshold

Module 3: Cost, Latency, and Quality: The Production Trade-off Framework

3h30

The three-tier model stack: Ollama (free) → Haiku ($0.80/M) → Sonnet ($3/M)
Benchmarking framework: quality vs P50/P95 latency vs cost-per-query
Quality-aware model router: route to cheapest model above quality floor
Latency budget analysis: identify bottlenecks in multi-step AI pipelines
Cost-per-interaction calculator: daily/monthly projections at scale

Module 4: Building Custom Evaluation Harnesses for Production

3h30

LangChain built-in evaluators: QA, CRITERIA, LABELED_CRITERIA — when to use each
Domain-specific rubric evaluators: customer support, code review, medical information
CI/CD integration: GitHub Actions workflow that blocks PRs on regression
pytest-compatible quality gates: test_quality_threshold + test_no_regression
Time-series metrics dashboard with anomaly detection (z-score alerts)

Ready to get started?

9.99 EUR/month — All courses included, cancel anytime

Request a quote View all courses

Aller plus loin

Ressources vidéo recommandées

Une sélection de vidéos des meilleurs experts pour approfondir chaque module de la formation.

Module 2

28:00

How to Systematically Setup LLM Evals

Dave Ebbelaar

Step-by-step framework for designing, running and interpreting LLM evaluations

59:00

[1hr Talk] Intro to Large Language Models

Andrej Karpathy

Karpathy's technical primer on how LLMs work – essential context for eval design

Module 3

25:00

LangSmith Tutorial - LLM Evaluation for Beginners

LangChain

Practical LangSmith setup for automated benchmark runs and regression tracking

Module 4

20:00

Get Started with Langfuse

Dave Ebbelaar

Langfuse observability platform for continuous LLM quality monitoring

ⓘ Ces vidéos sont des contenus externes produits par des créateurs indépendants et ne sont pas la propriété d'Academy Talki. Elles sont recommandées à titre pédagogique pour compléter et vulgariser le contenu de la formation.