📊
RAG Evaluation: Metrics, Benchmarks & Production
An intensive technical training for ML Engineers, AI Developers, and MLOps teams who build RAG systems and need to ensure their reliability in production. You'll learn to measure Faithfulness, Relevance, and Context Recall using open-source tools (Ragas, TruLens, DeepEval), automate evaluation in CI/CD pipelines, and set up continuous monitoring. Real-world case study: an audit of a RAG system that answered correctly 61% of the time — improved to 89% in 3 weeks through a structured evaluation framework.
Duration
2 days
Level
Advanced
Price
9.99 EUR/month (all courses included)
Max group
12 participants
What you will learn
+Understand and compute the 7 core RAG metrics: Faithfulness, Answer Relevance, Context Recall, Context Precision, Context Relevance, Answer Correctness, Answer Similarity
+Instrument an existing RAG pipeline with Ragas and TruLens in under one hour
+Build a gold-standard evaluation dataset using synthetic generation (TestsetGenerator)
+Automate RAG evaluation in a CI/CD pipeline with quality gates that block regressions
+Detect production quality degradation in real time with alerting
+Diagnose root causes of RAG failures (retrieval vs. generation vs. chunking)
+Compare Ragas, TruLens, and DeepEval and choose the right tool for your stack
Course program
Module 1: RAG Metrics: Theory and Implementation
3h30- The 7 core metrics: definitions, formulas, and interpretation
- Faithfulness vs. Answer Relevance: why confusing them is dangerous
- Context Recall and Context Precision: measuring retriever quality
- Implementation with Ragas: evaluate a pipeline in 50 lines of Python
- Workshop: evaluate a broken RAG system and identify root problems
Module 2: Benchmarking and Evaluation Datasets
3h30- Building a gold-standard dataset: manual vs. synthetic approaches
- TestsetGenerator (Ragas): automatic generation of adversarial questions
- Public benchmarks: BEIR, RAGAS benchmark, TruthfulQA — when to use each
- Sampling strategies: covering edge cases and unanswerable questions
- Workshop: create a 200-question dataset for your business domain
Module 3: Production Monitoring and Alerting
3h30- RAG monitoring architecture: LangFuse, TruLens, Phoenix (Arize)
- Operational metrics: latency p50/p95, error rate, cost per request
- Degradation detection: quality drift, embedding distribution shift
- Prometheus/Grafana alerts on RAG metrics (e.g., Faithfulness < 0.80)
- Workshop: instrument a LangChain pipeline and build a Grafana dashboard
Module 4: CI/CD for RAG Systems and Error Diagnosis
3h30- Quality gates in GitHub Actions: block a PR if Faithfulness < 0.85
- RAG failure diagnosis: decision tree for retrieval / generation / chunking issues
- A/B testing RAG configurations: chunk size, overlap, embedding models
- DeepEval vs. Ragas vs. TruLens: comparison and selection criteria
- Final workshop: complete CI/CD pipeline with automated evaluation and rollback
Ready to get started?
9.99 EUR/month — All courses included, cancel anytime