Talki Academy
📊

RAG Evaluation: Metrics, Benchmarks & Production

An intensive technical training for ML Engineers, AI Developers, and MLOps teams who build RAG systems and need to ensure their reliability in production. You'll learn to measure Faithfulness, Relevance, and Context Recall using open-source tools (Ragas, TruLens, DeepEval), automate evaluation in CI/CD pipelines, and set up continuous monitoring. Real-world case study: an audit of a RAG system that answered correctly 61% of the time — improved to 89% in 3 weeks through a structured evaluation framework.

Duration
2 days
Level
Advanced
Price
9.99 EUR/month (all courses included)
Max group
12 participants

What you will learn

+Understand and compute the 7 core RAG metrics: Faithfulness, Answer Relevance, Context Recall, Context Precision, Context Relevance, Answer Correctness, Answer Similarity
+Instrument an existing RAG pipeline with Ragas and TruLens in under one hour
+Build a gold-standard evaluation dataset using synthetic generation (TestsetGenerator)
+Automate RAG evaluation in a CI/CD pipeline with quality gates that block regressions
+Detect production quality degradation in real time with alerting
+Diagnose root causes of RAG failures (retrieval vs. generation vs. chunking)
+Compare Ragas, TruLens, and DeepEval and choose the right tool for your stack

Course program

Module 1: RAG Metrics: Theory and Implementation

3h30
  • The 7 core metrics: definitions, formulas, and interpretation
  • Faithfulness vs. Answer Relevance: why confusing them is dangerous
  • Context Recall and Context Precision: measuring retriever quality
  • Implementation with Ragas: evaluate a pipeline in 50 lines of Python
  • Workshop: evaluate a broken RAG system and identify root problems

Module 2: Benchmarking and Evaluation Datasets

3h30
  • Building a gold-standard dataset: manual vs. synthetic approaches
  • TestsetGenerator (Ragas): automatic generation of adversarial questions
  • Public benchmarks: BEIR, RAGAS benchmark, TruthfulQA — when to use each
  • Sampling strategies: covering edge cases and unanswerable questions
  • Workshop: create a 200-question dataset for your business domain

Module 3: Production Monitoring and Alerting

3h30
  • RAG monitoring architecture: LangFuse, TruLens, Phoenix (Arize)
  • Operational metrics: latency p50/p95, error rate, cost per request
  • Degradation detection: quality drift, embedding distribution shift
  • Prometheus/Grafana alerts on RAG metrics (e.g., Faithfulness < 0.80)
  • Workshop: instrument a LangChain pipeline and build a Grafana dashboard

Module 4: CI/CD for RAG Systems and Error Diagnosis

3h30
  • Quality gates in GitHub Actions: block a PR if Faithfulness < 0.85
  • RAG failure diagnosis: decision tree for retrieval / generation / chunking issues
  • A/B testing RAG configurations: chunk size, overlap, embedding models
  • DeepEval vs. Ragas vs. TruLens: comparison and selection criteria
  • Final workshop: complete CI/CD pipeline with automated evaluation and rollback

Ready to get started?

9.99 EUR/month — All courses included, cancel anytime

Request a quoteView all courses