Why RAG Evaluation is Academic Bullshit

RAG evaluation works great until Karen from accounting shows up at 2pm with "cant login help me urgent!!!". Then everything goes to shit because she doesn't type like your beautiful test dataset.

Production monitoring dashboard

Your evaluation dataset has questions like "What is the company's refund policy for digital products?"

Karen types "cant get mony back wtf???" and Dave from sales sends "refund pls broken app im on a call" at 7:23am on a Tuesday.

RAGAS faithfulness scores tell you if the AI accurately summarizes retrieved text. They don't tell you if that text answers the user's actual fucking question. I've watched systems get 0.9 faithfulness while users rage quit because they got perfect answers to questions they never asked.

What Actually Breaks RAG Systems

Nobody tests with actual user behavior. Evaluation datasets use perfect grammar. Users type like they're ordering pizza at 2am after three beers with autocorrect disabled.

Example that made me want to quit:

  • Evaluation: "How can I troubleshoot authentication issues with the API?"
  • Production: "api thingy broken cant login help???" (sent from CEO's personal Slack at 7am Sunday)

Standard metrics are fucking useless. RAGAS faithfulness measures if AI summarizes text accurately. It doesn't measure if that text is relevant.

I've seen this disaster multiple times:

  • User asks: "how much does premium cost"
  • System retrieves: Enterprise security documentation
  • System responds: Perfect summary of security features
  • Faithfulness score: 0.92 (great!)
  • User satisfaction: Zero (completely wrong answer)

The system gets rewarded for being accurately wrong.

Production documents are a fucking nightmare. Dev testing uses pristine PDFs and clean markdown. Production hits you with:

  • PDFs where tables turn into garbled text soup that looks like a cat walked on the keyboard
  • Word docs with encoding that turns quotes into question marks (thanks Windows-1252)
  • Web scraping that grabs navigation menus instead of content
  • Excel "CSV" exports that explode on the first comma in a cell

I burned 6 hours debugging why our system started responding "Click here for more information" to every billing question. Turned out our web scraper was grabbing footer links instead of actual content because some frontend dev updated CSS selectors during a "quick design refresh" and nobody told the backend team. Lost my entire weekend to that trainwreck.

Three Things That Actually Matter in Production

Users type like shit. Your evaluation uses "What are the system requirements for the enterprise plan?" Users type "requirements???" and expect it to work.

I learned this the expensive way when our support RAG started serving up installation guides for "billing" questions, and three customers canceled thinking we didn't have billing support. Turns out nobody tested with one-word queries because our evaluation dataset was too fucking polite. Oops.

Test with:

  • Typos and misspellings ("seperate" not "separate")
  • One-word queries ("billing", "refund", "broken")
  • Drunk user language ("thing no work why")
  • Questions that assume context ("how much?" without saying what)

Most systems hallucinate instead of admitting ignorance. Retrieval breaks constantly:

  • Document chunking splits sentences in the worst possible places
  • Embeddings match on random words instead of meaning
  • Vector databases return garbage when they can't find good matches

I'd rather have a system that says "I don't know" than one that confidently explains how to cancel a subscription when the user asked about billing.

Users prefer honest ignorance over confident bullshit.

You need debugging or you're fucked when things break. Users will report "the AI is broken" and you need to figure out:

  • What documents got retrieved (probably the wrong ones)
  • Why the embeddings thought those were relevant
  • What the hell the LLM was thinking
  • How much this failure is costing you

Most frameworks give you a score and leave you to figure out why it sucks. I've debugged RAG failures while the CEO is breathing down my neck at midnight - you want actual logs showing what broke, not philosophical metrics about "semantic similarity".

Framework Reality Check

Every framework sucks at something important:

  • RAGAS: Great docs, expensive as hell, slow evaluation
  • DeepEval: Fast but docs are shit, you'll be reading source code
  • TruLens: Comprehensive debugging but setup takes a week and costs more than your car payment

I've used all three. RAGAS is where most people start and stay because switching frameworks after you've built everything around one is painful.

Synthetic datasets are conference talk bullshit. LLM-generated questions sound like this: "Could you please explain the process for initiating a refund request through the customer portal?"

Real questions from our Slack #help channel: "refund???" followed by "HELLO??" two minutes later.

Synthetic data is useful for exactly one thing: getting started before you have real user queries. After that, use actual production logs or your evaluation is lying to you.

Production monitoring beats development evaluation every time. Dev metrics use perfect test data. Production shows you:

  • Users asking about Pokemon in your enterprise SaaS docs
  • The system breaking when 50 people hit it simultaneously
  • Documents that worked in staging but are corrupted in prod
  • API costs that went from $200 to $2000 because someone removed rate limits

I monitor: user thumbs down, response times over 10 seconds, queries that return empty results, and API costs that spike when someone accidentally removes rate limiting at 2am. These metrics matter more than any faithfulness score ever will.

RAG Evaluation Framework Comparison

Framework

What's Good

What Sucks

Reality Check

RAGAS

Actually works, decent docs

Will bankrupt you on API costs

Start here, switch later when bill hits $800/month

DeepEval

Fast, cheaper

Docs are shit, you'll read source code

Use when RAGAS costs too much

TruLens

Amazing debugging

Takes a week to set up, costs more than your mortgage

Only for companies with dedicated ML platform teams

RAGAS Setup That Actually Works

Here's how to get RAGAS running without losing your mind. I've set this up 6 times and learned all the stupid gotchas.

RAG evaluation setup workspace

RAGAS Installation

Use a clean environment because RAGAS will definitely conflict with something in your existing Python disaster zone. I've watched it shit the bed on numpy 1.24 vs 1.25, different torch versions, and for some godforsaken reason, Pillow 9.3.0 specifically.

## Create new venv because dependency hell is inevitable
python -m venv rag-eval-env
source rag-eval-env/bin/activate

## Install everything at once to avoid the six hours I spent debugging version conflicts
pip install ragas==0.1.16 openai==1.40.0 langchain==0.2.11 datasets==2.19.0

## If you get weird errors about tokenizers or transformers, nuke it
## rm -rf rag-eval-env && python -m venv rag-eval-env
## This happens more than I'd like to admit

API Configuration

Get your OpenAI API key ready. If you don't have one, this process will stop dead right here.

import os
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

## Set this or nothing works
os.environ["OPENAI_API_KEY"] = "your-actual-api-key-not-this-text"

## Use gpt-4o-mini unless you like burning money
## Seriously, gpt-4 costs 10x more and the difference is minimal for evaluation
## I learned this after running eval on 500 questions with gpt-4 and getting a $287 bill that made my manager ask "wtf did you do?"
os.environ["OPENAI_MODEL"] = "gpt-4o-mini"

Creating Realistic Test Data

Your test questions are lies. You're testing with "How can I troubleshoot authentication failures?" when users type "cant login".

## What your dataset probably looks like
fake_test_questions = [
    "How can I troubleshoot authentication failures?",
    "What are the common causes of API data retrieval errors?",
    "What is the process for requesting a refund?"
]

## What users actually type (yes, this is real)
real_user_queries = [
    "cant login",
    "api broken wtf", 
    "refund now",
    "how much cost???",
    "thing no work",
    "billing"
]

Synthetic data is better than nothing but it's still not real user behavior. Use it to get started, then replace it with actual production logs ASAP.

from ragas.testset.generator import TestsetGenerator
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

## Initialize generation components
generator_llm = ChatOpenAI(model="gpt-4o-mini")
embeddings = OpenAIEmbeddings()

## Load documents for test generation
from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader("./docs", glob="**/*.md")
documents = loader.load()

generator = TestsetGenerator.from_langchain(generator_llm, embeddings)

## Generate test dataset
testset = generator.generate_with_langchain_docs(
    documents, 
    test_size=50
)

## Save generated dataset
testset.to_pandas().to_csv("synthetic_testset.csv", index=False)

Running RAGAS Evaluation

RAGAS is slow as molasses because every metric hammers the API. 100 questions takes 5-15 minutes if OpenAI is having a good day. When their API decides to take a shit (Friday afternoons, holidays, whenever you have a deadline), it can drag on for 45+ minutes. I once started eval at 4pm on a Thursday and came back Monday morning to find it was still fucking running.

from datasets import Dataset
import pandas as pd
from ragas.metrics import faithfulness, answer_relevancy

## Load test dataset
eval_df = pd.read_csv("synthetic_testset.csv")
eval_dataset = Dataset.from_pandas(eval_df)

## Run evaluation (grab coffee, this takes forever)
result = evaluate(
    eval_dataset,
    metrics=[
        faithfulness,     # Catches AI bullshit
        answer_relevancy, # Actually answers the question
    ],
)

## Check the damage
results_df = result.to_pandas()
print(f"Faithfulness: {results_df['faithfulness'].mean():.3f}")
print(f"Answer Relevancy: {results_df['answer_relevancy'].mean():.3f}")

## Find the worst responses (these need human review)
shit_responses = results_df[results_df['faithfulness'] < 0.5]
print(f"Found {len(shit_responses)} responses that suck")

## Look at the actual bad ones
print(shit_responses[['question', 'answer', 'faithfulness']].head())

Production Monitoring

Don't overthink metrics. Faithfulness and relevancy catch 90% of problems. Complex custom metrics are academic masturbation that won't help you ship.

Simple quality tracking:

import json
from datetime import datetime

def log_quality_issue(query, answer, issue_type):
    """Log when things go wrong so you can fix them"""
    issue = {
        "timestamp": datetime.now().isoformat(),
        "query": query,
        "answer": answer,
        "issue_type": issue_type,
        "wtf_factor": "high" if "pokemon" in query.lower() else "normal"
    }
    
    with open("quality_issues.jsonl", "a") as f:
        f.write(json.dumps(issue) + "
")

Sample-based production evaluation:

import random

def evaluate_sample(query, answer, contexts, sample_rate=0.05):
    """Evaluate some production queries (not all - that's expensive)"""
    # Only evaluate 5% because API costs add up fast
    if random.random() > sample_rate:
        return None
    
    try:
        result = evaluate(
            Dataset.from_dict({
                "question": [query],
                "answer": [answer],
                "contexts": [contexts]
            }),
            metrics=[faithfulness]
        )
        
        score = result['faithfulness'][0]
        if score < 0.5:
            log_quality_issue(query, answer, f"faithfulness_sucks_{score:.2f}")
            
        return score
    except Exception as e:
        # API will fail sometimes, don't crash prod
        print(f"Evaluation failed: {e}")
        return None

User Feedback Integration

User feedback beats any metric. Thumbs down tells you more than a 0.8 faithfulness score ever will.

I track:

  • Thumbs down (users hate this answer)
  • "That didn't help" clicks
  • Users asking the same question again (first answer sucked)
  • Support tickets about "the AI is broken"

Automated metrics are useful but users clicking "this sucks" is ground truth.

Session 7 rag evaluation with ragas and how to improve retrieval by CodeMint

## RAGAS Setup Video Tutorial

This 14-minute video shows actual RAGAS implementation, not just theory. Covers the gotchas I wish someone had warned me about.

Worth watching for:
- Installation without dependency hell
- Setting up evaluation that doesn't bankrupt you
- Real examples of what goes wrong

Watch: RAG Evaluation with RAGAS and How to Improve Retrieval

Warning: Video is from 2023, some API details have changed but the core concepts are still relevant.

📺 YouTube

Common RAG Evaluation Questions

Q

Why do my scores look great but users say my system sucks?

A

Because you're testing with "What is the refund policy?" and Carol from HR is typing "refund???" at 8:43am while her coffee is still brewing and she's walking to a meeting. Your evaluation dataset is living in some academic fantasyland while your users are typing one-handed from their phones in bathroom stalls.

Fix: Use actual user queries from logs. I've seen systems with 0.85 faithfulness that users absolutely despise because the evaluation questions were written by someone who's never used Slack while stressed.

## Use actual user query patterns
production_queries = [
    "refund policy???",
    "api key not working", 
    "premium subscription cost"
]
Q

My RAG has like 5 different components. How the hell do I evaluate this?

A

Test each piece separately or you'll never figure out what's broken when (not if) things go wrong.

Retrieval first: Are you getting the right docs? This is cheaper to test (no LLM calls) and catches most problems.

Generation second: Given perfect docs, does the LLM produce decent answers? If not, your prompt sucks.

End-to-end last: Real queries through the whole system. This catches the weird interactions between components that you'll never predict.

## Component isolation approach
def evaluate_retrieval_only(queries, ground_truth_docs):
    retrieved_docs = [retriever.get_docs(q) for q in queries]
    return calculate_precision_recall(retrieved_docs, ground_truth_docs)

def evaluate_generation_only(queries, perfect_contexts):
    responses = [llm.generate(q, ctx) for q, ctx in zip(queries, perfect_contexts)]  
    return evaluate_faithfulness(responses, perfect_contexts)
Q

Why do my scores change every time I run evaluation?

A

Because LLMs are random as fuck, even with temperature=0. I've watched the same exact query get faithfulness scores from 0.6 to 0.9 across runs because the models are basically sophisticated coin flips.

Deal with it:

  • Run evaluation 3 times and average if you need consistent numbers
  • Focus on trends ("scores are dropping") not absolutes ("0.82 is bad")
  • Use gpt-4o-mini - cheaper and less random than gpt-4
  • Accept that evaluation is approximate, not scientific
## Consistent evaluation configuration
evaluation_llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0.0,  # Deterministic outputs
    max_retries=3,    # Handle API failures
)

## Run evaluation multiple times for important datasets
def robust_evaluation(dataset, runs=3):
    results = []
    for i in range(runs):
        result = evaluate(dataset, metrics=[faithfulness, answer_relevancy])
        results.append(result.to_pandas())
    
    # Average across runs
    avg_results = pd.concat(results).groupby(level=0).mean()
    return avg_results
Q

How much is this evaluation going to cost me?

A

Way more than anyone admits in the tutorials. I've watched teams go from "oh cool, $50/month" to "WHAT THE FUCK $847?!" when they started evaluating everything like those Medium articles suggested. The tutorials never mention you'll be making thousands of API calls.

Reality check (from my actual bills):

  • 1000 questions = ~$10-50 depending on model (gpt-4o-mini vs gpt-4)
  • Production sampling at 5% of 10k queries/month = $200-400/month minimum
  • gpt-4 evaluation costs 10x more than gpt-4o-mini (I found out the hard way)

Don't go broke:

  • Sample 1-5% of production, not everything
  • Cache results - don't re-evaluate the same queries
  • Start small (100 questions) and scale when you see the bill
Q

What's a "good" score? My manager keeps asking this.

A

Nobody has a goddamn clue, and anyone who gives you exact thresholds is selling you something. I've seen systems with 0.6 faithfulness that users absolutely love and systems with 0.9 that make the support team want to quit.

Rough guidelines (don't treat as gospel):

  • Faithfulness: 0.8+ if you don't want to get fired, 0.9+ if lawyers are involved
  • Relevancy: 0.7+ or users will complain, 0.8+ if you want happy customers
  • Everything else: Probably doesn't matter as much as you think

What actually matters: Trend direction. If scores drop from 0.8 to 0.7, something broke. If users are happy at 0.6, you're fine.

Q

How do I create ground truth when I don't have any?

A

Manual ground truth is expensive as hell and doesn't scale. I've tried - it sucks.

Start automated:

  • Generate Q&A pairs from docs (quick and dirty)
  • Use production logs of successful interactions
  • LLM-generated questions (not perfect but better than nothing)

Add human review:

  • Get experts to fix the worst automated examples
  • Focus on edge cases that break your system
  • Templates help non-experts create consistent data

Production signals:

  • Thumbs up/down (easiest to implement)
  • Users completing tasks after getting answers (purchase, download)
  • Queries that don't get follow-up questions (probably good answers)
Q

My system works great in evaluation but sucks in production. What gives?

A

Welcome to software engineering. Production has delightful surprises that evaluation doesn't:

  • 50 users hitting it simultaneously (not your single-threaded test)
  • Memory pressure making vector search slow as molasses
  • Network timeouts when the API decides to take a coffee break
  • Data that changed since you created your evaluation set
  • Users asking follow-up questions that assume context

Fix the gap:

  • Load test with realistic traffic (not just one query at a time)
  • Test with production memory/CPU limits
  • Update your evaluation data monthly, not yearly
  • Monitor real metrics, not just periodic evaluation scores
Q

How do I evaluate conversations instead of single questions?

A

Standard RAG evaluation assumes each query is independent. Conversations have context, memory, and can go completely off the rails.

What matters in conversations:

  • Context continuity: Does it remember what we were talking about?
  • Coherence: Do responses make sense given what happened before?
  • Memory management: Does it remember important stuff and forget irrelevant details?

Implementation approach:

def evaluate_conversation(conversation_history):
    """Evaluate multi-turn conversation quality"""
    
    # Evaluate each turn considering full context
    for i, turn in enumerate(conversation_history):
        context = conversation_history[:i]  # Previous turns as context
        current_query = turn['query']
        current_response = turn['response']
        
        # Evaluate response quality given conversation context
        turn_quality = evaluate_with_context(
            query=current_query,
            response=current_response, 
            conversation_context=context
        )
        
        turn['quality_scores'] = turn_quality
    
    return conversation_history
Q

My metrics disagree with each other. Which one is lying?

A

Probably none - they're measuring different kinds of failure. High faithfulness + low relevancy = accurately answering the wrong question. Low faithfulness + high relevancy = bullshitting about the right topic.

Decode the chaos:

  • High faithfulness, low relevancy: Your retrieval sucks, getting wrong docs
  • Low faithfulness, high relevancy: Your LLM is hallucinating but staying on topic
  • High precision, low recall: Too picky, missing good answers
  • Low precision, high recall: Returning too much garbage

Pick your poison: Customer support? Relevancy wins (help users). Legal stuff? Faithfulness wins (don't get sued). Know your priorities.

Bottom line: Users don't care about your metrics. They care about getting answers that help them do their job. Perfect evaluation scores on academic benchmarks mean nothing if users think your system is useless.

Figure out what "good" means for your users, measure that, and optimize for user happiness over metric perfection.

RAG Evaluation Resources

Related Tools & Recommendations

integration
Similar content

Pinecone Production Costs: Debugging RAG & LangChain Architecture

Six months of debugging RAG systems in production so you don't have to make the same expensive mistakes I did

Vector Database Systems
/integration/vector-database-langchain-pinecone-production-architecture/pinecone-production-deployment
100%
pricing
Recommended

I've Been Burned by Vector DB Bills Three Times. Here's the Real Cost Breakdown.

Pinecone, Weaviate, Qdrant & ChromaDB pricing - what they don't tell you upfront

Pinecone
/pricing/pinecone-weaviate-qdrant-chroma-enterprise-cost-analysis/cost-comparison-guide
82%
tool
Similar content

LangChain: Python Library for Building AI Apps & RAG

Discover LangChain, the Python library for building AI applications. Understand its architecture, package structure, and get started with RAG pipelines. Include

LangChain
/tool/langchain/overview
59%
integration
Recommended

LangChain + Hugging Face Production Deployment Architecture

Deploy LangChain + Hugging Face without your infrastructure spontaneously combusting

LangChain
/integration/langchain-huggingface-production-deployment/production-deployment-architecture
48%
howto
Similar content

Migrate LangChain to LlamaIndex: Complete RAG System Guide

Here's What Actually Worked (And What Completely Broke)

LangChain
/howto/migrate-langchain-to-llamaindex/complete-migration-guide
42%
tool
Similar content

LlamaIndex Overview: Document Q&A & Search That Works

Build search over your docs without the usual embedding hell

LlamaIndex
/tool/llamaindex/overview
41%
troubleshoot
Recommended

Pinecone Keeps Crashing? Here's How to Fix It

I've wasted weeks debugging this crap so you don't have to

pinecone
/troubleshoot/pinecone/api-connection-reliability-fixes
40%
tool
Recommended

Pinecone - Vector Database That Doesn't Make You Manage Servers

A managed vector database for similarity search without the operational bullshit

Pinecone
/tool/pinecone/overview
40%
howto
Recommended

Deploy Weaviate in Production Without Everything Catching Fire

So you've got Weaviate running in dev and now management wants it in production

Weaviate
/howto/weaviate-production-deployment-scaling/production-deployment-scaling
38%
tool
Recommended

Weaviate - The Vector Database That Doesn't Suck

integrates with Weaviate

Weaviate
/tool/weaviate/overview
38%
integration
Recommended

Temporal + Kubernetes + Redis: The Only Microservices Stack That Doesn't Hate You

Stop debugging distributed transactions at 3am like some kind of digital masochist

Temporal
/integration/temporal-kubernetes-redis-microservices/microservices-communication-architecture
33%
tool
Recommended

GPT-5 Migration Guide - OpenAI Fucked Up My Weekend

OpenAI dropped GPT-5 on August 7th and broke everyone's weekend plans. Here's what actually happened vs the marketing BS.

OpenAI API
/tool/openai-api/gpt-5-migration-guide
30%
review
Recommended

I've Been Testing Enterprise AI Platforms in Production - Here's What Actually Works

Real-world experience with AWS Bedrock, Azure OpenAI, Google Vertex AI, and Claude API after way too much time debugging this stuff

OpenAI API Enterprise
/review/openai-api-alternatives-enterprise-comparison/enterprise-evaluation
30%
tool
Similar content

LM Studio: Run AI Models Locally & Ditch ChatGPT Bills

Finally, ChatGPT without the monthly bill or privacy nightmare

LM Studio
/tool/lm-studio/overview
29%
tool
Recommended

ChromaDB Production Deployment: The Stuff That Actually Matters

Deploy ChromaDB without the production horror stories

ChromaDB
/tool/chroma/enterprise-deployment
29%
tool
Recommended

ChromaDB - Actually Works Unlike Most Vector DBs

integrates with Chroma

Chroma
/tool/chroma/overview
29%
tool
Recommended

Milvus - Vector Database That Actually Works

For when FAISS crashes and PostgreSQL pgvector isn't fast enough

Milvus
/tool/milvus/overview
28%
compare
Recommended

Python vs JavaScript vs Go vs Rust - Production Reality Check

What Actually Happens When You Ship Code With These Languages

python
/compare/python-javascript-go-rust/production-reality-check
28%
tool
Similar content

LM Studio Performance: Fix Crashes & Speed Up Local AI

Stop fighting memory crashes and thermal throttling. Here's how to make LM Studio actually work on real hardware.

LM Studio
/tool/lm-studio/performance-optimization
26%
news
Recommended

Google Avoids $2.5 Trillion Breakup in Landmark Antitrust Victory

Federal judge rejects Chrome browser sale but bans exclusive search deals in major Big Tech ruling

OpenAI/ChatGPT
/news/2025-09-05/google-antitrust-victory
24%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization