Your RAG Evaluation is Bullshit Until Real Users Touch It

Why RAG Evaluation is Academic Bullshit

RAG evaluation works great until Karen from accounting shows up at 2pm with "cant login help me urgent!!!". Then everything goes to shit because she doesn't type like your beautiful test dataset.

Production monitoring dashboard

Your evaluation dataset has questions like "What is the company's refund policy for digital products?"

Karen types "cant get mony back wtf???" and Dave from sales sends "refund pls broken app im on a call" at 7:23am on a Tuesday.

RAGAS faithfulness scores tell you if the AI accurately summarizes retrieved text. They don't tell you if that text answers the user's actual fucking question. I've watched systems get 0.9 faithfulness while users rage quit because they got perfect answers to questions they never asked.

What Actually Breaks RAG Systems

Nobody tests with actual user behavior. Evaluation datasets use perfect grammar. Users type like they're ordering pizza at 2am after three beers with autocorrect disabled.

Example that made me want to quit:

Evaluation: "How can I troubleshoot authentication issues with the API?"
Production: "api thingy broken cant login help???" (sent from CEO's personal Slack at 7am Sunday)

Standard metrics are fucking useless. RAGAS faithfulness measures if AI summarizes text accurately. It doesn't measure if that text is relevant.

I've seen this disaster multiple times:

User asks: "how much does premium cost"
System retrieves: Enterprise security documentation
System responds: Perfect summary of security features
Faithfulness score: 0.92 (great!)
User satisfaction: Zero (completely wrong answer)

The system gets rewarded for being accurately wrong.

Production documents are a fucking nightmare. Dev testing uses pristine PDFs and clean markdown. Production hits you with:

PDFs where tables turn into garbled text soup that looks like a cat walked on the keyboard
Word docs with encoding that turns quotes into question marks (thanks Windows-1252)
Web scraping that grabs navigation menus instead of content
Excel "CSV" exports that explode on the first comma in a cell

I burned 6 hours debugging why our system started responding "Click here for more information" to every billing question. Turned out our web scraper was grabbing footer links instead of actual content because some frontend dev updated CSS selectors during a "quick design refresh" and nobody told the backend team. Lost my entire weekend to that trainwreck.

Three Things That Actually Matter in Production

Users type like shit. Your evaluation uses "What are the system requirements for the enterprise plan?" Users type "requirements???" and expect it to work.

I learned this the expensive way when our support RAG started serving up installation guides for "billing" questions, and three customers canceled thinking we didn't have billing support. Turns out nobody tested with one-word queries because our evaluation dataset was too fucking polite. Oops.

Test with:

Typos and misspellings ("seperate" not "separate")
One-word queries ("billing", "refund", "broken")
Drunk user language ("thing no work why")
Questions that assume context ("how much?" without saying what)

Most systems hallucinate instead of admitting ignorance. Retrieval breaks constantly:

Document chunking splits sentences in the worst possible places
Embeddings match on random words instead of meaning
Vector databases return garbage when they can't find good matches

I'd rather have a system that says "I don't know" than one that confidently explains how to cancel a subscription when the user asked about billing.

Users prefer honest ignorance over confident bullshit.

You need debugging or you're fucked when things break. Users will report "the AI is broken" and you need to figure out:

What documents got retrieved (probably the wrong ones)
Why the embeddings thought those were relevant
What the hell the LLM was thinking
How much this failure is costing you

Most frameworks give you a score and leave you to figure out why it sucks. I've debugged RAG failures while the CEO is breathing down my neck at midnight - you want actual logs showing what broke, not philosophical metrics about "semantic similarity".

Framework Reality Check

Every framework sucks at something important:

RAGAS: Great docs, expensive as hell, slow evaluation
DeepEval: Fast but docs are shit, you'll be reading source code
TruLens: Comprehensive debugging but setup takes a week and costs more than your car payment

I've used all three. RAGAS is where most people start and stay because switching frameworks after you've built everything around one is painful.

Synthetic datasets are conference talk bullshit. LLM-generated questions sound like this: "Could you please explain the process for initiating a refund request through the customer portal?"

Real questions from our Slack #help channel: "refund???" followed by "HELLO??" two minutes later.

Synthetic data is useful for exactly one thing: getting started before you have real user queries. After that, use actual production logs or your evaluation is lying to you.

Production monitoring beats development evaluation every time. Dev metrics use perfect test data. Production shows you:

Users asking about Pokemon in your enterprise SaaS docs
The system breaking when 50 people hit it simultaneously
Documents that worked in staging but are corrupted in prod
API costs that went from $200 to $2000 because someone removed rate limits

I monitor: user thumbs down, response times over 10 seconds, queries that return empty results, and API costs that spike when someone accidentally removes rate limiting at 2am. These metrics matter more than any faithfulness score ever will.

RAG Evaluation Framework Comparison

Framework	What's Good	What Sucks	Reality Check
RAGAS	Actually works, decent docs	Will bankrupt you on API costs	Start here, switch later when bill hits $800/month
DeepEval	Fast, cheaper	Docs are shit, you'll read source code	Use when RAGAS costs too much
TruLens	Amazing debugging	Takes a week to set up, costs more than your mortgage	Only for companies with dedicated ML platform teams

RAGAS Setup That Actually Works

Here's how to get RAGAS running without losing your mind. I've set this up 6 times and learned all the stupid gotchas.

RAG evaluation setup workspace

RAGAS Installation

Use a clean environment because RAGAS will definitely conflict with something in your existing Python disaster zone. I've watched it shit the bed on numpy 1.24 vs 1.25, different torch versions, and for some godforsaken reason, Pillow 9.3.0 specifically.

## Create new venv because dependency hell is inevitable
python -m venv rag-eval-env
source rag-eval-env/bin/activate

## Install everything at once to avoid the six hours I spent debugging version conflicts
pip install ragas==0.1.16 openai==1.40.0 langchain==0.2.11 datasets==2.19.0

## If you get weird errors about tokenizers or transformers, nuke it
## rm -rf rag-eval-env && python -m venv rag-eval-env
## This happens more than I'd like to admit

API Configuration

Get your OpenAI API key ready. If you don't have one, this process will stop dead right here.

import os
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

## Set this or nothing works
os.environ["OPENAI_API_KEY"] = "your-actual-api-key-not-this-text"

## Use gpt-4o-mini unless you like burning money
## Seriously, gpt-4 costs 10x more and the difference is minimal for evaluation
## I learned this after running eval on 500 questions with gpt-4 and getting a $287 bill that made my manager ask "wtf did you do?"
os.environ["OPENAI_MODEL"] = "gpt-4o-mini"

Creating Realistic Test Data

Your test questions are lies. You're testing with "How can I troubleshoot authentication failures?" when users type "cant login".

## What your dataset probably looks like
fake_test_questions = [
    "How can I troubleshoot authentication failures?",
    "What are the common causes of API data retrieval errors?",
    "What is the process for requesting a refund?"
]

## What users actually type (yes, this is real)
real_user_queries = [
    "cant login",
    "api broken wtf", 
    "refund now",
    "how much cost???",
    "thing no work",
    "billing"
]

Synthetic data is better than nothing but it's still not real user behavior. Use it to get started, then replace it with actual production logs ASAP.

from ragas.testset.generator import TestsetGenerator
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

## Initialize generation components
generator_llm = ChatOpenAI(model="gpt-4o-mini")
embeddings = OpenAIEmbeddings()

## Load documents for test generation
from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader("./docs", glob="**/*.md")
documents = loader.load()

generator = TestsetGenerator.from_langchain(generator_llm, embeddings)

## Generate test dataset
testset = generator.generate_with_langchain_docs(
    documents, 
    test_size=50
)

## Save generated dataset
testset.to_pandas().to_csv("synthetic_testset.csv", index=False)

Running RAGAS Evaluation

RAGAS is slow as molasses because every metric hammers the API. 100 questions takes 5-15 minutes if OpenAI is having a good day. When their API decides to take a shit (Friday afternoons, holidays, whenever you have a deadline), it can drag on for 45+ minutes. I once started eval at 4pm on a Thursday and came back Monday morning to find it was still fucking running.

from datasets import Dataset
import pandas as pd
from ragas.metrics import faithfulness, answer_relevancy

## Load test dataset
eval_df = pd.read_csv("synthetic_testset.csv")
eval_dataset = Dataset.from_pandas(eval_df)

## Run evaluation (grab coffee, this takes forever)
result = evaluate(
    eval_dataset,
    metrics=[
        faithfulness,     # Catches AI bullshit
        answer_relevancy, # Actually answers the question
    ],
)

## Check the damage
results_df = result.to_pandas()
print(f"Faithfulness: {results_df['faithfulness'].mean():.3f}")
print(f"Answer Relevancy: {results_df['answer_relevancy'].mean():.3f}")

## Find the worst responses (these need human review)
shit_responses = results_df[results_df['faithfulness'] < 0.5]
print(f"Found {len(shit_responses)} responses that suck")

## Look at the actual bad ones
print(shit_responses[['question', 'answer', 'faithfulness']].head())

Production Monitoring

Don't overthink metrics. Faithfulness and relevancy catch 90% of problems. Complex custom metrics are academic masturbation that won't help you ship.

Simple quality tracking:

import json
from datetime import datetime

def log_quality_issue(query, answer, issue_type):
    """Log when things go wrong so you can fix them"""
    issue = {
        "timestamp": datetime.now().isoformat(),
        "query": query,
        "answer": answer,
        "issue_type": issue_type,
        "wtf_factor": "high" if "pokemon" in query.lower() else "normal"
    }
    
    with open("quality_issues.jsonl", "a") as f:
        f.write(json.dumps(issue) + "
")

Sample-based production evaluation:

import random

def evaluate_sample(query, answer, contexts, sample_rate=0.05):
    """Evaluate some production queries (not all - that's expensive)"""
    # Only evaluate 5% because API costs add up fast
    if random.random() > sample_rate:
        return None
    
    try:
        result = evaluate(
            Dataset.from_dict({
                "question": [query],
                "answer": [answer],
                "contexts": [contexts]
            }),
            metrics=[faithfulness]
        )
        
        score = result['faithfulness'][0]
        if score < 0.5:
            log_quality_issue(query, answer, f"faithfulness_sucks_{score:.2f}")
            
        return score
    except Exception as e:
        # API will fail sometimes, don't crash prod
        print(f"Evaluation failed: {e}")
        return None

User Feedback Integration

User feedback beats any metric. Thumbs down tells you more than a 0.8 faithfulness score ever will.

I track:

Thumbs down (users hate this answer)
"That didn't help" clicks
Users asking the same question again (first answer sucked)
Support tickets about "the AI is broken"

Automated metrics are useful but users clicking "this sucks" is ground truth.

Session 7 rag evaluation with ragas and how to improve retrieval by CodeMint

## RAGAS Setup Video Tutorial

This 14-minute video shows actual RAGAS implementation, not just theory. Covers the gotchas I wish someone had warned me about.

Worth watching for:
- Installation without dependency hell
- Setting up evaluation that doesn't bankrupt you
- Real examples of what goes wrong

Watch: RAG Evaluation with RAGAS and How to Improve Retrieval

Warning: Video is from 2023, some API details have changed but the core concepts are still relevant.

📺 YouTube

Common RAG Evaluation Questions

Why do my scores look great but users say my system sucks?

Because you're testing with "What is the refund policy?" and Carol from HR is typing "refund???" at 8:43am while her coffee is still brewing and she's walking to a meeting. Your evaluation dataset is living in some academic fantasyland while your users are typing one-handed from their phones in bathroom stalls.

Fix: Use actual user queries from logs. I've seen systems with 0.85 faithfulness that users absolutely despise because the evaluation questions were written by someone who's never used Slack while stressed.

## Use actual user query patterns
production_queries = [
    "refund policy???",
    "api key not working", 
    "premium subscription cost"
]

My RAG has like 5 different components. How the hell do I evaluate this?

Test each piece separately or you'll never figure out what's broken when (not if) things go wrong.

Retrieval first: Are you getting the right docs? This is cheaper to test (no LLM calls) and catches most problems.

Generation second: Given perfect docs, does the LLM produce decent answers? If not, your prompt sucks.

End-to-end last: Real queries through the whole system. This catches the weird interactions between components that you'll never predict.

## Component isolation approach
def evaluate_retrieval_only(queries, ground_truth_docs):
    retrieved_docs = [retriever.get_docs(q) for q in queries]
    return calculate_precision_recall(retrieved_docs, ground_truth_docs)

def evaluate_generation_only(queries, perfect_contexts):
    responses = [llm.generate(q, ctx) for q, ctx in zip(queries, perfect_contexts)]  
    return evaluate_faithfulness(responses, perfect_contexts)

Why do my scores change every time I run evaluation?

Because LLMs are random as fuck, even with temperature=0. I've watched the same exact query get faithfulness scores from 0.6 to 0.9 across runs because the models are basically sophisticated coin flips.

Deal with it:

Run evaluation 3 times and average if you need consistent numbers
Focus on trends ("scores are dropping") not absolutes ("0.82 is bad")
Use gpt-4o-mini - cheaper and less random than gpt-4
Accept that evaluation is approximate, not scientific

## Consistent evaluation configuration
evaluation_llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0.0,  # Deterministic outputs
    max_retries=3,    # Handle API failures
)

## Run evaluation multiple times for important datasets
def robust_evaluation(dataset, runs=3):
    results = []
    for i in range(runs):
        result = evaluate(dataset, metrics=[faithfulness, answer_relevancy])
        results.append(result.to_pandas())
    
    # Average across runs
    avg_results = pd.concat(results).groupby(level=0).mean()
    return avg_results

How much is this evaluation going to cost me?

Way more than anyone admits in the tutorials. I've watched teams go from "oh cool, $50/month" to "WHAT THE FUCK $847?!" when they started evaluating everything like those Medium articles suggested. The tutorials never mention you'll be making thousands of API calls.

Reality check (from my actual bills):

1000 questions = ~$10-50 depending on model (gpt-4o-mini vs gpt-4)
Production sampling at 5% of 10k queries/month = $200-400/month minimum
gpt-4 evaluation costs 10x more than gpt-4o-mini (I found out the hard way)

Don't go broke:

Sample 1-5% of production, not everything
Cache results - don't re-evaluate the same queries
Start small (100 questions) and scale when you see the bill

What's a "good" score? My manager keeps asking this.

Nobody has a goddamn clue, and anyone who gives you exact thresholds is selling you something. I've seen systems with 0.6 faithfulness that users absolutely love and systems with 0.9 that make the support team want to quit.

Rough guidelines (don't treat as gospel):

Faithfulness: 0.8+ if you don't want to get fired, 0.9+ if lawyers are involved
Relevancy: 0.7+ or users will complain, 0.8+ if you want happy customers
Everything else: Probably doesn't matter as much as you think

What actually matters: Trend direction. If scores drop from 0.8 to 0.7, something broke. If users are happy at 0.6, you're fine.

How do I create ground truth when I don't have any?

Manual ground truth is expensive as hell and doesn't scale. I've tried - it sucks.

Start automated:

Generate Q&A pairs from docs (quick and dirty)
Use production logs of successful interactions
LLM-generated questions (not perfect but better than nothing)

Add human review:

Get experts to fix the worst automated examples
Focus on edge cases that break your system
Templates help non-experts create consistent data

Production signals:

Thumbs up/down (easiest to implement)
Users completing tasks after getting answers (purchase, download)
Queries that don't get follow-up questions (probably good answers)

My system works great in evaluation but sucks in production. What gives?

Welcome to software engineering. Production has delightful surprises that evaluation doesn't:

50 users hitting it simultaneously (not your single-threaded test)
Memory pressure making vector search slow as molasses
Network timeouts when the API decides to take a coffee break
Data that changed since you created your evaluation set
Users asking follow-up questions that assume context

Fix the gap:

Load test with realistic traffic (not just one query at a time)
Test with production memory/CPU limits
Update your evaluation data monthly, not yearly
Monitor real metrics, not just periodic evaluation scores

How do I evaluate conversations instead of single questions?

Standard RAG evaluation assumes each query is independent. Conversations have context, memory, and can go completely off the rails.

What matters in conversations:

Context continuity: Does it remember what we were talking about?
Coherence: Do responses make sense given what happened before?
Memory management: Does it remember important stuff and forget irrelevant details?

Implementation approach:

def evaluate_conversation(conversation_history):
    """Evaluate multi-turn conversation quality"""
    
    # Evaluate each turn considering full context
    for i, turn in enumerate(conversation_history):
        context = conversation_history[:i]  # Previous turns as context
        current_query = turn['query']
        current_response = turn['response']
        
        # Evaluate response quality given conversation context
        turn_quality = evaluate_with_context(
            query=current_query,
            response=current_response, 
            conversation_context=context
        )
        
        turn['quality_scores'] = turn_quality
    
    return conversation_history

My metrics disagree with each other. Which one is lying?

Probably none - they're measuring different kinds of failure. High faithfulness + low relevancy = accurately answering the wrong question. Low faithfulness + high relevancy = bullshitting about the right topic.

Decode the chaos:

High faithfulness, low relevancy: Your retrieval sucks, getting wrong docs
Low faithfulness, high relevancy: Your LLM is hallucinating but staying on topic
High precision, low recall: Too picky, missing good answers
Low precision, high recall: Returning too much garbage

Pick your poison: Customer support? Relevancy wins (help users). Legal stuff? Faithfulness wins (don't get sued). Know your priorities.

Bottom line: Users don't care about your metrics. They care about getting answers that help them do their job. Perfect evaluation scores on academic benchmarks mean nothing if users think your system is useless.

Figure out what "good" means for your users, measure that, and optimize for user happiness over metric perfection.

RAG Evaluation Resources

24%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization

Quick Navigation

What Actually Breaks RAG Systems

Three Things That Actually Matter in Production

Framework Reality Check

RAGAS Installation

API Configuration

Creating Realistic Test Data

Running RAGAS Evaluation

Production Monitoring

User Feedback Integration

Why do my scores look great but users say my system sucks?

My RAG has like 5 different components. How the hell do I evaluate this?

Why do my scores change every time I run evaluation?

How much is this evaluation going to cost me?

What's a "good" score? My manager keeps asking this.

How do I create ground truth when I don't have any?

My system works great in evaluation but sucks in production. What gives?

How do I evaluate conversations instead of single questions?

My metrics disagree with each other. Which one is lying?

Related Tools & Recommendations

Pinecone Production Costs: Debugging RAG & LangChain Architecture

I've Been Burned by Vector DB Bills Three Times. Here's the Real Cost Breakdown.

LangChain: Python Library for Building AI Apps & RAG

LangChain + Hugging Face Production Deployment Architecture

Migrate LangChain to LlamaIndex: Complete RAG System Guide

LlamaIndex Overview: Document Q&A & Search That Works

Pinecone Keeps Crashing? Here's How to Fix It

Pinecone - Vector Database That Doesn't Make You Manage Servers

Deploy Weaviate in Production Without Everything Catching Fire

Weaviate - The Vector Database That Doesn't Suck

Temporal + Kubernetes + Redis: The Only Microservices Stack That Doesn't Hate You

GPT-5 Migration Guide - OpenAI Fucked Up My Weekend

I've Been Testing Enterprise AI Platforms in Production - Here's What Actually Works

LM Studio: Run AI Models Locally & Ditch ChatGPT Bills

ChromaDB Production Deployment: The Stuff That Actually Matters

ChromaDB - Actually Works Unlike Most Vector DBs

Milvus - Vector Database That Actually Works

Python vs JavaScript vs Go vs Rust - Production Reality Check

LM Studio Performance: Fix Crashes & Speed Up Local AI

Google Avoids $2.5 Trillion Breakup in Landmark Antitrust Victory