RAG evaluation works great until Karen from accounting shows up at 2pm with "cant login help me urgent!!!". Then everything goes to shit because she doesn't type like your beautiful test dataset.
Your evaluation dataset has questions like "What is the company's refund policy for digital products?"
Karen types "cant get mony back wtf???" and Dave from sales sends "refund pls broken app im on a call" at 7:23am on a Tuesday.
RAGAS faithfulness scores tell you if the AI accurately summarizes retrieved text. They don't tell you if that text answers the user's actual fucking question. I've watched systems get 0.9 faithfulness while users rage quit because they got perfect answers to questions they never asked.
What Actually Breaks RAG Systems
Nobody tests with actual user behavior. Evaluation datasets use perfect grammar. Users type like they're ordering pizza at 2am after three beers with autocorrect disabled.
Example that made me want to quit:
- Evaluation: "How can I troubleshoot authentication issues with the API?"
- Production: "api thingy broken cant login help???" (sent from CEO's personal Slack at 7am Sunday)
Standard metrics are fucking useless. RAGAS faithfulness measures if AI summarizes text accurately. It doesn't measure if that text is relevant.
I've seen this disaster multiple times:
- User asks: "how much does premium cost"
- System retrieves: Enterprise security documentation
- System responds: Perfect summary of security features
- Faithfulness score: 0.92 (great!)
- User satisfaction: Zero (completely wrong answer)
The system gets rewarded for being accurately wrong.
Production documents are a fucking nightmare. Dev testing uses pristine PDFs and clean markdown. Production hits you with:
- PDFs where tables turn into garbled text soup that looks like a cat walked on the keyboard
- Word docs with encoding that turns quotes into question marks (thanks Windows-1252)
- Web scraping that grabs navigation menus instead of content
- Excel "CSV" exports that explode on the first comma in a cell
I burned 6 hours debugging why our system started responding "Click here for more information" to every billing question. Turned out our web scraper was grabbing footer links instead of actual content because some frontend dev updated CSS selectors during a "quick design refresh" and nobody told the backend team. Lost my entire weekend to that trainwreck.
Three Things That Actually Matter in Production
Users type like shit. Your evaluation uses "What are the system requirements for the enterprise plan?" Users type "requirements???" and expect it to work.
I learned this the expensive way when our support RAG started serving up installation guides for "billing" questions, and three customers canceled thinking we didn't have billing support. Turns out nobody tested with one-word queries because our evaluation dataset was too fucking polite. Oops.
Test with:
- Typos and misspellings ("seperate" not "separate")
- One-word queries ("billing", "refund", "broken")
- Drunk user language ("thing no work why")
- Questions that assume context ("how much?" without saying what)
Most systems hallucinate instead of admitting ignorance. Retrieval breaks constantly:
- Document chunking splits sentences in the worst possible places
- Embeddings match on random words instead of meaning
- Vector databases return garbage when they can't find good matches
I'd rather have a system that says "I don't know" than one that confidently explains how to cancel a subscription when the user asked about billing.
Users prefer honest ignorance over confident bullshit.
You need debugging or you're fucked when things break. Users will report "the AI is broken" and you need to figure out:
- What documents got retrieved (probably the wrong ones)
- Why the embeddings thought those were relevant
- What the hell the LLM was thinking
- How much this failure is costing you
Most frameworks give you a score and leave you to figure out why it sucks. I've debugged RAG failures while the CEO is breathing down my neck at midnight - you want actual logs showing what broke, not philosophical metrics about "semantic similarity".
Framework Reality Check
Every framework sucks at something important:
- RAGAS: Great docs, expensive as hell, slow evaluation
- DeepEval: Fast but docs are shit, you'll be reading source code
- TruLens: Comprehensive debugging but setup takes a week and costs more than your car payment
I've used all three. RAGAS is where most people start and stay because switching frameworks after you've built everything around one is painful.
Synthetic datasets are conference talk bullshit. LLM-generated questions sound like this: "Could you please explain the process for initiating a refund request through the customer portal?"
Real questions from our Slack #help channel: "refund???" followed by "HELLO??" two minutes later.
Synthetic data is useful for exactly one thing: getting started before you have real user queries. After that, use actual production logs or your evaluation is lying to you.
Production monitoring beats development evaluation every time. Dev metrics use perfect test data. Production shows you:
- Users asking about Pokemon in your enterprise SaaS docs
- The system breaking when 50 people hit it simultaneously
- Documents that worked in staging but are corrupted in prod
- API costs that went from $200 to $2000 because someone removed rate limits
I monitor: user thumbs down, response times over 10 seconds, queries that return empty results, and API costs that spike when someone accidentally removes rate limiting at 2am. These metrics matter more than any faithfulness score ever will.