The AI Eval Gate Cheat Sheet
Most AI projects die in the gap between "works in the demo" and "works in production."
The most dangerous bug in a RAG system is the answer that looks right.
A model can produce a response that is true to its training data but unsupported by the documents you actually retrieved. It reads perfectly. The user never notices. Your accuracy score never flags it.
One metric catches it: faithfulness.
Does every claim trace back to retrieved context?
Two rules make it work. Measure it with a different model than the one that generated the answer, because nothing is a reliable judge of its own output.
And below 0.70, you are hallucinating in roughly a third of responses and have no business in front of users.
That is one gate. There are three, each with a continue, refine, or stop threshold. All of them on one cheat sheet below:


