Discussion about this post

User's avatar
Alireza Rahmani Khalili's avatar

The gate thresholds in Stage 2 are the most underappreciated part of this framework. Most teams treat faithfulness as a vibe check, a few example outputs, some nodding in a meeting, rather than a metric with an explicit cutoff. The consequence is exactly what you describe: systems that pass "SME review" in the loose sense but fail in production because nobody committed to a number.

One thing I'd add: the evaluation set composition matters as much as the size. 200 queries drawn from the happy path is not the same as 200 queries that stress the boundaries of your retrieval design. Teams that double the set size at each gate without changing the distribution tend to get confident about the wrong things.

Writing about the correctness side of this, specifically how systems degrade silently in production after passing all three gates, in my newsletter if you're interested. The eval lifecycle gets you to launch. Staying correct post-launch is a different problem.

No posts

Ready for more?