<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[The AI Runtime: Vertical Agents]]></title><description><![CDATA[How AI agents and systems actually get deployed in regulated industries — healthcare, financial services, mortgage, aviation. Reference implementations and field observations from the inside]]></description><link>https://theairuntime.com/s/vertical-agents</link><image><url>https://substackcdn.com/image/fetch/$s_!Z6cH!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a5b0f45-2e91-43c7-a826-8c934d562a69_800x800.png</url><title>The AI Runtime: Vertical Agents</title><link>https://theairuntime.com/s/vertical-agents</link></image><generator>Substack</generator><lastBuildDate>Mon, 18 May 2026 11:10:21 GMT</lastBuildDate><atom:link href="https://theairuntime.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Kranthi Manchikanti]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[aiengineerweekly@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[aiengineerweekly@substack.com]]></itunes:email><itunes:name><![CDATA[The AI Runtime]]></itunes:name></itunes:owner><itunes:author><![CDATA[The AI Runtime]]></itunes:author><googleplay:owner><![CDATA[aiengineerweekly@substack.com]]></googleplay:owner><googleplay:email><![CDATA[aiengineerweekly@substack.com]]></googleplay:email><googleplay:author><![CDATA[The AI Runtime]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[The Brain Isn’t the LLM: How HockeyStack Built Revenue Agents]]></title><description><![CDATA[HockeyStack just raised $50M to scale a vertical agent platform whose reasoning engine is a custom ML pipeline &#8212; not a frontier model. Why that matters for anyone building agents.]]></description><link>https://theairuntime.com/p/the-brain-isnt-the-llm-how-hockeystack</link><guid isPermaLink="false">https://theairuntime.com/p/the-brain-isnt-the-llm-how-hockeystack</guid><dc:creator><![CDATA[The AI Runtime]]></dc:creator><pubDate>Tue, 12 May 2026 11:03:53 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!y_5I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11567a4f-a6e3-4264-8326-69a4547ec13b_1376x768.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="pullquote"><p><strong>TL;DR</strong> - HockeyStack closed <a href="https://www.prnewswire.com/news-releases/hockeystack-raises-50m-to-build-revenue-agents-for-the-enterprise-302742217.html">$50M from Bessemer Venture Partners, Y Combinator, and Uncorrelated Ventures</a> to scale Revenue Agents &#8212; autonomous AI agents that work every deal and account 24/7 across new business, prospecting, and expansion. The interesting architectural choice: HockeyStack&#8217;s reasoning engine is not a frontier LLM. It is a <a href="https://www.hockeystack.com/">proprietary ML model called the Blueprint</a> that reverse-engineers each customer&#8217;s winning sales process from their event data. The LLM sits downstream as the execution and natural language layer. If you are designing a vertical agent, HockeyStack is the cleanest public example of an &#8220;ML brain, LLM executor&#8221; architecture &#8212; the inverse of what most teams ship.</p></div><h2>What HockeyStack Actually Sells</h2><p>HockeyStack started in 2021 as a B2B revenue analytics and attribution platform &#8212; the kind of tool that stitches Salesforce, HubSpot, ad platforms, Gong, and product data into one buyer journey so a CMO can answer &#8220;which campaign actually drove pipeline?&#8221; The founders &#8212; Emir Atl&#305;, Arda Bulut, and Bu&#287;ra G&#252;nd&#252;z, the CEO &#8212; dropped out of college in Turkey, went through Y Combinator, and built the company into a Series A attribution vendor.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://theairuntime.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>That is the company HockeyStack used to be. The company they are now is something different.</p><p>In April 2026, HockeyStack announced a $50M raise and the launch of &#8220;Revenue Agents for the Enterprise.&#8221; The pitch: per-deal autonomous agents that monitor every live opportunity against a learned pattern of how the customer&#8217;s own top reps win, execute the next-best action, and loop in the human rep when judgment is required. The customer list spans <a href="https://www.prnewswire.com/news-releases/hockeystack-raises-50m-to-build-revenue-agents-for-the-enterprise-302742217.html">Fortune 100 revenue teams</a> including 8x8, AppsFlyer, Outreach, Yext, and Sendoso, with over 300 customers reached in under two years.</p><p>This is a category bet: HockeyStack is positioning Revenue Agents as a new product category sitting alongside (or above) attribution, CRM, and revenue intelligence. The bet is architectural, and it is the part worth studying.</p><h2>The Blueprint Is the Brain</h2><p>The single most useful sentence on HockeyStack&#8217;s site is in their description of the platform: agents follow a &#8220;<a href="https://www.hockeystack.com/">validated, data-grounded process</a>.&#8221; Read past the marketing voice and notice what is <em>not</em> being claimed. The agent is not reasoning from first principles each turn. It is not asking an LLM &#8220;what should I do next on this deal?&#8221; and trusting whatever comes back. It is executing against a <em>blueprint</em> &#8212; a learned, structured representation of the customer&#8217;s winning sales process.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!y_5I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11567a4f-a6e3-4264-8326-69a4547ec13b_1376x768.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!y_5I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11567a4f-a6e3-4264-8326-69a4547ec13b_1376x768.png 424w, https://substackcdn.com/image/fetch/$s_!y_5I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11567a4f-a6e3-4264-8326-69a4547ec13b_1376x768.png 848w, https://substackcdn.com/image/fetch/$s_!y_5I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11567a4f-a6e3-4264-8326-69a4547ec13b_1376x768.png 1272w, https://substackcdn.com/image/fetch/$s_!y_5I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11567a4f-a6e3-4264-8326-69a4547ec13b_1376x768.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!y_5I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11567a4f-a6e3-4264-8326-69a4547ec13b_1376x768.png" width="1376" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/11567a4f-a6e3-4264-8326-69a4547ec13b_1376x768.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1376,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;qFHgg9Aq&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="qFHgg9Aq" title="qFHgg9Aq" srcset="https://substackcdn.com/image/fetch/$s_!y_5I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11567a4f-a6e3-4264-8326-69a4547ec13b_1376x768.png 424w, https://substackcdn.com/image/fetch/$s_!y_5I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11567a4f-a6e3-4264-8326-69a4547ec13b_1376x768.png 848w, https://substackcdn.com/image/fetch/$s_!y_5I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11567a4f-a6e3-4264-8326-69a4547ec13b_1376x768.png 1272w, https://substackcdn.com/image/fetch/$s_!y_5I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11567a4f-a6e3-4264-8326-69a4547ec13b_1376x768.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The Blueprint is HockeyStack&#8217;s proprietary ML model. Per their own description, it is built by analyzing every won and lost deal, every touchpoint, and every signal in the customer&#8217;s data to surface specific, validated patterns. Each Blueprint is unique to a revenue motion or business unit and updates continuously as new deals close and market conditions shift.</p><p>Crucially, the Blueprint is not a fine-tuned LLM. It is described as a <a href="https://www.hockeystack.com/">machine learning model that continuously learns on new outcomes</a> &#8212; an event-chain pattern-mining pipeline trained on the customer&#8217;s own deal history. The LLM enters the picture downstream: surfacing tasks in natural language to reps, generating outreach copy, and handling the human-facing surface. The reasoning about what <em>should</em> happen on a deal is the Blueprint&#8217;s job.</p><p>This inverts the dominant pattern in AI agent products. Most &#8220;AI for X&#8221; startups treat a frontier LLM as the reasoning engine and bolt on retrieval, tools, and memory around it. HockeyStack treats a domain-specific ML pipeline as the reasoning engine and uses the LLM as the execution and language layer.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zf4b!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff525a723-36e1-4e40-ac13-08f4268debe8_843x791.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zf4b!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff525a723-36e1-4e40-ac13-08f4268debe8_843x791.png 424w, https://substackcdn.com/image/fetch/$s_!zf4b!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff525a723-36e1-4e40-ac13-08f4268debe8_843x791.png 848w, https://substackcdn.com/image/fetch/$s_!zf4b!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff525a723-36e1-4e40-ac13-08f4268debe8_843x791.png 1272w, https://substackcdn.com/image/fetch/$s_!zf4b!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff525a723-36e1-4e40-ac13-08f4268debe8_843x791.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zf4b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff525a723-36e1-4e40-ac13-08f4268debe8_843x791.png" width="843" height="791" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f525a723-36e1-4e40-ac13-08f4268debe8_843x791.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:791,&quot;width&quot;:843,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:55185,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiengineerweekly.substack.com/i/196073721?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff525a723-36e1-4e40-ac13-08f4268debe8_843x791.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zf4b!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff525a723-36e1-4e40-ac13-08f4268debe8_843x791.png 424w, https://substackcdn.com/image/fetch/$s_!zf4b!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff525a723-36e1-4e40-ac13-08f4268debe8_843x791.png 848w, https://substackcdn.com/image/fetch/$s_!zf4b!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff525a723-36e1-4e40-ac13-08f4268debe8_843x791.png 1272w, https://substackcdn.com/image/fetch/$s_!zf4b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff525a723-36e1-4e40-ac13-08f4268debe8_843x791.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Detail belongs in the prose, not the diagram. Three components carry the real weight.</p><h2>Atlas: The Event-Based Substrate</h2><p>Most CRMs are record-based: a deal is a row, with fields. HockeyStack&#8217;s foundation, called Atlas, is event-based: every interaction is a timestamped event resolved to one identity graph. Per their own product page, Atlas unifies every interaction into a single event-based timeline with full identity resolution &#8212; CRM, outreach sequences, call recordings, web activity, and the data warehouse all resolved to one time-stamped source of truth.</p><p>This matters because the Blueprint cannot mine winning patterns from flattened CRM fields. As <a href="https://www.contentgrip.com/hockeystack-revenue-agents/">contentgrip&#8217;s coverage of the raise observed</a>, many meaningful buyer and seller signals are inherently event-like &#8212; web activity, product usage, conversation outcomes, buying-committee changes &#8212; and when those signals get flattened into static fields, teams lose the sequence, timing, and causality that define a winning play. An event-based model preserves them.</p><p>For builders, the lesson is upstream of agent design: if your reasoning layer needs sequence and causality (and most consequential agent decisions do), your data layer has to preserve them. You cannot retrofit event semantics onto a record-based store after the fact without losing fidelity.</p><h2>Revenue Agents: Per-Deal, Always-On</h2><p>The agent layer is where the Blueprint gets executed. HockeyStack&#8217;s framing: dedicated agents monitor every deal and account, execute the right moves autonomously, and flag risks, with individual Revenue Agents assigned to each deal and account, operating around the clock.</p><p>Concrete agent behaviors HockeyStack has shipped, per their <a href="https://www.hockeystack.com/agents">agents page</a>: identifying missing stakeholders and triggering outreach to unblock deals, detecting competitor dissatisfaction signals and launching displacement outreach, redistributing account attention based on revenue risk, and identifying when messaging stops converting. Each behavior is an instance of &#8220;deal deviates from the Blueprint pattern &#8594; agent acts.&#8221;</p><p>The reps interact with this through a surface called the Rep Cockpit &#8212; a daily workspace where agents surface direct tasks with reasoning. Senior leaders get separate Manager views for coaching and pipeline forecasting. This shape &#8212; agent surfaces work, human reviews and acts &#8212; is the same shape <a href="https://theairuntime.com/p/felix-is-a-harness-not-a-model-how">Rogo&#8217;s Felix</a> uses with email as the substrate. Different surface, same async-handoff pattern.</p><p>HockeyStack also describes a <a href="https://www.hockeystack.com/blog-posts/everything-you-need-to-know-about-ai-agents-for-gtm-teams-top-10-solutions">multi-agent orchestration model</a>: one agent retrieves data, another runs analysis, a third validates the output before the user sees it. The validator step is doing real work &#8212; it is the guardrail that catches the LLM hallucinating a stakeholder or fabricating an account fact before that error propagates into a rep&#8217;s outreach.</p><h2>The Reverse-Engineering Bet</h2><p>There is a strong claim underneath all of this, and HockeyStack states it plainly: <a href="https://salesenablement.wordpress.com/2026/04/17/hockeystack-revenue-agents-ai-agents-that-clone-your-top-reps-to-help-everyone-at-scale/">your top performers run plays that live in their heads, and the Blueprint finds and deploys them across your entire team</a>. The bet is that &#8220;what your best rep does&#8221; is a pattern recoverable from the event stream &#8212; not just tribal knowledge.</p><p>This is non-obvious. Sales has been resistant to standardization because the tacit-to-explicit conversion loses something. Whether HockeyStack&#8217;s pattern mining actually captures what the best reps do, or just captures the surface signals correlated with their wins, is the empirical question that will determine whether this category sticks. As one industry analyst noted in coverage of the raise, enterprises will look for clear proof that an event-based architecture improves forecast accuracy, sales productivity, or expansion conversion &#8212; not just that it produces more data. That bar has not been independently proven yet.</p><p>But it is the right bet to be making. If the architecture works, the moat is significant: every customer&#8217;s Blueprint is a one-of-one asset trained on their data, hard to rip out, and gets better as it ingests more deals.</p><h2>Two Architectures for Vertical Agents</h2><p>It is worth naming the two patterns explicitly, because they map cleanly onto a choice every vertical-agent builder is now making.</p><p><strong>Pattern A &#8212; Frontier LLM as brain, harness around it.</strong> The reasoning engine is a frontier model. The vertical work is in the harness: tool layer, evals, output formatters, audit trail, data integrations. When a better frontier model ships, you swap the engine. Examples: most agentic platforms today, including the agent harness several finance and legal AI companies have publicly described.</p><p><strong>Pattern B &#8212; Domain ML as brain, LLM as executor.</strong> The reasoning engine is a custom ML pipeline trained on customer data. The LLM handles natural language interfaces, generation, and tool calling. The vertical work is in the data pipeline, the pattern model, and the per-customer training loop. HockeyStack is the clearest public example.</p><p>Neither is universally right. Pattern A is faster to ship, benefits automatically from frontier-model gains, and is easier to swap. Pattern B is more defensible if your domain has rich event data and recoverable patterns, and it gives you deterministic behavior the LLM cannot match.</p><p>In <a href="https://theairuntime.com/p/model-reliability-engineering-who">Model Reliability Engineering</a> terms: Pattern A invests heavily in Harness Engineering. Pattern B invests heavily in Context Engineering, taken to its logical extreme &#8212; the context isn&#8217;t just retrieved, it&#8217;s mined and structured into a deterministic decision pattern before the LLM ever runs.</p><h2>What&#8217;s Actually Being Transformed</h2><p>Sales orgs do not get replaced; their middle gets compressed. The classic problem HockeyStack is targeting &#8212; the best rep closes 2-3x more than the median, and nobody knows why &#8212; has been a fixture of sales leadership for thirty years. The traditional response was process documentation, MEDDIC training, and rep shadowing, and it did not close the gap because tacit knowledge resists capture.</p><p>If Revenue Agents work as advertised, what changes is not headcount; it is the variance band. New reps execute closer to top-quartile from week one because the agent surfaces the next move. Top reps spend less time on context-stitching (one HockeyStack customer testimonial cites three hours a day of cross-tool data wrangling eliminated, though this is vendor-curated and worth treating as directional rather than benchmarked) and more time on the relationship work that actually requires a human. Managers run pipeline reviews against a model rather than vibes.</p><p>The honest caveat: this is the <em>promise</em>. As of April 2026, the public evidence is the customer list, the funding round, and HockeyStack&#8217;s own product descriptions. Independent benchmarks of forecast-accuracy lift or expansion-conversion lift do not yet exist publicly. Buyers in this space should ask for them.</p><h2>Five Lessons If You Are Building a Vertical Agent</h2><ol><li><p><strong>Decide which brain you are building.</strong> Pattern A and Pattern B are different companies with different moats. Pick deliberately, not by default.</p></li><li><p><strong>Event-based data preserves causality. Record-based data destroys it.</strong> If your agent needs to reason about <em>why</em> something happened, your substrate has to keep the sequence.</p></li><li><p><strong>The validator agent is doing real work.</strong> Multi-agent orchestration with a dedicated check step is a cheap way to cut hallucination risk before output reaches the user.</p></li><li><p><strong>Per-customer learning is a moat. Per-customer training is hard.</strong> A model that gets better as the customer uses it is structurally defensible &#8212; but only if you can run that loop without ongoing human curation.</p></li><li><p><strong>Async surfaces beat new UIs.</strong> HockeyStack&#8217;s Rep Cockpit and Manager views, like Rogo&#8217;s email interface, surface agent work where the user already lives. Adoption follows the path of least friction.</p></li></ol><h2>What to Do This Week</h2><p>Pick a workflow you have watched a domain expert do &#8212; one with rich, structured signals leading up to the decision. Now ask: could a small ML model trained on past instances of this workflow predict the right next action better than an LLM prompted with the same context?</p><p>If yes, you have a candidate for Pattern B. The investment is in the data pipeline and the model, not the prompt.</p><p>If no &#8212; if the signals are sparse, unstructured, or judgment-dominated &#8212; you are in Pattern A territory, and your work is in the harness around the frontier model.</p><p>The mistake to avoid is the third pattern: a thin LLM wrapper that pretends to be either. That is the architecture that gets disrupted next quarter when the next frontier model ships and removes whatever differentiation the wrapper claimed.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://theairuntime.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[How MIT’s ScienceClaw Runs Hundreds of AI Agents Without a Central Planner]]></title><description><![CDATA[MIT&#8217;s open-source agent swarm replaces the orchestrator with an artifact reactor. The architecture is worth studying even if you&#8217;ll never build a science swarm.]]></description><link>https://theairuntime.com/p/how-mits-scienceclaw-runs-hundreds</link><guid isPermaLink="false">https://theairuntime.com/p/how-mits-scienceclaw-runs-hundreds</guid><dc:creator><![CDATA[The AI Runtime]]></dc:creator><pubDate>Mon, 11 May 2026 11:04:55 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!kKaR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1848dbf-8152-4317-a13e-09091bddc33c_1376x768.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>TL;DR</strong> - On March 15, 2026, a team led by MIT&#8217;s Markus Buehler released ScienceClaw + Infinite, an open-source framework where autonomous AI agents conduct scientific research across <a href="https://github.com/lamm-mit/scienceclaw">a registry of more than 300 interoperable tools</a>. The system is Apache 2.0-licensed and built around a coordination pattern most production multi-agent systems don&#8217;t use: there is no central planner. Agents broadcast unsatisfied research needs into a shared index, peer agents pick those needs up via schema-overlap matching, and a component called the ArtifactReactor uses pressure-based scoring to bias the swarm toward high-impact directions. Every computation produces an immutable, content-hashed artifact with explicit parent lineage, accumulating in a directed acyclic graph. The repository is research-grade &#8212; five GitHub stars, four contributors, fifty-five commits as of early May 2026 &#8212; so this is not a drop-in production system. But the coordination pattern is what to take from it. If you are building multi-agent systems where the planner has become a brittle bottleneck, ScienceClaw shows what plannerless coordination via a typed-artifact substrate looks like in practice. Read the paper, skim the repo, port the patterns.</p><div><hr></div><h2>What ScienceClaw actually is</h2><p>ScienceClaw + Infinite is an open-source multi-agent framework, <a href="https://arxiv.org/abs/2603.14312">released by MIT&#8217;s Laboratory for Atomistic and Molecular Mechanics in March 2026</a>, where autonomous AI agents conduct scientific investigations across a catalog of more than 300 tools. Agents coordinate without a central scheduler: they broadcast unmet research needs and peer agents fulfill them through schema-matching on artifact types.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://theairuntime.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>The system has three named components: an extensible registry of scientific skills, an artifact layer that preserves full computational lineage as a directed acyclic graph (DAG), and the Infinite platform &#8212; a structured space for agent-based scientific discourse with provenance-aware governance. The stack runs on top of OpenClaw, requires Node.js &#8805; 22 and Python &#8805; 3.8, and supports multiple LLM backends including Anthropic, OpenAI, and Hugging Face models alongside the default OpenClaw runtime. Once installed, agents run as a 4-hour heartbeat daemon &#8212; <code>scienceclaw-heartbeat.service</code> &#8212; that periodically scans for sessions to join, needs to fulfill, and findings to validate.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kKaR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1848dbf-8152-4317-a13e-09091bddc33c_1376x768.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kKaR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1848dbf-8152-4317-a13e-09091bddc33c_1376x768.png 424w, https://substackcdn.com/image/fetch/$s_!kKaR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1848dbf-8152-4317-a13e-09091bddc33c_1376x768.png 848w, https://substackcdn.com/image/fetch/$s_!kKaR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1848dbf-8152-4317-a13e-09091bddc33c_1376x768.png 1272w, https://substackcdn.com/image/fetch/$s_!kKaR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1848dbf-8152-4317-a13e-09091bddc33c_1376x768.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kKaR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1848dbf-8152-4317-a13e-09091bddc33c_1376x768.png" width="1376" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d1848dbf-8152-4317-a13e-09091bddc33c_1376x768.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1376,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;0_Jawuge&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="0_Jawuge" title="0_Jawuge" srcset="https://substackcdn.com/image/fetch/$s_!kKaR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1848dbf-8152-4317-a13e-09091bddc33c_1376x768.png 424w, https://substackcdn.com/image/fetch/$s_!kKaR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1848dbf-8152-4317-a13e-09091bddc33c_1376x768.png 848w, https://substackcdn.com/image/fetch/$s_!kKaR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1848dbf-8152-4317-a13e-09091bddc33c_1376x768.png 1272w, https://substackcdn.com/image/fetch/$s_!kKaR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1848dbf-8152-4317-a13e-09091bddc33c_1376x768.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The paper presents four autonomous investigations: peptide design for the somatostatin receptor SSTR2, lightweight impact-resistant ceramic screening, cross-domain resonance bridging biology, materials and music, and formal analogy construction between urban morphology and grain-boundary evolution. The last of those produced a concrete output: a <em>de novo</em> Hierarchical Ribbed Membrane Lattice that, when validated with 3D finite-element analysis, resonates at 2.116 kHz and exhibits nine elastic modes in the 2&#8211;8 kHz band &#8212; relevant to acoustic filtering and bio-inspired sensing. Buehler reports that no human directed the cross-domain mapping, the gap identification, or the design generation.</p><h2>The plannerless coordination loop</h2><p>Most production multi-agent frameworks are orchestrator-based. A planner LLM decomposes the user&#8217;s request into subtasks, assigns them to agents, and either supervises execution or rewires the plan as new information arrives. AutoGen, CrewAI, and most LangGraph patterns sit in this family. The orchestrator is the throat through which all coordination flows.</p><p>ScienceClaw inverts this. There is no planner. Coordination emerges from three primitives: typed artifacts produced by every computation, a global index where agents broadcast unsatisfied information needs, and pressure-based scoring that biases attention toward high-impact directions.</p><p>The mechanic is straightforward. When an agent produces an artifact &#8212; say, a list of candidate peptide sequences &#8212; it is wrapped as an immutable, content-addressed record with typed metadata and parent lineage, then dropped into a shared store. When that agent hits a question it cannot answer with its own skills &#8212; say, ADMET prediction &#8212; it broadcasts the unmet need into the global index. Peer agents discovering this index during their own heartbeat cycles via the ArtifactReactor pick up matching needs, run the fulfilling skill, and post their result as another comment on the same Infinite thread, creating a growing, traceable conversation between agents that never explicitly assigned each other tasks. Schema-overlap matching does the routing: when one agent posts an artifact whose schema is a downstream input for another agent&#8217;s skill, the second agent detects the match implicitly.</p><p>If the pattern feels familiar, that is because it is. This is a modern blackboard architecture &#8212; the 1970s-era pattern where multiple knowledge sources read from and write to a shared substrate &#8212; re-implemented for typed LLM agents. <a href="https://www.linkedin.com/in/markus-j-buehler-2245682/">Buehler describes it categorically</a> as a pullback in category theory: distinct domains (biology, metamaterials, music) become categories of objects, the shared feature space is a functor, and the ArtifactReactor&#8217;s schema-overlap matching behaves like the universal object connecting them. That is a fancier way to say <em>agents see each other through types, not orchestration.</em></p><h2>Why this matters: where orchestrators break</h2><p>Orchestrator-based multi-agent systems work well when the work is well-specified, the agent set is small and stable, and the planning context fits. They fall apart in the opposite regime.</p><p>As agent counts grow, the planner&#8217;s context bloats with state about every agent&#8217;s capabilities, current task, intermediate outputs, and dependencies. Plans get longer, the planner&#8217;s reasoning gets shallower per step, and small misroutings compound. Adding a new agent means changing the planner&#8217;s prompts or fine-tuning. Removing one means dependency repair. The planner becomes the channel through which all coordination passes &#8212; and the single point of contention.</p><p>Plannerless coordination shifts the harness. Instead of encoding routing in a planner&#8217;s prompts, ScienceClaw encodes it in the substrate: typed artifacts, schema matches, and pressure scores. Agents see each other through what they produce and what they need, not through a central agenda. An autonomous mutation layer prunes the expanding artifact DAG to resolve conflicting or redundant workflows, and persistent memory lets agents build on prior epistemic states across cycles. The result is an architecture that scales by addition: contribute an agent, contribute a skill, the swarm reorganizes around it without rewiring.</p><p>There is a second consequence worth pulling out. Every computation in ScienceClaw produces an immutable artifact with explicit parent lineage, accumulating in a directed acyclic graph that preserves the full provenance of every discovery. Provenance is what production AI teams typically bolt on as observability &#8212; a tracing layer wrapped around an existing system. Here it is the substrate. The DAG is the coordination medium <em>and</em> the audit log. You cannot have one without the other.</p><h2>How agents actually select tools</h2><p>The headline question for engineers reading this: how do agents decide which tools to call?</p><p>ScienceClaw&#8217;s answer is that there is no domain-to-tool routing table. The LLM analyzes the topic and selects three to five skills from the full catalog, with skills auto-discovered from the <code>skills/</code> directory. The README is explicit: <em>&#8220;No hardcoded domain &#8594; tool mapping &#8212; selection adapts to any research question.&#8221;</em> Add a skill folder with a <code>SKILL.md</code> and the catalog picks it up.</p><p>The catalog spans roughly fifteen tool families covering the working set of a modern computational research lab. Sequence and structural biology are represented by BLAST, UniProt, and PDB; literature by PubMed and ArXiv; cheminformatics by PubChem, ChEMBL, RDKit, and TDC; materials by the Materials Project and NIST WebBook; plus general-purpose web search and data visualization. Each is a thin Python wrapper that exposes a uniform invocation surface. Agents reason about which skills apply, chain them, and produce artifacts at every step.</p><p>There is a separate, smaller decision the system makes at the social layer: role assignment. ScienceClaw exposes five roles &#8212; investigator, validator, critic, synthesizer, and screener &#8212; assigned based on skills and personality during session joining. Investigators explore. Validators independently re-verify findings using different tools. Critics challenge logic and propose alternatives. Synthesizers integrate disagreements. Screeners parallelize high-throughput work. Upvotes and downvotes require structured reasoning and citations; they are evidence-backed, not sentiment. Disagreement is preserved as <code>validated</code>, <code>challenged</code>, <code>under review</code>, or <code>disputed</code> rather than forced into unanimity.</p><p>This matters for engineers because role-plus-interaction-type is a different shape of coordination than control flow. You are not writing the workflow. You are writing the <em>vocabulary</em> the workflow uses to assemble itself.</p><h2>The coordination loop, end to end</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DjHE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69b6fb81-fd58-4ed2-a3a3-7c132c75e90c_813x1139.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DjHE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69b6fb81-fd58-4ed2-a3a3-7c132c75e90c_813x1139.png 424w, https://substackcdn.com/image/fetch/$s_!DjHE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69b6fb81-fd58-4ed2-a3a3-7c132c75e90c_813x1139.png 848w, https://substackcdn.com/image/fetch/$s_!DjHE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69b6fb81-fd58-4ed2-a3a3-7c132c75e90c_813x1139.png 1272w, https://substackcdn.com/image/fetch/$s_!DjHE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69b6fb81-fd58-4ed2-a3a3-7c132c75e90c_813x1139.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DjHE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69b6fb81-fd58-4ed2-a3a3-7c132c75e90c_813x1139.png" width="813" height="1139" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/69b6fb81-fd58-4ed2-a3a3-7c132c75e90c_813x1139.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1139,&quot;width&quot;:813,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:65602,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://theairuntime.com/i/197163659?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69b6fb81-fd58-4ed2-a3a3-7c132c75e90c_813x1139.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DjHE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69b6fb81-fd58-4ed2-a3a3-7c132c75e90c_813x1139.png 424w, https://substackcdn.com/image/fetch/$s_!DjHE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69b6fb81-fd58-4ed2-a3a3-7c132c75e90c_813x1139.png 848w, https://substackcdn.com/image/fetch/$s_!DjHE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69b6fb81-fd58-4ed2-a3a3-7c132c75e90c_813x1139.png 1272w, https://substackcdn.com/image/fetch/$s_!DjHE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69b6fb81-fd58-4ed2-a3a3-7c132c75e90c_813x1139.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>                                                            The co-ordination loop</em></p><p><em>The eight-step coordination loop runs without a central planner. Skill-based discovery, role assignment, and schema matching happen as side effects of the heartbeat &#8212; not as orchestrated control flow. The full loop and its four-layer implementation are documented in <a href="https://github.com/lamm-mit/scienceclaw">the README</a> and <a href="https://arxiv.org/abs/2603.14312">the paper</a>.</em></p><h2>What&#8217;s actually shipped &#8212; and what to be careful about</h2><p>The four investigations in the paper are real and worth reading, but the framing matters.</p><p>The peptide design investigation targeted SSTR2, a somatostatin receptor with established cancer relevance. The lightweight ceramic work was a screening pipeline. The cross-domain resonance investigation produced the Hierarchical Ribbed Membrane Lattice with the 2.116 kHz primary mode that I mentioned above, and validated the design with finite-element analysis. The urban-morphology-to-grain-boundary work built a formal analogy between two fields with no prior cross-citation. The paper&#8217;s core empirical claim is that across these four cases, the framework demonstrates heterogeneous tool chaining, emergent convergence among independently operating agents, and traceable reasoning from raw computation to published finding.</p><p>What the paper does not yet show is large-scale cross-institutional coordination. Buehler&#8217;s announcement <a href="https://x.com/ProfBuehlerMIT/status/2033855565160501354">describes ScienceClaw &#215; Infinite as a swarm &#8220;across institutions, labs and the world&#8221;</a>, and the architecture is built for it: anyone can deploy an agent or contribute a skill, the heartbeat runs 24/7 without a central coordinator. But the four investigations in the paper are produced by Buehler&#8217;s MIT team. The cross-institutional layer is a design property, not a demonstrated outcome &#8212; at least not yet.</p><p>The repo state confirms this is early. Five GitHub stars, four contributors, fifty-five commits at the time of writing. Posting to Infinite requires a minimum of 10 karma, which agents earn through commenting and voting before they can post &#8212; a sensible spam guard, but a reminder that the surrounding social layer is also under construction. There are rate limits: one post per 30 minutes, fifty comments per day, two hundred votes per day. This is a research artifact, generously open-sourced, that aligns with <a href="https://www.energy.gov/articles/energy-department-launches-genesis-mission-transform-american-science-and-innovation">the broader DOE Genesis Mission&#8217;s stated goal of doubling the productivity and impact of American science within a decade</a>, but it is not a production system.</p><p>That framing is also the right way to consume it.</p><p>The broader OpenClaw scientific ecosystem this sits inside is itself worth knowing about. A <a href="https://www.biorxiv.org/content/10.64898/2026.03.30.715118v1.full">bioRxiv paper from late March 2026 catalogued 91 projects and 2,230 skills across 34 scientific categories in the OpenClaw scientific agent ecosystem</a>, and ScienceClaw is one of the more architecturally distinct entries. The pattern across the ecosystem &#8212; <a href="https://www.biorxiv.org/content/10.64898/2026.03.30.715118v1.full">skill-based agent design where workflows are expressed as structured Markdown files, lowering the barrier to contribution</a> &#8212; is what makes the substrate-driven coordination model viable at all. Agents do not need to know about each other in advance because the skill catalog and the artifact types form a shared language.</p><h2>What production AI engineers should take from this</h2><p>The patterns transfer even if the framework does not.</p><p><strong>Schema-typed artifacts as a routing primitive.</strong> The most portable idea in ScienceClaw is that the <em>type</em> of an artifact is the routing signal. If an agent produces a <code>peptide_sequences</code> artifact, any agent whose <code>SKILL.md</code> declares <code>peptide_sequences</code> as an input can pick it up. That removes a layer of planner reasoning. Production multi-agent systems can adopt this without going fully plannerless: type your intermediate artifacts, expose schemas as inputs and outputs, and let the substrate dispatch.</p><p><strong>Provenance as substrate, not afterthought.</strong> Treat the artifact DAG as the source of truth for both coordination <em>and</em> audit. If your current observability is wrapping logs around an opaque LangGraph state, you are paying twice. ScienceClaw&#8217;s pattern &#8212; content-hashed, immutable, lineage-preserving artifacts dropped into a shared store &#8212; gives you a deterministic replay of any investigation, and the cost is mostly upfront design discipline.</p><p><strong>Roles plus interaction types as coordination semantics.</strong> The investigator/validator/critic/synthesizer split is a coordination pattern, not a UI metaphor. You can implement it on top of any agent framework: tag each agent&#8217;s purpose, define a small interaction-type vocabulary (<code>challenge</code>, <code>validate</code>, <code>extend</code>, <code>synthesize</code>, <code>request_help</code>), and write your prompts to respect those roles. You will find that consensus and disagreement become legible in your traces in a way they typically are not.</p><p><strong>Plannerless is not always the answer.</strong> Orchestrator-based architectures still win when the workload is bounded, the agent set is small, and latency matters. Plannerless coordination has overhead &#8212; the pressure scoring, the schema matching, the heartbeat cadence &#8212; and it works best when the work is open-ended and agents can be added or removed dynamically. Apply it where it fits.</p><p>If you want to experiment with these patterns without adopting ScienceClaw wholesale, the cheapest path is to add a <em>needs board</em> to your existing system. Let one agent post what it cannot do; let peer agents pick those needs up on their own schedule. You will learn whether plannerless coordination buys anything for your domain in about a week of work.</p><h2>FAQ</h2><p><strong>Is ScienceClaw production-ready?</strong> No. Five GitHub stars, four contributors, an academic paper from March 2026, and a Vercel-deployed Infinite platform. Treat it as a reference architecture and a research artifact, not a runtime you deploy this quarter.</p><p><strong>How is it different from CrewAI or other frameworks?</strong> Most frameworks use orchestrator-based coordination &#8212; a central agent decomposes work and assigns it. ScienceClaw uses plannerless coordination via the ArtifactReactor: agents broadcast unsatisfied needs and peers fulfill them via schema-overlap matching, without any planner assigning tasks. The closest analogue is a 1970s blackboard architecture, modernized for typed-artifact LLM agents.</p><p><strong>Can I use Claude as the agent backbone?</strong> Yes. <a href="https://github.com/lamm-mit/scienceclaw">The repository documents Anthropic, OpenAI, and Hugging Face as supported LLM backends, with OpenClaw as the default runtime</a>. Setup is via <code>LLM_BACKEND=anthropic</code> and the corresponding API key.</p><p><strong>Does it actually produce real scientific results?</strong> The paper presents four investigations across peptide design, ceramic screening, cross-domain resonance, and urban-morphology analogy, and one of them produced a finite-element-validated metamaterial design with concrete acoustic properties. Whether those count as &#8220;real scientific results&#8221; depends on whether you mean <em>novel publishable findings</em> or <em>experiments still pending wet-lab validation</em>. The framework&#8217;s contribution is the coordination pattern; the scientific outputs are early demonstrations.</p><p><strong>Should I read the paper or the repo first?</strong> The paper for the architecture and the experimental results. The repo&#8217;s <code>ARCHITECTURE.md</code> and the multi-agent examples in the README for the implementation patterns. Both fit in an afternoon.</p><h2>Closing</h2><p>The interesting question is not whether ScienceClaw will become the dominant scientific agent platform. It probably will not, on its own. The interesting question is what production AI engineers should port out of it before someone else does.</p><p>Type your artifacts. Make provenance substrate, not observability. Let agents post what they need rather than wait for a planner to figure it out for them. The coordination patterns ScienceClaw demonstrates are old ideas &#8212; blackboard architectures, tuple spaces, content-addressable artifacts &#8212; applied with discipline to the LLM-agent stack. They were good ideas in 1975 and they remain good ideas now.</p><p>If your multi-agent system has a planner that has become the most fragile component in your harness, ScienceClaw is the cleanest open-source reference you can read this month for what the alternative looks like. Read <a href="https://arxiv.org/abs/2603.14312">the paper</a>. Skim <a href="https://github.com/lamm-mit/scienceclaw">the repo</a>. Then go look at the planner in your own system and ask what would happen if you replaced it with a needs board, a type system, and a pressure score.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://theairuntime.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Auctor’s Bet: Traceability Is the Architecture, Not a Feature]]></title><description><![CDATA[Enterprise software only creates value when it&#8217;s actually deployed, and deployment is overwhelmingly a labor problem, not a software problem.]]></description><link>https://theairuntime.com/p/auctors-bet-traceability-is-the-architecture</link><guid isPermaLink="false">https://theairuntime.com/p/auctors-bet-traceability-is-the-architecture</guid><dc:creator><![CDATA[The AI Runtime]]></dc:creator><pubDate>Sat, 09 May 2026 11:03:49 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!yLEC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9efb027-13f0-45e1-aa0a-c32c3e38112c_859x829.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="pullquote"><p><strong>TL;DR</strong> - Auctor <a href="https://www.getauctor.com/series-a-announcement">emerged from stealth in April 2026 with $20M led by Sequoia Capital</a> to build an &#8220;AI-native system of action&#8221; for the messy, <a href="https://sequoiacap.com/article/services-the-new-software/">~$500B/yr labor market</a> of enterprise software implementation &#8212; the work that actually gets Salesforce, SAP, ServiceNow, or Workday running in a customer&#8217;s environment. Reading their public material with an architect&#8217;s eye, the interesting choice isn&#8217;t the agent loop or the LLM tuning. It&#8217;s the bet that <strong>artifact lineage is the load-bearing primitive</strong>: every user story, SoW, design doc, and Jira ticket is anchored in a graph that walks back to the discovery call that originated it. Frontier models are commodity; the project-scoped artifact graph compounds. If you&#8217;re building agentic systems for any domain where decisions accumulate across stakeholders over months &#8212; legal, healthcare RCM, B2B sales, regulated change management &#8212; study this pattern before you architect your context layer.</p></div><h2>A real problem, sized correctly</h2><p>Enterprise software only creates value when it&#8217;s actually deployed, and deployment is overwhelmingly a labor problem, not a software problem. <a href="https://sequoiacap.com/article/services-the-new-software/">Sequoia&#8217;s Julien Bek frames the ratio crisply</a>: every dollar of enterprise software pulls roughly six dollars of services behind it. Across the top ten ecosystems &#8212; ServiceNow, Salesforce, SAP, AWS, and the rest &#8212; that adds up to about nine million implementation consultants and more than half a trillion dollars in annual labor spend, growing at a double-digit pace.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://theairuntime.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>The work itself is brutal. A single deployment can span hundreds of requirements, dozens of stakeholders, and months of negotiation between what a business says it needs and what the platform can actually do. <a href="https://www.bcg.com/publications/2024/most-large-scale-tech-programs-fail-how-to-succeed">BCG&#8217;s 2024 study of more than 1,000 large-scale tech programs</a> found that more than two-thirds miss their time, budget, or scope targets. <a href="https://finance.yahoo.com/news/auctor-raises-20m-led-sequoia-130000020.html">Auctor cites their own statistics</a> in the same vein: 50% of projects miss deadlines, and 1 in 6 exceeds budget by more than 200% &#8212; vendor-cited numbers, but directionally consistent with independent research. The interesting question isn&#8217;t whether implementation is broken. It&#8217;s whether the brokenness is <em>structural</em> &#8212; and if so, where the structural fix actually lives.</p><h2>The architecture</h2><p>Auctor&#8217;s framing is that implementation work is a context coordination problem, not a productivity problem. <a href="https://tercera.io/resources/why-auctor-is-building-the-ai-system-of-action-for-system-integrators/">In a Q&amp;A with Tercera</a>, CEO Will Sun draws a distinction between three categories of enterprise software:</p><ul><li><p><strong>System of record</strong> &#8212; the platform that holds data (CRMs, ERPs).</p></li><li><p><strong>System of work</strong> &#8212; the platform where work happens (Jira, Confluence, Asana).</p></li><li><p><strong>System of action</strong> &#8212; a platform that <em>acts</em> on the data, not just stores or displays it, while preserving the traceability and governance enterprise buyers require.</p></li></ul><p>That last category is where Auctor positions itself. It&#8217;s a marketing term, but the technical substance behind it is real: rather than being a chatbot that surfaces documents from your existing systems, the system itself is the substrate where decisions accumulate, artifacts are generated, and downstream tools get synced. The company describes the loop in three layers &#8212; Capture, Contextualize, Create &#8212; which read like marketing copy until you realize each layer corresponds to a non-trivial engineering surface.</p><p>Here is what the artifact graph actually looks like inside an engagement, based on what Auctor has described publicly:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yLEC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9efb027-13f0-45e1-aa0a-c32c3e38112c_859x829.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yLEC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9efb027-13f0-45e1-aa0a-c32c3e38112c_859x829.png 424w, https://substackcdn.com/image/fetch/$s_!yLEC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9efb027-13f0-45e1-aa0a-c32c3e38112c_859x829.png 848w, https://substackcdn.com/image/fetch/$s_!yLEC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9efb027-13f0-45e1-aa0a-c32c3e38112c_859x829.png 1272w, https://substackcdn.com/image/fetch/$s_!yLEC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9efb027-13f0-45e1-aa0a-c32c3e38112c_859x829.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yLEC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9efb027-13f0-45e1-aa0a-c32c3e38112c_859x829.png" width="859" height="829" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a9efb027-13f0-45e1-aa0a-c32c3e38112c_859x829.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:829,&quot;width&quot;:859,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:47175,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiengineerweekly.substack.com/i/196073829?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9efb027-13f0-45e1-aa0a-c32c3e38112c_859x829.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yLEC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9efb027-13f0-45e1-aa0a-c32c3e38112c_859x829.png 424w, https://substackcdn.com/image/fetch/$s_!yLEC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9efb027-13f0-45e1-aa0a-c32c3e38112c_859x829.png 848w, https://substackcdn.com/image/fetch/$s_!yLEC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9efb027-13f0-45e1-aa0a-c32c3e38112c_859x829.png 1272w, https://substackcdn.com/image/fetch/$s_!yLEC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9efb027-13f0-45e1-aa0a-c32c3e38112c_859x829.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>                                                                    Artifact graph</em></p><p>The dotted line back to (A) is the actual product. The forward arrows are table stakes &#8212; anything with a decent prompt and a Confluence connector can generate a SoW from a meeting transcript today. The dotted line is where the engineering discipline lives.</p><h2>Layer 1: Capture</h2><p>The capture layer is the ingest plane. <a href="https://www.getauctor.com/product">Auctor lists integrations with</a> Google Meet, Microsoft Teams, Zoom, Gong, Outlook, Google Calendar, Slack, Confluence, Google Drive, OneDrive/SharePoint, Salesforce, HubSpot, Jira, Linear, Azure DevOps, Rally, and Certinia. Reading this as a list of features misses the point. Reading it as a topology of where implementation context lives is closer to right.</p><p>The non-obvious move here is that real-time meeting transcription is treated as a first-class source, not a bolt-on. Auctor&#8217;s own materials describe agents that join discovery and refinement calls, transcribe live, and pull context from past projects to steer the conversation. <a href="https://www.getauctor.com/valiantys-case-study">The Valiantys case study</a> describes this concretely: instead of consultants taking manual notes during fifteen-stakeholder discovery sessions and consolidating afterward, requirements, action items, and meeting summaries are produced as the discussion unfolds.</p><p>That sounds modest until you think about what a &#8220;captured requirement&#8221; has to mean to be useful downstream. It has to:</p><ul><li><p>be timestamped and attributed to the speaker who voiced it</p></li><li><p>be tagged with the stakeholder role that gave it weight (PMO, architect, exec)</p></li><li><p>be linked to the meeting and the parent engagement</p></li><li><p>be deduplicated against earlier captures of the same intent</p></li><li><p>carry enough structure to be queryable later by an agent generating a SoW</p></li></ul><p>This is the unglamorous schema work that turns &#8220;transcript + LLM&#8221; into something a delivery team can actually trust. Most teams underestimate how much of this is bespoke and how little of it is solved by a vector store.</p><h2>Layer 2: Contextualize</h2><p>The contextualize layer is where Auctor&#8217;s architectural bet shows up most clearly. <a href="https://tercera.io/resources/why-auctor-is-building-the-ai-system-of-action-for-system-integrators/">In Will Sun&#8217;s own words</a>, the very first capability he and his cofounders prototyped &#8212; the one that drew SI leaders in &#8212; was traceability. Not generation. Traceability.</p><p>The mental model he describes: a user story created months into a project should be walkable back to the original requirement, the SoW that scoped it, and the pre-sales conversation where the stakeholder first voiced the need. That walk has to survive consultant turnover, mid-project pod swaps, and the natural decay of &#8220;tribal knowledge&#8221; that erodes every long engagement.</p><p>There are a few engineering implications worth pulling out:</p><p><strong>The graph is multi-modal.</strong> A node in this graph can be a transcript span, a section of a Word doc, a CRM field, a Jira ticket, a Confluence page, or a Slack message. Edges aren&#8217;t just &#8220;is-related-to&#8221; &#8212; they need to encode causal relationships (this requirement <em>caused</em> this user story to exist) and temporal ones (this requirement was <em>superseded</em> by that decision in last Tuesday&#8217;s call). Few off-the-shelf graph databases handle this cleanly without significant modeling work above them.</p><p><strong>Project-scoped retrieval beats global retrieval.</strong> <a href="https://www.getauctor.com/crossfuze-case-study">The Crossfuze case study</a> describes Auctor&#8217;s account- and project-level repositories explicitly: queries are bounded to a defined scope rather than searching across everything the firm has ever ingested. This is a deliberate inversion of the &#8220;one big RAG corpus&#8221; pattern. For implementation work, it&#8217;s almost certainly correct &#8212; the consultant answering a Q in a SoW review wants context from <em>this</em> engagement, not the closest semantic match across 200 historical projects. Cross-project learning becomes a separate, opt-in surface &#8212; templates, playbooks, codified house standards &#8212; rather than something contaminating live retrieval.</p><p><strong>Audit trail is the API, not a sidecar.</strong> Implementation buyers &#8212; especially in financial services, government, and healthcare &#8212; won&#8217;t trust an autonomous system unless they can ask, of any output, &#8220;what did this come from?&#8221; Bolting an audit log onto a generation pipeline after the fact rarely produces a satisfying answer. Designing the lineage as the primary data structure, with generation as a derived operation, is what makes the audit trail credible. This is the same discipline that production data engineering applies to lineage in dbt or feature stores; it&#8217;s still rare in agent systems.</p><h2>Layer 3: Create</h2><p>The create layer is where Auctor&#8217;s outputs land. <a href="https://www.getauctor.com/product">Their own product page</a> lists the artifact types: rough orders of magnitude, resource plans, statements of work, scopes, solution designs, process flows, user stories, and presentation decks. Each of these is a distinct generation problem with its own template, its own validation rules, and its own downstream sync target.</p><p>The interesting design decision is that generation is bounded by the project graph, not by raw model capability. A SoW draft isn&#8217;t generated from &#8220;what the model knows about SoWs&#8221;; it&#8217;s generated from the requirements, decisions, and constraints already in this engagement&#8217;s graph, with house-style templates from the SI&#8217;s own playbook layered on top. <a href="https://www.getauctor.com/crossfuze-case-study">Crossfuze describes this as &#8220;first-pass content creation within clearly defined project contexts,&#8221;</a> explicitly using Auctor for drafts that then go through their normal brand and review process.</p><p>That&#8217;s the right framing for any high-stakes generation task: the model produces a defensible draft, the human still owns approval, and the graph guarantees that nothing in the draft is stranded &#8212; every claim, number, and design decision can be traced to a source already in the system. It&#8217;s also a much better fit for fixed-fee delivery economics than &#8220;AI assistant pinging the consultant for help&#8221; &#8212; because the unit of work is the artifact, not the keystroke.</p><h2>The harness, not the model</h2><p>Sun is explicit on the model question: Auctor builds on frontier foundation models and tunes the system around how those models evolve, <a href="https://tercera.io/resources/why-auctor-is-building-the-ai-system-of-action-for-system-integrators/">working with hundreds of consultants daily to know what works and what doesn&#8217;t</a>. They are not building a foundation model. They are not even, as far as the public material reveals, fine-tuning one in a meaningful way. The bet is that the model is the commodity layer and the SI-specific harness &#8212; the schemas, the project-scoped retrieval, the artifact graph, the integrations, the templates, the governance &#8212; is where compounding value lives.</p><p>This is a defensible bet, and not just for Auctor. The same reasoning applies to most vertical agent companies: every six weeks the underlying model gets cheaper and stronger, and any architectural choice that depends on a specific model&#8217;s quirks decays with it. The architecture that compounds is the one that gets <em>more useful</em> with better models, because the harness was the durable artifact all along. The frontier labs themselves have been making versions of this argument in their own engineering writeups: the loop, the tools, the context curation are where engineering effort earns its keep, not the model behavior of any given week.</p><p>The corollary is uncomfortable for some founders: if your moat is mostly model behavior, you don&#8217;t have a moat. You have a temporary advantage on a clock you don&#8217;t control. Auctor&#8217;s choice to plant their flag on the graph instead of the model is, on its face, the more durable bet.</p><h2>Governance is engineering, too</h2><p><a href="https://www.getauctor.com/security">Auctor&#8217;s security page</a> is more interesting than the average vendor compliance recitation, mostly for one detail: zero data retention with upstream AI providers, meaning customer inputs aren&#8217;t stored or logged by the underlying model providers and aren&#8217;t used for model training. For services firms whose customers include financial institutions, government agencies, and Fortune 500s, this is a precondition for sale, not a nice-to-have. The rest is what you&#8217;d expect from a startup chasing enterprise contracts: AWS infrastructure, AES-256 at rest, TLS 1.3 in transit, SSO/SCIM via Okta/Azure AD/Google, SOC 2 Type II, ISO 27001, and regional data residency.</p><p>The governance story matters because it&#8217;s the gating constraint on the whole architectural play. An audit trail is only as trustworthy as the platform&#8217;s ability to demonstrate its handling controls to a procurement team. The system-of-action framing falls apart if the action can&#8217;t be retrospectively justified to a regulator or an internal audit function. Sun makes this point explicitly in <a href="https://tercera.io/resources/why-auctor-is-building-the-ai-system-of-action-for-system-integrators/">the Tercera Q&amp;A</a>: action without accountability fails.</p><h2>What&#8217;s unproven</h2><p>Worth being honest about what we don&#8217;t know from the public material:</p><p><strong>The 80% efficiency claim is vendor-cited.</strong> <a href="https://www.getauctor.com/series-a-announcement">Auctor reports</a> &#8220;up to 80% efficiency gains across phases like discovery and design.&#8221; The number comes from the company and the customers it has chosen to highlight; there&#8217;s no independent benchmark, and &#8220;efficiency gain&#8221; is doing a lot of definitional work. Take it as directional, not as a measured productivity figure.</p><p><strong>The architectural details are not public.</strong> Everything above is reverse-engineered from product copy, founder interviews, case studies, and integration lists. We don&#8217;t have a public technical writeup describing the schema, the graph implementation, the retrieval strategy, or the agent loop. There may be &#8212; and probably are &#8212; significant differences between the architecture as described and the architecture as built.</p><p><strong>Implementation work resists templating.</strong> The harder question for any &#8220;system of action&#8221; is whether the work it&#8217;s automating is genuinely templatable at scale. SoWs and user stories sit on a spectrum: the boilerplate scaffolding is highly templatable, the load-bearing scope language often isn&#8217;t. Auctor&#8217;s own framing &#8212; first drafts, with human approval &#8212; implicitly concedes this. The interesting test will be how much of the high-judgment work survives at the human layer five years from now. <a href="https://sequoiacap.com/article/services-the-new-software/">Sequoia&#8217;s framing of &#8220;intelligence vs. judgement&#8221;</a> is the right map here.</p><p><strong>Category competition is coming.</strong> &#8220;Agentic operating system for SI work&#8221; is a defensible position today partly because nobody else is positioned exactly there. That window won&#8217;t stay open. Several adjacent categories &#8212; meeting intelligence vendors, services automation tools, project management platforms &#8212; are within a roadmap or two of overlapping capability. The artifact graph is a real moat if it stays project-scoped and integration-rich, but it&#8217;s the kind of moat that needs to keep deepening.</p><h2>What builders should learn</h2><p>Three patterns are worth pulling into your own architecture, regardless of vertical:</p><p><strong>Make lineage the primary data structure.</strong> If you&#8217;re building an agent system in any domain where decisions need to be defensible &#8212; legal, finance, healthcare, regulated B2B &#8212; design the artifact graph first and the generation pipeline second. Walking from any output back to the source it depends on should be a single graph traversal, not a forensic exercise. Most teams do this backward: they build the loop, ship a feature, then bolt on observability when a customer asks why the model said what it said.</p><p><strong>Scope retrieval to the engagement, not the corpus.</strong> Cross-project learning is a different surface from in-project recall. Conflating them produces retrieval that&#8217;s almost-right in a hundred subtle ways and consistently wrong on questions like &#8220;what did <em>this</em> customer decide last Tuesday?&#8221; Project- or account-scoped repositories solve a real problem cheaply.</p><p><strong>Bet on the harness.</strong> If the part of your system that depends on the current state of frontier models is more than a thin layer, your roadmap is exposed to the next model release. The durable engineering &#8212; the schemas, the scoping, the integrations, the templates, the lineage &#8212; is what compounds while the model layer keeps shifting underneath.</p><p>These aren&#8217;t novel patterns in isolation. The novel thing is treating them as load-bearing rather than as polish. In a domain that has resisted automation for thirty years, that decision is the architecture.</p><div><hr></div><p><em>Have you seen this pattern &#8212; lineage-first, harness-bet &#8212; in production agent systems outside the SI space? Reply and tell me what you&#8217;re building. I read every response.</em></p><div><hr></div><p><strong>Further reading</strong></p><ul><li><p><a href="https://www.getauctor.com/product">Auctor product overview</a> &#8212; the integration topology and the three-layer framing in the company&#8217;s own words</p></li><li><p><a href="https://tercera.io/resources/why-auctor-is-building-the-ai-system-of-action-for-system-integrators/">Will Sun&#8217;s Q&amp;A with Tercera</a> &#8212; primary-source view on the system-of-action concept and the founding traceability bet</p></li><li><p><a href="https://sequoiacap.com/article/services-the-new-software/">Julien Bek, &#8220;Services: The New Software&#8221;</a> &#8212; the strategic frame Auctor was funded against; useful even if you&#8217;re not in services</p></li><li><p><a href="https://sequoiacap.com/article/partnering-with-auctor/">Sequoia&#8217;s partnership announcement</a> &#8212; Bek&#8217;s investment thesis on Auctor specifically</p></li><li><p><a href="https://www.bcg.com/publications/2024/most-large-scale-tech-programs-fail-how-to-succeed">BCG, &#8220;Most Large-Scale Tech Programs Fail&#8221;</a> &#8212; the independent base rate for project failure that the whole category is sized against</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://theairuntime.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p></li></ul>]]></content:encoded></item><item><title><![CDATA[Inside Mintlify’s Agent Stack]]></title><description><![CDATA[A teardown of the two-harness architecture &#8212; async sandboxes for writes, virtual filesystems for reads &#8212; and what it teaches about wrapping a model in production.]]></description><link>https://theairuntime.com/p/inside-mintlifys-agent-stack</link><guid isPermaLink="false">https://theairuntime.com/p/inside-mintlifys-agent-stack</guid><dc:creator><![CDATA[The AI Runtime]]></dc:creator><pubDate>Wed, 06 May 2026 08:03:50 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!-xTO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51724eea-0d78-4888-98d3-beb3f8cd0d44_1024x559.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="pullquote"><p><strong>TL;DR</strong> - Mintlify just <a href="https://www.mintlify.com/blog/series-b">raised $45M at a $500M valuation</a> on the bet that documentation has stopped being something humans read and started being infrastructure that agents query. Their own traffic data backs the bet: across 30 days and roughly 790M requests on Mintlify-powered sites, <a href="https://www.mintlify.com/blog/state-of-ai">AI coding agents accounted for 45.3% of traffic versus 45.8% for browsers</a>, with Claude Code alone generating more requests than Chrome on Windows.</p><p>Underneath the bet sits a three-part architecture worth studying. The <strong>write agent</strong> runs inside ephemeral <a href="https://www.mintlify.com/blog/knowledge-management-agent-era">Daytona sandboxes with a headless OpenCode session driven by Opus 4.6</a>, triggered by Slack mentions, dashboard prompts, API calls, or YAML-defined Workflows in your repo. The <strong>read assistant</strong> does the opposite &#8212; it <a href="https://www.mintlify.com/blog/how-we-built-a-virtual-filesystem-for-our-assistant">skips real sandboxes entirely</a> in favor of ChromaFs, a virtual filesystem layered over their existing Chroma database, taking session creation from roughly 46 seconds to about 100 milliseconds. The <strong>public surface</strong> auto-generates llms.txt, llms-full.txt, and skill.md at the root, <a href="https://www.mintlify.com/library/mintlify-alternatives-what-to-consider-and-why-theres-no-true-substitute">serves clean Markdown when you append </a><code>.md</code> to a page URL, and hosts an MCP server for every docs site it powers.</p><p>The architectural lesson isn&#8217;t that they built a doc agent. It&#8217;s that they built <strong>two</strong> harnesses with deliberately asymmetric constraints &#8212; async writes get full sandboxes, sync reads get a virtual filesystem &#8212; and the asymmetry is what makes the system economical at <a href="https://www.mintlify.com/blog/mintlify-acquires-trieve-to-improve-rag-search-in-documentation">over 23 million queries a month</a>. If you&#8217;re wrapping a model around a code repository for any reason, this is the reference implementation to study.</p></div><h2>The 45% problem</h2><p>Start with the data, because the architecture only makes sense once you accept the premise.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://theairuntime.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>In April 2026, Mintlify&#8217;s co-founder Han Wang published a Cloudflare-header analysis covering 30 days of traffic across all Mintlify-powered docs sites. The headline number: AI coding agents had reached <a href="https://www.mintlify.com/blog/state-of-ai">45.3% of total requests, narrowly behind 45.8% from browsers</a>. The distribution was lopsided. Claude Code alone produced 199.4M requests, ahead of Chrome on Windows at 119.4M. Cursor produced 142.3M. Together those two tools accounted for roughly 96% of identified AI agent traffic. Mintlify itself notes the real share is likely higher, since Codex traffic is invisible to user-agent header analysis and disappears into generic HTTP requests.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-xTO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51724eea-0d78-4888-98d3-beb3f8cd0d44_1024x559.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-xTO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51724eea-0d78-4888-98d3-beb3f8cd0d44_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!-xTO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51724eea-0d78-4888-98d3-beb3f8cd0d44_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!-xTO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51724eea-0d78-4888-98d3-beb3f8cd0d44_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!-xTO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51724eea-0d78-4888-98d3-beb3f8cd0d44_1024x559.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-xTO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51724eea-0d78-4888-98d3-beb3f8cd0d44_1024x559.png" width="1024" height="559" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/51724eea-0d78-4888-98d3-beb3f8cd0d44_1024x559.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:559,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:779871,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiengineerweekly.substack.com/i/196074000?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51724eea-0d78-4888-98d3-beb3f8cd0d44_1024x559.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-xTO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51724eea-0d78-4888-98d3-beb3f8cd0d44_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!-xTO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51724eea-0d78-4888-98d3-beb3f8cd0d44_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!-xTO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51724eea-0d78-4888-98d3-beb3f8cd0d44_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!-xTO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51724eea-0d78-4888-98d3-beb3f8cd0d44_1024x559.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>                                                                Architecture Patterns</em></p><p>If half your readers are agents pulling context to generate code, the design pressure on documentation flips. Browsers want navigation chrome, syntax highlighting, expandable sections. Agents want clean Markdown, exact strings, and stable URLs. The same content has to render correctly to both audiences, and &#8212; critically &#8212; has to <em>stay current</em> as the underlying product ships at agent-swarm speed.</p><p>That second pressure is the one that produced the agent stack. As Mintlify&#8217;s other co-founder Hahnbee Lee frames it, when a chatbot gives a wrong answer it is usually a documentation failure rather than a model failure, because the corpus the model retrieved against is out of date. The gap between what your docs say and what your product does compounds quarter over quarter unless something automated keeps the two in sync. Their answer is two distinct agents with two distinct harnesses, plus a public surface that exposes the maintained corpus to every other agent in the ecosystem.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xxag!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6e34ec3-4ccb-4678-8d18-9a62e425c387_871x579.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xxag!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6e34ec3-4ccb-4678-8d18-9a62e425c387_871x579.png 424w, https://substackcdn.com/image/fetch/$s_!xxag!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6e34ec3-4ccb-4678-8d18-9a62e425c387_871x579.png 848w, https://substackcdn.com/image/fetch/$s_!xxag!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6e34ec3-4ccb-4678-8d18-9a62e425c387_871x579.png 1272w, https://substackcdn.com/image/fetch/$s_!xxag!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6e34ec3-4ccb-4678-8d18-9a62e425c387_871x579.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xxag!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6e34ec3-4ccb-4678-8d18-9a62e425c387_871x579.png" width="871" height="579" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c6e34ec3-4ccb-4678-8d18-9a62e425c387_871x579.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:579,&quot;width&quot;:871,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:46305,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiengineerweekly.substack.com/i/196074000?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6e34ec3-4ccb-4678-8d18-9a62e425c387_871x579.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xxag!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6e34ec3-4ccb-4678-8d18-9a62e425c387_871x579.png 424w, https://substackcdn.com/image/fetch/$s_!xxag!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6e34ec3-4ccb-4678-8d18-9a62e425c387_871x579.png 848w, https://substackcdn.com/image/fetch/$s_!xxag!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6e34ec3-4ccb-4678-8d18-9a62e425c387_871x579.png 1272w, https://substackcdn.com/image/fetch/$s_!xxag!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6e34ec3-4ccb-4678-8d18-9a62e425c387_871x579.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Two harnesses, two latency budgets. The write path optimizes for capability; the read path optimizes for cost-per-conversation.</em></p><div><hr></div><h2>Layer 1 &#8212; The write agent: a sandbox is the whole product</h2><p>Most &#8220;AI doc writer&#8221; features on the market today are roughly one prompt, one model call, one diff. Mintlify&#8217;s write agent is structurally different. When you trigger it &#8212; by <code>@mintlify</code>-ing the bot in Slack, hitting <code>Cmd+I</code> in the dashboard, calling the agent API, or merging a PR that fires a Workflow &#8212; what runs on the other side is a headless OpenCode session driven by Opus 4.6, scoped to a fresh Daytona container that has the docs repo and any context repositories cloned in. The sandbox is the unit of work.</p><p>This decision is more load-bearing than it sounds. The Mintlify team is explicit about the reasoning: pointing a stateless model at a codebase produces, in their phrase, &#8220;chaos with a byline&#8221;. The agent needs a real environment to read code, plan changes, and edit files safely &#8212; not an API call decorated with retrieved chunks. So they gave it one. A trigger lands on a job queue, a worker provisions the container, and the result of the run is reported back through GitHub commit checks and the Mintlify dashboard. Inside the container, the agent runs through a fixed pipeline: it pulls in relevant material across the docs and the connected code repos, drafts a multi-step plan if the work calls for one, applies edits while honoring the project&#8217;s writing standards, runs a <a href="https://www.mintlify.com/docs/agent">local Mintlify CLI build to confirm the docs still compile</a>, and opens a pull request &#8212; direct commits to main are not on the menu.</p><p>Two design choices inside that loop are worth pulling out.</p><p><strong>Slack-first, not terminal-first.</strong> The Mintlify agent originally shipped only in Slack and via API, with <a href="https://www.mintlify.com/blog/agent-dashboard">the dashboard surface added later in December 2025</a>. The team&#8217;s stated reason: opening a terminal triggers a <a href="https://www.mintlify.com/blog/we-built-our-coding-agent-for-slack">&#8220;mentally draining switch&#8221;</a> that opening Slack does not, and documentation work is exactly the kind of task people procrastinate on. By living where the relevant context already lives &#8212; the PR thread that explained the change, the customer Slack message that surfaced the gap &#8212; the trigger surface matches the source of the work.</p><p><strong>Behavior-as-code through </strong><code>AGENTS.md</code><strong>.</strong> The agent reads a config file at <code>.mintlify/AGENTS.md</code> in your repo, and appends its contents to its system prompt for every task it runs &#8212; whether the trigger comes from Slack, the dashboard, or the API. The path matters: Mintlify&#8217;s docs explicitly warn that placing the file at the project root exposes it as a public asset under <code>/agents.md</code>, since the <code>.mintlify/</code> directory is not served on the docs site. What you put inside is style preferences, code standards, project-specific terminology &#8212; the kind of guidance a senior reviewer would otherwise repeat fifty times a year. It is the same pattern as Anthropic&#8217;s <code>CLAUDE.md</code> or the AGENTS.md spec emerging across the agent tooling space, and it makes agent behavior version-controlled and reviewable.</p><p>The most interesting trigger surface is <strong>Workflows</strong>, where the YAML config gets explicit. A workflow file lives in your repo. The schema looks roughly like this:</p><pre><code><code>---
name: 'Update API reference on backend changes'
on:
  push:
    - repo: 'your-org/backend'
      branch: main
context:
  - repo: 'your-org/docs'
  - repo: 'your-org/openapi-specs'
automerge: false
---

When the backend repo merges a PR, scan the diff for changes to public API
endpoints, request/response schemas, or authentication behavior. Update the
matching API reference pages and code examples. Skip internal refactors.</code></code></pre><p>The structure is a trigger (cron job or push event), a list of context repos to clone in, an automerge flag, and natural-language instructions in markdown. When the trigger fires, the agent evaluates the conditions, runs the task, and either commits directly or opens a PR depending on configuration, so cost stays predictable. Documentation maintenance becomes a downstream event of shipping, not a separate task someone has to remember.</p><p>The whole arrangement maps onto a pattern emerging across serious agent products: give the AI a sandbox, version-control the instructions, keep humans in the review loop, and let the model do the actual work inside well-defined guardrails. The reviewer-on-PRs analogy is doing real work here. The agent is treated like a junior contributor with full repo access &#8212; capable, but reviewed.</p><div><hr></div><h2>Layer 2 &#8212; The read assistant: when a real sandbox is the wrong answer</h2><p>If the write agent shows what it looks like to spend latency to gain capability, the read assistant shows the opposite trade-off &#8212; and it is the more architecturally surprising of the two.</p><p>The read assistant is the chat widget your readers use on a Mintlify-powered docs site. It now serves over thirty thousand conversations a day across hundreds of thousands of users. The natural design &#8212; and the one Mintlify started with &#8212; was the same shape that powers the write agent: spin up a sandbox, clone the docs repo, let the model run real <code>grep</code>, <code>cat</code>, <code>ls</code>, and <code>find</code> against the filesystem.</p><p>That design hit two walls. First, latency: <a href="https://www.mintlify.com/blog/how-we-built-a-virtual-filesystem-for-our-assistant">p90 session boot time, including the GitHub clone and other setup, came in around 46 seconds</a> &#8212; fine for an async write task where someone fires a Slack message and walks to get coffee, fatal for a reader staring at a loading spinner on a docs page. Second, cost. At nearly a million conversations a month, even a minimal sandbox setup at 1 vCPU, 2 GiB RAM, and a five-minute lifetime would have run north of $70,000 a year on Daytona&#8217;s per-second pricing, with longer sessions doubling the bill.</p><p>So the team built <strong>ChromaFs</strong> &#8212; a virtual filesystem that gives the agent the <em>illusion</em> of a real shell, layered over the Chroma database that already stored the docs as embedded chunks. Session creation collapsed from tens of seconds to roughly 100 milliseconds, and because ChromaFs reuses infrastructure they were already paying for, the marginal compute cost per conversation dropped to zero. The implementation runs on top of <code>just-bash</code>, a TypeScript reimplementation of bash from Vercel Labs that exposes a pluggable <code>IFileSystem</code><a href="https://www.mintlify.com/blog/how-we-built-a-virtual-filesystem-for-our-assistant"> interface</a>. <code>just-bash</code> parses commands, pipes, and flags; ChromaFs translates each underlying filesystem call into a Chroma query.</p><p>The mechanics are worth dwelling on, because they reveal how thoughtful harness design beats brute-force sandboxing.</p><p>The directory tree is bootstrapped from a single gzipped JSON document called <code>__path_tree__</code> stored inside the Chroma collection. On startup, the server fetches and decompresses it into two in-memory structures &#8212; a set of file paths and a map from directories to their children. After that, <code>ls</code>, <code>cd</code>, and <code>find</code> resolve in local memory with zero network calls, and the tree is cached so subsequent sessions for the same site skip the fetch entirely. Per-user access control happens at tree-build time: ChromaFs prunes paths the user can&#8217;t see and applies a matching filter to all subsequent Chroma queries, with the result that <a href="https://www.mintlify.com/blog/how-we-built-a-virtual-filesystem-for-our-assistant">pruned paths cannot even be referenced by the agent</a>. Reading a page is a chunk-reassembly operation &#8212; <code>cat /auth/oauth.mdx</code> fetches all chunks with the matching slug, sorts them by <code>chunk_index</code>, and joins them into the full page. Writes throw <code>EROFS</code>, making the system stateless by construction.</p><p>The most clever piece is <code>grep</code>. A naive recursive grep over a virtual filesystem would be agonizing &#8212; every file would round-trip to the database. ChromaFs intercepts the grep call, parses flags with <code>yargs-parser</code>, and translates them into a Chroma query (<code>$contains</code> for fixed strings, <code>$regex</code> for patterns) that acts as a coarse filter to identify which files might contain a hit. The matched chunks are bulk-prefetched into a Redis cache, and the rewritten grep is handed back to <code>just-bash</code> for in-memory fine filtering. Large recursive queries finish in milliseconds.</p><p>Sitting beneath ChromaFs in the read path is <strong>Trieve</strong>, the RAG infrastructure company <a href="https://www.mintlify.com/blog/mintlify-acquires-trieve-to-improve-rag-search-in-documentation">Mintlify acquired in July 2025</a>. Trieve had been Mintlify&#8217;s search backbone since before the team finished its Y Combinator batch, and the acquisition brought retrieval ownership in-house at a moment when the assistant was already serving more than 23 million queries a month. Trieve&#8217;s stack &#8212; dense vector search, re-ranker models, sub-sentence highlighting, and date recency biasing on a single endpoint &#8212; does the heavy lifting underneath ChromaFs&#8217;s UNIX-style interface. Trieve also <a href="https://www.trieve.ai/blog/trieve-is-being-acquired-by-mintlify">moved to an MIT license as part of the acquisition</a>, so the same retrieval kernel is inspectable on GitHub.</p><p>The pattern in the read assistant is the part most teams underweight. Mintlify&#8217;s team observed that <a href="https://www.mintlify.com/blog/how-we-built-a-virtual-filesystem-for-our-assistant">agents are converging on filesystems as their primary interface</a>, because <code>grep</code>, <code>cat</code>, <code>ls</code>, and <code>find</code> are sufficient primitives for an agent to reason over arbitrary structured content. Most builders take that observation and reach for a real sandbox. Mintlify took the same observation and asked whether the <em>interface</em> could be virtualized while keeping the <em>primitives</em> real. For their workload, the answer was yes &#8212; and the cost curve in their post (sandbox cost grows linearly with conversation duration; ChromaFs stays flat) is a clean argument for why.</p><div><hr></div><h2>Layer 3 &#8212; The public surface: content negotiation as the unification trick</h2><p>The third layer is the cheapest to describe and the easiest to overlook.</p><p>Every Mintlify-hosted docs site automatically generates a set of agent-readable artifacts at the root: llms.txt, llms-full.txt, and skill.md. The first two are an emerging convention for telling LLMs what content lives on a site and giving them a parseable bulk dump. The third is more interesting. As Mintlify describes it, <code>skill.md</code> is the action-layer manifest &#8212; it enumerates not just what the documentation contains but what an agent can actually invoke against the product, with required inputs and operating constraints attached to each capability. It is, in other words, the difference between an agent that can find information and an agent that can take action. Mintlify also exposes the <code>/.well-known/agent-skills</code> and <code>/.well-known/skills</code> paths &#8212;  so any agent that knows the convention can find capabilities without hard-coded paths.</p><p>The unification trick that ties everything together is <strong>content negotiation</strong>. The same URL serves rich HTML to browsers and clean Markdown to agents &#8212; appending <code>.md</code> to any page URL returns a Markdown view of the same content, with no separate agent-facing site to maintain. This avoids the failure mode where teams maintain a &#8220;human site&#8221; and a separate &#8220;AI site&#8221; that drift out of sync; there is only one content store, with two rendering targets selected by the request.</p><p>Finally, every Mintlify site auto-hosts an MCP server, which lets coding agents like Cursor, Claude Code, and Windsurf query current documentation while a task is running. Authentication is supported when the docs site itself is gated &#8212; the MCP server respects whatever auth protocol the docs already use. The architectural significance is that retrieval is no longer something only the docs site itself can do. Every external agent that supports MCP gets a structured handle into your corpus, on the same terms as Mintlify&#8217;s own assistant.</p><div><hr></div><h2>What the architecture teaches</h2><p>A few patterns are general enough to lift out of Mintlify&#8217;s specific case and apply elsewhere.</p><p>First, <strong>the sandbox is the unit of work for write tasks, but the wrong unit for read tasks</strong>. Most builders default to one or the other. Mintlify&#8217;s own bill clarifies the trade-off: a sandbox that boots in tens of seconds and costs a fraction of a cent per session is fine for asynchronous PR drafting, and ruinous for a chat widget. If you&#8217;re building both surfaces, expect to want both harnesses.</p><p>Second, <strong>version-controlled, natural-language instructions are the right encoding for agent behavior</strong>. Workflows YAML and <code>AGENTS.md</code> are the same idea applied at different scopes &#8212; one configures a recurring task, the other configures the agent globally. Both live in the repo, both go through code review, both evolve with the project. This is what &#8220;config as code&#8221; looks like when the configured component is a model.</p><p>Third, <strong>virtualizing the agent&#8217;s interface, not its environment, is often the better move</strong>. ChromaFs is the cleanest example: a real grep, a real ls, a real cat &#8212; but resolved against a database, not a disk. The agent doesn&#8217;t need a sandbox, it needs the sandbox&#8217;s API. Once you internalize that, a lot of &#8220;we need a Daytona for this&#8221; becomes &#8220;we need an <code>IFileSystem</code> shim for this,&#8221; with two orders of magnitude less infrastructure.</p><p>Fourth, <strong>content negotiation is the right unification primitive when you&#8217;re serving humans and agents from the same corpus</strong>. Maintaining parallel &#8220;human docs&#8221; and &#8220;AI docs&#8221; is how you guarantee they drift. Same URL, different format, selected by the request &#8212; and the cost of supporting the agent surface drops to near-zero.</p><p>Finally, <strong>harnesses are not edge cases, they&#8217;re the product</strong>. If you remove ChromaFs from the read assistant, the bill blows up. If you remove the sandbox boundary from the write agent, you stop being able to safely run on customer codebases. If you remove the auto-generated llms.txt and MCP server, the 45.3% of agent traffic loses its grip on the corpus. The model is doing model work in the middle, but everything around it &#8212; the sandbox, the virtual filesystem, the YAML triggers, the public surface &#8212; is what makes the product trustworthy and economical.</p><div><hr></div><h2>What to do with this</h2><p>Three concrete moves for practitioners building anything adjacent to this space.</p><p>If you operate a documentation site, run it through Mintlify&#8217;s free <a href="https://www.mintlify.com/blog/agent-score">Agent Score tool</a>, which checks twenty-nine signals of agent-readability and tells you where the gaps are. The data is right there: half your traffic is agents you cannot see, and most teams are still building only for browsers. If you&#8217;d rather audit on your own, start by checking whether <code>curl -L https://yourdocs.com/some-page.md</code> returns clean Markdown or a 404 &#8212; that one HTTP request tells you whether you&#8217;re on the agent map at all.</p><p>If you&#8217;re building any agent that needs to read or modify a code repository, start with the harness, not the prompt. Decide your latency budget before you decide your model. If the answer is &#8220;tens of seconds and the agent edits files,&#8221; the Mintlify write agent &#8212; sandbox, headless OpenCode, version-controlled config &#8212; is your reference. If the answer is &#8220;milliseconds and the agent only reads,&#8221; the ChromaFs pattern (virtualize the interface, not the environment) is your reference.</p><p>And if you&#8217;re shipping a product that other agents will need to understand &#8212; an API, an SDK, a developer tool &#8212; treat your documentation as a programmatic interface that happens to also be human-readable. Auto-generate llms.txt and skill.md, expose an MCP server, serve clean Markdown via content negotiation. The asymmetric world Mintlify is betting on already exists. The teams whose docs are agent-readable get evaluated. The teams whose docs aren&#8217;t get skipped.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://theairuntime.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[How Vertical Agents Self-Improve in Production]]></title><description><![CDATA[Field notes on the harness loop at Harvey, Hippocratic, Anterior, and Azure SRE &#8212; where production failures compound into skill without retraining the model.]]></description><link>https://theairuntime.com/p/how-vertical-agents-self-improve</link><guid isPermaLink="false">https://theairuntime.com/p/how-vertical-agents-self-improve</guid><dc:creator><![CDATA[The AI Runtime]]></dc:creator><pubDate>Sat, 02 May 2026 11:03:55 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!V7Rg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5caaf98-dc3c-4fc0-8487-7f9fa24ff038_1024x559.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="pullquote"><p><strong>TL;DR</strong> - In regulated verticals &#8212; healthcare, legal, insurance, finance &#8212; the most reliable way to make a deployed agent better is not a new model. It is a closed loop that turns production failures into harness updates: prompts, tools, sub-agents, memory files, judge rubrics, routing logic. Harvey ran this loop on twelve legal tasks and moved average success from <a href="https://www.artificiallawyer.com/2026/04/07/harvey-drives-legal-agent-learning-via-harness-engineering/">40.8% to 87.7% with model weights frozen</a>, with <a href="https://x.com/nikogrupen/status/2041166953902203157">complaint drafting going from 2% to 98% rubric coverage</a>. Hippocratic AI vendor-published clinical accuracy improvements <a href="https://hippocraticai.com/polaris-3/">from ~80% pre-Polaris to 99.38% in Polaris 3.0</a> by feeding ~1.85M real patient calls and 307K clinician-reviewed test calls back into the system. Anterior (vendor-published) puts a <a href="https://www.zenml.io/llmops-database/building-scalable-llm-evaluation-systems-for-healthcare-prior-authorization">reference-free LLM-as-judge in front of every prior auth decision</a>, routes only the low-confidence ones to under ten clinicians, and reports 96% F1 at over 100K decisions/day. Microsoft&#8217;s Azure SRE Agent moved its <a href="https://techcommunity.microsoft.com/blog/appsonazureblog/the-agent-that-investigates-itself/4500073">Intent-Met score from 45% to 75% on novel incidents</a> by letting the agent investigate its own bugs and submit PRs against its own codebase. The shared pattern is the same six nodes everywhere: trace &#8594; judge &#8594; cluster &#8594; mutate harness &#8594; gate &#8594; deploy. <strong>If you cannot run that loop, you are shipping a frozen artifact in a moving market.</strong> Start by instrumenting traces and writing one rubric. The judge and the mutation loop come after.</p></div><h2>The frozen-agent problem</h2><p>A vertical agent that ships at 90% accuracy and stays there is not a 90% accurate system. It is a 90% accurate system at the moment of deployment, decaying.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://theairuntime.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>The decay has three sources. <strong>Distribution drift</strong>: real patients ramble, real lawyers redline contracts in non-canonical ways, real claims arrive with new denial codes. <strong>Policy drift</strong>: CMS coverage determinations change, <a href="https://www.healthaffairs.org/doi/10.1377/hlthaff.2025.00897">EU AI Act provisions phase in on staggered enforcement timelines</a>, insurer rulesets get rewritten quarterly. <strong>Long-tail surface area</strong>: the failure modes you didn&#8217;t see in eval are the ones production discovers, one in ten thousand at a time. At 100K medical decisions per day, <a href="https://www.zenml.io/llmops-database/building-scalable-llm-evaluation-systems-for-healthcare-prior-authorization">a one-in-ten-thousand subtle hallucination &#8212; &#8220;suspicious for multiple sclerosis&#8221; when the patient has a confirmed MS diagnosis &#8212; fires ten times daily</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!V7Rg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5caaf98-dc3c-4fc0-8487-7f9fa24ff038_1024x559.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!V7Rg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5caaf98-dc3c-4fc0-8487-7f9fa24ff038_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!V7Rg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5caaf98-dc3c-4fc0-8487-7f9fa24ff038_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!V7Rg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5caaf98-dc3c-4fc0-8487-7f9fa24ff038_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!V7Rg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5caaf98-dc3c-4fc0-8487-7f9fa24ff038_1024x559.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!V7Rg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5caaf98-dc3c-4fc0-8487-7f9fa24ff038_1024x559.png" width="1024" height="559" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d5caaf98-dc3c-4fc0-8487-7f9fa24ff038_1024x559.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:559,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:556426,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiengineerweekly.substack.com/i/196073139?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5caaf98-dc3c-4fc0-8487-7f9fa24ff038_1024x559.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!V7Rg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5caaf98-dc3c-4fc0-8487-7f9fa24ff038_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!V7Rg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5caaf98-dc3c-4fc0-8487-7f9fa24ff038_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!V7Rg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5caaf98-dc3c-4fc0-8487-7f9fa24ff038_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!V7Rg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5caaf98-dc3c-4fc0-8487-7f9fa24ff038_1024x559.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>                                                                Agent Improvement</em></p><p>In low-stakes consumer apps you can absorb that. In a vertical where the cost of a single error is a denied surgery, a missed disclosure schedule, or a regulatory finding, you cannot. So the question that defines vertical agent engineering in 2026 is not &#8220;which model do we use&#8221; &#8212; it is &#8220;how does this agent get better next week than it is today, <em>without</em> a new base model release, and <em>with</em> the audit trail a regulator will demand.&#8221;</p><p>The answer that has emerged across legal, healthcare, insurance, and incident response is the same architecture, sometimes given different names. Anthropic&#8217;s engineering team and Viv Trivedy refer to it as <a href="https://addyosmani.com/blog/agent-harness-engineering/">harness engineering</a>. Microsoft frames it as the <a href="https://techcommunity.microsoft.com/blog/appsonazureblog/the-agent-that-investigates-itself/4500073">agent investigating itself</a>. NVIDIA borrows MAPE-K from autonomic computing and <a href="https://arxiv.org/pdf/2510.27051">calls it a data flywheel</a>. LangChain calls it <a href="https://www.langchain.com/conceptual-guides/traces-start-agent-improvement-loop">the agent improvement loop powered by traces</a>. The mechanics are the same.</p><div><hr></div><h2>The shape of the loop</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6iB3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F037db51c-ee9f-4106-a460-95d25164e7cd_841x763.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6iB3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F037db51c-ee9f-4106-a460-95d25164e7cd_841x763.png 424w, https://substackcdn.com/image/fetch/$s_!6iB3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F037db51c-ee9f-4106-a460-95d25164e7cd_841x763.png 848w, https://substackcdn.com/image/fetch/$s_!6iB3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F037db51c-ee9f-4106-a460-95d25164e7cd_841x763.png 1272w, https://substackcdn.com/image/fetch/$s_!6iB3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F037db51c-ee9f-4106-a460-95d25164e7cd_841x763.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6iB3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F037db51c-ee9f-4106-a460-95d25164e7cd_841x763.png" width="841" height="763" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/037db51c-ee9f-4106-a460-95d25164e7cd_841x763.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:763,&quot;width&quot;:841,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:33712,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiengineerweekly.substack.com/i/196073139?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F037db51c-ee9f-4106-a460-95d25164e7cd_841x763.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6iB3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F037db51c-ee9f-4106-a460-95d25164e7cd_841x763.png 424w, https://substackcdn.com/image/fetch/$s_!6iB3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F037db51c-ee9f-4106-a460-95d25164e7cd_841x763.png 848w, https://substackcdn.com/image/fetch/$s_!6iB3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F037db51c-ee9f-4106-a460-95d25164e7cd_841x763.png 1272w, https://substackcdn.com/image/fetch/$s_!6iB3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F037db51c-ee9f-4106-a460-95d25164e7cd_841x763.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>                                                                      The loop</em></p><p>Six nodes. Every component carries weight; every break in the chain causes silent degradation.</p><p><strong>Production traces</strong> are the substrate. Without per-step tool calls, model inputs, model outputs, latency, token counts, and final outcomes, none of the downstream work is possible. LangChain&#8217;s formulation is the cleanest: <a href="https://www.langchain.com/conceptual-guides/traces-start-agent-improvement-loop">traces come from staging environments, benchmark runs, local development, and especially from production</a>, and they are the input to every subsequent step. The trace store doubles as the audit trail regulators ask for.</p><p><strong>Evaluation and judging</strong> is where most teams over-rely on offline benchmarks. The shift in 2025&#8211;26 has been toward online evaluators that score every production trace &#8212; typically an LLM-as-judge augmented with deterministic checks (schema validation, citation existence, tool-call shape) and routed human review on a configurable sample. Anterior&#8217;s framing is sharper than most: their judge is <em>reference-free</em>, scoring outputs against guidelines and clinical reasoning rather than a held-out ground truth, because the volume &#8212; over 100K decisions a day &#8212; makes ground truth impossible to maintain.</p><p><strong>Failure clustering</strong> is where the leverage is. A pile of low-scored traces is not actionable. Grouping them by failure pattern &#8212; &#8220;agent missed exhibit B in 30% of due diligence runs,&#8221; &#8220;agent emits &#8216;suspicious for X&#8217; on confirmed-X patients,&#8221; &#8220;agent hits LLM 429s during streaming&#8221; &#8212; turns symptoms into hypotheses. LangChain runs <a href="https://blog.langchain.com/improving-deep-agents-with-harness-engineering/">parallel error-analysis subagents and synthesizes their findings into harness change proposals</a>. Microsoft&#8217;s SRE Agent runs <a href="https://techcommunity.microsoft.com/blog/appsonazureblog/the-agent-that-investigates-itself/4500073">a daily monitoring task that searches the last 24 hours of errors, clusters the top hitters, traces each to its root cause, and submits a PR</a>.</p><p><strong>Harness mutation</strong> is the change itself. We will spend a section on the levers that actually move; for now: <em>most of these changes never touch model weights</em>. They edit the system prompt, add a skill or sub-agent, modify a tool definition, append to a memory file, tighten a routing threshold, or rewrite the judge&#8217;s rubric.</p><p><strong>Validation gate</strong> is the hill-climbing safety. Every proposed harness change runs against a frozen eval set before it ships, and any regression &#8212; even on a task the change was not targeting &#8212; blocks the merge. Harvey runs this against <a href="https://x.com/nikogrupen/status/2041166953902203157">twelve internal benchmark tasks per iteration</a>; LangChain marks proposed changes that overfit as discarded runs in their iteration log. Without the gate, the loop generates regressions as fast as it generates improvements.</p><p><strong>Deploy</strong> then closes the cycle. The new harness produces new traces; new traces feed new judges; new clusters drive new mutations. The model is the one piece of this picture that does not change between weekly cycles.</p><p>The non-obvious property of this loop is what compounds. As Anterior describes it, the loop creates a <a href="https://www.zenml.io/llmops-database/building-scalable-llm-evaluation-systems-for-healthcare-prior-authorization">virtuous improvement cycle where the evaluator itself gets calibrated against human review, and confidence grades from that calibrated evaluator route which cases need humans next time</a>. The judge improves. The clustering improves. The mutations get more targeted. The agent appears to learn &#8212; without a single weight changing.</p><div><hr></div><h2>Case 1: Harvey &#8212; autoresearch and the rubric ceiling</h2><p>The cleanest published demonstration is Harvey&#8217;s recent <a href="https://x.com/nikogrupen/status/2041166953902203157">autoresearch experiment</a>, summarized externally by <a href="https://www.artificiallawyer.com/2026/04/07/harvey-drives-legal-agent-learning-via-harness-engineering/">Artificial Lawyer</a>. Niko Grupen, Head of Applied Research, ran twelve tasks from Harvey&#8217;s internal agent benchmark &#8212; commercial lease review, complaint drafting, tax memos, disclosure schedules, due diligence questionnaires &#8212; through a loop where an outer agent is allowed to edit the inner agent&#8217;s harness based on rubric-graded judge feedback.</p><p>The setup: each task ships with source documents, instructions, and a detailed grading rubric. After an attempt, an LLM judge scores against the rubric and produces written feedback on what the agent got right, what it missed, and where its reasoning was wrong. A coding agent reads the judge feedback, clusters the failures, forms a hypothesis about which harness components would help, edits or builds those components &#8212; skills, hooks, scripts, sub-agents, <em>not</em> model weights &#8212; and reruns.</p><p>The result: across all twelve tasks, average success rose from 40.8% to 87.7%. Five of the twelve started in the 2&#8211;7% range. After optimization, seven exceeded 90% and one hit 100%. The complaint drafting task is the most striking &#8212; it <a href="https://x.com/nikogrupen/status/2041166953902203157">moved from 2% rubric coverage to 98% over a handful of iterations, producing a 164-paragraph complaint with a 33-exhibit list</a>.</p><p>Two patterns from Grupen&#8217;s log are worth quoting on terms. First, the early iterations correct basic structural failures &#8212; wrong file types, missing deliverables, weak structure. Later iterations show domain-specific expertise emerging: cross-document issue spotting, risk classification, distinguishing genuinely problematic provisions from market-standard distractors. Second, the ceiling is the rubric. &#8220;When the rubric is high quality, the agent can hill-climb surprisingly far.&#8221; When it isn&#8217;t, the loop stalls.</p><p>This generalizes. The same auto-improvement pattern works in a generic coding domain: LangChain&#8217;s deepagents-cli moved <a href="https://blog.langchain.com/improving-deep-agents-with-harness-engineering/">from 52.8% to 66.5% on Terminal Bench 2.0 &#8212; a 13.7-point jump from harness changes alone, with the model fixed at GPT-5.2-Codex</a>. The mechanism is the same trace analyzer skill, parallel error agents, and targeted prompt/tool/middleware changes per iteration.</p><p>The Harvey caveat is real and worth surfacing: this is a vendor-run experiment on twelve tasks; it does not yet generalize to all legal work, and it is bound by the quality of the rubrics Harvey wrote. But the directional finding &#8212; that harness-layer changes can deliver model-upgrade-sized improvements in a regulated domain &#8212; is now hard to dismiss.</p><div><hr></div><h2>Case 2: Hippocratic AI &#8212; clinicians as a learning signal at scale</h2><p>Hippocratic AI&#8217;s Polaris is a different shape of the same loop, scaled to a 22-LLM constellation that handles <a href="https://arxiv.org/pdf/2603.29893">over 10 million real patient calls</a> and a network of 6,234 US-licensed clinicians who review production output.</p><p>The vendor-published trajectory across three model generations: <a href="https://hippocraticai.com/polaris-3/">pre-Polaris baseline ~80%, Polaris 1.0 at 96.79%, Polaris 2.0 at 98.75%, Polaris 3.0 at 99.38% clinical accuracy</a>, validated under their Real-World Evaluation of Large Language Models in Healthcare framework. The framework leverages <a href="https://hippocraticai.com/real-world-evaluation-llm/">6,234 US-licensed clinicians (5,969 nurses and 265 physicians) evaluating 307,038 unique calls</a> through a three-tier review process: nurse review first, physician adjudication when needed, structured error categorization in between. Errors flagged at any tier feed back into the next iteration&#8217;s training and harness.</p><p>The subsystem-level numbers tell the more interesting story, because they show what specifically improved between Polaris 2.0 and 3.0 by listening to production:</p><ul><li><p><a href="https://hippocraticai.com/polaris-3/">Health Risk Assessment documentation accuracy: 90.5% &#8594; 98.5%</a></p></li><li><p>Explanation-of-Benefits policy quoting: 86.4% &#8594; 99.4%</p></li><li><p>Complex appointment scheduling error rate: 8% &#8594; 0.5%</p></li><li><p>Background-noise speech recognition error rate: 9.3% &#8594; 2.3%</p></li><li><p>Clarification engine error rate (gracefully handling unclear patient speech): 16.3% &#8594; 2.0%</p></li></ul><p>These aren&#8217;t random improvements. They&#8217;re the long-tail issues that surfaced once 1.85M patient calls had run through Polaris 1.0 and 2.0 and clinicians had flagged categorical failure modes. Speech recognition fails in noisy environments &#8594; train a dedicated background-noise engine. Patients answer HRAs in rambling, context-shifting ways &#8594; ship a &#8220;deep thinking&#8221; model that triple-checks documentation. Policy quotes occasionally drift from source documents &#8594; tighten the harness around source attribution.</p><p>The honest framing: these are vendor-self-published numbers, and there is no independent third party validating Hippocratic AI&#8217;s safety scores. What is independently verifiable is the <em>architecture</em> of the feedback loop &#8212; clinician review network, structured error categorization, real-world evidence accumulation across versions &#8212; which is now <a href="https://www.medrxiv.org/content/10.1101/2025.03.17.25324157v1">described in the underlying RWE-LLM paper on medRxiv</a> and is replicable by anyone willing to invest in a comparable review apparatus.</p><div><hr></div><h2>Case 3: Anterior &#8212; judge first, route smartly, validate the validator</h2><p>Anterior <a href="https://www.zenml.io/llmops-database/building-scalable-llm-evaluation-systems-for-healthcare-prior-authorization">runs the same loop in healthcare prior authorization</a>, but with two design choices that are worth studying separately because they generalize beyond healthcare.</p><p>First, reference-free real-time evaluation. Anterior&#8217;s primary system makes a coverage determination by reasoning across unstructured clinical documentation, payer rulesets, and clinical guidelines. A second LLM-as-judge then evaluates the determination against those same guidelines &#8212; without needing a held-out ground truth &#8212; and produces a confidence grade. Reference-free evaluation matters because at 100K+ decisions a day, no organization can maintain a labeled gold set that keeps up with policy drift.</p><p>Second, dynamic case prioritization. The confidence grade combines with contextual factors &#8212; procedure cost, bias risk, historical error rates for that procedure category &#8212; to decide which cases are sent to human clinicians for review. High-confidence cases auto-resolve; low-confidence and high-stakes cases route to a small clinical team. Anterior reports a team of fewer than ten clinical reviewers handling tens of thousands of cases, against a competitor reportedly employing 800+ nurses for comparable review volume. (Caveat: scope of work may differ. Take the comparison directionally.)</p><p>The third move is the one most teams miss. Anterior runs alignment metrics between the LLM-judge and the human reviewers on cases that get both, and uses that data to validate &#8212; and continuously recalibrate &#8212; the judge itself. They call this &#8220;validating the validator.&#8221; It is the missing piece in most LLM-judge deployments. Without it, the judge can drift, and you only learn about it when the harness has been mutating against bad signal for weeks.</p><p>Anterior&#8217;s <a href="https://www.anterior.com/insights/ahip-commitment-health-plans">vendor-reported numbers</a>: 99.26% accuracy on automated approvals, against 86% baseline human accuracy, with 76% reduction in human review needed and 74% less time per escalated case. Cross-reference with Anterior&#8217;s <a href="https://arxiv.org/abs/2603.14631">own arXiv paper on fairness evaluation</a>, which reports model error rates across 7,166 human-reviewed cases spanning 27 medical necessity guidelines. Independent validation remains an open need; the 96% F1 figure that has circulated comes from Anterior&#8217;s own talks, not a peer-reviewed audit.</p><p>The architectural lesson generalizes far past healthcare. Any vertical agent operating at scale where ground truth is expensive &#8212; fraud review, AML, KYC, contract triage, claims adjudication, security alert triage &#8212; can adopt the same three-part move: reference-free judge in line, dynamic routing on confidence and stakes, alignment metrics that validate the judge against the humans that exist.</p><div><hr></div><h2>Case 4: Azure SRE Agent &#8212; when the agent debugs itself</h2><p>Microsoft&#8217;s Azure Site Reliability Engineering Agent handles <a href="https://techcommunity.microsoft.com/blog/appsonazureblog/the-agent-that-investigates-itself/4500073">tens of thousands of incidents weekly</a> for internal Microsoft services and external teams. The team published a remarkably honest engineering retrospective in March 2026 about how they closed their improvement loop.</p><p>The starting point: incident resolution rates were climbing toward 50% on high-instrumented scenarios &#8212; but the high-performing scenarios all shared a trait. They had been built with heavy human scaffolding: custom response plans, hand-built sub-agents for known failure modes, pre-written log queries exposed as opaque tools. On any new incident class, the agent had nowhere to start. Engineers were reading 50 lower-scored threads a week against an agent handling 10,000 &#8212; debugging at human speed.</p><p>The inversion they made: stop pre-computing the answer space. Instead, give the agent a filesystem as its world (source code, runbooks, query schemas, past investigation notes &#8212; all files; no <code>SearchCodebase</code> API), context hooks that orient it on what it can access, and frugal context management that keeps long investigations sharp. Three architectural bets, in their words. The result: <a href="https://techcommunity.microsoft.com/blog/appsonazureblog/the-agent-that-investigates-itself/4500073">Intent-Met score on novel incidents &#8212; whether the agent&#8217;s investigation actually addressed the root cause as judged by the on-call engineer &#8212; rose from 45% to 75%</a>.</p><p>The closing move is the one to study. They set up a daily monitoring task: the agent searches the last 24 hours for LLM errors &#8212; timeouts, 429s, mid-stream failures, malformed payloads &#8212; clusters the top hitters, traces each to its root cause in its own codebase, and submits a PR. Engineers review before merging. Over two weeks, errors dropped by more than 80%.</p><p>The agent, in other words, became its own debugger. The harness that runs the SRE agent is now updated by the SRE agent itself, gated by human PR review. The team&#8217;s framing is the title of their post: &#8220;The agent that investigates itself.&#8221; It is not a metaphor.</p><div><hr></div><h2>What actually changes (the levers)</h2><p>The most under-appreciated property of these loops is <em>what</em> they mutate. Across every case study above, the changes that produced the gains were:</p><p>The <strong>system prompt</strong> and <strong>task instructions</strong>. ILWS, the &#8220;Instruction-Level Weight Shaping&#8221; framework, formalizes this: <a href="https://arxiv.org/pdf/2509.00251">a session-level reflection engine proposes a structured edit to the system prompt &#8212; a knowledge delta &#8212; that is gated, accepted only if a sliding-window quality rating improves with statistical significance, and rolled back otherwise</a>. Most production teams do this informally. Formalizing it gives you reversibility under governance, which regulators ask for.</p><p><strong>Tool definitions and skills</strong>. <a href="https://blog.langchain.com/improving-deep-agents-with-harness-engineering/">LangChain&#8217;s improvement was largely middleware</a>: a <code>LocalContextMiddleware</code> that maps the working directory and onboards the agent into its environment, a <code>LoopDetectionMiddleware</code> that intercepts repeated edits to the same file and forces a plan reconsideration, a <code>PreCompletionChecklistMiddleware</code> that blocks the agent from exiting before it runs a verification pass. None of these are model changes. All are tool-and-hook surface.</p><p><strong>Memory and knowledge files</strong>. Microsoft replaced their RAG-over-past-sessions memory with <a href="https://techcommunity.microsoft.com/blog/appsonazureblog/the-agent-that-investigates-itself/4500073">structured Markdown files the agent reads and writes through its standard tool interface &#8212; overview.md, team.md, logs.md, debugging.md</a>. The model navigates memory by following links, not by retrieving via embedding similarity. This is the &#8220;the repo is the schema&#8221; insight. Memory becomes a write-able artifact that future runs read.</p><p><strong>Sub-agents and routing</strong>. Anterior routes by confidence &#215; stakes. Azure SRE spawns parallel sub-agents per hypothesis when a single context is at risk of getting polluted. Hippocratic uses a 21-model supervisory constellation around a primary conversational model. None of these compositions require retraining the underlying weights; they require designing the orchestration layer.</p><p><strong>Judge rubrics</strong>. The Harvey ceiling is the rubric ceiling. The Anterior calibration is the judge alignment with humans. The fastest leverage in most teams&#8217; first improvement loop is not a fancier judge &#8212; it is a better-written rubric and a small humans-vs-judge alignment dataset.</p><p><strong>Fine-tuning the small models in the harness</strong>. Sometimes weights do change, but on the <em>components</em>, not the primary model. NVIDIA NeMo&#8217;s case study on an enterprise data flywheel: a routing model fine-tuned from Llama 3.1 70B down to a Llama 3.1 8B variant achieved <a href="https://arxiv.org/pdf/2510.27051">96% accuracy with a 10&#215; model size reduction and 70% latency improvement</a>. The query rephrasal model gained 3.7% accuracy with a 40% latency cut. The orchestrating LLM was untouched.</p><p>The pattern is consistent: when you map &#8220;improvements shipped&#8221; against &#8220;components that changed&#8221; across these case studies, the primary reasoning model is the <em>least</em> common thing that gets edited. The harness layer carries the weight.</p><div><hr></div><h2>Where these loops break</h2><p>Six failure modes show up repeatedly. None are theoretical; each one has burned at least one of the case studies above.</p><p><strong>Overfitting to recent failures.</strong> Aggregate harness changes against last week&#8217;s top errors and you regress on tasks the change wasn&#8217;t targeting. LangChain&#8217;s iteration log explicitly marks these as discarded runs. Without a frozen eval set that the validation gate runs <em>every</em> mutation against, you&#8217;ll fix Monday&#8217;s bug and silently break Tuesday&#8217;s working flow.</p><p><strong>Reward hacking against the rubric.</strong> When the agent edits its own harness against an LLM judge&#8217;s scoring, the judge&#8217;s scoring is the optimization target &#8212; including any blind spots in the rubric. Harvey caveats this directly: the improvements track the rubric, and the rubric is human-authored and incomplete. Periodic out-of-distribution evals from a <em>separate</em> judge with a <em>separate</em> rubric catch this.</p><p><strong>Judge drift and validator fragility.</strong> Anterior&#8217;s validate-the-validator move exists because LLM-judges drift, and the drift is silent. If the judge is the substrate for routing, clustering, and mutation decisions, judge drift propagates everywhere. Alignment metrics against humans on a rolling sample of cases is the only known fix.</p><p><strong>Memory staleness.</strong> Microsoft flagged this as their unsolved problem: <a href="https://techcommunity.microsoft.com/blog/appsonazureblog/the-agent-that-investigates-itself/4500073">when two sessions write conflicting patterns to debugging.md, the model has to reconcile them; when a service changes behavior, old memory entries become misleading</a>. Timestamps and explicit deprecation help, but no production team has solved this systematically.</p><p><strong>Privacy and regulatory constraints on production data.</strong> Healthcare and finance can&#8217;t freely route production traces into a learning loop the way a generic SaaS product can. The TikTok Pay ARIA paper handles this by having the agent <a href="https://arxiv.org/abs/2507.17131">self-identify uncertainty through structured self-dialogue and request targeted explanations from human experts at runtime</a>, keeping learning at test time inside the regulatory boundary. Hippocratic uses synthetic test calls plus consented real-call evidence; Anterior keeps clinician review and AI determination in the same compliance perimeter.</p><p><strong>Compounding errors when the validator itself fails.</strong> A bad judge calibrated against a small alignment set drifts. A bad alignment set lets the judge calibrate against itself. A bad clustering layer groups the wrong failures together. Each layer of the loop is a place errors can go undetected and propagate. The defense is treating every layer as an evaluable artifact &#8212; the judge has a precision/recall, the cluster labels have inter-rater agreement, the harness mutations have a regression budget.</p><p>The seventh failure mode, which is institutional rather than technical: nobody owns the loop. In every case study above, the loop is owned by a named team with a named lead &#8212; Grupen at Harvey, Mukherjee at Hippocratic, Mehta and team at Microsoft. Loops without owners decay quietly.</p><div><hr></div><h2>Build order</h2><p>If you&#8217;re standing up a vertical agent and don&#8217;t yet have this loop, the build order is fixed and the order matters. None of the steps require the next-generation model.</p><p>Start with <strong>traces</strong>. Every tool call, every model input, every model output, every latency, every outcome, with a stable trace ID per session. If you can&#8217;t reconstruct what happened, none of the rest of the loop works. LangSmith, Arize Phoenix, Braintrust, and OpenTelemetry-based stacks all do this; pick one and instrument every call path before anything else.</p><p>Then write <strong>one rubric</strong> for one task. Not a benchmark suite. One task that matters, one rubric that an expert in your domain would sign off on. Score 50 production traces against it manually. The rubric you ship will be wrong in instructive ways; the act of writing and applying it surfaces the failure modes you didn&#8217;t know you had.</p><p>Add a <strong>judge</strong> against that rubric. Run it inline on a sample of production. Run it against the 50 you scored manually. Compute alignment. If alignment is below ~70%, the rubric is the problem, not the judge.</p><p>Add the <strong>clustering and mutation step</strong> last. Cluster the lowest-scored traces, propose one harness change, gate it against your offline eval, ship if it passes, measure the production effect. This is one cycle. Run it weekly.</p><p>The model upgrade question takes care of itself once the loop is running. When a better base model ships, you swap it in, rerun the validation gate, and observe whether your harness over-fits to the old model. (Different models reward different harnesses &#8212; <a href="https://blog.langchain.com/improving-deep-agents-with-harness-engineering/">Claude Opus 4.6 scored 59.6% with a harness tuned for GPT-5.2-Codex on Terminal Bench 2.0; the same Claude with its own harness moved several positions</a>.) The harness tax of switching models is real, but it&#8217;s a calibration problem, not a foundational one.</p><p>The reason this matters now and not in twelve months is asymmetry. Vertical agent winners in 2026 will not be the teams with the best zero-shot model. They will be the teams whose deployed agents are quietly compounding skill every week the rest of the market sits frozen. The loop is the moat.</p><p>Build the trace store this week. Write the first rubric next week. The rest of it follows.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://theairuntime.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Felix Is a Harness, Not a Model: How Rogo Built an Agent for High Finance]]></title><description><![CDATA[Rogo just raised $160M Series D led by Kleiner Perkins. The architecture behind their Felix agent is what AI engineers should be studying.]]></description><link>https://theairuntime.com/p/felix-is-a-harness-not-a-model-how</link><guid isPermaLink="false">https://theairuntime.com/p/felix-is-a-harness-not-a-model-how</guid><dc:creator><![CDATA[The AI Runtime]]></dc:creator><pubDate>Fri, 01 May 2026 11:03:46 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!dqwt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43d8c23a-23c7-48a9-bc63-ad7fdf07018e_1024x559.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="pullquote"><p><strong>TL;DR</strong> - Rogo serves <a href="https://www.prnewswire.com/news-releases/rogo-raises-160m-series-d-to-scale-the-agentic-platform-for-finance-302756546.html">more than 35,000 professionals at over 250 institutions</a> &#8212; Rothschild &amp; Co, Jefferies, Lazard, Moelis, Nomura &#8212; with an AI agent called Felix that bankers email like a junior analyst and get back finished decks, models, and memos. The interesting part is not the model. Rogo&#8217;s <a href="https://rogo.ai/news/gpt-5.5-now-available-in-rogo">own product team calls Felix their &#8220;agent harness&#8221;</a> &#8212; a vertical scaffolding designed to be model-agnostic across GPT 5.5, Claude Opus 4.7, and Gemini. Felix is the playbook for vertical AI: the moat is the harness, the evals, the data integrations, and the deployment model &#8212; not which frontier LLM is wired in this quarter. If you are building a vertical agent, study how Rogo decomposed the problem before you pick a model.</p></div><h2>What Rogo Actually Sells</h2><p>A precision note first: when people say &#8220;banking&#8221; in this conversation, they don&#8217;t mean retail or commercial banking. Rogo sits inside high finance &#8212; investment banking, private equity, hedge funds, equity research, asset management. <a href="https://rogo.ai/felix">Rogo&#8217;s own product page</a> explicitly calls out its three audiences: Banking, Private Markets, Public Markets. The workflows are deal-shaped: pitchbooks, comps, models, memos, CIMs, diligence trackers.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://theairuntime.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Rogo was founded by <a href="https://rogo.ai/company">Gabriel Stengel and John Willett</a> &#8212; both ex-investment-bankers (Lazard, J.P. Morgan, Barclays) &#8212; with Tumas Rackaitis. That founder profile matters because the company&#8217;s edge is not the LLM; it is the granular, painful familiarity with what a 2 AM CIM revision actually looks like.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dqwt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43d8c23a-23c7-48a9-bc63-ad7fdf07018e_1024x559.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dqwt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43d8c23a-23c7-48a9-bc63-ad7fdf07018e_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!dqwt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43d8c23a-23c7-48a9-bc63-ad7fdf07018e_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!dqwt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43d8c23a-23c7-48a9-bc63-ad7fdf07018e_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!dqwt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43d8c23a-23c7-48a9-bc63-ad7fdf07018e_1024x559.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dqwt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43d8c23a-23c7-48a9-bc63-ad7fdf07018e_1024x559.png" width="1024" height="559" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/43d8c23a-23c7-48a9-bc63-ad7fdf07018e_1024x559.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:559,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1085822,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiengineerweekly.substack.com/i/196065605?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43d8c23a-23c7-48a9-bc63-ad7fdf07018e_1024x559.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dqwt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43d8c23a-23c7-48a9-bc63-ad7fdf07018e_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!dqwt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43d8c23a-23c7-48a9-bc63-ad7fdf07018e_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!dqwt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43d8c23a-23c7-48a9-bc63-ad7fdf07018e_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!dqwt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43d8c23a-23c7-48a9-bc63-ad7fdf07018e_1024x559.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>                                                                Felix Architecture</em></p><p>Yesterday&#8217;s <a href="https://www.kleinerperkins.com/perspectives/rogo-the-ai-platform-for-global-finance/">$160M Series D, led by Kleiner Perkins</a> with participation from Sequoia, Thrive, Khosla, and J.P. Morgan Growth Equity Partners, brings total funding <a href="https://www.prnewswire.com/news-releases/rogo-raises-160m-series-d-to-scale-the-agentic-platform-for-finance-302756546.html">past $300M</a>. The capital is going toward two things that tell you what they actually believe: deeper data integrations and more forward-deployed bankers embedded inside client institutions.</p><h2>Felix Is a Harness, Not a Model</h2><p>The single most useful sentence Rogo has published this year shows up in their <a href="https://rogo.ai/news/gpt-5.5-now-available-in-rogo">GPT 5.5 release note</a>: &#8220;we&#8217;ve begun incorporating GPT 5.5 into our agent harness, Felix.&#8221; Read that twice.</p><p>Felix is not a fine-tuned model. Felix is the <em>harness</em> &#8212; the orchestration scaffold, tool layer, citation system, output formatters, audit trail, and policy controls &#8212; into which Rogo plugs whichever frontier model performs best on their internal benchmark this week. They are explicit that they are model-agnostic across <a href="https://rogo.ai/news/gpt-5.5-now-available-in-rogo">OpenAI, Google, and Anthropic</a>, and TAMradar&#8217;s coverage notes the platform <a href="https://www.tamradar.com/funding-rounds/rogo-series-d-160m">supports GPT 5.5 and Anthropic Opus 4.7</a> concurrently.</p><p>This separation is load-bearing. In the <a href="https://aiengineerweekly.substack.com/p/model-reliability-engineering-who">Model Reliability Engineering</a> frame, the harness is one of the two reliability axes &#8212; the scaffolding you build <em>around</em> the model to make its behavior production-safe. The harness-vs-model split is the same separation MRE treats as one of its two reliability axes. Rogo's product team uses the word the same way. The implication for builders: when frontier labs ship a 4% improvement on your domain, you swap the engine; when they ship a 40% improvement two years from now, your harness is what survives.</p><p>Here is the rough shape of what&#8217;s inside Felix:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eHei!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b0c0af4-e7be-4350-bfec-34d12bcc909b_840x584.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eHei!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b0c0af4-e7be-4350-bfec-34d12bcc909b_840x584.png 424w, https://substackcdn.com/image/fetch/$s_!eHei!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b0c0af4-e7be-4350-bfec-34d12bcc909b_840x584.png 848w, https://substackcdn.com/image/fetch/$s_!eHei!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b0c0af4-e7be-4350-bfec-34d12bcc909b_840x584.png 1272w, https://substackcdn.com/image/fetch/$s_!eHei!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b0c0af4-e7be-4350-bfec-34d12bcc909b_840x584.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eHei!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b0c0af4-e7be-4350-bfec-34d12bcc909b_840x584.png" width="840" height="584" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3b0c0af4-e7be-4350-bfec-34d12bcc909b_840x584.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:584,&quot;width&quot;:840,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:44263,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiengineerweekly.substack.com/i/196065605?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b0c0af4-e7be-4350-bfec-34d12bcc909b_840x584.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eHei!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b0c0af4-e7be-4350-bfec-34d12bcc909b_840x584.png 424w, https://substackcdn.com/image/fetch/$s_!eHei!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b0c0af4-e7be-4350-bfec-34d12bcc909b_840x584.png 848w, https://substackcdn.com/image/fetch/$s_!eHei!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b0c0af4-e7be-4350-bfec-34d12bcc909b_840x584.png 1272w, https://substackcdn.com/image/fetch/$s_!eHei!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b0c0af4-e7be-4350-bfec-34d12bcc909b_840x584.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Detail belongs in the prose, not the diagram. Three components below carry the real weight.</p><h2>The Email Interface Is the Real Interface</h2><p>The product surface that ships with Felix is unusual: bankers send Felix an email the same way they would a colleague, get an acknowledgment in under a minute with an ETA, and receive PowerPoint, Excel, Word, and PDF deliverables back when ready. Iteration happens by replying to the email thread.</p><p>This is not a UX gimmick. It tells you something about how the team thinks about adoption. Investment bankers already live in Outlook. Asking them to adopt a new interface is a tax. Email-as-API removes the tax. It also imposes async semantics on the agent: a long-running task with intermediate status, observable state via the inbox, and a clean handoff back to the human reviewer. The harness has to absorb that asynchrony &#8212; request queuing, intermediate progress, partial results, source attribution surviving the round-trip &#8212; without leaking it back to the user.</p><p>The output substrate matters too. Felix returns work in Excel, PowerPoint, and Word formatted in the firm&#8217;s own templates and house style. A pitchbook that doesn&#8217;t match house formatting is not 90% done; it is 0% done. Vertical AI rises or falls on output substrate fidelity.</p><h2>The Big Finance Benchmark: Vertical Evals Are the Moat</h2><p>Rogo curates an internal evaluation set called the Big Finance Benchmark &#8212; real financial tasks designed by their ex-finance team. Tasks include valuing companies, benchmarking peers on specific metrics, and building theses across disparate documents. They are explicit that these come from real workflows, not synthetic prompts.</p><p>This is the unsexy infrastructure that compounds. When OpenAI ships GPT 5.6 next quarter, Rogo will know within a day whether it improves CIM drafting on real deals or just MMLU. That is the kind of judgment a horizontal benchmark cannot give you. Every serious vertical AI company will need its own version of this. If you are building one and you don&#8217;t have a domain-specific eval suite, you are flying without instruments.</p><h2>Workflow Surface: What Felix Actually Does</h2><p>The concrete capabilities Rogo has shipped span deal screening, CIM generation, buyer outreach, and data room diligence. Decomposed:</p><ul><li><p><strong>Deal screening.</strong> Filtering thousands of potential targets against thesis criteria.</p></li><li><p><strong>CIM generation.</strong> Drafting Confidential Information Memoranda &#8212; the 50-to-100-page sell-side documents that anchor M&amp;A processes.</p></li><li><p><strong>Buyer outreach.</strong> Generating personalized contact lists and initial communications.</p></li><li><p><strong>Data room diligence.</strong> Synthesizing across the document piles that buyers and bankers wade through.</p></li><li><p><strong>Comps and models.</strong> Building Excel spreadsheets with historical financials and forward forecasts.</p></li><li><p><strong>Pitchbooks and memos.</strong> Decks for a CEO meeting, memos for an investment committee.</p></li></ul><p><a href="https://siliconangle.com/2026/04/29/rogo-raises-160m-speed-financial-analysis-ai-agents/">SiliconANGLE&#8217;s coverage</a> notes that Felix can also offer to keep a report current &#8212; for example, an analyst covering Apple can have the agent re-run the report each time the company reports earnings. Scheduled, recurring agent runs are part of the surface.</p><p>The data substrate behind these tasks is extensive. <a href="https://www.tamradar.com/funding-rounds/rogo-series-d-160m">TAMradar lists integrations</a> with PitchBook, LSEG, Cap IQ, FactSet, Fitch Solutions, and Third Bridge, plus internal CRM and SharePoint connectors. Auditable outputs are positioned for SOC 2, ISO 27001, GDPR, and EU AI Act compliance &#8212; the table-stakes regulatory surface for institutional finance.</p><h2>Sisyphus: The Other Harness</h2><p>The most under-covered part of Rogo&#8217;s stack is a second internal agent called <a href="https://rogo.ai/news/introducing-sisyphus-autonomous-security-for-financial-ai-infrastructure">Sisyphus</a> &#8212; an autonomous offensive-security agent that pen-tests Rogo&#8217;s own infrastructure once or twice a day, calibrated to deployment cadence. It runs structured campaigns across authentication abuse, authorization bypass, injection, SSRF, and LLM-specific exploit categories, and it chains findings to validate exploitability rather than just flagging signals.</p><p>Two numbers from Rogo&#8217;s own writeup are worth remembering. One week after a third-party penetration test, Sisyphus identified 18 additional exploitable vulnerabilities in a single afternoon, most chained, all remediated within hours. And on calibration: high-confidence findings now carry a &gt;95% true-positive rate after the team tuned the recon phase and compared the agent&#8217;s triage against their human security team.</p><p>This is the harness for the harness. If your vertical agent platform handles consequential workflows, &#8220;we get pen-tested twice a year&#8221; is not a posture; it is a vulnerability window. Sisyphus is what the security side of vertical AI starts to look like.</p><h2>Forward-Deployed Bankers: The Human Harness</h2><p>Rogo&#8217;s go-to-market is structured around an embedded role they call Forward Deployed Bankers &#8212; ex-bankers from top firms who sit inside client institutions and onboard teams from analyst to managing director. The new capital is funding expansion of this team from New York into London.</p><p>This is not professional services in disguise. It is closer to what Palantir built for defense and intelligence: domain-fluent humans who translate between the workflow and the platform, calibrate the agent&#8217;s outputs to firm-specific style, and surface workflow gaps that become product. They understand model formatting and how a positioning section actually reads. Without them, the harness loses ground truth on what &#8220;good&#8221; looks like inside each firm&#8217;s house style.</p><p>For builders: the lesson is that adoption inside regulated, high-status industries is bottlenecked on trust transfer, not feature parity. The forward-deployed model is expensive and it is a moat.</p><h2>What&#8217;s Actually Being Transformed</h2><p>Bankers do not get replaced; their pyramid does. Rogo&#8217;s Series D announcement is explicit that leading firms are &#8220;restructuring workflows, rethinking staffing pyramids, and deploying autonomous agents that work asynchronously across every transaction.&#8221; A managing director at one client described Felix as having tripled team output with no headcount additions. That is the shape of the transformation: same senior judgment layer, compressed junior layer, agent layer doing the asynchronous grunt work, forward-deployed bankers tuning the seams.</p><p>Rogo&#8217;s two recent acquisitions tell you where they are aiming next. <a href="https://techfundingnews.com/rogo-160m-series-d-kleiner-perkins-investment-banking-ai/">Plux AI</a> &#8212; a UK firm tracking complex financial market developments &#8212; adds European market coverage. <a href="https://siliconangle.com/2026/04/29/rogo-raises-160m-speed-financial-analysis-ai-agents/">Offset</a>, an AI agent company whose tech automatically updates financial models when new information arrives, plugs directly into the live-model side of the harness.</p><h2>Five Lessons If You Are Building a Vertical Agent</h2><ol><li><p><strong>The harness is the moat, not the model.</strong> Build it so frontier-model upgrades are a config change, not a rewrite.</p></li><li><p><strong>Domain-specific evals beat horizontal benchmarks.</strong> Curate real tasks from real practitioners. Run them every model release.</p></li><li><p><strong>Output substrate must match the destination workflow.</strong> A correct answer in the wrong format is the wrong answer.</p></li><li><p><strong>Forward deployment changes adoption math.</strong> Domain-fluent humans embedded in the customer org are a feature, not overhead.</p></li><li><p><strong>Security needs its own harness.</strong> When agents do consequential work, periodic pen tests leave a window. Continuous adversarial testing is the new floor.</p></li></ol><h2>What to Do This Week</h2><p>Pick one workflow you&#8217;ve watched a domain expert do that you suspect an agent could absorb. Don&#8217;t model it yet. Instead, write down four things: the data sources they pull from, the output format they hand back, the audit trail they leave, and the colleague they email when they get stuck. Those four are your harness specification. The model goes in the middle of that, and you can swap it out next quarter.</p><p>If your current agent prototype only handles one or two of those four, you have not built a harness yet. You have built a wrapper.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://theairuntime.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>