The AI Runtime: Vertical Agents

The "Self-Improving" AI Myth (And What 60 Production Deployments Actually Do)

The AI Runtime — Mon, 15 Jun 2026 11:03:29 GMT

TL;DR - Self-improving AI agents in 2026 do not improve through model weight updates. They improve at the harness layer (orchestration, evaluation, A/B gating) and the context layer (per-tenant memory and skill libraries). Sixty production deployments across customer support, devtools, legal, healthcare, finance, sales, recruiting, and real estate converged on this architecture. The vertical leaders win by owning the loop inside the product, not by training a better base model. The biggest unfilled opportunity is the horizontal self-improvement engine, no vendor sells one. Build in a vertical: invest in eval surface and outcome-based pricing. Build horizontally: ship the packaged loop that plugs into any agent stack.

Subscribe now

What “self-improving” actually means in production

The “self-improving agent” label is doing heavy lifting in 2026 product marketing. The label conflates four very different mechanisms: weight updates from production traces, harness changes shipped after A/B testing, per-tenant context that accumulates from user interactions, and procedural skill libraries that compound across sessions. Most products marketed as “self-improving” do exactly one of these, usually the third, and stop there.

Across the production landscape, the dominant pattern is harness plus context. Real model weight updates are concentrated in a handful of companies with proprietary data flywheels: Hippocratic AI versions its Polaris suite with documented progression from 96.79% to 99.38% clinical accuracy on its RWE-LLM safety benchmark, EvenUp trained Piai on hundreds of thousands of personal-injury cases, Abridge ships custom medical speech recognition across fourteen languages, and Harvey co-developed a case-law model with OpenAI. Everyone else differentiates at the harness and context layers.

The model layer is where a small number of verticalized leaders defend a moat. The harness layer is where every serious player wins or loses the production loop.

The three-layer architecture

The three-layer model, popularized by LangChain in early 2026 and now widely adopted, separates what changes in an LLM system into model, harness, and context. Each layer has its own improvement surface, its own cost curve, and its own failure modes.

Model layer: rare, expensive, high-moat. True weight updates from production traces. The pattern is consistent across the four companies that do this: own a proprietary data flywheel, build a benchmark, ship versioned models. Glean’s Waldo agent runs on Nemotron 3 Nano. Most enterprises buy frontier models and never touch the weights.

Harness layer: the active battleground. The orchestration, evaluation, retry, gating, and reflection logic that constrains and verifies model behavior before output reaches the user. This is where almost all 2026 differentiation lives. Cursor publishes harness-improvement details openly, measuring releases against an internal CursorBench plus an offline grader plus online A/B with Code Retention and Keep Rate as proxies. Anthropic documents Claude Code’s harness evolution across released versions. Decagon’s Agent Operating Procedures are a harness in disguise. The term “harness engineering” has moved from Anthropic-internal vocabulary to industry-standard framing.

Context layer: where customer specificity compounds. Per-tenant memory, knowledge packs, skill libraries, and tool manifests. Every serious vertical agent has a context studio: Sierra Explorer, Decagon Duet, Hex Context Studio, Ada Coaching, Intercom Procedures, Harvey Memory. This layer is where customer-specific value compounds, and where the multi-tenant safety problem lives.

The model layer is where vertical leaders defend a moat. The harness layer is where every serious player wins or loses the production loop.

Harness Topology applied across sixty deployments

Harness Topology is the comparative discipline of analyzing harness shape across regulated industries, used to identify which harness design patterns generalize across verticals and which are domain-locked. Built on Vertical Agent Anatomy, applied cross-vertically. Inside it sit two named concepts: Harness Half-Life (the durability axis: how fast harness investment depreciates as the model evolves) and Harness Saturation (the viability threshold: the point at which a system labeled an “agent” is a deterministic workflow with an LLM bolted on).

Applied empirically across roughly sixty production deployments in 2026, Harness Topology reveals nine patterns that repeat across every leading vertical agent. A vertical agent that does not implement at least seven of them is missing a core mechanism.

Traces as the unit of truth. Every serious shop treats execution traces as the artifact that drives improvement. LangSmith, Langfuse, OpenTelemetry GenAI conventions, and Cursor’s internal trace store all encode this.
LLM-as-judge plus golden datasets. Glean’s internal AI Evaluator hits 74% human agreement rate; Harvey’s LAB benchmark uses rubric-based LLM grading; Cursor’s CursorBench and Decagon’s simulation suite combine LLM judges with human review.
Per-tenant context studios. Sierra Explorer, Decagon Duet, Hex Context Studio, Ada Coaching, Intercom Procedures, Harvey Memory. Every leader has one; nobody sells one as a horizontal product.
Skill libraries as procedural memory. Anthropic’s Agent Skills standard (SKILL.md plus scripts) is being copied by Replit, Devin, OpenHands, and Cursor Rules. An open-source marketplace ecosystem already exists at scale.
A/B harness experimentation on real traffic. Cursor A/Bs harness variants and measures Code Retention. Intercom A/Bs Fin against production baselines on every change. Decagon versions and simulates conversations before deployment.
Outcome-based pricing aligned with the improvement loop. Sierra charges only on full resolution; Intercom Fin charges per resolution; EvenUp’s pricing is tied to settlement outcomes. The loop is structurally aligned with the customer metric.
Offline “dreaming” jobs. Coding agents that run nightly over recent traces, propose harness or context changes, and gate against an eval suite. Sierra Explorer, Decagon Duet, Intercom Optimize, and Cursor’s harness-improvement agent are all variants.
Vertical benchmarks as the eval moat. Harvey’s LAB, Mercor’s APEX-Agents (open-sourced on Artificial Analysis), Hippocratic’s RWE-LLM. The benchmark itself becomes a competitive asset distinct from the product.
Workflow / SOP ingestion as cold-start. Sierra Ghostwriter, Decagon AOPs, Ada Playbooks, Harvey Workflow Agent. Natural-language SOPs and call transcripts bootstrap the first agent before any improvement loop has data.

The patterns are not fashionable. Each one closes a specific failure mode. A vertical agent missing trace infrastructure cannot improve at all. A vertical agent without a golden dataset cannot ship harness changes safely. A vertical agent without per-tenant context cannot survive contact with the customer’s idioms. The list is functional, not aesthetic.

The five standardized axes that turn rebuild into configuration

Across verticals, the leaders also converge on five standardized axes that turn vertical onboarding from rebuild into configuration. Each axis is a slot in a generic agent kernel, planner, memory, tool router, critic, that gets filled with vertical-specific configuration.

1. Workflow / SOP ingestion as the bootstrap. Sierra Ghostwriter ingests existing SOPs and transcripts. Mercor Enterprise runs AI-moderated interviews with employees to extract tacit workflows. Harvey requires firm-specific upload of precedents before work begins.

2. MCP / connectors as the tool layer. MCP has effectively won as the integration standard. Clay’s Claygent connects to any MCP server. Devin and Cursor expose MCP marketplaces. Glean ships over 100 connectors. The tool layer is no longer differentiation; the manifest is.

3. Per-tenant context store. Universal pattern. Customer-specific knowledge, working style, precedents, and learned patterns isolated per tenant. The audit surface lives here.

4. Vertical benchmarks as the eval moat. Harvey’s LAB, Mercor’s APEX-Agents, Hippocratic’s RWE-LLM. The benchmark itself is the competitive asset. Harder to copy than a feature; compounds over time.

5. Outcome-based pricing. Sierra and Intercom Fin price on resolution; EvenUp ties to settlement outcomes. Pricing structurally aligns the improvement loop with the customer’s metric, the agent that improves the customer’s outcome also improves its own revenue.

These five axes are what makes the harness layer the active battleground. A vertical agent with a strong eval suite and outcome-based pricing has a self-tuning revenue model. A vertical agent without them has shipped a demo.

What the numbers actually say

Most of the cited improvement numbers in the 2026 self-improving agent market are vendor-published. Treat them as directional. Where independent benchmarks exist, the picture is less flattering than the marketing.

Hebbia’s Matrix shows 92% accuracy with o1 versus 68% out-of-the-box RAG on a legal/financial benchmark, a vendor-published number on a vendor-defined benchmark. Cognition reports Devin 2.0 is 83% more productive than 1.x per Agent Compute Unit, also vendor-published, with no methodology release. Intercom Fin reports 51% average resolution across its customer base, with Lightspeed at 65% end-to-end and Synthesia at 87% self-serve, customer-reported, but mediated through Intercom’s product analytics.

The independent benchmark numbers tell a different story. Mercor’s APEX-Agents benchmark, 480 tasks across investment banking, consulting, and law, released open-source, shows frontier models scoring roughly 33%, a large gap to humans. OpenHands reports about 77% on SWE-Bench Verified with Sonnet 4.5. The verticals where independent benchmarks are publicly available are the verticals where the production gap is widest.

The reading is consistent with Harness Topology’s central claim: harness investment is what closes the gap between frontier-model benchmark score and customer-outcome resolution rate. The customer cares about resolution rate. The model cares about benchmark score. The harness is what translates one into the other.

Where the pattern saturates

Harness Saturation is the viability threshold inside Harness Topology, the point at which a system labeled an “agent” is actually a deterministic workflow with an LLM bolted on, because the harness has accumulated so many gates, verifications, and constraints that no autonomous decision remains. The end-of-agent signal: every decision is gated, every output is validated, every action requires approval, and the LLM call is reduced to a structured-output formatter.

The most regulated verticals are closest to saturation. Healthcare clinical-decision agents and legal demand-letter agents have so many compliance gates that the autonomous surface has collapsed to schema completion. Customer support is further from saturation because the customer’s tolerance for a wrong answer is higher than the patient’s. Devtools is furthest from saturation because the human reviewer is in the loop on every change.

Saturation matters because it tells the practitioner when to stop adding harness. Adding more gates past saturation degrades completion rates without lowering incident rates. The engineering-correct move is to drop the LLM and ship the deterministic workflow it became.

The biggest unfilled opportunity

The vertical-agnostic self-improvement engine does not exist as a product. Every serious vertical agent runs a version of the same loop: trace → cluster failures → propose context or skill update → gate against eval suite → ship. Sierra Explorer, Decagon Duet, Intercom Optimize, and Cursor’s harness-improvement coding agent all implement variants. None is sold as a horizontal product.

That is the largest single opening in the territory. A packaged self-improvement loop, plugging into any agent stack via traces, producing per-tenant context and skill updates that any eval suite can promote, would slot into every vertical without rebuilding the loop. The market readiness is high; the competitive whitespace is wide; the product does not exist.

The closest adjacent products are observability platforms (LangSmith, Langfuse), Anthropic’s Agent Skills registry which crossed 277,000 installs on the frontend-design skill alone, eval platforms (Braintrust, HoneyHive), and memory products (Mem0, Zep, Letta). None of them ships the full loop. The observability platforms see traces but do not propose changes. The eval platforms score outputs but do not generate updates. The memory products store context but do not curate skills.

A horizontal self-improvement engine sitting between these three would be the missing keystone. It is the single most valuable position in the 2026 agent landscape that no vendor occupies.

The three-layer stack, visualized

FAQ

Where does self-improvement actually happen in 2026 agents?

At the harness and context layers, not the model layer. The harness layer covers orchestration, evaluation, retry logic, A/B testing, and reflection loops. The context layer covers per-tenant memory and procedural skill libraries. Real model weight updates from production traces are concentrated in four to six companies, Hippocratic AI, EvenUp, Abridge, Harvey, Glean, and require a proprietary data flywheel that most enterprises do not have.

What is the difference between Harness Engineering and Context Engineering?

Harness Engineering controls what the user sees: the gates, verifications, and orchestration logic that constrain model behavior before output reaches the user. Context Engineering controls what the model knows: every method of getting information to the LLM at inference time, including RAG, fine-tuning, long-context injection, prompt design, MCP tool use, and persistent memory. Both sit inside Model Reliability Engineering, the broader discipline of making LLM behavior reliable in production.

Why has no horizontal self-improvement product emerged?

Because the eval surface is vertical-specific. A self-improvement loop is only as good as the eval suite that gates its proposed changes, and every vertical defines correctness differently, clinical accuracy in healthcare, citation faithfulness in legal, resolution rate in customer support, Code Retention in devtools. A horizontal product would need a configuration surface that accepts any eval definition, plugs into any agent stack via traces, and produces context or skill updates that any deployment can promote. That surface is hard to design and harder to sell into without a category to lean on.

What does a vertical leader build before the improvement loop exists?

Workflow ingestion. Every leader starts the same way: ingest existing SOPs, call transcripts, audio recordings, or expert demonstrations, and turn them into agent behavior. Sierra’s Ghostwriter is the canonical example. Mercor Enterprise runs AI-moderated interviews with employees to extract tacit workflows. Harvey requires firm-specific precedent upload before work begins. The improvement loop only starts producing value once enough traces have accumulated; the first agent has to ship from the cold-start.

How does Harness Saturation get diagnosed in practice?

Three indicators compound. First, every decision in the agent’s path is gated by a deterministic check. Second, every output is validated against a fixed schema. Third, every action requires human or rule-based approval. When all three are present, the LLM call has been reduced to a structured-output formatter, and the autonomous decision surface has collapsed. The engineering-correct response is to drop the LLM and ship the deterministic workflow the harness has become. Adding more harness past this threshold degrades completion rates without improving incident rates.

The two open positions

The architecture has settled. The leaders have converged. The opportunity has narrowed to two clean positions and one unfilled gap.

For vertical builders, the moat compounds in the eval surface and the customer-outcome metric. Capital invested in proprietary benchmarks and outcome-aligned pricing returns more than capital invested in fine-tuning the base model. The four to six companies running real weight-update loops are the exceptions that prove the rule: they own data flywheels nobody else can replicate, and even then the differentiation visible to the customer is harness-mediated. Vertical leaders without the data flywheel should stop trying to compete on model and start competing on benchmark depth and pricing structure.

For horizontal builders, the territory open in 2026 is the packaged self-improvement loop. Any product that plugs into an agent stack via traces, produces per-tenant context and skill updates, and gates them against the customer’s existing eval suite occupies a position no vendor currently holds. Observability sees the traces, eval platforms score outputs, memory products store context, but none ships the full loop. The market is ready, the configuration surface is hard but tractable, and the first credible category entry will define how every vertical agent procures self-improvement for the next decade.

A subscriber brief mapping the full sixty-deployment landscape, the nine cross-cutting patterns, the three-layer memory stack, the mechanism catalog, and the opportunity map is available below:

Self Improving Agents Theairuntime

3.56MB ∙ PDF file

Download

Thanks for reading! This post is public so feel free to share it.

Dario Amodei’s “Policy on the AI Exponential” Describes a World Banking AI Already Lives In

The AI Runtime — Thu, 11 Jun 2026 21:22:19 GMT

The short version: Anthropic’s CEO just published Policy on the AI Exponential, asking for an FAA-style regulator that tests frontier models and blocks the unsafe ones before release. For most of the industry that is a new idea. For anyone building AI inside a bank, it describes the regime they have worked in since 2011, under a rule called SR 11-7. The interesting part is not the policy. It is that banking already has a working maturity curve for regulated AI agents, the acceptance criteria are published, and most teams sit two levels lower than they claim. Below is how to score your own agent, including a counterfactual test that exposes the most common lie a regulated AI tells.

Subscribe now

Banking already built the thing the essay asks for

His argument is that AI moves faster than policy can react, so frontier models should be tested by qualified third parties and blocked if they fail. The coverage fixated on the FAA comparison.

Banks have run a version of it for fifteen years. In 2011 the Federal Reserve and the OCC issued SR 11-7, the supervisory guidance on model risk management, later adopted by the FDIC. It rests on three pillars: sound development and use, independent validation, and governance with board accountability. The load-bearing idea is effective challenge, meaning critical analysis of a model by objective, competent people whose job is to find its limits. Examiners treat it as a baseline expectation, not a best-practice suggestion, and it covers machine learning, not just regression.

So if you build AI in a bank, the FAA the essay wants is not your future. It is the room you already stand in. Which makes the useful question simple: how mature is your agent inside that room, measured honestly?

The five levels, and the test that proves each one

Think of a credit-decisioning agent or an AML investigator as climbing five levels. The point of the levels is not the label. It is that each one has a specific eval that proves you are actually on it. If you cannot run the eval, you are not on the level.

Level one: it writes. A fluent credit memo or suspicious-activity narrative from a prompt, with nothing real underneath. No eval needed, because there is nothing to verify.

Level two: it is grounded. Every claim in the output traces to a record: the financials, the bureau pull, the transaction history, the bank’s own policy. The eval is a groundedness rate. Sample outputs, extract each factual claim, and check how many trace to a source. Anything below near-total is a hallucination problem, and a validator will find it before you do.

Level three: it follows the rules. The output carries the regulatory format: adverse-action reason codes that satisfy ECOA and Regulation B, the required SAR elements, SR 11-7 documentation. The eval is a schema pass rate. Run a structured validator over a few hundred outputs and check that every required field is present and every reason code comes from the approved set. Most bank AI tooling reaches here, and here is where teams stop, because the output finally looks finished.

Level four: the establishment grades it, not you. Your eval set is now owned by the people who challenge your model. Validation pass rate: how often does effective challenge accept the model versus send it back. Adverse-action dispute rate. AML alert precision, filed SARs over alerts raised, against a baseline where roughly 95% of rule-based alerts are false positives. You are no longer measuring whether the document reads well. You are measuring what the regulator and the validator do with it.

Level five: it is accepted as evidence. The validator, then eventually the examiner, takes the model’s own output as proof rather than redoing the work. The model’s stated reason becomes the legally sufficient denial reason. No bank is fully here, and the essay is, in effect, arguing the whole industry should build toward it.

The eval that earns the share

Here is the test most regulated AI fails, and the one worth running first.

The CFPB stated in 2022 that a black-box model does not excuse a lender from giving a specific, accurate reason for every denial, and that a reason approximated after the decision is not enough, because it has to reflect the factors actually used. Read as an engineer, that is a faithfulness requirement on the explanation, and it is testable.

Take a denial where your system told the applicant the main reason was, say, debt-to-income. Change only that input, push debt-to-income into the approving range, hold everything else fixed, and re-score. If the decision does not flip, debt-to-income was not actually driving it. The reason you gave the applicant was a plausible story, not the cause, which is exactly the post-hoc approximation the CFPB rejects.

Run that counterfactual across a sample of denials and you get a reason-faithfulness rate: the share of stated reasons that actually move the decision. Most teams have never run it, and most are shocked by the result, because their reason codes come from a feature-importance library bolted on after the model, not from the decision itself. A level-three system passes the format validator and fails this test. That gap, looks compliant, is not faithful, is the single most useful thing to measure in regulated AI, and it is invisible to every demo.

This is also why level four matters more than it looks. Wiring the counterfactual check, the validation-survival rate, and the dispute rate back as your evaluation set is the move from grading your own homework to letting the establishment grade it. SR 11-7’s demand for effective challenge by independent validators is the same idea written into law: your model has to survive evaluation by an adversarial party you do not control. The thing Dario is asking regulators to impose on frontier models, banking already imposes on its own.

Why this is not only a banking problem

The pattern repeats wherever an external body owns acceptance: a regulator, an independent validator, a court. The value of your agent is capped by how much of that body’s judgment your harness has internalized, and the only honest measure of progress is their response, not your output. Pharma teams are watching the FDA write those criteria now. Banking has had them since 2011. The curve is identical. Only the establishment changes, which means the bank team that nails the counterfactual reason test is building the playbook insurance, legal, and healthcare teams will copy.

The honest catch

There is a fair argument that the top level never arrives. A regulator that accepts a model’s self-explanation as sufficient has given up some of the judgment it exists to apply, and the CFPB’s whole point was to distrust the black box. Weigh the messenger too: the person urging regulators to formalize AI testing runs an AI company, published the essay the day after shipping a new model, and critics have already called the broader proposal regulatory capture.

It does not change the practitioner takeaway. Even if no regulator ever accepts raw model output as standalone evidence, the criteria they enforce decide how much work your system is allowed to carry, and banking has the clearest published version of those criteria anywhere. The teams that read them as eval specs, not compliance paperwork, are the ones who will still be standing when the audit comes.

Where this leaves you

Score your most important regulated agent on the five levels, then run the counterfactual reason test on a sample of its outputs. If the stated reasons do not move the decisions, you are not where you think you are, no matter how clean the output looks. The banks that measure faithfulness instead of fluency are quietly building the standard every regulated industry will inherit when its own FAA finally arrives.

The Anatomy of an AI Legal Agent

The AI Runtime — Wed, 03 Jun 2026 11:04:23 GMT

TL;DR. In every other vertical, a wrong answer costs money. In law, a wrong answer that reaches a court costs a sanction, a malpractice exposure, and sometimes a license. That asymmetry is why the deployable unit in legal AI is never the model. It is the harness around it: the grounding layer that forces every legal proposition back to a retrieved primary source, the verification gate that refuses to pass an unverifiable citation, and the checkpoint router that decides which work product a human must sign. The two best-funded legal agents on the market, valued at eleven billion and two billion dollars, are not selling models. They are selling that harness. Before a legal agent ships, run one audit: take its last twenty outputs and try to trace every legal claim to a source it actually retrieved. The fraction you cannot trace is the real reliability number.

Subscribe now

A legal agent is a production AI system whose defining component is verification, not generation. The model drafts; the harness proves. Across the leading deployments, the architecture converges on the same shape: retrieval grounded in primary law, a citation-verification gate that blocks unprovable claims, a checkpoint router that assigns a human reviewer by task risk, and an audit trail that survives discovery. The model is the smallest part. What surrounds it is what separates a tool a partner will sign behind from a tool that ends a career.

Why legal is the hardest reliability problem in vertical AI

Most vertical agents operate where errors are recoverable. A misrouted support ticket gets reassigned. A mispriced transaction gets reversed. Legal work has no such buffer once it reaches a tribunal. A fabricated citation in a filed brief is not a bug report; it is a Rule 11 violation.

The reference incident is already three years old and still defines the field. In June 2023, the Southern District of New York sanctioned two attorneys five thousand dollars after they filed a brief containing six judicial opinions that did not exist. A general-purpose chatbot had generated the cases, complete with names, citations, and quoted passages, and when one of the attorneys asked the tool to confirm the cases were real, it said yes. They were not. What looked at the time like an isolated embarrassment turned out to be the first documented instance of a structural failure mode. By late summer 2025, one count put the number of documented AI-hallucination legal filings above three hundred, with more than two hundred recorded in 2025 alone. The pattern was not confined to one tool or one court: a different general-purpose chatbot surfaced fabricated citations in a high-profile matter, and by early 2024 a federal appeals court had referred an attorney to a grievance panel for filing nonexistent AI-generated cases.

The profession’s governing body responded with a rulebook. In July 2024 the American Bar Association issued its first formal ethics opinion on generative AI, Formal Opinion 512, mapping the technology onto existing duties: competence under Model Rule 1.1, confidentiality under 1.6, candor to the tribunal under 3.3, and supervision under 5.3. The opinion’s operational core is that verification is not optional and not uniform. The required level of independent review is factually specific and depends on the tool and the task: generating ideas demands less scrutiny than reviewing a document, and in no case can the tool substitute for a lawyer’s own competent judgment. Because forty-nine of fifty states have adopted the core structure of the Model Rules, that opinion functions as a de facto national baseline rather than advice a firm can ignore.

Two duties beyond candor shape the deployment itself. Confidentiality under Model Rule 1.6 protects all information relating to a representation, which means a legal agent cannot route privileged material to a model endpoint that retains or trains on its inputs absent informed client consent. Data isolation is a precondition of the architecture, not a configuration toggle. Privilege and work-product doctrine compound the point: the audit trail the harness keeps to prove its own outputs is itself potentially discoverable, so how it is scoped and retained is a legal decision before it is an engineering one.

In law, verification is not a feature of the product. It is the legal duty the product exists to discharge.

This is the constraint every legal agent inherits. The duty to verify cannot be delegated to the thing producing the output. So the architecture has to externalize verification into a layer the model does not control.

The reliability floor no model has cleared

The instinct is to assume the problem is solved by retrieval. Wire the model to a database of real cases, ground every answer in retrieved text, and the fabrications stop. The vendors who built exactly that marketed it as the cure. The first independent measurement found otherwise.

Researchers at Stanford’s regulatory lab and human-centered AI institute ran the first preregistered empirical evaluation of the proprietary legal research tools that sit at the center of practice. The study, later peer-reviewed and published in the Journal of Empirical Legal Studies, tested the retrieval-augmented systems from the two dominant legal publishers across more than two hundred hand-scored legal queries. The conclusion was blunt: the providers’ claims are overstated. The tools hallucinated between seventeen and thirty-three percent of the time. Broken out, one publisher’s tool erred on roughly one in six queries and the other on roughly one in three, against forty-three percent for the raw general-purpose model used as a baseline.

Two findings inside that result matter more than the headline. First, retrieval helps and does not cure. Grounding the model in real law cut the error rate roughly in half versus the bare model, but a one-in-three failure rate on a tool sold as hallucination-free is not a rounding error. Second, the errors are not only invented cases. They include mischaracterizing a real case, citing inapplicable authority, and misstating what a rule says, which are harder for a busy associate to catch than a citation that simply does not resolve.

The architectural lesson is precise. If retrieval alone leaves a double-digit error rate, then grounding is necessary but not sufficient, and the harness needs a second mechanism downstream of retrieval whose only job is to test whether each generated claim is actually supported by the retrieved source. That mechanism is the verification gate, and it is the component that distinguishes a legal agent from a legal chatbot.

What a legal agent actually is

A production vertical agent decomposes into seven layers wrapping the model, the reference architecture set out in Vertical Agent Anatomy. Three of those layers carry almost all the weight in law, because the legal constraint loads them in a way no other vertical does.

The first is grounding. A legal agent does not answer from parametric memory. It retrieves the controlling authority, statute, regulation, case, or contract clause, and constrains generation to what it retrieved. This is table stakes, and as the Stanford measurement showed, it is also not enough on its own.

The second is the verification gate, and this is the layer that defines the vertical. After the model drafts, the harness re-derives every legal proposition against the retrieved corpus before anything reaches a human. Does the cited case exist. Does it say what the draft claims. Is it still good law. Is the quoted passage real. A claim that fails any check is flagged or dropped, not surfaced as a confident answer. The reason a verification gate is non-negotiable here and optional elsewhere is that the duty of candor makes an unverified citation a professional violation regardless of whether anyone catches it.

The third is the checkpoint router. Legal work is not uniformly risky, so the harness does not apply uniform review. It routes by task: a first-draft research memo for internal use carries different review than a brief headed for filing. The clearest articulation of this pattern comes from the field’s most rigorous benchmark effort, which frames deployment as a question of whether an agent can do all, some, or none of a given task and assigns the human review tier accordingly. The router is where the ABA’s task-specific verification standard becomes code.

Around those three sits the audit layer, which records provenance for every output: what was retrieved, what the model generated, what the gate verified, who reviewed it. In a vertical where work product can be subpoenaed, the audit trail is not telemetry. It is evidence.

The production landscape

The market has already priced this thesis. The two highest-valued legal agents are explicit that the moat is the harness.

The research-and-drafting platform most associated with large law firms reached an eleven-billion-dollar valuation on the strength of an architecture it benchmarks obsessively. Its team built and published its own evaluation suite, and in May 2026 released an open-source legal agent benchmark containing more than twelve hundred tasks across twenty-four practice areas, graded against more than seventy-five thousand expert-written rubric criteria, with backing from every major frontier lab. The benchmark is structured to mirror how work is assigned and reviewed at a firm: an instruction, a client matter with real materials, and a work product that a human must sign off on. On the company’s own internal suite, vendor-published results put the strongest frontier model above ninety percent (these are the vendor’s own benchmark and methodology, not an independent measurement). The instructive part is not the score. It is that a company at this valuation spends its research budget building the measurement layer, because in legal the harness improves only as fast as the firm can measure where it fails. The same team’s research-specific benchmark goes further still: built with a data-labeling partner, it requires a model to use search tools, locate relevant context, and return cited responses end to end, which is the verification gate expressed as a test rather than left to run silently at inference time. The company has said it is expanding that public benchmark more than fivefold across global law, practice areas, and legal research, a sustained investment in measurement that only makes sense if the harness, not the model, is the thing being engineered.

The drafting side tells the same story from a different vertical slice. The category leader in personal injury raised a hundred and fifty million dollars in October 2025 at a valuation above two billion, bringing total funding to three hundred and eighty-five million. Its platform runs a proprietary model trained on hundreds of thousands of injury cases and millions of medical records, drafting demand letters and case documentation that human attorneys review. The company reports its case volume roughly doubling to ten thousand cases per week in six months (a vendor-reported operating figure), in a personal injury market it sizes at sixty-one billion dollars. The lead investor was a firm whose prior rounds it had already joined, and the round included the venture arm of the company that owns one of the legal research publishers the Stanford study measured, a strategic alignment worth noting when reading any single vendor’s reliability claims. The depth of the segment is visible in the company that raised a hundred and three million dollars for the plaintiff side the same week.

Map these to the architecture and the pattern is clean. The research platform’s benchmark obsession is the verification gate and the checkpoint router, instrumented. The drafting platform’s proprietary model trained on case-specific data is the grounding layer, specialized. Neither company’s pitch is that its model is smarter than a frontier model. The pitch is that its harness turns a frontier model into something a firm will deploy.

Where the harness saturates

The strongest argument against this thesis is that the model is catching up. A 2025 randomized controlled trial found that modern AI tools measurably improved lawyers’ work relative to working without them, and vendor benchmarks now show frontier models clearing ninety percent on firm-grade tasks. If the model reaches the point where it almost never fabricates, does the verification gate become dead weight.

It does not, for a reason specific to the vertical. In a domain where a single fabricated citation is sanctionable, the cost function is not the average error rate. It is the tail. A model that is right ninety-nine percent of the time still produces a fabricated authority once every hundred filings, and one fabricated authority in a filed brief is a Rule 11 problem no matter how good the other ninety-nine were. The verification gate is not insurance against a bad model. It is the mechanism that converts a probabilistic system into one whose output a human can attest to under a duty of candor. That requirement does not relax as the model improves; it is structural.

There is a real saturation risk, but it runs the other way. Pile on enough gates, retrieval constraints, and mandatory human checkpoints and the system stops being an agent at all. It becomes a deterministic retrieval-and-citation-check pipeline with a model bolted on for phrasing, the point of Harness Saturation. For low-risk, high-volume drafting that may be exactly right. For genuinely novel legal reasoning it is a ceiling. The design question for any legal agent is not how many gates to add. It is which tasks tolerate near-total gating and which need the model’s judgment to survive contact with the harness. The benchmark that grades tasks as all, some, or none is, read correctly, a map of where on that spectrum each workflow sits.

There is a second-order trap the Stanford measurement exposed. The tool with the higher error rate also produced markedly longer answers than the more reliable one, and more words mean more falsifiable propositions and more surface area for a claim to be wrong. A harness tuned to produce thorough, expansive output inflates its own verification burden. Concise grounded answers are not only easier to read; they are cheaper to verify, which in this vertical is the same as saying cheaper to trust.

FAQ

Do AI legal research tools still hallucinate?

Yes. The leading independent study found the major retrieval-augmented legal research tools hallucinate between seventeen and thirty-three percent of the time, well below the raw model baseline but far above any rate acceptable for unverified use. Retrieval reduces the problem; it does not remove it.

What is the review pattern in legal AI?

It is the deployment model where an agent produces a work product and a human reviews it before use, with the depth of review set by task risk. The most developed benchmark formalizes this by grading whether an agent can do all, some, or none of a task, which tells a firm where to set the checkpoint.

Does a better model remove the need for verification?

No. Because a single fabricated citation in a filing is sanctionable under Rule 11 and the duty of candor, the cost is driven by the worst output, not the average. Verification is what lets a human attest to the output, and that obligation is structural, not a function of model quality.

What does the ABA require for AI use in legal work?

Formal Opinion 512 maps generative AI onto existing duties of competence, confidentiality, candor, and supervision, and requires verification calibrated to the tool and the task. It is advisory, but functions as a national baseline because most states share the Model Rules structure.

What to do Monday

Take the last twenty outputs your legal agent produced. For each one, try to trace every legal proposition, every case, every rule, every quoted passage, back to a source the system actually retrieved. Count the propositions you cannot trace. That fraction is your hallucination exposure, and it is a more honest deployment signal than any benchmark score, because it measures the layer that determines whether a human can sign the work. If the number is not near zero, the gap is not in the model. It is in the gate.

Agent Commerce Is in Production. Here’s the Stack, the Code, and the Three Things Already Breaking.

The AI Runtime — Thu, 21 May 2026 11:03:42 GMT

TL;DR - The agent commerce stack settled into four layers in the last quarter, and senior engineers building agentic applications need to design against it now - not because every product needs payments today, but because the architectural commitments around authorization, observability, and policy enforcement that won’t backport later are being made this quarter. MPP launched March 18, 2026 with Browserbase, Parallel Web Systems, fal.ai, and PostalForm processing live traffic. x402 has processed over 100 million payment flows since Coinbase shipped it. Three production failure modes have already surfaced — a critical x402 SDK signature bypass, a settlement-timing gap where agents pay but receive nothing, and a missing authorization layer MPP explicitly does not solve. Build allowlists, budget caps, and a signed authorization chain before integration, pick the protocol layer-by-layer rather than as a single bet, and treat the payment surface as a policy domain enforced at the infrastructure layer - not a prompt instruction the model can ignore. The protocols are open; the discipline is the bottleneck.

The shape of the domain: four layers, one transaction

Agent commerce in mid-2026 is a four-layer composition, not a single protocol. A single paid request from a senior engineer’s agent touches all four layers, even when the implementation lets you ignore most of them. The layers compose vertically, and the protocols within each layer are designed to be swappable.

Diagram 1 — the four-layer agent commerce stack. No single protocol covers the full transaction; production agent integrations touch every layer.

Authorization is the layer that proves the agent is acting on a user’s instructions rather than hallucinating. AP2 occupies this slot: tamper-evident Intent, Cart, and Payment mandates signed by verifiable credentials, backed by Google with sixty-plus partners. Agent identity attestation proof of which agent is acting, not just which user authorized it - sits adjacent and is currently handled by third-party protocols like Skyfire’s Know Your Agent. The two together form the audit-grade authorization chain that regulators are starting to ask for.

Discovery is where the agent finds out what to buy and what it costs. MCP servers expose tool catalogs, ACP defines the four RESTful endpoints that model the checkout lifecycle for shopping agents, and ad networks like ZeroClick attach paid context to agent responses in the opposite economic direction (services earning from agent traffic, not agents paying for services). All three live at the discovery layer and compete or compose depending on the use case.

Settlement is the HTTP handshake that exchanges value. MPP and x402 both revive the HTTP 402 status code, both are backwards-compatible at the charge level, and they differ mainly in opinionation. MPP bakes idempotency, expiration, request-body binding via SHA-256 digest, HMAC-bound replay protection, structured RFC 9457 errors, and first-class receipts into the protocol spec itself, so every implementation inherits them. x402 leaves these to facilitators, which is why production teams keep rediscovering the same edge cases in their own implementations.

Rails is where money actually moves. Tempo settles MPP sessions with 0.5-second finality; USDC on Base settles x402 charges; Stripe Shared Payment Tokens settle fiat through the same PaymentIntents API; Lightning settles Bitcoin via Lightspark. The settlement layer is method-agnostic by design, and the layer above it should be too - your code should not know which rail the caller used.

Audit and policy span all four layers. Senior engineers underweight this layer because no protocol owns it. AWS’s AgentCore exposes vended logs and vended spans for every data-plane payments API call - the right pattern. Most production deployments don’t have an equivalent yet, which means audit trails are reconstructed from log scrape after the fact. That’s forensics, not compliance.

The architecturally important fact is that no single protocol covers the full transaction. A production agent that shops for users needs ACP’s checkout flow, AP2-style authorization, and either x402 or MPP for settlement - four protocol integrations, multiple wallet infrastructures, and multiple compliance surfaces. The clean separation is a feature of the protocol design and an operational burden for anyone shipping against it.

What’s actually live in production

The MPP services directory now lists over fifty integrated services, and Coinbase’s x402 Bazaar exposes over ten thousand x402 endpoints through MCP. The launch roster matters because it’s the first time large API providers have priced themselves directly for agent consumption.

Stripe’s own launch post names Browserbase (per-session headless browsers), PostalForm (physical mail printing), and Prospect Butcher Co. (NYC sandwich delivery) - vendor-published case studies, not independent ones. fal.ai prices image generation per request. Alchemy runs an agentic gateway where an agent authenticates with its on-chain wallet, pays USDC on Base, and accesses RPC across a hundred-plus chains without an API key.

The most architecturally instructive production deployment is Parallel Web Systems’ parallelmpp.dev — and unlike the Stripe roster, Parallel’s writeup is an independent engineering blog with code. The gateway exposes three paid endpoints (POST /api/search at $0.01, POST /api/extract at $0.01 per URL, POST /api/task at $0.30 ultra or $0.10 pro) plus free routes for discovery, task polling, and wallet balance lookups. Two payment rails — Tempo via the mppx CLI, x402 on Base via Stripe’s purl — route through a single middleware instance. The route handler doesn’t know or care which rail the caller used; it sees a 200, a Payment-Receipt header, and a parsed body, and proceeds as if it were any other authenticated request. That separation is the most important design choice in the writeup, and it’s the one most teams won’t get right on the first try.

Parallel’s other load-bearing decision is stateless 402 challenges. The challenge has an ID field that is an HMAC-SHA256 of the challenge parameters - realm, method, intent, request body, and expiry. When the client retries with a credential referencing that ID, the gateway recomputes the HMAC against the parameters in the credential and checks the IDs match. The issued challenge is never written anywhere. The gateway can horizontally scale behind any load balancer, restart cleanly, and survive a database outage without dropping in-flight requests. There’s no challenge replay window to manage and no TTL to tune — the expiry travels inside the signed parameters, and if a client tries to redeem a credential past it, the math fails and the request 402s again. The whole challenge layer is a pure function. That’s the kind of design choice that makes a system survive contact with production scale.

On the enterprise side, Amazon Bedrock AgentCore Payments entered preview May 7, 2026 with Coinbase CDP and Stripe Privy as the connected wallet providers. Three things matter about it. First, the wallet doesn’t hold private keys the agent can see — keys live in the wallet provider and the agent only gets signing through a managed interface. Second, spending limits are enforced deterministically at the infrastructure layer rather than as a soft instruction the agent’s prompt can override. Third, the same observability surface AgentCore uses for logs, metrics, and traces now covers payments — end-to-end observability through CloudWatch with vended logs and X-Ray traces for every data-plane API call. The “agent that spends money” went from custom-build to managed-service line item in seven weeks.

What an MPP integration actually looks like

The Substack version of the production reality lives in fifteen lines of Node. The mppx server SDK wraps the entire 402 challenge/credential flow into framework middleware:

import { Mppx, tempo } from 'mppx/server'

const mppx = Mppx.create({
  methods: [
    tempo({
      currency: '0x20c0000000000000000000000000000000000000', // pathUSD
      recipient: '0x742d35Cc6634c0532925a3b844bC9e7595F8fE00',
    }),
  ],
})

export async function handler(request: Request) {
  const response = await mppx.charge({ amount: '1' })(request)
  if (response.status === 402) return response.challenge
  return response.withReceipt(Response.json({ data: '...' }))
}

The middleware handles the 402 issuance and credential verification; the route handler reduces to “return the data.” On the client side, mppx.fetch is a drop-in for fetch — when the server returns 402, the client reads the payment requirements, signs a credential with the configured wallet, and retries the request automatically.

That brevity is the whole point. It’s also the trap. The fifteen lines work because every protocol-level concern — idempotency, replay protection, request-body binding, receipts — is hidden inside the SDK. When a production failure mode surfaces inside that SDK (and one already has), you don’t see it until your monetization bypass shows up in logs.

Operational lifecycle: where each documented failure mode hits

The mental model above is the architecture. The diagram below is what runs on every paid request and where the three documented failure modes attach. This diagram is the war-room reference; the prose underneath maps it to the actual incidents.

Diagram 2 - the payment lifecycle and the three documented production failure modes. Steps 4 and 5 are the soft underbelly; the authorization gap is cross-cutting.

Failure mode 1 — Signature verification can fail at the SDK layer even when the protocol is sound

GHSA-qr2g-p6q7-w82m, disclosed March 7, 2026, was a critical signature-verification bypass in the Coinbase x402 SDK affecting Solana payments. The protocol uses Ed25519 signatures for Solana settlements rather than ECDSA, and the facilitator component — which intercepts payment claims, verifies on-chain settlement, and issues cryptographic proofs to the resource server - was incorrectly accepting malformed or replayed signatures as valid. An attacker could craft a follow-up request with a spoofed PAYMENT-SIGNATURE header, the facilitator would validate it, the SDK would generate an x402 token, and the resource server would deliver the premium response without funds ever moving on-chain.

The fix shipped in npm 2.6.0, Python 2.3.0, and Go 2.5.0. The lesson is structural: a cryptographically sound protocol design can harbor implementation-level vulnerabilities in its SDK, and x402 is still rapidly evolving — production deployments must maintain rigorous SDK version management and security advisory monitoring. The same analysis notes the V2 release in December 2025 introduced new attack surfaces — dynamic payTo means recipient manipulation, sessions mean session hijacking, plugins mean supply chain attacks. The fix isn’t to avoid V2; it’s to match V2’s flexibility with equally granular security policies.

Failure mode 2 — Settlement timing creates a paid-but-not-delivered failure mode

The second failure mode is documented as Issue #1062 in the x402 repository and affects every agent running on Base through the Coinbase-hosted facilitator. The root cause is a timing mismatch in the settlement layer — the facilitator assumes blockchain settlement completes faster than it actually does under load, the off-chain verification step succeeds, but the on-chain transaction times out before the resource server returns. The wallet is debited, the service is not delivered, and the protocol does not specify a recovery path.

The same independent analysis flags a deeper structural issue. The gap between off-chain verification and on-chain settlement enables scenarios where payment processes but service is not delivered, and this remains unresolved in x402 v2 released December 11, 2025. An academic paper from March 2026 - A402: Atomic Payments for the x402 Protocol - proposes a TEE-plus-adaptor-signature solution to close the atomicity gap, but it isn’t in either protocol yet. MPP partially avoids this specific failure mode by baking idempotency, expiration, and request-body binding into the protocol spec itself, which is the strongest engineering argument for MPP regardless of which settlement rail you ultimately use.

Failure mode 3 — MPP solves payment execution; it does not solve authorization

The third failure mode isn’t a bug - it’s an architectural gap protocol specs explicitly punt to a layer above them. MPP gives agents a clean payment lifecycle. It does not give the merchant cryptographic proof of who authorized the payment, under what policy, with what constraints. At one agent making one payment, this is manageable. At a hundred agents each making fifty payments an hour, you have five thousand payment decisions per hour that each need an audit trail tying back to a user mandate. Without a structured authorization layer, you reconstruct decision chains from logs scattered across systems after the fact.

AP2 was designed for this slot. The protocol chains three cryptographically signed mandates - Intent (user delegates authority), Cart (user approves a specific cart at a specific price), and Payment (the network sees a derived credential) - and the chain provides the non-repudiable audit trail. But AP2 has its own gaps production teams should know about. AP2 binds a mandate to a user’s identity through their signing key, not to an agent’s identity. A compromised agent can still produce a mandate-signing prompt that fools the user, and the user’s signature on the resulting cart is valid even though the agent acted maliciously. Agent identity attestation has to come from a separate protocol. Skyfire’s KYA is one approach, before the mandate chain holds up. And cryptographic mandates are non-repudiable by design, which is the security feature, but there is no in-protocol mechanism for the user to revoke an Intent Mandate before its TTL expires; revocation depends on the credential provider or wallet enforcing it outside AP2.

Protocol selection: a decision matrix

The “which protocol” question has a layer-by-layer answer, not a single-bet answer. The table below maps the common workload shapes a senior engineer will encounter to the protocol stack that actually fits.

Workload Authorization Discovery Settlement Rails Pay-per-call API monetization (simple) None required MCP server discovery x402 charge USDC on Base Pay-per-call API monetization (enterprise) AP2 Intent mandate MCP server discovery MPP charge Tempo or SPT (fiat) Streaming / per-token billing AP2 Intent mandate MCP server MPP session Tempo Multi-hour agent task with mixed services AP2 Intent mandate MCP + ACP MPP session + x402 charge Tempo + Base Agent-led e-commerce checkout AP2 Intent + Cart mandate ACP SPT via MPP Stripe rails (fiat) Free tier funded by attention monetization None Ad network (e.g., ZeroClick) None Advertiser CPC

A few things to read off this table. First, the authorization column is mostly “AP2 Intent mandate” - that’s where production deployments are converging. Second, the settlement column splits cleanly between charge and session intents based on whether the unit of work is discrete or streaming. Third, the rails column rarely needs to be a single bet; MPP is method-agnostic at the protocol level, so the same endpoint can accept Tempo, SPT, or Lightning without forking the route handler. Fourth, the bottom row (ad-supported monetization) is a different economic flow entirely — not “agent pays service” but “service earns from agent traffic via advertisers” — and senior engineers building free-tier consumer-facing agent products will need to design for it explicitly.

ZeroClick is the relevant example on the bottom row. The platform launched in August 2025 with $55 million from the investor group that backed Honey’s $4 billion PayPal exit and runs a CPC ad marketplace where matched advertiser context is surfaced into AI responses. It does not run on MPP or x402, and confusing the ad layer with the payment layer is a common architectural mistake. They are different layers of the same emerging stack — both serve agent commerce, both sit above settlement, both are unstandardized in ways the payment protocols no longer are. Mature AI products will run both: ad-supported free tier funded by the discovery-layer ad network, paid premium tier settled through MPP or x402.

The architectural idea: session intents

If a senior engineer building agent infrastructure remembers one architectural decision from this domain, it’s the session intent. Charge intents are one-shot — one request, one payment, one response, equivalent to x402’s exact flow and backwards-compatible with existing 402 implementations. They work for “fetch this report” or “send this email” — anywhere the unit of work matches the unit of payment.

Session intents are different. The agent deposits funds into an escrow contract once, then makes thousands of subsequent micropayment requests using signed vouchers, without hitting the blockchain on every call. The server validates each voucher locally against the escrow without going back on-chain. The economics flip from per-call chain fees to per-session amortized cost, and the protocol enables payments as small as $0.0001 per request with sub-100ms latency. When the session closes, all micro-interactions batch-settle into a single on-chain transaction with unused funds refunded.

This matters because LLM agent workloads have a usage shape no prior payment rail addressed. A multi-hour agent run consumes API calls across half a dozen services, each priced per-token. Settling each call as a separate charge multiplies signature overhead. Settling at task completion forces the service to extend credit. Streaming MPP runs a continuous debit against a prepaid balance with finality checkpoints so neither side carries open exposure for long. At Sessions 2026, Stripe added streaming payments as a first-class MPP primitive — the wire-level mechanism for per-token billing, settled on Tempo with sub-second finality.

For any service whose pricing model is “per token consumed,” “per second of compute,” or “per row of data returned,” the session primitive is the only economically sane settlement layer in production today. For any service whose unit of work is discrete and atomic, charge intents are fine and x402 is probably the more permissionless choice.

Production readiness checklist

A senior engineer about to ship an agent that spends money should be able to check off each of the following before deploying. None of these are theoretical; each maps to a documented production failure mode or an architectural lesson from a deployed system.

Spending controls enforced below the agent, not inside it. AgentCore’s pattern of session-level spending limits enforced deterministically at the infrastructure layer is the correct architecture. Whether you build this yourself or adopt AgentCore, the agent must not see private keys, must not be able to lift its own limits, and the limits must expire on a clock.
Chain allowlist and per-endpoint amount caps in the agent’s payment middleware. Standardized identifiers are great until an attacker exploits the standardization — a malicious 402 response can redirect your agent from Base to Ethereum mainnet at 100x the gas cost. Whitelist the chains your agent is configured to operate on, validate per-endpoint, flag any chain identifier the agent hasn’t seen in that context.
Session scoping. An agent doing data lookups should not also be able to book hotels. Per-session, per-domain, per-task scoping limits the blast radius of any single compromised session.
Stateless 402 challenges where possible. Parallel’s HMAC-of-parameters challenge ID is the production pattern. The gateway can horizontally scale, restart cleanly, and survive a database outage without dropping in-flight requests. If you’re issuing stateful challenges, you’re carrying operational complexity that doesn’t have to exist.
Two rails, one route handler. Parallel’s gateway runs Tempo and x402 through the same middleware; the route handler doesn’t know which rail the caller used. The abstraction boundary is at the middleware, not the route. You can add or retire a rail without touching the routes. Most teams build this in the wrong place on the first try.
Full payment-lifecycle observability tied back to authorization. Logs of “agent X paid $0.12 to service Y at time T” are receipts. What you need is an audit trail tying that payment back to the user mandate that authorized it, the policy that bounded it, and the alternatives the agent evaluated. Receipt and audit trail are different artifacts.
SDK version pinning tied to security advisory review. The GHSA bypass will not be the last. Treat the x402 GitHub Security Advisories feed and the MPP IETF draft updates as inputs to your dependency review process, not as side channels. Pin SDK versions; tie upgrades to a formal advisory review.
Discovery endpoint that documents itself. Parallel’s GET /api endpoint returns a JSON document with every endpoint, its price, the request body schema, and ready-to-paste mppx commands. Pricing constants live in a single config module that feeds the middleware, the route handlers, and the discovery JSON. There is no version of the truth that disagrees with another version of the truth. This is how an agent-native API documents itself.

The architectural decisions are now, and the protocols won’t wait

The protocols are stabilizing faster than most teams expect. MPP went from launch to AWS-managed primitive in seven weeks. The x402 Bazaar lists ten thousand endpoints. AP2 has sixty-plus partners. The four-layer stack — authorization, discovery, settlement, rails — has settled into something stable enough to design against, even though specific protocol choices within each layer will keep shifting through 2026.

What hasn’t stabilized is the operational discipline. Most teams shipping agent-payment integrations today are doing it the way teams shipped database access in 2008 — get it working, then add controls later. That worked for databases because the failure mode was a slow query. The failure mode for an under-controlled agent payment system is your agent draining its session limit to an attacker who manipulated the recipient address, or paying for a resource that never delivered, or making a payment your compliance team can’t trace back to an authorization. These failure modes are documented in production. They have CVE numbers and GitHub issues.

The architects who win this transition are the ones treating the agent-payment surface the way mature finance teams treat payments: as a regulated domain with deterministic controls, audited authorization chains, and incident response built in from day one. The protocols are open and the SDKs are free. The discipline is the bottleneck.

Two of the most expensive mistakes a senior engineer can make in the next six months are betting on a single protocol and treating payments as plumbing rather than policy. The four-layer stack composes; pick the layer-appropriate primitive, build the abstraction boundary so you can swap settlements, and ship the controls before you ship the integration.

The Anatomy of a Production Vertical Agent

The AI Runtime — Tue, 19 May 2026 11:03:48 GMT

TL;DR - Production AI agents in regulated industries — clinical documentation at Abridge, prior authorization at Anterior, patient engagement at Hippocratic, customer experience at Sierra, mortgage origination at Rocket and Tavant — have converged on a seven-component architecture. The LLM is the smallest of those seven. The other six do the load-bearing work: a router that orchestrates calls, a constellation of specialist models with supervisors, a deterministic policy layer that retains decision authority, a domain schema adapter into the system of record, a long-horizon state store, a human checkpoint router, and a regulator-replay audit trail. No two vendors call them the same thing. They are the same components. Call the pattern Vertical Agent Anatomy (VAA). If your design is missing any of these seven, you are building a demo, not a production vertical agent.

A production vertical agent is an LLM-driven system that operates safely inside a regulated industry’s compliance, schema, and system-of-record constraints. In practice, this requires seven specific architectural components: the LLM itself, plus six layers of deterministic scaffolding that prevent it from speaking, deciding, or acting outside those constraints. The MongoDB engineering team has argued the LLM is the smallest part of any production agent system. Regulated verticals make the imbalance more extreme — the harness becomes nearly the entire system.

Production vertical agents have converged on the same seven components

Read enough architecture posts and the same pattern emerges. Sierra calls its multi-model layer a “constellation of models” and the policy-enforcement layer “supervisor agents.” Hippocratic AI calls its constellation “Polaris” — roughly twenty-two supervising LLMs around a ~400B-parameter primary, aggregate ~4.1 trillion parameters. Decagon calls its routable workflow definitions “Agent Operating Procedures” and its quality-review layer “Watchtower.” Abridge calls its evidence-linked audit layer “Linked Evidence.” Norm Ai calls its regulation-to-decision-tree compiler “Leap.” Tavant calls its mortgage agent framework “MAYA” and positions the underlying identity model with the line: “These [AI] agents need to be provisioned like people.” Workday calls its agent registry and lifecycle layer the “Agent System of Record.”

Different names. The same seven components.

The closest academic anchor is the 2025 arXiv paper that proposed a standardization of Vertical AI agent design patterns and named the central component a “Cognitive Skills Module” — what we’ll call the specialist model constellation. The paper formalized the cognitive layer but did not name the surrounding six. The MongoDB harness writeup formalized the surrounding scaffold for general-purpose agents but did not specialize it to regulated verticals. VAA is the regulated-vertical specialization: the same architecture, with the harness components made specific because the vertical demands they be.

1. The Router/Orchestrator decides who handles what

The router is the first thing a request hits. It decides which downstream models, which tools, which policies, and which humans get involved. Production verticals see heterogeneous request types — a KYC submission, a prior auth appeal, and a refinance inquiry all require different downstream paths. Single-model designs try to reason their way to the path. Production agents route deterministically when they can.

Sierra’s Agent OS routes among 15+ models depending on task — low-latency models for lookups, high-precision classifiers for behavior detection, tone-optimized models for sensitive interactions. Rocket Mortgage’s “Rocket AI Agent API” performs the same role across the borrower lifecycle on AWS Bedrock, with Step Functions orchestrating Claude 3 Haiku fine-tunes and other specialist models. Decagon’s AOPs are essentially programmable router definitions in natural language.

The architectural insight: the router is the cheapest place to enforce determinism. Every routing decision that doesn’t go through an LLM is one fewer failure mode in production. Teams who treat the router as an afterthought and let the LLM decide its own next step end up paying for that decision in eval cost and audit ambiguity.

2. The Specialist Model Constellation is where the LLM actually lives

The LLM does not sit alone. It sits inside a constellation of specialist models, each tuned to a subtask, with supervising models that check outputs before they leave the constellation.

The pattern is explicit at Sierra: agents are assembled from 15+ purpose-built models working in concert, backed by supervisors that enforce guardrails, policies, and quality checks. It is more extreme at Hippocratic, whose Polaris architecture places roughly twenty-two supervising LLMs around the primary conversational model — the design explicitly assumes the primary cannot be trusted to police itself. Anterior’s published architecture follows the same shape: specialist models for classification and synthesis, LLM-as-judge supervisors evaluating outputs in real time, with a clinical review team an order of magnitude smaller than competitor benchmarks because the supervisors absorb most of the work.

This is also where the constellation’s biggest design failure shows up: teams overspend on model selection and underspend on supervisors. Picking the right primary model matters less than the question of whether anything is checking its output before it leaves the agent.

3. The Deterministic Policy Layer retains the decision authority

This is the single most underappreciated layer. The policy layer is the non-LLM gatekeeper that decides whether the model’s recommendation can be acted on, escalated to a human, or rejected outright. Regulators do not accept “the model said so.” Liability sits with the deterministic decision-maker, and the deterministic decision-maker is not the LLM.

The pattern is consistent across verticals. Microsoft’s Azure AI Foundry prior authorization template is explicit: the agent never produces an automated DENY, only APPROVE or PEND, and every recommendation requires clinician sign-off with documented rationale. Blend’s mortgage agent Autopilot “does not make credit decisions, which remain the responsibility of human underwriters and automated underwriting systems.” Anterior’s design principle — never let the LLM make the final authorization decision — sits at the same architectural location. Norm Ai’s Leap platform pushes this furthest, representing regulations themselves as decision trees rather than as LLM prompts, so the policy layer is the regulation itself in machine-executable form.

The common confusion is to call this “human-in-the-loop.” It is not. The policy layer runs before any human sees the recommendation; it filters which decisions even reach the human checkpoint router. Calling it human-in-the-loop is how teams end up with a system that escalates everything and overwhelms reviewers, or escalates nothing and ships a deterministic decision under an LLM-shaped accent. Both outcomes are common. Both are architectural failures at this layer.

4. The Domain Schema Adapter is where every vertical pays its own tax

The schema adapter is the translation layer between LLM-native representations and the vertical’s canonical schemas — and the systems of record built on them. Healthcare has FHIR R4 and HL7 v2 and SMART-on-FHIR and CDS Hooks and C-CDA and the Da Vinci PAS/CRD/DTR profiles. Mortgage has MISMO and Encompass and MSP. Insurance has ACORD. Trade has FpML and FIX. Cross-border payments now have ISO 20022. Demos work in plain English. Production does not.

Abridge’s Epic integration is the canonical example. Abridge was the first ambient-AI tool officially integrated into Epic’s EHR through the “Pal” program, with Linked Evidence mapping any word or phrase in the generated note back to source transcript or audio in real time. The integration is bidirectional and embedded inside Epic workflows from Haiku to Hyperdrive — not a wrapper around Epic, a participant in it. Rocket Mortgage’s Bedrock agents bridge directly into MSP and Encompass. AWS AgentCore Gateway exposes core banking systems as OpenAPI-schema tools so KYC agents can act against them without modeling the underlying banking schema in prompts. ICE Aurora embeds responsible agentic AI directly into Encompass and MSP rather than running as a standalone tool.

The architectural insight is that every vertical pays this tax independently. There is no FHIR-equivalent layer that crosses verticals; even within a vertical, there is no clean abstraction across systems of record. Schema work is where production cost accumulates and where horizontal AI agent platforms keep hitting the same wall. It is also where vertical agents earn their right to exist — the deep schema bridge is the moat, not the model choice.

5. The Long-Horizon State Store handles the cases that span weeks

A prior auth can bounce three times. A disability claim spans 90 days. M&A diligence runs for two quarters. A mortgage application drags for 45 days. Stateless agents cannot handle any of these.

The state store is the agent’s durable memory across days, weeks, or quarters — for cases that don’t fit in a single request/response. Tennr’s RaeLM™ document-reasoning model, trained on 100M+ medical documents and 2.3B distinct data fields, acts as the persistent reasoning substrate for referral workflows that touch the same patient across multiple touchpoints. Oracle AI Agent Memory positions itself explicitly as “a persistent memory core for AI agents...enabling them to perform well at long-horizon tasks.” Sierra’s Agent Data Platform serves the same role for CX agents whose conversations span multiple sessions.

The missing primitive at this layer is reawakening on external events — not just storing state but triggering on it. A claim that bounces and resurfaces 60 days later. A loan that becomes refinanceable when rates drop. A KYC review that requires re-verification on an annual cadence. Most vertical agents are still built as request/response when the underlying workflow demands a calendar- and event-aware agent. This is one of the clearest gaps between current production deployments and what the next generation of vertical agents will require.

6. The Human Checkpoint Router is not “human-in-the-loop”

“Human-in-the-loop” as a vibe is not a checkpoint architecture. Production checkpoint routing has explicit thresholds, named reviewer pools, SLA tracking, and override-rate telemetry that feeds back into the eval system.

Anterior is the cleanest published example: confidence-tiered routing where each tier specifies which clinical reviewer types see the decision, and the override rate is treated as a continuous quality signal — initial override rates of 15–20% decay toward <5% as the system learns from each override. Hippocratic AI’s clinician validation network — more than seven thousand licensed U.S. clinicians as of the November 2025 Series C announcement — exists as a checkpoint pool that the router can call into based on specialty, jurisdiction, and conversation type. Tavant’s stated principle extends further: AI agents themselves need explicit identities, distinct authority scopes, and full auditability — provisioned like people, not like generic automation. The implication for the checkpoint router is that the human and the agent are both first-class identities with explicit authority models.

The architectural failure mode here is the “send everything below 80% confidence to a human” pattern. This is not a checkpoint router; it is a workload offload. Production deployments build a routing policy that names specific reviewers, specific SLAs, specific reasons for escalation, and tracks override rates as a continuous quality signal that loops back into the eval system. Calibration of the threshold is itself an ongoing engineering problem — not a static config value.

7. The Regulator-Replay Audit Trail is not a log file

Every regulated vertical demands that decisions be reproducible. HIPAA. The Federal Reserve’s SR 11-7. The NAIC AI Model Bulletin. FCRA adverse-action notices. NYDFS Part 500. CMS-0057-F. “We have logs” is not auditable. The audit trail in a production vertical agent is evidence-linked, decision-grained, tamper-resistant, and designed to survive a regulator or a court reconstructing why a specific decision was made on a specific day for a specific person.

Abridge’s Linked Evidence is the cleanest example in healthcare: every section of a generated note maps back to the timestamped transcript and source audio, so clinicians and auditors can reconstruct provenance at the word level. Sixfold provides full sourcing and lineage for every underwriting decision, with the explicit goal of making the decision defensible in a regulatory review. The MobiHealthNews blueprint for agentic prior auth describes the audit substrate as a “provenance graph that records every step an agent takes” — what data it looked at, which rules and policies it applied, what it decided and why.

Academic prior art is catching up. The Brown audit-trails paper defines LLM audit trails as “a chronological, tamper-evident, context-rich ledger of lifecycle events and decisions.” IBM’s “Replayable Financial Agents” preprint goes further and proposes a determinism-faithfulness assurance harness specifically for regulatory replay of tool-using LLM agents in finance.

Logs are not audit trails. Audit trails are designed for replay by someone who wasn’t in the room. The two are not architecturally equivalent, and treating them as equivalent is one of the most common reasons production vertical agent pilots fail compliance review.

Why this is a framework, not a checklist

The seven components are load-bearing in regulated industries. They are not optional. A general-purpose customer-service chatbot can ship without a deterministic policy layer because nothing it does carries regulatory weight. A KYC agent cannot. A prior auth agent cannot. A mortgage origination agent cannot. The Bessemer vertical-AI thesis (with the caveat that Bessemer is a portfolio investor in Abridge and a number of other vendors cited in this piece) and the broader vertical-AI investor consensus argues that vertical agents win because they reach further into the system of record. That is true, but it is the schema adapter doing that work, not the model. The model is the easy part.

VAA is descriptive, not prescriptive. The components illuminate where production engineering effort actually goes — and where most early deployments under-invest. They are not a scoring rubric. A system with all seven components but a weak schema adapter will still fail in production. A system that nails the schema adapter but treats the policy layer as a confidence threshold will pass demos and fail audits.

The interesting question once the seven components are recognized is comparative: the shape of those components changes dramatically across verticals. A healthcare audit trail looks nothing like a banking audit trail. A mortgage human checkpoint router looks nothing like a claims one. Schema work in insurance is fundamentally unlike schema work in legal. That comparative analysis — how the harness reshapes itself vertical by vertical — is the next piece in this series.

What this means for your build

Three takeaways for engineering teams looking at vertical agent deployments today.

First, audit your design against the seven components before you start measuring model quality. Most teams discover that what they thought was an “agent” is actually four of the seven components glued together with weak supervisor coverage and no replay audit. Measuring model accuracy on that system answers the wrong question.

Second, the schema adapter and the policy layer are where production hours go. Engineering effort that does not touch one of these two components after the first month is engineering effort that is not building a production vertical agent. This is where the Retrofit Tax hides — every legacy schema, every undocumented system-of-record behavior, every state-by-state policy variation pays itself in engineering hours.

Third, design the audit trail before you design the model. The audit trail constrains everything else — what gets logged, what gets versioned, how decisions are reconstructed, what evidence the policy layer captures on the way through. Most teams design the audit trail last. The result is logs that observability tools can read but regulators cannot.

The LLM is the smallest part. The other six components are the work.

FAQ

What is a vertical agent? A vertical agent is an LLM-driven system designed to operate within a specific industry’s regulatory, schema, and system-of-record constraints — healthcare, banking, insurance, legal, or mortgage. The defining feature is not the model; it is the deterministic scaffolding around the model. The 2025 arXiv standardization paper formalizes the academic definition; production deployments at Anterior, Abridge, Sierra, Decagon, Hippocratic, Harvey, Sixfold, Norm Ai, Tennr, Tavant, Rocket, and Blend instantiate the seven-component shape.

What is the difference between a vertical agent and a general-purpose agent? General-purpose agents have a harness, but the harness is optional in the sense that demos and consumer products can ship without most of it. Vertical agents in regulated industries cannot ship without the harness — every component carries regulatory, schema, or liability weight. The same seven components exist; the difference is that in vertical deployments they are load-bearing rather than nice-to-have.

Can a vertical agent skip any of these components? Not in regulated industries. A vertical agent missing the deterministic policy layer cannot pass model risk management review. One missing the regulator-replay audit trail cannot pass an HHS, OCC, or NAIC audit. One missing the schema adapter cannot reach into the system of record and is effectively a chatbot pretending to be an agent. One missing the human checkpoint router cannot allocate liability cleanly. Each component exists because something specific in the regulated environment requires it.

Is the LLM really the smallest component? By engineering hours, yes. The MongoDB engineering team has made this case for general-purpose agents — the model interaction is a small fraction of the codebase compared to state, governance, orchestration, memory, observability, and evaluation. Regulated verticals push this further. The schema adapter alone often exceeds the entire model-interaction layer in lines of code, ongoing maintenance, and incident frequency. The audit trail and policy layer compound the imbalance.

What is the relationship between VAA and the harness concept? The harness is the broader engineering pattern around any production LLM agent. VAA is the regulated-vertical specialization of that pattern. Components 1, 5, 6, and 7 of VAA correspond closely to harness primitives the MongoDB writeup names (orchestration, memory, governance, observability/eval). Components 3 and 4 — deterministic policy layer and domain schema adapter — are where vertical agents specifically diverge from general-purpose agents.

Read more:

The Brain Isn’t the LLM: How HockeyStack Built Revenue Agents

The AI Runtime — Tue, 12 May 2026 11:03:53 GMT

TL;DR - HockeyStack closed $50M from Bessemer Venture Partners, Y Combinator, and Uncorrelated Ventures to scale Revenue Agents — autonomous AI agents that work every deal and account 24/7 across new business, prospecting, and expansion. The interesting architectural choice: HockeyStack’s reasoning engine is not a frontier LLM. It is a proprietary ML model called the Blueprint that reverse-engineers each customer’s winning sales process from their event data. The LLM sits downstream as the execution and natural language layer. If you are designing a vertical agent, HockeyStack is the cleanest public example of an “ML brain, LLM executor” architecture — the inverse of what most teams ship.

What HockeyStack Actually Sells

HockeyStack started in 2021 as a B2B revenue analytics and attribution platform — the kind of tool that stitches Salesforce, HubSpot, ad platforms, Gong, and product data into one buyer journey so a CMO can answer “which campaign actually drove pipeline?” The founders — Emir Atlı, Arda Bulut, and Buğra Gündüz, the CEO — dropped out of college in Turkey, went through Y Combinator, and built the company into a Series A attribution vendor.

That is the company HockeyStack used to be. The company they are now is something different.

In April 2026, HockeyStack announced a $50M raise and the launch of “Revenue Agents for the Enterprise.” The pitch: per-deal autonomous agents that monitor every live opportunity against a learned pattern of how the customer’s own top reps win, execute the next-best action, and loop in the human rep when judgment is required. The customer list spans Fortune 100 revenue teams including 8x8, AppsFlyer, Outreach, Yext, and Sendoso, with over 300 customers reached in under two years.

This is a category bet: HockeyStack is positioning Revenue Agents as a new product category sitting alongside (or above) attribution, CRM, and revenue intelligence. The bet is architectural, and it is the part worth studying.

The Blueprint Is the Brain

The single most useful sentence on HockeyStack’s site is in their description of the platform: agents follow a “validated, data-grounded process.” Read past the marketing voice and notice what is not being claimed. The agent is not reasoning from first principles each turn. It is not asking an LLM “what should I do next on this deal?” and trusting whatever comes back. It is executing against a blueprint — a learned, structured representation of the customer’s winning sales process.

The Blueprint is HockeyStack’s proprietary ML model. Per their own description, it is built by analyzing every won and lost deal, every touchpoint, and every signal in the customer’s data to surface specific, validated patterns. Each Blueprint is unique to a revenue motion or business unit and updates continuously as new deals close and market conditions shift.

Crucially, the Blueprint is not a fine-tuned LLM. It is described as a machine learning model that continuously learns on new outcomes — an event-chain pattern-mining pipeline trained on the customer’s own deal history. The LLM enters the picture downstream: surfacing tasks in natural language to reps, generating outreach copy, and handling the human-facing surface. The reasoning about what should happen on a deal is the Blueprint’s job.

This inverts the dominant pattern in AI agent products. Most “AI for X” startups treat a frontier LLM as the reasoning engine and bolt on retrieval, tools, and memory around it. HockeyStack treats a domain-specific ML pipeline as the reasoning engine and uses the LLM as the execution and language layer.

Detail belongs in the prose, not the diagram. Three components carry the real weight.

Atlas: The Event-Based Substrate

Most CRMs are record-based: a deal is a row, with fields. HockeyStack’s foundation, called Atlas, is event-based: every interaction is a timestamped event resolved to one identity graph. Per their own product page, Atlas unifies every interaction into a single event-based timeline with full identity resolution — CRM, outreach sequences, call recordings, web activity, and the data warehouse all resolved to one time-stamped source of truth.

This matters because the Blueprint cannot mine winning patterns from flattened CRM fields. As contentgrip’s coverage of the raise observed, many meaningful buyer and seller signals are inherently event-like — web activity, product usage, conversation outcomes, buying-committee changes — and when those signals get flattened into static fields, teams lose the sequence, timing, and causality that define a winning play. An event-based model preserves them.

For builders, the lesson is upstream of agent design: if your reasoning layer needs sequence and causality (and most consequential agent decisions do), your data layer has to preserve them. You cannot retrofit event semantics onto a record-based store after the fact without losing fidelity.

Revenue Agents: Per-Deal, Always-On

The agent layer is where the Blueprint gets executed. HockeyStack’s framing: dedicated agents monitor every deal and account, execute the right moves autonomously, and flag risks, with individual Revenue Agents assigned to each deal and account, operating around the clock.

Concrete agent behaviors HockeyStack has shipped, per their agents page: identifying missing stakeholders and triggering outreach to unblock deals, detecting competitor dissatisfaction signals and launching displacement outreach, redistributing account attention based on revenue risk, and identifying when messaging stops converting. Each behavior is an instance of “deal deviates from the Blueprint pattern → agent acts.”

The reps interact with this through a surface called the Rep Cockpit — a daily workspace where agents surface direct tasks with reasoning. Senior leaders get separate Manager views for coaching and pipeline forecasting. This shape — agent surfaces work, human reviews and acts — is the same shape Rogo’s Felix uses with email as the substrate. Different surface, same async-handoff pattern.

HockeyStack also describes a multi-agent orchestration model: one agent retrieves data, another runs analysis, a third validates the output before the user sees it. The validator step is doing real work — it is the guardrail that catches the LLM hallucinating a stakeholder or fabricating an account fact before that error propagates into a rep’s outreach.

The Reverse-Engineering Bet

There is a strong claim underneath all of this, and HockeyStack states it plainly: your top performers run plays that live in their heads, and the Blueprint finds and deploys them across your entire team. The bet is that “what your best rep does” is a pattern recoverable from the event stream — not just tribal knowledge.

This is non-obvious. Sales has been resistant to standardization because the tacit-to-explicit conversion loses something. Whether HockeyStack’s pattern mining actually captures what the best reps do, or just captures the surface signals correlated with their wins, is the empirical question that will determine whether this category sticks. As one industry analyst noted in coverage of the raise, enterprises will look for clear proof that an event-based architecture improves forecast accuracy, sales productivity, or expansion conversion — not just that it produces more data. That bar has not been independently proven yet.

But it is the right bet to be making. If the architecture works, the moat is significant: every customer’s Blueprint is a one-of-one asset trained on their data, hard to rip out, and gets better as it ingests more deals.

Two Architectures for Vertical Agents

It is worth naming the two patterns explicitly, because they map cleanly onto a choice every vertical-agent builder is now making.

Pattern A — Frontier LLM as brain, harness around it. The reasoning engine is a frontier model. The vertical work is in the harness: tool layer, evals, output formatters, audit trail, data integrations. When a better frontier model ships, you swap the engine. Examples: most agentic platforms today, including the agent harness several finance and legal AI companies have publicly described.

Pattern B — Domain ML as brain, LLM as executor. The reasoning engine is a custom ML pipeline trained on customer data. The LLM handles natural language interfaces, generation, and tool calling. The vertical work is in the data pipeline, the pattern model, and the per-customer training loop. HockeyStack is the clearest public example.

Neither is universally right. Pattern A is faster to ship, benefits automatically from frontier-model gains, and is easier to swap. Pattern B is more defensible if your domain has rich event data and recoverable patterns, and it gives you deterministic behavior the LLM cannot match.

In Model Reliability Engineering terms: Pattern A invests heavily in Harness Engineering. Pattern B invests heavily in Context Engineering, taken to its logical extreme — the context isn’t just retrieved, it’s mined and structured into a deterministic decision pattern before the LLM ever runs.

What’s Actually Being Transformed

Sales orgs do not get replaced; their middle gets compressed. The classic problem HockeyStack is targeting — the best rep closes 2-3x more than the median, and nobody knows why — has been a fixture of sales leadership for thirty years. The traditional response was process documentation, MEDDIC training, and rep shadowing, and it did not close the gap because tacit knowledge resists capture.

If Revenue Agents work as advertised, what changes is not headcount; it is the variance band. New reps execute closer to top-quartile from week one because the agent surfaces the next move. Top reps spend less time on context-stitching (one HockeyStack customer testimonial cites three hours a day of cross-tool data wrangling eliminated, though this is vendor-curated and worth treating as directional rather than benchmarked) and more time on the relationship work that actually requires a human. Managers run pipeline reviews against a model rather than vibes.

The honest caveat: this is the promise. As of April 2026, the public evidence is the customer list, the funding round, and HockeyStack’s own product descriptions. Independent benchmarks of forecast-accuracy lift or expansion-conversion lift do not yet exist publicly. Buyers in this space should ask for them.

Five Lessons If You Are Building a Vertical Agent

Decide which brain you are building. Pattern A and Pattern B are different companies with different moats. Pick deliberately, not by default.
Event-based data preserves causality. Record-based data destroys it. If your agent needs to reason about why something happened, your substrate has to keep the sequence.
The validator agent is doing real work. Multi-agent orchestration with a dedicated check step is a cheap way to cut hallucination risk before output reaches the user.
Per-customer learning is a moat. Per-customer training is hard. A model that gets better as the customer uses it is structurally defensible — but only if you can run that loop without ongoing human curation.
Async surfaces beat new UIs. HockeyStack’s Rep Cockpit and Manager views, like Rogo’s email interface, surface agent work where the user already lives. Adoption follows the path of least friction.

What to Do This Week

Pick a workflow you have watched a domain expert do — one with rich, structured signals leading up to the decision. Now ask: could a small ML model trained on past instances of this workflow predict the right next action better than an LLM prompted with the same context?

If yes, you have a candidate for Pattern B. The investment is in the data pipeline and the model, not the prompt.

If no — if the signals are sparse, unstructured, or judgment-dominated — you are in Pattern A territory, and your work is in the harness around the frontier model.

The mistake to avoid is the third pattern: a thin LLM wrapper that pretends to be either. That is the architecture that gets disrupted next quarter when the next frontier model ships and removes whatever differentiation the wrapper claimed.

How MIT’s ScienceClaw Runs Hundreds of AI Agents Without a Central Planner

The AI Runtime — Mon, 11 May 2026 11:04:55 GMT

TL;DR - On March 15, 2026, a team led by MIT’s Markus Buehler released ScienceClaw + Infinite, an open-source framework where autonomous AI agents conduct scientific research across a registry of more than 300 interoperable tools. The system is Apache 2.0-licensed and built around a coordination pattern most production multi-agent systems don’t use: there is no central planner. Agents broadcast unsatisfied research needs into a shared index, peer agents pick those needs up via schema-overlap matching, and a component called the ArtifactReactor uses pressure-based scoring to bias the swarm toward high-impact directions. Every computation produces an immutable, content-hashed artifact with explicit parent lineage, accumulating in a directed acyclic graph. The repository is research-grade — five GitHub stars, four contributors, fifty-five commits as of early May 2026 — so this is not a drop-in production system. But the coordination pattern is what to take from it. If you are building multi-agent systems where the planner has become a brittle bottleneck, ScienceClaw shows what plannerless coordination via a typed-artifact substrate looks like in practice. Read the paper, skim the repo, port the patterns.

What ScienceClaw actually is

ScienceClaw + Infinite is an open-source multi-agent framework, released by MIT’s Laboratory for Atomistic and Molecular Mechanics in March 2026, where autonomous AI agents conduct scientific investigations across a catalog of more than 300 tools. Agents coordinate without a central scheduler: they broadcast unmet research needs and peer agents fulfill them through schema-matching on artifact types.

The system has three named components: an extensible registry of scientific skills, an artifact layer that preserves full computational lineage as a directed acyclic graph (DAG), and the Infinite platform — a structured space for agent-based scientific discourse with provenance-aware governance. The stack runs on top of OpenClaw, requires Node.js ≥ 22 and Python ≥ 3.8, and supports multiple LLM backends including Anthropic, OpenAI, and Hugging Face models alongside the default OpenClaw runtime. Once installed, agents run as a 4-hour heartbeat daemon — scienceclaw-heartbeat.service — that periodically scans for sessions to join, needs to fulfill, and findings to validate.

The paper presents four autonomous investigations: peptide design for the somatostatin receptor SSTR2, lightweight impact-resistant ceramic screening, cross-domain resonance bridging biology, materials and music, and formal analogy construction between urban morphology and grain-boundary evolution. The last of those produced a concrete output: a de novo Hierarchical Ribbed Membrane Lattice that, when validated with 3D finite-element analysis, resonates at 2.116 kHz and exhibits nine elastic modes in the 2–8 kHz band — relevant to acoustic filtering and bio-inspired sensing. Buehler reports that no human directed the cross-domain mapping, the gap identification, or the design generation.

The plannerless coordination loop

Most production multi-agent frameworks are orchestrator-based. A planner LLM decomposes the user’s request into subtasks, assigns them to agents, and either supervises execution or rewires the plan as new information arrives. AutoGen, CrewAI, and most LangGraph patterns sit in this family. The orchestrator is the throat through which all coordination flows.

ScienceClaw inverts this. There is no planner. Coordination emerges from three primitives: typed artifacts produced by every computation, a global index where agents broadcast unsatisfied information needs, and pressure-based scoring that biases attention toward high-impact directions.

The mechanic is straightforward. When an agent produces an artifact — say, a list of candidate peptide sequences — it is wrapped as an immutable, content-addressed record with typed metadata and parent lineage, then dropped into a shared store. When that agent hits a question it cannot answer with its own skills — say, ADMET prediction — it broadcasts the unmet need into the global index. Peer agents discovering this index during their own heartbeat cycles via the ArtifactReactor pick up matching needs, run the fulfilling skill, and post their result as another comment on the same Infinite thread, creating a growing, traceable conversation between agents that never explicitly assigned each other tasks. Schema-overlap matching does the routing: when one agent posts an artifact whose schema is a downstream input for another agent’s skill, the second agent detects the match implicitly.

If the pattern feels familiar, that is because it is. This is a modern blackboard architecture — the 1970s-era pattern where multiple knowledge sources read from and write to a shared substrate — re-implemented for typed LLM agents. Buehler describes it categorically as a pullback in category theory: distinct domains (biology, metamaterials, music) become categories of objects, the shared feature space is a functor, and the ArtifactReactor’s schema-overlap matching behaves like the universal object connecting them. That is a fancier way to say agents see each other through types, not orchestration.

Why this matters: where orchestrators break

Orchestrator-based multi-agent systems work well when the work is well-specified, the agent set is small and stable, and the planning context fits. They fall apart in the opposite regime.

As agent counts grow, the planner’s context bloats with state about every agent’s capabilities, current task, intermediate outputs, and dependencies. Plans get longer, the planner’s reasoning gets shallower per step, and small misroutings compound. Adding a new agent means changing the planner’s prompts or fine-tuning. Removing one means dependency repair. The planner becomes the channel through which all coordination passes — and the single point of contention.

Plannerless coordination shifts the harness. Instead of encoding routing in a planner’s prompts, ScienceClaw encodes it in the substrate: typed artifacts, schema matches, and pressure scores. Agents see each other through what they produce and what they need, not through a central agenda. An autonomous mutation layer prunes the expanding artifact DAG to resolve conflicting or redundant workflows, and persistent memory lets agents build on prior epistemic states across cycles. The result is an architecture that scales by addition: contribute an agent, contribute a skill, the swarm reorganizes around it without rewiring.

There is a second consequence worth pulling out. Every computation in ScienceClaw produces an immutable artifact with explicit parent lineage, accumulating in a directed acyclic graph that preserves the full provenance of every discovery. Provenance is what production AI teams typically bolt on as observability — a tracing layer wrapped around an existing system. Here it is the substrate. The DAG is the coordination medium and the audit log. You cannot have one without the other.

How agents actually select tools

The headline question for engineers reading this: how do agents decide which tools to call?

ScienceClaw’s answer is that there is no domain-to-tool routing table. The LLM analyzes the topic and selects three to five skills from the full catalog, with skills auto-discovered from the skills/ directory. The README is explicit: “No hardcoded domain → tool mapping — selection adapts to any research question.” Add a skill folder with a SKILL.md and the catalog picks it up.

The catalog spans roughly fifteen tool families covering the working set of a modern computational research lab. Sequence and structural biology are represented by BLAST, UniProt, and PDB; literature by PubMed and ArXiv; cheminformatics by PubChem, ChEMBL, RDKit, and TDC; materials by the Materials Project and NIST WebBook; plus general-purpose web search and data visualization. Each is a thin Python wrapper that exposes a uniform invocation surface. Agents reason about which skills apply, chain them, and produce artifacts at every step.

There is a separate, smaller decision the system makes at the social layer: role assignment. ScienceClaw exposes five roles — investigator, validator, critic, synthesizer, and screener — assigned based on skills and personality during session joining. Investigators explore. Validators independently re-verify findings using different tools. Critics challenge logic and propose alternatives. Synthesizers integrate disagreements. Screeners parallelize high-throughput work. Upvotes and downvotes require structured reasoning and citations; they are evidence-backed, not sentiment. Disagreement is preserved as validated, challenged, under review, or disputed rather than forced into unanimity.

This matters for engineers because role-plus-interaction-type is a different shape of coordination than control flow. You are not writing the workflow. You are writing the vocabulary the workflow uses to assemble itself.

The coordination loop, end to end

The co-ordination loop

The eight-step coordination loop runs without a central planner. Skill-based discovery, role assignment, and schema matching happen as side effects of the heartbeat — not as orchestrated control flow. The full loop and its four-layer implementation are documented in the README and the paper.

What’s actually shipped — and what to be careful about

The four investigations in the paper are real and worth reading, but the framing matters.

The peptide design investigation targeted SSTR2, a somatostatin receptor with established cancer relevance. The lightweight ceramic work was a screening pipeline. The cross-domain resonance investigation produced the Hierarchical Ribbed Membrane Lattice with the 2.116 kHz primary mode that I mentioned above, and validated the design with finite-element analysis. The urban-morphology-to-grain-boundary work built a formal analogy between two fields with no prior cross-citation. The paper’s core empirical claim is that across these four cases, the framework demonstrates heterogeneous tool chaining, emergent convergence among independently operating agents, and traceable reasoning from raw computation to published finding.

What the paper does not yet show is large-scale cross-institutional coordination. Buehler’s announcement describes ScienceClaw × Infinite as a swarm “across institutions, labs and the world”, and the architecture is built for it: anyone can deploy an agent or contribute a skill, the heartbeat runs 24/7 without a central coordinator. But the four investigations in the paper are produced by Buehler’s MIT team. The cross-institutional layer is a design property, not a demonstrated outcome — at least not yet.

The repo state confirms this is early. Five GitHub stars, four contributors, fifty-five commits at the time of writing. Posting to Infinite requires a minimum of 10 karma, which agents earn through commenting and voting before they can post — a sensible spam guard, but a reminder that the surrounding social layer is also under construction. There are rate limits: one post per 30 minutes, fifty comments per day, two hundred votes per day. This is a research artifact, generously open-sourced, that aligns with the broader DOE Genesis Mission’s stated goal of doubling the productivity and impact of American science within a decade, but it is not a production system.

That framing is also the right way to consume it.

The broader OpenClaw scientific ecosystem this sits inside is itself worth knowing about. A bioRxiv paper from late March 2026 catalogued 91 projects and 2,230 skills across 34 scientific categories in the OpenClaw scientific agent ecosystem, and ScienceClaw is one of the more architecturally distinct entries. The pattern across the ecosystem — skill-based agent design where workflows are expressed as structured Markdown files, lowering the barrier to contribution — is what makes the substrate-driven coordination model viable at all. Agents do not need to know about each other in advance because the skill catalog and the artifact types form a shared language.

What production AI engineers should take from this

The patterns transfer even if the framework does not.

Schema-typed artifacts as a routing primitive. The most portable idea in ScienceClaw is that the type of an artifact is the routing signal. If an agent produces a peptide_sequences artifact, any agent whose SKILL.md declares peptide_sequences as an input can pick it up. That removes a layer of planner reasoning. Production multi-agent systems can adopt this without going fully plannerless: type your intermediate artifacts, expose schemas as inputs and outputs, and let the substrate dispatch.

Provenance as substrate, not afterthought. Treat the artifact DAG as the source of truth for both coordination and audit. If your current observability is wrapping logs around an opaque LangGraph state, you are paying twice. ScienceClaw’s pattern — content-hashed, immutable, lineage-preserving artifacts dropped into a shared store — gives you a deterministic replay of any investigation, and the cost is mostly upfront design discipline.

Roles plus interaction types as coordination semantics. The investigator/validator/critic/synthesizer split is a coordination pattern, not a UI metaphor. You can implement it on top of any agent framework: tag each agent’s purpose, define a small interaction-type vocabulary (challenge, validate, extend, synthesize, request_help), and write your prompts to respect those roles. You will find that consensus and disagreement become legible in your traces in a way they typically are not.

Plannerless is not always the answer. Orchestrator-based architectures still win when the workload is bounded, the agent set is small, and latency matters. Plannerless coordination has overhead — the pressure scoring, the schema matching, the heartbeat cadence — and it works best when the work is open-ended and agents can be added or removed dynamically. Apply it where it fits.

If you want to experiment with these patterns without adopting ScienceClaw wholesale, the cheapest path is to add a needs board to your existing system. Let one agent post what it cannot do; let peer agents pick those needs up on their own schedule. You will learn whether plannerless coordination buys anything for your domain in about a week of work.

FAQ

Is ScienceClaw production-ready? No. Five GitHub stars, four contributors, an academic paper from March 2026, and a Vercel-deployed Infinite platform. Treat it as a reference architecture and a research artifact, not a runtime you deploy this quarter.

How is it different from CrewAI or other frameworks? Most frameworks use orchestrator-based coordination — a central agent decomposes work and assigns it. ScienceClaw uses plannerless coordination via the ArtifactReactor: agents broadcast unsatisfied needs and peers fulfill them via schema-overlap matching, without any planner assigning tasks. The closest analogue is a 1970s blackboard architecture, modernized for typed-artifact LLM agents.

Can I use Claude as the agent backbone? Yes. The repository documents Anthropic, OpenAI, and Hugging Face as supported LLM backends, with OpenClaw as the default runtime. Setup is via LLM_BACKEND=anthropic and the corresponding API key.

Does it actually produce real scientific results? The paper presents four investigations across peptide design, ceramic screening, cross-domain resonance, and urban-morphology analogy, and one of them produced a finite-element-validated metamaterial design with concrete acoustic properties. Whether those count as “real scientific results” depends on whether you mean novel publishable findings or experiments still pending wet-lab validation. The framework’s contribution is the coordination pattern; the scientific outputs are early demonstrations.

Should I read the paper or the repo first? The paper for the architecture and the experimental results. The repo’s ARCHITECTURE.md and the multi-agent examples in the README for the implementation patterns. Both fit in an afternoon.

Closing

The interesting question is not whether ScienceClaw will become the dominant scientific agent platform. It probably will not, on its own. The interesting question is what production AI engineers should port out of it before someone else does.

Type your artifacts. Make provenance substrate, not observability. Let agents post what they need rather than wait for a planner to figure it out for them. The coordination patterns ScienceClaw demonstrates are old ideas — blackboard architectures, tuple spaces, content-addressable artifacts — applied with discipline to the LLM-agent stack. They were good ideas in 1975 and they remain good ideas now.

If your multi-agent system has a planner that has become the most fragile component in your harness, ScienceClaw is the cleanest open-source reference you can read this month for what the alternative looks like. Read the paper. Skim the repo. Then go look at the planner in your own system and ask what would happen if you replaced it with a needs board, a type system, and a pressure score.

Auctor’s Bet: Traceability Is the Architecture, Not a Feature

The AI Runtime — Sat, 09 May 2026 11:03:49 GMT

TL;DR - Auctor emerged from stealth in April 2026 with $20M led by Sequoia Capital to build an “AI-native system of action” for the messy, ~$500B/yr labor market of enterprise software implementation — the work that actually gets Salesforce, SAP, ServiceNow, or Workday running in a customer’s environment. Reading their public material with an architect’s eye, the interesting choice isn’t the agent loop or the LLM tuning. It’s the bet that artifact lineage is the load-bearing primitive: every user story, SoW, design doc, and Jira ticket is anchored in a graph that walks back to the discovery call that originated it. Frontier models are commodity; the project-scoped artifact graph compounds. If you’re building agentic systems for any domain where decisions accumulate across stakeholders over months — legal, healthcare RCM, B2B sales, regulated change management — study this pattern before you architect your context layer.

A real problem, sized correctly

Enterprise software only creates value when it’s actually deployed, and deployment is overwhelmingly a labor problem, not a software problem. Sequoia’s Julien Bek frames the ratio crisply: every dollar of enterprise software pulls roughly six dollars of services behind it. Across the top ten ecosystems — ServiceNow, Salesforce, SAP, AWS, and the rest — that adds up to about nine million implementation consultants and more than half a trillion dollars in annual labor spend, growing at a double-digit pace.

The work itself is brutal. A single deployment can span hundreds of requirements, dozens of stakeholders, and months of negotiation between what a business says it needs and what the platform can actually do. BCG’s 2024 study of more than 1,000 large-scale tech programs found that more than two-thirds miss their time, budget, or scope targets. Auctor cites their own statistics in the same vein: 50% of projects miss deadlines, and 1 in 6 exceeds budget by more than 200% — vendor-cited numbers, but directionally consistent with independent research. The interesting question isn’t whether implementation is broken. It’s whether the brokenness is structural — and if so, where the structural fix actually lives.

The architecture

Auctor’s framing is that implementation work is a context coordination problem, not a productivity problem. In a Q&A with Tercera, CEO Will Sun draws a distinction between three categories of enterprise software:

System of record — the platform that holds data (CRMs, ERPs).
System of work — the platform where work happens (Jira, Confluence, Asana).
System of action — a platform that acts on the data, not just stores or displays it, while preserving the traceability and governance enterprise buyers require.

That last category is where Auctor positions itself. It’s a marketing term, but the technical substance behind it is real: rather than being a chatbot that surfaces documents from your existing systems, the system itself is the substrate where decisions accumulate, artifacts are generated, and downstream tools get synced. The company describes the loop in three layers — Capture, Contextualize, Create — which read like marketing copy until you realize each layer corresponds to a non-trivial engineering surface.

Here is what the artifact graph actually looks like inside an engagement, based on what Auctor has described publicly:

Artifact graph

The dotted line back to (A) is the actual product. The forward arrows are table stakes — anything with a decent prompt and a Confluence connector can generate a SoW from a meeting transcript today. The dotted line is where the engineering discipline lives.

Layer 1: Capture

The capture layer is the ingest plane. Auctor lists integrations with Google Meet, Microsoft Teams, Zoom, Gong, Outlook, Google Calendar, Slack, Confluence, Google Drive, OneDrive/SharePoint, Salesforce, HubSpot, Jira, Linear, Azure DevOps, Rally, and Certinia. Reading this as a list of features misses the point. Reading it as a topology of where implementation context lives is closer to right.

The non-obvious move here is that real-time meeting transcription is treated as a first-class source, not a bolt-on. Auctor’s own materials describe agents that join discovery and refinement calls, transcribe live, and pull context from past projects to steer the conversation. The Valiantys case study describes this concretely: instead of consultants taking manual notes during fifteen-stakeholder discovery sessions and consolidating afterward, requirements, action items, and meeting summaries are produced as the discussion unfolds.

That sounds modest until you think about what a “captured requirement” has to mean to be useful downstream. It has to:

be timestamped and attributed to the speaker who voiced it
be tagged with the stakeholder role that gave it weight (PMO, architect, exec)
be linked to the meeting and the parent engagement
be deduplicated against earlier captures of the same intent
carry enough structure to be queryable later by an agent generating a SoW

This is the unglamorous schema work that turns “transcript + LLM” into something a delivery team can actually trust. Most teams underestimate how much of this is bespoke and how little of it is solved by a vector store.

Layer 2: Contextualize

The contextualize layer is where Auctor’s architectural bet shows up most clearly. In Will Sun’s own words, the very first capability he and his cofounders prototyped — the one that drew SI leaders in — was traceability. Not generation. Traceability.

The mental model he describes: a user story created months into a project should be walkable back to the original requirement, the SoW that scoped it, and the pre-sales conversation where the stakeholder first voiced the need. That walk has to survive consultant turnover, mid-project pod swaps, and the natural decay of “tribal knowledge” that erodes every long engagement.

There are a few engineering implications worth pulling out:

The graph is multi-modal. A node in this graph can be a transcript span, a section of a Word doc, a CRM field, a Jira ticket, a Confluence page, or a Slack message. Edges aren’t just “is-related-to” — they need to encode causal relationships (this requirement caused this user story to exist) and temporal ones (this requirement was superseded by that decision in last Tuesday’s call). Few off-the-shelf graph databases handle this cleanly without significant modeling work above them.

Project-scoped retrieval beats global retrieval. The Crossfuze case study describes Auctor’s account- and project-level repositories explicitly: queries are bounded to a defined scope rather than searching across everything the firm has ever ingested. This is a deliberate inversion of the “one big RAG corpus” pattern. For implementation work, it’s almost certainly correct — the consultant answering a Q in a SoW review wants context from this engagement, not the closest semantic match across 200 historical projects. Cross-project learning becomes a separate, opt-in surface — templates, playbooks, codified house standards — rather than something contaminating live retrieval.

Audit trail is the API, not a sidecar. Implementation buyers — especially in financial services, government, and healthcare — won’t trust an autonomous system unless they can ask, of any output, “what did this come from?” Bolting an audit log onto a generation pipeline after the fact rarely produces a satisfying answer. Designing the lineage as the primary data structure, with generation as a derived operation, is what makes the audit trail credible. This is the same discipline that production data engineering applies to lineage in dbt or feature stores; it’s still rare in agent systems.

Layer 3: Create

The create layer is where Auctor’s outputs land. Their own product page lists the artifact types: rough orders of magnitude, resource plans, statements of work, scopes, solution designs, process flows, user stories, and presentation decks. Each of these is a distinct generation problem with its own template, its own validation rules, and its own downstream sync target.

The interesting design decision is that generation is bounded by the project graph, not by raw model capability. A SoW draft isn’t generated from “what the model knows about SoWs”; it’s generated from the requirements, decisions, and constraints already in this engagement’s graph, with house-style templates from the SI’s own playbook layered on top. Crossfuze describes this as “first-pass content creation within clearly defined project contexts,” explicitly using Auctor for drafts that then go through their normal brand and review process.

That’s the right framing for any high-stakes generation task: the model produces a defensible draft, the human still owns approval, and the graph guarantees that nothing in the draft is stranded — every claim, number, and design decision can be traced to a source already in the system. It’s also a much better fit for fixed-fee delivery economics than “AI assistant pinging the consultant for help” — because the unit of work is the artifact, not the keystroke.

The harness, not the model

Sun is explicit on the model question: Auctor builds on frontier foundation models and tunes the system around how those models evolve, working with hundreds of consultants daily to know what works and what doesn’t. They are not building a foundation model. They are not even, as far as the public material reveals, fine-tuning one in a meaningful way. The bet is that the model is the commodity layer and the SI-specific harness — the schemas, the project-scoped retrieval, the artifact graph, the integrations, the templates, the governance — is where compounding value lives.

This is a defensible bet, and not just for Auctor. The same reasoning applies to most vertical agent companies: every six weeks the underlying model gets cheaper and stronger, and any architectural choice that depends on a specific model’s quirks decays with it. The architecture that compounds is the one that gets more useful with better models, because the harness was the durable artifact all along. The frontier labs themselves have been making versions of this argument in their own engineering writeups: the loop, the tools, the context curation are where engineering effort earns its keep, not the model behavior of any given week.

The corollary is uncomfortable for some founders: if your moat is mostly model behavior, you don’t have a moat. You have a temporary advantage on a clock you don’t control. Auctor’s choice to plant their flag on the graph instead of the model is, on its face, the more durable bet.

Governance is engineering, too

Auctor’s security page is more interesting than the average vendor compliance recitation, mostly for one detail: zero data retention with upstream AI providers, meaning customer inputs aren’t stored or logged by the underlying model providers and aren’t used for model training. For services firms whose customers include financial institutions, government agencies, and Fortune 500s, this is a precondition for sale, not a nice-to-have. The rest is what you’d expect from a startup chasing enterprise contracts: AWS infrastructure, AES-256 at rest, TLS 1.3 in transit, SSO/SCIM via Okta/Azure AD/Google, SOC 2 Type II, ISO 27001, and regional data residency.

The governance story matters because it’s the gating constraint on the whole architectural play. An audit trail is only as trustworthy as the platform’s ability to demonstrate its handling controls to a procurement team. The system-of-action framing falls apart if the action can’t be retrospectively justified to a regulator or an internal audit function. Sun makes this point explicitly in the Tercera Q&A: action without accountability fails.

What’s unproven

Worth being honest about what we don’t know from the public material:

The 80% efficiency claim is vendor-cited. Auctor reports “up to 80% efficiency gains across phases like discovery and design.” The number comes from the company and the customers it has chosen to highlight; there’s no independent benchmark, and “efficiency gain” is doing a lot of definitional work. Take it as directional, not as a measured productivity figure.

The architectural details are not public. Everything above is reverse-engineered from product copy, founder interviews, case studies, and integration lists. We don’t have a public technical writeup describing the schema, the graph implementation, the retrieval strategy, or the agent loop. There may be — and probably are — significant differences between the architecture as described and the architecture as built.

Implementation work resists templating. The harder question for any “system of action” is whether the work it’s automating is genuinely templatable at scale. SoWs and user stories sit on a spectrum: the boilerplate scaffolding is highly templatable, the load-bearing scope language often isn’t. Auctor’s own framing — first drafts, with human approval — implicitly concedes this. The interesting test will be how much of the high-judgment work survives at the human layer five years from now. Sequoia’s framing of “intelligence vs. judgement” is the right map here.

Category competition is coming. “Agentic operating system for SI work” is a defensible position today partly because nobody else is positioned exactly there. That window won’t stay open. Several adjacent categories — meeting intelligence vendors, services automation tools, project management platforms — are within a roadmap or two of overlapping capability. The artifact graph is a real moat if it stays project-scoped and integration-rich, but it’s the kind of moat that needs to keep deepening.

What builders should learn

Three patterns are worth pulling into your own architecture, regardless of vertical:

Make lineage the primary data structure. If you’re building an agent system in any domain where decisions need to be defensible — legal, finance, healthcare, regulated B2B — design the artifact graph first and the generation pipeline second. Walking from any output back to the source it depends on should be a single graph traversal, not a forensic exercise. Most teams do this backward: they build the loop, ship a feature, then bolt on observability when a customer asks why the model said what it said.

Scope retrieval to the engagement, not the corpus. Cross-project learning is a different surface from in-project recall. Conflating them produces retrieval that’s almost-right in a hundred subtle ways and consistently wrong on questions like “what did this customer decide last Tuesday?” Project- or account-scoped repositories solve a real problem cheaply.

Bet on the harness. If the part of your system that depends on the current state of frontier models is more than a thin layer, your roadmap is exposed to the next model release. The durable engineering — the schemas, the scoping, the integrations, the templates, the lineage — is what compounds while the model layer keeps shifting underneath.

These aren’t novel patterns in isolation. The novel thing is treating them as load-bearing rather than as polish. In a domain that has resisted automation for thirty years, that decision is the architecture.

Have you seen this pattern — lineage-first, harness-bet — in production agent systems outside the SI space? Reply and tell me what you’re building. I read every response.

Further reading

Auctor product overview — the integration topology and the three-layer framing in the company’s own words
Will Sun’s Q&A with Tercera — primary-source view on the system-of-action concept and the founding traceability bet
Julien Bek, “Services: The New Software” — the strategic frame Auctor was funded against; useful even if you’re not in services
Sequoia’s partnership announcement — Bek’s investment thesis on Auctor specifically
BCG, “Most Large-Scale Tech Programs Fail” — the independent base rate for project failure that the whole category is sized against
Thanks for reading! Subscribe for free

Inside Mintlify’s Agent Stack

The AI Runtime — Wed, 06 May 2026 08:03:50 GMT

TL;DR - Mintlify just raised $45M at a $500M valuation on the bet that documentation has stopped being something humans read and started being infrastructure that agents query. Their own traffic data backs the bet: across 30 days and roughly 790M requests on Mintlify-powered sites, AI coding agents accounted for 45.3% of traffic versus 45.8% for browsers, with Claude Code alone generating more requests than Chrome on Windows.

Underneath the bet sits a three-part architecture worth studying. The write agent runs inside ephemeral Daytona sandboxes with a headless OpenCode session driven by Opus 4.6, triggered by Slack mentions, dashboard prompts, API calls, or YAML-defined Workflows in your repo. The read assistant does the opposite — it skips real sandboxes entirely in favor of ChromaFs, a virtual filesystem layered over their existing Chroma database, taking session creation from roughly 46 seconds to about 100 milliseconds. The public surface auto-generates llms.txt, llms-full.txt, and skill.md at the root, serves clean Markdown when you append .md to a page URL, and hosts an MCP server for every docs site it powers.

The architectural lesson isn’t that they built a doc agent. It’s that they built two harnesses with deliberately asymmetric constraints — async writes get full sandboxes, sync reads get a virtual filesystem — and the asymmetry is what makes the system economical at over 23 million queries a month. If you’re wrapping a model around a code repository for any reason, this is the reference implementation to study.

The 45% problem

Start with the data, because the architecture only makes sense once you accept the premise.

In April 2026, Mintlify’s co-founder Han Wang published a Cloudflare-header analysis covering 30 days of traffic across all Mintlify-powered docs sites. The headline number: AI coding agents had reached 45.3% of total requests, narrowly behind 45.8% from browsers. The distribution was lopsided. Claude Code alone produced 199.4M requests, ahead of Chrome on Windows at 119.4M. Cursor produced 142.3M. Together those two tools accounted for roughly 96% of identified AI agent traffic. Mintlify itself notes the real share is likely higher, since Codex traffic is invisible to user-agent header analysis and disappears into generic HTTP requests.

Architecture Patterns

If half your readers are agents pulling context to generate code, the design pressure on documentation flips. Browsers want navigation chrome, syntax highlighting, expandable sections. Agents want clean Markdown, exact strings, and stable URLs. The same content has to render correctly to both audiences, and — critically — has to stay current as the underlying product ships at agent-swarm speed.

That second pressure is the one that produced the agent stack. As Mintlify’s other co-founder Hahnbee Lee frames it, when a chatbot gives a wrong answer it is usually a documentation failure rather than a model failure, because the corpus the model retrieved against is out of date. The gap between what your docs say and what your product does compounds quarter over quarter unless something automated keeps the two in sync. Their answer is two distinct agents with two distinct harnesses, plus a public surface that exposes the maintained corpus to every other agent in the ecosystem.

Two harnesses, two latency budgets. The write path optimizes for capability; the read path optimizes for cost-per-conversation.

Layer 1 — The write agent: a sandbox is the whole product

Most “AI doc writer” features on the market today are roughly one prompt, one model call, one diff. Mintlify’s write agent is structurally different. When you trigger it — by @mintlify-ing the bot in Slack, hitting Cmd+I in the dashboard, calling the agent API, or merging a PR that fires a Workflow — what runs on the other side is a headless OpenCode session driven by Opus 4.6, scoped to a fresh Daytona container that has the docs repo and any context repositories cloned in. The sandbox is the unit of work.

This decision is more load-bearing than it sounds. The Mintlify team is explicit about the reasoning: pointing a stateless model at a codebase produces, in their phrase, “chaos with a byline”. The agent needs a real environment to read code, plan changes, and edit files safely — not an API call decorated with retrieved chunks. So they gave it one. A trigger lands on a job queue, a worker provisions the container, and the result of the run is reported back through GitHub commit checks and the Mintlify dashboard. Inside the container, the agent runs through a fixed pipeline: it pulls in relevant material across the docs and the connected code repos, drafts a multi-step plan if the work calls for one, applies edits while honoring the project’s writing standards, runs a local Mintlify CLI build to confirm the docs still compile, and opens a pull request — direct commits to main are not on the menu.

Two design choices inside that loop are worth pulling out.

Slack-first, not terminal-first. The Mintlify agent originally shipped only in Slack and via API, with the dashboard surface added later in December 2025. The team’s stated reason: opening a terminal triggers a “mentally draining switch” that opening Slack does not, and documentation work is exactly the kind of task people procrastinate on. By living where the relevant context already lives — the PR thread that explained the change, the customer Slack message that surfaced the gap — the trigger surface matches the source of the work.

Behavior-as-code through AGENTS.md. The agent reads a config file at .mintlify/AGENTS.md in your repo, and appends its contents to its system prompt for every task it runs — whether the trigger comes from Slack, the dashboard, or the API. The path matters: Mintlify’s docs explicitly warn that placing the file at the project root exposes it as a public asset under /agents.md, since the .mintlify/ directory is not served on the docs site. What you put inside is style preferences, code standards, project-specific terminology — the kind of guidance a senior reviewer would otherwise repeat fifty times a year. It is the same pattern as Anthropic’s CLAUDE.md or the AGENTS.md spec emerging across the agent tooling space, and it makes agent behavior version-controlled and reviewable.

The most interesting trigger surface is Workflows, where the YAML config gets explicit. A workflow file lives in your repo. The schema looks roughly like this:

---
name: 'Update API reference on backend changes'
on:
  push:
    - repo: 'your-org/backend'
      branch: main
context:
  - repo: 'your-org/docs'
  - repo: 'your-org/openapi-specs'
automerge: false
---

When the backend repo merges a PR, scan the diff for changes to public API
endpoints, request/response schemas, or authentication behavior. Update the
matching API reference pages and code examples. Skip internal refactors.

The structure is a trigger (cron job or push event), a list of context repos to clone in, an automerge flag, and natural-language instructions in markdown. When the trigger fires, the agent evaluates the conditions, runs the task, and either commits directly or opens a PR depending on configuration, so cost stays predictable. Documentation maintenance becomes a downstream event of shipping, not a separate task someone has to remember.

The whole arrangement maps onto a pattern emerging across serious agent products: give the AI a sandbox, version-control the instructions, keep humans in the review loop, and let the model do the actual work inside well-defined guardrails. The reviewer-on-PRs analogy is doing real work here. The agent is treated like a junior contributor with full repo access — capable, but reviewed.

Layer 2 — The read assistant: when a real sandbox is the wrong answer

If the write agent shows what it looks like to spend latency to gain capability, the read assistant shows the opposite trade-off — and it is the more architecturally surprising of the two.

The read assistant is the chat widget your readers use on a Mintlify-powered docs site. It now serves over thirty thousand conversations a day across hundreds of thousands of users. The natural design — and the one Mintlify started with — was the same shape that powers the write agent: spin up a sandbox, clone the docs repo, let the model run real grep, cat, ls, and find against the filesystem.

That design hit two walls. First, latency: p90 session boot time, including the GitHub clone and other setup, came in around 46 seconds — fine for an async write task where someone fires a Slack message and walks to get coffee, fatal for a reader staring at a loading spinner on a docs page. Second, cost. At nearly a million conversations a month, even a minimal sandbox setup at 1 vCPU, 2 GiB RAM, and a five-minute lifetime would have run north of $70,000 a year on Daytona’s per-second pricing, with longer sessions doubling the bill.

So the team built ChromaFs — a virtual filesystem that gives the agent the illusion of a real shell, layered over the Chroma database that already stored the docs as embedded chunks. Session creation collapsed from tens of seconds to roughly 100 milliseconds, and because ChromaFs reuses infrastructure they were already paying for, the marginal compute cost per conversation dropped to zero. The implementation runs on top of just-bash, a TypeScript reimplementation of bash from Vercel Labs that exposes a pluggable IFileSystem interface. just-bash parses commands, pipes, and flags; ChromaFs translates each underlying filesystem call into a Chroma query.

The mechanics are worth dwelling on, because they reveal how thoughtful harness design beats brute-force sandboxing.

The directory tree is bootstrapped from a single gzipped JSON document called __path_tree__ stored inside the Chroma collection. On startup, the server fetches and decompresses it into two in-memory structures — a set of file paths and a map from directories to their children. After that, ls, cd, and find resolve in local memory with zero network calls, and the tree is cached so subsequent sessions for the same site skip the fetch entirely. Per-user access control happens at tree-build time: ChromaFs prunes paths the user can’t see and applies a matching filter to all subsequent Chroma queries, with the result that pruned paths cannot even be referenced by the agent. Reading a page is a chunk-reassembly operation — cat /auth/oauth.mdx fetches all chunks with the matching slug, sorts them by chunk_index, and joins them into the full page. Writes throw EROFS, making the system stateless by construction.

The most clever piece is grep. A naive recursive grep over a virtual filesystem would be agonizing — every file would round-trip to the database. ChromaFs intercepts the grep call, parses flags with yargs-parser, and translates them into a Chroma query ($contains for fixed strings, $regex for patterns) that acts as a coarse filter to identify which files might contain a hit. The matched chunks are bulk-prefetched into a Redis cache, and the rewritten grep is handed back to just-bash for in-memory fine filtering. Large recursive queries finish in milliseconds.

Sitting beneath ChromaFs in the read path is Trieve, the RAG infrastructure company Mintlify acquired in July 2025. Trieve had been Mintlify’s search backbone since before the team finished its Y Combinator batch, and the acquisition brought retrieval ownership in-house at a moment when the assistant was already serving more than 23 million queries a month. Trieve’s stack — dense vector search, re-ranker models, sub-sentence highlighting, and date recency biasing on a single endpoint — does the heavy lifting underneath ChromaFs’s UNIX-style interface. Trieve also moved to an MIT license as part of the acquisition, so the same retrieval kernel is inspectable on GitHub.

The pattern in the read assistant is the part most teams underweight. Mintlify’s team observed that agents are converging on filesystems as their primary interface, because grep, cat, ls, and find are sufficient primitives for an agent to reason over arbitrary structured content. Most builders take that observation and reach for a real sandbox. Mintlify took the same observation and asked whether the interface could be virtualized while keeping the primitives real. For their workload, the answer was yes — and the cost curve in their post (sandbox cost grows linearly with conversation duration; ChromaFs stays flat) is a clean argument for why.

Layer 3 — The public surface: content negotiation as the unification trick

The third layer is the cheapest to describe and the easiest to overlook.

Every Mintlify-hosted docs site automatically generates a set of agent-readable artifacts at the root: llms.txt, llms-full.txt, and skill.md. The first two are an emerging convention for telling LLMs what content lives on a site and giving them a parseable bulk dump. The third is more interesting. As Mintlify describes it, skill.md is the action-layer manifest — it enumerates not just what the documentation contains but what an agent can actually invoke against the product, with required inputs and operating constraints attached to each capability. It is, in other words, the difference between an agent that can find information and an agent that can take action. Mintlify also exposes the /.well-known/agent-skills and /.well-known/skills paths — so any agent that knows the convention can find capabilities without hard-coded paths.

The unification trick that ties everything together is content negotiation. The same URL serves rich HTML to browsers and clean Markdown to agents — appending .md to any page URL returns a Markdown view of the same content, with no separate agent-facing site to maintain. This avoids the failure mode where teams maintain a “human site” and a separate “AI site” that drift out of sync; there is only one content store, with two rendering targets selected by the request.

Finally, every Mintlify site auto-hosts an MCP server, which lets coding agents like Cursor, Claude Code, and Windsurf query current documentation while a task is running. Authentication is supported when the docs site itself is gated — the MCP server respects whatever auth protocol the docs already use. The architectural significance is that retrieval is no longer something only the docs site itself can do. Every external agent that supports MCP gets a structured handle into your corpus, on the same terms as Mintlify’s own assistant.

What the architecture teaches

A few patterns are general enough to lift out of Mintlify’s specific case and apply elsewhere.

First, the sandbox is the unit of work for write tasks, but the wrong unit for read tasks. Most builders default to one or the other. Mintlify’s own bill clarifies the trade-off: a sandbox that boots in tens of seconds and costs a fraction of a cent per session is fine for asynchronous PR drafting, and ruinous for a chat widget. If you’re building both surfaces, expect to want both harnesses.

Second, version-controlled, natural-language instructions are the right encoding for agent behavior. Workflows YAML and AGENTS.md are the same idea applied at different scopes — one configures a recurring task, the other configures the agent globally. Both live in the repo, both go through code review, both evolve with the project. This is what “config as code” looks like when the configured component is a model.

Third, virtualizing the agent’s interface, not its environment, is often the better move. ChromaFs is the cleanest example: a real grep, a real ls, a real cat — but resolved against a database, not a disk. The agent doesn’t need a sandbox, it needs the sandbox’s API. Once you internalize that, a lot of “we need a Daytona for this” becomes “we need an IFileSystem shim for this,” with two orders of magnitude less infrastructure.

Fourth, content negotiation is the right unification primitive when you’re serving humans and agents from the same corpus. Maintaining parallel “human docs” and “AI docs” is how you guarantee they drift. Same URL, different format, selected by the request — and the cost of supporting the agent surface drops to near-zero.

Finally, harnesses are not edge cases, they’re the product. If you remove ChromaFs from the read assistant, the bill blows up. If you remove the sandbox boundary from the write agent, you stop being able to safely run on customer codebases. If you remove the auto-generated llms.txt and MCP server, the 45.3% of agent traffic loses its grip on the corpus. The model is doing model work in the middle, but everything around it — the sandbox, the virtual filesystem, the YAML triggers, the public surface — is what makes the product trustworthy and economical.

What to do with this

Three concrete moves for practitioners building anything adjacent to this space.

If you operate a documentation site, run it through Mintlify’s free Agent Score tool, which checks twenty-nine signals of agent-readability and tells you where the gaps are. The data is right there: half your traffic is agents you cannot see, and most teams are still building only for browsers. If you’d rather audit on your own, start by checking whether curl -L https://yourdocs.com/some-page.md returns clean Markdown or a 404 — that one HTTP request tells you whether you’re on the agent map at all.

If you’re building any agent that needs to read or modify a code repository, start with the harness, not the prompt. Decide your latency budget before you decide your model. If the answer is “tens of seconds and the agent edits files,” the Mintlify write agent — sandbox, headless OpenCode, version-controlled config — is your reference. If the answer is “milliseconds and the agent only reads,” the ChromaFs pattern (virtualize the interface, not the environment) is your reference.

And if you’re shipping a product that other agents will need to understand — an API, an SDK, a developer tool — treat your documentation as a programmatic interface that happens to also be human-readable. Auto-generate llms.txt and skill.md, expose an MCP server, serve clean Markdown via content negotiation. The asymmetric world Mintlify is betting on already exists. The teams whose docs are agent-readable get evaluated. The teams whose docs aren’t get skipped.

How Vertical Agents Self-Improve in Production

The AI Runtime — Sat, 02 May 2026 11:03:55 GMT

TL;DR - In regulated verticals — healthcare, legal, insurance, finance — the most reliable way to make a deployed agent better is not a new model. It is a closed loop that turns production failures into harness updates: prompts, tools, sub-agents, memory files, judge rubrics, routing logic. Harvey ran this loop on twelve legal tasks and moved average success from 40.8% to 87.7% with model weights frozen, with complaint drafting going from 2% to 98% rubric coverage. Hippocratic AI vendor-published clinical accuracy improvements from ~80% pre-Polaris to 99.38% in Polaris 3.0 by feeding ~1.85M real patient calls and 307K clinician-reviewed test calls back into the system. Anterior (vendor-published) puts a reference-free LLM-as-judge in front of every prior auth decision, routes only the low-confidence ones to under ten clinicians, and reports 96% F1 at over 100K decisions/day. Microsoft’s Azure SRE Agent moved its Intent-Met score from 45% to 75% on novel incidents by letting the agent investigate its own bugs and submit PRs against its own codebase. The shared pattern is the same six nodes everywhere: trace → judge → cluster → mutate harness → gate → deploy. If you cannot run that loop, you are shipping a frozen artifact in a moving market. Start by instrumenting traces and writing one rubric. The judge and the mutation loop come after.

The frozen-agent problem

A vertical agent that ships at 90% accuracy and stays there is not a 90% accurate system. It is a 90% accurate system at the moment of deployment, decaying.

The decay has three sources. Distribution drift: real patients ramble, real lawyers redline contracts in non-canonical ways, real claims arrive with new denial codes. Policy drift: CMS coverage determinations change, EU AI Act provisions phase in on staggered enforcement timelines, insurer rulesets get rewritten quarterly. Long-tail surface area: the failure modes you didn’t see in eval are the ones production discovers, one in ten thousand at a time. At 100K medical decisions per day, a one-in-ten-thousand subtle hallucination — “suspicious for multiple sclerosis” when the patient has a confirmed MS diagnosis — fires ten times daily.

Agent Improvement

In low-stakes consumer apps you can absorb that. In a vertical where the cost of a single error is a denied surgery, a missed disclosure schedule, or a regulatory finding, you cannot. So the question that defines vertical agent engineering in 2026 is not “which model do we use” — it is “how does this agent get better next week than it is today, without a new base model release, and with the audit trail a regulator will demand.”

The answer that has emerged across legal, healthcare, insurance, and incident response is the same architecture, sometimes given different names. Anthropic’s engineering team and Viv Trivedy refer to it as harness engineering. Microsoft frames it as the agent investigating itself. NVIDIA borrows MAPE-K from autonomic computing and calls it a data flywheel. LangChain calls it the agent improvement loop powered by traces. The mechanics are the same.

The shape of the loop

The loop

Six nodes. Every component carries weight; every break in the chain causes silent degradation.

Production traces are the substrate. Without per-step tool calls, model inputs, model outputs, latency, token counts, and final outcomes, none of the downstream work is possible. LangChain’s formulation is the cleanest: traces come from staging environments, benchmark runs, local development, and especially from production, and they are the input to every subsequent step. The trace store doubles as the audit trail regulators ask for.

Evaluation and judging is where most teams over-rely on offline benchmarks. The shift in 2025–26 has been toward online evaluators that score every production trace — typically an LLM-as-judge augmented with deterministic checks (schema validation, citation existence, tool-call shape) and routed human review on a configurable sample. Anterior’s framing is sharper than most: their judge is reference-free, scoring outputs against guidelines and clinical reasoning rather than a held-out ground truth, because the volume — over 100K decisions a day — makes ground truth impossible to maintain.

Failure clustering is where the leverage is. A pile of low-scored traces is not actionable. Grouping them by failure pattern — “agent missed exhibit B in 30% of due diligence runs,” “agent emits ‘suspicious for X’ on confirmed-X patients,” “agent hits LLM 429s during streaming” — turns symptoms into hypotheses. LangChain runs parallel error-analysis subagents and synthesizes their findings into harness change proposals. Microsoft’s SRE Agent runs a daily monitoring task that searches the last 24 hours of errors, clusters the top hitters, traces each to its root cause, and submits a PR.

Harness mutation is the change itself. We will spend a section on the levers that actually move; for now: most of these changes never touch model weights. They edit the system prompt, add a skill or sub-agent, modify a tool definition, append to a memory file, tighten a routing threshold, or rewrite the judge’s rubric.

Validation gate is the hill-climbing safety. Every proposed harness change runs against a frozen eval set before it ships, and any regression — even on a task the change was not targeting — blocks the merge. Harvey runs this against twelve internal benchmark tasks per iteration; LangChain marks proposed changes that overfit as discarded runs in their iteration log. Without the gate, the loop generates regressions as fast as it generates improvements.

Deploy then closes the cycle. The new harness produces new traces; new traces feed new judges; new clusters drive new mutations. The model is the one piece of this picture that does not change between weekly cycles.

The non-obvious property of this loop is what compounds. As Anterior describes it, the loop creates a virtuous improvement cycle where the evaluator itself gets calibrated against human review, and confidence grades from that calibrated evaluator route which cases need humans next time. The judge improves. The clustering improves. The mutations get more targeted. The agent appears to learn — without a single weight changing.

Case 1: Harvey — autoresearch and the rubric ceiling

The cleanest published demonstration is Harvey’s recent autoresearch experiment, summarized externally by Artificial Lawyer. Niko Grupen, Head of Applied Research, ran twelve tasks from Harvey’s internal agent benchmark — commercial lease review, complaint drafting, tax memos, disclosure schedules, due diligence questionnaires — through a loop where an outer agent is allowed to edit the inner agent’s harness based on rubric-graded judge feedback.

The setup: each task ships with source documents, instructions, and a detailed grading rubric. After an attempt, an LLM judge scores against the rubric and produces written feedback on what the agent got right, what it missed, and where its reasoning was wrong. A coding agent reads the judge feedback, clusters the failures, forms a hypothesis about which harness components would help, edits or builds those components — skills, hooks, scripts, sub-agents, not model weights — and reruns.

The result: across all twelve tasks, average success rose from 40.8% to 87.7%. Five of the twelve started in the 2–7% range. After optimization, seven exceeded 90% and one hit 100%. The complaint drafting task is the most striking — it moved from 2% rubric coverage to 98% over a handful of iterations, producing a 164-paragraph complaint with a 33-exhibit list.

Two patterns from Grupen’s log are worth quoting on terms. First, the early iterations correct basic structural failures — wrong file types, missing deliverables, weak structure. Later iterations show domain-specific expertise emerging: cross-document issue spotting, risk classification, distinguishing genuinely problematic provisions from market-standard distractors. Second, the ceiling is the rubric. “When the rubric is high quality, the agent can hill-climb surprisingly far.” When it isn’t, the loop stalls.

This generalizes. The same auto-improvement pattern works in a generic coding domain: LangChain’s deepagents-cli moved from 52.8% to 66.5% on Terminal Bench 2.0 — a 13.7-point jump from harness changes alone, with the model fixed at GPT-5.2-Codex. The mechanism is the same trace analyzer skill, parallel error agents, and targeted prompt/tool/middleware changes per iteration.

The Harvey caveat is real and worth surfacing: this is a vendor-run experiment on twelve tasks; it does not yet generalize to all legal work, and it is bound by the quality of the rubrics Harvey wrote. But the directional finding — that harness-layer changes can deliver model-upgrade-sized improvements in a regulated domain — is now hard to dismiss.

Case 2: Hippocratic AI — clinicians as a learning signal at scale

Hippocratic AI’s Polaris is a different shape of the same loop, scaled to a 22-LLM constellation that handles over 10 million real patient calls and a network of 6,234 US-licensed clinicians who review production output.

The vendor-published trajectory across three model generations: pre-Polaris baseline ~80%, Polaris 1.0 at 96.79%, Polaris 2.0 at 98.75%, Polaris 3.0 at 99.38% clinical accuracy, validated under their Real-World Evaluation of Large Language Models in Healthcare framework. The framework leverages 6,234 US-licensed clinicians (5,969 nurses and 265 physicians) evaluating 307,038 unique calls through a three-tier review process: nurse review first, physician adjudication when needed, structured error categorization in between. Errors flagged at any tier feed back into the next iteration’s training and harness.

The subsystem-level numbers tell the more interesting story, because they show what specifically improved between Polaris 2.0 and 3.0 by listening to production:

Health Risk Assessment documentation accuracy: 90.5% → 98.5%
Explanation-of-Benefits policy quoting: 86.4% → 99.4%
Complex appointment scheduling error rate: 8% → 0.5%
Background-noise speech recognition error rate: 9.3% → 2.3%
Clarification engine error rate (gracefully handling unclear patient speech): 16.3% → 2.0%

These aren’t random improvements. They’re the long-tail issues that surfaced once 1.85M patient calls had run through Polaris 1.0 and 2.0 and clinicians had flagged categorical failure modes. Speech recognition fails in noisy environments → train a dedicated background-noise engine. Patients answer HRAs in rambling, context-shifting ways → ship a “deep thinking” model that triple-checks documentation. Policy quotes occasionally drift from source documents → tighten the harness around source attribution.

The honest framing: these are vendor-self-published numbers, and there is no independent third party validating Hippocratic AI’s safety scores. What is independently verifiable is the architecture of the feedback loop — clinician review network, structured error categorization, real-world evidence accumulation across versions — which is now described in the underlying RWE-LLM paper on medRxiv and is replicable by anyone willing to invest in a comparable review apparatus.

Case 3: Anterior — judge first, route smartly, validate the validator

Anterior runs the same loop in healthcare prior authorization, but with two design choices that are worth studying separately because they generalize beyond healthcare.

First, reference-free real-time evaluation. Anterior’s primary system makes a coverage determination by reasoning across unstructured clinical documentation, payer rulesets, and clinical guidelines. A second LLM-as-judge then evaluates the determination against those same guidelines — without needing a held-out ground truth — and produces a confidence grade. Reference-free evaluation matters because at 100K+ decisions a day, no organization can maintain a labeled gold set that keeps up with policy drift.

Second, dynamic case prioritization. The confidence grade combines with contextual factors — procedure cost, bias risk, historical error rates for that procedure category — to decide which cases are sent to human clinicians for review. High-confidence cases auto-resolve; low-confidence and high-stakes cases route to a small clinical team. Anterior reports a team of fewer than ten clinical reviewers handling tens of thousands of cases, against a competitor reportedly employing 800+ nurses for comparable review volume. (Caveat: scope of work may differ. Take the comparison directionally.)

The third move is the one most teams miss. Anterior runs alignment metrics between the LLM-judge and the human reviewers on cases that get both, and uses that data to validate — and continuously recalibrate — the judge itself. They call this “validating the validator.” It is the missing piece in most LLM-judge deployments. Without it, the judge can drift, and you only learn about it when the harness has been mutating against bad signal for weeks.

Anterior’s vendor-reported numbers: 99.26% accuracy on automated approvals, against 86% baseline human accuracy, with 76% reduction in human review needed and 74% less time per escalated case. Cross-reference with Anterior’s own arXiv paper on fairness evaluation, which reports model error rates across 7,166 human-reviewed cases spanning 27 medical necessity guidelines. Independent validation remains an open need; the 96% F1 figure that has circulated comes from Anterior’s own talks, not a peer-reviewed audit.

The architectural lesson generalizes far past healthcare. Any vertical agent operating at scale where ground truth is expensive — fraud review, AML, KYC, contract triage, claims adjudication, security alert triage — can adopt the same three-part move: reference-free judge in line, dynamic routing on confidence and stakes, alignment metrics that validate the judge against the humans that exist.

Case 4: Azure SRE Agent — when the agent debugs itself

Microsoft’s Azure Site Reliability Engineering Agent handles tens of thousands of incidents weekly for internal Microsoft services and external teams. The team published a remarkably honest engineering retrospective in March 2026 about how they closed their improvement loop.

The starting point: incident resolution rates were climbing toward 50% on high-instrumented scenarios — but the high-performing scenarios all shared a trait. They had been built with heavy human scaffolding: custom response plans, hand-built sub-agents for known failure modes, pre-written log queries exposed as opaque tools. On any new incident class, the agent had nowhere to start. Engineers were reading 50 lower-scored threads a week against an agent handling 10,000 — debugging at human speed.

The inversion they made: stop pre-computing the answer space. Instead, give the agent a filesystem as its world (source code, runbooks, query schemas, past investigation notes — all files; no SearchCodebase API), context hooks that orient it on what it can access, and frugal context management that keeps long investigations sharp. Three architectural bets, in their words. The result: Intent-Met score on novel incidents — whether the agent’s investigation actually addressed the root cause as judged by the on-call engineer — rose from 45% to 75%.

The closing move is the one to study. They set up a daily monitoring task: the agent searches the last 24 hours for LLM errors — timeouts, 429s, mid-stream failures, malformed payloads — clusters the top hitters, traces each to its root cause in its own codebase, and submits a PR. Engineers review before merging. Over two weeks, errors dropped by more than 80%.

The agent, in other words, became its own debugger. The harness that runs the SRE agent is now updated by the SRE agent itself, gated by human PR review. The team’s framing is the title of their post: “The agent that investigates itself.” It is not a metaphor.

What actually changes (the levers)

The most under-appreciated property of these loops is what they mutate. Across every case study above, the changes that produced the gains were:

The system prompt and task instructions. ILWS, the “Instruction-Level Weight Shaping” framework, formalizes this: a session-level reflection engine proposes a structured edit to the system prompt — a knowledge delta — that is gated, accepted only if a sliding-window quality rating improves with statistical significance, and rolled back otherwise. Most production teams do this informally. Formalizing it gives you reversibility under governance, which regulators ask for.

Tool definitions and skills. LangChain’s improvement was largely middleware: a LocalContextMiddleware that maps the working directory and onboards the agent into its environment, a LoopDetectionMiddleware that intercepts repeated edits to the same file and forces a plan reconsideration, a PreCompletionChecklistMiddleware that blocks the agent from exiting before it runs a verification pass. None of these are model changes. All are tool-and-hook surface.

Memory and knowledge files. Microsoft replaced their RAG-over-past-sessions memory with structured Markdown files the agent reads and writes through its standard tool interface — overview.md, team.md, logs.md, debugging.md. The model navigates memory by following links, not by retrieving via embedding similarity. This is the “the repo is the schema” insight. Memory becomes a write-able artifact that future runs read.

Sub-agents and routing. Anterior routes by confidence × stakes. Azure SRE spawns parallel sub-agents per hypothesis when a single context is at risk of getting polluted. Hippocratic uses a 21-model supervisory constellation around a primary conversational model. None of these compositions require retraining the underlying weights; they require designing the orchestration layer.

Judge rubrics. The Harvey ceiling is the rubric ceiling. The Anterior calibration is the judge alignment with humans. The fastest leverage in most teams’ first improvement loop is not a fancier judge — it is a better-written rubric and a small humans-vs-judge alignment dataset.

Fine-tuning the small models in the harness. Sometimes weights do change, but on the components, not the primary model. NVIDIA NeMo’s case study on an enterprise data flywheel: a routing model fine-tuned from Llama 3.1 70B down to a Llama 3.1 8B variant achieved 96% accuracy with a 10× model size reduction and 70% latency improvement. The query rephrasal model gained 3.7% accuracy with a 40% latency cut. The orchestrating LLM was untouched.

The pattern is consistent: when you map “improvements shipped” against “components that changed” across these case studies, the primary reasoning model is the least common thing that gets edited. The harness layer carries the weight.

Where these loops break

Six failure modes show up repeatedly. None are theoretical; each one has burned at least one of the case studies above.

Overfitting to recent failures. Aggregate harness changes against last week’s top errors and you regress on tasks the change wasn’t targeting. LangChain’s iteration log explicitly marks these as discarded runs. Without a frozen eval set that the validation gate runs every mutation against, you’ll fix Monday’s bug and silently break Tuesday’s working flow.

Reward hacking against the rubric. When the agent edits its own harness against an LLM judge’s scoring, the judge’s scoring is the optimization target — including any blind spots in the rubric. Harvey caveats this directly: the improvements track the rubric, and the rubric is human-authored and incomplete. Periodic out-of-distribution evals from a separate judge with a separate rubric catch this.

Judge drift and validator fragility. Anterior’s validate-the-validator move exists because LLM-judges drift, and the drift is silent. If the judge is the substrate for routing, clustering, and mutation decisions, judge drift propagates everywhere. Alignment metrics against humans on a rolling sample of cases is the only known fix.

Memory staleness. Microsoft flagged this as their unsolved problem: when two sessions write conflicting patterns to debugging.md, the model has to reconcile them; when a service changes behavior, old memory entries become misleading. Timestamps and explicit deprecation help, but no production team has solved this systematically.

Privacy and regulatory constraints on production data. Healthcare and finance can’t freely route production traces into a learning loop the way a generic SaaS product can. The TikTok Pay ARIA paper handles this by having the agent self-identify uncertainty through structured self-dialogue and request targeted explanations from human experts at runtime, keeping learning at test time inside the regulatory boundary. Hippocratic uses synthetic test calls plus consented real-call evidence; Anterior keeps clinician review and AI determination in the same compliance perimeter.

Compounding errors when the validator itself fails. A bad judge calibrated against a small alignment set drifts. A bad alignment set lets the judge calibrate against itself. A bad clustering layer groups the wrong failures together. Each layer of the loop is a place errors can go undetected and propagate. The defense is treating every layer as an evaluable artifact — the judge has a precision/recall, the cluster labels have inter-rater agreement, the harness mutations have a regression budget.

The seventh failure mode, which is institutional rather than technical: nobody owns the loop. In every case study above, the loop is owned by a named team with a named lead — Grupen at Harvey, Mukherjee at Hippocratic, Mehta and team at Microsoft. Loops without owners decay quietly.

Build order

If you’re standing up a vertical agent and don’t yet have this loop, the build order is fixed and the order matters. None of the steps require the next-generation model.

Start with traces. Every tool call, every model input, every model output, every latency, every outcome, with a stable trace ID per session. If you can’t reconstruct what happened, none of the rest of the loop works. LangSmith, Arize Phoenix, Braintrust, and OpenTelemetry-based stacks all do this; pick one and instrument every call path before anything else.

Then write one rubric for one task. Not a benchmark suite. One task that matters, one rubric that an expert in your domain would sign off on. Score 50 production traces against it manually. The rubric you ship will be wrong in instructive ways; the act of writing and applying it surfaces the failure modes you didn’t know you had.

Add a judge against that rubric. Run it inline on a sample of production. Run it against the 50 you scored manually. Compute alignment. If alignment is below ~70%, the rubric is the problem, not the judge.

Add the clustering and mutation step last. Cluster the lowest-scored traces, propose one harness change, gate it against your offline eval, ship if it passes, measure the production effect. This is one cycle. Run it weekly.

The model upgrade question takes care of itself once the loop is running. When a better base model ships, you swap it in, rerun the validation gate, and observe whether your harness over-fits to the old model. (Different models reward different harnesses — Claude Opus 4.6 scored 59.6% with a harness tuned for GPT-5.2-Codex on Terminal Bench 2.0; the same Claude with its own harness moved several positions.) The harness tax of switching models is real, but it’s a calibration problem, not a foundational one.

The reason this matters now and not in twelve months is asymmetry. Vertical agent winners in 2026 will not be the teams with the best zero-shot model. They will be the teams whose deployed agents are quietly compounding skill every week the rest of the market sits frozen. The loop is the moat.

Build the trace store this week. Write the first rubric next week. The rest of it follows.

Felix Is a Harness, Not a Model: How Rogo Built an Agent for High Finance

The AI Runtime — Fri, 01 May 2026 11:03:46 GMT

TL;DR - Rogo serves more than 35,000 professionals at over 250 institutions — Rothschild & Co, Jefferies, Lazard, Moelis, Nomura — with an AI agent called Felix that bankers email like a junior analyst and get back finished decks, models, and memos. The interesting part is not the model. Rogo’s own product team calls Felix their “agent harness” — a vertical scaffolding designed to be model-agnostic across GPT 5.5, Claude Opus 4.7, and Gemini. Felix is the playbook for vertical AI: the moat is the harness, the evals, the data integrations, and the deployment model — not which frontier LLM is wired in this quarter. If you are building a vertical agent, study how Rogo decomposed the problem before you pick a model.

What Rogo Actually Sells

A precision note first: when people say “banking” in this conversation, they don’t mean retail or commercial banking. Rogo sits inside high finance — investment banking, private equity, hedge funds, equity research, asset management. Rogo’s own product page explicitly calls out its three audiences: Banking, Private Markets, Public Markets. The workflows are deal-shaped: pitchbooks, comps, models, memos, CIMs, diligence trackers.

Rogo was founded by Gabriel Stengel and John Willett — both ex-investment-bankers (Lazard, J.P. Morgan, Barclays) — with Tumas Rackaitis. That founder profile matters because the company’s edge is not the LLM; it is the granular, painful familiarity with what a 2 AM CIM revision actually looks like.

Felix Architecture

Yesterday’s $160M Series D, led by Kleiner Perkins with participation from Sequoia, Thrive, Khosla, and J.P. Morgan Growth Equity Partners, brings total funding past $300M. The capital is going toward two things that tell you what they actually believe: deeper data integrations and more forward-deployed bankers embedded inside client institutions.

Felix Is a Harness, Not a Model

The single most useful sentence Rogo has published this year shows up in their GPT 5.5 release note: “we’ve begun incorporating GPT 5.5 into our agent harness, Felix.” Read that twice.

Felix is not a fine-tuned model. Felix is the harness — the orchestration scaffold, tool layer, citation system, output formatters, audit trail, and policy controls — into which Rogo plugs whichever frontier model performs best on their internal benchmark this week. They are explicit that they are model-agnostic across OpenAI, Google, and Anthropic, and TAMradar’s coverage notes the platform supports GPT 5.5 and Anthropic Opus 4.7 concurrently.

This separation is load-bearing. In the Model Reliability Engineering frame, the harness is one of the two reliability axes — the scaffolding you build around the model to make its behavior production-safe. The harness-vs-model split is the same separation MRE treats as one of its two reliability axes. Rogo's product team uses the word the same way. The implication for builders: when frontier labs ship a 4% improvement on your domain, you swap the engine; when they ship a 40% improvement two years from now, your harness is what survives.

Here is the rough shape of what’s inside Felix:

Detail belongs in the prose, not the diagram. Three components below carry the real weight.

The Email Interface Is the Real Interface

The product surface that ships with Felix is unusual: bankers send Felix an email the same way they would a colleague, get an acknowledgment in under a minute with an ETA, and receive PowerPoint, Excel, Word, and PDF deliverables back when ready. Iteration happens by replying to the email thread.

This is not a UX gimmick. It tells you something about how the team thinks about adoption. Investment bankers already live in Outlook. Asking them to adopt a new interface is a tax. Email-as-API removes the tax. It also imposes async semantics on the agent: a long-running task with intermediate status, observable state via the inbox, and a clean handoff back to the human reviewer. The harness has to absorb that asynchrony — request queuing, intermediate progress, partial results, source attribution surviving the round-trip — without leaking it back to the user.

The output substrate matters too. Felix returns work in Excel, PowerPoint, and Word formatted in the firm’s own templates and house style. A pitchbook that doesn’t match house formatting is not 90% done; it is 0% done. Vertical AI rises or falls on output substrate fidelity.

The Big Finance Benchmark: Vertical Evals Are the Moat

Rogo curates an internal evaluation set called the Big Finance Benchmark — real financial tasks designed by their ex-finance team. Tasks include valuing companies, benchmarking peers on specific metrics, and building theses across disparate documents. They are explicit that these come from real workflows, not synthetic prompts.

This is the unsexy infrastructure that compounds. When OpenAI ships GPT 5.6 next quarter, Rogo will know within a day whether it improves CIM drafting on real deals or just MMLU. That is the kind of judgment a horizontal benchmark cannot give you. Every serious vertical AI company will need its own version of this. If you are building one and you don’t have a domain-specific eval suite, you are flying without instruments.

Workflow Surface: What Felix Actually Does

The concrete capabilities Rogo has shipped span deal screening, CIM generation, buyer outreach, and data room diligence. Decomposed:

Deal screening. Filtering thousands of potential targets against thesis criteria.
CIM generation. Drafting Confidential Information Memoranda — the 50-to-100-page sell-side documents that anchor M&A processes.
Buyer outreach. Generating personalized contact lists and initial communications.
Data room diligence. Synthesizing across the document piles that buyers and bankers wade through.
Comps and models. Building Excel spreadsheets with historical financials and forward forecasts.
Pitchbooks and memos. Decks for a CEO meeting, memos for an investment committee.

SiliconANGLE’s coverage notes that Felix can also offer to keep a report current — for example, an analyst covering Apple can have the agent re-run the report each time the company reports earnings. Scheduled, recurring agent runs are part of the surface.

The data substrate behind these tasks is extensive. TAMradar lists integrations with PitchBook, LSEG, Cap IQ, FactSet, Fitch Solutions, and Third Bridge, plus internal CRM and SharePoint connectors. Auditable outputs are positioned for SOC 2, ISO 27001, GDPR, and EU AI Act compliance — the table-stakes regulatory surface for institutional finance.

Sisyphus: The Other Harness

The most under-covered part of Rogo’s stack is a second internal agent called Sisyphus — an autonomous offensive-security agent that pen-tests Rogo’s own infrastructure once or twice a day, calibrated to deployment cadence. It runs structured campaigns across authentication abuse, authorization bypass, injection, SSRF, and LLM-specific exploit categories, and it chains findings to validate exploitability rather than just flagging signals.

Two numbers from Rogo’s own writeup are worth remembering. One week after a third-party penetration test, Sisyphus identified 18 additional exploitable vulnerabilities in a single afternoon, most chained, all remediated within hours. And on calibration: high-confidence findings now carry a >95% true-positive rate after the team tuned the recon phase and compared the agent’s triage against their human security team.

This is the harness for the harness. If your vertical agent platform handles consequential workflows, “we get pen-tested twice a year” is not a posture; it is a vulnerability window. Sisyphus is what the security side of vertical AI starts to look like.

Forward-Deployed Bankers: The Human Harness

Rogo’s go-to-market is structured around an embedded role they call Forward Deployed Bankers — ex-bankers from top firms who sit inside client institutions and onboard teams from analyst to managing director. The new capital is funding expansion of this team from New York into London.

This is not professional services in disguise. It is closer to what Palantir built for defense and intelligence: domain-fluent humans who translate between the workflow and the platform, calibrate the agent’s outputs to firm-specific style, and surface workflow gaps that become product. They understand model formatting and how a positioning section actually reads. Without them, the harness loses ground truth on what “good” looks like inside each firm’s house style.

For builders: the lesson is that adoption inside regulated, high-status industries is bottlenecked on trust transfer, not feature parity. The forward-deployed model is expensive and it is a moat.

What’s Actually Being Transformed

Bankers do not get replaced; their pyramid does. Rogo’s Series D announcement is explicit that leading firms are “restructuring workflows, rethinking staffing pyramids, and deploying autonomous agents that work asynchronously across every transaction.” A managing director at one client described Felix as having tripled team output with no headcount additions. That is the shape of the transformation: same senior judgment layer, compressed junior layer, agent layer doing the asynchronous grunt work, forward-deployed bankers tuning the seams.

Rogo’s two recent acquisitions tell you where they are aiming next. Plux AI — a UK firm tracking complex financial market developments — adds European market coverage. Offset, an AI agent company whose tech automatically updates financial models when new information arrives, plugs directly into the live-model side of the harness.

Five Lessons If You Are Building a Vertical Agent

The harness is the moat, not the model. Build it so frontier-model upgrades are a config change, not a rewrite.
Domain-specific evals beat horizontal benchmarks. Curate real tasks from real practitioners. Run them every model release.
Output substrate must match the destination workflow. A correct answer in the wrong format is the wrong answer.
Forward deployment changes adoption math. Domain-fluent humans embedded in the customer org are a feature, not overhead.
Security needs its own harness. When agents do consequential work, periodic pen tests leave a window. Continuous adversarial testing is the new floor.

What to Do This Week

Pick one workflow you’ve watched a domain expert do that you suspect an agent could absorb. Don’t model it yet. Instead, write down four things: the data sources they pull from, the output format they hand back, the audit trail they leave, and the colleague they email when they get stuck. Those four are your harness specification. The model goes in the middle of that, and you can swap it out next quarter.

If your current agent prototype only handles one or two of those four, you have not built a harness yet. You have built a wrapper.