The AI Runtime

A Portfolio That Practices MRE

The AI Runtime — Fri, 08 May 2026 11:02:37 GMT

TL;DR - Most early-career AI portfolios show the AIfolio pillars — RAG, tool-use, multi-agent orchestration — and stop at “demo runs once.” Vishnu Purohitham’s GitHub is rarer because the projects come pre-equipped with the parts MRE calls harness engineering: fallback chains, validation gates, quality thresholds, graceful degradation. The context engineering layer is real too — a T5 fine-tuned on the 226K-article XSum corpus (or 300K-article CNN-DailyMail) on Northeastern’s H200 cluster, BLIP adapted with LoRA r=16, BGE-base-en-v1.5 embeddings at 768 dimensions, hybrid dense + keyword search. Three of four AIfolio pillars are touched. Persistent memory is the honest gap. The hire/study signal isn’t completeness — it’s that the harness wasn’t an afterthought. If you’re staffing AI engineers and you want a filter for MRE instincts, this is the kind of portfolio to compare against. If you’re building one, copy the disposition: harness with the model, not after it.

Why this builder is worth a closer look

There’s a recognizable shape to most AI engineering portfolios in late 2025 and 2026: a chatbot, a RAG demo, a “GPT wrapper for [niche],” and maybe one fine-tuning notebook. They show familiarity with the stack. They don’t show that the builder has internalized what production AI actually requires — the unglamorous infrastructure that sits around the model and decides whether the system survives contact with real input.

Vishnu Purohitham is a Northeastern-affiliated builder whose portfolio inverts that ratio. Across four shipped projects — one a graduate-class capstone, three from hackathons spanning local Northeastern events to MIT’s Bitcoin Expo — the same architectural commitments show up. It’s the consistency that’s interesting, not any single project.

Vishnu’s AIFolio

This Builder Spotlight reads the work through two frameworks. The AIfolio framework gives us a way to talk about what an AI portfolio should contain — RAG with real evaluation, multi-agent orchestration, tool-use boundaries, persistent memory. Model Reliability Engineering (MRE) gives us a way to talk about how it should be built — split into context engineering (what the model sees at inference time) and harness engineering (the control layer governing what the user sees). Together they answer the question hiring managers actually care about: does this builder ship things, or does this builder ship things that hold up?

The four projects, in one paragraph each

InfoRetrieval v2 — A multimodal RAG system for personal knowledge management. Ingests URLs, PDFs, DOCX files, raw text, images, and Chrome bookmarks through a four-layer pipeline. Web scraping uses Playwright with a Trafilatura fallback. OCR runs EasyOCR first, then Tesseract if the first pass returns less than 20 characters. Summarization uses a T5 fine-tuned on either XSum (226K articles) or CNN-DailyMail (300K articles) on Northeastern’s H200 HPC cluster. Image captioning uses BLIP with a LoRA adapter (r=16, alpha=32). Storage is ChromaDB with hybrid dense + keyword search. Whole thing ships as a Docker Compose stack with a React frontend.

Boston 311 AI Agent — A multilingual (English / Spanish / Portuguese) agent for Boston city services, built in under 36 hours at a Northeastern hackathon. The interesting choice isn’t the agent — it’s the orchestration. The agent fans out parallel tool calls across four live Boston Open Data sources (311 cases, weather, events, neighborhood trends) and streams reasoning back to the frontend over SSE. The visible reasoning panel isn’t a UX flourish; it’s a trust mechanism for users (older adults, non-English speakers) who would otherwise have no way to evaluate whether the answer is grounded.

Zero-Shot Video Annotator — A FiftyOne plugin built at the Voxel51 / Twelve Labs hackathon. The interesting design move: instead of training a classifier, it uses Twelve Labs Pegasus to generate natural-language descriptions of each clip, then matches those descriptions to a user-defined taxonomy via cosine similarity over Marengo embeddings (512-dim). Tested on a 691-clip workplace safety dataset across 8 behavior categories. Local API caching reportedly cut inference costs by 80%. Built-in human-in-the-loop review surfaces low-confidence predictions for manual sign-off.

PulseMesh — A smartphone-based environmental DePIN built at the MIT Bitcoin Expo 2026 Virtual Hackathon. Native Android app collects sensor data (air pressure, noise, light) in the background, with a built-in Lightning wallet for instant micropayments via the L402 protocol. Backend includes a four-stage validation pipeline that detects spoofed readings before data hits the buyer-facing marketplace. Privacy-first design aggregates locations to city-block level before sale.

Two are flagship-quality builds. Two are 36-hour hackathon outputs. The architectural commitments are identical.

Where the AIfolio shows up — and where it doesn’t

The AIfolio framework names four pillars an AI engineer’s portfolio should evidence: a RAG pipeline with real evaluation, a multi-agent system that solves a real problem, an MCP / tool-use integration with sensible boundaries, and a persistent memory architecture. We don’t score Vishnu’s portfolio against this — that turns a spotlight into an audit, and the AIfolio is a reference for the concepts present, not a checklist a builder has to pass. The interesting reading is which pillars Vishnu has built around and which one he hasn’t.

RAG with real evaluation is built around in InfoRetrieval v2 — and “evaluation” is the word that earns it the hit. The training pipeline reports ROUGE-1, ROUGE-2, and ROUGE-L on summarization, plus BLEU for captioning. Most “AIfolio RAG” demos skip the eval. This one ships it.

Tool-use with sensible boundaries is built around in two places. The Boston 311 agent fans out parallel tool calls across four data sources with the reasoning panel exposed to the user — boundary as transparency. Zero-Shot Annotator routes low-confidence predictions to a human reviewer instead of writing them blindly to the labelset — boundary as fallback. Different mechanisms, same disposition: the tool-use isn’t the whole answer, and the system knows it.

Multi-agent orchestration is approached, not fully delivered. The Boston 311 build is parallel tool-calling, not multi-agent in the canonical sense (no negotiation between agents, no planner-worker split). Worth naming honestly: the orchestration skill is real, the multi-agent label is generous.

Persistent memory is the honest gap. Nothing in the four projects builds a cross-session memory layer (Mem0, Letta, Zep, or a custom architecture). Worth being clear about — if Vishnu wanted to round out the AIfolio, this is the next project to ship.

The pillars are reference points for what’s present. The more interesting question is how what’s present has been built. That’s MRE.

What the projects look like through the MRE lens

MRE splits production AI work along two axes. Context engineering governs what the model knows at inference time — fine-tuning, RAG, embedding strategy, knowledge freshness, retrieval precision. Harness engineering governs what the user sees — guardrails, output validation, fallback paths, faithfulness checks, graceful degradation, auditability.

Most AI demos do the first. Vishnu’s projects do both. That’s the signal.

Context engineering, layer by layer

InfoRetrieval v2 is the project where the context engineering is most visible, and it’s done with care.

The summarizer isn’t FLAN-T5 off the shelf — it’s a T5-base fine-tuned for 3 epochs on XSum or CNN-DailyMail at batch size 16 and learning rate 3e-5, with beam search at 4 beams and a 1.2 repetition penalty for inference. The image captioner isn’t BLIP off the shelf — it’s BLIP with a LoRA adapter trained on Flickr8k at r=16, alpha=32, dropout 0.05. The embedder is BGE-base-en-v1.5 at 768 dimensions — a deliberate choice over default OpenAI embeddings, with retrieval running as hybrid dense + keyword search rather than pure cosine.

What’s worth naming: this isn’t fine-tuning for the sake of “I trained something.” Each model on the path has been picked or adapted to the role it plays in the pipeline. T5 because summarization is a sequence-to-sequence problem with strong public benchmarks. BGE because the embedder is a retrieval surface with its own SLO and the MTEB leaderboard is a real signal. Hybrid search because pure dense retrieval misses keyword-exact matches and the system has to handle both.

The Chrome bookmark sync and watchdog file consumer are the part most readers will overlook. These are context freshness mechanisms — automatic re-ingestion as new content lands. MRE treats freshness as a context-layer SLO; this project ships the plumbing for it.

Harness engineering as the standout signal

Harness engineering is where Vishnu’s portfolio separates itself from the median. The pattern repeats across all four projects: any layer where input variation can break the system has a backup path and a quality check that decides which path runs.

The minimal viable shape:

def extract(input_data):

primary_result = primary_extractor(input_data)

if quality_check(primary_result) >= THRESHOLD:

return primary_result, “primary”

fallback_result = fallback_extractor(input_data)

return fallback_result, “fallback”

InfoRetrieval v2’s web scraper runs Trafilatura first because it’s faster and lighter, and falls back to Playwright only if static extraction returns less than 50 characters. The OCR pipeline runs EasyOCR first and falls back to Tesseract if the first pass returns less than 20 characters, then returns a tuple of (text, method) where method is one of “easyocr”, “tesseract”, “combined”, or “none”. That last detail matters — auditability of which path actually ran is what makes the system debuggable three months later.

PulseMesh’s four-stage spoofing detection is the harness pointed at sensor data instead of extractor output, but it’s the same architectural move. Zero-Shot Annotator’s HITL review queue is the same move applied to model confidence — low-confidence predictions don’t get written silently, they get surfaced. The Boston 311 agent’s visible reasoning panel is the same move applied to user trust — the user can see what tools the agent called and decide whether to trust the answer.

What to call out: the validation layer isn’t decorative. It’s the part that lets the system know its own confidence, which is the precondition for graceful degradation. MRE treats this as the harness engineer’s primary deliverable. Vishnu ships it on a hackathon timeline.

Where the edges show

Every project has visible trade-offs. Calling them out is the difference between a profile and a puff piece.

InfoRetrieval v2 doesn’t scale past one machine. ChromaDB’s persistent client is single-process. The watchdog file consumer is async but in-process. None of this is wrong for a CS5130 capstone — but the architecture as written maxes out around one user with one Chrome bookmark file and one watched directory. Multi-user deployment would require a real DB tier, a job queue, and an actual auth layer. The README is honest about this; it doesn’t claim to be SaaS-ready.

The Boston 311 agent was built in 36 hours. That shows. Sub-2-second latency is impressive for a parallel-tool-calling agent, but error handling for stale data sources, partial tool failures, or rate-limited Open Data endpoints would all need real work for a public deployment.

Zero-Shot Annotator’s 80% cost reduction is from caching. The first annotation pass on any new dataset is expensive. The plugin is a good fit for “annotate this dataset once, then iterate on labels” — and a poor fit for “annotate streaming video as it arrives.” Worth knowing before you adopt it.

PulseMesh’s four-stage validation adds latency and a trust assumption. The validators themselves can be wrong. A determined spoofer with knowledge of the validation pipeline can defeat statistical detection. The architecture is correct for an MVP DePIN; it would need a slashing or reputation mechanism to survive at scale.

The persistent memory pillar isn’t built around at all. None of the four projects ship a cross-session memory architecture. For an AIfolio that’s “complete,” this is the next project. The honest read: three of four pillars touched, with strong harness engineering compensating for the gap.

None of these are dealbreakers. They’re the edges of work shipped fast against real constraints. The portfolio doesn’t try to hide them.

What readers can take away

For new AI engineers building portfolios:

The AIfolio pillars name what to build. MRE names how to build it. Both matter, and most portfolios over-invest in the first and under-invest in the second. A demo that hits all four AIfolio pillars but has no harness around any of them is weaker than three pillars built with real harness engineering.

Pick one project and ship the harness. The minimum viable harness has three pieces: a fallback path on the layer most likely to fail, a quality gate that decides which path runs, and a way to audit which path actually ran (logs, return tuples, method tags). The cost is small. The signal is large.

Context engineering doesn’t require an H200. T5-base on a Kaggle GPU works. The signal isn’t the compute — it’s that you can defend a dataset choice, an eval metric, and a hyperparameter. Without that, your context layer is indistinguishable from the median.

Show the trade-offs. A README that says “this maxes out at one user, here’s why, here’s what would change for multi-tenant” reads as more senior than a README that claims SaaS-readiness it can’t back up. The InfoRetrieval v2 README’s frank acknowledgment that BLIP falls back to CPU on Apple Silicon “due to operator support limitations” is the right tone.

For mid-level engineers reviewing portfolios: the cheapest filter for MRE instincts is does the harness exist at all. Run through the candidate’s repos and ask — where does primary extraction live, what happens if it fails, and how would I know which path ran? The absence of an answer is the answer.

For hiring managers: a portfolio that ships hackathon-grade builds with the same architectural rigor as classroom flagship projects is a stronger signal than either taken alone. It says the patterns are reflexive, not assignment-driven. That’s what you’re hiring for.

The most underrated skill in early-career AI engineering isn’t model selection or prompt design. It’s the discipline to architect around the model the same way you’d architect around any other unreliable dependency. Vishnu’s portfolio is interesting because every project assumes the unreliability and designs for it from line one — context engineering on the input side, harness engineering on the output side, with the AIfolio pillars showing up as the natural shape rather than the assignment. If you’re hiring, look for this. If you’re building, copy it.

Three Weeks of Opus 4.7 in Production: What Teams Are Actually Reporting

The AI Runtime — Thu, 07 May 2026 22:31:05 GMT

TL;DR - Anthropic released Claude Opus 4.7 on April 16, 2026 at unchanged pricing ($5/$25 per million tokens). After three weeks of production traffic from teams that shipped early, the most important changes are not the headline benchmark gains — they’re the behavior shifts. Stricter instruction following has broken prompts that relied on charitable interpretation. The new tokenizer can produce up to 35% more tokens for the same input text, shifting cost calculations even at unchanged pricing. Self-verification has materially reduced agent hallucination on tool-use tasks; Hex reports the model surfaces missing data states honestly rather than confabulating. The migration is not drop-in — teams that flipped the model string in config and shipped are the teams reporting regressions. The four practices that worked: re-run the eval suite, audit per-task cost in the first 48 hours, bump the effort tier when comparing benchmarks, and test vision workloads explicitly. The deeper lesson: every Opus release on the current ~2-month cadence is now a release event with its own pre-flight, and the Harness Half-Life is playing out in real time on every team’s prompt suite.

What was promised at launch

The April 16 launch positioned Opus 4.7 as a targeted upgrade over Opus 4.6 — improvements in software engineering, vision, instruction following, and self-verification, with particular gains on the most difficult tasks. Anthropic’s framing was that users should be able to hand off their hardest coding work to the model with less supervision than 4.6 required.

The benchmark numbers Anthropic published: 64.3% on SWE-bench Pro, 87.6% on SWE-bench Verified, and 69.4% on Terminal-Bench 2.0, with 3x higher image resolution (up to 2,576 pixels on the long edge) and a new xhigh effort tier between high and max. Pricing held flat at $5 per million input tokens and $25 per million output tokens.

Opus Updates

That was the launch. What’s emerged in the three weeks since is more textured — and the texture is where the engineering decisions actually live.

The instruction-following shift is the biggest change

The headline that matters for any team running production prompts: Opus 4.7 follows instructions more literally than 4.6 did.

The behavioral pattern, reported across multiple post-launch evaluations: prompts that relied on the model “reading between the lines” now do exactly what they were told. If the prompt says “respond in JSON format,” the model does — even when a clarifying question would have been more useful. If the prompt says “use Postgres, not SQLite” early in the run, the model now honors that constraint twenty steps later where 4.6 would sometimes drift toward whatever the broader context implied.

Three concrete patterns have shown up most often in the regression triage:

Implicit fallback prompts. Teams shipped prompts that effectively said “if you can’t do X, do Y.” The 4.6 behavior was to interpret this as a soft preference and frequently produce X anyway when X was clearly the right answer. The 4.7 behavior is to follow the literal instruction — Y appears when X would have been better, because the prompt said Y was acceptable. Fix: rewrite to express constraints as preferences rather than fallbacks where appropriate.

Format-overriding-content. A prompt that ends with “respond in JSON” gets JSON, even when the right response is a clarifying question. The 4.6 model would often violate the format instruction to ask the question. The 4.7 model produces malformed JSON or a JSON object containing the question, both of which break downstream parsers. Fix: split format instructions from content instructions, or explicitly say “if you need clarification, ask in plain text and skip the JSON wrapper.”

Negation drift. “Don’t do X” instructions that 4.6 sometimes interpreted as “X is unusual but not forbidden” now produce strict refusal of X even when context shifts. Fix: state the positive form (”do Y”) rather than the negation, where possible.

This is good for production systems. Predictability beats cleverness, and stricter instruction following is exactly the property agentic systems need to scale beyond babysitting. It is bad for teams who shipped prompts that depended on the model’s charitable interpretation. Those prompts now produce different outputs, sometimes subtly worse, and the regression is not always visible in eval — it shows up as a 3% increase in user complaints two weeks after launch.

The practical implication: every team migrating from 4.6 to 4.7 needs to re-run their prompt suite against the new model and re-tune. Not because anything is broken — because the model is now answering the literal question, and the literal question may not have been quite what the prompt intended.

The tokenizer change is a silent cost shift

Pricing did not change. Effective spend did.

Anthropic’s pricing documentation states the change explicitly: Opus 4.7 uses a new tokenizer that may use up to 35% more tokens for the same fixed text. Independent post-launch testing has reported token counts up roughly 12-18% on typical workloads, with code-heavy and multilingual content sitting closer to the upper bound.

The 35% number is the worst case. The realistic number for most production workloads is in the 10-20% range. Either way, the implication for a team running production traffic is concrete:

Cost rises at the same pricing per token, because the same prompts now consume more tokens. A workload that ran at $50K/month on 4.6 likely runs at $55-60K/month on 4.7 with no other changes.
Rate limits hit sooner for any team running close to the ceiling, because the limits are denominated in tokens per minute. Teams who previously had headroom may need to request a quota increase or restructure their request distribution.
Context window math changes — prompts that comfortably fit in 200K under the old tokenizer now sit closer to the edge. Teams who routinely ran at 180K input may now be hitting 220K and getting truncated.
Cache hit accounting is unchanged at the multiplier level (5m write at 1.25x, 1h write at 2.0x, read at 0.1x), but the absolute number of cached tokens is higher, which changes the savings calculation in absolute terms.

This is a benign change on paper and an expensive one in practice. The teams that ran a careful migration audited their per-task cost metric in the first 48 hours and adjusted budgets. The teams that did not are now finding out via the monthly bill.

The broader lesson: token consumption is now part of the migration audit. A model upgrade is not a cost-neutral event even when per-token pricing is unchanged. The metric that matters is cost-per-task, not cost-per-token, and it must be measured before and after every migration.

Self-verification has been the standout improvement

The behavioral change practitioners report most consistently is self-verification on agentic tasks. The model proactively checks its own outputs before declaring a task complete — writing tests and running them, re-checking tool results before synthesizing, flagging missing data rather than confabulating around it.

Hex’s CTO captured the practical impact: the model surfaces missing-data states honestly rather than fabricating around them, and it resists the kind of conflicting-evidence patterns that previously confused 4.6. On Hex’s 93-task internal benchmark, the resolution rate moved up by 13 points against 4.6, and Opus 4.7 closed four problems that neither 4.6 nor Sonnet 4.6 had been able to finish.

Notion AI reported it as the first model to pass their implicit-need tests — tasks where the model must infer required actions rather than being told what tools to invoke.

For teams running coding agents and other multi-step automation in production, this is the change that justifies the migration on its own. The error rate that previously forced human checkpoints on every meaningful action drops, and the human checkpoint can move one layer up the stack. That is a different shape of human-in-the-loop, and it changes the economics of agent oversight.

The economics shift is concrete. If a team was running a coding agent that required human review on every PR, and 4.7 reduces the review-required rate from 100% to 60%, the per-PR human time falls by 40%. Aggregated across an engineering org’s PR volume, that’s a meaningful productivity multiplier — and it lands on the same headcount, not new hires.

For agent product teams, this also reshapes the handoff layer. The escalation triggers that fired when the model was uncertain now fire less often, because the model resolves more cases internally. The handoff payload still has to be tight when escalations do happen — but the volume of escalations falls, which means the human queue shortens, which means each escalation gets faster human attention, which means handoff quality improves end-to-end.

The xhigh effort tier and task budgets

Two new control surfaces shipped with 4.7. Both have meaningful implications for production economics.

xhigh sits between high and max — finer-grained control over the reasoning-vs-latency tradeoff. Anthropic recommends starting with high or xhigh for coding and agentic use cases, and Claude Code now defaults to xhigh across all plans.

Hex’s observation is the load-bearing one for cost calibration: low-effort 4.7 sits at roughly the quality of medium-effort 4.6. This means a team comparing the two should benchmark at one tier higher on 4.7 to match equivalent quality at lower cost. Concretely:

Workloads that ran at medium on 4.6 → try low on 4.7 first; you may match or exceed quality at lower cost
Workloads that ran at high on 4.6 → try medium or high on 4.7; match quality at meaningful cost reduction
Workloads that need the absolute ceiling → xhigh is the new tier worth exercising; max remains for the genuinely hardest tasks

The teams treating effort tiers as fixed config rather than tunable parameters are leaving real cost savings on the table. A migration sprint that includes effort-tier audits typically recovers a meaningful portion of the tokenizer cost increase.

Task budgets (public beta) are a token cap on a complete agentic loop — thinking, tool calls, tool results, and final output combined. The model sees a running countdown and prioritizes accordingly. This is the agent-system equivalent of a request timeout. It does not optimize cost per call; it bounds the worst case.

The implementation pattern is direct: set a per-task budget at invocation time, and the model receives the running count as part of its prompt context. As the budget approaches zero, the model wraps gracefully — finishing the current step, summarizing where it is, returning a partial answer rather than hitting a hard cutoff mid-tool-call.

For any team that has had a runaway agent loop in production — the kind that eats a day’s budget retrying the same failing tool call — this is the primitive that closes that failure mode. The combination with the server-side compaction beta (the compact-2026-01-12 header) means teams now have provider-native primitives for both the cost ceiling and the context overflow problem. Less custom infrastructure to build; less to maintain.

The vision jump is real

The vision change is the one most likely to be undervalued because it requires a workflow that exercises it. For teams that work with screenshots, diagrams, dense PDFs, or any high-DPI input, the practical impact is large.

The maximum image resolution moved from ~1.15 megapixels to ~3.75 megapixels — a 3.3x increase in pixel count. Independent reports flag this as an inflection for document extraction, log screenshot analysis, architecture diagram understanding, and similar workflows.

The use cases where this materially changes feasibility:

Dense document extraction — financial statements, medical records, technical drawings — where text or detail at the original resolution was previously too small to reliably extract.
UI testing and visual regression — full-page screenshots of complex web apps where individual components or text strings were previously below the resolution threshold.
Architecture diagrams and technical illustrations — where the relationships between components depend on small text labels and connection details.
Log and dashboard screenshots — where a workflow involves the agent reading rendered UI rather than structured data.

The cost: higher resolution images consume more tokens. Anthropic recommends downsampling when the extra fidelity is not needed. The pattern that has emerged: tier images by resolution requirement, and route to lower-resolution input for routine cases. Treat the high-resolution capability as a tool to invoke, not as a default.

This is not a “nice to have” change for vision-adjacent workloads. It is the difference between vision capabilities that worked in demos and vision capabilities that work in production.

The regressions

Not every change is an improvement. Two regressions are worth flagging.

Web research quality, by some independent reports, has dropped relative to 4.6 — source attribution accuracy, contradiction detection, and citation specificity all reportedly weaker. The hypothesis circulating among teams who migrated then partially reverted: the training tradeoff that improved agentic persistence shifted the model away from the careful cross-referential reasoning that made 4.6 strong on research tasks.

The practical guidance from teams who ran both side-by-side: if your primary workload is research synthesis where source fidelity matters, evaluate carefully before migrating. Some teams are running 4.7 for coding workflows and 4.6 for research workflows on the same product surface, routed by task type. The cost of running two models is real but smaller than the cost of regression on the workload that regressed.

Self-reported numbers vs independent testing. As is now standard with frontier model launches, independent testing tends to show tighter margins than vendor numbers. The 13% lift on coding benchmarks reported by Hex may be closer to 5-6 points in real-world workloads, particularly when controlling for the effort tier difference. This is not specific to Anthropic; it is a category property of self-reported AI evaluations and a reason to run independent benchmarks before relying on launch numbers for production decisions.

The patterns that worked

The migration patterns that worked in the first three weeks share four practices:

Re-run the eval suite before flipping production traffic. The instruction-following shift exposes prompt regressions that are not obvious from spot-checking. Teams that have a regression suite ran it against 4.7 first, triaged the failures, and then either fixed the prompts or held the model upgrade until they could.
Audit per-task cost in the first 48 hours after migration. The tokenizer change is a silent cost shift, and the only honest measurement is the per-task metric. A 30% increase in median cost-per-task with no quality change is the signal that effort tier or task budget tuning is needed.
Bump effort tier when comparing benchmarks. If the previous workload ran at high on 4.6, equivalent quality on 4.7 may sit at xhigh — and equivalent cost at high may now match what medium did on 4.6. The tier-shift opportunity is the largest under-claimed win in the migration.
Test vision workloads explicitly. The 3.3x resolution jump changes what is feasible. Teams that don’t exercise vision are leaving capability on the table — and teams whose workloads include any document, screenshot, or diagram processing should explicitly test whether the new resolution unlocks workflows that weren’t viable before.

The teams that struggled in the first three weeks did the opposite: flipped the model string, watched some prompts regress, and spent days triaging without a structured re-evaluation. Several reported partial reversion to 4.6 for specific high-value workloads while they did the migration audit they should have done before the cutover.

Migration Plan

The verdict three weeks in

For agentic coding workflows: migrate. The self-verification and tool-call reliability gains compound into materially fewer failed loops and less wasted compute. The teams running coding agents in production are the clearest beneficiaries.

For vision-heavy workflows: migrate immediately. The resolution jump is the kind of capability change that opens new product surfaces — workflows that were demo-viable but production-fragile become production-viable.

For research-heavy workflows: evaluate carefully. The reported regression on cross-referential reasoning is real for some tasks. Some teams are running 4.6 for research and 4.7 for coding on the same product, routed by task type, until the gap closes.

For everyone: budget time for prompt audit, audit per-task cost, and treat the migration as a release event with its own pre-flight. The model is better. The migration is not free.

What this release teaches about model upgrades generally

The deeper pattern this release illustrates is the Harness Half-Life playing out in real time. The custom prompt scaffolding, the fallback heuristics, the workarounds for 4.6’s quirks — many of them are now obsolete. Some of them are now actively suppressing capabilities the new model could provide. A team that built a custom verification step on top of 4.6 because the model didn’t reliably check its own work is now running that custom step and the model’s stronger built-in self-verification — paying for both, getting marginal benefit from the custom layer.

Auditing the harness on every model release is no longer optional. With a release cadence of roughly two months on the Opus line, it is now part of the operating rhythm.

The teams who treat each model release as a discrete project — its own pre-flight, its own audit, its own dashboard for tracking the migration — are the teams whose harnesses stay lean. The teams who treat each release as a config flip accumulate harness debt at compounding rates, and pay it off in larger and more painful migrations later.

The model is improving faster than the harnesses around it. That asymmetry is now a structural feature of building on frontier models, and the engineering response — instrumented migrations, structured audits, and a culture of harness pruning — is what separates teams whose costs shrink with each release from teams whose costs only grow.

Three weeks of production data from Opus 4.7 is enough to see the shape. The teams who learned this lesson cleanly are already preparing for the next release. The teams who didn’t are still triaging the last one.

Dont miss out on the next editions from The AI Runtime

The Cost Layer — The xhigh effort tier and the tokenizer change are both cost levers. Caching, routing, and task budgets are how teams absorb the per-task cost shift on migration.

The Shipped Agent’s First 90 Days — Treat every model release as a release event with its own pre-flight. The first 90 days framework formalizes the operating rhythm that catches regressions before users do.

Long-Running Agent State Management — The compact-2026-01-12 beta header pairs with Opus 4.7’s task budgets. Both are provider-native primitives that close failure modes teams used to build themselves.

Inside Mintlify’s Agent Stack

The AI Runtime — Wed, 06 May 2026 08:03:50 GMT

TL;DR - Mintlify just raised $45M at a $500M valuation on the bet that documentation has stopped being something humans read and started being infrastructure that agents query. Their own traffic data backs the bet: across 30 days and roughly 790M requests on Mintlify-powered sites, AI coding agents accounted for 45.3% of traffic versus 45.8% for browsers, with Claude Code alone generating more requests than Chrome on Windows.

Underneath the bet sits a three-part architecture worth studying. The write agent runs inside ephemeral Daytona sandboxes with a headless OpenCode session driven by Opus 4.6, triggered by Slack mentions, dashboard prompts, API calls, or YAML-defined Workflows in your repo. The read assistant does the opposite — it skips real sandboxes entirely in favor of ChromaFs, a virtual filesystem layered over their existing Chroma database, taking session creation from roughly 46 seconds to about 100 milliseconds. The public surface auto-generates llms.txt, llms-full.txt, and skill.md at the root, serves clean Markdown when you append .md to a page URL, and hosts an MCP server for every docs site it powers.

The architectural lesson isn’t that they built a doc agent. It’s that they built two harnesses with deliberately asymmetric constraints — async writes get full sandboxes, sync reads get a virtual filesystem — and the asymmetry is what makes the system economical at over 23 million queries a month. If you’re wrapping a model around a code repository for any reason, this is the reference implementation to study.

The 45% problem

Start with the data, because the architecture only makes sense once you accept the premise.

In April 2026, Mintlify’s co-founder Han Wang published a Cloudflare-header analysis covering 30 days of traffic across all Mintlify-powered docs sites. The headline number: AI coding agents had reached 45.3% of total requests, narrowly behind 45.8% from browsers. The distribution was lopsided. Claude Code alone produced 199.4M requests, ahead of Chrome on Windows at 119.4M. Cursor produced 142.3M. Together those two tools accounted for roughly 96% of identified AI agent traffic. Mintlify itself notes the real share is likely higher, since Codex traffic is invisible to user-agent header analysis and disappears into generic HTTP requests.

Architecture Patterns

If half your readers are agents pulling context to generate code, the design pressure on documentation flips. Browsers want navigation chrome, syntax highlighting, expandable sections. Agents want clean Markdown, exact strings, and stable URLs. The same content has to render correctly to both audiences, and — critically — has to stay current as the underlying product ships at agent-swarm speed.

That second pressure is the one that produced the agent stack. As Mintlify’s other co-founder Hahnbee Lee frames it, when a chatbot gives a wrong answer it is usually a documentation failure rather than a model failure, because the corpus the model retrieved against is out of date. The gap between what your docs say and what your product does compounds quarter over quarter unless something automated keeps the two in sync. Their answer is two distinct agents with two distinct harnesses, plus a public surface that exposes the maintained corpus to every other agent in the ecosystem.

Two harnesses, two latency budgets. The write path optimizes for capability; the read path optimizes for cost-per-conversation.

Layer 1 — The write agent: a sandbox is the whole product

Most “AI doc writer” features on the market today are roughly one prompt, one model call, one diff. Mintlify’s write agent is structurally different. When you trigger it — by @mintlify-ing the bot in Slack, hitting Cmd+I in the dashboard, calling the agent API, or merging a PR that fires a Workflow — what runs on the other side is a headless OpenCode session driven by Opus 4.6, scoped to a fresh Daytona container that has the docs repo and any context repositories cloned in. The sandbox is the unit of work.

This decision is more load-bearing than it sounds. The Mintlify team is explicit about the reasoning: pointing a stateless model at a codebase produces, in their phrase, “chaos with a byline”. The agent needs a real environment to read code, plan changes, and edit files safely — not an API call decorated with retrieved chunks. So they gave it one. A trigger lands on a job queue, a worker provisions the container, and the result of the run is reported back through GitHub commit checks and the Mintlify dashboard. Inside the container, the agent runs through a fixed pipeline: it pulls in relevant material across the docs and the connected code repos, drafts a multi-step plan if the work calls for one, applies edits while honoring the project’s writing standards, runs a local Mintlify CLI build to confirm the docs still compile, and opens a pull request — direct commits to main are not on the menu.

Two design choices inside that loop are worth pulling out.

Slack-first, not terminal-first. The Mintlify agent originally shipped only in Slack and via API, with the dashboard surface added later in December 2025. The team’s stated reason: opening a terminal triggers a “mentally draining switch” that opening Slack does not, and documentation work is exactly the kind of task people procrastinate on. By living where the relevant context already lives — the PR thread that explained the change, the customer Slack message that surfaced the gap — the trigger surface matches the source of the work.

Behavior-as-code through AGENTS.md. The agent reads a config file at .mintlify/AGENTS.md in your repo, and appends its contents to its system prompt for every task it runs — whether the trigger comes from Slack, the dashboard, or the API. The path matters: Mintlify’s docs explicitly warn that placing the file at the project root exposes it as a public asset under /agents.md, since the .mintlify/ directory is not served on the docs site. What you put inside is style preferences, code standards, project-specific terminology — the kind of guidance a senior reviewer would otherwise repeat fifty times a year. It is the same pattern as Anthropic’s CLAUDE.md or the AGENTS.md spec emerging across the agent tooling space, and it makes agent behavior version-controlled and reviewable.

The most interesting trigger surface is Workflows, where the YAML config gets explicit. A workflow file lives in your repo. The schema looks roughly like this:

---
name: 'Update API reference on backend changes'
on:
  push:
    - repo: 'your-org/backend'
      branch: main
context:
  - repo: 'your-org/docs'
  - repo: 'your-org/openapi-specs'
automerge: false
---

When the backend repo merges a PR, scan the diff for changes to public API
endpoints, request/response schemas, or authentication behavior. Update the
matching API reference pages and code examples. Skip internal refactors.

The structure is a trigger (cron job or push event), a list of context repos to clone in, an automerge flag, and natural-language instructions in markdown. When the trigger fires, the agent evaluates the conditions, runs the task, and either commits directly or opens a PR depending on configuration, so cost stays predictable. Documentation maintenance becomes a downstream event of shipping, not a separate task someone has to remember.

The whole arrangement maps onto a pattern emerging across serious agent products: give the AI a sandbox, version-control the instructions, keep humans in the review loop, and let the model do the actual work inside well-defined guardrails. The reviewer-on-PRs analogy is doing real work here. The agent is treated like a junior contributor with full repo access — capable, but reviewed.

Layer 2 — The read assistant: when a real sandbox is the wrong answer

If the write agent shows what it looks like to spend latency to gain capability, the read assistant shows the opposite trade-off — and it is the more architecturally surprising of the two.

The read assistant is the chat widget your readers use on a Mintlify-powered docs site. It now serves over thirty thousand conversations a day across hundreds of thousands of users. The natural design — and the one Mintlify started with — was the same shape that powers the write agent: spin up a sandbox, clone the docs repo, let the model run real grep, cat, ls, and find against the filesystem.

That design hit two walls. First, latency: p90 session boot time, including the GitHub clone and other setup, came in around 46 seconds — fine for an async write task where someone fires a Slack message and walks to get coffee, fatal for a reader staring at a loading spinner on a docs page. Second, cost. At nearly a million conversations a month, even a minimal sandbox setup at 1 vCPU, 2 GiB RAM, and a five-minute lifetime would have run north of $70,000 a year on Daytona’s per-second pricing, with longer sessions doubling the bill.

So the team built ChromaFs — a virtual filesystem that gives the agent the illusion of a real shell, layered over the Chroma database that already stored the docs as embedded chunks. Session creation collapsed from tens of seconds to roughly 100 milliseconds, and because ChromaFs reuses infrastructure they were already paying for, the marginal compute cost per conversation dropped to zero. The implementation runs on top of just-bash, a TypeScript reimplementation of bash from Vercel Labs that exposes a pluggable IFileSystem interface. just-bash parses commands, pipes, and flags; ChromaFs translates each underlying filesystem call into a Chroma query.

The mechanics are worth dwelling on, because they reveal how thoughtful harness design beats brute-force sandboxing.

The directory tree is bootstrapped from a single gzipped JSON document called __path_tree__ stored inside the Chroma collection. On startup, the server fetches and decompresses it into two in-memory structures — a set of file paths and a map from directories to their children. After that, ls, cd, and find resolve in local memory with zero network calls, and the tree is cached so subsequent sessions for the same site skip the fetch entirely. Per-user access control happens at tree-build time: ChromaFs prunes paths the user can’t see and applies a matching filter to all subsequent Chroma queries, with the result that pruned paths cannot even be referenced by the agent. Reading a page is a chunk-reassembly operation — cat /auth/oauth.mdx fetches all chunks with the matching slug, sorts them by chunk_index, and joins them into the full page. Writes throw EROFS, making the system stateless by construction.

The most clever piece is grep. A naive recursive grep over a virtual filesystem would be agonizing — every file would round-trip to the database. ChromaFs intercepts the grep call, parses flags with yargs-parser, and translates them into a Chroma query ($contains for fixed strings, $regex for patterns) that acts as a coarse filter to identify which files might contain a hit. The matched chunks are bulk-prefetched into a Redis cache, and the rewritten grep is handed back to just-bash for in-memory fine filtering. Large recursive queries finish in milliseconds.

Sitting beneath ChromaFs in the read path is Trieve, the RAG infrastructure company Mintlify acquired in July 2025. Trieve had been Mintlify’s search backbone since before the team finished its Y Combinator batch, and the acquisition brought retrieval ownership in-house at a moment when the assistant was already serving more than 23 million queries a month. Trieve’s stack — dense vector search, re-ranker models, sub-sentence highlighting, and date recency biasing on a single endpoint — does the heavy lifting underneath ChromaFs’s UNIX-style interface. Trieve also moved to an MIT license as part of the acquisition, so the same retrieval kernel is inspectable on GitHub.

The pattern in the read assistant is the part most teams underweight. Mintlify’s team observed that agents are converging on filesystems as their primary interface, because grep, cat, ls, and find are sufficient primitives for an agent to reason over arbitrary structured content. Most builders take that observation and reach for a real sandbox. Mintlify took the same observation and asked whether the interface could be virtualized while keeping the primitives real. For their workload, the answer was yes — and the cost curve in their post (sandbox cost grows linearly with conversation duration; ChromaFs stays flat) is a clean argument for why.

Layer 3 — The public surface: content negotiation as the unification trick

The third layer is the cheapest to describe and the easiest to overlook.

Every Mintlify-hosted docs site automatically generates a set of agent-readable artifacts at the root: llms.txt, llms-full.txt, and skill.md. The first two are an emerging convention for telling LLMs what content lives on a site and giving them a parseable bulk dump. The third is more interesting. As Mintlify describes it, skill.md is the action-layer manifest — it enumerates not just what the documentation contains but what an agent can actually invoke against the product, with required inputs and operating constraints attached to each capability. It is, in other words, the difference between an agent that can find information and an agent that can take action. Mintlify also exposes the /.well-known/agent-skills and /.well-known/skills paths — so any agent that knows the convention can find capabilities without hard-coded paths.

The unification trick that ties everything together is content negotiation. The same URL serves rich HTML to browsers and clean Markdown to agents — appending .md to any page URL returns a Markdown view of the same content, with no separate agent-facing site to maintain. This avoids the failure mode where teams maintain a “human site” and a separate “AI site” that drift out of sync; there is only one content store, with two rendering targets selected by the request.

Finally, every Mintlify site auto-hosts an MCP server, which lets coding agents like Cursor, Claude Code, and Windsurf query current documentation while a task is running. Authentication is supported when the docs site itself is gated — the MCP server respects whatever auth protocol the docs already use. The architectural significance is that retrieval is no longer something only the docs site itself can do. Every external agent that supports MCP gets a structured handle into your corpus, on the same terms as Mintlify’s own assistant.

What the architecture teaches

A few patterns are general enough to lift out of Mintlify’s specific case and apply elsewhere.

First, the sandbox is the unit of work for write tasks, but the wrong unit for read tasks. Most builders default to one or the other. Mintlify’s own bill clarifies the trade-off: a sandbox that boots in tens of seconds and costs a fraction of a cent per session is fine for asynchronous PR drafting, and ruinous for a chat widget. If you’re building both surfaces, expect to want both harnesses.

Second, version-controlled, natural-language instructions are the right encoding for agent behavior. Workflows YAML and AGENTS.md are the same idea applied at different scopes — one configures a recurring task, the other configures the agent globally. Both live in the repo, both go through code review, both evolve with the project. This is what “config as code” looks like when the configured component is a model.

Third, virtualizing the agent’s interface, not its environment, is often the better move. ChromaFs is the cleanest example: a real grep, a real ls, a real cat — but resolved against a database, not a disk. The agent doesn’t need a sandbox, it needs the sandbox’s API. Once you internalize that, a lot of “we need a Daytona for this” becomes “we need an IFileSystem shim for this,” with two orders of magnitude less infrastructure.

Fourth, content negotiation is the right unification primitive when you’re serving humans and agents from the same corpus. Maintaining parallel “human docs” and “AI docs” is how you guarantee they drift. Same URL, different format, selected by the request — and the cost of supporting the agent surface drops to near-zero.

Finally, harnesses are not edge cases, they’re the product. If you remove ChromaFs from the read assistant, the bill blows up. If you remove the sandbox boundary from the write agent, you stop being able to safely run on customer codebases. If you remove the auto-generated llms.txt and MCP server, the 45.3% of agent traffic loses its grip on the corpus. The model is doing model work in the middle, but everything around it — the sandbox, the virtual filesystem, the YAML triggers, the public surface — is what makes the product trustworthy and economical.

What to do with this

Three concrete moves for practitioners building anything adjacent to this space.

If you operate a documentation site, run it through Mintlify’s free Agent Score tool, which checks twenty-nine signals of agent-readability and tells you where the gaps are. The data is right there: half your traffic is agents you cannot see, and most teams are still building only for browsers. If you’d rather audit on your own, start by checking whether curl -L https://yourdocs.com/some-page.md returns clean Markdown or a 404 — that one HTTP request tells you whether you’re on the agent map at all.

If you’re building any agent that needs to read or modify a code repository, start with the harness, not the prompt. Decide your latency budget before you decide your model. If the answer is “tens of seconds and the agent edits files,” the Mintlify write agent — sandbox, headless OpenCode, version-controlled config — is your reference. If the answer is “milliseconds and the agent only reads,” the ChromaFs pattern (virtualize the interface, not the environment) is your reference.

And if you’re shipping a product that other agents will need to understand — an API, an SDK, a developer tool — treat your documentation as a programmatic interface that happens to also be human-readable. Auto-generate llms.txt and skill.md, expose an MCP server, serve clean Markdown via content negotiation. The asymmetric world Mintlify is betting on already exists. The teams whose docs are agent-readable get evaluated. The teams whose docs aren’t get skipped.

How Vertical Agents Self-Improve in Production

The AI Runtime — Sat, 02 May 2026 11:03:55 GMT

TL;DR - In regulated verticals — healthcare, legal, insurance, finance — the most reliable way to make a deployed agent better is not a new model. It is a closed loop that turns production failures into harness updates: prompts, tools, sub-agents, memory files, judge rubrics, routing logic. Harvey ran this loop on twelve legal tasks and moved average success from 40.8% to 87.7% with model weights frozen, with complaint drafting going from 2% to 98% rubric coverage. Hippocratic AI vendor-published clinical accuracy improvements from ~80% pre-Polaris to 99.38% in Polaris 3.0 by feeding ~1.85M real patient calls and 307K clinician-reviewed test calls back into the system. Anterior (vendor-published) puts a reference-free LLM-as-judge in front of every prior auth decision, routes only the low-confidence ones to under ten clinicians, and reports 96% F1 at over 100K decisions/day. Microsoft’s Azure SRE Agent moved its Intent-Met score from 45% to 75% on novel incidents by letting the agent investigate its own bugs and submit PRs against its own codebase. The shared pattern is the same six nodes everywhere: trace → judge → cluster → mutate harness → gate → deploy. If you cannot run that loop, you are shipping a frozen artifact in a moving market. Start by instrumenting traces and writing one rubric. The judge and the mutation loop come after.

The frozen-agent problem

A vertical agent that ships at 90% accuracy and stays there is not a 90% accurate system. It is a 90% accurate system at the moment of deployment, decaying.

The decay has three sources. Distribution drift: real patients ramble, real lawyers redline contracts in non-canonical ways, real claims arrive with new denial codes. Policy drift: CMS coverage determinations change, EU AI Act provisions phase in on staggered enforcement timelines, insurer rulesets get rewritten quarterly. Long-tail surface area: the failure modes you didn’t see in eval are the ones production discovers, one in ten thousand at a time. At 100K medical decisions per day, a one-in-ten-thousand subtle hallucination — “suspicious for multiple sclerosis” when the patient has a confirmed MS diagnosis — fires ten times daily.

Agent Improvement

In low-stakes consumer apps you can absorb that. In a vertical where the cost of a single error is a denied surgery, a missed disclosure schedule, or a regulatory finding, you cannot. So the question that defines vertical agent engineering in 2026 is not “which model do we use” — it is “how does this agent get better next week than it is today, without a new base model release, and with the audit trail a regulator will demand.”

The answer that has emerged across legal, healthcare, insurance, and incident response is the same architecture, sometimes given different names. Anthropic’s engineering team and Viv Trivedy refer to it as harness engineering. Microsoft frames it as the agent investigating itself. NVIDIA borrows MAPE-K from autonomic computing and calls it a data flywheel. LangChain calls it the agent improvement loop powered by traces. The mechanics are the same.

The shape of the loop

The loop

Six nodes. Every component carries weight; every break in the chain causes silent degradation.

Production traces are the substrate. Without per-step tool calls, model inputs, model outputs, latency, token counts, and final outcomes, none of the downstream work is possible. LangChain’s formulation is the cleanest: traces come from staging environments, benchmark runs, local development, and especially from production, and they are the input to every subsequent step. The trace store doubles as the audit trail regulators ask for.

Evaluation and judging is where most teams over-rely on offline benchmarks. The shift in 2025–26 has been toward online evaluators that score every production trace — typically an LLM-as-judge augmented with deterministic checks (schema validation, citation existence, tool-call shape) and routed human review on a configurable sample. Anterior’s framing is sharper than most: their judge is reference-free, scoring outputs against guidelines and clinical reasoning rather than a held-out ground truth, because the volume — over 100K decisions a day — makes ground truth impossible to maintain.

Failure clustering is where the leverage is. A pile of low-scored traces is not actionable. Grouping them by failure pattern — “agent missed exhibit B in 30% of due diligence runs,” “agent emits ‘suspicious for X’ on confirmed-X patients,” “agent hits LLM 429s during streaming” — turns symptoms into hypotheses. LangChain runs parallel error-analysis subagents and synthesizes their findings into harness change proposals. Microsoft’s SRE Agent runs a daily monitoring task that searches the last 24 hours of errors, clusters the top hitters, traces each to its root cause, and submits a PR.

Harness mutation is the change itself. We will spend a section on the levers that actually move; for now: most of these changes never touch model weights. They edit the system prompt, add a skill or sub-agent, modify a tool definition, append to a memory file, tighten a routing threshold, or rewrite the judge’s rubric.

Validation gate is the hill-climbing safety. Every proposed harness change runs against a frozen eval set before it ships, and any regression — even on a task the change was not targeting — blocks the merge. Harvey runs this against twelve internal benchmark tasks per iteration; LangChain marks proposed changes that overfit as discarded runs in their iteration log. Without the gate, the loop generates regressions as fast as it generates improvements.

Deploy then closes the cycle. The new harness produces new traces; new traces feed new judges; new clusters drive new mutations. The model is the one piece of this picture that does not change between weekly cycles.

The non-obvious property of this loop is what compounds. As Anterior describes it, the loop creates a virtuous improvement cycle where the evaluator itself gets calibrated against human review, and confidence grades from that calibrated evaluator route which cases need humans next time. The judge improves. The clustering improves. The mutations get more targeted. The agent appears to learn — without a single weight changing.

Case 1: Harvey — autoresearch and the rubric ceiling

The cleanest published demonstration is Harvey’s recent autoresearch experiment, summarized externally by Artificial Lawyer. Niko Grupen, Head of Applied Research, ran twelve tasks from Harvey’s internal agent benchmark — commercial lease review, complaint drafting, tax memos, disclosure schedules, due diligence questionnaires — through a loop where an outer agent is allowed to edit the inner agent’s harness based on rubric-graded judge feedback.

The setup: each task ships with source documents, instructions, and a detailed grading rubric. After an attempt, an LLM judge scores against the rubric and produces written feedback on what the agent got right, what it missed, and where its reasoning was wrong. A coding agent reads the judge feedback, clusters the failures, forms a hypothesis about which harness components would help, edits or builds those components — skills, hooks, scripts, sub-agents, not model weights — and reruns.

The result: across all twelve tasks, average success rose from 40.8% to 87.7%. Five of the twelve started in the 2–7% range. After optimization, seven exceeded 90% and one hit 100%. The complaint drafting task is the most striking — it moved from 2% rubric coverage to 98% over a handful of iterations, producing a 164-paragraph complaint with a 33-exhibit list.

Two patterns from Grupen’s log are worth quoting on terms. First, the early iterations correct basic structural failures — wrong file types, missing deliverables, weak structure. Later iterations show domain-specific expertise emerging: cross-document issue spotting, risk classification, distinguishing genuinely problematic provisions from market-standard distractors. Second, the ceiling is the rubric. “When the rubric is high quality, the agent can hill-climb surprisingly far.” When it isn’t, the loop stalls.

This generalizes. The same auto-improvement pattern works in a generic coding domain: LangChain’s deepagents-cli moved from 52.8% to 66.5% on Terminal Bench 2.0 — a 13.7-point jump from harness changes alone, with the model fixed at GPT-5.2-Codex. The mechanism is the same trace analyzer skill, parallel error agents, and targeted prompt/tool/middleware changes per iteration.

The Harvey caveat is real and worth surfacing: this is a vendor-run experiment on twelve tasks; it does not yet generalize to all legal work, and it is bound by the quality of the rubrics Harvey wrote. But the directional finding — that harness-layer changes can deliver model-upgrade-sized improvements in a regulated domain — is now hard to dismiss.

Case 2: Hippocratic AI — clinicians as a learning signal at scale

Hippocratic AI’s Polaris is a different shape of the same loop, scaled to a 22-LLM constellation that handles over 10 million real patient calls and a network of 6,234 US-licensed clinicians who review production output.

The vendor-published trajectory across three model generations: pre-Polaris baseline ~80%, Polaris 1.0 at 96.79%, Polaris 2.0 at 98.75%, Polaris 3.0 at 99.38% clinical accuracy, validated under their Real-World Evaluation of Large Language Models in Healthcare framework. The framework leverages 6,234 US-licensed clinicians (5,969 nurses and 265 physicians) evaluating 307,038 unique calls through a three-tier review process: nurse review first, physician adjudication when needed, structured error categorization in between. Errors flagged at any tier feed back into the next iteration’s training and harness.

The subsystem-level numbers tell the more interesting story, because they show what specifically improved between Polaris 2.0 and 3.0 by listening to production:

Health Risk Assessment documentation accuracy: 90.5% → 98.5%
Explanation-of-Benefits policy quoting: 86.4% → 99.4%
Complex appointment scheduling error rate: 8% → 0.5%
Background-noise speech recognition error rate: 9.3% → 2.3%
Clarification engine error rate (gracefully handling unclear patient speech): 16.3% → 2.0%

These aren’t random improvements. They’re the long-tail issues that surfaced once 1.85M patient calls had run through Polaris 1.0 and 2.0 and clinicians had flagged categorical failure modes. Speech recognition fails in noisy environments → train a dedicated background-noise engine. Patients answer HRAs in rambling, context-shifting ways → ship a “deep thinking” model that triple-checks documentation. Policy quotes occasionally drift from source documents → tighten the harness around source attribution.

The honest framing: these are vendor-self-published numbers, and there is no independent third party validating Hippocratic AI’s safety scores. What is independently verifiable is the architecture of the feedback loop — clinician review network, structured error categorization, real-world evidence accumulation across versions — which is now described in the underlying RWE-LLM paper on medRxiv and is replicable by anyone willing to invest in a comparable review apparatus.

Case 3: Anterior — judge first, route smartly, validate the validator

Anterior runs the same loop in healthcare prior authorization, but with two design choices that are worth studying separately because they generalize beyond healthcare.

First, reference-free real-time evaluation. Anterior’s primary system makes a coverage determination by reasoning across unstructured clinical documentation, payer rulesets, and clinical guidelines. A second LLM-as-judge then evaluates the determination against those same guidelines — without needing a held-out ground truth — and produces a confidence grade. Reference-free evaluation matters because at 100K+ decisions a day, no organization can maintain a labeled gold set that keeps up with policy drift.

Second, dynamic case prioritization. The confidence grade combines with contextual factors — procedure cost, bias risk, historical error rates for that procedure category — to decide which cases are sent to human clinicians for review. High-confidence cases auto-resolve; low-confidence and high-stakes cases route to a small clinical team. Anterior reports a team of fewer than ten clinical reviewers handling tens of thousands of cases, against a competitor reportedly employing 800+ nurses for comparable review volume. (Caveat: scope of work may differ. Take the comparison directionally.)

The third move is the one most teams miss. Anterior runs alignment metrics between the LLM-judge and the human reviewers on cases that get both, and uses that data to validate — and continuously recalibrate — the judge itself. They call this “validating the validator.” It is the missing piece in most LLM-judge deployments. Without it, the judge can drift, and you only learn about it when the harness has been mutating against bad signal for weeks.

Anterior’s vendor-reported numbers: 99.26% accuracy on automated approvals, against 86% baseline human accuracy, with 76% reduction in human review needed and 74% less time per escalated case. Cross-reference with Anterior’s own arXiv paper on fairness evaluation, which reports model error rates across 7,166 human-reviewed cases spanning 27 medical necessity guidelines. Independent validation remains an open need; the 96% F1 figure that has circulated comes from Anterior’s own talks, not a peer-reviewed audit.

The architectural lesson generalizes far past healthcare. Any vertical agent operating at scale where ground truth is expensive — fraud review, AML, KYC, contract triage, claims adjudication, security alert triage — can adopt the same three-part move: reference-free judge in line, dynamic routing on confidence and stakes, alignment metrics that validate the judge against the humans that exist.

Case 4: Azure SRE Agent — when the agent debugs itself

Microsoft’s Azure Site Reliability Engineering Agent handles tens of thousands of incidents weekly for internal Microsoft services and external teams. The team published a remarkably honest engineering retrospective in March 2026 about how they closed their improvement loop.

The starting point: incident resolution rates were climbing toward 50% on high-instrumented scenarios — but the high-performing scenarios all shared a trait. They had been built with heavy human scaffolding: custom response plans, hand-built sub-agents for known failure modes, pre-written log queries exposed as opaque tools. On any new incident class, the agent had nowhere to start. Engineers were reading 50 lower-scored threads a week against an agent handling 10,000 — debugging at human speed.

The inversion they made: stop pre-computing the answer space. Instead, give the agent a filesystem as its world (source code, runbooks, query schemas, past investigation notes — all files; no SearchCodebase API), context hooks that orient it on what it can access, and frugal context management that keeps long investigations sharp. Three architectural bets, in their words. The result: Intent-Met score on novel incidents — whether the agent’s investigation actually addressed the root cause as judged by the on-call engineer — rose from 45% to 75%.

The closing move is the one to study. They set up a daily monitoring task: the agent searches the last 24 hours for LLM errors — timeouts, 429s, mid-stream failures, malformed payloads — clusters the top hitters, traces each to its root cause in its own codebase, and submits a PR. Engineers review before merging. Over two weeks, errors dropped by more than 80%.

The agent, in other words, became its own debugger. The harness that runs the SRE agent is now updated by the SRE agent itself, gated by human PR review. The team’s framing is the title of their post: “The agent that investigates itself.” It is not a metaphor.

What actually changes (the levers)

The most under-appreciated property of these loops is what they mutate. Across every case study above, the changes that produced the gains were:

The system prompt and task instructions. ILWS, the “Instruction-Level Weight Shaping” framework, formalizes this: a session-level reflection engine proposes a structured edit to the system prompt — a knowledge delta — that is gated, accepted only if a sliding-window quality rating improves with statistical significance, and rolled back otherwise. Most production teams do this informally. Formalizing it gives you reversibility under governance, which regulators ask for.

Tool definitions and skills. LangChain’s improvement was largely middleware: a LocalContextMiddleware that maps the working directory and onboards the agent into its environment, a LoopDetectionMiddleware that intercepts repeated edits to the same file and forces a plan reconsideration, a PreCompletionChecklistMiddleware that blocks the agent from exiting before it runs a verification pass. None of these are model changes. All are tool-and-hook surface.

Memory and knowledge files. Microsoft replaced their RAG-over-past-sessions memory with structured Markdown files the agent reads and writes through its standard tool interface — overview.md, team.md, logs.md, debugging.md. The model navigates memory by following links, not by retrieving via embedding similarity. This is the “the repo is the schema” insight. Memory becomes a write-able artifact that future runs read.

Sub-agents and routing. Anterior routes by confidence × stakes. Azure SRE spawns parallel sub-agents per hypothesis when a single context is at risk of getting polluted. Hippocratic uses a 21-model supervisory constellation around a primary conversational model. None of these compositions require retraining the underlying weights; they require designing the orchestration layer.

Judge rubrics. The Harvey ceiling is the rubric ceiling. The Anterior calibration is the judge alignment with humans. The fastest leverage in most teams’ first improvement loop is not a fancier judge — it is a better-written rubric and a small humans-vs-judge alignment dataset.

Fine-tuning the small models in the harness. Sometimes weights do change, but on the components, not the primary model. NVIDIA NeMo’s case study on an enterprise data flywheel: a routing model fine-tuned from Llama 3.1 70B down to a Llama 3.1 8B variant achieved 96% accuracy with a 10× model size reduction and 70% latency improvement. The query rephrasal model gained 3.7% accuracy with a 40% latency cut. The orchestrating LLM was untouched.

The pattern is consistent: when you map “improvements shipped” against “components that changed” across these case studies, the primary reasoning model is the least common thing that gets edited. The harness layer carries the weight.

Where these loops break

Six failure modes show up repeatedly. None are theoretical; each one has burned at least one of the case studies above.

Overfitting to recent failures. Aggregate harness changes against last week’s top errors and you regress on tasks the change wasn’t targeting. LangChain’s iteration log explicitly marks these as discarded runs. Without a frozen eval set that the validation gate runs every mutation against, you’ll fix Monday’s bug and silently break Tuesday’s working flow.

Reward hacking against the rubric. When the agent edits its own harness against an LLM judge’s scoring, the judge’s scoring is the optimization target — including any blind spots in the rubric. Harvey caveats this directly: the improvements track the rubric, and the rubric is human-authored and incomplete. Periodic out-of-distribution evals from a separate judge with a separate rubric catch this.

Judge drift and validator fragility. Anterior’s validate-the-validator move exists because LLM-judges drift, and the drift is silent. If the judge is the substrate for routing, clustering, and mutation decisions, judge drift propagates everywhere. Alignment metrics against humans on a rolling sample of cases is the only known fix.

Memory staleness. Microsoft flagged this as their unsolved problem: when two sessions write conflicting patterns to debugging.md, the model has to reconcile them; when a service changes behavior, old memory entries become misleading. Timestamps and explicit deprecation help, but no production team has solved this systematically.

Privacy and regulatory constraints on production data. Healthcare and finance can’t freely route production traces into a learning loop the way a generic SaaS product can. The TikTok Pay ARIA paper handles this by having the agent self-identify uncertainty through structured self-dialogue and request targeted explanations from human experts at runtime, keeping learning at test time inside the regulatory boundary. Hippocratic uses synthetic test calls plus consented real-call evidence; Anterior keeps clinician review and AI determination in the same compliance perimeter.

Compounding errors when the validator itself fails. A bad judge calibrated against a small alignment set drifts. A bad alignment set lets the judge calibrate against itself. A bad clustering layer groups the wrong failures together. Each layer of the loop is a place errors can go undetected and propagate. The defense is treating every layer as an evaluable artifact — the judge has a precision/recall, the cluster labels have inter-rater agreement, the harness mutations have a regression budget.

The seventh failure mode, which is institutional rather than technical: nobody owns the loop. In every case study above, the loop is owned by a named team with a named lead — Grupen at Harvey, Mukherjee at Hippocratic, Mehta and team at Microsoft. Loops without owners decay quietly.

Build order

If you’re standing up a vertical agent and don’t yet have this loop, the build order is fixed and the order matters. None of the steps require the next-generation model.

Start with traces. Every tool call, every model input, every model output, every latency, every outcome, with a stable trace ID per session. If you can’t reconstruct what happened, none of the rest of the loop works. LangSmith, Arize Phoenix, Braintrust, and OpenTelemetry-based stacks all do this; pick one and instrument every call path before anything else.

Then write one rubric for one task. Not a benchmark suite. One task that matters, one rubric that an expert in your domain would sign off on. Score 50 production traces against it manually. The rubric you ship will be wrong in instructive ways; the act of writing and applying it surfaces the failure modes you didn’t know you had.

Add a judge against that rubric. Run it inline on a sample of production. Run it against the 50 you scored manually. Compute alignment. If alignment is below ~70%, the rubric is the problem, not the judge.

Add the clustering and mutation step last. Cluster the lowest-scored traces, propose one harness change, gate it against your offline eval, ship if it passes, measure the production effect. This is one cycle. Run it weekly.

The model upgrade question takes care of itself once the loop is running. When a better base model ships, you swap it in, rerun the validation gate, and observe whether your harness over-fits to the old model. (Different models reward different harnesses — Claude Opus 4.6 scored 59.6% with a harness tuned for GPT-5.2-Codex on Terminal Bench 2.0; the same Claude with its own harness moved several positions.) The harness tax of switching models is real, but it’s a calibration problem, not a foundational one.

The reason this matters now and not in twelve months is asymmetry. Vertical agent winners in 2026 will not be the teams with the best zero-shot model. They will be the teams whose deployed agents are quietly compounding skill every week the rest of the market sits frozen. The loop is the moat.

Build the trace store this week. Write the first rubric next week. The rest of it follows.

Felix Is a Harness, Not a Model: How Rogo Built an Agent for High Finance

The AI Runtime — Fri, 01 May 2026 11:03:46 GMT

TL;DR - Rogo serves more than 35,000 professionals at over 250 institutions — Rothschild & Co, Jefferies, Lazard, Moelis, Nomura — with an AI agent called Felix that bankers email like a junior analyst and get back finished decks, models, and memos. The interesting part is not the model. Rogo’s own product team calls Felix their “agent harness” — a vertical scaffolding designed to be model-agnostic across GPT 5.5, Claude Opus 4.7, and Gemini. Felix is the playbook for vertical AI: the moat is the harness, the evals, the data integrations, and the deployment model — not which frontier LLM is wired in this quarter. If you are building a vertical agent, study how Rogo decomposed the problem before you pick a model.

What Rogo Actually Sells

A precision note first: when people say “banking” in this conversation, they don’t mean retail or commercial banking. Rogo sits inside high finance — investment banking, private equity, hedge funds, equity research, asset management. Rogo’s own product page explicitly calls out its three audiences: Banking, Private Markets, Public Markets. The workflows are deal-shaped: pitchbooks, comps, models, memos, CIMs, diligence trackers.

Rogo was founded by Gabriel Stengel and John Willett — both ex-investment-bankers (Lazard, J.P. Morgan, Barclays) — with Tumas Rackaitis. That founder profile matters because the company’s edge is not the LLM; it is the granular, painful familiarity with what a 2 AM CIM revision actually looks like.

Felix Architecture

Yesterday’s $160M Series D, led by Kleiner Perkins with participation from Sequoia, Thrive, Khosla, and J.P. Morgan Growth Equity Partners, brings total funding past $300M. The capital is going toward two things that tell you what they actually believe: deeper data integrations and more forward-deployed bankers embedded inside client institutions.

Felix Is a Harness, Not a Model

The single most useful sentence Rogo has published this year shows up in their GPT 5.5 release note: “we’ve begun incorporating GPT 5.5 into our agent harness, Felix.” Read that twice.

Felix is not a fine-tuned model. Felix is the harness — the orchestration scaffold, tool layer, citation system, output formatters, audit trail, and policy controls — into which Rogo plugs whichever frontier model performs best on their internal benchmark this week. They are explicit that they are model-agnostic across OpenAI, Google, and Anthropic, and TAMradar’s coverage notes the platform supports GPT 5.5 and Anthropic Opus 4.7 concurrently.

This separation is load-bearing. In the Model Reliability Engineering frame, the harness is one of the two reliability axes — the scaffolding you build around the model to make its behavior production-safe. The harness-vs-model split is the same separation MRE treats as one of its two reliability axes. Rogo's product team uses the word the same way. The implication for builders: when frontier labs ship a 4% improvement on your domain, you swap the engine; when they ship a 40% improvement two years from now, your harness is what survives.

Here is the rough shape of what’s inside Felix:

Detail belongs in the prose, not the diagram. Three components below carry the real weight.

The Email Interface Is the Real Interface

The product surface that ships with Felix is unusual: bankers send Felix an email the same way they would a colleague, get an acknowledgment in under a minute with an ETA, and receive PowerPoint, Excel, Word, and PDF deliverables back when ready. Iteration happens by replying to the email thread.

This is not a UX gimmick. It tells you something about how the team thinks about adoption. Investment bankers already live in Outlook. Asking them to adopt a new interface is a tax. Email-as-API removes the tax. It also imposes async semantics on the agent: a long-running task with intermediate status, observable state via the inbox, and a clean handoff back to the human reviewer. The harness has to absorb that asynchrony — request queuing, intermediate progress, partial results, source attribution surviving the round-trip — without leaking it back to the user.

The output substrate matters too. Felix returns work in Excel, PowerPoint, and Word formatted in the firm’s own templates and house style. A pitchbook that doesn’t match house formatting is not 90% done; it is 0% done. Vertical AI rises or falls on output substrate fidelity.

The Big Finance Benchmark: Vertical Evals Are the Moat

Rogo curates an internal evaluation set called the Big Finance Benchmark — real financial tasks designed by their ex-finance team. Tasks include valuing companies, benchmarking peers on specific metrics, and building theses across disparate documents. They are explicit that these come from real workflows, not synthetic prompts.

This is the unsexy infrastructure that compounds. When OpenAI ships GPT 5.6 next quarter, Rogo will know within a day whether it improves CIM drafting on real deals or just MMLU. That is the kind of judgment a horizontal benchmark cannot give you. Every serious vertical AI company will need its own version of this. If you are building one and you don’t have a domain-specific eval suite, you are flying without instruments.

Workflow Surface: What Felix Actually Does

The concrete capabilities Rogo has shipped span deal screening, CIM generation, buyer outreach, and data room diligence. Decomposed:

Deal screening. Filtering thousands of potential targets against thesis criteria.
CIM generation. Drafting Confidential Information Memoranda — the 50-to-100-page sell-side documents that anchor M&A processes.
Buyer outreach. Generating personalized contact lists and initial communications.
Data room diligence. Synthesizing across the document piles that buyers and bankers wade through.
Comps and models. Building Excel spreadsheets with historical financials and forward forecasts.
Pitchbooks and memos. Decks for a CEO meeting, memos for an investment committee.

SiliconANGLE’s coverage notes that Felix can also offer to keep a report current — for example, an analyst covering Apple can have the agent re-run the report each time the company reports earnings. Scheduled, recurring agent runs are part of the surface.

The data substrate behind these tasks is extensive. TAMradar lists integrations with PitchBook, LSEG, Cap IQ, FactSet, Fitch Solutions, and Third Bridge, plus internal CRM and SharePoint connectors. Auditable outputs are positioned for SOC 2, ISO 27001, GDPR, and EU AI Act compliance — the table-stakes regulatory surface for institutional finance.

Sisyphus: The Other Harness

The most under-covered part of Rogo’s stack is a second internal agent called Sisyphus — an autonomous offensive-security agent that pen-tests Rogo’s own infrastructure once or twice a day, calibrated to deployment cadence. It runs structured campaigns across authentication abuse, authorization bypass, injection, SSRF, and LLM-specific exploit categories, and it chains findings to validate exploitability rather than just flagging signals.

Two numbers from Rogo’s own writeup are worth remembering. One week after a third-party penetration test, Sisyphus identified 18 additional exploitable vulnerabilities in a single afternoon, most chained, all remediated within hours. And on calibration: high-confidence findings now carry a >95% true-positive rate after the team tuned the recon phase and compared the agent’s triage against their human security team.

This is the harness for the harness. If your vertical agent platform handles consequential workflows, “we get pen-tested twice a year” is not a posture; it is a vulnerability window. Sisyphus is what the security side of vertical AI starts to look like.

Forward-Deployed Bankers: The Human Harness

Rogo’s go-to-market is structured around an embedded role they call Forward Deployed Bankers — ex-bankers from top firms who sit inside client institutions and onboard teams from analyst to managing director. The new capital is funding expansion of this team from New York into London.

This is not professional services in disguise. It is closer to what Palantir built for defense and intelligence: domain-fluent humans who translate between the workflow and the platform, calibrate the agent’s outputs to firm-specific style, and surface workflow gaps that become product. They understand model formatting and how a positioning section actually reads. Without them, the harness loses ground truth on what “good” looks like inside each firm’s house style.

For builders: the lesson is that adoption inside regulated, high-status industries is bottlenecked on trust transfer, not feature parity. The forward-deployed model is expensive and it is a moat.

What’s Actually Being Transformed

Bankers do not get replaced; their pyramid does. Rogo’s Series D announcement is explicit that leading firms are “restructuring workflows, rethinking staffing pyramids, and deploying autonomous agents that work asynchronously across every transaction.” A managing director at one client described Felix as having tripled team output with no headcount additions. That is the shape of the transformation: same senior judgment layer, compressed junior layer, agent layer doing the asynchronous grunt work, forward-deployed bankers tuning the seams.

Rogo’s two recent acquisitions tell you where they are aiming next. Plux AI — a UK firm tracking complex financial market developments — adds European market coverage. Offset, an AI agent company whose tech automatically updates financial models when new information arrives, plugs directly into the live-model side of the harness.

Five Lessons If You Are Building a Vertical Agent

The harness is the moat, not the model. Build it so frontier-model upgrades are a config change, not a rewrite.
Domain-specific evals beat horizontal benchmarks. Curate real tasks from real practitioners. Run them every model release.
Output substrate must match the destination workflow. A correct answer in the wrong format is the wrong answer.
Forward deployment changes adoption math. Domain-fluent humans embedded in the customer org are a feature, not overhead.
Security needs its own harness. When agents do consequential work, periodic pen tests leave a window. Continuous adversarial testing is the new floor.

What to Do This Week

Pick one workflow you’ve watched a domain expert do that you suspect an agent could absorb. Don’t model it yet. Instead, write down four things: the data sources they pull from, the output format they hand back, the audit trail they leave, and the colleague they email when they get stuck. Those four are your harness specification. The model goes in the middle of that, and you can swap it out next quarter.

If your current agent prototype only handles one or two of those four, you have not built a harness yet. You have built a wrapper.

Privacy Filter Is Not an LLM

The AI Runtime — Wed, 29 Apr 2026 11:44:46 GMT

TL;DR - OpenAI released Privacy Filter on April 22, 2026 — an Apache 2.0, 1.5B-parameter (50M active) model for detecting and masking eight categories of personally identifiable information. The headline is the 96% F1 score on PII-Masking-300k. The actual story is the architecture: Privacy Filter takes a gpt-oss autoregressive checkpoint, swaps its language-modeling head for a token-classification head, and post-trains it as a bidirectional banded-attention classifier with BIOES span decoding. It labels every token in a single forward pass instead of generating one. That single design decision is why it runs in a browser, supports 128K context without chunking, and is designed for high-throughput data sanitization workflows. But the 96% F1 is on synthetic data — a third-party benchmark by Tonic.ai (a competing redaction vendor) on real EHR notes and web crawls puts F1 between 0.18 and 0.65 at default settings, almost entirely as a recall problem. Treat Privacy Filter as a fine-tuning starting point and a precision-tuned default, not a drop-in production redactor — and notice that Anthropic, despite having every reason to ship something equivalent, has not.

The architecture: a generative model with its head replaced

Most coverage describes Privacy Filter as “a small open-weight model for PII detection.” That misses the interesting part. Privacy Filter is not a small LLM that happens to do classification. It is structurally a different model class.

Privacy Filter

The base checkpoint is a gpt-oss-style decoder pretrained autoregressively. OpenAI then performs three modifications to convert it into a classifier:

Replace the head. The language-modeling head is removed and a token-classification head is bolted on, emitting 33 logits per token (1 background class plus 8 PII categories × 4 BIOES boundary tags).
Switch attention from causal to bidirectional banded. Each token now attends to a window of 128 tokens on each side (effective receptive field: 257 tokens including itself), in both directions. The causal mask — the thing that makes a model “generative” — is gone.
Post-train with supervised classification loss. No next-token prediction. The objective is BIOES tag accuracy on a privacy-labeled dataset (the public PII-Masking-300k corpus plus synthetic data, augmented with model-assisted annotation review).

The retained pieces are also informative: grouped-query attention (14 query heads, 2 KV heads), rotary positional embeddings, and a sparse mixture-of-experts feed-forward block. The MoE is what gives the 50M-active-out-of-1.5B-total figure. Only a small fraction of weights actually fire on any single forward pass, which is what makes CPU inference viable.

The Architecture

The decoder is the other piece worth surfacing. Per-token classifications produce incoherent spans on their own — “John” tagged as begin-name, the next token tagged as begin-address, and so on. To prevent that, Privacy Filter applies constrained Viterbi decoding over the BIOES transition graph. Begin must be followed by Inside, Inside, or End. End cannot transition to Inside. Single is its own one-token span. The decoder enforces these transitions globally over the sequence, so the output is always a clean set of contiguous spans.

This architecture is not novel by NLP standards — BIOES tagging and Viterbi decoding date back to pre-transformer NER systems. What is novel is using a frontier-quality pretrained generative model as the substrate, then surgically retargeting its head and attention pattern for a different objective. The world model the autoregressive pretraining gave the network — the contextual sense of when “Alice” is a literary character versus a person in a customer email — is preserved. That world model is what classical Presidio-style regex-plus-NER doesn’t have, and it is the entire reason Privacy Filter outperforms rule-based systems on ambiguous spans.

Why the architecture matters in production

Three properties fall out of this design that an LLM-based redactor wouldn’t have.

Single-pass labeling. A 128K-token document is processed once. There is no autoregressive decoding loop over the output, no chain-of-thought reasoning, no JSON parsing of the result. OpenAI describes the model as designed for high-throughput data sanitization workflows but does not publish specific tokens-per-second numbers; the architecture’s single-forward-pass design is what enables a sanitization-on-every-prompt deployment pattern even at modest hardware budgets.

No prompt engineering surface. A generative model used for classification has prompts, which means it has prompt injection risk. A token classifier has neither. There is no instruction the input can override.

Adjustable precision/recall via the decoder, not the weights. OpenAI exposes the Viterbi transition biases as runtime knobs. You can shift the operating point toward higher recall without retraining, just by re-tuning decoder priors.

The flip side is genuine: token classifiers cannot reason about context the way an LLM can. They cannot rewrite, synthesize, or follow a custom redaction policy (”redact only PII belonging to non-employees”). Privacy Filter does what it does and nothing else.

The 96% F1 trap

The PII-Masking-300k benchmark is a synthetic corpus generated specifically to evaluate PII-masking systems. OpenAI reports F1 = 96% on the original (94.04% precision, 98.04% recall) and 97.43% on a corrected version where they fixed annotation errors. Both numbers are real and reproducible.

They are also nearly useless as a production signal.

Tonic.ai — itself a vendor of competing redaction tooling — published a benchmark within days of release, running Privacy Filter against four real-world test groups: electronic health record notes, call-center transcripts, loan contracts, and web crawls. Their methodology is transparent (token-level evaluation projected to Privacy Filter’s 8-class taxonomy on 500+ documents) and the comparison product is their own. With those caveats noted: Privacy Filter’s F1 ranged from 0.18 to 0.65 at default settings. Tonic’s purpose-built redactor scored 0.92–0.99 on the same data. Precision was comparable across both systems (around 0.77–0.85 for Privacy Filter). The gap was almost entirely recall: on web-crawl PII, default recall was 10%; on EHR notes, 38%.

Two things explain this. First, OpenAI ships Privacy Filter with a precision-tuned default operating point. Over-redaction destroys downstream utility, and the company chose to under-flag rather than over-flag. The Viterbi knobs can recover most of the gap, but at the cost of multiplying total predictions roughly 5× — with a corresponding hit to precision on common words like “our” and “please.” Second, real-world PII has a long tail of formats — international phone numbers, forum-handle-style usernames, obfuscated contact blocks, region-specific identifiers — that the default eight-category taxonomy doesn’t even attempt to cover. SSNs, MRNs, NHS numbers, and Brazilian CPFs are not in the default label set.

Fine-tuning closes the gap. OpenAI’s own announcement reports fine-tuning improves F1 from 54% to 96% on a domain-adaptation benchmark and approaches saturation, and the model card explicitly recommends task-specific fine-tuning when policy differs from base boundaries. The lesson: Privacy Filter’s value as a base model is real. Its value as a drop-in production redactor at default settings is not.

Where Anthropic fits — and conspicuously doesn’t

Anthropic does not ship anything equivalent to Privacy Filter. There is no open-weight Anthropic PII detector. There is no Claude API endpoint specifically for PII redaction. The Constitutional Classifiers Anthropic publishes about — including the more recent two-stage cascade with activation probes — are jailbreak and CBRN safety filters, scanning for harmful intent rather than personal data. They are also closed-weight and operated only inside Anthropic’s own deployment.

This is a structural difference between the two labs in 2026. OpenAI now maintains an open-weight model family (gpt-oss-20b, gpt-oss-120b, and now Privacy Filter as a derivative). Anthropic does not. For an engineering team using Claude in a regulated environment — healthcare, legal, financial — there is no first-party path to local PII filtering on Claude’s own infrastructure. The viable options are:

Run Privacy Filter or Presidio in front of Claude as a proxy. This is what community tooling like the Claude Privacy Tool already does — it intercepts prompts locally, swaps PII for placeholders using OpenAI’s open-weight model, sends the masked version to Claude, and re-substitutes on the way back.
Use a commercial proxy. Tools like Grepture or Tonic Textual sit between the client and the Claude API, performing token-level redaction with a reversible token map.
Build it in-app. Open issues like anthropics/claude-code#29434 are explicitly requesting a first-party redaction hook in Claude Code so secrets and PII don’t enter the context window in the first place.

The strategic reading: OpenAI is positioning small, specialized open-weight models — what’s worth calling safety SLMs — as infrastructure they want the broader ecosystem to standardize on. Anthropic’s safety story is built around training-time alignment plus closed classifiers integrated tightly into Claude itself. Both are legitimate strategies. Only one of them gives you a model you can run locally.

The alternatives landscape

For teams evaluating PII redaction in 2026, Privacy Filter joins a crowded field. The relevant tradeoffs:

Microsoft Presidio is open source, mature, and combines regex pattern recognizers, spaCy-based NER, and contextual checks. It supports more languages out of the box than Privacy Filter and ships with image and structured-data redactors that Privacy Filter lacks. Its weakness is exactly where Privacy Filter is strong: ambiguous, contextual PII that requires language understanding rather than pattern matching, since its defaults rely heavily on regex and pre-trained NER models rather than purpose-trained PII classification.

AWS Comprehend is a managed cloud API. AWS’s docs state PII detection supports English or Spanish text documents only, with no on-prem option. It is a reasonable pick only if your data is already in AWS and your sensitivity tolerance allows cross-network calls.

Google Cloud Sensitive Data Protection (formerly DLP) has the broadest taxonomy — over 200 built-in infoType detectors — but is also cloud-only and the most complex to configure.

Private AI is the commercial purpose-built option. The vendor publishes its own benchmark showing it leading on recall across domains, with multilingual support and a containerized on-prem deployment path. Treat the numbers as vendor-published rather than independent.

Tonic Textual is the production-trained option for teams with real customer data — its head-to-head against Privacy Filter is the only public comparison on non-synthetic corpora to date.

The architectural takeaway across these options: Privacy Filter is the first frontier-lab open-weight entry into a category that has been dominated by closed cloud APIs and SDK-based regex-NER hybrids. Its long-term value is probably less as a finished tool and more as a base checkpoint that shifts the ecosystem from rule-based to learned context-aware redaction.

What this means for your stack

If you are building production AI features today and PII handling is part of the threat model, three concrete decisions follow.

First, decide where redaction lives in your pipeline. The two viable spots are at-source — a proxy or hook that scrubs prompts before they reach any LLM API — and in-batch — a sanitization pass on training data, logs, and indexed corpora before they reach a vector store. These have different operating-point requirements. At-source needs low latency and reversibility (the token-to-real-value map persists for the session). In-batch can be slower, can run in parallel, and is one-way.

Second, do not adopt Privacy Filter at default settings if your data doesn’t look like PII-Masking-300k. Either fine-tune on a few hundred to a few thousand domain examples, or tune the Viterbi knobs aggressively and accept the precision hit, or run Privacy Filter as one detector among several with rule-based and pattern-based detectors filling the gaps. The eight-category taxonomy is also static — if your domain has SSNs, MRNs, NHS numbers, or non-US tax IDs, you will need to fine-tune to add those classes.

Third, reversibility is the real production problem, not detection. If your application needs to mask PII before sending to an LLM and then un-mask it in the response, you are doing pseudonymization, not anonymization. The LLM might rewrite, paraphrase, or modify the placeholders, and your un-masking logic has to handle that. Privacy Filter solves none of this. Tools like Protecto and Tonic position themselves explicitly around the un-masking robustness problem, which is harder than the F1 score implies.

Safety SLMs as a model class

Privacy Filter is the clearest signal yet that “small, specialized model trained for one safety task” is becoming a stable category — distinct from foundation models and distinct from classical NLP libraries. The pattern is consistent: take a frontier-pretrained checkpoint as the substrate, surgically modify the head and attention pattern for a single classification or scoring objective, post-train on labeled safety data, and ship the weights under a permissive license so the ecosystem can fine-tune for vertical domains.

The next entries in this category are predictable. Prompt-injection detectors. Toxicity classifiers. Output policy auditors. Code-secret scanners. Some already exist as research artifacts. Privacy Filter is the first that is small enough to run in a browser, accurate enough to ship, and open enough to adapt without negotiating a license. If safety SLMs become the standard infrastructure layer for production AI — the privacy and safety equivalent of TLS termination — Privacy Filter is the v1.

What’s worth watching is whether Anthropic continues to keep its safety classifiers internal, or whether the competitive pressure of an open ecosystem forces a shift. The Constitutional Classifiers research is, technically, exactly the kind of work that could ship as open weights for the broader community to build on. So far, it hasn’t.

Shadow AI Agents

The AI Runtime — Mon, 27 Apr 2026 11:03:54 GMT

TL;DR - Per Gravitee’s 2026 State of AI Agent Security report, 88% of organizations reported confirmed or suspected AI agent security incidents in the past year. The same survey found three million agents running inside corporations today, only 47.1% of which are actively monitored or secured. Deloitte’s 2026 State of AI in the Enterprise adds that only one in five companies has a mature governance model for agentic AI. The numbers describe a single underlying problem: most enterprise AI agents are shadow agents — autonomous workers with persistent permissions, no owner, no registry entry, and no audit trail. This is shadow IT’s faster, more dangerous successor. Shadow IT was unsanctioned software. Shadow AI was unsanctioned LLM use. Shadow agents are unsanctioned workers — they move files, send emails, execute transactions, and call APIs at machine speed, often borrowing a human’s credentials with no separation of action.

The fix is agent identity as a first-class reliability surface — sitting beneath context engineering and harness engineering as the precondition both rely on. Microsoft’s Agent 365, generally available May 1 at $15 per user per month, is the first major reference architecture: every agent gets a unique Entra Agent ID, a sponsor, a registry entry, and a managed lifecycle. It’s not the whole answer — cross-cloud governance is still unsolved — but it’s the clearest blueprint enterprises have today for what an agent control plane needs to do. If you can’t answer three questions about your environment in five minutes — how many agents we have, what each one can actually do, and who is accountable when one misbehaves — you have shadow agents. This is a guide to making them visible.

The Office Building Analogy

Imagine you walk into your office tomorrow and discover that your company hired forty-five people overnight for every existing employee. They don’t have badges. They report to no one. They have access to your filesystem, email, CRM, customer database, and bank accounts. They never go home, never take vacation, and when something breaks at 3 AM on a Saturday, no one even knows they were there.

Shadow AI Agents

This is not hyperbole. It is the actual ratio. Non-human identities — service accounts, API tokens, robotic process automation, and now AI agents — outnumber human identities in average enterprises by 45 to 1, according to Gartner research, climbing to 80 to 1 in cloud-native organizations. Most operate with excessive privileges. Most run unmonitored. And most are essential to keeping production systems running.

The traditional security playbook was simple: lock down the humans. Enforce MFA. Train employees not to phish. Review badges. The shadow agents problem rewrites the question entirely. The mandate is no longer “who has admin rights?” but “what has access to what?” — and answering that requires infrastructure most organizations have not built yet.

What Shadow Agents Actually Are

Shadow IT was the previous era’s problem. Employees signed up for SaaS tools without IT approval. Procurement found out months later when the renewal invoice landed.

Shadow AI was the bridge. Employees pasted proprietary data into ChatGPT, Claude, or Gemini. The exposure was real but bounded — a single conversation, a single export, a single user.

Shadow agents are categorically different. Unlike shadow AI, which is the use of unapproved LLMs, shadow agents are granted persistent permissions to your systems. They don’t just answer questions. They move files, send emails, update records, and communicate with customers and other agents. They authenticate continuously. They make decisions while no human is watching. And they typically piggyback on a human user’s credentials — which means in your audit logs, the agent’s actions are indistinguishable from the human’s.

When an agent updates a file, the log says “John Doe updated a file.” It should say “John Doe’s Agent [ID 042] updated a file.” That single missing distinction is the source of most attribution failures, most incident response delays, and most of the 88% incident rate Gravitee found in its 2026 State of AI Agent Security report.

The pattern is predictable and already widespread. Marketing deploys an agent for content generation. Sales spins up one for lead scoring. Finance automates invoice processing. Each was approved by a manager who reasonably assumed IT would catch anything risky. IT never sees them, because the agents enter the environment through OAuth grants, browser extensions, MCP integrations, and developer pipelines that no central registry tracks. Six months later the agents are doing critical work. Twelve months later one of them malfunctions and exposes a customer database. The post-mortem reveals nobody knew it existed.

Gravitee’s research puts the steady-state at three million agents operating inside corporations today, of which an estimated 1.5 million are running with no oversight, accessing sensitive data, making decisions, and connecting to critical systems with no audit trail. Gartner expects 40% of enterprise applications to embed task-specific AI agents by the end of this year, up from less than 5% in 2025. IDC projects 1.3 billion autonomous agents in circulation by 2028. None of those agents will govern themselves.

Why Reliability Engineering Alone Doesn’t Solve This

I’ve written extensively about Model Reliability Engineering — the discipline of ensuring AI behavior is reliable in production. MRE has two surfaces: context engineering (what the model knows at inference) and harness engineering (what users see, with what guardrails).

Both surfaces assume something they shouldn’t: that you know which agent is calling the model, whose permissions it carries, and who is accountable if it misbehaves.

Take a faithfulness SLO failure. An agent generates a response unsupported by the retrieved context. MRE tells you the metric fired. It does not tell you which of your 412 agents fired it, which user it was acting on behalf of, what permissions it was operating under, or whether the failure exposed data the agent should never have been able to access in the first place. That investigation requires identity — and most organizations cannot produce it.

Agent identity is therefore not a sibling discipline to MRE. It’s a precondition. Reliability without identity is unauditable. Observability without attribution is theater. You cannot enforce a purpose limitation on an agent whose purpose was never declared. Kiteworks’ 2026 Data Security and Compliance Risk Forecast quantifies the gap directly: 63% of organizations cannot enforce purpose limitations on what their agents are authorized to do, and 60% cannot terminate a misbehaving agent once it starts operating.

This is why agent identity belongs as the next reliability surface — not in addition to context and harness engineering, but underneath them. Without it, the rest of the stack cannot carry weight.

The Four Pillars of an Agent Control Plane

Across the most coherent enterprise frameworks emerging in the last six months — Microsoft’s Agent 365, the Cloud Adoption Framework guidance for agent governance, the OWASP Top 10 for Agentic Applications, and the NIST AI Agent Standards Initiative announced in January 2026 — the same four pillars surface repeatedly. Together they describe what an agent control plane has to do.

Discovery and registry. Every agent in the environment is inventoried. Not just the ones IT sanctioned. The ones running through OAuth grants, browser extensions, MCP servers, low-code platforms, and developer scripts. If you don’t know an agent exists, you cannot govern it. Most organizations cannot produce this list today.

Identity and sponsorship. Each agent receives a unique, durable identifier — distinct from any human user’s credentials. Each identity has a sponsor: a human accountable for the agent’s lifecycle, its permissions, and its decommissioning. Microsoft’s Entra Agent ID is the most concrete implementation of this primitive available today, but the principle is portable: no agent operates without an owner.

Policy and permission. Agents authenticate using short-lived, task-specific tokens, not long-lived shared credentials. Permissions are scoped to least privilege by default. Conditional access policies adapt in real time to risk signals. Purpose limitation is encoded — what the agent is allowed to do, and equally important, what it is not allowed to do, even when prompted to.

Observability and attribution. Every action an agent takes is logged with the agent’s identity, the user it was acting on behalf of, the tools it called, and the data it touched. Behavioral baselines detect drift. Anomalies trigger investigation. When something goes wrong, the audit trail answers “what happened” in minutes, not in days of forensic archaeology.

These four pillars are not novel individually. Identity governance has been a discipline for decades. What is new is applying them to entities that operate continuously, autonomously, at machine speed, with permissions equal to or exceeding privileged human users — and doing so before the agent population grows past the point of practical inventory.

Pillars of an Agent Control Plane

Microsoft Agent 365 as the Reference Architecture

Agent 365, generally available May 1, 2026, is the most complete implementation of these four pillars shipping today. It deserves attention not because it is the only solution but because it is the first concrete blueprint enterprises can point to and copy.

The Agent 365 inventory in the Microsoft 365 admin center captures every agent registered through Microsoft channels — Copilot Studio, Microsoft Foundry, Teams, and third-party agents that integrate via the Agent 365 SDK. Microsoft Entra issues each agent a unique Agent ID and applies identity governance: lifecycle controls, conditional access, sponsor relationships, and access packages. Microsoft Purview applies data protection policies and audits agent activity. Microsoft Defender provides threat detection and incident response, with visibility into attack paths.

Microsoft is its own first proof point. The company has been running Agent 365 internally as “Customer Zero” and reports more than 500,000 agents mapped within its own environment, generating more than 65,000 responses per day for employees in a representative 28-day window. In the public preview phase, tens of millions of agents have been registered in the Agent 365 registry across customer environments. The control plane has been load-tested before launch.

Worth understanding what Agent 365 does not solve. Its strength is also its boundary: it is anchored to the Microsoft ecosystem. Agents running in AWS Bedrock, GCP Vertex, OpenAI’s platform, Anthropic’s API, GitHub Actions, or internal frameworks built on LangChain or CrewAI do not automatically appear in the Agent 365 registry. Cross-cloud governance still requires configuration or third-party tooling. Several aspects of the security story are also incomplete on day one — runtime threat protection through the Agent 365 tools gateway is entering public preview in April rather than shipping at GA, and security posture management for Foundry and Copilot Studio agents remains in public preview after launch.

Agent 365 is the most coherent reference architecture today, but it is one path among several. To pick well, architects need the broader landscape.

The Control Plane Is a Category, Not a Product

Microsoft is not alone in this space. As of mid-2026, six distinct categories of vendor are racing toward the same control-plane primitives, with overlapping and sometimes conflicting approaches.

Hyperscaler-native control planes. Each major cloud is building its own version of Agent 365. AWS Bedrock AgentCore added a managed Agent Registry in April 2026, with identity, gateway, sandboxed runtime, observability, and a policy module that runs outside the agent. VentureBeat’s framing of the difference is sharp — AWS optimizes for build-velocity, with identity baked into the runtime layer rather than sitting on top. Google rebranded Vertex AI as Gemini Enterprise Platform and built a Kubernetes-style governance control plane around it, with Agent Registry integrations via Apigee, plus VPC Service Controls, CMEK, and a new Vertex AI Governance layer. Three hyperscalers, three philosophies, each bound to its own ecosystem. Forrester analyst Charlie Dai flagged the corollary risk: enterprises adopting AWS, Microsoft, and Google registries in parallel could end up recreating the exact fragmentation these tools are meant to solve. Registry sprawl is the second-order failure mode of the control-plane era.

The neutral identity-fabric play. Okta plus Auth0 is the most ambitious cross-ecosystem competitor. Okta for AI Agents entered Early Access in March 2026; Auth0 for AI Agents handles the build-time identity primitives — Token Vault, Fine-Grained Authorization for RAG, CIBA for asynchronous human consent. The strategically important move is Cross App Access (XAA), an OAuth extension built specifically for agent-to-application delegation, with launch support from AWS, Google Cloud, Salesforce, Box, Glean, and others. XAA was recently merged into MCP as “Enterprise-Managed Authorization.” If XAA becomes the actual interoperability standard, it matters more than any single vendor’s control plane. Strata Identity’s Maverics Agentic Identity is a similar pure-play approach, with just-in-time provisioning and OIDC/OAuth subject-actor binding.

Non-human-identity vendors. Entro Security, TrustLogix, BeyondTrust Pathfinder, CyberArk, GitGuardian, Keeper, and AppViewX with Eos came from privileged access, non-human identity, or secrets management and extended into agents. BeyondTrust Pathfinder is the closest a non-hyperscaler comes to a true unified control plane, combining PAM, CIEM, ITDR, secrets management, and agentic AI security in a single telemetry layer. Their thesis is the cross-environment one: agents do not respect ecosystem boundaries, so neither should governance.

IGA retrofit. Saviynt shipped ISPM for AI Agents and ISPM for NHI in early 2026. SailPoint and others are extending traditional identity governance to agents. “Extending” is the operative word. This is the retrofit path, with the trade-offs that implies.

Cross-cloud data-policy layer. Bedrock Data’s ArgusAI sits adjacent to identity, governing what data agents can access across AWS Bedrock, Snowflake Cortex, ChatGPT Enterprise, and Google Vertex AI. Write a policy in plain English once, enforce it across clouds. Identity governance and data governance are converging.

The open-standard foundation few are pointing to. SPIFFE/SPIRE — CNCF-graduated, production-proven for workload identity in cloud-native environments, integrated natively into HashiCorp Vault Enterprise as of version 1.21, shipping as a Red Hat OpenShift operator. SPIFFE was not built for AI agents specifically, but it solves precisely the right problem: short-lived cryptographic identities for non-human workloads, attested by what the workload is rather than what secret it holds. Most enterprise architects have not connected SPIFFE to agent governance yet. They should. For platform-agnostic, multi-cloud agent identity, SPIFFE/SPIRE is the most mature and standards-aligned foundation available — and it composes cleanly underneath any of the higher-level control planes above.

Practical guidance breaks down by deployment shape. Heavily Microsoft stacks should default to Agent 365 at $15 per user per month standalone, or included in the new M365 E7 bundle at $99, as the path of least resistance. Heavily AWS or Google deployments should look at AgentCore Registry and Gemini Enterprise’s governance layer respectively as the analogous bets, with the same architectural pattern and same ecosystem boundary. Multi-cloud organizations need Okta plus Auth0’s identity fabric or one of the NHI-pedigree platforms — BeyondTrust Pathfinder, Entro, TrustLogix — for cross-environment governance that hyperscaler-native tools cannot deliver. Cloud-native shops running Kubernetes and a service mesh should evaluate SPIFFE/SPIRE as the open-standard foundation that composes underneath any of the above. Teams still early, with fewer than a dozen agents in production, should build identity in from day one rather than retrofit it later. The shadow agents problem is what retrofit looks like at scale, and the cost grows by an order of magnitude with every doubling of agent population.

A Three-Question Diagnostic

Before any tooling decision, every organization running agents should be able to answer three questions in under five minutes. The number of “no” or “I’m not sure” responses correlates directly with shadow agent exposure.

How many AI agents are running in our environment right now? Not the ones IT approved. The total — including the ones spun up via OAuth grants, browser extensions, MCP integrations, and developer scripts. Most organizations cannot answer this within an order of magnitude.

What can each agent actually do? Not what it was designed to do. What permissions does its token carry, what systems does it have read access to, what systems does it have write access to, and what would happen if a malicious prompt convinced it to use the broadest interpretation of its access? The 63% of organizations that cannot enforce purpose limitations are by definition unable to bound this.

Who is accountable if an agent misbehaves at 3 AM on a Saturday? Not “the team that built it.” A specific human, on call, with the authority to decommission the agent. If the answer requires a meeting to determine, the agent has no owner.

Three “no’s” means a major incident is a question of when, not if. The organizations that will survive the next 24 months of agent adoption without a public incident are the ones that can answer all three today, with names, numbers, and pages.

The Bottom Line

Agent adoption is moving faster than identity governance. Forty percent of enterprise applications embedding agents by year-end is not an adoption curve — it is a vertical line. The 1.3 billion agent projection by 2028 means that within two years, autonomous non-human workers will outnumber every other class of digital identity inside the enterprise.

The organizations that treat agent identity as a first-class reliability surface — with discovery, sponsorship, scoped permissions, and audit-grade observability — will spend the next two years building production capability. The organizations that don’t will spend them doing post-incident forensics on agents they didn’t know they had.

Reliability begins with identity. If you cannot tell who acted, you cannot tell what happened. If you cannot tell what happened, you cannot fix it. Everything else in the agent stack — context engineering, harness engineering, evaluation, incident response — assumes that question is already answered.

It usually isn’t. That’s the work.

Builder Spotlight - Armaan Agrawal ships like a forward-deployed engineer already

The AI Runtime — Fri, 24 Apr 2026 11:03:51 GMT

TL;DR. Armaan Agrawal (CS @ Northeastern, class of 2026) has a new-grad portfolio that reads like a scout report for forward-deployed engineering. SamGPT is a RAG engine over the My First Million corpus with timestamp and speaker-attributed chunks that bridge into a Viral Clip Generator via FFmpeg — the retrieval and action paths share a schema on purpose, which is the move a Solutions Architect makes. He wired prompt-injection and harmful-request guardrails into an OpenAI Agents SDK build in hour one of three at a hackathon and placed 2nd. His co-op recommender solved cold-start with a staged text-match → collaborative-filtering rollout that will be deployed to 2500 students. He’s demonstrating the concepts the AIfolio framework calls for — RAG, tool-use, voice continuity — but what makes the portfolio FDE-shaped is the real-world skills around those concepts: latency as a product feature, guardrails as a default, schema as operational design, staged rollout as architecture. If you’re staffing an FDE or Solutions Architect role, talk to him before he takes a traditional SWE offer.

The habit, stated

Most new-grad portfolios are a pile of frameworks. Armaan’s is a pile of systems shipped to specific users whose operational reality he understood. That sounds soft until you look at the architectural choices — they’re the ones you make when the user’s failure mode, not the rubric, is what you’re optimizing against.

AIfolio Projects

Forward-deployed engineering and solutions architecture are the same job at different scales: drop into a domain you didn’t grow up in, compose a working system out of heterogeneous pieces, land it with safety and observability already in it, and iterate on the signal instead of the stack. Most new grads learn this over two years of production pain. Armaan has already shipped it seven times.

The AIfolio framework names the concepts an AI-engineering portfolio should demonstrate — RAG pipelines, tool-use architecture, agent design, memory and voice continuity. Armaan hits those concepts. What’s more interesting is what he does around them: the habits that make the concepts production-viable instead of demo-viable. That’s what this piece walks through.

RAG with schema foresight (SamGPT + Viral Clip Generator)

Viral clip generator

Stack: Whisper / ASR, speaker diarization, embeddings, vector search, FFmpeg, Next.js.

SamGPT is a RAG system over the My First Million podcast corpus: semantic search, query expansion, suggested prompts, YouTube deep links to the exact timestamp. The Viral Clip Generator is the adjacent tool: paste a YouTube URL, get the top 3 sub-2-minute cuts auto-extracted as 16:9 exports.

What makes this architecturally non-obvious isn’t the RAG itself. It’s the bridge between the two services.

The data model carries timestamped, speaker-attributed chunks all the way through retrieval. A user who finds a quote in SamGPT can jump to the video at the exact second, or pass the chunk to FFmpeg and get a shippable 16:9 cut. Retrieval and generation aren’t separate products wearing the same skin — they share metadata, and the shared metadata is the feature.

This is the Solutions Architect move. It would have been easier to build two independent tools and call it a suite. Instead he built one pipeline with two exits, and the marginal cost of the second exit was near zero because he designed the chunking schema for it upfront. Most AI engineers bolt that on later and lose half the data.

The portable real-world skill: your chunking schema is a product decision, not an infrastructure decision. Armaan’s schema already had timestamps and speaker attribution because he knew a second surface (clip extraction) would need them. That’s designing the system to be legible to the next tool you’ll build against it — the skill that separates an AI engineer from a solutions architect.

Tool-use architecture without MCP (Content Engine)

Content Engine

Stack: Next.js, content pipelines, carousel export, AI rewrite with tone presets, personalized to voice and style data.

One source tweet, four output formats: LinkedIn long-form, IG carousel, newsletter, quote card. Three-pane UI: source feed on the left, tabbed editor in the middle (one tab per target format), live preview on the right.

Two choices worth calling out:

Format-specific editor tabs instead of a single “transform” button. Each target format has its own constraints, and he exposes them as first-class surfaces. This is the difference between treating output formats as parameters to one generator versus treating them as distinct tools that share an upstream source. The second is what a Solutions Architect picks when the user has real editorial control needs. It’s also the tool-use design pattern MCP formalizes — you don’t need MCP to pick it, you need the instinct that separates tool boundaries along user-decision boundaries.
Voice personalization. Most “AI rewrite” tools regress your writing toward a generic model voice. Armaan’s design carries personal-voice signal into every tab, so the four generated variants don’t all sound vaguely like a LinkedIn guru. The failure mode of cross-platform content tools is well-known: you write once, four generated variants all need a hand-rewrite, the tool saves you zero minutes. Closing that gap is a real-world skill the canonical AIfolio memory pillar hints at but most projects miss.

Latency as a product feature (Red Sox)

Stack: Django, Vue.js, Redis, Celery, PostgreSQL, Okta SSO, Docker. Live at Fenway Park, Jan–Sep 2024.

Live batting-lineup API for journalists during games. Previous method: a handwritten whiteboard. If the API went down mid-game, press couldn’t report the lineup before first pitch.

The number most new grads would chase is features. Armaan chased tail latency: 1.2s → 121ms, a ~90% cut, via Redis on the hot path and Celery for everything off-path. Two architectural choices inside that:

Cache on the read path, not everywhere. Redis in front of lineup reads means the request journalists actually make — “what’s the roster right now” — never waits on downstream services. Cache invalidation is keyed to lineup changes, so staleness lives in a narrow, owned window.
Celery for everything the user isn’t waiting on. Notifications, logging, eventual-consistency writes — off the request thread. The hot path becomes trivial to reason about because it does one thing.

None of this is AI engineering. It matters for an AI portfolio anyway, because the Retrofit Tax is what teams pay when they try to add observability, latency discipline, or governance to a system that was shipped without them. Armaan doesn’t retrofit. Production-shaped defaults go in the original design, where the cost of adding them is close to zero. That’s the posture that keeps his work from accruing tax as he scales it.

Guardrails in hour one, not hour forty (AgentOps hackathon)

Guardrails Setup

Stack: OpenAI Agents SDK. 3 hours. 2nd place.

Most hackathon demos ship a working prototype and skip safety entirely — the rubric doesn’t require it, and guardrails feel like production overhead. Armaan shipped input guardrails from line one, with prompt-injection blocks and harmful-request blocks both live in the demo.

Reading this as a minor detail misses what it signals. The OpenAI Agents SDK exposes guardrail primitives cheaply; almost nobody uses them on a hackathon timeline. Using them anyway is the same instinct as caching the Red Sox hot path: production-shaped defaults on demo-shaped timelines.

This is also where the behavioral-reliability work most AI engineers learn after their first incident shows up pre-incident. Validation gates, input filters, behavioral guardrails before a model’s output reaches the user — these are not optional for production systems, but they’re almost always added reactively. Reaching for them at hour one of a three-hour build is the instinct, and it’s not teachable under deadline.

For an FDE hiring manager, this is the cheapest-to-evaluate signal in the portfolio.

Schema as operational product design (Feedshare)

Feedshare

Stack: SwiftUI, Firebase, iOS. 100+ campus users.

The framing — “campus free food shouldn’t die in a group chat” — constrains the whole system. The schema isn’t “post + comments.” It’s:

Photo-first feed (you don’t walk across campus on a text description)
Map pins (location is a first-class field, not a comment)
Multi-photo upload, up to 5 (proof, not hype)
Room + headcount fields (so you know whether it’s worth the walk before you leave)

Every field on the post form corresponds to a decision the user makes: is this real, where is it, is it still there, is it worth the walk. The schema is the product.

This distinguishes FDE work from generic backend work. The post schema isn’t generic — it encodes the decision-making workflow of the specific user on the specific campus. Firebase gets him there fast because the real work isn’t the backend; it’s figuring out what data the user’s decision actually requires and refusing to collect anything else. Shipping to 100+ students on a campus with real food-waste pressure means the hypothesis has already been validated in the field.

Cold-start as staged architecture, not a hack (co-op recommender)

While at NExT Consulting, he built a co-op recommender for Northeastern students — planning to be used by 2,500 in intro university classes. New students have no history, the classical cold-start trap that kills most recommendation systems before they ship.

His rollout: text matching first (profile-to-role matching for the cold-start population), then shift to collaborative filtering once interaction data accumulates. “Good matches from day one, better over time.”

This is the staged-architecture move a Solutions Architect picks. You don’t wait for data to deploy the system, and you don’t stay on cold-start forever. You design the data pipeline so the transition is a config change, not a rewrite. For a new grad to pick the staged approach on a real user-impact system is unusual — most new engineers either over-engineer the eventual collaborative-filter stack and ship late, or ship the text-match version with no path off it and accumulate the tax later.

Why this maps to FDE / Solutions Architect work

Forward-deployed engineering is:

Understand a domain you didn’t grow up in faster than the customer thinks is possible.
Compose a working system from heterogeneous pieces (their stack + yours).
Land it with safety, observability, and latency budgets already wired in.
Iterate on the signal, not the stack.

Armaan has already run this pattern across five unrelated domains: chemical plant telemetry, a baseball press box, campus food logistics, long-form podcast content, and an AI agent under safety scrutiny. The domains are portable. The habit is portable.

The concepts an AI-engineering portfolio needs to demonstrate — RAG, tool-use, voice continuity, agent design — are necessary. The real-world skills that make those concepts production-viable are what’s rare: schema foresight, tail-latency discipline, pre-incident guardrails, schema-as-product thinking, staged rollout. Armaan’s portfolio has both layers. That’s the thing most new-grad hires don’t come with.

Solutions Architect work has a narrower shape — more “compose a durable reference architecture for customers” than “ship a one-off” — but the underlying disposition is identical. Pick the production-shaped default, not the demo-shaped one. Design the data model for the surface you’ll build next. Treat latency and guardrails as product features. Refuse to accrue Retrofit Tax.

How to reach him

Portfolio: armaanagrawal.com — worth reading in order, it’s structured as seven chapters
GitHub: github.com/airman416 — SamParrBot (SamGPT) and Content-Engine are the deepest reads
LinkedIn: linkedin.com/in/agr1
Target roles: Forward Deployed Engineer, Solutions Architect

For readers building their own AIfolio: the pattern that repeats across his work is cheaper to adopt than you’d think. Ship guardrails in hour one, not hour forty. Design your RAG chunking schema around the second surface you’ll build, not the first. Stage your cold-start into a config change instead of a rewrite. Cache the read path before you need to. None of that is senior-only work. It’s just the production-shaped default most engineers don’t pick until the first incident teaches them to — and the reason their portfolios end up carrying Retrofit Tax instead of compounding.

The Vercel Breach RCA: Agent Identity Is the New Attack Surface

The AI Runtime — Thu, 23 Apr 2026 11:05:52 GMT

TL;DR - On April 19, 2026, Vercel disclosed a breach of its internal systems. The root cause wasn’t a zero-day, a supply chain poisoning of an npm package, or a perimeter failure. It was an OAuth grant — a Vercel employee signed into Context.ai, a 300-connector agentic “AI office suite,” using their Vercel enterprise Google Workspace account and granted “Allow All” permissions. Context.ai was already compromised from a February 2026 infostealer infection on an employee laptop. The attacker inherited that OAuth session, pivoted into Vercel’s Google Workspace, and enumerated customer environment variables that were stored in plaintext-recoverable form because they weren’t explicitly marked “sensitive.” Vercel CEO Guillermo Rauch publicly attributed the attacker’s “operational velocity” to AI-accelerated tradecraft. Stolen data was listed on BreachForums for $2M. The mainstream framing — “shadow AI,” “third-party risk,” “OAuth supply chain” — is correct but incomplete. The right framing for AI engineers: this is the first major platform breach where an AI agent holding delegated identity was the pivot point. Every agent, every MCP server, every AI productivity tool your team is shipping or consuming runs on exactly this pattern. If you operate agents, audit your OAuth grants this week, default-sensitive every secret you store, and stop treating agent vendors as if they were ordinary SaaS.

What actually happened

Here is the compressed attack chain, reconstructed from Vercel’s bulletin, Context.ai’s advisory, Hudson Rock’s infostealer analysis, and Trend Micro’s post-incident writeup.

Attack chain

Each hop is worth pausing on.

The initial compromise was human, not technical. According to Hudson Rock’s analysis, the Context.ai employee’s browser history showed active searches for Roblox “auto-farm” scripts — a classic Lumma Stealer distribution vector. An enterprise SaaS vendor’s entire security posture was compromised because one employee downloaded game cheats on a corporate laptop. This is a failure of endpoint policy, not crypto or architecture.

The pivot was an OAuth grant, not a credential theft. Context.ai’s own statement is worth reading carefully: Vercel wasn’t even a Context.ai customer. A single Vercel employee had signed up for the product using their Vercel enterprise Google account and granted full read access to Google Drive during onboarding. When Context.ai’s OAuth token store was compromised, the attacker acquired not a password, but a delegated session — the authority to act as that employee inside Vercel’s Google Workspace.

The blast radius was set by Vercel’s “sensitive vs. non-sensitive” environment variable model. Vercel encrypts all env vars at rest. But it has a distinction: env vars marked as “sensitive” are stored such that they cannot be read back even by the platform itself; non-sensitive env vars can be decrypted to plaintext for display in dashboards. The attacker couldn’t touch sensitive vars. Everything else — API keys, database credentials, signing keys that customers had never opted into the sensitive treatment — was readable by enumeration.

The velocity was the tell. Rauch’s public claim is that the attacker moved fast enough, with enough understanding of Vercel’s internal structure, that AI augmentation is the most likely explanation. This is interpretive — attribution-by-velocity is not a forensic artifact — but it lines up with a pattern Trend Micro, Microsoft, and others have flagged across 2026: LLM-driven reconnaissance that parallelizes schema discovery, endpoint probing, and credential-format recognition at rates that break detection baselines calibrated to human attackers.

Breach RCA

Why the standard framings are incomplete

The Vercel breach is getting framed three ways in the security press. All three are partially right and all three miss the point for AI engineers.

Framing 1: “Third-party risk / shadow AI.” True. But this framing leads to the wrong remediation — better vendor questionnaires, annual SOC 2 reviews, procurement gates. None of that would have prevented this. Context.ai likely had SOC 2. A Vercel employee signed up as a consumer, bypassing procurement entirely. Point-in-time vendor assessments are worthless against active compromise.

Framing 2: “OAuth supply chain attack.” True. But OAuth supply chain attacks have been understood for years — Codecov, CircleCI, the Heroku/Travis CI incident. What’s new here isn’t the OAuth mechanism. It’s the category of vendor on the other side of the grant.

Framing 3: “Platform env var model needs defaults.” True. Vercel has already rolled out dashboard changes and is pushing customers toward the sensitive-variable feature. This is good, and every platform should copy it. But this is a Vercel-specific lesson, not an industry-wide one.

The framing that actually matters for AI engineers is the one none of these capture: the intermediary in this breach was an AI agent holding delegated identity, and the pattern that made it dangerous is the pattern every agent deployment replicates.

Context.ai markets itself as an agent platform. Per their own launch materials, its agents “dynamically traverse entire organizational knowledge bases.” To do that well, it needs broad, persistent access to Drive, Slack, email, code repos — and it acquires that access through long-lived OAuth grants from individual users. This is not a Context.ai pathology. It’s the architectural baseline for every agentic product shipping today: Cursor’s enterprise connectors, Glean’s agents, the exploding MCP server ecosystem, every “connect your Google Drive” button in every AI startup demo.

When the agent is compromised, the delegated identity is compromised. When the delegated identity is an enterprise Google Workspace account, the compromise propagates to everything that account can touch.

A useful handle: Delegated Identity Blast Radius

A shorthand for this pattern, which I’ll use for the rest of the piece: Delegated Identity Blast Radius (DIBR) — the scope of systems an attacker inherits by compromising an agent, equal to the union of all permissions granted to that agent across all delegating users and tenants.

DIBR has three properties that distinguish it from pre-agent OAuth risk.

1. Delegation collapses identity. A traditional SaaS integration might hold a scoped API key for “read Slack messages.” That’s a credential, and it’s bounded. An agent holding an OAuth grant with “Allow All” on Drive doesn’t hold a credential — it holds a session. If the agent’s vendor is compromised, the attacker is now the human. They can read everything the human can read, compose everything the human can compose, move laterally through every system the human’s SSO has reach into. The credential/identity distinction that security teams rely on stops working at the agent boundary.

2. Consent UX was never designed for agents. OAuth scopes describe what an app can do at authorization time. They don’t describe what an autonomous agent will do at runtime. A user approving “read your Drive” is not meaningfully consenting to “this agent will read your Drive, reason over every document, and potentially generate outputs that contain exfiltrated content.” Google’s own consent screen shows a list of scopes, not a behavioral model. In the Vercel case, Context.ai’s onboarding asked for Drive read access — exactly what the product needs to function. Nothing about the consent flow would flag this as risky. The scope was honest. The runtime behavior was the risk.

3. Blast radius scales with agent ambition. The more capable the agent, the worse the breach. A narrow AI — say, a meeting summarizer that only touches calendar events from the last 48 hours — has a bounded DIBR. A “universal office suite” agent marketed as being able to understand everything about how your organization works has, by design, maximal DIBR. The product’s value proposition and its worst-case blast radius are the same vector. Context.ai’s sales pitch — 300 connectors, cross-tool reasoning, organizational memory — is also a perfect description of its breach impact.

This is the uncomfortable part: you cannot reduce DIBR without reducing agent capability. The only knobs are scope minimization, token lifetime, and vendor security posture — and all three trade off against the reason you bought the agent in the first place.

This is not a Vercel problem. It’s an agent-era problem.

The instinct right now is to look at the Vercel incident and ask: “What did Vercel do wrong, and how do I avoid being Vercel?” That’s useful but it’s the wrong axis. Vercel’s specific mistakes — non-sensitive-by-default env vars, enterprise Google Workspace OAuth config permissive enough to allow broad grants — are patchable and already being patched.

The unpatchable part is structural. Right now, across the AI ecosystem:

Millions of developers have connected OpenAI, Anthropic, and other API keys to Cursor, Continue, Claude Code, Zed, and dozens of other AI coding tools — in many cases through OAuth to their GitHub identity, not just a local API key.
Every “connect your Google Drive” AI product demo creates a long-lived OAuth grant. Most of those grants are never revoked, never rotated, and never audited.
The Model Context Protocol (MCP) ecosystem is accelerating the pattern: MCP servers are effectively generalized delegation endpoints, and the current norm is to trust them implicitly because they run “locally” or “in the enterprise.”
Agentic IDE integrations — the kind that autonomously read, edit, and commit across an entire codebase — hold scopes that would horrify a security auditor if they were attached to a human service account.

Every one of these is a future Context.ai, waiting for its Lumma Stealer moment. The attack pattern is replicable. The defenses, so far, are not standardized.

There are two structural responses.

Product-side (if you build agent tools): Default to the narrowest scope that lets your product demo, not the scope your product’s full feature set needs. Expose scope minimization as a first-class UI element — “Context.ai full access” versus “Context.ai research only” — so users can make real trust decisions. Short-lived tokens with explicit re-authorization for high-impact actions. Invalidate tokens on any vendor-side incident, not just on user-triggered rotation. Publish an incident response SLA for token compromise.

Deployment-side (if you ship software that depends on agent vendors): Treat every agent vendor’s breach as your breach. The Vercel env var issue isn’t unique — audit whether your platform’s secret store is sensitive-by-default or sensitive-by-opt-in, and switch the defaults. Build a disaster recovery playbook for “assume our primary AI vendor is compromised right now.” Most teams don’t have one. The ones that will survive the next incident in this category are the ones that already wrote it.

What to change this week

If you’re reading this and asking “OK, what do I do Tuesday morning” — here is the ordered list. This is the most concrete thing in the piece, so don’t skip it.

1. Audit your Google Workspace OAuth grants right now. In admin.google.com → Security → Access and data control → API controls → App access control. Export the full list. For every app, check the scopes. The Secure Annex researcher John Tuckner put it sharply: spend a week asking yourself which scopes you’ve allowed and whether you recognize all the services. Most teams have never done this exercise and are shocked by what comes back.

2. Identify every OAuth grant with “broad” or “Allow All” scopes on Drive, Mail, or Calendar. These are your highest-DIBR connections. Revoke the ones you don’t actively use. For the ones you keep, set a calendar reminder to re-audit quarterly. Treat “broad Drive access” as a permission on par with production database access, because in breach terms it is.

3. Check whether your platform’s secrets are sensitive-by-default. Vercel’s model — sensitive is opt-in — is common. Netlify, Render, Railway, and Fly.io all have variations on this pattern. Go into your secret store, identify every non-sensitive secret that carries production access, and either rotate-and-mark-sensitive or move to a dedicated secrets manager (AWS Secrets Manager, GCP Secret Manager, Doppler, Infisical, 1Password).

4. If you ship an agent product, publish your scope minimization story. This is both a security posture and a differentiation opportunity. Buyers in 2026 are going to start asking “what happens when you get breached” — teams that have a good answer will win. Teams that don’t, won’t.

5. If you run agents in production, assume the AI vendor is already compromised and plan the blast radius. The exercise: pick your most-connected agent. Write down every credential, scope, and system it touches. Imagine you wake up tomorrow to a vendor breach disclosure. Which secrets rotate first? Which systems need re-authorization? Which customers need notification? If this exercise takes more than four hours, you don’t have a runbook.

6. Recalibrate your detection baselines for AI-accelerated enumeration. If your SIEM alerts are tuned to “human-paced” attacker behavior — unique resource enumeration rate, error-to-success ratio recovery — they may under-alert against AI-augmented operators. Trend Micro’s writeup has specific guidance on thresholds to revisit. This is worth a security team afternoon.

What to watch

Two questions will shape the next six months.

Will any OAuth provider ship “agent consent” as a distinct flow? Google, Microsoft, and Okta all have the signal that agent grants are different in character from traditional app grants. What the ecosystem needs is a new consent primitive — something like a “delegated agent session” with mandatory short lifetime, mandatory re-authorization for high-impact actions, and a scope model expressive enough to describe runtime behavior, not just capability surface. The first provider to ship this will reset the security baseline for every agent product downstream.

Will platform providers make sensitive-by-default the standard? Vercel is clearly moving that direction post-incident. If competitors follow, the industry gets safer. If they don’t, Vercel customers end up paying a security tax while customers of other platforms keep eating the old default. Watch the next 60 days of product announcements from Netlify, Render, and Cloudflare.

The Vercel breach is going to be cited for years. Not because the technical details are novel — they mostly aren’t — but because it’s the first high-profile case where the intermediary was an AI agent holding delegated identity, and the ecosystem reaction will set precedent for how we treat agent vendors from here on.

If you’re building agents, you have a few months to fix your defaults before someone else’s breach becomes your problem. Use them.

OpenAI’s AI Deployment Playbook Is Missing a Chapter

The AI Runtime — Wed, 22 Apr 2026 11:03:51 GMT

TL;DR: OpenAI’s “From Experiments to Deployments” whitepaper lays out a solid four-phase framework for scaling AI — foundations, fluency, prioritization, build. But Phase 4 reveals a critical gap: the whitepaper treats evaluation as a step in a checklist rather than a continuous engineering discipline. It describes what to measure (retrieval quality, summarization accuracy, guardrail compliance) without naming who owns it or how it operates at scale. That missing chapter is Model Reliability Engineering — the discipline that sits between the eval checklist and the production system that keeps your AI products trustworthy over time. If you’re an AI engineer reading OpenAI’s playbook, understand the organizational framework, but build MRE into your Phase 4 from day one.

The Whitepaper Gets a Lot Right

Credit where it’s earned. OpenAI’s whitepaper, published in late 2025, distills real lessons from enterprise partnerships with BBVA, Uber, Lowe’s, Booking.com, and others into a four-phase model for scaling AI:

Phase 1: Set the foundations — executive alignment, governance, data access. The “compliance fast path” example from Figma is particularly instructive: data guardrails that enable experimentation rather than blocking it.

Phase 2: Create AI fluency — literacy programs, champion networks, SME development. BBVA’s journey from 3,000 to 11,000 (and now 120,000) ChatGPT Enterprise licenses, powered by a distributed champion network, is the best public case study of this phase working at scale.

Phase 3: Scope and prioritize — repeatable intake processes, impact/effort scoring, reuse-first design. Standard portfolio management, adapted well for AI’s unique characteristics.

Phase 4: Build and scale products — cross-functional teams, incremental builds, gated checkpoints, continuous evaluation.

Phase 4 is where the whitepaper gets interesting — and where it stops too soon.

MRE in the mix

Where MRE Fills the Gap

The whitepaper's four phases get you to the launch gate. MRE - Model Reliability Engineering is the operational discipline that keeps AI products reliable after deployment — monitoring behavioral SLOs, detecting drift, and feeding failures back into the build cycle.

The Gap in Phase 4

The whitepaper includes a table that traces a Q&A agent through three evaluation stages: retrieval (does it find the right information?), summarization and grounding (does it synthesize useful, cited answers?), and guardrails (does it stay within approved data, tone, and safety guidelines?). Each stage has a decision gate: continue, refine, or stop.

This is a good checklist. It is not an engineering discipline.

Here’s what the table doesn’t address:

Who owns these evaluations after launch? The whitepaper assigns “SME review” and “safety review” as activities, but never identifies a team or role responsible for ongoing behavioral monitoring. In traditional software, SRE owns uptime. In ML systems, MLOps owns pipeline health. In AI products built on LLMs, who owns behavioral reliability — the question of whether the model is still doing what you deployed it to do?

What happens when the model changes underneath you? The whitepaper acknowledges that “AI systems don’t follow fixed rules” and that “capabilities evolve in weeks, not quarters.” But the evaluation framework is presented as a build-time activity. When your model provider ships a new version — and they will, roughly every three days according to the whitepaper’s own graphic — who reruns those evals? Who detects behavioral drift before your users do?

Where are the SLOs? The table has qualitative goals (”accurate, grounded, and useful”) but no quantitative thresholds. In SRE, you don’t say “the system should be reliable” — you say “99.9% availability measured over a 30-day rolling window.” AI products need the same precision: “faithfulness score above 0.85 on our evaluation suite, measured daily across a stratified sample of production queries.”

What’s the incident response playbook? When a guardrail fails — and it will — what happens? The whitepaper’s “continue/refine/stop” gates are pre-launch decisions. Post-launch, you need detection, triage, mitigation, and postmortem processes. You need to know whether to roll back the prompt, switch models, tighten the guardrail, or escalate to a human.

The Missing Chapter: Model Reliability Engineering

These aren’t minor gaps. They’re the difference between a successful pilot and a production system that earns trust over months and years.

The discipline that fills this gap is what I call Model Reliability Engineering (MRE) — the practice of owning model behavior reliability in production. MRE borrows the operational rigor of Site Reliability Engineering and applies it to the unique challenges of AI systems that generate outputs based on patterns rather than predefined logic.

MRE operates through two layers:

Context Engineering — ensuring the model receives the right information, in the right format, at the right time. This covers retrieval quality, prompt construction, tool orchestration, and the entire input pipeline. When the whitepaper’s “retrieval” and “summarization” stages fail in production, it’s usually a Context Engineering problem: the retrieval pipeline returned stale data, the prompt template drifted, or the context window was consumed by irrelevant information.

Harness Engineering — everything that wraps around model output before it reaches the user. Output validation, consistency checking, safety filtering, fallback logic, and the instrumentation that makes all of this observable. The whitepaper’s “guardrails” stage lives here, but MRE treats it as a continuous runtime concern rather than a pre-launch checkpoint.

Think of it this way: the whitepaper’s Phase 4 table is a construction inspection checklist. MRE is the building management system that keeps the building safe after the inspectors leave.

What This Means for Your Team

If you’re building AI products and following OpenAI’s playbook — which, again, is genuinely good organizational advice — here’s how to fill in the gap:

Define behavioral SLOs before launch. Not “the system should be accurate” but “faithfulness ≥ 0.85, relevance ≥ 0.80, guardrail violation rate < 0.1%, measured daily on a stratified sample of 500 production queries.” These become the contract between your AI product and your organization.

Assign MRE ownership explicitly. Someone — a person, a team, a rotation — needs to own behavioral reliability the way your SRE team owns uptime. They monitor the behavioral SLOs, investigate violations, and coordinate with product and engineering on fixes.

Build for model-provider instability. Pin your model versions. Run behavioral regression tests on every model update. Maintain a rollback capability. The whitepaper says innovation happens every three days — your evaluation system needs to keep pace.

Create an incident response playbook for behavioral failures. When your Q&A agent starts hallucinating, who gets paged? What’s the first mitigation? How do you determine blast radius? These are engineering operations questions, not product management questions.

Instrument everything. Log prompts, retrieved context, raw model outputs, post-processing transformations, and final user-facing responses. Without this trace, you can’t diagnose failures and you can’t run meaningful evals.

The Bigger Pattern

This gap isn’t unique to OpenAI’s whitepaper. It reflects a broader industry blind spot: we’ve gotten good at building AI systems and reasonably good at evaluating them before launch, but we haven’t yet developed the operational discipline for keeping them reliable in production.

SRE emerged because uptime required its own discipline, separate from software engineering. MLOps emerged because model pipelines required their own discipline, separate from DevOps. MRE is the next layer — the discipline that owns the behavior of AI systems that are neither deterministic nor static.

OpenAI’s playbook will get you to production. Model Reliability Engineering is what keeps you there.

The Eval Lifecycle: What Actually Happens Between “Proof of Concept” and “Production”

The AI Runtime — Mon, 20 Apr 2026 11:03:55 GMT

TL;DR: OpenAI’s enterprise whitepaper quietly introduced a three-stage evaluation framework for AI agents — retrieval, summarization/grounding, and guardrails — with a continue/refine/stop gate at each stage. This framework is more important than anything else in the 25-page document, and the whitepaper spends exactly one table on it. Here’s the expanded version: how each eval stage actually works, what tools exist to run them, what “good” looks like at each gate, and how the entire lifecycle repeats at MVP, pilot, and production scale. If you’re building AI products, this is the technical architecture that determines whether your proof of concept ever graduates.

Why Evals Are the Whole Game

There’s a moment in every AI project where the demo works. The retrieval is pulling relevant chunks, the model is generating coherent answers, and the stakeholders are nodding. This moment is dangerous.

It’s dangerous because the gap between “works in a demo” and “works in production” is not a linear improvement problem. It’s a category shift. In a demo, you control the inputs, you cherry-pick the questions, and you evaluate by gut feel. In production, real users ask unpredictable questions against messy data, and you evaluate by numbers you’ve committed to in advance.

The eval lifecycle is the structured process that bridges this gap. OpenAI’s enterprise whitepaper sketches it in a single table. Let’s build the full architecture.

Stage 1: Retrieval Evaluation

Retrieval Evals

Each stage has its own metrics, its own evaluation set, and its own continue/refine/stop gate. The lifecycle repeats at MVP, pilot, and production scale — with the evaluation set roughly doubling at each stage.

The question: Does the system reliably find the right information?

This is where most AI products fail first — not because retrieval is hard to build, but because retrieval is hard to evaluate well. A retrieval system that returns plausible results will pass casual inspection. A retrieval system that returns the right results for edge cases is what separates a demo from a product.

What you’re measuring:

Recall — of all the documents that should have been retrieved, what fraction did the system actually find? Low recall means the system is missing relevant information. For a Q&A agent over company docs, this might mean missing the updated policy while retrieving the obsolete one.

Precision — of all the documents retrieved, what fraction are actually relevant? Low precision means the model’s context window is polluted with irrelevant material, degrading downstream generation quality.

Mean Reciprocal Rank (MRR) — is the most relevant document appearing first, or buried in position five? Models pay more attention to what appears early in context. If your best document consistently ranks third, your answers will be worse than they should be.

How you build the evaluation set:

Start with 50-100 representative queries drawn from actual user conversations (or realistic simulations). For each query, a domain expert labels which documents should be retrieved. This labeled set becomes your retrieval ground truth.

This is tedious and irreplaceable. Automated approaches — using an LLM to judge retrieval relevance — are useful for scaling evaluations but unreliable for building the initial ground truth. The domain expert knows that “Q3 revenue guidance” should retrieve the board deck, not the press release. The LLM doesn’t know your organization well enough to make that distinction.

The gate decision:

Continue if recall ≥ 0.85 and precision ≥ 0.75 on your evaluation set. Refine if metrics are between 0.60 and 0.85 — this usually means adjusting chunking strategy, embedding model, or retrieval parameters. Stop if recall is below 0.60 — the retrieval pipeline needs fundamental rework before downstream evaluation is meaningful.

Track token costs at this stage. Retrieving too many documents burns context window space and money. Retrieving too few misses information. The right balance is specific to your use case.

Stage 2: Summarization and Grounding Evaluation

The question: Does the system synthesize clear, consistent, useful, and cited answers? Did it follow the right steps and access the right data?

This is the stage where the whitepaper’s description — “evals on traces/logs + SME review” — is most dangerously compressed. “SME review” alone can mean anything from “my colleague glanced at five outputs” to “three domain experts independently rated 200 outputs on a structured rubric.” The difference in quality assurance is enormous.

What you’re measuring:

Faithfulness — does the answer only contain claims that are supported by the retrieved context? An answer can be correct according to the model’s training data but unfaithful to the retrieved context, which means it’s hallucinating in a way that’s invisible to the user. This is the most important metric in the entire eval lifecycle and the one most teams measure poorly.

Relevance — does the answer actually address the question? A faithfully grounded answer that doesn’t answer the user’s question is useless.

Completeness — does the answer cover all the relevant information from the retrieved context? Partial answers erode trust over time even when they’re technically accurate.

Citation accuracy — if the system claims “according to document X,” is that claim actually in document X? Citation errors are trust-destroying because they’re verifiable — a user who checks a citation and finds it doesn’t match will never trust the system again.

How you build the evaluation:

For each query in your evaluation set, have domain experts write the “gold standard” answer — the response a knowledgeable human would give. Then compare model outputs against these references.

Automated faithfulness evaluation is one of the areas where LLM-as-judge approaches are genuinely useful. Have a separate model (not the one generating the answer) check whether each claim in the output is supported by the retrieved context. Tools like RAGAS, DeepEval, and TruLens provide frameworks for this, but the key insight is: use a different model for evaluation than the one generating answers. Models are unreliable judges of their own outputs.

The gate decision:

Continue if faithfulness ≥ 0.85, relevance ≥ 0.80, and citation accuracy ≥ 0.90 on a sample of 200+ queries. Refine if faithfulness is between 0.70 and 0.85 — this usually means adjusting the system prompt to enforce stricter grounding, or improving the retrieval stage to provide better context. Stop if faithfulness is below 0.70. A system that hallucinates in 30%+ of responses is not ready for any form of user testing.

Stage 3: Guardrail Evaluation

The question: Does it stay within approved data, tone, and safety guidelines?

Guardrails get treated as an afterthought in most AI projects — the safety review that happens the week before launch. That’s backwards. Guardrail failures are the ones that make the news, generate legal liability, and destroy user trust in ways that no amount of accuracy improvement can repair.

What you’re measuring:

Topic boundary compliance — does the system stay within its defined scope? A legal Q&A agent that starts offering medical advice has failed a topic boundary guardrail, even if the medical advice happens to be accurate.

Tone and brand consistency — does the system’s voice match organizational guidelines? A customer-facing agent that suddenly becomes casual or sarcastic when asked difficult questions has a tone guardrail failure.

Safety filtering — does the system refuse or redirect harmful, offensive, or manipulative inputs? This isn’t just about explicit toxicity — it includes prompt injection attempts, jailbreaking, and social engineering.

PII handling — does the system avoid exposing, generating, or echoing personally identifiable information? This is both a safety and a regulatory requirement.

How you build the evaluation:

Create an adversarial test set. This is distinct from the representative test set used in stages 1 and 2. Adversarial tests specifically probe boundaries: out-of-scope questions, prompt injection attempts, requests for information the system shouldn’t have, edge cases where tone guidance is ambiguous.

A strong adversarial test set has 100+ cases across these categories, built by people who actively try to break the system. This is one area where “red teaming” (having humans try to elicit harmful outputs) provides signal that automated evaluation cannot replicate.

The gate decision:

Continue if guardrail violation rate < 0.5% on the adversarial test set and < 0.1% on the representative test set. Refine if violations are between 0.5% and 2% — usually by tightening the system prompt, adding output filters, or restricting tool access. Stop if violation rate exceeds 2% on the adversarial set. Safety is not a gradient.

The Lifecycle Repeats at Every Scale

Here’s what the whitepaper mentions but doesn’t emphasize enough: this three-stage evaluation runs at every deployment gate, not just once.

MVP gate: Run all three stages on your evaluation set. Small scale (50-100 queries for retrieval, 200 for summarization, 100 adversarial). The goal is to validate the architecture, not achieve production quality.

Pilot gate: Re-run with production data from pilot users. The evaluation set should now include real queries you didn’t anticipate. Expand the adversarial set based on actual user behavior. Introduce latency and cost measurements — a system that takes 30 seconds per response won’t be adopted regardless of accuracy.

Production gate: Full evaluation suite plus continuous monitoring. This is where the eval lifecycle transitions from a build activity to an operational responsibility. The same metrics you used to gate deployment now become the SLOs your team monitors daily.

The whitepaper’s “once proven in a narrow scope, the same checks repeat at pilot and production scale” is correct, but it undersells the expansion that happens at each gate. Your evaluation set should roughly double at each stage. Your adversarial set should incorporate everything users tried during the previous stage. And your automated monitoring should replace the manual SME review that gates earlier stages.

The Tooling Stack

You don’t need to build this from scratch. The eval tooling ecosystem has matured significantly:

Retrieval evaluation: RAGAS and DeepEval both provide retrieval metrics out of the box. LangSmith and Arize Phoenix offer tracing that connects retrieval to downstream generation quality.

Faithfulness and grounding: RAGAS faithfulness metrics, DeepEval’s hallucination detection, and custom LLM-as-judge evaluations using structured prompts. Braintrust and HumanLoop provide platforms for managing evaluation datasets and running automated evals at scale.

Guardrails: Guardrails AI, NeMo Guardrails (NVIDIA), and Lakera Guard for safety filtering. LangFuse for observability and trace-level analysis.

End-to-end: LangSmith, Braintrust, and Arize Phoenix each provide integrated platforms that span all three stages, with tracing, evaluation, and monitoring in a single tool.

Pick one end-to-end platform and supplement with specialized tools where needed. The worst outcome is building a custom evaluation framework from scratch — you’ll spend months replicating what these tools provide on day one.

The Real Lesson

The whitepaper frames evaluation as Phase 4 — something that happens when you build products. That’s wrong. Evaluation is the connective tissue that links every phase.

Your Phase 1 data access decisions determine whether you can build a retrieval evaluation set. Your Phase 2 fluency programs determine whether you have SMEs capable of writing gold-standard answers. Your Phase 3 prioritization determines whether you’ve chosen use cases where evaluation is tractable.

The eval lifecycle isn’t a step in the process. It’s the process.

Your AI Strategy Doesn’t Need More Use Cases. It Needs a Production System.

The AI Runtime — Sat, 18 Apr 2026 11:02:38 GMT

TL;DR: Most enterprise AI strategies are lists of use cases hunting for approval. The companies that actually reach production — BBVA (120,000 employees), Lowe’s (1,700 stores), Intercom (millions of monthly resolutions), Booking.com (global trip planning) — didn’t succeed because they found better use cases. They succeeded because they built production systems: repeatable engineering, governance, and organizational infrastructure that turns any validated idea into a deployed product. After analyzing seven enterprise deployments from OpenAI’s whitepaper, the path to production comes down to five architectural decisions most companies either skip or get wrong. This article is the strategy document your CTO needs — not another use-case brainstorm, but the engineering and organizational blueprint for making AI deployable by default.

The Pilot Trap

Here’s what happens at most companies: A team identifies a promising AI use case. They build a prototype. It works in the demo. Stakeholders are excited. Then nothing happens for six months.

The prototype needs production data — but the data team hasn’t classified which datasets are approved for AI use. The prototype needs a deployment environment — but the infrastructure team hasn’t provisioned one for AI workloads. The prototype needs a compliance review — but legal doesn’t have a framework for evaluating AI-specific risks. The prototype needs an evaluation suite — but nobody has defined what “good enough” means.

Each of these is a solvable problem. The issue is that they’re solved sequentially, per-project, by the same team that built the prototype. The team that’s good at building AI prototypes is now spending 80% of its time on governance, infrastructure, and cross-functional coordination.

This is the pilot trap: the gap between prototype and production isn’t a technology problem. It’s a systems problem. And it requires a systems solution.

Pilot to Prod

Decision 1: Build the Production Infrastructure Before You Need It

The companies that reached to production with AI fastest didn’t wait for a use case to justify infrastructure investment. They built the production path first.

Figma created a “compliance fast path” — pre-classified data, pre-defined guardrails, pre-approved experiment categories — so that any team could test AI tools without triggering a per-project compliance review. The governance infrastructure existed before the use cases that needed it.

BBVA established data boundaries, security protocols, and a Center of Excellence before expanding from 3,000 to 11,000 licenses. By the time they were ready to scale to 120,000, the infrastructure was battle-tested.

What this means for your strategy: Before you prioritize your top 10 use cases, answer these five infrastructure questions:

Data readiness — Which datasets are classified and approved for AI use? What’s the process for approving new ones? How fast can a team get access to production data for a validated use case?

Governance framework — What types of AI experiments are pre-approved? What triggers a full review? Who has decision rights, and what are the escalation paths?

Evaluation infrastructure — Do you have an eval framework that any team can plug into? Can you define and measure behavioral SLOs before launch?

Deployment pipeline — Can a team go from approved prototype to production deployment without building custom infrastructure? Is there a standard path with gated checkpoints?

Monitoring — Once deployed, who owns ongoing behavioral reliability? What gets measured, how often, and what triggers intervention?

If you can’t answer these questions, your first AI project isn’t a use case — it’s building this infrastructure. Every subsequent use case becomes faster and cheaper because the path already exists.

Decision 2: Treat AI Fluency as Engineering Capacity, Not HR Training

The whitepaper from OpenAI frames AI fluency as a training and culture initiative — workshops, champion networks, hackathons. That framing misses the most important dimension: engineering fluency determines your production velocity.

Intercom’s ability to migrate models in days comes from engineers who deeply understand their evaluation pipeline. Booking.com shipped a prototype in 8-10 weeks because their engineers could integrate OpenAI’s API with existing ML infrastructure without rearchitecting. BBVA’s 3,000+ custom GPTs were built by employees who understood enough about prompt engineering to create useful tools without engineering support.

What this means for your strategy: Fluency investment should be tiered:

Tier 1: Universal literacy. Everyone in the organization understands what AI can and can’t do, when to use it, and how to interact with it effectively. This is the workshop-and-hackathon layer.

Tier 2: Builder capability. Product managers, analysts, and domain experts can build custom GPTs, design prompts, and evaluate AI outputs against domain-specific quality standards. BBVA’s “wizards” operate at this tier.

Tier 3: Production engineering. Engineers can build, evaluate, deploy, and monitor AI systems in production. They can design evaluation suites, implement guardrails, instrument observability, and run behavioral regression tests against model updates. This tier determines how fast you can ship.

Most enterprise AI strategies invest heavily in Tier 1, modestly in Tier 2, and almost nothing in Tier 3. Then they wonder why pilots don’t reach production. The bottleneck is almost always Tier 3 engineering capacity — not use-case ideas, not executive sponsorship, not data access.

Decision 3: Prioritize Reuse Over Innovation

The whitepaper advises designing “for reuse from the start.” This understates how transformative reuse-first thinking actually is.

Lowe’s built one AI foundation and deployed it as two products — customer-facing Mylow and associate-facing Mylow Companion. Same knowledge base, same model, different interfaces. The second product was dramatically cheaper and faster than the first because the foundational engineering was already done.

BBVA’s internal GPT Store means solutions built by one team are immediately available to the entire organization. A legal team’s document analysis GPT becomes a compliance team’s document analysis GPT with minimal modification.

What this means for your strategy: When prioritizing use cases, the highest-value next project isn’t always the highest-impact standalone idea. It’s often the one that shares the most infrastructure with what you’ve already built.

Score each candidate use case on two dimensions: standalone value (impact if built in isolation) and infrastructure leverage (how much existing code, data pipelines, evaluations, and governance it can reuse). The use case that scores highest on the product of both dimensions is your next build — not the one with the highest standalone value.

Concretely: if you’ve already built a retrieval pipeline, evaluation framework, and guardrail system for an internal knowledge Q&A tool, your next use case should probably be another knowledge Q&A tool for a different domain — not a completely different architecture that requires building everything from scratch.

This feels counterintuitive because organizations reward novelty (”we’re building something new!”) over leverage (”we’re deploying what we already have to a new domain”). But leverage is what compounds. Novelty is what creates one-off pilots.

Decision 4: Measure Causally, Not Correlatively

Uber ran controlled experiments comparing AI-augmented workflows with traditional ones. OpenAI’s internal sales assistant was measured against corrections from top performers. Booking.com tracked engagement time, search-to-booking conversion, and support ticket volume against baselines.

Most companies measure AI adoption metrics: number of users, messages sent, satisfaction surveys. These metrics can show adoption without proving value. A tool that’s widely used but subtly wrong — plausible but inaccurate answers, faster-but-lower-quality outputs — will show positive adoption metrics while degrading actual business outcomes.

What this means for your strategy: Define your measurement architecture before you deploy:

Causal measurement — Can you run controlled comparisons? A/B tests between AI-augmented and traditional workflows? Before/after analysis with matched cohorts? If you can’t establish causation, you’re optimizing for adoption, not impact.

Business outcome metrics — What business metric does this use case actually move? Not “time saved” (self-reported) but “resolution speed” (measured). Not “user satisfaction with the tool” but “customer satisfaction with the outcome.”

Counterfactual tracking — What would have happened without the AI? This is the hardest measurement to build and the most important. Without it, you attribute every improvement to AI and every failure to something else.

Cost-per-outcome — What does each AI-generated outcome actually cost, including compute, human review, error correction, and organizational overhead? Lowe’s discovered that 68% of their queries didn’t need their flagship model — a discovery only possible with per-query cost instrumentation.

The goal isn’t to measure everything. It’s to measure the right things with enough rigor to make deployment and expansion decisions based on evidence rather than enthusiasm.

Decision 5: Assign Production Ownership Before Launch

The whitepaper describes building cross-functional teams with “engineers, SMEs, data leads, and executive sponsors.” What it doesn’t specify — and what matters most — is who owns the system after launch.

In traditional software, this is obvious: the engineering team that built it operates it, with SRE support. In AI products, it’s ambiguous. The model changes without you deploying anything. The data changes without you modifying anything. The behavior changes without you touching anything. Someone needs to own this.

What this means for your strategy: Before any AI product launches, assign three ownership roles:

Behavioral reliability owner — monitors behavioral SLOs (faithfulness, relevance, safety), detects drift, coordinates response to behavioral incidents. This is the MRE function, whether you call it that or not.

Model management owner — tracks model provider updates, runs regression tests on new versions, manages model selection and routing decisions. This role prevents the “silent model update breaks production” failure mode.

Business value owner — monitors the causal metrics from Decision 4, determines whether the product is still delivering the value that justified deployment, and decides when to expand, refine, or sunset.

These can be the same person on a small team, but they can’t be no one. The most common failure mode in enterprise AI isn’t a spectacular crash — it’s a slow, invisible degradation where the model gets slightly worse over weeks and nobody notices because nobody is watching.

Building Your Path-to-Production Document

If you’re a CTO, VP of Engineering, or AI lead, here’s the strategic document you should build — not a list of use cases, but a production system specification:

Page 1: Infrastructure readiness assessment. Where do you stand on data classification, governance framework, evaluation infrastructure, deployment pipeline, and monitoring? What’s the gap between current state and production-ready?

Page 2: Fluency investment plan. How are you building Tier 1 (literacy), Tier 2 (builder), and Tier 3 (production engineering) capabilities? What’s the timeline for each, and how do you measure progress?

Page 3: First three use cases, scored on standalone value × infrastructure leverage. Not your ten best ideas — your three best first ideas, chosen because they build infrastructure that makes everything after them faster.

Page 4: Measurement architecture. For each use case, what’s the causal measurement strategy? What business outcomes are you tracking, and how are you establishing counterfactuals?

Page 5: Ownership model. Who owns behavioral reliability, model management, and business value for each deployed product? What’s the incident response playbook?

This document isn’t a strategy deck that gets presented once and forgotten. It’s a living system specification that evolves with every deployment. Each new product strengthens the infrastructure, expands the evaluation framework, deepens organizational fluency, and makes the next deployment faster.

The companies in OpenAI’s whitepaper didn’t scale AI because they had better ideas. They scaled because they built production systems that turn good ideas into deployed products — repeatedly, reliably, and with compounding returns.

Your AI strategy should do the same.

Building your own path-to-production document? I’m collecting examples of enterprise AI production system designs for a future AIEW deep-dive. Reply with what you’re building — anonymized details welcome.

Claude Opus 4.7: The Production Engineer’s Breakdown

The AI Runtime — Fri, 17 Apr 2026 11:04:40 GMT

TL;DR - Anthropic released Claude Opus 4.7 on April 16, 2026, available via the Claude API as claude-opus-4-7, plus Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. Pricing is unchanged from Opus 4.6 at $5 per million input tokens and $25 per million output tokens. The marketing line is “better coding, better vision, same price.” That is true and it understates what shipped. Opus 4.7 introduces two new control surfaces (the xhigh effort level and task budgets in beta), four breaking changes to the Messages API that will silently affect existing integrations, seven behavior shifts that will affect how your prompts perform, more than 3x the maximum image resolution with 1:1 coordinate mapping, file-system memory improvements that change how persistent agents work, deliberately throttled cyber capabilities as part of Project Glasswing, and a tokenizer change that can move your bill by up to 35%. If you run agents in production, this release is less about a smarter model and more about a model engineered to behave more predictably under load. The benchmark gains follow from the engineering, not the other way around.

What you actually get

Strip out the marketing and the technical envelope is straightforward. According to Anthropic’s developer documentation, Opus 4.7 supports the 1M token context window, 128k max output tokens, adaptive thinking, and the same set of tools and platform features as Claude Opus 4.6. The 1M context window comes at standard API pricing with no long-context premium — a meaningful change for anyone who has been chunking aggressively to stay under the previous tier boundaries.

Opus 4.7

The model is generally available across Claude products and the API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. For business users, Opus 4.7 is available on Claude for Pro, Max, Team, and Enterprise users. Per Anthropic’s product page, pricing for Opus 4.7 starts at $5 per million input tokens and $25 per million output tokens, with up to 90% cost savings via prompt caching and 50% via batch processing.

The architectural lift over Opus 4.6 is concentrated in three places: a retrained tokenizer, a redesigned thinking-effort surface, and significantly improved high-resolution vision. Everything else in the release — the new tools, the breaking changes, the behavior shifts — flows from those three.

Two new control surfaces

The most consequential additions for engineers building autonomous workflows are the new effort level and task budgets. They change what “tuning a Claude integration” actually means.

The `xhigh` effort level

The new xhigh level sits between high and max. Per the effort documentation, Anthropic recommends starting with xhigh for coding and agentic use cases, with high as the minimum for most intelligence-sensitive workloads. The API default is high. In Claude Code, xhigh is now the default for all plans and providers on Opus 4.7.

What changed beyond the new tier is how strictly the model respects effort. Per Anthropic’s migration guide, Opus 4.7 respects effort levels more strictly than Opus 4.6, especially at low and medium. At those lower levels, the model scopes its work to what was asked rather than going above and beyond. The practical implication is that a moderately complex task running at low effort will under-think rather than silently escalate. If you observe shallow reasoning on complex problems, raise effort to high or xhigh rather than prompting around it.

Two production-relevant data points worth knowing before you migrate. First, per a Hex testimonial in the launch post, low-effort Opus 4.7 is roughly equivalent to medium-effort Opus 4.6. Second, per Anthropic's launch post, on their internal agentic coding evaluation the net token usage across all effort levels improved versus Opus 4.6 — meaning the efficiency gains outweighed the tokenizer increase and the deeper thinking. Anthropic explicitly notes the evaluation runs autonomously from a single prompt and may not represent interactive coding patterns.

Task budgets (beta)

Task budgets are the more architecturally interesting new control surface, because they are the first time a Claude model is given visibility into its own remaining budget. Per the docs, a task budget gives Claude a rough estimate of how many tokens to target for a full agentic loop, including thinking, tool calls, tool results, and final output. The model sees a running countdown and uses it to prioritize work and finish the task gracefully as the budget is consumed.

The API surface is straightforward. Set the beta header task-budgets-2026-03-13 and add the following to your output config:

response = client.beta.messages.create(
    model="claude-opus-4-7",
    max_tokens=128000,
    output_config={
        "effort": "high",
        "task_budget": {"type": "tokens", "total": 128000},
    },
    messages=[
        {"role": "user", "content": "Review the codebase and propose a refactor plan."}
    ],
    betas=["task-budgets-2026-03-13"],
)

The minimum value for a task budget is 20k tokens. If the model is given a task budget that is too restrictive for a given task, it may complete the task less thoroughly or refuse to do it entirely. For open-ended agentic tasks where quality matters more than speed, Anthropic recommends not setting a task budget; reserve them for workloads where you need the model to scope its work to a token allowance.

What makes this design different from a hard cap is that the model is aware of it. A task budget is advisory — it is a suggestion the model is aware of, not a hard cap. This is distinct from max_tokens, which is a hard per-request ceiling that is not passed to the model at all. max_tokens is a guillotine — the model never sees it and gets cut off when it hits. task_budget is a clock — the model sees the countdown and adjusts behavior to land cleanly within the budget. For long-running agentic work where graceful degradation matters more than abrupt termination, this is a meaningfully better primitive.

Four breaking changes you might miss

These breaking changes apply to the Messages API only. If you use Claude Managed Agents, there are no breaking API changes for Claude Opus 4.7. The first two return 400 errors that flag the issue clearly. The third and fourth are silent — they surface as subtle behavior changes downstream if you skip the migration audit. All four are documented in the official What’s new in Claude Opus 4.7 reference.

Extended thinking budgets are removed. Setting thinking: {"type": "enabled", "budget_tokens": N} will return a 400 error. Adaptive thinking is the only thinking-on mode, and Anthropic reports their internal evaluations show it reliably outperforms extended thinking. The new pattern uses adaptive thinking with effort as the depth control:

# Before (Opus 4.6)
thinking = {"type": "enabled", "budget_tokens": 32000}

# After (Opus 4.7)
thinking = {"type": "adaptive"}
output_config = {"effort": "high"}

There is also a subtler shift here. Adaptive thinking is off by default on Claude Opus 4.7. Requests with no thinking field run without thinking. Set thinking: {type: "adaptive"} explicitly to enable it.

Sampling parameters are removed. Setting temperature, top_p, or top_k to any non-default value will return a 400 error. The safest migration path is to omit these parameters entirely from requests and use prompting to guide the model’s behavior. The prior trick of setting temperature = 0 for “determinism” is also gone — per Anthropic’s own note, it never guaranteed identical outputs, and now it does not even run.

Thinking content is omitted by default. Thinking blocks still appear in the response stream, but their thinking field will be empty unless the caller explicitly opts in. This is a silent change — no error is raised — and response latency will be slightly improved. If your product streams reasoning to users, the new default will appear as a long pause before output begins. Set "display": "summarized" to restore visible progress during thinking.

Updated token counting. Claude Opus 4.7 uses a new tokenizer that contributes to its improved performance on a wide range of tasks. Per the docs, this new tokenizer may use roughly 1x to 1.35x as many tokens when processing text compared to previous models, varying by content, and /v1/messages/count_tokens will return a different number of tokens for Opus 4.7 than it did for Opus 4.6. The 1.0–1.35x range is wide enough that “your bill went up 5%” and “your bill went up 30%” are both plausible outcomes — measure on real traffic before extrapolating. Anthropic suggests updating your max_tokens parameters to give additional headroom, including for compaction triggers.

Seven behavior shifts that will change how your prompts perform

These are not breaking changes in the API contract sense, but they will silently affect the quality of your existing prompts. The official behavior change list reads almost like a release note for an operations-focused fork:

Instruction following is now literal, particularly at lower effort levels. The model will not silently generalize an instruction from one item to another, and will not infer requests you didn’t make. The most common failure mode in early migration coverage: bullet-list “suggestions” that earlier Claude models treated as optional hints are now treated as hard requirements.

Response length calibrates to perceived task complexity, rather than defaulting to a fixed verbosity. Short queries get short answers. Complex queries get longer ones. If you have prompt scaffolding that forced specific response lengths, expect different behavior.

Fewer tool calls by default. The model uses tools less often than Opus 4.6 and uses reasoning more. Raising effort increases tool usage; per the migration guide, high or xhigh effort settings show substantially more tool usage in agentic search and coding.

More direct, opinionated tone. Less validation-forward phrasing and fewer emoji than Claude Opus 4.6’s warmer style. Whether this is what your end users want depends entirely on your product surface.

More regular progress updates during long agentic traces. If you’ve added scaffolding to force interim status messages, try removing it.

Fewer subagents spawned by default. Steerable through prompting.

Real-time cybersecurity safeguards. Newly added in Claude Opus 4.7, requests that involve prohibited or high-risk topics may lead to refusals. Legitimate security teams can apply to the Cyber Verification Program for reduced restrictions.

The cumulative effect across all seven is a model that does more of what you tell it to do and less of what it inferred you wanted. For teams with mature prompt libraries built against Opus 4.6, this is a real audit obligation. For teams writing new integrations, it is a meaningful reduction in “magical” behavior that you cannot test for.

Vision: the genuinely large step function

The vision upgrade is the single largest capability jump in the release. Per the docs, maximum image resolution increased to 2576px / 3.75MP, up from the previous limit of 1568px / 1.15MP. That is more than 3x the pixel count.

Two technical details matter beyond the headline number. First, the model’s coordinates now map 1:1 with actual pixels, so there’s no scale-factor math required for any computer-use agent that needs to point at specific UI elements. Second, the upgrades extend beyond resolution: low-level perception (pointing, measuring, counting) and image localization (bounding-box detection) both improved.

The biggest reported lift comes from XBOW, building autonomous penetration testing. Per their testimonial in the launch post, visual acuity moved from 54.5% on Opus 4.6 to 98.5% on Opus 4.7. That is the kind of step function that obsoletes architectural workarounds. If your computer-use or document-analysis agent has ever included logic to chunk, crop, or downsample images to compensate for the previous resolution ceiling, that code is now technical debt. One tradeoff to plan for: higher-resolution images consume more tokens — downsample images before sending if the additional fidelity is unnecessary.

File-system memory improvements

Per the docs, Opus 4.7 is better at writing and using file-system-based memory. If an agent maintains a scratchpad, notes file, or structured memory store across turns, that agent should improve at jotting down notes to itself and leveraging its notes in future tasks.

For teams that have built persistent agents — the kind that work across multiple sessions on long-running projects — this is a quietly significant improvement. The agent that previously needed extensive context restoration at the start of each session can now do more of that work itself by writing better notes and using them more effectively. Anthropic’s client-side memory tool gives you a managed scratchpad if you do not want to roll your own.

The downstream effect is fewer tokens spent on context restoration and more on actual work. Multi-session agentic workflows that previously felt like they were starting from scratch each time should feel more continuous.

Training and the cyber capability story

The most editorially interesting decision in this release is what Anthropic deliberately did not improve. Per the launch post, during training Anthropic experimented with efforts to differentially reduce Opus 4.7’s cyber capabilities relative to Mythos Preview. The model also ships with safeguards that automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses.

This is the first generally available model carrying the Project Glasswing safeguard stack — Anthropic’s approach to staging powerful model releases by testing new safeguards on less-capable models before broader rollout of Mythos-class capabilities. Per Vellum AI’s benchmark analysis, on CyberGym, Opus 4.7 scores 73.1%, effectively flat against Opus 4.6’s revised 73.8%, while Mythos Preview scores 83.1% on the same benchmark but remains restricted to vetted partners.

For production teams, two takeaways. First, if you have legitimate security workloads — vulnerability research, penetration testing, red-teaming — the Cyber Verification Program is the path to reduced restrictions. Apply early; the program is new and the enrollment cycle is unclear. Second, the safeguard-first deployment pattern is likely to repeat. Anthropic states that what they learn from real-world deployment of these safeguards will inform their goal of a broad release of Mythos-class models, which means the next Mythos-class model will likely not arrive without similar testing on a less capable model first.

What the alignment evals actually say

The safety profile is honest about being incomplete. Per the launch post, Anthropic’s alignment assessment concluded that the model is “largely well-aligned and trustworthy, though not fully ideal in its behavior.” Mythos Preview remains the better-aligned model by Anthropic’s own evaluations.

Specifics worth knowing if you operate Opus 4.7 in user-facing contexts:

Honesty and resistance to malicious prompt injection attacks are improvements on Opus 4.6. For agents that consume web content, customer documents, or third-party tool output, prompt injection resistance is the most active reliability threat surface, and the improvement is meaningful.
The model is modestly weaker on overly detailed harm-reduction advice for controlled substances.
Per reporting by The Decoder on the system card, Opus 4.7 still refuses to assist in 33% of simulated AI safety research tasks, a significant drop from 88% with Opus 4.6. Still imperfect, but a categorical shift.
The system card distinguishes between factual hallucinations (wrong claims about the world) and input hallucinations (the model acting as if it has access to a tool or attachment that doesn’t actually exist), and Opus 4.7 performs better than or on par with Opus 4.6 across factual hallucination benchmarks.

The customer feedback in the launch post is consistent with these numbers. Hex reports the model correctly reports when data is missing instead of providing plausible-but-incorrect fallbacks, and resists dissonant-data traps that even Opus 4.6 falls for. Vercel notes the model is more honest about its own limits and even runs proofs on systems code before starting work — behavior they had not seen in earlier Claude models. Notion measured a 14% improvement at fewer tokens and a third of the tool errors, with the model continuing to execute through tool failures that previously stopped Opus cold.

None of these are intelligence claims. They are behavioral consistency claims. For anyone operating the model in production, behavioral consistency is the metric that drives or kills a deployment.

The cost story (with real numbers)

Pricing has not changed: $5 per million input tokens, $25 per million output tokens. Three things that have changed will move your actual bill:

The tokenizer. As covered above, expect 1.0–1.35x more tokens on the same text. The token efficiency of Claude Opus 4.7 can vary by workload shape. The first thing to measure on your traffic before any production rollout.

Higher effort means more thinking. Per the launch post, Opus 4.7 thinks more at higher effort levels, particularly on later turns in agentic settings — this improves reliability on hard problems but produces more output tokens. Anthropic’s own internal coding evaluation shows token usage improving across all effort levels for that specific workload, but the result is workload-dependent.

Counter-evidence from actual deployments. Per Box’s Head of AI Yashodha Bhavnani as reported by 9to5Mac, in Box’s evaluations Opus 4.7 had a 56% reduction in model calls and 50% reduction in tool calls. The Hex observation that low-effort 4.7 matches medium-effort 4.6 points the same direction. The honest read: per-token costs may rise; per-task costs often fall, because the model finishes work in fewer iterations. Whether your bill goes up or down depends on whether your workflow is throttled by tokens-per-call or by calls-per-task.

The practical playbook: instrument cost-per-completed-task, not just tokens-per-call, before you decide whether the upgrade is favorable for your specific workload.

Claude Code: /ultrareview, auto mode, and new defaults

For Claude Code users, three changes ship alongside the model:

/ultrareview slash command. A dedicated review session that reads through changes and flags bugs and design issues a careful reviewer would catch. Pro and Max Claude Code users get three free ultrareviews to try it out.

Auto mode extended to Max. Auto mode is a permissions option where Claude makes decisions on your behalf, meaning longer tasks run with fewer interruptions and with less risk than skipping all permissions. Per 9to5Mac’s reporting, it was previously available for Teams, Enterprise, and API customers, and is now also available to Max plan subscribers.

xhigh is now the default in Claude Code across all plans and providers on Opus 4.7. Per the Claude Code docs, when you first run Opus 4.7, Claude Code applies xhigh even if you previously set a different effort level for Opus 4.6 or Sonnet 4.6. Sessions will use more thinking tokens by default, which produces higher-quality results at slightly higher cost. Override via /effort high if you preferred the old behavior.

Migration playbook

A concrete sequence for moving production workloads, distilled from Anthropic’s official migration guide:

Audit your existing prompts against the new literal instruction-following behavior on your top three workflows. Look specifically for bullet-list suggestions, imperative verbs used loosely, and any prompt that depends on the model “filling in” implied context.

Re-test integrations that set thinking: {"type": "enabled"} or any sampling parameter. Both will return 400 errors now. Migrate to adaptive thinking with effort as the depth control.

Measure tokenizer impact on a representative sample of real traffic before extrapolating cost. Code-heavy and prose-heavy workloads land at different points in the 1.0–1.35x band.

Set task_budget on long-running agentic workflows. Even if you do not yet need it as a cost guard, the discipline of declaring an upper bound forces clarity on what “done” looks like for autonomous runs.

If you are running computer-use agents, prioritize re-evaluating the vision pipeline. The 3.75MP ceiling and 1:1 coordinate mapping change architectural decisions that were made under earlier constraints.

If you have legitimate security workloads, apply to the Cyber Verification Program. The new safeguards will refuse some requests that Opus 4.6 handled.

For teams running Opus 4.6 at high or max as a reliability fallback, test Opus 4.7 one tier lower against the same evaluations. The cost-per-task math may justify staying at lower effort.

Bottom line

Opus 4.7 is the clearest signal yet that frontier model releases are bifurcating along a new axis. One axis is raw capability, where the field has visibly converged — on graduate-level reasoning measured by GPQA Diamond, as reported by The Next Web, Opus 4.7 scores 94.2%, GPT-5.4 Pro scores 94.4%, and Gemini 3.1 Pro scores 94.3%, with the differences within noise. The other axis is operational maturity: how predictably the model behaves under load, how cleanly it integrates with engineering controls, how honestly it reports its own limits.

Anthropic invested in the second axis. Self-verification before reporting, loop resistance, lower variance, fewer tool errors, honest uncertainty, task-aware budgets, literal instruction following, prompt injection resistance — the entire shape of this release is about the model being a better operational citizen, not a smarter conversationalist. The benchmark gains follow from that engineering. They do not lead it.

For anyone running agents in production, the upgrade is straightforward but the prompt audit is real. For anyone designing new agentic workflows, the launch post explicitly frames this as the model where users can hand off their hardest work with less supervision than before — a claim worth testing against your own evaluations rather than taking on faith.

The next model release will tell us whether this becomes the new norm. If it does, the era of treating frontier models as raw intelligence to be wrangled by external scaffolding is ending, and the era of treating them as engineered systems with first-class operational primitives is beginning.

Opus 4.7 is the strongest single data point so far that we are already in that second era.

Sources & further reading

Primary (Anthropic):

Introducing Claude Opus 4.7 — the official launch post, including all partner testimonials cited above
What’s new in Claude Opus 4.7 — developer documentation covering breaking changes, behavior shifts, and capability improvements
Migration guide: Opus 4.6 → Opus 4.7 — official upgrade guidance
Effort parameter documentation — recommended effort levels per workload type
Task budgets documentation — full setup and tuning guidance
Claude Code model configuration — Claude Code-specific defaults and overrides
Project Glasswing — context for the cyber capability staging strategy
Cyber Verification Program — application form for security professionals
Claude Opus 4.7 System Card — referenced throughout the launch post

Secondary (third-party reporting and analysis):

Vellum AI: Claude Opus 4.7 Benchmarks Explained — source for CyberGym scores cited above
The Decoder: Anthropic’s Claude Opus 4.7 makes a big leap in coding — source for the AI safety research refusal numbers from the system card
9to5Mac: Anthropic reveals new Opus 4.7 model — source for Box’s deployment numbers and auto mode availability details
The Next Web: Claude Opus 4.7 leads on SWE-bench and agentic reasoning — source for cross-model GPQA Diamond comparison

Subscribe to AI Engineer Weekly for technical breakdowns like this on every major model release, plus original analysis on production AI engineering. Forward to one engineer who would benefit.

Share AI Engineer Weekly

The Responses API Is OpenAI’s Bet That State Belongs on the Server

The AI Runtime — Thu, 16 Apr 2026 11:03:51 GMT

TL;DR - OpenAI launched the Responses API in March 2025 to replace both Chat Completions (for new projects) and the Assistants API (sunsetting August 2026). The core bet: move conversation state, reasoning token persistence, and tool execution to the server so developers stop rebuilding the same plumbing. The result is 40–80% better cache utilization than Chat Completions, chain-of-thought that survives across turns, built-in tools (web search, file search, code interpreter, computer use, MCP), and a compaction system that lets agents run beyond the context window. If you’re building anything multi-turn on OpenAI today, the Responses API isn’t optional — it’s the surface where new capabilities land first.

The Problem the Responses API Solves

Every developer who has built a production chatbot on the Chat Completions API knows the ritual. User sends a message. You fetch the entire conversation history from your database. You prepend the system prompt. You serialize the whole thing into a messages array. You send it. You get a response. You store it. Next turn, you do it all again — with one more message appended.

This works. It also wastes money, breaks prompt caching, and throws away the model’s reasoning between turns.

Responses API

The Assistants API tried to fix this in late 2023 by moving state server-side. Persistent threads. Managed runs. Built-in tools. The abstraction was right, but the execution was painful: creating a thread, adding a message, kicking off a run, polling for completion, then finally retrieving the response. Five API calls for one answer. Rate limits tied to threads. Opaque state that was hard to debug. And because no other provider implemented the Assistants API, adopting it meant full vendor lock-in to a perpetual beta.

The Responses API is OpenAI’s second attempt. It takes the right ideas from Assistants — server-side state, built-in tools, persistent reasoning — and delivers them through the simplicity of a single API call. No threads. No runs. No polling.

Every architectural choice has a regime where it’s right and a regime where it’s wrong. Stateless APIs were the right answer for the workloads LLMs were first built against: classification, single-turn Q&A, one-shot generation. What you sent was what you paid for, and the abstraction was symmetric, clean, and cheap to reason about.

Agentic systems break that regime. An agent isn’t a classifier — it’s a sequential decision process in which every step depends on the reasoning, tool calls, and results of every prior step. Forcing that shape onto a stateless API creates what I call the Stateless Tax — three compounding costs that scale with conversation depth and never appear as a single line item on your bill.

Replay cost is the visible one. A 20-turn conversation resends 20 messages every turn, with the system prompt bolted to the front each time. Prompt caching is supposed to fix this, and does — until a single dynamic token at the start of the prefix shatters the cache and you’re paying full freight again. The longer the agent runs, the larger the tax, and the more fragile the mitigation.

Reasoning amnesia is the cost most developers never see. GPT-5 and o3 generate hidden chain-of-thought tokens that shape the final answer. On a stateless API, those tokens are discarded the moment the response returns. Next turn, the model reasons from absolute zero — not from where it left off. The conversation looks continuous to the user; the cognition restarts on every call. This is why OpenAI’s own evals show a ~3% SWE-bench lift and a ~4-point Tau-Bench Retail gain just from switching APIs, with no model change. Persisting reasoning isn’t a minor optimization. It’s the model being functionally smarter, because it stops getting wiped between turns.

Observability debt is the silent one. Stateless APIs return a final message; everything between input and output — tool calls, reasoning items, retrieval decisions — is opaque by construction. You can reconstruct it with careful logging, but you’re rebuilding state the API already had and discarded. In production debugging, this is the difference between a stack trace and a single error code.

Server-managed state collapses all three costs into a single API primitive. Response chains eliminate replay. Reasoning items persist cognition across turns. Typed output items turn every step the agent took into an inspectable artifact.

This is why calling the Responses API “a better Chat Completions” undersells what actually happened in March 2025. It’s the first major commercial inference API to treat agentic workloads as a distinct architectural category — one where statelessness isn’t the clean default. It’s a misconfiguration that gets more expensive the longer your agent runs.

The Nine Features That Matter

1. Server-Side State via `store` and `previous_response_id`

This is the single biggest architectural change. With Chat Completions, you resend the entire conversation every turn. With the Responses API, you set store: true and the server remembers. On the next turn, pass previous_response_id instead of the full history.

# Turn 1
response1 = client.responses.create(
    model="gpt-5",
    store=True,
    instructions="You are a customer support agent for Acme Corp.",
    input="What's your return policy for electronics?"
)

# Turn 2 — no history resending needed
response2 = client.responses.create(
    model="gpt-5",
    store=True,
    previous_response_id=response1.id,
    input="What if I lost the receipt?"
)

Response objects are saved for 30 days by default. You can delete them explicitly with client.responses.delete(response_id). For organizations with Zero Data Retention requirements, OpenAI provides encrypted reasoning items — you get the reasoning persistence benefit without server-side storage.

Why this matters: A 20-turn customer support conversation on Chat Completions resends 20 messages every turn. On the Responses API, you send exactly one: the new user input. The server handles the rest.

2. Reasoning Token Persistence

This is the feature most developers don’t know they’re missing.

When you use a reasoning model like GPT-5 or o3 through Chat Completions, the model generates chain-of-thought tokens during inference. But those tokens aren’t returned to you. On the next turn, the model starts reasoning from scratch — like a detective who forgets all the clues every time they leave the room.

With the Responses API’s previous_response_id, reasoning tokens from the previous turn survive into the next turn. The model builds on its prior thinking instead of starting over.

OpenAI’s internal evals show a 3% improvement on SWE-bench with the same prompt and setup when using Responses instead of Chat Completions. That number sounds modest, but on agentic benchmarks like TAU-bench the gap widens to 5%, because multi-step reasoning tasks compound the benefit of persistent chain-of-thought.

3. Built-In Tools

Chat Completions gives you function calling — you define schemas, the model returns tool_calls, you execute them, you send results back. Every tool call is a round trip through your backend.

The Responses API adds hosted tools that OpenAI executes for you:

response = client.responses.create(
    model="gpt-5",
    instructions="You are a research assistant.",
    input="What were the key announcements at GTC 2026?",
    tools=[
        {"type": "web_search"},         # OpenAI runs the search
        {"type": "code_interpreter"},   # OpenAI runs the code
        {"type": "file_search"},        # OpenAI searches uploaded files
        {"type": "computer_use"},       # Model interacts with UIs
        {"type": "mcp"},               # Connect to external MCP servers
    ]
)

Because tool execution happens server-side for hosted tools, you eliminate the round-trip latency of bouncing every call through your own backend. You can still define custom function tools alongside the hosted ones — the two compose naturally.

The web_search tool uses the same models powering ChatGPT search, which score around 90% accuracy on the SimpleQA benchmark — dramatically better than plain GPT models without search. File search integrates with OpenAI’s vector stores for a RAG pipeline without custom infrastructure. And the MCP tool connects to any Model Context Protocol server, meaning your agent can interact with external services through a standardized interface.

4. The `instructions` Parameter Replaces System Messages

Chat Completions overloads the messages array with a system role message. The Responses API separates concerns: instructions define what the model is, input defines what the user asks.

response = client.responses.create(
    model="gpt-5",
    instructions="You are a tax assistant. Always cite relevant IRS publications.",
    input="What deductions can I claim for my home office?"
)

This isn’t just cosmetic. Because instructions sit at the start of the context as a stable prefix, they cache far more effectively than a system message buried in a mutable messages array. The architectural separation between static identity and dynamic conversation is what enables the 40–80% cache improvement OpenAI reports in internal tests.

5. Output Items Instead of Choices

Chat Completions returns a choices array where each choice contains a single message. The Responses API returns an output array of typed items. A single response can contain reasoning items, tool calls, tool results, and the final message — all as separate, inspectable objects.

output: [
  { type: "reasoning",    ... },   # Chain-of-thought (if visible)
  { type: "tool_call",    ... },   # Tool invocation
  { type: "tool_result",  ... },   # Tool output
  { type: "message",      ... },   # Final text response
]

This is transformative for debugging and observability. With Chat Completions, tool execution is a black box — you see what went in and what came out, but the intermediate steps are invisible. With Items, you get receipts. Every step the model took is an inspectable object in the response. You can build richer UIs, structured audit logs, and step-by-step tracing from a single response.

6. The Conversations API

For applications that need durable, long-lived conversations — think customer support tickets that span days — the Conversations API provides a persistent container:

# Create a persistent conversation
conversation = client.conversations.create(
    metadata={"user_id": "user_123", "session_type": "support"}
)

# Use it across multiple responses
response = client.responses.create(
    model="gpt-5",
    store=True,
    conversation=conversation.id,
    input="How do I reset my password?"
)

Conversations persist indefinitely (no 30-day TTL like standalone responses). You can retrieve all items from a conversation, fork it at any point, and resume across sessions and devices. It replaces the Assistants API’s Threads concept without the polling overhead.

7. Compaction for Long-Running Agents

Every agentic workflow eventually hits the context window ceiling. The Responses API introduces compaction — an intelligent summarization of older conversation content to make room for new work while preserving critical context.

Two modes are available. Server-side compaction triggers automatically when the context crosses a threshold you set:

response = client.responses.create(
    model="gpt-5.4",
    input=conversation_history,
    store=False,
    context_management=[{
        "type": "compaction",
        "compact_threshold": 200000
    }]
)

Client-side compaction gives you explicit control via the /responses/compact endpoint — you send a full context window, and the API returns a compressed version with an encrypted compaction item that carries forward key state.

This is what enables GPT-5.4 to sustain coherent progress across agent trajectories that would previously collapse when the context window filled up. The compaction endpoint is fully stateless and ZDR-friendly.

8. Tool Search for Large Tool Surfaces

If your agent has 50+ function definitions, sending all of them in every request wastes tokens, breaks cache prefixes, and degrades tool selection accuracy. GPT-5.4 introduces tool search: deferred tool loading where the model dynamically discovers relevant tools at runtime.

Instead of defining every tool upfront, you make tools searchable. The model loads only the definitions it needs for the current request. This preserves cache performance, reduces token usage, and improves latency for enterprise applications with large tool inventories.

9. Flexible Input Formats

Chat Completions requires a messages array with role and content objects. The Responses API accepts three formats:

# Simple string
input="What is the return policy?"

# Message array (familiar from Chat Completions)
input=[{"role": "user", "content": "What is the return policy?"}]

# Multimodal input with images, audio, documents
input=[
    {"role": "user", "content": [
        {"type": "input_text", "text": "Summarize this document"},
        {"type": "input_file", "file_id": "file_abc123"}
    ]}
]

The string shorthand eliminates boilerplate for simple single-turn calls. The multimodal support makes text, images, PDFs, and audio first-class citizens in the same input array.

Case Study: Migrating a Customer Support RAG System

Let’s make this concrete. Consider a mid-size e-commerce company running a customer support bot on Chat Completions with GPT-4o. Here’s their current architecture and what changes with a Responses API migration.

The Before: Chat Completions Architecture

User message arrives
  → App fetches full conversation history from Postgres (all turns)
  → App prepends system prompt (800 tokens of instructions)
  → App calls embeddings API with the user's question
  → App queries Pinecone for relevant knowledge base chunks
  → App injects retrieved chunks into the messages array
  → App sends everything to Chat Completions
  → App parses response
  → App stores response in Postgres
  → If tool call: app executes tool, sends result back, waits again
  → Repeat for every turn

The pain points: Every turn resends the full conversation (0% prompt cache hit rate). The system prompt is 800 tokens of static instructions re-sent identically every request. RAG requires a separate embeddings call plus a vector DB query before every API call. Tool execution requires multiple round trips. A 15-turn conversation means the system prompt alone costs 12,000 redundant tokens. And the model’s reasoning resets between every turn.

The After: Responses API Architecture

User message arrives
  → App sends one API call with previous_response_id + new input
  → Built-in file_search handles RAG (vector store configured once)
  → Built-in web_search handles real-time queries
  → Model's reasoning persists from prior turns
  → Static instructions cached via `instructions` parameter
  → Response returned with full item trail for observability
  → Repeat

What You Actually Save

Token costs: The instructions parameter creates a stable prefix that caches across turns. OpenAI’s extended prompt cache retention (up to 24 hours) means the system prompt stays cached throughout a support agent’s entire shift. For a 15-turn conversation, you eliminate roughly 12,000 redundant instruction tokens and gain 40–80% cache improvement on the remaining context.

Infrastructure: You can retire your Pinecone instance (or equivalent) for this use case — file search with vector stores handles the RAG pipeline. You eliminate the embeddings call, the vector query, and the chunk injection logic.

Quality: Reasoning persistence means the model remembers not just what was said, but how it was thinking about the problem. When a customer asks a follow-up that builds on a complex refund calculation, the model’s prior chain-of-thought carries forward instead of starting from scratch.

Observability: Every response contains typed output items — you can log exactly which knowledge base documents were retrieved, which tools were called, and what reasoning the model applied, all from a single response object.

The Migration Decision Matrix

Not every application should migrate today. Here’s how to think about it:

Migrate now if you have multi-turn conversations with reasoning models, applications resending full conversation history every turn, workflows that need built-in web search or file search, or agentic systems hitting context window limits.

Migrate incrementally if you have a mix of simple and complex flows. The Responses API is a superset of Chat Completions — you can migrate individual user flows that benefit from reasoning persistence while keeping simpler flows on Chat Completions.

Wait and watch if you have single-turn, stateless workloads with no tools (basic classification, single-shot generation). Chat Completions handles these fine and will be supported indefinitely.

Be cautious if your architecture requires full control over conversation state for compliance reasons, though encrypted reasoning items and ZDR support address most of these concerns.

The Assistants → Responses Concept Map

If you’re migrating from the Assistants API (sunset: August 26, 2026), the mapping is straightforward:

Assistants API              → Responses API
─────────────────────────────────────────────
Assistant object            → instructions + model + tools (inline config)
Thread                      → Conversation (or previous_response_id chain)
Message                     → Input items
Run (create → poll → get)   → Single responses.create() call
Run Steps                   → Output items (inspectable per-step)
Code Interpreter            → {"type": "code_interpreter"} built-in tool
File Search / Retrieval     → {"type": "file_search"} built-in tool
Thread-based state          → store: true + conversation or previous_response_id

The biggest win: you go from a five-step async flow (create thread → add message → create run → poll status → get response) to a single synchronous API call that returns the complete result.

What to Watch

The Responses API is clearly where OpenAI is investing. New capabilities — tool search, compaction, computer use, MCP support — are landing in Responses first, sometimes exclusively. GPT-5.4’s tool calling with reasoning: none is only supported in the Responses API, not Chat Completions.

But there are trade-offs to keep eyes on. Server-side state means you’re trusting OpenAI with your conversation data (responses are retained for 30 days by default). The in-memory fast path caches only the most recent response; older IDs are hydrated from persisted state when store: true, and if unresolvable you must fall back to full context. And despite being billed as simpler, the Items-based response format is a different mental model that takes adjustment.

The broader signal is architectural. OpenAI is pushing developers toward a world where the API provider manages state, runs tools, and handles context — and developers focus on defining behavior and building UIs. Whether that trade-off works for your stack depends on how much control you’re willing to delegate.

But for the majority of applications resending full conversation histories and rebuilding tool execution loops from scratch — the Responses API isn’t just an improvement. It’s the API you wished existed three years ago.

Building on the Responses API or migrating from Assistants? I’d love to hear what’s working and what’s breaking.

You’re Paying 10x Too Much for LLM Inference (And Your Provider Already Has the Fix)

The AI Runtime — Wed, 15 Apr 2026 11:03:33 GMT

TL;DR - Prompt caching stores the KV (key-value) computations from transformer attention layers so repeated prompt prefixes skip the expensive prefill step entirely. Every major provider now offers it, but they’ve made fundamentally different design choices: OpenAI caches automatically with zero code changes and now offers up to 90% discounts on newer models. Anthropic gives you explicit control with cache_control breakpoints and a strict hierarchy (tools → system → messages) that rewards careful prompt architecture. Google Gemini offers both implicit (automatic) and explicit caching with the longest TTL options — up to custom durations — plus per-hour storage fees for explicit caches. If you’re running a production AI application and haven’t optimized for cache hits, you’re leaving 50–90% of your inference budget on the table. Start by structuring your prompts with static content first and variable content last, then monitor cached_tokens in your API responses to measure your hit rate.

Why This Matters Right Now

Here’s a number that should make you uncomfortable: in a 100-turn coding session with Claude Opus, you’re sending roughly 10–20 million input tokens. Without caching, that’s $50–100 in input costs alone. With caching, it’s $10–19.

That’s not a hypothetical. The Claude Code team has said publicly that prompt caching is the architectural constraint around which their entire product is built. They declare SEV incidents when cache hit rates drop.

And it’s not just Anthropic. OpenAI’s Prompt Caching 201 cookbook (published February 2026) shows their Realtime API offering a 98.75% discount on cached audio tokens — from $32 per million tokens down to $0.40. Google’s Gemini 2.5 Pro drops cached input from $1.25 to $0.13 per million tokens.

The question isn’t whether to use prompt caching. It’s whether you understand it well enough to actually get the cache hits you’re paying for.

Prompt Caching

What’s Actually Being Cached (It’s Not What You Think)

A common misconception is that prompt caching stores your text and retrieves it later, like a Redis layer for prompts. It doesn’t work that way.

LLM inference has two phases. In the prefill phase, the model processes every input token through its transformer layers, computing key and value projections inside the attention mechanism. These projections — the “KV cache” — capture how each token relates to every other token in the sequence. In the decode phase, the model generates output tokens one at a time, each step referencing the KV cache it built during prefill.

Prompt caching stores those KV projections in GPU memory. When your next request starts with the same prefix, the model skips recomputing those attention layers and jumps straight to processing new tokens. You’re not caching text. You’re caching the result of the most computationally expensive part of inference.

This is why the savings are so dramatic. Prefill is the dominant cost driver — it scales with both sequence length and model size. Skip it, and you cut latency by up to 80% and costs by up to 90%.

It also explains why caching only works on prefixes. The KV cache is sequential. Token 500’s attention values depend on tokens 1–499. You can’t cache the middle of a prompt because the middle depends on everything before it.

The Three Approaches: A Design Philosophy Comparison

Each major provider has made distinct design choices about caching that reflect deeper philosophies about developer experience versus control.

OpenAI: “It Just Works”

OpenAI’s approach is fully automatic. There’s no flag to set, no API parameter to enable. If your prompt exceeds 1,024 tokens and shares a prefix with a recent request, the system attempts a cache hit behind the scenes.

The mechanism works through routing: OpenAI hashes the first ~256 tokens of your prompt and routes the request to a machine that recently processed a matching prefix. If that machine still has the KV cache in memory, you get a hit. Cache matches happen in 128-token increments — so if you change one token at position 2,048 in a 10,000-token prompt, you still get a cache hit on the first 2,048 tokens.

What’s unique about OpenAI’s approach:

Zero code changes required. You monitor cache performance by checking usage.prompt_tokens_details.cached_tokens in the response — but you don’t need to do anything to enable it.
prompt_cache_key parameter. This is OpenAI’s concession to developers who want more control. By setting a consistent key across related requests, you improve the odds that they route to the same machine. Useful when many requests share a common long prefix.
Extended retention. Beyond the default 5–10 minute in-memory cache, OpenAI offers extended retention (up to 24 hours) via the prompt_cache_retention parameter. Same pricing either way.
Flex Processing. For latency-insensitive workloads, service_tier="flex" gives you the same 50% Batch API discount but runs through the standard API, where you can tune cache locality more precisely. OpenAI’s own testing showed an 8.5% higher cache hit rate with Flex + extended caching versus Batch.

The trade-off: You have less deterministic control. Cache hits depend on routing, which depends on server-side decisions. You can influence routing with prompt_cache_key, but you can’t guarantee hits the way you can with Anthropic’s explicit breakpoints.

Anthropic: “You Decide What Gets Cached”

Anthropic takes the opposite approach. You explicitly mark what should be cached using cache_control parameters on individual content blocks. This gives you deterministic control — when you mark a block, Anthropic stores its KV projections and serves cache hits 100% of the time on matching prefixes (within the TTL window).

The key architectural detail is Anthropic’s strict processing hierarchy: Tools → System Message → Messages. Caching is cumulative along this chain, and changes at any level invalidate that level and everything below it. Change a tool definition? Your system prompt cache breaks too. Change the system prompt? Your conversation history cache breaks.

What’s unique about Anthropic’s approach:

Explicit breakpoints. Place cache_control: {"type": "ephemeral"} on up to 4 content blocks. The cache stores everything from the beginning of the prompt up to that breakpoint.
Automatic caching mode. Anthropic now also offers a simpler path: add a single cache_control at the top level of your request, and the system automatically applies the breakpoint to the last cacheable block and moves it forward as conversations grow.
Cache write surcharge. Unlike OpenAI (no extra fee for cache writes), Anthropic charges 1.25x the base input price for 5-minute cache writes and 2x for 1-hour cache writes. Cache reads are 0.1x — so you need roughly 2 cache reads to break even on a 5-minute write.
Model-specific minimum thresholds. Claude Sonnet and Opus require at least 1,024 tokens to trigger caching. Claude Haiku 4.5 requires 4,096 tokens. Below these thresholds, your cache_control annotation is silently ignored.
Extended TTL option. Beyond the default 5-minute window, you can set "ttl": "1h" for a 1-hour cache at the 2x write premium.

The trade-off: More setup work, more things that can silently break (JSON key ordering in tool definitions, subtle changes in system prompts), but also more predictable behavior. When you ask for a cache, you get a cache.

Pricing multipliers (all models):

Operation Multiplier vs. Base Input Cache write (5-min) 1.25x Cache write (1-hour) 2x Cache read 0.1x

Google Gemini: “Choose Your Adventure”

Google offers both implicit and explicit caching — and they work differently enough that you need to understand both.

Implicit caching is automatic (enabled by default on Gemini 2.5 and newer). Like OpenAI, it detects repeated prefixes and applies discounts opportunistically. Unlike OpenAI, there’s no storage fee and no guarantee of savings — you get discounts only when the system determines a cache hit occurred.

Explicit caching is a managed resource. You create a cache object via the API, assign it a TTL (default 60 minutes, customizable), and reference it by resource name in subsequent requests. This guarantees discounts but introduces storage costs — typically $1.00 per million tokens per hour, depending on the model.

What’s unique about Google’s approach:

Longest TTL flexibility. Explicit caches can be set to custom durations with configurable ttl or expire_time. No other provider offers this level of TTL control.
Storage fees for explicit caches. This is the critical differentiator. OpenAI and Anthropic don’t charge for cache storage. Google does — approximately $1.00 per million tokens per hour. This means you need to do break-even math: a 100K-token cache costs about $0.10/hour. If cached reads save you $0.10+ per hour in input token discounts, you’re ahead.
Multimodal caching. Gemini caches text, images, audio, and video — and each modality has different pricing for cached reads.
Cache lifecycle management. You can update TTLs, list caches, and delete them explicitly — a level of cache management that neither OpenAI nor Anthropic provides.

Pricing multipliers (Gemini 2.5 Flash example):

The Comparison Matrix That Actually Matters

Comparison Matrix

The Five Use Cases Where Caching Transforms Economics

1. Multi-turn chatbots and agents. Every turn resends the full conversation history. Without caching, turn 50 costs 50x what turn 1 costs. With caching, turns 2–50 only pay full price for the new message — everything before it is a cache hit.

2. Document Q&A. Embed a 100K-token document in the system prompt and let users ask questions. Without caching, each question reprocesses the entire document. With caching, the document is processed once and subsequent queries against it cost 90% less.

3. Few-shot and many-shot prompting. High-quality few-shot examples can be 10K+ tokens. Caching lets you include 50–100 examples without paying full price on every call.

4. Agentic tool use. Agents make multiple tool calls per task, each requiring a new API request with the full context. Tool definitions and system instructions remain stable across calls — perfect cache candidates.

5. Code assistants. The canonical case. Claude Code’s system prompt alone is ~4,000 tokens. Add tool definitions, CLAUDE.md files, and conversation history, and you’re sending 100K+ tokens per turn. Caching keeps this economically viable.

What Breaks Your Cache (And How to Prevent It)

The most expensive bug in production AI isn’t a wrong answer — it’s a silently broken cache. Here’s what invalidates caches across providers:

Universal cache killers:

Changing any token in the cached prefix (even a single character)
Reordering JSON keys in tool definitions (watch out for languages like Go and Swift that randomize key order)
Adding timestamps or per-request IDs to system prompts
Switching models mid-session

Anthropic-specific:

Changing tool_choice parameter
Adding or removing images anywhere in the prompt
Enabling/disabling extended thinking or changing the thinking budget (invalidates message-level cache, but system and tool caches survive)
Exceeding 20 content blocks without additional cache_control markers

OpenAI-specific:

High request volume on the same prefix (>15 RPM per prompt_cache_key) causing overflow to additional machines
The routing hash only considers ~256 tokens — so two prompts that differ only after token 256 might route to different machines

Google-specific:

Explicit caches can expire if TTL isn’t updated
Referencing a deleted or expired cache object causes request failure (implement retry logic that recreates the cache)

Practical Prompt Architecture for Maximum Cache Hits

The universal rule across all providers: static content first, variable content last.

Think of your prompt as having concentric layers of stability:

Most Stable (cache these)
├── Tool definitions
├── System instructions
├── Reference documents / few-shot examples
├── Conversation history (grows but prefix stays stable)
└── Current user message
Most Variable (don't try to cache this)

For Anthropic, place your first cache_control breakpoint after your system instructions and a second after your reference documents. Use automatic caching mode for the conversation history — it moves the breakpoint forward as the conversation grows.

For OpenAI, structure is the only lever you have (plus prompt_cache_key). Put your most stable, longest content at the very beginning. Don’t embed per-request metadata in your system prompt.

For Google, create an explicit cache for your reference documents and set an appropriate TTL. Use implicit caching for everything else.

The Decision Framework: Which Provider’s Caching Fits Your Use Case?

Choose OpenAI’s caching when you want zero implementation effort, you’re running standard chat or completion workloads, and you value simplicity over control. The newer GPT-5 family’s 90% discounts make this increasingly attractive.

Choose Anthropic’s caching when you need guaranteed cache hits, you’re building long-context applications (document analysis, code assistants), and you’re willing to invest in prompt architecture. The explicit control means you can debug and optimize with certainty.

Choose Google’s caching when you’re working with multimodal content (especially video and audio), you need long cache durations, or you’re already in the Google Cloud ecosystem. Be aware of storage fees — do the break-even math.

Monitoring: The Metric That Tells You If You’re Doing It Right

Regardless of provider, there’s one metric you should track: cache hit rate, defined as cached tokens divided by total input tokens.

For OpenAI, check usage.prompt_tokens_details.cached_tokens in every response. For Anthropic, monitor cache_read_input_tokens versus cache_creation_input_tokens plus input_tokens. For Google, look at cachedContentTokenCount in the response metadata.

A healthy production system should see 70%+ cache hit rates after the first few requests in a session. Claude Code reports 95%+ in sustained coding sessions. If you’re below 50%, something is breaking your cache — review the invalidation checklist above.

Model Bills Are the New Headcount

The AI Runtime — Mon, 13 Apr 2026 11:03:49 GMT

TL:DR - At a growing number of AI startups, the monthly model inference bill has surpassed individual engineer salaries as the most scrutinized cost on the P&L. This isn’t a temporary artifact of early adoption — it’s the permanent economic structure of AI-native businesses. Yet most teams manage inference costs the way early startups managed cloud bills: reactively, after the damage is done. The emerging discipline of Model Reliability Engineering (MRE) treats model behavior and model cost as two sides of the same operational problem, giving teams a framework to monitor, optimize, and control inference economics alongside output quality. If your model bill is growing faster than your revenue, you don’t have a pricing problem — you have an engineering problem.

The New P&L

In 2024, when founders discussed their burn rate, the conversation was almost entirely about payroll. “We’re a team of twelve, burning $180K per month.” The model API line item — if it existed at all — was a rounding error. A few hundred dollars for prototyping.

In 2026, that conversation has inverted at AI-native companies. A team of four might burn $50K per month on salaries and $25K–$40K per month on inference. The model bill isn’t a rounding error — it’s the second-largest expense after payroll, and at some companies, it’s approaching the first.

This creates a cost structure that’s fundamentally different from traditional software businesses in three ways.

First, the marginal cost of serving a customer is non-trivial. In traditional SaaS, the marginal cost of an additional user is essentially zero — server costs are negligible per user. In AI-native products, every user interaction triggers model inference that costs real money. A complex query might cost $0.05–$0.50 in model calls. At scale, this adds up fast.

Second, costs are partially unpredictable. Traditional infrastructure scales predictably — you know roughly what a new server instance costs. Model costs depend on input complexity, output length, which model handles the request, retry rates, and dozens of other factors that vary by user and use case.

Third, cost and quality are directly coupled. In traditional software, you can usually cut costs without affecting user experience — optimize a query, compress an asset, cache a result. In AI systems, cheaper often means worse. Routing to a smaller model saves money but may degrade output quality. Shorter prompts cost less but may produce less reliable results. Every cost optimization decision is simultaneously a quality decision.

Why Cloud-Era Thinking Doesn’t Work

Most engineering teams default to treating model costs the way they treat cloud infrastructure costs. Set up billing alerts, review the dashboard monthly, optimize the biggest spenders when the bill gets uncomfortable.

This approach fails for AI inference because it addresses the wrong problem. Cloud cost optimization is primarily about resource utilization — right-sizing instances, eliminating waste, reserving capacity. The decisions are mostly independent of the product’s behavior.

Inference cost optimization is inseparable from product behavior. When you change how a model is called — the prompt, the model choice, the context window size — you change both the cost and the output. You can’t optimize one without affecting the other. An engineer who reduces inference costs by 40% but degrades response quality by 20% hasn’t saved money — they’ve broken the product.

This coupling is why inference economics requires its own discipline, not just a tab in your existing monitoring dashboard.

Enter Model Reliability Engineering

Model Reliability Engineering (MRE) is an engineering discipline that owns model behavior reliability in production — and inference economics is one of its core concerns.

MRE sits at the intersection of several existing disciplines. Site Reliability Engineering (SRE) gives it operational rigor — uptime targets, incident response, monitoring. MLOps gives it the deployment and pipeline perspective. AI Safety gives it the behavioral constraint framework. But none of these disciplines adequately cover the specific problem of maintaining reliable model behavior at manageable cost in production systems.

MRE addresses this through a two-layer architecture: Context Engineering (designing and managing what goes into the model) and Harness Engineering (building the infrastructure that wraps, monitors, and controls model interactions). Together, they form a framework for thinking about inference costs as an engineering problem, not a finance problem.

The MRE approach to inference economics centers on five operational concerns:

1. Cost Observability

You can’t optimize what you can’t see. Most teams track their aggregate model bill — total spend per month. That’s like tracking your total cloud bill without knowing which service consumes the most. Useless for optimization.

Effective cost observability means tracking cost per request, segmented by model, feature, user tier, and request complexity. It means knowing that your document summarization feature costs $0.12 per request while your chatbot costs $0.03 per request — and understanding why.

The implementation is straightforward: instrument every model call with metadata (feature name, model used, input tokens, output tokens, latency) and aggregate it in a monitoring system. The hard part is building the organizational habit of reviewing this data with the same rigor you’d review error rates or latency percentiles.

2. Model Routing

Not every task requires the same model. A classification decision — “is this email spam or not?” — can be handled by a small, fast, cheap model. A complex reasoning task — “analyze this legal document and identify liability risks” — requires a frontier model.

Model routing is the practice of sending each request to the most cost-effective model that can handle it at the required quality level. In practice, this means defining quality thresholds for each task type, benchmarking multiple models against those thresholds, building a routing layer that selects the appropriate model per request, and continuously evaluating whether routing decisions are still optimal as models evolve.

Teams that implement routing consistently report 40–60% reductions in inference costs. It’s the single highest-leverage optimization available, and most teams haven’t done it because it requires evaluation infrastructure they don’t have.

3. Prompt Economics

Prompt length directly affects cost — more input tokens means higher cost per request. But prompt optimization for cost can’t be done in isolation from quality.

The MRE approach treats prompts as economic artifacts. Every prompt has a cost (measured in tokens) and a quality level (measured by evaluation). The goal is to find the minimum-cost prompt that meets the quality threshold — not the cheapest prompt possible, and not the longest prompt that maximizes quality.

This requires evaluation infrastructure: a way to systematically test prompt variations against quality metrics and cost metrics simultaneously. Without evaluation, prompt optimization is guesswork. With evaluation, it’s engineering.

4. Caching and Deduplication

Many production workloads involve repeated or near-identical requests. Semantic caching — returning cached results for requests that are similar enough to previous ones — can significantly reduce inference costs without affecting user experience.

The engineering challenge is defining “similar enough.” Exact-match caching is trivial but catches few cases. Semantic similarity caching (using embedding distance to find near-matches) catches more cases but introduces a quality risk: the cached response might not be appropriate for the new request.

The MRE framework treats caching as a reliability decision, not just a performance optimization. Every cache hit is an assertion that the cached response is good enough for the new request. That assertion needs validation.

5. Budget Governance

As inference costs become a material portion of company spend, they need governance mechanisms similar to other significant cost centers.

This means per-feature cost budgets (this feature should cost no more than $X per month), cost-per-request limits (if a single request exceeds $Y, flag it for review), trend alerting (if costs are growing faster than usage, investigate), and cost-quality tradeoff documentation (recording why each routing or prompt decision was made).

Budget governance sounds bureaucratic, but without it, inference costs grow unchecked until they trigger a crisis.

The Cost-Quality Tradeoff in Practice

Here’s a concrete example of how MRE thinking changes inference economics.

Consider a customer support AI that handles 10,000 requests per day. Without optimization, every request goes to a frontier model with a long system prompt. Cost: roughly $0.15 per request. Monthly bill: $45,000.

An MRE approach would look like this:

Step 1 — Classify requests by complexity. Analysis reveals that 60% of requests are simple FAQ-type questions, 30% are moderately complex, and 10% require deep reasoning.

Step 2 — Build a routing layer. Simple requests go to a small model ($0.01/request). Moderate requests go to a mid-tier model ($0.05/request). Complex requests go to the frontier model ($0.15/request).

Step 3 — Optimize prompts per tier. The simple model gets a short, focused prompt. The mid-tier model gets a moderate prompt with examples. The frontier model gets the full system prompt.

Step 4 — Add semantic caching for the simple tier, where many requests are near-identical.

Result: Simple requests (6,000/day × $0.008 with caching) = $48/day. Moderate requests (3,000/day × $0.05) = $150/day. Complex requests (1,000/day × $0.15) = $150/day. Total: $348/day. Monthly bill: roughly $10,400.

That’s a 77% cost reduction. But it only works because each step was validated against quality metrics. The small model’s responses to simple queries were evaluated and confirmed to meet quality thresholds. The routing classifier was tested for accuracy. The caching system was validated against semantic similarity scores.

Without evaluation infrastructure, you’re just guessing about where to cut. With it, you’re engineering.

Who Owns This?

At most companies today, nobody owns inference economics. The engineering team builds features. The finance team pays the bills. Nobody connects the two systematically.

MRE argues that inference economics is an engineering responsibility — specifically, it’s the responsibility of whoever owns model behavior in production. The person who decides which model to use, how to prompt it, and how to evaluate the output is also the person best positioned to optimize the cost, because they understand the cost-quality tradeoff for each decision.

This doesn’t mean every engineer needs to become a financial analyst. It means the team responsible for model interactions needs cost visibility, cost targets, and the tools to optimize against them. Just as SRE teams own uptime targets, MRE teams own cost-quality targets.

For teams without dedicated MRE roles (which is most teams right now), the minimum viable version is: instrument every model call, review costs weekly by feature, and set per-feature cost budgets. That alone puts you ahead of 90% of teams managing inference costs today.

The Compounding Problem

Here’s why this matters now and not later: inference costs compound with growth. Unlike traditional infrastructure costs that grow sub-linearly with scale (thanks to efficiency gains), inference costs grow roughly linearly — and sometimes super-linearly when complex features get more usage.

A startup spending $25K/month on inference at 1,000 users will likely spend $250K/month at 10,000 users unless they actively optimize. At 100,000 users, the unoptimized bill would approach a $3M annual run rate — on inference alone.

Cost Observability with AI

Every month you delay implementing cost observability, routing, and evaluation is a month where cost inefficiencies compound into your growth trajectory. The startups that survive the transition from early traction to real scale will be the ones that treated inference economics as a first-class engineering discipline from the beginning, not the ones that panicked when the bill arrived.

Your ETL Pipeline Won’t Save You. Your AI Data Stack Will.

The AI Runtime — Sun, 12 Apr 2026 11:25:37 GMT

TL;DR: Data engineering isn’t dying — it’s splitting. The BLS projects 36% job growth through 2034, one of the fastest rates in tech. But the work is unrecognizable. AI copilots now generate boilerplate SQL in seconds, anomaly detection tools learn “normal” without hand-written rules, and natural-language interfaces let business users build their own simple pipelines. The data engineers who thrive in 2026 aren’t the ones writing more dbt models — they’re the ones designing the data infrastructure that makes AI systems actually work. In my last article, I introduced the concept of an AIfolio — a portfolio built around AI-native projects that prove you can architect AI systems, not just code. That article was aimed at developers broadly. This one is for data engineers specifically, because your version of an AIfolio looks fundamentally different — and your existing skills give you an unfair advantage in building it. The old resume line was “built ETL pipeline processing 10M rows/day.” The new one is “built the data infrastructure that reduced our LLM hallucination rate from 23% to 4%.” Here are the five pillars of a data engineer’s AIfolio, the exact tools to build them with, and the presentation layer that makes hiring managers say yes.

The Tectonic Shift Nobody Warned You About

Here’s the thing about data engineering in 2026: the profession is simultaneously booming and being hollowed out from the inside.

AI-Native Data Engineer

The demand numbers look fantastic on the surface. The O’Reilly 2025 Tech Trends Report showed data engineering skills grew 29% year-over-year. The BLS projects 36% growth through 2034. Median salaries sit comfortably between $120K and $200K. By every macro measure, data engineering is thriving.

But zoom into what data engineers are actually doing day-to-day, and the picture shifts dramatically. Snowflake launched Cortex Code in February 2026 — a CLI that generates dbt models from natural language, reads your actual schema (no hallucinated table names), and supports Claude Opus 4.6 and GPT-5.2 as underlying models. Describe what you want in plain English, and it writes the SQL, the schema YAML, and the tests. Databricks has Agent Bricks running at 250K+ queries per second for structured extraction and text transformation. GitHub Copilot, at $19-$39 per seat per month, is already standard on most data teams.

The result? A study examining 285,000 companies found that hiring for senior positions is still increasing while hiring for junior positions is decreasing. The pattern is identical to what happened in software engineering — AI doesn’t replace the experienced architect, it eliminates the apprenticeship that creates experienced architects.

If you’re a data engineer whose primary value is “I write SQL and Python to move data from point A to point B,” you’re in the blast radius. If your value is “I design the data systems that make AI applications reliable, governable, and cost-effective,” you’re in the most in-demand job market in a decade.

The question is: which one are you building toward?

The Data Engineer’s Role Has Inverted

Think about how a hospital pharmacy works. A decade ago, pharmacists spent most of their time physically counting pills and putting them in bottles — the mechanical act of fulfillment. Today, automated dispensing machines handle that. Pharmacists didn’t disappear. They moved up the stack — clinical consultations, drug interaction analysis, treatment optimization. The mechanical work was automated; the judgment work became more valuable.

Data engineering is undergoing the exact same inversion.

The old job: Write ingestion scripts. Build transformation logic. Schedule pipelines. Monitor for failures. Debug broken DAGs at 2 AM.

The new job: Design the data architecture that powers AI applications. Build embedding pipelines for RAG systems. Implement data quality frameworks that prevent AI models from making dangerous decisions on bad data. Create semantic layers that let LLMs understand organizational knowledge. Govern the data estate so AI adoption doesn’t create compliance nightmares.

Erik Duffield, co-founder of data platform company Ascend, captured it precisely: we’ve moved from a world where 80% of data is served to human analysts through traditional BI tools to one where machines are the primary data consumers. When your main customer was a human looking at a dashboard, “good enough” data quality was often fine. When your main customer is an LLM making autonomous decisions, “good enough” can be catastrophic.

This inversion creates a massive opportunity for data engineers who see it coming — because you already have the foundational skills (SQL, Python, cloud infrastructure, orchestration) that AI engineers typically lack. You understand data modeling, schema design, governance, and operational reliability. The gap isn’t in your foundations. It’s in your AI application layer.

Here’s how to close it.

Why Data Engineers Need a Different AIfolio

In the AIfolio article, I laid out four pillars for a developer’s AI portfolio: RAG pipelines, multi-agent systems, MCP integrations, and persistent memory. Those pillars are calibrated for software engineers crossing into AI.

Data engineers need a different set of pillars. Not because the AIfolio framework is wrong — but because your superpower is different.

An AI engineer’s AIfolio proves: “I can architect systems that think.”

A data engineer’s AIfolio proves: “I can build the data infrastructure that makes those thinking systems reliable, accurate, and governable.”

Most AI engineers build impressive demos on toy datasets, then watch them crumble when fed real-world data at scale. They don’t know how to handle schema evolution, data contracts, incremental processing, or data quality monitoring. They’ve never debugged a pipeline that silently dropped 12% of records at 3 AM.

You have. That’s your edge.

A data engineer’s AIfolio doesn’t replace the four original pillars — it complements them. Where the AI engineer builds the RAG application, you build the pipeline that keeps its knowledge base fresh, accurate, and governed. Where the AI engineer designs the agent workflow, you build the feature store and embedding infrastructure that powers it. Where the AI engineer wires up MCP, you build the semantic layer it queries.

The combination is absurdly valuable — and almost nobody has both sides. Here are the five pillars of a data engineer’s AIfolio.

The Five Pillars of a Data Engineer’s AIfolio

Pillar 1: A RAG-Ready Data Pipeline (Your Foundation Project)

Every AI application needs data, and most AI engineers are terrible at data engineering. This is your superpower — if you know how to wield it.

A RAG-ready data pipeline doesn’t just move data. It ingests unstructured documents (PDFs, Confluence pages, Slack threads, API responses), parses them intelligently, chunks them with semantic awareness, generates embeddings, and loads them into a vector store — all with the orchestration, monitoring, and data quality checks you’d apply to any production pipeline.

This is where your existing skills translate directly. You already know how to build reliable ingestion pipelines. You already understand idempotency, backfills, and incremental processing. You just need to add the AI-specific layers: document parsing, chunking strategy, embedding generation, and vector database management.

What this proves to a hiring manager: You understand that RAG systems live or die based on data quality — not model quality. A brilliant LLM with a poorly chunked knowledge base will hallucinate. A mediocre LLM with a well-engineered data pipeline will be reliable. You’re the person who builds the reliable version.

The tech stack:

For orchestration, use what you know — Airflow, Prefect, or Dagster. The pipeline structure is familiar: extract documents from source systems, transform them through parsing and chunking stages, load embeddings into a vector store. The DAG looks like any ELT pipeline; the transformations are just different.

For document parsing, LlamaParse handles PDFs with tables, nested headers, and images. For simpler documents, LangChain’s document loaders cover most formats.

For chunking, start with RecursiveCharacterTextSplitter (predictable, tunable) and graduate to semantic chunking when you’re ready. Chunk size matters enormously — too large and you dilute relevance, too small and you lose context. Production systems in 2026 typically use 200-1,000 token windows with 10-20% overlap.

For vector databases, Postgres with pgvector is the secret weapon for data engineers. You already know Postgres. pgvectorscale benchmarks show strong throughput even at 50M vectors. For dedicated vector stores, start with Chroma (zero-config, embedded) and graduate to Qdrant (production-grade, Rust-based) or Pinecone (fully managed).

For embedding models, use OpenAI’s text-embedding-3-small for prototypes. For production, consider open-source models from Hugging Face that you can self-host — eliminating per-token costs entirely.

The repos to study:

NirDiamant/RAG_Techniques (~26K stars) — 30+ advanced RAG implementations. Start here to understand the patterns before building your own pipeline around them.
infiniflow/ragflow (~73K stars) — Production-grade RAG engine with deep document understanding. Study this to understand what “production RAG” looks like from a data engineering perspective.
HKUDS/LightRAG (~30K stars) — Graph-based RAG that builds knowledge graphs from documents. Building a LightRAG pipeline over a real corpus is the kind of project that makes data engineering and AI engineering teams lean forward.

The AIfolio differentiator: This is where your version diverges from the standard AIfolio. Don’t just build a RAG pipeline. Add the data engineering discipline that most AI engineers skip — data quality checks on your chunks (are they coherent? do they preserve table structure?), monitoring on embedding drift, automated re-indexing when source documents change, and lineage tracking from source document to vector store to LLM response. An AI engineer’s RAG demo says “look, it answers questions!” Your RAG pipeline says “look, it answers questions correctly, reliably, with auditability from source to response.“ That’s the difference.

Pillar 2: AI-Powered Data Quality Monitoring (Your Competitive Advantage)

This is the pillar that screams “I’m a data engineer who understands AI” rather than “I’m a data engineer who’s trying to become an AI engineer.” It plays directly to your strengths.

Traditional data quality monitoring requires writing explicit rules for every check: this column should never be null, this value should be between X and Y, this count should match within 5% of yesterday’s. It’s exhausting, brittle, and never comprehensive enough.

AI-powered data quality flips the script. Instead of writing rules, you train anomaly detection models that learn what “normal” looks like for each dataset and alert only on meaningful deviations. The system notices when weekend sales patterns suddenly match weekdays, when a typically stable metric shows unusual variance, or when subtle correlations between datasets shift — things hand-written rules would never catch.

What this proves to a hiring manager: You understand the production reality that most AI projects ignore — that AI systems are only as good as the data feeding them. You can build the monitoring layer that prevents garbage-in-garbage-out at scale.

The tech stack:

For anomaly detection, start with statistical methods (z-scores, interquartile range) on your most critical tables, then graduate to ML-based detection using isolation forests or autoencoders. Great Expectations gives you the rule-based foundation; layer learned anomaly detection on top.

For metadata management, look at open-source data catalogs like DataHub or OpenMetadata. These tools track lineage, auto-generate documentation, and increasingly integrate AI for data discovery.

For observability, Monte Carlo is the industry leader (integrates with Snowflake, Databricks, dbt, and Airflow), but building your own lightweight version is the AIfolio project. The goal is a system that monitors freshness, volume, schema changes, and distribution shifts — and distinguishes between acceptable variations and genuine problems.

The AIfolio differentiator: Build a pipeline that ingests real data (public datasets work — NYC taxi data, weather data, stock prices), monitors it continuously for quality issues, and automatically alerts when anomalies occur. Add a dashboard showing historical data quality scores, detected anomalies, and resolution status. Then — here’s the move that elevates this from “project” to “AIfolio pillar” — intentionally inject data quality issues and show that your system catches them before they corrupt downstream AI models. Deploy it with a live link a recruiter can interact with. This is the kind of project you can only build if you understand both data engineering and AI failure modes.

Pillar 3: A Semantic Layer with MCP Integration (The Architecture Pillar)

This is the pillar nobody else is building yet — and it’s the one that will define data engineering’s next chapter. It also directly extends the MCP pillar from the original AIfolio framework, but from the data infrastructure side.

The problem: every company deploying LLMs needs those models to understand organizational data. But LLMs can’t query your data warehouse directly. They don’t know your business logic, your metric definitions, or which tables to join. Natural-language-to-SQL translation is better than it was, but it’s still unreliable for complex queries.

A semantic layer solves this by creating a structured, governed interface between LLMs and your data. It defines metrics, dimensions, and relationships in a way that both humans and machines can understand. Think of it as the “API” for your data — instead of letting AI tools write arbitrary SQL against raw tables, they query through a semantic layer that enforces business logic and access controls.

What this proves to a hiring manager: You think at the system design level. You understand that AI applications need governed, structured access to data — not just raw table scans.

The tech stack:

For the semantic layer itself, dbt’s semantic layer (via MetricFlow) is the production standard — it defines metrics as code that can be version-controlled, tested, and governed. Cube is another option that adds a caching and API layer.

For the LLM integration, build an MCP server (Model Context Protocol) that exposes your semantic layer to AI assistants. This means Claude, Copilot, or any MCP-compatible AI can query your organizational data through a governed interface — asking questions in natural language that get translated to semantically correct queries.

The repos to study:

modelcontextprotocol/python-sdk (~22K stars) — The official Python SDK for building MCP servers. FastMCP lets you build a working server in under 20 lines of code.
modelcontextprotocol/servers (~76K stars) — Reference implementations. Study the database server examples.

The AIfolio differentiator: Build an MCP server that wraps a dbt semantic layer. An AI assistant asks “What was our revenue last quarter by region?” and your server translates that through the semantic layer into a governed, correct query — with access controls, audit logging, and metric definitions enforced automatically. Document the governance model alongside the technical architecture. This single project sits at the intersection of data engineering, AI infrastructure, and data governance — exactly where the profession is heading. In the original AIfolio, MCP was about connecting AI to tools. In a data engineer’s AIfolio, MCP is about connecting AI to your organization’s data — safely.

Pillar 4: A Feature Store and Real-Time Embedding Pipeline (The ML Infrastructure Pillar)

Every company building recommendation engines, fraud detection, or personalization needs a feature store. Every company deploying LLMs needs an embedding pipeline. These are data engineering problems wearing AI costumes — and they’re the infrastructure that AI engineers assume “someone else” builds.

A feature store ensures consistent feature computation across training and serving — preventing the dreaded “training-serving skew” where your model was trained on features calculated one way but serves predictions using features calculated slightly differently. An embedding pipeline continuously generates and updates vector representations of your data as it changes.

What this proves to a hiring manager: You understand ML infrastructure — the plumbing that makes models work reliably in production, not just in a Jupyter notebook.

The tech stack:

For feature stores, Feast (open-source) is the standard for learning. It handles both batch features (computed in your warehouse) and real-time features (computed from streaming data). Tecton is the enterprise option if you want to demonstrate awareness of the commercial landscape.

For the embedding pipeline, build a Kafka-based streaming pipeline that generates embeddings in near-real-time as new data arrives — documents added, records updated, content changed. Embeddings flow into your vector store, keeping your RAG system current without full re-indexing.

For streaming infrastructure, Apache Kafka is still the backbone. Combine it with Flink or Spark Structured Streaming for the processing layer.

The AIfolio differentiator: Build a feature store that serves features for a simple recommendation model, and an embedding pipeline that keeps a vector store current. Show that when new data arrives via Kafka, embeddings are generated and searchable within seconds — not hours. Then connect this to your Pillar 1 RAG pipeline. Now you have two AIfolio projects that work together as a system, not isolated demos. This compound effect — projects that reference and extend each other — is what separates an AIfolio from a list of disconnected repos.

Pillar 5: A Data Governance Framework for AI (The Senior-Level Pillar)

This is the pillar that signals staff/principal-level thinking. It’s less about code and more about systems design — and it’s the most underbuilt layer in the entire AI ecosystem.

Every organization racing to adopt AI is creating a governance nightmare. Business teams launch AI initiatives with zero regard for data lineage, access controls, or compliance. AI models are trained on data that may contain PII. LLMs access data stores without audit trails. The EU AI Act requires audit trails for model-training data. Nobody’s building the governance infrastructure to handle any of this.

What this proves to a hiring manager: You understand the organizational and regulatory dimensions of AI — not just the technical ones. You’re the engineer who prevents the compliance disaster, not the one who creates it.

The implementation:

Build a governance-as-code framework that includes data classification (automatically tagging PII, sensitive, public data), access control policies (who and what systems can access which data, with audit logging), lineage tracking (from raw source through transformations to AI model training data), and data contracts between producing and consuming teams.

Implement it using open-source tools: OpenMetadata or DataHub for the catalog, Great Expectations for data contracts, and your orchestrator’s built-in lineage tracking. Add a policy layer that automatically enforces classification-based access rules.

The AIfolio differentiator: Write a companion blog post explaining how your framework maps to EU AI Act requirements and organizational data governance policies. This transforms a technical project into a business-level asset. The original AIfolio article emphasized “documenting your design decisions” — this pillar is that principle taken to its logical extreme. You’re not just building infrastructure; you’re publishing the governance blueprint that other organizations can learn from. That’s the kind of thought leadership that gets you noticed by hiring managers and builds your professional reputation.

The Data Engineer’s AIfolio Tech Stack Cheat Sheet

You don’t need to learn everything. Here’s the focused stack, organized by what you actually need:

Your Core (Keep and Deepen): SQL, Python, dbt, Airflow/Prefect/Dagster, Snowflake or Databricks or BigQuery, Kafka

Add for AI Readiness: Vector databases (pgvector for Postgres teams, Qdrant or Pinecone for dedicated), embedding models (OpenAI API for prototypes, Hugging Face for self-hosted), LangChain/LlamaIndex for RAG orchestration, MCP SDK for AI integration layers

Add for Observability: Monte Carlo (study the concepts even if you use open-source), Great Expectations + custom anomaly detection, OpenMetadata or DataHub for AI-era data cataloging

Add for Streaming AI: Kafka + Flink for real-time embedding pipelines, Feast for feature stores

AI Copilots to Master Now: GitHub Copilot (universal), Snowflake Cortex Code (if on Snowflake), Altimate Code (open-source, dbt + SQL native)

Deployment (Your AIfolio Needs Live Links): Streamlit Community Cloud or Hugging Face Spaces (free, zero-config — for dashboards and demos), Vercel + Supabase (full-stack AI apps with pgvector), any major cloud free tier for containerized services

What Separates a Good Data Engineer’s AIfolio From a Great One

Building the five pillars is necessary but not sufficient. The original AIfolio article laid out a presentation layer that applies just as forcefully here — with some data-engineering-specific additions.

Every project needs a README that sells — with architecture diagrams. Hiring managers spend less than two minutes on a GitHub repo. For data engineers specifically, an architecture diagram isn’t optional — it’s the first thing they look for. Show the full pipeline: sources → ingestion → transformation → vector store → retrieval → LLM response. Show the monitoring layer. Show the governance layer. A clean Mermaid diagram in your README communicates more architectural thinking than a thousand lines of code.

Deploy everything with a clickable link. A pipeline without a live demo is a pipeline that doesn’t exist. Deploy your RAG pipeline’s query interface to Streamlit. Deploy your data quality dashboard. Deploy your MCP server and show an AI assistant querying your data live. Hugging Face Spaces, Streamlit Community Cloud, and Supabase all offer generous free tiers. There’s no excuse.

Add observability — especially on your data pipelines. This is where data engineers have a natural advantage over AI engineers building AIfolios. You already think about monitoring, alerting, and debugging in production. Integrate Langfuse or LangSmith for AI observability, and combine it with your existing pipeline monitoring. Show metrics: latency per query, retrieval precision, embedding freshness, data quality scores over time. This is the kind of production thinking that makes a hiring manager think “this person can build real systems.”

Document your design decisions — with trade-off reasoning. Why did you choose pgvector over Qdrant? Why did you set chunk size to 500 tokens with 15% overlap? Why did you use semantic chunking for some document types and recursive splitting for others? Write this up — in a blog post, a detailed README section, or even a short companion article. The original AIfolio article made this point for all developers: the reasoning reveals more than the code. For data engineers, the specific trade-offs you’ve navigated (cost vs. performance, freshness vs. computational overhead, governance strictness vs. developer velocity) are the exact conversations hiring managers want to have in interviews.

Be explicit about AI tool usage. Note in your documentation: “Used Cortex Code to generate initial dbt model definitions, then customized the chunking logic and added data quality tests manually” or “Used Copilot to scaffold the Airflow DAG structure, then wrote the embedding generation and quality monitoring operators by hand.” This signals a modern mindset. As one engineering leader put it: the goal isn’t to pretend you don’t use AI — it’s to show you use AI to accelerate the routine work so you can spend your time on the architectural decisions that matter.

Connect your pillars into a system. This is the meta-move that elevates a data engineer’s AIfolio above a list of disconnected projects. Your RAG pipeline (Pillar 1) feeds into your data quality monitoring (Pillar 2). Your semantic layer and MCP server (Pillar 3) provides governed access to the same data. Your embedding pipeline (Pillar 4) keeps the RAG system current in real-time. Your governance framework (Pillar 5) wraps the entire system in compliance and auditability. When a hiring manager can trace the connections between your projects and see a coherent data architecture rather than five isolated repos — that’s when they know you think like a staff engineer.

What Actually Gets You Hired

The pillars give you the what to build. The presentation layer gives you the how to show it. But after conversations with founders and hiring leaders at companies building AI-native data infrastructure, four traits emerged that determine whether you get the offer.

1. You understand that machines are the new data consumer. The shift from human-facing dashboards to AI-facing data infrastructure is the defining change of this era. Every architectural decision you make — schema design, data quality thresholds, freshness requirements, access patterns — should account for the fact that your primary consumers are increasingly models, not analysts. When you can articulate how this changes your design decisions, you signal that you’ve internalized the shift.

2. You have a point of view on data architecture trade-offs. “Should we use a dedicated vector database or pgvector?” is a question every data team is debating. Having a specific, defensible answer — backed by your actual project experience — matters more than having built the project in the first place. “I started with pgvector because our team already knew Postgres, and at our scale (under 10M vectors) the performance was comparable to dedicated solutions. I’d switch to Qdrant if we hit 50M+ vectors or needed sub-5ms p99 latency.” That answer gets you hired. Your AIfolio is the evidence that your opinions are earned, not theoretical.

3. A learning mindset that’s visible in the work. Does your commit history show iteration — not just “initial commit” and “final version,” but a progression of experiments, dead ends, and improvements? Does your README explain what you tried that didn’t work? Did you start with fixed-size chunking, measure the retrieval quality, switch to semantic chunking, and document the improvement? A data engineer’s AIfolio that shows measured, iterative improvement signals something tutorials never can: you know how to diagnose and fix problems in production AI systems.

4. You think about governance before someone makes you. The organizations that will win the AI race are the ones that can deploy AI without creating compliance disasters. Data engineers who proactively build governance frameworks — data contracts, lineage tracking, access controls, PII classification — are the ones who end up in the room where strategic decisions are made. You stop being a cost center and start being a profit enabler. Your AIfolio’s Pillar 5 is the proof.

Your Minimum Viable Data Engineer’s AIfolio

If you’re a data engineer reading this and feeling overwhelmed, here’s the path in order:

Month 1-2: Build Pillar 1 — your RAG-ready data pipeline. Install pgvector on your Postgres instance. Learn how embeddings work. Build a RAG pipeline over real documents (legal docs, technical documentation, research papers — not toy datasets) using your existing Airflow/dbt setup for orchestration. Add data quality checks on your chunks. Deploy the query interface to Streamlit or Gradio. One project, deployed, with a clean README and architecture diagram.

Month 3-4: Build Pillar 2 — AI-powered data quality. Add anomaly detection to your most critical tables. Start with statistical methods, then layer in ML-based detection. Connect it to your Pillar 1 pipeline so it monitors the data feeding your RAG system. Document what your system catches that hand-written rules miss. Deploy the monitoring dashboard.

Month 5-6: Build Pillar 3 — your semantic layer with MCP. Create an MCP server that exposes your data warehouse through a governed semantic layer. Show that an AI assistant can query your data correctly and safely. This is the pillar that makes hiring managers lean forward — almost nobody has built this yet.

When ready: Build Pillars 4 and 5. Add a real-time embedding pipeline (Pillar 4) to keep your RAG system current without full re-indexing. Build the governance framework (Pillar 5) when you’re ready to make the case for staff-level roles.

Throughout: Master an AI copilot for data engineering. Use Copilot for your daily SQL and Python work. Try Cortex Code if you’re on Snowflake. The productivity gains are real — developers report 88% productivity increases — and showing that you use AI as a power tool signals a modern mindset.

The hand-coded ETL pipeline is the new to-do app. It proves you completed a tutorial. It signals nothing about whether you can design the data infrastructure that AI systems depend on.

The original AIfolio replaced the traditional developer portfolio with proof that you can architect AI systems. A data engineer’s AIfolio goes one layer deeper — proof that you can build the data infrastructure those AI systems can’t function without.

Your pipelines don’t end at a dashboard anymore. They end at a vector store. At a feature store. At an LLM’s context window. At a governed semantic layer that lets AI systems understand organizational knowledge without creating compliance nightmares.

The data engineers who build this AIfolio won’t just survive the AI era. They’ll own the infrastructure layer that makes the entire AI era possible.

That’s not a bad position to be in.

Start building.

PromptOps Is Dead, Long Live SkillOps

The AI Runtime — Fri, 10 Apr 2026 11:03:37 GMT

TL;DR - Enterprise teams are drowning in prompts scattered across Claude Code, Copilot, Cursor, Codex, and internal tools — no versioning, no governance, no reuse. The fix isn’t better prompt management. It’s treating skills — self-contained packages of instructions, metadata, scripts, and guardrails — as first-class ops artifacts with registries, evaluation loops, and supply-chain controls. SkillOps — the practice of versioning, evaluating, governing, and composing skills — is the new operational layer for agentic systems. If you’re still doing PromptOps, you’re optimizing the wrong primitive.

The Prompt Sprawl Problem You Already Have

Here’s a pattern across every enterprise customer: someone writes a great prompt for code review in Claude Code. Someone else writes a different one for Copilot. A third person pastes a variation into Cursor. None of them know the others exist. None are versioned. None are tested. When the LLM vendor changes model behavior in an update, all three break silently.

This is PromptOps at its logical endpoint — a graveyard of undiscoverable, untested, ungoverned text blobs. The fundamental problem isn’t tooling. It’s that prompts are the wrong unit of reuse.

A prompt is a string. A skill is an asset.

Skillops

What a Skill Actually Is

The SKILL.md format — originally published by Anthropic at agentskills.io in December 2025 — has become the de facto standard across every major agentic platform in under six months. Here’s the structure:

my-skill/
├── SKILL.md        # Required: metadata + instructions
├── scripts/        # Optional: executable code
├── references/     # Optional: documentation
└── assets/         # Optional: templates, resources

The SKILL.md file contains YAML frontmatter (name, description) and markdown instructions. That’s it. But the design is deceptively powerful because of progressive disclosure — the mechanism that makes skills scale where prompts don’t.

L1 — Discovery: At startup, the agent loads only the name and description of every available skill. Fifty skills might cost 2,500 tokens total. This is what the agent uses to decide whether to activate a skill.

L2 — Activation: When a task matches a skill’s description, the agent reads the full SKILL.md body into context. Only the relevant skill loads. Everything else stays on disk at zero token cost.

L3 — Execution: If instructions reference scripts, templates, or documentation, those load on demand. A skill can bundle dozens of reference files, but a given invocation might use one.

The result: you can install hundreds of skills with no context bloat. Compare this to PromptOps, where every prompt is always in context or requires manual selection.

The Convergence Nobody Predicted

Six months ago, skills were a Claude Code concept. Today:

Anthropic Claude — Skills across Claude Code, Claude.ai, and the API via the Skills API (/v1/skills endpoints)
OpenAI Codex — Full SKILL.md support with .codex/skills/ directories, implicit and explicit invocation
GitHub Copilot — Agent Skills in VS Code with the same SKILL.md format, progressive disclosure built in
Google ADK — load_skill_from_dir for file-based skills, meta-skills that generate new SKILL.md files at runtime

This is not each vendor independently inventing a similar format. This is a shared specification at agentskills.io that every major player adopted. A skill built for Claude Code drops into Codex or Copilot with minimal changes. The runtime behaviors differ (session management, tool permissions, invocation modes), but the format is portable.

skills spec

This convergence is the inflection point. It means skills are no longer a platform feature — they’re an interoperable standard. And that changes the operational model entirely.

From PromptOps to SkillOps: What Actually Changes

PromptOps treated prompts as the unit of optimization: version them, A/B test them, track their performance. SkillOps treats skills as the unit — but the operational surface is fundamentally different.

…SkillOps

Here’s what each layer means in practice:

Skill Registry — A centralized system of record for all skills across your organization. JFrog launched theirs at NVIDIA GTC in March 2026, positioning it as the trust layer for enterprise agent deployments. SkillRegistry.io serves the open-source community with 61 skills and 6,000+ downloads. The point isn’t which registry you pick — it’s that skills become discoverable, governed assets rather than files someone shared on Slack.

Progressive Loading — The agent decides which skills to use, not the developer. This is the operational shift that kills PromptOps: you stop manually selecting prompts and start trusting that good metadata enables good discovery. Write better descriptions, not better selection logic.

Evaluation Loops — Skills get scored on real tasks by agents. Did the code review skill catch the bug? Did the documentation skill produce accurate output? This is where platforms like LangSmith and Langfuse are moving — from prompt-level tracking to skill-level observability.

Supply Chain Security — JFrog’s core insight: skills are the new packages. An unvetted skill can instruct an agent to exfiltrate data, call unauthorized APIs, or bypass guardrails. Scanning, signing, and policy-driven approval workflows aren’t optional for enterprise deployments. Anthropic’s own documentation warns that skills with external URL fetches pose particular risk because fetched content can contain malicious instructions.

Compositional Testing — The hardest and least solved problem. A “summarize patient record” skill is HIPAA-compliant in isolation. Compose it with a “send email” skill and you’ve got a violation. No major platform has compositional compliance testing today.

The Enterprise Skill Governance Gap

Here’s what I don’t see anyone talking about yet: skills solve the reuse problem but create a governance problem that’s arguably worse than what we had with prompts.

With prompts, governance was simple — there was nothing to govern. Prompts were disposable. Skills are durable, versioned, shared, and composed. They’re organizational IP. And in regulated industries (healthcare, financial services, mortgage), they touch compliance boundaries that current registries don’t model.

JFrog gives you the software supply chain layer — scan, sign, verify. That’s necessary but not sufficient. What’s missing is the requirements traceability layer: the ability to map a skill’s behavior to the specific regulatory obligations it must satisfy, and to detect when skill composition violates those obligations even when individual skills are compliant.

This is the problem I’m working on with the CART (Cloud-AI Requirements Traceability) framework, specifically extending it for agentic systems where execution paths aren’t deterministic and skills compose at runtime. The gap between supply-chain security and regulatory traceability is where the next wave of enterprise SkillOps tooling needs to go.

What You Should Do This Week

If you’re starting from zero: Pick one workflow your team does repeatedly (code review, PR descriptions, incident response). Write a SKILL.md for it. Drop it in .claude/skills/ or .codex/skills/. Test it. You’ll learn more about progressive disclosure and description-writing in an hour than from any documentation.

If you already have scattered prompts: Audit them. Pick the five most-used. Convert each to a skill directory with proper metadata. Commit them to your repo. You’ve just started your skill library.

If you’re operating at scale: Evaluate registry options. For startups, SkillRegistry.io and GitHub repos work. For enterprise with compliance requirements, look at JFrog’s Agent Skills Registry or build an internal registry with the Agent Skills SDK (open-source Python library from Microsoft). Either way, add evaluation loops — track which skills agents actually use and how they perform.

If you’re in a regulated industry: Start thinking about the governance gap now. Current registries handle supply-chain security but not regulatory traceability. Map your most critical skills to the compliance obligations they touch. You’ll want this mapping before auditors start asking for it — and they will.

Anthropic's Mythos Uncovered Decades-Old Vulnerabilities. Your Governance Model Needs to Catch Up.

The AI Runtime — Thu, 09 Apr 2026 11:04:43 GMT

TL;DR - Anthropic’s Project Glasswing coalition — AWS, Microsoft, Google, Apple, CrowdStrike, JPMorganChase, the Linux Foundation, and six others — used an unreleased model called Claude Mythos Preview to find thousands of zero-day vulnerabilities across every major OS and browser, some hidden for 27 years. For AI engineers shipping in regulated industries, this breaks three assumptions simultaneously: that your open-source dependencies are “good enough,” that quarterly governance keeps you safe, and that your AI agent infrastructure isn’t attack surface. Here’s what to do about each, this week.

The 27-Year Bug and the Five-Million-Test Miss

Let me start with the two numbers that should keep you up tonight.

Twenty-seven years. That’s how long a remote crash vulnerability survived in OpenBSD — an operating system whose entire reputation is built on being security-hardened. It runs firewalls. It runs critical infrastructure. Mythos Preview found it.

Five million. That’s how many times automated security tests hit the vulnerable line of code in FFmpeg without catching the bug. Mythos Preview caught it on what amounts to a first read.

These aren’t edge cases. These are the libraries underneath your production systems right now.

Project GLASSWING

Three Things That Just Broke

Enterprises started deploying AI across healthcare, financial services, airlines, and other regulated industries. These are the industries where you don’t get to say “we’ll patch it next sprint” — you answer to regulators, patients, and auditors. Glasswing broke three foundational assumptions we see in nearly every deployment we touch.

Broken Assumption #1: “We Track Our Dependencies”

You track your direct dependencies. Maybe your first layer of transitive dependencies. But Glasswing exposed vulnerabilities in the deep layers — the FFmpegs and OpenSSLs and zlibs that your dependencies’ dependencies depend on.

The deeper you go, the less you track — and that’s where Mythos found the bugs.

The Linux Foundation joined Glasswing because the people maintaining the software at the bottom of that chain don’t have security teams. Your SBOM was a compliance artifact. It needs to become an operational dependency map with patching SLAs attached to every node.

Broken Assumption #2: “Our Governance Cadence Is Sufficient”

CrowdStrike’s CTO said it plainly: what once took months now happens in minutes. Mythos Preview autonomously chained together multiple Linux kernel vulnerabilities to escalate from user to root — no human steering required.

Your quarterly vulnerability review doesn’t survive this. You need dependency scanning on every build, and a fast-track patching path that bypasses the standard change advisory timeline for critical zero-days.

Broken Assumption #3: “Our AI Agent Layer Isn’t Attack Surface”

This is the one nobody’s talking about, and it’s the one I see every day.

If you’re building multi-agent systems — agents calling tools via MCP, persisting memory, chaining decisions across services — you’ve built execution paths that no traditional penetration test covers.

Traditional security tests the infrastructure. Nobody tests the agent paths that sit on top of it.

Here’s the connection nobody’s making: the agentic reasoning that lets Mythos Preview autonomously chain kernel exploits is architecturally the same capability your agents use to chain tool calls. If a compromised dependency injects malicious context into your agent’s execution chain, what layer catches it?

For most systems? Nothing. The guardrails check the model’s outputs. They don’t check what flows into the model from compromised upstream tools.

Your Playbook: This Week, This Month

This Week

Map your Glasswing exposure now. Anthropic published cryptographic hashes of unpatched vulnerabilities. When full disclosures land, you need to already know your dependency overlap. Don’t start the audit after the CVEs drop.

Benchmark your real patching SLA. Not the number in your security policy — the actual elapsed time from “critical zero-day announced” to “patched in production.” If it’s measured in weeks, you’ve found the gap.

Tabletop an AI-speed attack. Get your security, platform, and AI engineering leads in a room. Scenario: a Mythos-class model finds a zero-day in a dependency your agents use. An exploit is weaponized in hours. Walk through your response. Find where it breaks.

This Month

Shift SBOM from compliance to CI/CD. Dependency scanning on every build. Automated alerts when any dependency matches a Glasswing disclosure. No exceptions.

Audit your agent attack surface. Document every tool-calling interface, memory layer, and cross-agent trust boundary. Test what happens when one node in the chain serves compromised context.

Design a fast-track patch path. Your standard CAB process can’t be the only route for critical zero-days.

The 90-Day Clock

Anthropic committed to publishing findings within 90 days — vulnerabilities fixed, lessons learned, and recommendations for how security practices should evolve. They’re working on guidance covering disclosure processes, patching automation, supply chain security, and standards for regulated industries.

That 90-day report will matter. But the vulnerabilities exist now. The exploitation tools are advancing now. And the gap between AI-speed offense and quarterly-cadence defense is only getting wider.

The Glasswing butterfly hides in plain sight — transparent wings, invisible against the forest. These vulnerabilities did the same thing for decades. The question isn’t whether your systems are affected. It’s whether your response will move at the speed this moment demands.

Model Reliability Engineering: Who Owns It When the AI Is Confidently Wrong?

The AI Runtime — Wed, 08 Apr 2026 11:51:15 GMT

TL;DR: Companies deploying LLMs in production are discovering a reliability gap that none of the existing engineering disciplines — SRE, MLOps, AI Safety — are designed to close. Infrastructure stays up. Pipelines keep running. Models keep generating. But the outputs users depend on can be wrong, inconsistent, or unsafe, and no team owns that problem. What’s emerging to fill this gap is something that might be called Model Reliability Engineering (MRE) — the practice of ensuring that AI model behavior is reliable in production, not just the infrastructure underneath it. This piece maps the gap, explains why it exists now and didn’t before, and sketches the shape of the discipline forming around it. The framework is early and evolving — the goal here is to start a conversation, not finish one.

Model Reliability Engineering

Something Is Missing

A healthcare system deploys an AI assistant to help clinicians review patient records and surface relevant clinical guidelines. The infrastructure team runs it on managed Kubernetes with auto-scaling. The ML platform team built a solid RAG pipeline with nightly document ingestion. The system passes load testing. The SRE dashboard is green across every metric.

A nurse practitioner asks: “What’s the recommended dosing adjustment for metformin in patients with reduced renal function?” The system retrieves a clinical guideline, passes it to the model, and generates a clear, confident answer with a specific dosage recommendation. The recommendation is subtly wrong — the model extracted a dosage figure from a retrieved passage but missed that the passage described a contraindicated scenario, not a recommended one. The qualifying context was in the previous chunk, which didn’t make the top-K retrieval cutoff.

The error isn’t caught. No alarm fires. The system’s correctness monitoring consists of a thumbs-up/thumbs-down button that fewer than 3% of users click. The next time anyone knows something went wrong is when a pharmacist catches the discrepancy during medication review — days later.

This isn’t a hypothetical. Variants of this failure pattern play out across every industry deploying LLMs in production:

In financial services, a compliance assistant retrieves an outdated regulatory interpretation and generates advice based on a rule that was superseded six months ago. The retrieval pipeline ran perfectly. The document was in the corpus — it just shouldn’t have been, or should have been flagged as superseded. No existing monitoring caught it because “the model returned a well-formed answer from a successfully retrieved document” looks like success to every metric being tracked.

In legal, a contract review tool summarizes a liability clause but drops a carve-out exception that fundamentally changes the clause’s meaning. The LLM’s summary is grammatically perfect, tonally appropriate, and 80% accurate. The missing 20% is the part that matters. The tool’s evaluation framework tests for “is the summary relevant to the clause?” but not “does the summary preserve all material qualifications?”

In enterprise knowledge management, an internal Q&A system answers “What’s our policy on remote work eligibility?” by combining fragments from three different policy documents — a 2022 version, a 2023 update, and an FAQ that was drafted but never approved. The answer reads coherently but reflects a policy that never existed. Each source was individually legitimate. The synthesis was not.

In every case, infrastructure reliability was excellent. Pipeline reliability was excellent. The model performed exactly as designed — it generated fluent, confident text based on the context it received. The failure was in a layer that no existing discipline is structured to monitor: the reliability of the model’s behavior as experienced by the user.

Why This Gap Exists Now

This isn’t a problem that people have been ignoring. It’s a problem that didn’t fully exist until recently. Three shifts created it.

Shift 1: From prediction to generation

Traditional ML in production outputs predictions: a classification, a score, a probability. A fraud detection model returns 0.87. A recommendation engine ranks items. These outputs are narrow, measurable, and directly testable against ground truth. You can compute precision, recall, F1, and AUC on every production prediction and track them in real time.

LLMs produce open-ended text. The output space is effectively infinite. Two correct answers to the same question can be worded completely differently. A wrong answer can be syntactically identical to a right one except for a single word. Traditional ML monitoring — tracking prediction distributions, feature drift, data quality — doesn’t tell you whether a generated paragraph is true. This is fundamentally different from anything software reliability or ML monitoring was designed to handle.

Shift 2: From self-contained models to compound systems

A traditional ML model is a single artifact: data goes in, prediction comes out. Its reliability surface is the model itself plus its input pipeline.

An LLM in production is a compound system — the term Berkeley researchers used in early 2024. It’s a model wrapped in a retrieval pipeline, a prompt template, a set of guardrails, possibly tool-calling infrastructure, memory, re-ranking, citation logic, and output formatting. The model is one component among many. A failure in any component degrades the final output, and the failure modes are combinatorial. Bad chunking + good retrieval + good generation = wrong answer. Good chunking + good retrieval + bad extraction = wrong answer. Good everything + stale source document = wrong answer.

No single component owner sees the full picture. The retrieval team sees retrieval metrics. The model provider sees generation metrics. The infrastructure team sees latency and throughput. Nobody sees “the user got a wrong answer because of an interaction between retrieval ranking and chunk boundary placement,” because that’s not any one team’s metric.

Shift 3: From technical users to everyone

When ML models served data scientists and internal analytics teams, a slightly wrong output was caught and corrected by experts who understood the model’s limitations. When LLMs serve nurses, compliance officers, customer support agents, and end consumers, the user often lacks the domain expertise to recognize when the model is wrong — especially when the model’s errors are articulate, confident, and well-structured.

The consequence of this shift: model behavior reliability is no longer a nice-to-have quality attribute. It’s a safety property. And unlike traditional safety properties in software, it can’t be addressed through static analysis, type checking, or deterministic testing. It requires continuous, probabilistic monitoring of outputs that are non-deterministic by nature.

What Existing Disciplines Cover — and What They Don’t

It’s worth being precise about why existing practices don’t close this gap. Not because they’re insufficient at what they do, but because none of them are scoped to cover model behavior reliability.

Site Reliability Engineering operates at the infrastructure layer. SRE’s tools — SLOs, error budgets, incident response, capacity planning — are designed for systems with deterministic or statistically predictable behavior. A web server either returns the right page or an error code. An SRE can define “success” as a 200 response within 300ms. For an LLM, a 200 response within 300ms tells you nothing about whether the content of that response is reliable. Todd Underwood, who built ML SRE at Google and later led reliability teams at OpenAI and Anthropic, has written directly about this: infrastructure failures in ML systems manifest as quality problems, and SRE’s monitoring isn’t designed to distinguish “the system returned an error” from “the system returned a confident wrong answer.” SRE monitors the vehicle. It doesn’t know if the vehicle is driving to the right destination.

MLOps operates at the pipeline and lifecycle layer. MLOps ensures models get from development to production, stay updated, and remain monitored for data and distribution drift. These are necessary functions. But MLOps drift detection typically tracks input distributions, feature statistics, and prediction distribution shifts — not whether individual outputs are correct, faithful to sources, or safe in context. MLOps monitors the assembly line. It doesn’t inspect what’s coming off the end of it.

AI Safety operates at the training and alignment layer. AI safety research produces the techniques — RLHF, constitutional AI, red-teaming — that make foundation models safer before deployment. For practitioners deploying models they didn’t train, in applications the model provider didn’t anticipate, AI safety provides crucial principles but not an operational engineering practice. A model can be aligned at training time and still produce unreliable outputs in a specific deployment context because of retrieval failures, prompt interactions, or domain-specific edge cases the training process never encountered. AI safety establishes the building code. It doesn’t do the home inspection.

ModelOps operates at the governance layer. ModelOps tracks which models are deployed where, who approved them, and whether they comply with organizational policies. It’s necessary for enterprise governance. It doesn’t monitor whether the model’s Tuesday afternoon output to a specific user was correct.

Existing Disciplines

The gap between these disciplines isn’t narrow. It’s the entire layer that users experience.

The Shape of What’s Emerging

Across organizations deploying LLMs seriously, a set of practices is forming to address this gap. Different teams call it different things — “LLM quality engineering,” “AI output monitoring,” “model behavior testing” — or don’t name it at all, just bolt it onto existing SRE or MLOps responsibilities. But the practices converge. What’s emerging has a recognizable shape, and giving it a name might help the community develop it faster.

The term that seems to fit is Model Reliability Engineering (MRE) — the practice of ensuring that AI model behavior is reliable in production. Not infrastructure uptime. Not pipeline health. The actual outputs the system produces.

MRE focuses on a simple question that turns out to be operationally complex: does the model’s output deserve the user’s trust, right now, for this query?

The practices forming around this question tend to organize along two layers.

The Context Layer

Every production LLM system has to solve the problem of getting the right information to the model at the right time. The methods span a wide spectrum — from static knowledge baked into model weights through fine-tuning, to dynamic retrieval from external sources, to real-time tool use and agentic research. Each method has a different reliability profile.

RAG systems can fail through stale indexes, bad chunking, missed retrieval, or context overload. Fine-tuned models can fail through knowledge staleness or catastrophic forgetting. Long-context approaches can fail through attention drift and the well-documented “lost in the middle” effect. Tool-calling systems can fail through API errors, schema mismatches, or the model misinterpreting returned data.

What’s emerging is the recognition that context is a reliability surface. It can be monitored, measured, and held to standards the same way infrastructure performance can. Retrieval precision isn’t just a search quality metric — it’s a leading indicator of output reliability. Context freshness isn’t just a data management concern — it’s a behavioral SLO. Source authority scoring, chunk boundary analysis, multi-source corroboration — these are reliability practices for the context layer, and teams are beginning to treat them that way.

The Harness Layer

Between the model’s raw output and what the user sees sits a control layer — the guardrails, evaluators, validators, safety filters, and orchestration logic that constrain and verify model behavior. This layer is where reliability is enforced.

In practice, this includes faithfulness scoring (does the output contradict its source context?), citation verification (do cited sources actually support the claims?), confidence calibration (does the system communicate uncertainty when it should?), output validation gates (does the response meet formatting, safety, and quality thresholds before serving?), graceful degradation (does the system fail safely when context is insufficient?), and permission-aware filtering (does retrieval respect access controls?).

In the Claude Code ecosystem, practitioners are already building harness components intuitively — CLAUDE.md files that establish behavioral constraints, hooks that enforce validation at lifecycle events, skills that encode domain-specific guardrails, subagents that verify outputs. What hasn’t happened yet is treating these as components of a reliability discipline with measurable SLOs.

Two evolving layers

The two layers are complementary. Context without harness gives the model the right information but no way to catch when it uses that information wrong. Harness without context constrains a model that’s working with bad information to begin with. Reliable model behavior requires both.

What Behavioral SLOs Look Like

The most concrete contribution MRE makes is extending the SLO concept from infrastructure to model behavior. This isn’t fully developed yet — the right metrics and thresholds are still being discovered in practice — but the emerging shape looks something like this:

Correctness rate — the percentage of outputs that are factually accurate against source material. This requires automated evaluation plus regular human calibration, because purely automated scoring drifts. A team might set a 90% correctness SLO, with the understanding that measuring it is harder than measuring uptime and that the metric itself will evolve.

Faithfulness — how often the model’s response stays grounded in its provided context versus fabricating beyond it. RAGAS, TruLens, and similar tools provide automated scoring here. A faithfulness SLO sets a floor: below this threshold, the system is considered unreliable for its use case.

Abstention accuracy — how often the model correctly identifies when it lacks sufficient information to answer, rather than fabricating a plausible response. This is arguably the most important behavioral SLO for high-stakes applications. A system that says “I don’t have enough information to answer this reliably” when it genuinely doesn’t is more reliable than a system that always produces an answer.

Consistency — given the same question and context, how stable are the model’s answers across repeated queries? Non-determinism is inherent in LLMs, but the factual content of answers to the same question should be stable even if the wording varies. Inconsistency often indicates that the model is uncertain and resolving that uncertainty differently on each pass.

Safety compliance — the rate at which outputs pass content safety, policy compliance, and domain-specific filters. What constitutes “safety” is domain-dependent: a medical system has different safety thresholds than a creative writing assistant.

These aren’t meant as a definitive list. They’re the SLOs that keep showing up across teams doing this work. The right behavioral SLOs for a specific system depend on the domain, the risk tolerance, and the user population. What matters is that they exist at all — that model behavior is treated as a measurable, monitorable dimension with explicit quality targets.

Incident Response for Model Behavior

One of the clearest signs that a reliability gap exists is looking at how organizations handle model misbehavior today. When infrastructure goes down, SRE has a well-defined incident response practice: detection, triage, response, postmortem, prevention. When a model generates a harmful or incorrect output, most organizations have... nothing. A user complains. Someone files a ticket. Eventually, someone looks at the logs. Maybe the prompt gets tweaked.

The same rigor can be applied to model behavior:

Detection should be automated. Faithfulness scoring, retrieval quality monitoring, and adversarial probing should catch behavioral degradation before users do. A drop in faithfulness scores below the SLO threshold is an incident — not a metric to review next sprint.

Triage matters because not all model failures are equal. A hallucination in a casual Q&A session has different severity than a hallucination in a compliance response. Incident classification needs domain-specific severity frameworks.

Postmortems should be blameless and systemic. Why did the model produce this output? Was it a context failure (wrong documents retrieved), a generation failure (model misinterpreted correct context), a harness failure (validation should have caught this but didn’t), or a coverage failure (the knowledge base lacked the needed information)? Each root cause points to a different remediation.

Incident Response for Model behaviour

Error budgets are the mechanism that makes behavioral SLOs operational rather than aspirational. If your correctness SLO is 92% and you’ve burned through your error budget this month, the team shifts from building new features to improving reliability — the same trade-off SRE pioneered for infrastructure.

RAG as the Primary Proving Ground

If this discipline needs a place to prove its value, RAG is it. RAG is the most widely deployed LLM architecture in production, and it’s where model behavior reliability challenges are most visible and most painful.

RAG systems have at least ten well-documented failure modes, cataloged by Barnett et al. (2024) and expanded significantly by production experience since. Every one of them is a model behavior reliability problem that doesn’t appear on an infrastructure dashboard: stale retrievals, bad chunking, missed context, context overload and the “lost in the middle” effect, unfaithful extraction, security leaks through retrieval, embedding drift, retrieval-generation timing failures, scattered evidence synthesis failures, and the model answering when it should abstain.

The evolution of RAG architectures — from naive single-shot retrieval through advanced hybrid retrieval, self-correcting RAG (Self-RAG, Corrective RAG), and now agentic RAG with autonomous retrieval planning — can itself be understood as an evolution toward greater model behavior reliability. Each generation added mechanisms to detect and recover from failure modes the previous generation couldn’t handle. Self-RAG taught models to judge whether they need to retrieve at all. Corrective RAG added evaluators that score document relevance before generation. Agentic RAG introduced multi-step planning, self-correction loops, and dynamic tool selection.

These advances happened organically, driven by practitioners hitting reliability walls. A model reliability framework provides a way to understand where on the reliability spectrum a system sits and what needs to happen to improve it — turning ad-hoc iteration into systematic engineering.

How This Relates to What Exists

MRE isn’t replacing anything. It’s filling a gap between things that already exist and work well at what they do.

The relationship to SRE is generational. SRE was created because software systems became too complex for traditional operations practices. This discipline is forming because AI systems are too complex for traditional software reliability practices. SRE’s operational philosophy — SLOs, error budgets, blameless postmortems, the principle that reliability is a feature — transfers directly. What changes is the object of measurement: from system behavior (latency, availability, error rates) to model behavior (correctness, faithfulness, appropriate abstention).

The relationship to MLOps is complementary. MLOps handles the lifecycle — getting models from development to production and keeping them updated. Model behavior reliability handles the runtime — ensuring that what the model does in production meets quality standards. A mature AI organization needs both, the same way a mature software organization needs both CI/CD and production monitoring.

The relationship to AI Safety is layered. AI safety establishes the foundation: models that are aligned, harmless, and honest at training time. Model behavior reliability builds on that foundation for specific deployment contexts: ensuring that a generally safe model behaves reliably in this application, with this data, for these users. A model can be well-aligned and still produce unreliable outputs when deployed in a context its training didn’t anticipate.

What’s Still Unknown

Honesty requires acknowledging what isn’t figured out yet. This discipline is early. Several hard problems remain open:

Measuring correctness at scale is hard. Unlike infrastructure metrics that can be computed from logs, output correctness often requires domain expertise to evaluate. Automated faithfulness scoring is getting better (RAGAS, TruLens, LLM-as-judge approaches), but these tools measure consistency with context, not truth. A model that faithfully reproduces information from a wrong document scores high on faithfulness and low on correctness. Bridging this gap requires human calibration, golden datasets, and evaluation frameworks that aren’t mature yet.

Setting the right thresholds is domain-specific. What correctness rate is acceptable? 95% for a customer support bot might be fine. 95% for a medical decision support system might be catastrophic. The thresholds need to come from domain expertise and risk analysis, not from engineering defaults. The framework can provide the structure, but it can’t prescribe universal thresholds.

Non-determinism complicates everything. LLMs are inherently probabilistic. The same input can produce different outputs on consecutive calls. This makes behavioral SLOs fundamentally different from infrastructure SLOs, where the same request should always produce the same response. Model reliability has to reason about distributions of behavior, not individual outputs — and the statistical tools for this are still developing.

The boundary with prompt engineering is fuzzy. Is improving a system prompt to reduce hallucinations a reliability activity or a development activity? Probably both, depending on context. The discipline’s boundaries will sharpen through practice, not through definitional fiat.

The tooling is immature. The evaluation tools that exist — RAGAS, TruLens, custom LLM-as-judge pipelines — are first-generation. They work but require significant integration effort, produce metrics that need calibration, and don’t yet connect to the kind of operational dashboards that SRE teams take for granted. This will improve, but it’s a real limitation right now.

These unknowns aren’t reasons to wait. SRE had plenty of open questions in its early years too. The discipline formed through practice, with refinements accumulating as more teams adopted and adapted the core ideas. This will likely follow the same path.

An Invitation, Not a Manifesto

If this framing resonates, the most useful thing that can happen is for practitioners to pressure-test it against their own experience. The questions worth asking:

Does the gap described here match what you see in your organization? Is there a team or role that owns model behavior reliability, or does it fall between the cracks?

Are the two layers — context reliability and harness reliability — the right decomposition, or is there a third layer missing?

Which behavioral SLOs matter most in your domain, and how are you measuring them today (if at all)?

What failure modes have you encountered that don’t fit neatly into the categories described here?

The discipline will be shaped by the practitioners who adopt and adapt it, not by any single definition. What’s offered here is a starting point — a way to talk about a problem that many teams are experiencing but that doesn’t yet have a shared vocabulary. If naming it helps teams think more clearly about it, build better systems around it, and hold themselves to higher standards for what their AI systems deliver to users, then the name is doing its job.

The infrastructure reliability problem is largely solved. The model behavior reliability problem is wide open. This is how we start closing it.

References: Lewis et al. (2020), “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” Meta AI. Barnett et al. (2024), “Seven Failure Points When Engineering a RAG System.” Asai et al. (2024), “Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection,” ICLR 2024. Yan et al. (2024), “Corrective Retrieval Augmented Generation.” Chen, Murphy, Parisa, Sculley & Underwood (2022), “Reliable Machine Learning,” O’Reilly. Sculley et al. (2015), “Hidden Technical Debt in Machine Learning Systems,” NeurIPS. Singh et al. (2025), “A Survey on Agentic RAG.” Microsoft Research (2024), “GraphRAG.” Hummer & Muthusamy (2018), “ModelOps,” IBM Research.