State Machines for AI Agents: A field guide from Forward-Deployed Engineering
Agents reason. State machines remember.
A multi-step agent that keeps its progress in the conversation will eventually lose track of where it is. A state machine moves that progress into explicit state outside the model and enforces the order of steps in code. This guide covers when it earns its place and when it is overkill, what it fixes, where it stops, and the patterns you pair it with, drawn from what the work teaches once an agent is carrying real load in production.
When an AI agent loses track of where it is
Picture a support agent handling a refund. Call it Refund-Bot.
The job has five steps:
verify the customer,
pull the order,
confirm the item is returnable,
issue the refund, and
send the confirmation.
In testing it runs clean every time.
In production, a customer messages mid-conversation to add a second item. Refund-Bot has been deciding each next step by re-reading the whole conversation, and the conversation just got longer and messier. It re-reads, decides it has not issued the refund yet, and issues it.
Except it already had, four messages ago. The customer gets refunded twice. Nothing crashed. No error fired. The model simply lost track of which steps it had already completed, because the only record of that was buried in a chat transcript it has to re-interpret on every turn.
The model is fine.
The reasoning is fine.
What breaks is where the agent keeps its progress: in the conversation, which is a terrible place to keep it. It is unstructured. It gets summarized and truncated as it grows. It does not survive a restart. So the agent forgets what it did, repeats steps, picks up values that have gone stale, and double-refunds a customer at scale.
A state machine addresses this directly.
You stop letting the model infer the next step from the transcript. You define the five steps and the legal moves between them, and you keep the progress, which step is done, what each one produced, in explicit state outside the model.
Refund-Bot cannot issue a second refund because the graph will not allow a second transition into the refund step. Order is enforced by code, not by the model remembering.
What a state machine does not fix is whether Refund-Bot pulled the right order in the first place. It can fetch the wrong customer’s record, read the wrong total off an invoice, or call the refund API with the right shape and the wrong number, and the state machine waves all of it through, because the move was legal. Sequencing and correctness are two different problems. The state machine owns the first one.
The second is still yours to solve, and most of the trouble with production agents comes from assuming the structure handles both.
This is the kind of problem forward-deployed work is made of.
You take a model into a customer’s real environment and own whether it holds up once it is there, past the demo, past the happy path, into the production reality where Refund-Bot double-refunds.
The state machine is one of the first tools you reach for, so it is the right place to start a field guide: what it does, when it earns its complexity, where it stops, and what you build around it.
Why multi-step agents fail in production
The double refund is one shape.
In production the same root cause, progress kept in the conversation, shows up in a handful of recognizable ways. The agent repeats a step because nothing recorded it was done. It carries a value from step one into step five after that value went stale. Two parts of a flow run at once and overwrite each other’s state. The process crashes at step seven and restarts at step one, because the work so far lived only in memory.
A team running a deep-research agent in production reported this exact set: race conditions, stale state, and agents getting stuck with no clear report of where they were.
This is not rare. The first large-scale study of agents in production found teams keeping agents short and supervised on purpose, 68 percent run at most ten steps before handing off to a human, because every additional unsupervised step is another chance to lose the thread.
The narrow, solvable problem underneath all of it: keep an agent’s place in a multi-step job, across crashes and restarts, without trusting the model to remember. That is the job a state machine is built for.
How a state machine fixes it: explicit steps and state
A state machine defines the agent’s world in advance.
You lay out the steps as states, and the legal moves between them as transitions. The model decides what to do within a step. The state machine decides what steps are even possible. An agent inside a well-formed state machine cannot skip an approval, cannot call a tool the current state forbids, and cannot jump to a step the structure does not allow. Illegal actions are not discouraged by a prompt. They are rejected by the architecture.
The idea is to hold the agent to a defined process before its output reaches anyone. LangGraph is the most widely used tool for this, with production users including Uber, LinkedIn, and Replit. Its core move: lay the agent out as an explicit graph of states and transitions instead of a free-running loop.
The state machine gives you two things a bare loop does not.
The first is a defined path. The graph spells out what can happen and when. Every transition is declared up front, so the agent cannot take a step you did not lay out. An approval gate cannot be skipped, because the structure will not move past it until the gate is satisfied.
The second is saved state.
Because the agent’s state is explicit and stored outside the run, it can be checkpointed: written down at each step so a crash does not lose the work so far.
A checkpoint is just a save point. It lets an agent pause and resume later, but it doesn’t guarantee the work will finish.
When an agent resumes, most frameworks restart the interrupted step from the beginning. That means any model calls or tool calls in that step may run again. If those actions aren’t safe to repeat, resuming can create new bugs.
Guaranteeing that a workflow survives crashes and eventually completes is a different problem. That’s what durable execution systems such as Temporal are designed for. They manage the workflow lifecycle and recover from process failures automatically.
In production, teams often use both:
the state machine defines what should happen next, while the durable execution layer makes sure the workflow keeps running even if the system crashes.
A defined path solves the repeat-a-step and out-of-order problems: the agent cannot do step five before step four, or do step two twice, because the graph will not allow the move. Saved state solves the stale-value problem and, with a real durable layer underneath, the crash problem: progress lives outside the run, so a restart resumes instead of starting over. Together they fix the lose-the-thread bug the last section described. They also introduce new tradeoffs, which is where most of the real decisions live.
When you actually need a state machine, and when it is overkill
The honest answer is that most agents do not need one, and reaching for it too early is its own failure. The clearest decision rule is: start with a workflow, and add the state machine only when the problem forces it.
If your application is just prompt → tool → response, you don’t need a state machine. You probably don’t need one when there are only a few steps, no branching, and it’s easy to restart if something fails.
The same goes for many so-called “agents” that are really just structured extraction or classification tasks.
In those cases, adding state machines, checkpoints, and orchestration layers creates more complexity than value. You’re paying the operational cost without getting much in return.
It becomes necessary at a specific and recognizable wall. When two or more steps have to coordinate, hand off state, recover from failure, and pause for human approval, the chain abstraction stops holding and the if-statements start multiplying. That is the moment the state machine earns its cost. The other forcing function is duration: a workflow that runs long enough to be interrupted, by a crash, a deploy, an expired session, needs durable state, and durable state is most of what a state machine gives you. The test is not how smart the task is. It is whether losing the work halfway through is expensive.
There is a compounding-math reason the wall is real and not a matter of taste. A pure chain of model calls multiplies its per-step reliability: even at 99 percent per step, a ten-step process succeeds about 90 percent of the time, and the degradation accelerates as the chain grows. The state machine does not fix the model’s per-step error rate, but it stops a single failed step from silently corrupting everything downstream, by making the failure an explicit state you can catch, retry, or escalate rather than a wrong value that flows on unnoticed.
One agent or many: the decision that costs the most
A separate architectural choice sits right next to the state-machine one, and getting it wrong is more expensive: whether to split the work across multiple agents.
The strongest case against multiple agents comes from a team that builds coding agents for a living. Their argument is that the moment two agents work in parallel on the same task, you have to share the full context and the full trace between them, not just passed messages, or they make conflicting decisions that surface as broken output. (Our Monthly meetup group in Boston fought over this)
Every action an agent takes carries implicit decisions, and two agents acting at once carry decisions that quietly contradict each other.
Their rule of thumb: keep the writes single-threaded. Let one agent own the actions, and the work stays coherent.
The strongest case for multiple agents comes from the opposite corner. An orchestrator-plus-workers design, one lead agent fanning out to subagents that each work an isolated slice, beat a single agent on a research evaluation by a wide margin in one reported internal test. The catch, stated by the same team: it burned roughly fifteen times the tokens of a normal chat, and that token budget alone explained most of the performance gain. They were also explicit about the limit: tasks where every agent needs the same context, or where the steps depend heavily on each other, are a bad fit for multiple agents.
Multiple agents win only when the task genuinely splits into independent parallel pieces that do not need to share state, and only when you can afford the token multiple.
The instant the agents have to coordinate writes or share context, the coordination failures cost more than the parallelism buys.
What state-machine agents still get wrong in production
The failure modes that actually show up in production are mundane and they are about state, not intelligence.
The recurring list from teams running these systems: a process crashes mid-workflow and has to re-run from the start because nothing was checkpointed; in-memory state is lost on restart or deploy because the default checkpointer was never swapped for a durable one; agents get stuck without clear reporting; and concurrent steps race each other and leave state stale.
Source: Temporal based Architecture
A concrete, documented case makes this real. Grid Dynamics built a deep-research agent for a Fortune 500 manufacturer that searches across internal databases, shared drives, and repositories, and falls back to the open web with citations when internal data comes up short. Their initial architecture paired a state-machine orchestration layer with a separate store for persistence.
Their own account of what happened next:
the system was powerful in concept but brittle in practice, hit an endless stream of race conditions, stale state, and agents getting stuck without clear reporting, and became extremely costly to support with no clear path to reducing that burden.
Their fix was architectural: they moved durability and retry into the orchestration layer itself, so state passed directly between steps instead of being fetched from an external key on every step.
The lesson in their words is that almost every real agent needs the same three things: intelligent state management, the ability to retry a failed step without restarting the whole pipeline, and an architecture that scales.
A second case shows the same lesson at a different scale.
Replit launched its coding agent on custom orchestration, then moved it onto a durable-execution engine within a couple of months. The reason was the user experience of failure: an agent that got deep into a task and hit a fatal error lost everything, which is unacceptable when a user has been waiting on a long build. After the move, each agent ran as its own durable workflow, and a cloud-provider degradation that would have caused an incident was absorbed by the durable layer instead.
There is research underneath these anecdotes.
A Berkeley study that hand-annotated more than two hundred traces of failing multi-agent runs sorted the failures into fourteen modes across three buckets: bad system design (including agents repeating steps, losing the conversation history, and not recognizing when to stop), agents misaligned with each other, and missing verification of the work.
Its blunt conclusion is the one that should shape how you build: a better base model will not fix most of these, because they are failures of structure and verification, not of intelligence.
The pattern across all of these is the same: the failure is about lost state, not bad reasoning. The state machine targets lost state, which is why it earns its complexity on a long-running production flow and feels like dead weight on a three-step script.
Who runs state-machine agents in production (LangGraph, Temporal)
The deployed pattern has settled into recognizable layers, and naming who sits where is more useful than another framework.
The state-machine orchestration layer has one widely used tool, LangGraph, which models the agent as an explicit graph of nodes and edges rather than a prompt loop.
It is the layer most teams reach for when they outgrow a chain. But it is not the only place this reasoning lives, and that matters for anyone building on a model provider’s own SDK.
The OpenAI Agents SDK ships sessions, handoffs between agents, and guardrails.
The Claude Agent SDK ships sessions, file-based memory, and context compaction.
Both give you state and control-flow primitives in the agent loop.
Neither gives you crash-proof durable execution on its own, which is why OpenAI’s own Codex agent runs on a separate durable-execution layer underneath. The point is not which library you pick. It is that the same reasoning, explicit state, enforced order, a durable layer when a lost run is expensive, applies whether you are in LangGraph, an SDK, or hand-rolled code.
Underneath, for systems where losing a run is unacceptable, sits a durable-execution layer. The general-purpose durable engines in this tier are battle-tested outside AI first, at companies like Netflix, Stripe, and Snap for ordinary backend workflows, and are now being pulled under agent orchestration.
The emerging production standard for serious systems is a two-layer split: the durable engine handles macro-orchestration and guaranteed completion, while the state-machine layer handles the micro-level agent logic. The Grid Dynamics migration landed on this split.
There is a real cost-and-complexity tension in the stack, and practitioners say it plainly: the heavyweight durable engines can feel like overkill for AI workflows because of their infrastructure overhead, which is why a lighter tier of durable-execution tools aimed at the agent-as-code level has appeared to fill the gap. The choice between them is the same decision rule as before, applied one layer down: take the heavier guarantee only when a lost or duplicated run has real business cost.
One caution that the production reports surface repeatedly: the default in-memory state is a trap. Teams that ship with the non-durable checkpointer in production lose state on restart or deploy, which is the kind of failure that looks fine in every test and only appears the first time a real deploy interrupts a live run.
The state machine gives you durability as an option. It does not force you to turn it on, and forgetting to is a common production wound.
The benchmark that changes how you choose a model for agents
One benchmark result shows how much the structure carries, and it should change how teams spend their model budget.
A reproducible benchmark ran eight different models through the same business workflow, an invoice approval, with a state machine enforcing the legal transitions. The outcome inverts the usual logic of picking the best model you can afford. Seven of the eight models scored a perfect pass rate. The models did not separate on correctness at all. They separated on cost, and the spread was roughly thirtyfold, with the cheapest model matching the most expensive on the actual task.
The reason is the structure. The state machine rejected every illegal move and handed back a structured error each time, so the model corrected course because the environment forced it to, not on its own. The correctness lived in the architecture, not the model.
The planning takeaway is concrete: when the structure defines what counts as a legal move, the choice of model stops being the thing that decides whether the process holds. On a tightly constrained task, the expensive model is buying headroom the structure already covers. Spend the model budget where the task is genuinely open-ended, not where the path is already pinned down.
That is also where most analyses stop. The harder and more useful question is what this architecture still cannot do.
The limit of state machines: sequence, not correctness
A state machine enforces which step runs and in what order. It does not validate the data each step produces.
A clean run and a correct result are not the same thing, and the gap between them is where the next class of production bug lives. Scary isn’t it?
Return to the invoice workflow that scored perfectly. The state machine kept the invoice moving from draft → submitted → approved along a defined sequence, every gate respected, every step logged.
It did nothing about whether the invoice was for the right vendor, in the right amount, read correctly from the right document.
An agent that misreads a purchase order and creates an invoice for the wrong sum will march that wrong invoice through a flawless, fully audited approval. The harness records a clean run. The business takes a loss. The path was legal. The content was wrong.
The ICML research names this directly. The same production study that found teams keeping agents short and supervised also found reliability is the top development challenge, driven by the difficulty of ensuring and evaluating correctness. Sequencing is largely a solved problem now. Checking that each step did the right thing is not.
The gap shows up in three specific places.
The first is the ungoverned input. Everything upstream of the first transition, reading the prompt, extracting fields from a document, deciding which entity a request refers to, happens in free model space before any gate exists. The hallucination that produces a bad input has already happened by the time the state machine sees it. The harness is a clean gate installed on a river of unknown quality, and it certifies what passes without inspecting what the water carries.
The second is compounding error. A per-step error rate that looks fine alone adds up fast across a long chain, because the run only works if every step does. At 95 percent per step, fourteen steps is close to a coin flip. The state machine keeps each transition legal, but it does not stop a small per-step error rate from stacking into a likely per-run failure. Longer runs widen the gap between a legal path and a correct outcome.
The third is the contained-but-not-corrected problem. A useful framing circulating among production teams is to treat the model as an unsafe component inside a deterministic harness. That posture is correct, and it also names the limit: the harness contains the damage a wrong output can do, but containment is not correction. A blocked bad action is good. A bad action that is legal, and therefore not blocked, still ships.
So the tradeoff is clear. A state machine buys you a controlled path and durable state. It does not buy you correct content along that path. Closing that second gap is separate work: content checks, verification steps, human review at the points that matter, and a state machine does not do it for you.
How to evaluate an agent harness before you trust it
Knowing the boundary exists is not enough.
The practical skill is reading a specific setup and finding where its control runs out. Three checks do most of the work.
Map where content flows ungoverned.
Walk the state graph and mark every stretch where data moves between gates without a content check, especially the extraction step before the first transition. That is where the wrong invoice is born, and it is invisible on an architecture diagram that only shows transitions.
Watch for the point where the agent stops being an agent. Add enough rules, validations, and approval steps, and you end up with a deterministic workflow that happens to call a model.
The benchmark where the cheapest model performed as well as the most expensive is a good signal. When the workflow tightly constrains every decision, the model is no longer doing much reasoning. It’s mostly filling in a template.
That may be the right engineering choice. But once you reach that point, it’s worth asking whether the model still belongs in the workflow at all.
Adding more guardrails beyond that often makes the system slower and more complex without making it more reliable.
Separate the model’s contribution from the harness’s. This is the measurement almost everyone skips, and it is the one that tells you the truth. Run the same agent twice, once with the full harness and once with the gates removed or weakened, and compare how often it passes every case on repeated runs. If reliability collapses without the harness, the harness is doing the work, which is the result you want.
If both versions perform the same, the model is doing the real work and the harness is just extra complexity.
Testing only the harnessed version can make any system look better. The real question is whether the harness improves results compared to running without it.
That comparison tells you whether the harness is solving a real problem or just adding maintenance overhead.
How to plan for model changes under your agent
A harness is built around a specific model’s weaknesses.
When the model changes, and it always does, the harness does not automatically still fit. This is the planning failure that catches teams off guard, and it is the difference between a one-time build and a system you can operate for years.
The cost of moving a harness from one model to a newer one is real and has three parts.
There is the orchestration tuned to the old model’s failure modes, the retry patterns and prompt scaffolding that were compensating for problems the new model may not have.
There is the schema that was stable on the old model and produces inconsistent shapes on the new one, breaking everything downstream.
And there is the audit and approval surface that was certified against the old model’s behavior and has to be re-certified for the new one. A harness is not free to carry forward. It depreciates as the model evolves underneath it.
A useful rule is to keep model-specific assumptions separate from the rest of the system. If your workflow is full of special cases for a particular model’s weaknesses, every model upgrade becomes a painful rewrite.
Instead, treat the harness as stable infrastructure and the model as a replaceable component. That way, most upgrades require minimal changes.
This matters because production agents are never finished. Models keep improving, and each new model changes what the harness needs to do.
Teams that plan for this from the start, by isolating model assumptions, maintaining evaluation sets, and measuring what the harness actually contributes, can upgrade models without rebuilding the entire system. Teams that don’t often end up starting over.
Frequently asked questions
What is a state-machine agent?
A state-machine agent is an AI agent whose available actions are limited by an explicit graph of states and transitions, so the model can only take actions the graph permits from its current state. Frameworks like LangGraph implement this by defining steps (nodes) and legal moves between them (edges) that compile into a workflow which rejects illegal actions outright.
Why do cheap AI models match expensive ones inside a state machine?
Because the state machine, not the model, enforces correctness on the constrained task. When illegal moves are structurally rejected and the model gets structured feedback on every attempt, the model’s job narrows to producing valid moves in a small space, which most models do equally well. A benchmark of eight models on the same workflow saw seven score perfectly, separating on cost rather than correctness, with the cheapest matching the most expensive.
What do state machines not handle for AI agents?
Content correctness. A state machine controls the path the agent takes through the process. It does not check whether the data moving along that path is right, so an agent that creates a wrong invoice will move that wrong invoice through a fully legal, fully audited approval. Extraction and interpretation before the first step also run unchecked.
Is LangGraph or Temporal better for production AI agents?
They solve different problems and are often combined. LangGraph handles the state-machine orchestration that defines the agent’s steps, and it checkpoints state so a run can resume. Temporal handles durable execution: it owns the workflow lifecycle and guarantees the run finishes across crashes, replaying coordination logic without re-running completed work. Checkpointing is a save-point you manage; durable execution is a completion guarantee. A common production setup runs the state-machine logic on top of a durable layer to get both the defined path and real crash recovery.
Should you use one agent or multiple agents?
Default to one. Multiple agents help only when the task splits into independent pieces that do not need to share state, and when you can afford a large token multiple (one reported design used roughly fifteen times the tokens of a single agent). The moment agents must coordinate writes or share context, the coordination failures tend to cost more than the parallelism gains, so a single agent on a clear path is the safer starting point.







