How to Build an AI Agent with Error Recovery (2026)

Your agent worked perfectly in dev, then it shipped to production and the support inbox lit up: a Stripe API hiccup orphaned a refund, a malformed JSON response sent the planner into a loop, and a Selenium timeout left the order in an unknown state for six hours. Welcome to the part of agent engineering nobody puts in the demo video.

Definition: AI agent error recovery

A layered set of patterns (retries, self-reflection, fallback chains, circuit breakers, dead letter queues, checkpointing) that lets an autonomous agent survive transient failures, malformed LLM outputs, broken tools, and infinite loops without losing state or silently corrupting data.

TL;DR

Production agents fail constantly. Public reliability data from teams running LangChain, AutoGen, and OpenAI agents in production shows tool-call error rates of 8 to 22 percent and end-to-end task failure rates between 20 and 40 percent on complex multi-step workflows. Layered recovery (3 retries with exponential backoff and jitter, self-reflection passes, fallback tools, circuit breakers, a dead letter queue, and checkpointed state) typically pushes successful completion above 95 percent and cuts pager load by 70 to 80 percent. None of these patterns are optional once you cross from prototype to revenue.

Why Agents Fail in Production

If you only build for the happy path, you will rebuild the agent every Monday. Failures cluster into four buckets, and each one needs a different defense.

LLM output errors. The model returns invalid JSON, hallucinates a tool name that does not exist, omits a required argument, or argues with a schema. These are non-deterministic and the same prompt can pass 99 times and fail on the hundredth call. Strict tool-calling helps but does not eliminate them.

Tool failures. APIs return 429s, 500s, and timeouts. Databases deadlock. Webhooks deliver out of order. A vendor pushes a breaking change at 3 a.m. Most of these are transient and a retry fixes them; some are permanent and a retry just burns your budget.

Environment errors. Browser-use agents lose sessions, file-system tools hit permission errors, MCP servers go offline, model providers throttle you, and credentials expire. The agent often does not know whether the failure is its fault.

Infinite loops. The most expensive failure mode. An agent keeps calling the same broken tool, keeps "reflecting" without improving, or oscillates between two near-identical plans. A loop running an Opus-class model can quietly burn hundreds of dollars before anyone notices.

The recovery stack below addresses each of those buckets at the right layer. For a refresher on agent fundamentals before you wire any of this in, see the complete guide to building AI agents.

Pattern 1: Tool-Level Retries with Exponential Backoff

The first line of defense is the dumbest one and it catches roughly half of all transient failures for free. Wrap every external call in a retry decorator that doubles the wait time between attempts and adds random jitter so concurrent agents do not stampede the API at the same instant.

A reasonable default is three attempts, base delay of one second, exponential factor of two, jitter of plus or minus 25 percent, and a maximum delay cap of 30 seconds. Retry on network errors, timeouts, 429, 502, 503, and 504. Do not retry on 400, 401, 403, 404, or 422 because those are signals the agent did something wrong and another attempt will fail identically while costing money.

In Python the tenacity library handles this in five lines. In TypeScript p-retry does the same. In n8n the built-in retry settings on every HTTP node handle the basics, and you can add a Wait node with {{ $itemIndex * Math.pow(2, $itemIndex) }} for exponential delay between iterations. AWS reliability research shows exponential backoff with jitter cuts retry storms by 60 to 80 percent compared to fixed-interval retries.

Tip

Start with three retries, base delay of one second, and full jitter. That single change typically eliminates 50 to 70 percent of production incidents that would otherwise page you. Increase only if the data tells you to.

Pattern 2: LLM Self-Reflection and Correction

When a tool returns an error, the most powerful move is to feed that error back to the LLM as a tool message and let it try again. LangGraph's ToolNode does this automatically when handle_tool_errors=True. The model sees "Cannot divide by zero" or "Field 'email' is required" in the conversation history and re-issues a corrected call. This self-fixing loop is one of the highest leverage patterns in agent design.

Reflection works best for syntactic and semantic errors the model can actually diagnose: malformed arguments, wrong tool selection, missing parameters, and invalid schema. ICLR 2024 research on intrinsic self-correction is more sober. Models cannot reliably self-correct reasoning errors without an external verification signal. Self-reflection over a flawed self-evaluation can compound the error rather than fix it.

The practical takeaway: use reflection for tool errors and structured output validation, but always pair it with an external check (schema validator, unit test, second model acting as a critic, or a deterministic rule) for anything that touches reasoning correctness. Anthropic's "dreaming" technique formalizes this, replaying action sequences to identify recurring failure patterns and refine future behavior.

Cap reflection at two passes. Beyond that, you are usually paying tokens for the model to convince itself a wrong answer is right.

Pattern 3: Fallback Tool Chains

Some failures are not transient and no amount of retries will help. The primary search API is down, the OCR vendor is rate-limiting your account, the structured-output model is throwing 500s. For these, you want a chain of fallbacks ranked by quality and cost.

The pattern: define each capability as an interface ("search the web", "extract text from PDF", "geocode an address") and register two or three implementations behind it. On primary failure, the orchestration layer catches the exception and calls the next provider with the same arguments. Brave Search falls back to Tavily falls back to a Google CSE wrapper. Claude Sonnet falls back to GPT-4o falls back to an open model on Groq for latency-sensitive paths.

Track which fallback served each request so you can spot when your "primary" is actually broken half the time. LangChain ships with_fallbacks() on Runnables for the model layer, and LangGraph's conditional edges let you route to a different tool node based on the error type in state. In n8n, an Error Trigger workflow plus an If node on $json.error.code does the same job declaratively.

Pattern 4: Circuit Breakers and Step Limits

A circuit breaker stops the agent from hammering a service that is already on fire. Three states: Closed (normal traffic), Open (requests fail fast for a cooldown window), Half-Open (one test request decides whether to close again). Standard thresholds are 5 consecutive failures to trip and a 60-second cooldown.

Without a circuit breaker, your retry policy makes the outage worse. With one, the agent recognizes "this provider is down" within seconds, flips the breaker, and routes everything to the fallback chain or the dead letter queue.

Step limits are the agent-loop equivalent. Hard cap the number of LLM turns per task (15 to 25 is a sensible default), the number of tool calls per turn (3 to 5), and total wall-clock time per session (60 to 300 seconds depending on the workload). When any limit trips, the agent halts, writes its current state to durable storage, and surfaces a "needs human review" event. This is the single most reliable defense against the runaway-cost failure mode.

Warning

Retries amplify cost. Three retries with exponential backoff plus a two-pass reflection loop can multiply your per-task LLM spend by 4x to 8x in a bad-day scenario. Always pair retry logic with hard step limits, a per-task budget cap, and an alert that fires when a single task crosses 2x the median cost.

Pattern 5: Dead Letter Queues for Manual Review

When all retries are exhausted, every fallback has failed, and the circuit is open, the request goes to a dead letter queue (DLQ). The DLQ is not a discard bin. It is durable storage of every undeliverable request with full context: the original input, the conversation history, the tool calls attempted, the errors at each step, the agent's state at the moment of failure, and the timestamp.

You need three things from a DLQ. Replayability: a one-click retry that re-runs the request after you have fixed the upstream issue. Inspectability: a UI or query interface where humans can read the failure context and decide what to do. Alertability: a notification that fires when the DLQ grows beyond a threshold so problems do not silently pile up.

In n8n, the DLQ is usually a Postgres table or a Google Sheet that the Error Trigger workflow writes to, with Slack alerts attached. In a code-first stack, Redis Streams, SQS with a redrive policy, or Temporal's failed-workflow store all work. The DLQ is also where you discover that 3 percent of your "agent failures" are actually upstream data quality problems that no retry would ever fix.

Pattern 6: Checkpointing and Resume

Long-running agents must be able to crash, get redeployed, or have their underlying provider go down for ten minutes and then resume from where they stopped. The mechanism is checkpointing: after every state transition, persist the agent's full state (messages, tool results, scratchpad, current node) to durable storage keyed by a session ID.

LangGraph's built-in checkpointer (SQLite, Postgres, or Redis) does this on every node transition. CrewAI exposes similar persistence. With Temporal or Inngest the durable execution model gives you checkpointing essentially for free: the workflow code is replayed deterministically on resume.

Checkpointing also unlocks human-in-the-loop. The agent reaches a sensitive action (sending money, deleting data, posting to a customer), pauses, writes the checkpoint, and waits for a human approval event before continuing. The same machinery that recovers from a crash also gates risky actions.

Implementation in n8n

You can wire all six patterns into n8n without writing a service. The trade-off is that everything is declarative, which is fast to ship but harder to unit test.

Per-node retries. On every HTTP Request, AI, and Code node, open Settings and enable Retry On Fail. Set Max Tries to 3, Wait Between Tries to 1000ms, and check Continue On Fail so a failed item does not abort the run.
Exponential backoff. Where the API needs more than linear retries, build a small loop: HTTP Request -> If (error) -> Wait {{ Math.pow(2, $itemIndex) * 1000 }} ms -> back to HTTP Request, with a SplitInBatches counter to cap iterations.
Reflection. After an AI Agent node, add an If node that checks for $json.error or a schema-validation failure, route the error back into a second AI Agent call with the original input plus the error message in the prompt, and cap at two retries.
Fallbacks. Use an Error Trigger workflow to catch any node failure. Inside it, route by error type: 429 to a Wait node and re-queue, 5xx to a fallback provider node, schema errors to the DLQ.
Circuit breaker. Store per-API failure counts in a Postgres or Redis node. Before each external call, query the counter; if it is at or above 5 within the last 60 seconds, skip the call and route to fallback or DLQ.
DLQ. Insert failed items into a Postgres table or Google Sheet with full context (input, error, timestamp, workflow ID, execution ID). Add a Slack node that pings a channel when the DLQ exceeds 10 rows in an hour.
Checkpointing. For multi-step agent workflows, write state to a Redis hash after each meaningful step and key it on the execution ID. On retry, the workflow checks Redis first and resumes from the last completed step.

Implementation in LangGraph

LangGraph gives you finer control and is the right choice once your reliability requirements get serious. The whole stack lives in code and is testable.

Define the state. Add fields for errors, attempt_count, last_checkpoint, and any tool-specific status. Errors live in state so conditional edges can route on them.
Wrap tools. Set handle_tool_errors=True on your ToolNode so exceptions become ToolMessage content the LLM can read and react to.
Per-node retry policy. builder.add_node("call_api", call_api, retry_policy=RetryPolicy(max_attempts=3, initial_interval=1.0, backoff_factor=2.0, jitter=True, retry_on=(httpx.HTTPError, TimeoutError))). Use runtime.execution_info.node_attempt inside the node to switch to a fallback implementation on attempt 2 or higher.
Conditional routing on errors. After each node, add a conditional edge that inspects state["errors"] and routes to a reflect node, a fallback node, or END with a "needs human review" status.
Self-reflection node. A node that takes the failed tool call and the error, prompts the LLM to diagnose and re-plan, and returns an updated tool call. Cap with an attempt counter in state and a hard limit of 2.
Circuit breaker. Wrap the external call in a small Python class (or use pybreaker) backed by Redis so the breaker state is shared across replicas. Trip after 5 consecutive failures, cooldown 60 seconds.
Step limit. Use the recursion_limit config on graph.invoke() and a per-iteration counter in state. On exceeding either, route to a human_review node that pauses the graph.
DLQ. On terminal failure, the human_review node writes the full state to Postgres with a status='dlq' flag and emits a Slack or PagerDuty alert.
Checkpointing. Compile the graph with graph.compile(checkpointer=PostgresSaver(...)) and pass a thread_id per session. The graph now resumes automatically after a crash, redeploy, or human approval.

Pair all of this with the practices in how to monitor and debug AI agents so every retry, fallback, and DLQ entry is traceable in your observability stack.

n8n vs LangGraph: Error Handling Capabilities

feature	n8n	langgraph
Per-node retry config	Built-in UI toggle, linear waits	RetryPolicy with exponential backoff and jitter
LLM self-reflection on tool errors	Manual: If node + second AI call	Native via ToolNode handle_tool_errors
Fallback tool chains	Error Trigger workflows + If routing	Conditional edges + with_fallbacks Runnable
Circuit breakers	Custom: Postgres counter + If guard	External lib (pybreaker) + Redis state
Step / recursion limits	Manual SplitInBatches counter	Native recursion_limit config
Dead letter queue	Postgres / Sheet + Slack alert	Custom write from terminal node
Checkpointing and resume	Manual: Redis hash per execution	Native checkpointer (Postgres / SQLite / Redis)
Human-in-the-loop pause	Wait for Webhook node	interrupt() with thread persistence
Time to ship recovery layer	1 to 2 days	1 to 2 weeks
Best fit	Internal ops, low-volume B2B agents	Customer-facing, high-volume, regulated

Frequently Asked Questions

What is a sensible default for max retries on an agent tool call?

Three attempts with exponential backoff (1s, 2s, 4s) plus jitter is the right starting point for almost every external call. It catches the vast majority of transient failures (rate limits, brief outages, network blips) without amplifying cost in pathological scenarios. Increase to 5 only for known-flaky vendors where you have data showing a real win, and always pair with a circuit breaker so a sustained outage does not turn into a sustained retry storm.

How do I prevent an agent from getting stuck in an infinite loop?

Three layers. First, hard cap the agent loop with a recursion or step limit (15 to 25 LLM turns is typical). Second, detect repeated identical tool calls in state and break out when the same call is issued twice in a row with no new information. Third, set a wall-clock timeout per session and a per-task budget cap (in dollars or tokens) that halts the agent and pages a human. Loops are the most expensive failure mode in agent systems, so over-engineer the limits rather than under-engineer them.

Which observability tools should I use to see why my agent is failing?

LangSmith, Langfuse, Helicone, Arize Phoenix, and Braintrust are the leaders in 2026. They capture full traces (every LLM call, every tool call, every retry, every error) and let you replay sessions, compare versions, and alert on failure rate. For a side-by-side comparison see the best AI agent monitoring and observability tools. At minimum you want trace IDs that flow from your application logs through the LLM calls so a Slack alert links directly to the failing trace.

Don't retries and self-reflection just multiply my LLM costs?

Yes, and this is the single biggest gotcha in agent recovery. A two-pass reflection loop on top of three tool retries with exponential backoff can 4x to 8x the per-task spend on a bad day. Defenses: hard step limits, per-task budget caps with automatic halt, alerting on tasks that exceed 2x the median cost, and routing reflection to a cheaper model than the primary planner. Track cost per successful task as a first-class metric, not just total spend.

Does LLM self-reflection actually work, or is it marketing?

It depends on the failure mode. For tool errors, schema validation failures, and malformed outputs, reflection works well because the LLM has a concrete error message to react to. Published results show 10 to 18 point improvements on benchmarks like HumanEval when reflection has an external signal (a failing test, a validator). For pure reasoning errors with no external check, ICLR 2024 research shows intrinsic self-correction is unreliable and can compound mistakes. Rule of thumb: always pair reflection with an external verifier (schema, unit test, critic model, deterministic check). Never trust an agent to grade its own homework on a hard reasoning task.

When should I send a failure to the dead letter queue versus retrying again?

Retry on transient signals (timeouts, 429s, 5xx) up to your max attempts. Send to DLQ immediately on permanent signals (400, 401, 403, 422, schema validation failures, business rule violations) because retrying will fail identically and waste money. Also DLQ anything that exhausts retries, trips the circuit breaker, or hits the step limit. The DLQ is a feature, not a failure: it is the audit trail and the manual-fix queue that lets you keep the main pipeline flowing while humans handle the edge cases.

Closing

Error recovery is not a feature you add at the end. It is the architecture, and every pattern in this article exists because someone shipped without it and got paged at 3 a.m. Start with retries plus exponential backoff (Pattern 1), add tool-error reflection (Pattern 2), then layer in fallbacks, circuit breakers, the DLQ, and checkpointing as your traffic and stakes grow. The agents that survive contact with production are the ones built to fail safely.

Sources: