Zarif Automates

How to Deploy AI Agents to Production

ZarifZarif
||Updated May 2, 2026

The demo works on your laptop. Then you ship it, and by Tuesday you're paying 400 dollars a day in token costs because a tool call started looping at 3am and nobody noticed.

Definition

Deploying an AI agent to production means moving it from a notebook or local script into a hosted, observable, evaluated, and cost-controlled system that can serve real users with predictable reliability and bounded spend.

TL;DR

  • Production agents need 7 layers most prototypes skip: SLAs, infra, persistence, entry points, observability, evals, and a controlled rollout
  • Pick infra by workload shape: Modal or Railway for long-running agents, Vercel or Lambda for short request-response, a managed platform like LangGraph Platform if you don't want to operate any of it
  • Split state into hot (Redis), durable (Postgres), and semantic (pgvector or a dedicated vector DB) instead of jamming everything into one store
  • Observability and evals are not optional. Wire Langfuse, LangSmith, or Braintrust on day one and gate every release on an eval suite
  • Add timeouts, retries with backoff, circuit breakers, and per-user spend caps before you take real traffic. Most production incidents are runaway loops, not bad answers

What Changes When an Agent Leaves Your Laptop

A prototype agent has one user (you), no SLA, no budget cap, and zero observability. Production flips every one of those. You need to know when it broke, how often, for which user, what it cost, and whether the new version is actually better than the old one before you roll it out.

The other thing that changes is the failure mode. Web apps fail loudly with 500 errors. Agents fail silently with confidently wrong answers, infinite tool loops, or quietly degraded quality after a model update. You catch those with evals and traces, not with uptime monitoring.

The seven steps below are the order I deploy in. Skip a step and you will pay for it within the first month of real traffic.

Step 1: Define Your SLAs and Failure Budget Before You Pick Tools

What to do: write down four numbers before touching infrastructure. Target latency at p95, target success rate, max cost per request, and max cost per user per day.

Why it matters: every downstream choice falls out of these numbers. A 30-second analyst agent can run on a totally different stack than a 200ms customer-support agent. Without the targets, you'll over-engineer the easy paths and under-engineer the dangerous ones.

A reasonable starting point for an internal tool: p95 under 8 seconds, success above 95 percent, 5 cents per request, 5 dollars per user per day. Customer-facing changes those numbers. Whatever you pick, write them in a doc, share them with whoever owns the budget, and revisit them after a week of real traffic.

Step 2: Pick the Right Hosting Pattern for Your Workload

What to do: match your agent's runtime profile to one of three deployment patterns. Short request-response, long-running, or background queue.

Why it matters: agents are not normal web apps. They have unpredictable latency, hold state across tool calls, and sometimes need GPU access. Putting a 5-minute research agent behind Vercel's 900-second function limit will work right up until it doesn't.

The three patterns:

  • Short request-response, under about 60 seconds total: Vercel functions, Cloudflare Workers, or AWS Lambda. Cheap, scales to zero, fine for chat agents that mostly do one or two tool calls.
  • Long-running or GPU-heavy, 1 to 30 minutes: Modal or Railway. Modal's Python-decorator deploy model is the closest thing to "Vercel for backends" right now, and it handles GPU and long timeouts natively.
  • Background or asynchronous, anything over 30 minutes or anything that needs to survive restarts: a queue plus workers (BullMQ on Redis, AWS SQS plus Lambda, or Temporal/Restate for durable execution). This is where the "agent as durable workflow" pattern lives.

If you don't want to make this decision at all, a managed agent platform like LangGraph Platform, Microsoft Foundry Hosted Agents, or Vertex AI Agent Engine will pick for you in exchange for vendor lock-in and higher per-request cost.

Step 3: Split State Into Hot, Durable, and Semantic Layers

What to do: stop trying to make one database hold everything. Use Redis for active session state, Postgres for the system of record, and pgvector or a dedicated vector store for semantic recall.

Why it matters: a single agent run touches three very different storage workloads. Tool call scratch space and partial results need millisecond reads. Conversation history, user prefs, and audit logs need ACID durability. Past experiences and document chunks need similarity search. One database does at most two of these well.

The pattern that's become standard in 2026 is the hybrid write-through model. Active turn state lives in Redis with a TTL. A background worker flushes completed turns to Postgres and computes embeddings for the chunks worth recalling, storing them in pgvector. When the agent wakes up for a new turn, it hydrates working memory from Redis if the session is hot, falls back to Postgres if it's cold, and pulls episodic context from pgvector by similarity.

Two specific gotchas. First, Redis with default RDB persistence can lose data up to the last snapshot interval, so never treat it as the system of record. Second, do not store secrets or PII in vector indexes; embed a reference ID instead and look up the actual content from Postgres at retrieval time.

Step 4: Deploy the Entry Points (HTTP, Webhook, Queue)

What to do: expose your agent through whichever entry points your product needs, and put authentication, rate limits, and request validation in front of every one of them.

Why it matters: agents are expensive to run. An unauthenticated webhook that spawns an agent is a denial-of-wallet attack waiting to happen. I have personally seen a 12,000 dollar AWS bill from one exposed endpoint over a long weekend.

A solid baseline:

  • Authentication on every entry point. API key for server-to-server, signed JWT for user sessions, HMAC verification for webhooks
  • Per-user rate limit at the edge (Upstash Redis, Cloudflare, or your gateway)
  • Request schema validation with Zod or Pydantic. Reject malformed payloads before they ever start an LLM call
  • Idempotency keys on any endpoint that can mutate state. Agents retry. Without idempotency, retries cause duplicate writes

Wrap the agent itself in a thin handler that does auth, validates input, generates a trace ID, kicks off the run, and returns. The handler should never contain agent logic.

Step 5: Wire Observability Before You Take Real Traffic

What to do: instrument every LLM call and tool call with traces, costs, latency, and inputs/outputs. Pick one observability platform and standardize on it from day one.

Why it matters: when your agent misbehaves in production, you need to replay the exact run that caused it. Without traces, you're reading customer screenshots and guessing. With them, you click into the trace and see the bad tool argument that started the loop.

The three platforms worth looking at in 2026:

  • Langfuse: open-source, MIT-licensed, self-hostable. Best choice if you care about data residency or want to avoid another SaaS bill
  • LangSmith: tightest integration with LangChain and LangGraph, including state-diff tracing. Default if you're already on that stack
  • Braintrust: eval-first platform with CI/CD deployment blocking. Best if your team treats evals as a primary engineering workflow

Whichever you pick, capture five things on every run: the trace ID, full input, full output, token counts and dollar cost per call, and total wall-clock latency. That's the minimum to debug anything.

Step 6: Build an Eval Suite and Gate Releases on It

What to do: build a dataset of 30 to 100 real user inputs with expected outcomes, run your agent against it on every change, and block deploys that regress on key metrics.

Why it matters: the failure mode of agents is not a 500 error, it's a quality regression. You change a prompt, fix one bug, and silently break three other behaviors. Without evals you discover this from a customer complaint two weeks later. With evals, the deploy doesn't ship.

A starter eval suite has three tiers. Unit-level checks on tool selection: given input X, does the agent call tool Y with arguments Z? Outcome-level checks on task success: did the final answer match the expected outcome on a graded rubric? End-to-end checks on regressions: rerun the last 30 days of real production traces against the new version and flag any that score lower than before.

Use LLM-as-judge for the graded rubric, but anchor it. Provide the rubric, provide 3 to 5 hand-graded examples in the prompt, and require the judge to output a score plus a reason. Spot-check 10 percent of judge scores manually each week or the rubric drifts.

Run the suite in CI. Braintrust, LangSmith, and Langfuse all support CI integration; if you self-host, a GitHub Action that runs the suite and posts results to the PR is enough to start.

Step 7: Roll Out With Retries, Circuit Breakers, and Cost Caps

What to do: deploy the new version alongside the old one, route 5 to 10 percent of traffic to it, monitor for an hour minimum, then expand. Wrap every external call in retries with exponential backoff, circuit breakers, and a hard spend cap per user per day.

Why it matters: most production agent incidents are not bad answers. They are runaway loops, third-party API outages, or one user accidentally running 8,000 dollars of inference because a prompt-injection convinced the agent to call a tool 400 times.

The minimum protection layer:

  • Retries with exponential backoff and jitter on every external API. Cap retries at 3. Do not retry on 4xx errors except 408 and 429
  • Circuit breakers on every tool. After N consecutive failures, open the circuit and short-circuit subsequent calls for a cool-off window. The agent gets an error back and can either work around it or fail gracefully
  • Per-run iteration cap. Most agents should complete in under 20 reasoning steps. Cap at 30 and kill the run if it exceeds. Log it and alert on it
  • Per-user daily spend cap, enforced before each LLM call by checking a Redis counter. When the cap is hit, return a friendly error and notify the user
  • Canary deployment. Route a small percentage of traffic to the new version. Compare success rate, latency, and cost-per-request to the baseline. Roll forward only if all three hold or improve

Once those are in place, the worst-case incident is a contained one. Without them, the worst case is a five-figure bill.

Warning

Never wire OpenAI or Anthropic keys directly into client code, browser apps, or Slack bots. Always proxy through a server you control where you can enforce auth, rate limits, and spend caps. Leaked API keys are the single most common way founders lose four-figure amounts overnight.

Hosting Comparison: Where to Actually Deploy

Use this as a starting point, then validate against your own latency and cost targets in step 1.

PlatformBest ForMax RuntimeStarting Price
ModalLong-running, GPU, Python-first agents24 hoursPay per second of compute
Vercel FunctionsShort chat agents next to a Next.js app900 seconds (Fluid)Free tier, then usage-based
AWS LambdaEvent-driven agents in AWS-native stacks900 secondsPay per request and GB-second
RailwayLong-running Node/Python agents with WebSocketsNo hard limit5 dollars per month plus usage
LangGraph PlatformManaged deploys for LangGraph agentsLong-running supportedUsage-based, free dev tier
Cloudflare WorkersEdge-deployed, low-latency agents30 minutes (paid plan)Free tier, then 5 dollars per month

If you're starting fresh in 2026 and don't have a strong existing platform preference, Modal for long-running plus Vercel for the user-facing chat surface plus Langfuse for traces is a stack you can be productive on by end of day one.

Putting the Stack Together

A real production deployment looks like this. Vercel or Modal hosts the agent runtime. Postgres on Neon or Supabase holds durable state. Upstash Redis holds session state and rate-limit counters. Langfuse or LangSmith captures every trace. Braintrust runs the eval suite in CI. A canary on 10 percent of traffic guards every deploy. Per-user spend caps and circuit breakers run on every request.

That stack costs roughly 50 to 200 dollars a month at low volume before you add LLM tokens. It will save you from the four-figure mistake on day 14.

For a deeper look at what your agent should actually do once it's deployed, the complete guide to building AI agents covers the design side. If you're still picking a framework, how to build AI agents with Python walks through the most common starting point. And if you haven't built guardrails yet, how to build AI agent guardrails and safety controls is what you want to read before going live.

What is the difference between deploying an AI agent and deploying a normal web app?

A web app handles short, stateless requests with predictable latency and fails loudly when something breaks. An AI agent runs longer, holds state across tool calls, has unpredictable latency and cost per request, and tends to fail silently with wrong answers or runaway loops. That means production agents need traces, evals, per-user spend caps, and circuit breakers that normal web apps usually skip.

Where should I host an AI agent in production?

Pick by workload shape. Short chat agents under 60 seconds run well on Vercel, Cloudflare Workers, or AWS Lambda. Longer-running agents up to 30 minutes are best on Modal or Railway, where Modal handles GPU and long timeouts natively. Anything truly long-running or that needs to survive restarts belongs in a queue plus worker pattern using Temporal, Restate, or BullMQ.

Do I need a vector database to run an agent in production?

Only if your agent uses semantic recall over past conversations or documents. Many production agents work fine with Postgres for durable state and Redis for session state, and never need a vector store. If you do need semantic search, pgvector inside the same Postgres you already run is the simplest starting point. Reach for a dedicated vector DB like Pinecone or Weaviate only when query latency or scale requires it.

How do I monitor an AI agent in production?

Use a dedicated LLM observability platform: Langfuse if you want open-source and self-hosted, LangSmith if you are on the LangChain or LangGraph stack, or Braintrust if you want eval-driven CI/CD gating. Capture the full input, full output, token cost, latency, and a unique trace ID on every run. Standard application monitoring tools like Datadog or New Relic miss the LLM-specific signals you actually need to debug agents.

How do I stop an AI agent from running up a huge API bill?

Layer four controls. Per-run iteration caps that kill the agent after 20 to 30 reasoning steps. Per-user daily spend caps enforced in Redis before every LLM call. Circuit breakers on tools so a failing API stops being called after a few consecutive errors. Authentication and rate limits on every entry point so unauthenticated requests cannot trigger LLM calls. With those four in place, the worst-case incident is contained instead of catastrophic.

When should I use a managed agent platform versus building my own deployment?

Use a managed platform like LangGraph Platform, Microsoft Foundry Hosted Agents, or Vertex AI Agent Engine when you want to ship fast, do not have a platform team, and are fine with vendor lock-in plus higher per-request cost. Build your own on Modal or Railway plus Postgres plus Langfuse when you have specific requirements around latency, cost, data residency, or custom infrastructure that the managed platforms do not support. Most teams should start managed and migrate later only if the per-request economics force it.

Zarif

Zarif

Zarif is an AI automation educator helping thousands of professionals and businesses leverage AI tools and workflows to save time, cut costs, and scale operations.