Production AI Agents: From Prototype to Production-Ready Systems
Most teams build their first agent in an afternoon. It works beautifully in the notebook — the LLM calls the right tools, the responses look great, and stakeholders are impressed. Everyone wants to ship it next week.
Then production happens. The API bill hits $87,000 on a $40,000 budget. Response times stretch past ten seconds. A one-word change to a prompt breaks a refund-flow tool-call schema that the 30-example regression suite never hit. Users report incorrect information, and you have no way to know whether it was a bad prompt, a bad retrieval, a bad tool call, or something else.
This is the prototype-to-production gap. This guide closes it.
You’ll learn how to build production AI agents that are reliable, observable, testable, deployable, secure, and cost‑controlled — without diving into theoretical architecture patterns. Every practice here comes from real production deployments at scale.
What Is a Production AI Agent
A production AI agent is an AI system that meets six non‑negotiable criteria:
- Reliable — handles failures gracefully, retries transient errors, and never hangs indefinitely.
- Observable — every LLM call, tool invocation, and decision point is traced and logged.
- Testable — prompt changes are validated against a fixed dataset before promotion.
- Deployable — supports canary releases, blue‑green switches, and instant rollback.
- Secure — authenticates and authorises every request, blocks prompt injection.
- Cost‑controlled — tracks spend per request, caches aggressively, rightsizes models.
A prototype agent works for you. A production agent works for your users, under real load, with real consequences.
Example: A customer‑support agent that routes tickets, retrieves knowledge, generates drafts, and escalates to humans — processing 10,000 requests per day with 99.5% uptime, full audit trails, and a predictable monthly API bill under $5,000.
Why Production Engineering Matters
The shift from prototype to production in the AI world exposes operational challenges that traditional software deployment practices weren’t designed to handle.
Three reasons production AI agents fail where prototypes succeed:
| Challenge | Why it breaks | The cost |
|---|---|---|
| Non‑determinism | The same prompt can yield different responses across runs. An identical input can produce a correct answer, then a hallucination, then a refusal. | You can’t trust unit tests. Your CI pipeline passes, but production fails |
| Evaluation opacity | HTTP 200 tells you nothing about quality. The model might have hallucinated facts, leaked sensitive information, or generated grammatical nonsense. | Your dashboards look green while your users get wrong answers |
| Configuration as code | A prompt change is not a config tweak — it’s an architectural change that can break three other behaviours for every one it fixes. | Silent regressions that only show up on real traffic |
Because of these challenges, building production AI agents requires LLMOps — operational practices that draw from both software engineering (observability, deployment pipelines, security) and machine learning (evaluation frameworks, model optimisation).
Production Agent Lifecycle
The production agent lifecycle is not a linear sequence — it’s a continuous loop. Every stage feeds back into the one before.
Stage breakdown:
- Development — Build the agent locally. Hard‑code prompts, mock tools, use sample inputs.
- Testing / Evaluation — Run the agent against a fixed dataset (ground truth). Measure accuracy, groundedness, latency. Fail the PR if scores drop below threshold.
- Deployment — Progressive rollout: shadow → canary → percentage → full. Each stage answers a different evaluation question.
- Monitoring — Real‑time metrics: success rate, latency per step, cost per request, tool error rate. Alerts on anomalies.
- Iteration — Use production traces to build better evals, refine prompts, and optimise costs.
The rest of this guide focuses on making each stage production‑ready.
Core Production Capabilities
Every production agent system must implement seven core capabilities.
| Capability | What it does | Why you cannot skip it |
|---|---|---|
| Observability | Traces every LLM call, tool invocation, retrieval, and decision point | Without traces, you cannot debug a single production failure |
| Evaluation | Compares agent outputs against ground truth or quality criteria | The only way to know if a prompt change actually improved behaviour |
| Monitoring | Real‑time metrics, alerts, and dashboards | You cannot fix what you do not measure |
| Deployment | Canary releases, blue‑green switches, instant rollback | A bad prompt update should affect 1% of users, not 100% |
| Reliability | Retries, circuit breakers, timeouts, fallbacks | External APIs fail. Your agent must survive |
| Security | Authentication, authorisation, prompt injection defence | OWASP LLM Top 10 #1: prompt injection is the most critical vulnerability |
| Cost Control | Token caching, model routing, per‑route budgets | Production API bills can explode without guardrails |
Let’s examine each in depth.
Observability for AI Agents
Observability is the foundation. You cannot evaluate, debug, or improve what you cannot see. Observability and tracing is the bare minimum an agent must have before you can start improving it.
The Five Things You Must Instrument
| What to trace | Why it matters | What to capture |
|---|---|---|
| Every LLM call | Prompts change behaviour; you need to see what was sent and what came back | Prompt, completion, tokens, cost, latency, model version |
| Every tool invocation | Tool failures are the most common silent failure | Tool name, input parameters, output, error, duration |
| Every retrieval | RAG pipelines degrade silently; retrieval quality directly affects answer quality | Query, retrieved chunks, relevance scores |
| Application metadata | Which user saw which response? Which session? | user_id, session_id, request_id, workflow_id |
| Key decision points | Why did the agent choose tool A over tool B? What was its reasoning? | The agent’s chain‑of‑thought before each action |
Tracing Architecture
Modern observability stacks use OpenTelemetry to collect traces and export them to a backend (Grafana Cloud, Langfuse, Elastic, or self‑hosted Tempo). The OpenAI Agents SDK, LangGraph, and Semantic Kernel all support OpenTelemetry natively.
What a trace looks like (simplified):
Trace ID: tr_abc123
Span 1: triage_agent
├─ Input: "Refund my last order"
├─ Output: "Call billing_agent"
└─ Duration: 120ms
Span 2: billing_agent (handoff)
├─ LLM call: prompt="Customer requests refund for order #12345"
├─ Tool call: get_order(order_id="12345")
│ ├─ Input: {"order_id": "12345"}
│ ├─ Output: {"status": "delivered", "date": "2025-06-01"}
│ └─ Duration: 45ms
├─ Tool call: create_return(order_id="12345")
│ ├─ Input: {"order_id": "12345", "reason": "defective"}
│ ├─ Output: {"return_id": "R123", "label_url": "..."}
│ └─ Duration: 230ms
└─ Duration: 890ms
With this trace, you can answer: Which tool failed? Why did the agent choose it? How long did each step take? How many tokens did we burn?
Platforms
- LangSmith / Langfuse — purpose‑built for LLM traces, with prompt versioning and cost tracking
- Grafana Cloud + OpenTelemetry — full observability stack for teams already using Grafana
- Elastic — unified logs, metrics, and APM traces with ML anomaly detection
- Self‑hosted — OpenTelemetry Collector + Jaeger/Tempo + Prometheus + Loki
Evaluation Systems
Traditional software has clear success criteria. HTTP 200 means success; 500 means failure. With LLMs, a 200 OK response tells you almost nothing about quality. You need semantic evaluation.
Two Modes of Evaluation
| Mode | When to use | How it works | Tools |
|---|---|---|---|
| Offline (supervised) | Before deploying a prompt or model change | Run agent against a fixed dataset of input/expected‑output pairs. Compute metrics (accuracy, groundedness, etc.) | RAGAS, DeepEval, custom LLM‑as‑judge |
| Online (continuous) | After deployment, on production traffic | Run unsupervised evals on live traces using only the agent’s own context (no ground truth needed). Alert on anomalies | LLM‑as‑judge, custom scoring functions |
Offline Evaluation with LLM‑as‑Judge
LLM‑as‑a‑Judge (LLMaaJ) has emerged as the leading technique to overcome the challenge of building a reference‑based evaluation engine. It shows decent correlation with human judgment and can evaluate properties that traditional metrics miss.
Properties LLMaaJ can evaluate:
- Groundedness — is the output based on facts in the retrieved context?
- Faithfulness — does the output contradict the source?
- Safety — is the content harmful or unsafe?
- Answer relevance — does the response actually answer the question?
Implementation example (pseudo‑code):
def llm_judge(question: str, answer: str, context: str) -> float:
prompt = f"""
You are an unbiased evaluator. Score the following answer on groundedness from 0 to 10.
Question: {question}
Answer: {answer}
Context: {context}
Score (0–10):"""
response = llm.generate(prompt)
return parse_score(response)
Best practices for LLM judges:
- Make evals binary (pass/fail) not scored on a range — LLMs are inconsistent rangers
- Make evals specific, not generic — “did the agent reference information not in the retrieved docs?” beats “was the response good?”
- Provide examples in the eval prompt, especially edge cases
- Validate that your judge correlates with human judgment (Cohen’s Kappa > 0.6)
Continuous Evaluation (Unsupervised)
Continuous evals run against live production traces. They assess behaviour using only the agent’s own context, so they don’t need a known correct answer.
Common continuous eval patterns:
- PII leakage detection — does the response contain email addresses, phone numbers, or credit cards?
- Refusal rate — does the agent refuse to answer questions it should answer?
- Tool error rate — is the tool calling failing more often than usual?
- Latency deviation — is a step suddenly taking much longer?
Use a scores table to track evaluation results over time:
| Eval Name | Threshold | Action on breach |
|---|---|---|
groundedness | < 0.85 | Page on‑call |
tool_error_rate | > 10% over 5 min | Auto‑rollback |
pii_detected | any | Block response + alert |
Monitoring Agent Behaviour
Production agents require real‑time monitoring of both traditional infrastructure metrics and agent‑specific behavioural signals.
Agent‑Specific Metrics
| Metric | What it measures | Alert when |
|---|---|---|
agent_success_rate | Percentage of requests that complete without error or timeout | < 99% over 5 min |
agent_latency_p95 | Time from request to final response, per workflow type | > 5 seconds |
tool_error_rate | Percentage of tool calls that return an error | > 5% |
llm_token_usage | Input + output tokens per request, broken down by model | Sudden spike > 2x baseline |
cost_per_request | Dollar cost per agent invocation | > $0.05 for simple workflows |
retry_rate | Percentage of LLM calls or tool calls that are retried | > 10% |
Traditional Infrastructure Metrics
Do not ignore the basics:
| Metric | Tool |
|---|---|
| HTTP request rate, error rate, latency | Prometheus + Grafana |
| CPU, memory, GPU utilisation | Prometheus node exporter |
| Queue length, consumer lag | Broker metrics (Kafka, RabbitMQ) |
| Database connection pool | Database exporter |
Alerting Philosophy
For AI agents, non‑deterministic behaviour requires non‑deterministic alerting. Fixed static thresholds often fail.
Better approach: statistical anomaly detection. Track a 7‑day rolling baseline for each metric. Alert when the current value deviates beyond N standard deviations from the baseline.
# Example anomaly detection rule
- name: "latency_anomaly"
measurement: agent_latency_p95
baseline_window: 7d
deviation_threshold: 3σ
action: on_loose_change
Combine with automated rollback when critical metrics breach thresholds (e.g., false positives exceed a critical limit).
Deployment Strategies
Deploying a prompt change is not a config update — it is a partial behaviour change on a non‑deterministic system. The standard software deployment playbook does not apply.
The Four‑Stage Gate
| Stage | Traffic | User impact | Eval question | Promotion gate |
|---|---|---|---|---|
| Shadow | 100% mirrored | Zero | Does the candidate behave wildly differently on real traffic? | Per‑rubric distribution within 1 point of production over 24‑72h |
| Canary | 1–5% live | Tier‑stratified | Is the candidate at least as good with users in the loop? | Containment of false resolution within noise floor |
| Percentage | 10, 25, 50% live | Broader | Are per‑rubric deltas statistically significant on prod data? | Welch’s t‑test p > 0.05 on each rubric vs 7‑day baseline |
| Full | 100% live | All users | Does the candidate hold the line under load with auto‑rollback armed? | Guardrail trip rate, rubric rolling mean, p99 latency hold for 48‑72h |
Key insight: Skipping a stage is the cheap‑and‑fast failure mode. A green check at stage one does not answer stage two’s question.
Blue‑Green Deployment
Maintain two identical environments (blue = current, green = updated). Route traffic to green only after validation, then switch back to blue instantly if issues arise.
Version Control
- Store prompts externally (not hard‑coded)
- Use explicit versions and environment tags (dev, staging, production)
- Maintain a model registry (MLflow, Weights & Biases) with versioned models
- Test new prompt versions against a regression dataset before promotion
Rollback Gotcha
Semantic caches key responses by prompt hash. When you roll back a prompt change, the cache may continue serving responses generated by the bad prompt for hours. Your rollback plan must include cache invalidation.
Reliability Engineering
In AI applications handling millions of LLM calls, failure is routine. Transient timeouts, flaky responses, and latency spikes are everyday events. You need a layered resilience strategy.
Retries (Transient Failures)
Use retries for temporary glitches: network instability, TLS handshake failures, brief provider rate limits, cold starts.
async def call_with_retry(func, max_retries=3, base_delay=1.0):
for attempt in range(max_retries):
try:
return await func()
except TransientError as e:
if attempt == max_retries - 1:
raise
delay = (base_delay * (2 ** attempt)) + random.uniform(0, 0.1)
await asyncio.sleep(delay)
Use exponential backoff to reduce pressure on the provider. Some platforms support Retry-After headers — honour them.
Fallbacks (Provider Diversity)
If your primary model fails, switch to a secondary provider or a smaller (cheaper) model. Use for provider overload, temporary degradation, or cost optimisation.
Caveat: Fallbacks are reactive. The system checks the primary every time, even if it is failing, before routing to the fallback — which adds latency.
Circuit Breakers (Persistent Failures)
Retries and fallbacks try to recover from failures. Circuit breakers prevent a bad situation from spiralling further by automatically cutting off traffic to unhealthy components.
Circuit breaker states:
- Closed — requests flow normally; failures increment a counter.
- Open — after N failures in a time window, all requests fail immediately (fast fail).
- Half‑open — after a timeout, one test request is allowed; if it succeeds, the circuit closes.
Implementation:
from pybreaker import CircuitBreaker
breaker = CircuitBreaker(fail_max=5, reset_timeout=30)
@breaker
async def call_llm(prompt: str) -> str:
return await client.chat.completions.create(...)
Bulkheads (Concurrency Limits)
Partition resources so that a failure in one agent does not cascade across the system. Each agent workflow gets a dedicated concurrency pool.
async def with_bulkhead(coro, semaphore=asyncio.Semaphore(10)):
async with semaphore:
return await coro
Idempotency
Ensure that retries do not cause duplicate side effects. Store processed request_ids or workflow_ids with a TTL.
Security in Production Agents
OWASP LLM Top 10 lists prompt injection as the most critical vulnerability in production LLM systems. 73%+ of LLM deployments are vulnerable, with 50–84% attack success rates in real‑world red team evaluations.
Prompt Injection Prevention
What is prompt injection? An attacker supplies malicious text designed to override the instructions given to the assistant by its developer — tricking the agent into ignoring its system prompt, executing unauthorised tool calls, or revealing sensitive data.
Production defences:
| Defence | Implementation | Effectiveness |
|---|---|---|
| Input guardrails | Scan all user inputs for injection patterns before they reach the LLM | Blocks obvious attacks at the edge |
| Type‑directed privilege separation | Convert untrusted data to safe data types with limited scope, eliminating the possibility of injection | Systematic (design‑level) defence |
| Separate guard model | A smaller model runs in front of the primary LLM to detect and block injection attempts | Used by many production systems |
| Output guardrails | Scan all LLM outputs before returning to the user | Catches leaked secrets or inappropriate content |
Implementation example with llm‑injection‑guard:
from llm_injection_guard import PromptScanner
scanner = PromptScanner(block_on_detection=True)
try:
result = scanner.scan(user_input) # Safe → pass to LLM
response = llm.chat(user_input)
except InjectionDetectedError as e:
print(f"Blocked! Threat level: {e.threat_level}")
return {"error": "Input contains unsafe content"}
Authentication and Authorisation
- Every agent request must be authenticated (JWT, API key, mTLS)
- Every tool call must be authorised against the caller’s identity
- Agents often require dynamic scope changes during execution — different resources at different times
Sensitive Data Protection
- Never log full messages containing PII, credentials, or API keys
- Redact fields like
password,secret,credit_card,authorizationbefore logging - Encrypt sensitive payload fields at rest
Compliance
- EU AI Act enforcement begins August 2026 — organisations must demonstrate prompt injection defences for compliance
- Maintain immutable audit trails (SHA256‑hashed logs) for SOC2, GDPR, and EU AI Act
Security Checklist for Production Agents
- All user inputs are scanned for injection patterns before LLM processing
- All LLM outputs are scanned for secrets and PII before returning to user
- Every request includes authentication (JWT, API key, or mTLS)
- Every tool call includes authorisation checks
- Logs are redacted (no passwords, no PII, no API keys)
- Secrets are stored in a vault (not in code or
.env) - Audit trail captures who called which tool, when, and what was returned
Cost Optimization
Production API bills can explode without guardrails. A platform team’s LLMOps bill once hit $87,000 in March on a $40,000 budget — $52,000 of which was online judge tokens alone.
Eight Levers for Cutting LLM Costs
| Lever | Typical cut | Where it lives |
|---|---|---|
| Distilled judges for online scoring | 30–50% | Eval platform |
| Semantic cache on repetitive routes | 20–40% | Gateway |
| Smaller default model + eval‑gated routing | 30–50% | Gateway |
| Provider prompt caching | 30–60% input | Provider SDK |
| Tail‑based trace sampling per route | 15–30% | OTel collector |
| Tiered trace retention | 10–20% | Storage layer |
| Batched offline evals | 50% | Eval pipeline |
| Per‑route token budgets | bounds spike | Gateway |
Caching Strategies
Semantic cache stores embeddings of previous queries and their responses. For a new query, retrieve the semantically closest cached response and return it — no LLM call. This cuts costs by 20–40% for repetitive routes.
Provider prompt caching — OpenAI, Anthropic, and Gemini offer caching discounts of 30–60% on input tokens when the same prompt prefix is reused across requests.
Model Routing
Use a small, cheap model as the default. Route only to a large, expensive model when necessary.
def route_request(query: str) -> str:
# Cheap model first
if is_simple_question(query):
return small_model.chat(query)
# Escalate to large model
return large_model.chat(query)
Applied correctly, this cuts costs by 30–50% while maintaining quality.
Per‑Request Cost Tracking
Capture token counts (input, output, cache_read, cache_creation) and compute cost from a maintained price book.
class CostTracker:
def __init__(self):
self.price_book = {"gpt-4o": {"input": 0.005, "output": 0.015}} # per 1K tokens
def record(self, model: str, input_tokens: int, output_tokens: int):
cost = (input_tokens / 1000) * self.price_book[model]["input"] + \
(output_tokens / 1000) * self.price_book[model]["output"]
# Send to metrics system
self.metrics.record_cost(cost)
Budget Controls
Set per‑route token budgets. If a workflow exceeds its budget, the gateway rejects the request or escalates for human review.
budgets:
- route: "/chat/support"
max_tokens_per_request: 4000
max_cost_per_request: 0.02
- route: "/chat/research"
max_tokens_per_request: 16000
max_cost_per_request: 0.10
Testing Production Agents
Traditional testing approaches do not apply to non‑deterministic systems. You need a layered test strategy.
Unit Testing
Test the deterministic parts in isolation: tool input validation, prompt formatting, data transformation.
def test_tool_input_validation():
with pytest.raises(ValidationError):
get_order(order_id="") # empty ID should fail
Integration Testing
Spin up a real agent instance (in a test container) and verify behaviour against a controlled dataset.
async def test_agent_on_fixed_dataset():
agent = Agent()
results = []
for question, expected in test_dataset:
response = await agent.run(question)
score = llm_judge(question, response, expected)
assert score > 0.85
End‑to‑End Testing
Run a full workflow against a production‑like environment (including real tools with test accounts, but isolated from production data).
Failure Testing (Chaos)
Inject failures to verify retry, circuit breaker, and fallback logic:
@pytest.mark.asyncio
async def test_circuit_breaker():
with patch("llm_client.call", side_effect=APITimeoutError):
for _ in range(5):
with pytest.raises(APITimeoutError):
await agent.run("test query")
# Circuit should now be open
with pytest.raises(CircuitOpenError):
await agent.run("test query")
Evaluation Regression Testing
Before deploying a prompt change, run offline evals against your test dataset. Fail the CI pipeline if aggregated scores drop below thresholds.
# .github/workflows/agent-eval.yml
- name: Run offline evals
run: python evaluate.py --dataset=test_data.json
- name: Check thresholds
run: |
python -c "from eval import check; check(accuracy=0.85, groundedness=0.80)"
Production Best Practices
These ten engineering recommendations come from real production deployments.
| Practice | Why it matters | Implementation |
|---|---|---|
| Always log agent decisions | Without traces, you cannot debug a single production failure | Instrument every LLM call, tool call, retrieval, and decision point |
| Track tool usage | Tool failures are the most common silent failure; you need to know which tool is failing | Log tool name, input, output, duration, error, per tool |
| Measure latency per step | A multi‑step agent can have one slow step that kills overall P99 | Record start and end timestamps for each span |
| Use structured outputs | LLMs return free text; parsing fails silently | Use JSON mode or constrained decoding (e.g., Instructor, Outlines) |
| Version prompts and tools | A prompt change can break three other behaviours for every one it fixes | Store prompts externally with version tags; use semantic versioning for tool schemas |
| Implement idempotency | Retries cause duplicate side effects | Store processed request_id in a key‑value store with TTL |
| Set timeouts at every level | An agent that hangs indefinitely kills the caller | Connection timeout (5s), request timeout (30s), workflow timeout (60s) |
| Use circuit breakers | Retries and fallbacks are not enough when a provider is persistently degraded | Trip after N failures in M seconds; probe after timeout |
| Monitor cost per request | API bills explode without guardrails | Capture token counts from every LLM call; compute cost from price book |
| Test failure scenarios | Your agent will fail in production; make sure it fails gracefully | Inject timeouts, API errors, malformed responses in integration tests |
Common Production Mistakes
| Mistake | Why it breaks production | How to fix |
|---|---|---|
| No observability | When a user reports a bad response, you have no way to know whether it was a bad prompt, bad retrieval, bad tool call, or something else | Instrument everything before the first production deploy |
| No evaluation system | You cannot tell if a prompt change made things better or worse | Build an offline eval dataset and run it in CI |
| No fallback strategy | A downstream API outage stops the entire agent | Implement retries, circuit breakers, and fallback models |
| Over‑reliance on LLM output | LLMs are not databases; they hallucinate | Validate outputs against known facts; use tools for factual lookups |
| Ignoring cost | The first production API bill will shock you | Set per‑route budgets; cache aggressively; use small models where possible |
| No guardrails | Prompt injection is the #1 LLM security risk | Scan all user inputs before they reach the LLM |
| Hard‑coded prompts | Every prompt tweak requires a full redeploy; changes risk silent regressions | Store prompts externally with versioning; test before promoting |
| No rollback plan for caches | Roll back the code, but the cache keeps serving bad responses for hours | Invalidate cache on rollback; version cache keys with prompt hash |
Case Study: Production Customer Support Agent
The system: A customer‑support agent that processes 10,000 tickets per day. It routes tickets to specialised agents (billing, technical, returns), retrieves knowledge base articles, and generates draft responses — with human approval for high‑value actions.
Observability Pipeline
Every request is traced with OpenTelemetry. The collector exports to both Langfuse (for developer debugging) and Grafana Tempo (for platform‑level aggregation). Traces capture:
- User ID and session ID for attribution
- Every LLM call (prompt, completion, tokens, model, latency)
- Every tool call (knowledge base search, order lookup, return creation)
- The agent’s reasoning before each action (chain‑of‑thought)
Key dashboard: “Top failing tools” — shows which tool calls are erroring most frequently, segmented by agent type.
Evaluation Loop
Offline evaluation runs on every pull request:
- Dataset: 1,000 labelled support interactions
- Metrics: accuracy (exact match on category), groundedness (LLM‑as‑judge), latency
- CI blocks the PR if accuracy drops below 92% or groundedness below 0.85
Continuous evaluation runs on 1% of production traffic:
- Checks for PII leakage (email addresses, phone numbers)
- Measures refusal rate (is the agent refusing to help valid requests?)
- Alerts on anomaly detection (latency spikes, error rate changes)
Deployment Strategy
Four‑stage gate for every prompt or model change:
- Shadow (24h) — new agent runs in parallel with production; zero user impact. Compare outputs.
- Canary (5% traffic, 6h) — tier‑stratified by user segment. Monitor error rate and satisfaction.
- Percentage (25% → 50% → 100%, 12h each) — expand slowly. Compare rubrics against 7‑day baseline.
- Full — armed with auto‑rollback if error rate exceeds 2% or latency exceeds 5 seconds.
Monitoring Dashboard (prometheus + grafana)
Metrics:
- agent_success_rate: 99.2% (target >99%)
- agent_p95_latency: 2.3s (target <4s)
- tool_error_rate_by_tool:
knowledge_search: 0.5%
get_order: 1.2%
create_return: 0.1%
- llm_token_usage_per_request: 1,200 input / 400 output
- cost_per_request: $0.008
Alerts:
- PagerDuty if success rate drops below 99% for 5 minutes
- Slack notification if any tool error rate exceeds 5%
- Daily cost report to finance team
Reliability Mechanisms
- Retries — exponential backoff for transient LLM API errors (3 retries: 1s, 2s, 4s)
- Circuit breaker — trips after 10 failures in 60 seconds; opens for 30 seconds
- Fallbacks — if primary model (GPT‑4o) fails, route to GPT‑4o‑mini for degraded but working responses
- Timeouts — 30 seconds per LLM call; 60 seconds total workflow
- Idempotency — store
request_idin Redis; duplicate requests return cached response
Security Controls
- All user inputs scanned by
llm‑injection‑guardbefore LLM processing - Tool calls authorised against caller’s identity (a user cannot look up another user’s order)
- Outputs scanned for PII before being returned
- Full audit trail: every tool call logged with
user_id,timestamp,parameters,result
Cost Optimisation
- Semantic cache for frequently asked questions (returns cached answer for identical embedding, skipping LLM call) — saves 30% of requests
- Model routing — simple questions go to GPT‑4o‑mini ($0.0005 per 1K tokens); complex reasoning goes to GPT‑4o ($0.005 per 1K tokens)
- Per‑request token budget — refuse requests exceeding 4,000 input tokens
- Monthly cost: $2,500 for 300,000 requests ($0.008 per request)
Result
- Uptime: 99.92%
- Accuracy: 94% of customer issues resolved without human escalation
- P95 latency: 2.3 seconds
- Monthly cost: $2,500 (down from $8,000 before optimisation)
- Security: 0 prompt injection breaches in 6 months
Production Readiness Checklist
Use this checklist before deploying any AI agent to production.
Observability
- Every LLM call is traced (prompt, completion, tokens, cost, latency, model version)
- Every tool invocation is traced (name, input, output, error, duration)
- Every retrieval is traced (query, chunks, relevance scores)
- Application metadata is attached (user_id, session_id, workflow_id)
- Key decision points are traced (agent reasoning before each action)
- Traces are exported to a central backend (Langfuse, Grafana, Elastic)
Evaluation
- Offline eval dataset exists (at least 500 labelled examples)
- LLM‑as‑judge evals are calibrated against human judgment
- CI pipeline runs offline evals on every PR; blocks on score drop
- Continuous evals run on production traces (ungroundedness, PII leakage, refusals)
Monitoring
- Success rate, latency (P50/P95/P99), error rate are monitored
- Per‑tool error rates are tracked
- Token usage and cost per request are tracked
- Dashboards exist for development, operations, and business stakeholders
- Alerts are configured with anomaly detection (not static thresholds only)
Deployment
- Staged rollout capability exists (shadow → canary → percentage → full)
- Canary deploys to 1–5% of traffic first
- Blue‑green or rolling updates are configured
- Auto‑rollback triggers on metric breach (error rate, latency, groundedness)
- Cache invalidation is part of rollback plan
Reliability
- Retries with exponential backoff are implemented for transient failures
- Circuit breakers protect against persistent provider failures
- Timeouts are set at every level (connection, request, workflow)
- Fallback strategies exist (secondary model, degraded mode, graceful failure)
- Idempotency is implemented for state‑changing operations
Security
- Input guardrails scan all user inputs for prompt injection
- Output guardrails scan all LLM responses for PII and secrets
- Every request is authenticated (JWT, API key, or mTLS)
- Every tool call is authorised against the caller’s identity
- Logs are redacted (no passwords, no PII, no API keys)
- Secrets are stored in a vault, not in code or
.env - Audit trail exists for all security‑relevant events
Cost Control
- Token usage is tracked per request, per user, per route
- Cost is computed per request (not just estimated)
- Semantic cache is implemented for repetitive queries
- Provider prompt caching is enabled (OpenAI / Anthropic / Gemini)
- Model routing routes simple requests to smaller, cheaper models
- Per‑route token budgets are configured
Testing
- Unit tests cover deterministic components (tool validation, prompt formatting)
- Integration tests run with real (containerised) agents
- Failure tests simulate timeouts, API errors, malformed responses
- Offline evals run on every pull request
- Load tests validate throughput and latency under peak traffic
Frequently Asked Questions
1. What makes an AI agent production‑ready?
A production agent is reliable (handles failures), observable (every decision is traced), testable (changes validated against a dataset), deployable (canary releases, instant rollback), secure (authenticated, prompt‑injection protected), and cost‑controlled (budgets, caching, model routing).
2. How do you monitor AI agents in production?
Monitor success rate, latency (P95), tool error rates, token usage, cost per request. Use anomaly detection against a 7‑day rolling baseline — static thresholds fail for non‑deterministic systems.
3. How do you evaluate agent performance?
Offline evaluation (supervised) against a labelled dataset. Online evaluation (unsupervised) on production traces with LLM‑as‑judge. Make evals binary and specific. Validate that your judge correlates with human judgment.
4. How do you control cost in production agents?
Cache aggressively (semantic cache + provider prompt caching). Route simple queries to cheap models. Set per‑route token budgets. Track cost per request. Distilled judges cut evaluation cost 30–50%.
5. How do you deploy agent updates safely?
Four‑stage gate: shadow (0% user impact) → canary (1–5%) → percentage (10–50%) → full. Each stage answers a different evaluation question. Auto‑rollback on metric breach.
6. What is the single most important observability practice?
Trace every LLM call, every tool invocation, every retrieval, and every decision point. Without traces, you cannot debug a single production failure.
7. Do I need a circuit breaker for my agent?
Yes if you make external calls. Retries and fallbacks are reactive; circuit breakers prevent a persistent failure from cascading across the system.
8. How do I handle prompt injection in production?
Scan all user inputs before they reach the LLM (input guardrail). Scan all LLM outputs before returning to the user (output guardrail). Use type‑directed privilege separation for systematic defence.
9. What metrics should I alert on?
Success rate (< 99%), tool error rate (> 5%), latency deviating beyond 3σ from baseline, cost per request spiking, PII detected in outputs.
10. Can I use the same evaluation for offline and online?
Offline evals require ground truth; online evals (continuous) use only the agent’s context. Use LLM‑as‑judge for both, but the rubric differs.
11. How do I test failure scenarios?
Inject timeouts, API errors, malformed responses, and slow calls in integration tests. Verify that retries, circuit breakers, and fallbacks behave as expected.
12. How do I roll back a bad prompt change?
Revert the prompt version, redeploy, and — critically — invalidate any semantic caches keyed by prompt hash. Caches can serve bad responses for hours after rollback.
13. What is the difference between canary and blue‑green?
Canary deploys to 1–5% of live traffic (gradual expansion). Blue‑green maintains two full environments (blue = current, green = updated) and switches traffic instantly. Use both.
14. How do I set timeouts for LLM calls?
Connection timeout (5 seconds), request timeout (30 seconds), workflow‑level timeout (60 seconds). Never rely on default infinite timeouts.
15. How do I handle the LLM’s non‑determinism?
You cannot eliminate it — embrace it. Use temperature=0 for consistency where possible. Evaluate on distributions, not single traces. Use statistical anomaly detection for monitoring.
16. Should I use one large model or many small models?
Many small models with routing. Use GPT‑4o‑mini or GPT‑4.1‑nano for simple tasks. Escalate to GPT‑4o or GPT‑5 only when necessary. Cuts costs 30–50%.
17. What is the minimum evaluation dataset size?
Start with 200–500 labelled examples covering edge cases and failure modes. Expand as you collect production traces with known outcomes.
18. How do I trace a multi‑step agent workflow?
Use OpenTelemetry spans: one root span for the entire workflow, child spans for each LLM call, tool call, and handoff. Carry trace_id across all services.
19. How do I prevent a single user from blowing the API budget?
Per‑user rate limits, per‑request token budgets, and hard monthly quotas at the gateway. Alert when any user approaches 80% of their quota.
20. What should I do if my agent’s success rate drops below 99%?
First, check traces for the failing requests. Is it a specific tool failing? A particular prompt pattern? A model version change? Fix, run offline evals, deploy via canary, monitor.
Conclusion
The gap between an agent that works in a notebook and an agent that works in production is wide — but it is not unbridgeable. Production AI agents require observability, evaluation, progressive deployment, reliability engineering, security, and cost control.
- Observability is your foundation — trace everything.
- Evaluation is your truth — offline for changes, online for production.
- Deployment is progressive — shadow → canary → percentage → full.
- Reliability is layered — retries → fallbacks → circuit breakers.
- Security is non‑negotiable — guardrails in, guardrails out.
- Cost is controllable — cache, route, budget.
Every practice in this guide has been tested in real production environments at scale. Apply them, and your agents will be reliable, observable, and ready for users.
Your Next Step
Continue your learning with the production engineering deep dives:
- Agent Evaluation — building offline and online evaluation systems →
- Agent Monitoring — real‑time metrics, dashboards, and alerting →
- Agent Deployment — canary releases, blue‑green, and rollback strategies →
Then explore the protocols and frameworks that power production agents:
- MCP — Model Context Protocol for standardised tool access →
- A2A — Agent‑to‑Agent protocol for agent collaboration →
- Agent Frameworks — LangGraph, CrewAI, Semantic Kernel, OpenAI Agents SDK →
This article is part of the AgentDevPro Handbook — practical, engineering‑focused guides for building production AI agent systems.