Skip to main content

Production AI Agents: From Prototype to Production-Ready Systems

Most teams build their first agent in an afternoon. It works beautifully in the notebook — the LLM calls the right tools, the responses look great, and stakeholders are impressed. Everyone wants to ship it next week.

Then production happens. The API bill hits $87,000 on a $40,000 budget. Response times stretch past ten seconds. A one-word change to a prompt breaks a refund-flow tool-call schema that the 30-example regression suite never hit. Users report incorrect information, and you have no way to know whether it was a bad prompt, a bad retrieval, a bad tool call, or something else.

This is the prototype-to-production gap. This guide closes it.

You’ll learn how to build production AI agents that are reliable, observable, testable, deployable, secure, and cost‑controlled — without diving into theoretical architecture patterns. Every practice here comes from real production deployments at scale.

What Is a Production AI Agent

A production AI agent is an AI system that meets six non‑negotiable criteria:

  • Reliable — handles failures gracefully, retries transient errors, and never hangs indefinitely.
  • Observable — every LLM call, tool invocation, and decision point is traced and logged.
  • Testable — prompt changes are validated against a fixed dataset before promotion.
  • Deployable — supports canary releases, blue‑green switches, and instant rollback.
  • Secure — authenticates and authorises every request, blocks prompt injection.
  • Cost‑controlled — tracks spend per request, caches aggressively, rightsizes models.

A prototype agent works for you. A production agent works for your users, under real load, with real consequences.

Example: A customer‑support agent that routes tickets, retrieves knowledge, generates drafts, and escalates to humans — processing 10,000 requests per day with 99.5% uptime, full audit trails, and a predictable monthly API bill under $5,000.

Why Production Engineering Matters

The shift from prototype to production in the AI world exposes operational challenges that traditional software deployment practices weren’t designed to handle.

Three reasons production AI agents fail where prototypes succeed:

ChallengeWhy it breaksThe cost
Non‑determinismThe same prompt can yield different responses across runs. An identical input can produce a correct answer, then a hallucination, then a refusal.You can’t trust unit tests. Your CI pipeline passes, but production fails
Evaluation opacityHTTP 200 tells you nothing about quality. The model might have hallucinated facts, leaked sensitive information, or generated grammatical nonsense.Your dashboards look green while your users get wrong answers
Configuration as codeA prompt change is not a config tweak — it’s an architectural change that can break three other behaviours for every one it fixes.Silent regressions that only show up on real traffic

Because of these challenges, building production AI agents requires LLMOps — operational practices that draw from both software engineering (observability, deployment pipelines, security) and machine learning (evaluation frameworks, model optimisation).

Production Agent Lifecycle

The production agent lifecycle is not a linear sequence — it’s a continuous loop. Every stage feeds back into the one before.

Stage breakdown:

  1. Development — Build the agent locally. Hard‑code prompts, mock tools, use sample inputs.
  2. Testing / Evaluation — Run the agent against a fixed dataset (ground truth). Measure accuracy, groundedness, latency. Fail the PR if scores drop below threshold.
  3. Deployment — Progressive rollout: shadow → canary → percentage → full. Each stage answers a different evaluation question.
  4. Monitoring — Real‑time metrics: success rate, latency per step, cost per request, tool error rate. Alerts on anomalies.
  5. Iteration — Use production traces to build better evals, refine prompts, and optimise costs.

The rest of this guide focuses on making each stage production‑ready.

Core Production Capabilities

Every production agent system must implement seven core capabilities.

CapabilityWhat it doesWhy you cannot skip it
ObservabilityTraces every LLM call, tool invocation, retrieval, and decision pointWithout traces, you cannot debug a single production failure
EvaluationCompares agent outputs against ground truth or quality criteriaThe only way to know if a prompt change actually improved behaviour
MonitoringReal‑time metrics, alerts, and dashboardsYou cannot fix what you do not measure
DeploymentCanary releases, blue‑green switches, instant rollbackA bad prompt update should affect 1% of users, not 100%
ReliabilityRetries, circuit breakers, timeouts, fallbacksExternal APIs fail. Your agent must survive
SecurityAuthentication, authorisation, prompt injection defenceOWASP LLM Top 10 #1: prompt injection is the most critical vulnerability
Cost ControlToken caching, model routing, per‑route budgetsProduction API bills can explode without guardrails

Let’s examine each in depth.

Observability for AI Agents

Observability is the foundation. You cannot evaluate, debug, or improve what you cannot see. Observability and tracing is the bare minimum an agent must have before you can start improving it.

The Five Things You Must Instrument

What to traceWhy it mattersWhat to capture
Every LLM callPrompts change behaviour; you need to see what was sent and what came backPrompt, completion, tokens, cost, latency, model version
Every tool invocationTool failures are the most common silent failureTool name, input parameters, output, error, duration
Every retrievalRAG pipelines degrade silently; retrieval quality directly affects answer qualityQuery, retrieved chunks, relevance scores
Application metadataWhich user saw which response? Which session?user_id, session_id, request_id, workflow_id
Key decision pointsWhy did the agent choose tool A over tool B? What was its reasoning?The agent’s chain‑of‑thought before each action

Tracing Architecture

Modern observability stacks use OpenTelemetry to collect traces and export them to a backend (Grafana Cloud, Langfuse, Elastic, or self‑hosted Tempo). The OpenAI Agents SDK, LangGraph, and Semantic Kernel all support OpenTelemetry natively.

What a trace looks like (simplified):

Trace ID: tr_abc123
Span 1: triage_agent
├─ Input: "Refund my last order"
├─ Output: "Call billing_agent"
└─ Duration: 120ms

Span 2: billing_agent (handoff)
├─ LLM call: prompt="Customer requests refund for order #12345"
├─ Tool call: get_order(order_id="12345")
│ ├─ Input: {"order_id": "12345"}
│ ├─ Output: {"status": "delivered", "date": "2025-06-01"}
│ └─ Duration: 45ms
├─ Tool call: create_return(order_id="12345")
│ ├─ Input: {"order_id": "12345", "reason": "defective"}
│ ├─ Output: {"return_id": "R123", "label_url": "..."}
│ └─ Duration: 230ms
└─ Duration: 890ms

With this trace, you can answer: Which tool failed? Why did the agent choose it? How long did each step take? How many tokens did we burn?

Platforms

  • LangSmith / Langfuse — purpose‑built for LLM traces, with prompt versioning and cost tracking
  • Grafana Cloud + OpenTelemetry — full observability stack for teams already using Grafana
  • Elastic — unified logs, metrics, and APM traces with ML anomaly detection
  • Self‑hosted — OpenTelemetry Collector + Jaeger/Tempo + Prometheus + Loki

Evaluation Systems

Traditional software has clear success criteria. HTTP 200 means success; 500 means failure. With LLMs, a 200 OK response tells you almost nothing about quality. You need semantic evaluation.

Two Modes of Evaluation

ModeWhen to useHow it worksTools
Offline (supervised)Before deploying a prompt or model changeRun agent against a fixed dataset of input/expected‑output pairs. Compute metrics (accuracy, groundedness, etc.)RAGAS, DeepEval, custom LLM‑as‑judge
Online (continuous)After deployment, on production trafficRun unsupervised evals on live traces using only the agent’s own context (no ground truth needed). Alert on anomaliesLLM‑as‑judge, custom scoring functions

Offline Evaluation with LLM‑as‑Judge

LLM‑as‑a‑Judge (LLMaaJ) has emerged as the leading technique to overcome the challenge of building a reference‑based evaluation engine. It shows decent correlation with human judgment and can evaluate properties that traditional metrics miss.

Properties LLMaaJ can evaluate:

  • Groundedness — is the output based on facts in the retrieved context?
  • Faithfulness — does the output contradict the source?
  • Safety — is the content harmful or unsafe?
  • Answer relevance — does the response actually answer the question?

Implementation example (pseudo‑code):

def llm_judge(question: str, answer: str, context: str) -> float:
prompt = f"""
You are an unbiased evaluator. Score the following answer on groundedness from 0 to 10.

Question: {question}
Answer: {answer}
Context: {context}

Score (0–10):"""

response = llm.generate(prompt)
return parse_score(response)

Best practices for LLM judges:

  • Make evals binary (pass/fail) not scored on a range — LLMs are inconsistent rangers
  • Make evals specific, not generic — “did the agent reference information not in the retrieved docs?” beats “was the response good?”
  • Provide examples in the eval prompt, especially edge cases
  • Validate that your judge correlates with human judgment (Cohen’s Kappa > 0.6)

Continuous Evaluation (Unsupervised)

Continuous evals run against live production traces. They assess behaviour using only the agent’s own context, so they don’t need a known correct answer.

Common continuous eval patterns:

  • PII leakage detection — does the response contain email addresses, phone numbers, or credit cards?
  • Refusal rate — does the agent refuse to answer questions it should answer?
  • Tool error rate — is the tool calling failing more often than usual?
  • Latency deviation — is a step suddenly taking much longer?

Use a scores table to track evaluation results over time:

Eval NameThresholdAction on breach
groundedness< 0.85Page on‑call
tool_error_rate> 10% over 5 minAuto‑rollback
pii_detectedanyBlock response + alert

Monitoring Agent Behaviour

Production agents require real‑time monitoring of both traditional infrastructure metrics and agent‑specific behavioural signals.

Agent‑Specific Metrics

MetricWhat it measuresAlert when
agent_success_ratePercentage of requests that complete without error or timeout< 99% over 5 min
agent_latency_p95Time from request to final response, per workflow type> 5 seconds
tool_error_ratePercentage of tool calls that return an error> 5%
llm_token_usageInput + output tokens per request, broken down by modelSudden spike > 2x baseline
cost_per_requestDollar cost per agent invocation> $0.05 for simple workflows
retry_ratePercentage of LLM calls or tool calls that are retried> 10%

Traditional Infrastructure Metrics

Do not ignore the basics:

MetricTool
HTTP request rate, error rate, latencyPrometheus + Grafana
CPU, memory, GPU utilisationPrometheus node exporter
Queue length, consumer lagBroker metrics (Kafka, RabbitMQ)
Database connection poolDatabase exporter

Alerting Philosophy

For AI agents, non‑deterministic behaviour requires non‑deterministic alerting. Fixed static thresholds often fail.

Better approach: statistical anomaly detection. Track a 7‑day rolling baseline for each metric. Alert when the current value deviates beyond N standard deviations from the baseline.

# Example anomaly detection rule
- name: "latency_anomaly"
measurement: agent_latency_p95
baseline_window: 7d
deviation_threshold:
action: on_loose_change

Combine with automated rollback when critical metrics breach thresholds (e.g., false positives exceed a critical limit).

Deployment Strategies

Deploying a prompt change is not a config update — it is a partial behaviour change on a non‑deterministic system. The standard software deployment playbook does not apply.

The Four‑Stage Gate

StageTrafficUser impactEval questionPromotion gate
Shadow100% mirroredZeroDoes the candidate behave wildly differently on real traffic?Per‑rubric distribution within 1 point of production over 24‑72h
Canary1–5% liveTier‑stratifiedIs the candidate at least as good with users in the loop?Containment of false resolution within noise floor
Percentage10, 25, 50% liveBroaderAre per‑rubric deltas statistically significant on prod data?Welch’s t‑test p > 0.05 on each rubric vs 7‑day baseline
Full100% liveAll usersDoes the candidate hold the line under load with auto‑rollback armed?Guardrail trip rate, rubric rolling mean, p99 latency hold for 48‑72h

Key insight: Skipping a stage is the cheap‑and‑fast failure mode. A green check at stage one does not answer stage two’s question.

Blue‑Green Deployment

Maintain two identical environments (blue = current, green = updated). Route traffic to green only after validation, then switch back to blue instantly if issues arise.

Version Control

  • Store prompts externally (not hard‑coded)
  • Use explicit versions and environment tags (dev, staging, production)
  • Maintain a model registry (MLflow, Weights & Biases) with versioned models
  • Test new prompt versions against a regression dataset before promotion

Rollback Gotcha

Semantic caches key responses by prompt hash. When you roll back a prompt change, the cache may continue serving responses generated by the bad prompt for hours. Your rollback plan must include cache invalidation.

Reliability Engineering

In AI applications handling millions of LLM calls, failure is routine. Transient timeouts, flaky responses, and latency spikes are everyday events. You need a layered resilience strategy.

Retries (Transient Failures)

Use retries for temporary glitches: network instability, TLS handshake failures, brief provider rate limits, cold starts.

async def call_with_retry(func, max_retries=3, base_delay=1.0):
for attempt in range(max_retries):
try:
return await func()
except TransientError as e:
if attempt == max_retries - 1:
raise
delay = (base_delay * (2 ** attempt)) + random.uniform(0, 0.1)
await asyncio.sleep(delay)

Use exponential backoff to reduce pressure on the provider. Some platforms support Retry-After headers — honour them.

Fallbacks (Provider Diversity)

If your primary model fails, switch to a secondary provider or a smaller (cheaper) model. Use for provider overload, temporary degradation, or cost optimisation.

Caveat: Fallbacks are reactive. The system checks the primary every time, even if it is failing, before routing to the fallback — which adds latency.

Circuit Breakers (Persistent Failures)

Retries and fallbacks try to recover from failures. Circuit breakers prevent a bad situation from spiralling further by automatically cutting off traffic to unhealthy components.

Circuit breaker states:

  • Closed — requests flow normally; failures increment a counter.
  • Open — after N failures in a time window, all requests fail immediately (fast fail).
  • Half‑open — after a timeout, one test request is allowed; if it succeeds, the circuit closes.

Implementation:

from pybreaker import CircuitBreaker

breaker = CircuitBreaker(fail_max=5, reset_timeout=30)

@breaker
async def call_llm(prompt: str) -> str:
return await client.chat.completions.create(...)

Bulkheads (Concurrency Limits)

Partition resources so that a failure in one agent does not cascade across the system. Each agent workflow gets a dedicated concurrency pool.

async def with_bulkhead(coro, semaphore=asyncio.Semaphore(10)):
async with semaphore:
return await coro

Idempotency

Ensure that retries do not cause duplicate side effects. Store processed request_ids or workflow_ids with a TTL.

Security in Production Agents

OWASP LLM Top 10 lists prompt injection as the most critical vulnerability in production LLM systems. 73%+ of LLM deployments are vulnerable, with 50–84% attack success rates in real‑world red team evaluations.

Prompt Injection Prevention

What is prompt injection? An attacker supplies malicious text designed to override the instructions given to the assistant by its developer — tricking the agent into ignoring its system prompt, executing unauthorised tool calls, or revealing sensitive data.

Production defences:

DefenceImplementationEffectiveness
Input guardrailsScan all user inputs for injection patterns before they reach the LLMBlocks obvious attacks at the edge
Type‑directed privilege separationConvert untrusted data to safe data types with limited scope, eliminating the possibility of injectionSystematic (design‑level) defence
Separate guard modelA smaller model runs in front of the primary LLM to detect and block injection attemptsUsed by many production systems
Output guardrailsScan all LLM outputs before returning to the userCatches leaked secrets or inappropriate content

Implementation example with llm‑injection‑guard:

from llm_injection_guard import PromptScanner

scanner = PromptScanner(block_on_detection=True)
try:
result = scanner.scan(user_input) # Safe → pass to LLM
response = llm.chat(user_input)
except InjectionDetectedError as e:
print(f"Blocked! Threat level: {e.threat_level}")
return {"error": "Input contains unsafe content"}

Authentication and Authorisation

  • Every agent request must be authenticated (JWT, API key, mTLS)
  • Every tool call must be authorised against the caller’s identity
  • Agents often require dynamic scope changes during execution — different resources at different times

Sensitive Data Protection

  • Never log full messages containing PII, credentials, or API keys
  • Redact fields like password, secret, credit_card, authorization before logging
  • Encrypt sensitive payload fields at rest

Compliance

  • EU AI Act enforcement begins August 2026 — organisations must demonstrate prompt injection defences for compliance
  • Maintain immutable audit trails (SHA256‑hashed logs) for SOC2, GDPR, and EU AI Act

Security Checklist for Production Agents

  • All user inputs are scanned for injection patterns before LLM processing
  • All LLM outputs are scanned for secrets and PII before returning to user
  • Every request includes authentication (JWT, API key, or mTLS)
  • Every tool call includes authorisation checks
  • Logs are redacted (no passwords, no PII, no API keys)
  • Secrets are stored in a vault (not in code or .env)
  • Audit trail captures who called which tool, when, and what was returned

Cost Optimization

Production API bills can explode without guardrails. A platform team’s LLMOps bill once hit $87,000 in March on a $40,000 budget — $52,000 of which was online judge tokens alone.

Eight Levers for Cutting LLM Costs

LeverTypical cutWhere it lives
Distilled judges for online scoring30–50%Eval platform
Semantic cache on repetitive routes20–40%Gateway
Smaller default model + eval‑gated routing30–50%Gateway
Provider prompt caching30–60% inputProvider SDK
Tail‑based trace sampling per route15–30%OTel collector
Tiered trace retention10–20%Storage layer
Batched offline evals50%Eval pipeline
Per‑route token budgetsbounds spikeGateway

Caching Strategies

Semantic cache stores embeddings of previous queries and their responses. For a new query, retrieve the semantically closest cached response and return it — no LLM call. This cuts costs by 20–40% for repetitive routes.

Provider prompt caching — OpenAI, Anthropic, and Gemini offer caching discounts of 30–60% on input tokens when the same prompt prefix is reused across requests.

Model Routing

Use a small, cheap model as the default. Route only to a large, expensive model when necessary.

def route_request(query: str) -> str:
# Cheap model first
if is_simple_question(query):
return small_model.chat(query)
# Escalate to large model
return large_model.chat(query)

Applied correctly, this cuts costs by 30–50% while maintaining quality.

Per‑Request Cost Tracking

Capture token counts (input, output, cache_read, cache_creation) and compute cost from a maintained price book.

class CostTracker:
def __init__(self):
self.price_book = {"gpt-4o": {"input": 0.005, "output": 0.015}} # per 1K tokens

def record(self, model: str, input_tokens: int, output_tokens: int):
cost = (input_tokens / 1000) * self.price_book[model]["input"] + \
(output_tokens / 1000) * self.price_book[model]["output"]
# Send to metrics system
self.metrics.record_cost(cost)

Budget Controls

Set per‑route token budgets. If a workflow exceeds its budget, the gateway rejects the request or escalates for human review.

budgets:
- route: "/chat/support"
max_tokens_per_request: 4000
max_cost_per_request: 0.02
- route: "/chat/research"
max_tokens_per_request: 16000
max_cost_per_request: 0.10

Testing Production Agents

Traditional testing approaches do not apply to non‑deterministic systems. You need a layered test strategy.

Unit Testing

Test the deterministic parts in isolation: tool input validation, prompt formatting, data transformation.

def test_tool_input_validation():
with pytest.raises(ValidationError):
get_order(order_id="") # empty ID should fail

Integration Testing

Spin up a real agent instance (in a test container) and verify behaviour against a controlled dataset.

async def test_agent_on_fixed_dataset():
agent = Agent()
results = []
for question, expected in test_dataset:
response = await agent.run(question)
score = llm_judge(question, response, expected)
assert score > 0.85

End‑to‑End Testing

Run a full workflow against a production‑like environment (including real tools with test accounts, but isolated from production data).

Failure Testing (Chaos)

Inject failures to verify retry, circuit breaker, and fallback logic:

@pytest.mark.asyncio
async def test_circuit_breaker():
with patch("llm_client.call", side_effect=APITimeoutError):
for _ in range(5):
with pytest.raises(APITimeoutError):
await agent.run("test query")
# Circuit should now be open
with pytest.raises(CircuitOpenError):
await agent.run("test query")

Evaluation Regression Testing

Before deploying a prompt change, run offline evals against your test dataset. Fail the CI pipeline if aggregated scores drop below thresholds.

# .github/workflows/agent-eval.yml
- name: Run offline evals
run: python evaluate.py --dataset=test_data.json
- name: Check thresholds
run: |
python -c "from eval import check; check(accuracy=0.85, groundedness=0.80)"

Production Best Practices

These ten engineering recommendations come from real production deployments.

PracticeWhy it mattersImplementation
Always log agent decisionsWithout traces, you cannot debug a single production failureInstrument every LLM call, tool call, retrieval, and decision point
Track tool usageTool failures are the most common silent failure; you need to know which tool is failingLog tool name, input, output, duration, error, per tool
Measure latency per stepA multi‑step agent can have one slow step that kills overall P99Record start and end timestamps for each span
Use structured outputsLLMs return free text; parsing fails silentlyUse JSON mode or constrained decoding (e.g., Instructor, Outlines)
Version prompts and toolsA prompt change can break three other behaviours for every one it fixesStore prompts externally with version tags; use semantic versioning for tool schemas
Implement idempotencyRetries cause duplicate side effectsStore processed request_id in a key‑value store with TTL
Set timeouts at every levelAn agent that hangs indefinitely kills the callerConnection timeout (5s), request timeout (30s), workflow timeout (60s)
Use circuit breakersRetries and fallbacks are not enough when a provider is persistently degradedTrip after N failures in M seconds; probe after timeout
Monitor cost per requestAPI bills explode without guardrailsCapture token counts from every LLM call; compute cost from price book
Test failure scenariosYour agent will fail in production; make sure it fails gracefullyInject timeouts, API errors, malformed responses in integration tests

Common Production Mistakes

MistakeWhy it breaks productionHow to fix
No observabilityWhen a user reports a bad response, you have no way to know whether it was a bad prompt, bad retrieval, bad tool call, or something elseInstrument everything before the first production deploy
No evaluation systemYou cannot tell if a prompt change made things better or worseBuild an offline eval dataset and run it in CI
No fallback strategyA downstream API outage stops the entire agentImplement retries, circuit breakers, and fallback models
Over‑reliance on LLM outputLLMs are not databases; they hallucinateValidate outputs against known facts; use tools for factual lookups
Ignoring costThe first production API bill will shock youSet per‑route budgets; cache aggressively; use small models where possible
No guardrailsPrompt injection is the #1 LLM security riskScan all user inputs before they reach the LLM
Hard‑coded promptsEvery prompt tweak requires a full redeploy; changes risk silent regressionsStore prompts externally with versioning; test before promoting
No rollback plan for cachesRoll back the code, but the cache keeps serving bad responses for hoursInvalidate cache on rollback; version cache keys with prompt hash

Case Study: Production Customer Support Agent

The system: A customer‑support agent that processes 10,000 tickets per day. It routes tickets to specialised agents (billing, technical, returns), retrieves knowledge base articles, and generates draft responses — with human approval for high‑value actions.

Observability Pipeline

Every request is traced with OpenTelemetry. The collector exports to both Langfuse (for developer debugging) and Grafana Tempo (for platform‑level aggregation). Traces capture:

  • User ID and session ID for attribution
  • Every LLM call (prompt, completion, tokens, model, latency)
  • Every tool call (knowledge base search, order lookup, return creation)
  • The agent’s reasoning before each action (chain‑of‑thought)

Key dashboard: “Top failing tools” — shows which tool calls are erroring most frequently, segmented by agent type.

Evaluation Loop

Offline evaluation runs on every pull request:

  • Dataset: 1,000 labelled support interactions
  • Metrics: accuracy (exact match on category), groundedness (LLM‑as‑judge), latency
  • CI blocks the PR if accuracy drops below 92% or groundedness below 0.85

Continuous evaluation runs on 1% of production traffic:

  • Checks for PII leakage (email addresses, phone numbers)
  • Measures refusal rate (is the agent refusing to help valid requests?)
  • Alerts on anomaly detection (latency spikes, error rate changes)

Deployment Strategy

Four‑stage gate for every prompt or model change:

  1. Shadow (24h) — new agent runs in parallel with production; zero user impact. Compare outputs.
  2. Canary (5% traffic, 6h) — tier‑stratified by user segment. Monitor error rate and satisfaction.
  3. Percentage (25% → 50% → 100%, 12h each) — expand slowly. Compare rubrics against 7‑day baseline.
  4. Full — armed with auto‑rollback if error rate exceeds 2% or latency exceeds 5 seconds.

Monitoring Dashboard (prometheus + grafana)

Metrics:
- agent_success_rate: 99.2% (target >99%)
- agent_p95_latency: 2.3s (target <4s)
- tool_error_rate_by_tool:
knowledge_search: 0.5%
get_order: 1.2%
create_return: 0.1%
- llm_token_usage_per_request: 1,200 input / 400 output
- cost_per_request: $0.008

Alerts:

  • PagerDuty if success rate drops below 99% for 5 minutes
  • Slack notification if any tool error rate exceeds 5%
  • Daily cost report to finance team

Reliability Mechanisms

  • Retries — exponential backoff for transient LLM API errors (3 retries: 1s, 2s, 4s)
  • Circuit breaker — trips after 10 failures in 60 seconds; opens for 30 seconds
  • Fallbacks — if primary model (GPT‑4o) fails, route to GPT‑4o‑mini for degraded but working responses
  • Timeouts — 30 seconds per LLM call; 60 seconds total workflow
  • Idempotency — store request_id in Redis; duplicate requests return cached response

Security Controls

  • All user inputs scanned by llm‑injection‑guard before LLM processing
  • Tool calls authorised against caller’s identity (a user cannot look up another user’s order)
  • Outputs scanned for PII before being returned
  • Full audit trail: every tool call logged with user_id, timestamp, parameters, result

Cost Optimisation

  • Semantic cache for frequently asked questions (returns cached answer for identical embedding, skipping LLM call) — saves 30% of requests
  • Model routing — simple questions go to GPT‑4o‑mini ($0.0005 per 1K tokens); complex reasoning goes to GPT‑4o ($0.005 per 1K tokens)
  • Per‑request token budget — refuse requests exceeding 4,000 input tokens
  • Monthly cost: $2,500 for 300,000 requests ($0.008 per request)

Result

  • Uptime: 99.92%
  • Accuracy: 94% of customer issues resolved without human escalation
  • P95 latency: 2.3 seconds
  • Monthly cost: $2,500 (down from $8,000 before optimisation)
  • Security: 0 prompt injection breaches in 6 months

Production Readiness Checklist

Use this checklist before deploying any AI agent to production.

Observability

  • Every LLM call is traced (prompt, completion, tokens, cost, latency, model version)
  • Every tool invocation is traced (name, input, output, error, duration)
  • Every retrieval is traced (query, chunks, relevance scores)
  • Application metadata is attached (user_id, session_id, workflow_id)
  • Key decision points are traced (agent reasoning before each action)
  • Traces are exported to a central backend (Langfuse, Grafana, Elastic)

Evaluation

  • Offline eval dataset exists (at least 500 labelled examples)
  • LLM‑as‑judge evals are calibrated against human judgment
  • CI pipeline runs offline evals on every PR; blocks on score drop
  • Continuous evals run on production traces (ungroundedness, PII leakage, refusals)

Monitoring

  • Success rate, latency (P50/P95/P99), error rate are monitored
  • Per‑tool error rates are tracked
  • Token usage and cost per request are tracked
  • Dashboards exist for development, operations, and business stakeholders
  • Alerts are configured with anomaly detection (not static thresholds only)

Deployment

  • Staged rollout capability exists (shadow → canary → percentage → full)
  • Canary deploys to 1–5% of traffic first
  • Blue‑green or rolling updates are configured
  • Auto‑rollback triggers on metric breach (error rate, latency, groundedness)
  • Cache invalidation is part of rollback plan

Reliability

  • Retries with exponential backoff are implemented for transient failures
  • Circuit breakers protect against persistent provider failures
  • Timeouts are set at every level (connection, request, workflow)
  • Fallback strategies exist (secondary model, degraded mode, graceful failure)
  • Idempotency is implemented for state‑changing operations

Security

  • Input guardrails scan all user inputs for prompt injection
  • Output guardrails scan all LLM responses for PII and secrets
  • Every request is authenticated (JWT, API key, or mTLS)
  • Every tool call is authorised against the caller’s identity
  • Logs are redacted (no passwords, no PII, no API keys)
  • Secrets are stored in a vault, not in code or .env
  • Audit trail exists for all security‑relevant events

Cost Control

  • Token usage is tracked per request, per user, per route
  • Cost is computed per request (not just estimated)
  • Semantic cache is implemented for repetitive queries
  • Provider prompt caching is enabled (OpenAI / Anthropic / Gemini)
  • Model routing routes simple requests to smaller, cheaper models
  • Per‑route token budgets are configured

Testing

  • Unit tests cover deterministic components (tool validation, prompt formatting)
  • Integration tests run with real (containerised) agents
  • Failure tests simulate timeouts, API errors, malformed responses
  • Offline evals run on every pull request
  • Load tests validate throughput and latency under peak traffic

Frequently Asked Questions

1. What makes an AI agent production‑ready?
A production agent is reliable (handles failures), observable (every decision is traced), testable (changes validated against a dataset), deployable (canary releases, instant rollback), secure (authenticated, prompt‑injection protected), and cost‑controlled (budgets, caching, model routing).

2. How do you monitor AI agents in production?
Monitor success rate, latency (P95), tool error rates, token usage, cost per request. Use anomaly detection against a 7‑day rolling baseline — static thresholds fail for non‑deterministic systems.

3. How do you evaluate agent performance?
Offline evaluation (supervised) against a labelled dataset. Online evaluation (unsupervised) on production traces with LLM‑as‑judge. Make evals binary and specific. Validate that your judge correlates with human judgment.

4. How do you control cost in production agents?
Cache aggressively (semantic cache + provider prompt caching). Route simple queries to cheap models. Set per‑route token budgets. Track cost per request. Distilled judges cut evaluation cost 30–50%.

5. How do you deploy agent updates safely?
Four‑stage gate: shadow (0% user impact) → canary (1–5%) → percentage (10–50%) → full. Each stage answers a different evaluation question. Auto‑rollback on metric breach.

6. What is the single most important observability practice?
Trace every LLM call, every tool invocation, every retrieval, and every decision point. Without traces, you cannot debug a single production failure.

7. Do I need a circuit breaker for my agent?
Yes if you make external calls. Retries and fallbacks are reactive; circuit breakers prevent a persistent failure from cascading across the system.

8. How do I handle prompt injection in production?
Scan all user inputs before they reach the LLM (input guardrail). Scan all LLM outputs before returning to the user (output guardrail). Use type‑directed privilege separation for systematic defence.

9. What metrics should I alert on?
Success rate (< 99%), tool error rate (> 5%), latency deviating beyond 3σ from baseline, cost per request spiking, PII detected in outputs.

10. Can I use the same evaluation for offline and online?
Offline evals require ground truth; online evals (continuous) use only the agent’s context. Use LLM‑as‑judge for both, but the rubric differs.

11. How do I test failure scenarios?
Inject timeouts, API errors, malformed responses, and slow calls in integration tests. Verify that retries, circuit breakers, and fallbacks behave as expected.

12. How do I roll back a bad prompt change?
Revert the prompt version, redeploy, and — critically — invalidate any semantic caches keyed by prompt hash. Caches can serve bad responses for hours after rollback.

13. What is the difference between canary and blue‑green?
Canary deploys to 1–5% of live traffic (gradual expansion). Blue‑green maintains two full environments (blue = current, green = updated) and switches traffic instantly. Use both.

14. How do I set timeouts for LLM calls?
Connection timeout (5 seconds), request timeout (30 seconds), workflow‑level timeout (60 seconds). Never rely on default infinite timeouts.

15. How do I handle the LLM’s non‑determinism?
You cannot eliminate it — embrace it. Use temperature=0 for consistency where possible. Evaluate on distributions, not single traces. Use statistical anomaly detection for monitoring.

16. Should I use one large model or many small models?
Many small models with routing. Use GPT‑4o‑mini or GPT‑4.1‑nano for simple tasks. Escalate to GPT‑4o or GPT‑5 only when necessary. Cuts costs 30–50%.

17. What is the minimum evaluation dataset size?
Start with 200–500 labelled examples covering edge cases and failure modes. Expand as you collect production traces with known outcomes.

18. How do I trace a multi‑step agent workflow?
Use OpenTelemetry spans: one root span for the entire workflow, child spans for each LLM call, tool call, and handoff. Carry trace_id across all services.

19. How do I prevent a single user from blowing the API budget?
Per‑user rate limits, per‑request token budgets, and hard monthly quotas at the gateway. Alert when any user approaches 80% of their quota.

20. What should I do if my agent’s success rate drops below 99%?
First, check traces for the failing requests. Is it a specific tool failing? A particular prompt pattern? A model version change? Fix, run offline evals, deploy via canary, monitor.

Conclusion

The gap between an agent that works in a notebook and an agent that works in production is wide — but it is not unbridgeable. Production AI agents require observability, evaluation, progressive deployment, reliability engineering, security, and cost control.

  • Observability is your foundation — trace everything.
  • Evaluation is your truth — offline for changes, online for production.
  • Deployment is progressive — shadow → canary → percentage → full.
  • Reliability is layered — retries → fallbacks → circuit breakers.
  • Security is non‑negotiable — guardrails in, guardrails out.
  • Cost is controllable — cache, route, budget.

Every practice in this guide has been tested in real production environments at scale. Apply them, and your agents will be reliable, observable, and ready for users.

Your Next Step

Continue your learning with the production engineering deep dives:

Then explore the protocols and frameworks that power production agents:


This article is part of the AgentDevPro Handbook — practical, engineering‑focused guides for building production AI agent systems.