Agent Memory: Engineering Context Retention for Production AI Agents
What Is Agent Memory​
Agent Memory is the capability that enables an AI agent to retain, retrieve, update, and utilise information across interactions, turns, and tasks. Unlike a stateless LLM call, an agent with memory can recall a user’s preference from a conversation three weeks ago, reuse a successful plan from a similar task, or avoid repeating a tool call that already failed.
Memory turns an agent from a stateless function into a system that learns, personalises, and improves over time. It is the difference between “I don’t remember you” and “Welcome back, John – your last order #1234 is still in transit.”
Why Memory Matters​
Without memory, every agent interaction starts from zero. The user repeats themselves, the agent repeats the same expensive tool calls, and no personalisation is possible.
| Capability | Without Memory | With Memory |
|---|---|---|
| Personalisation | “What’s your name?” every time. | Remembers user name, preferences, history. |
| Long‑running tasks | Cannot pause and resume. | Checkpoints state, resumes days later. |
| Context retention | Only current turn. | Keeps conversation history across turns. |
| Multi‑step workflows | Must re‑plan from scratch. | Reuses previous steps, avoids duplicate work. |
| Knowledge reuse | Re‑retrieves same information. | Caches retrieved facts, reduces cost. |
Practical example: A customer support agent handling a refund request. Memory stores that the user already verified their identity in turn 1, so turn 2 does not ask again. Memory also remembers that the user has contacted support three times this month – the agent can escalate to a human automatically.
Memory in Agent Architecture​
Memory is not a single component. It is a set of storage and retrieval mechanisms that plug into the agent’s runtime at multiple points.
Key interaction points:
- Pre‑LLM – Memory retrieval injects relevant past information into the context.
- During execution – Working memory stores intermediate results (tool outputs).
- Post‑execution – The agent updates short‑term and long‑term memory with new facts.
Types of Agent Memory​
| Memory Type | Lifetime | Content Example | Storage Tech |
|---|---|---|---|
| Working memory | Single turn or step | “Current tool output = {temp: 22°C}” | In‑memory dict, session variable |
| Short‑term memory | Session (minutes to hours) | Last 10 conversation turns | Redis list, SQLite, circular buffer |
| Long‑term memory | Persistent across sessions (weeks to years) | “User prefers email over SMS”, “User’s shipping address” | PostgreSQL, DynamoDB, vector DB |
| Episodic memory | Persistent (specific events) | “On March 15, user asked about refund policy” | Time‑series + embedding |
| Semantic memory | Persistent (facts, knowledge) | “Product X has a 2‑year warranty” | Vector database, knowledge graph |
Working memory – Extremely short‑lived, not persisted. Used for chaining tool calls within a single plan.
Short‑term memory – Also called “conversation buffer”. Stores raw exchange history. Usually bounded by token limit or number of turns.
Long‑term memory – Stores structured facts about the user or domain. Often updated via extraction prompts (e.g., “Extract user preferences from this conversation”).
Episodic memory – Stores complete past interactions as “episodes”. Useful for case‑based reasoning: “When this happened before, we did X.”
Semantic memory – Often confused with RAG. It is a static knowledge base (company policies, product specs). Differs from episodic memory because it is not tied to a specific user interaction.
Agent Memory Lifecycle​
Each piece of memory goes through a lifecycle from creation to eventual expiration.
Stage 1: Memory Creation​
- Triggered after each agent turn or explicitly by tool.
- Extract facts using a small LLM or rule‑based parser.
- Generate embeddings for semantic retrieval.
- Attach metadata: user ID, session ID, timestamp, type (preference, fact, event), importance score.
Stage 2: Memory Storage​
- Write to appropriate store (vector DB for semantic, relational for structured).
- Ensure idempotency (avoid duplicate memories for same fact).
- Set initial TTL (e.g., 30 days for short‑term, 1 year for long‑term).
Stage 3: Memory Retrieval​
- Before each LLM call, the agent queries memory.
- Use hybrid search (vector similarity + keyword + metadata filters).
- Limit to top‑K (e.g., 5–10 memories) to avoid context overflow.
Stage 4: Memory Consolidation (Compaction)​
- When short‑term memory exceeds token budget (e.g., >8k tokens), trigger summarisation.
- Use an LLM to summarise older turns into a condensed form.
- Replace raw turns with summary; keep full raw history in cold storage (optional).
Stage 5: Memory Expiration​
- Delete or archive memories past their TTL.
- For long‑term memory, implement soft delete (mark as inactive) rather than hard delete.
- Periodic batch job to remove stale entries.
Memory Storage Architectures​
Context Window Memory​
The simplest form: store memory directly in the LLM context window as text.
Implementation: Append each turn to a list, join as string, prefix to prompt.
Pros: Zero infrastructure, immediate, no retrieval logic.
Cons: Fixed size (grows until limit), no persistent storage across sessions, no selective retrieval.
When to use: Prototypes, single‑turn tasks, sessions that fit in 8k tokens.
Vector Database Memory​
Embed each memory chunk and store embeddings. Retrieve by semantic similarity.
Implementation: Use OpenAI text-embedding-3-small or local model (BGE, E5). Store in Pinecone, Weaviate, Qdrant, or pgvector.
Pros: Semantic retrieval, scales to millions of memories, supports filtering.
Cons: Requires embedding cost, tuning (chunk size, overlap), latency (~100ms).
When to use: Long‑term memory, RAG, any system where memories are not strictly ordered by time.
Relational Database Memory​
Store structured memories in tables: user_id, fact_type, fact_value, timestamp, ttl.
Implementation: PostgreSQL, MySQL.
Pros: Exact queries, ACID, easy updates, joins across user data.
Cons: No semantic search (only exact match or LIKE), requires predefined schema.
When to use: User preferences, settings, facts that are key‑value pairs.
Knowledge Graph Memory​
Store memories as nodes (entities) and edges (relations). Example: [User] - [prefers] -> [email_notifications].
Implementation: Neo4j, Amazon Neptune.
Pros: Complex reasoning, relationship traversal, explainable.
Cons: High complexity, non‑trivial to build from raw text.
When to use: Enterprise domains with many relationships (e.g., customer‑product‑region).
Hybrid Memory Architecture​
Combine multiple stores: short‑term in Redis, long‑term facts in PostgreSQL, semantic memory in vector DB.
Pros: Best of each world.
Cons: Complex orchestration, consistency challenges.
Memory Retrieval Strategies​
| Strategy | Description | When to Use |
|---|---|---|
| Semantic search | Embed query, retrieve by cosine similarity. | “Find memories about product returns” – meaning matters. |
| Similarity search | Same as semantic (often used interchangeably). | General retrieval. |
| Metadata filtering | Filter by user ID, date range, type before vector search. | Multi‑tenant, time‑bounded queries. |
| Hybrid retrieval | Combine vector similarity + keyword BM25 + metadata. | Highest accuracy. |
| Context compression | Retrieve many candidates, then compress (summarise) before injecting. | Very long memories (e.g., 50 pages of chat log). |
| Recency boost | Boost scores of recent memories. | Conversations where recent context matters more. |
| Importance ranking | Use a learned model or heuristic (e.g., user starred message). | High‑value memories. |
Implementation example (pseudo‑code):
def retrieve_memory(query, user_id, limit=5):
# Embed query
query_vec = embed(query)
# Vector search with metadata filter
vector_results = vector_db.search(
vector=query_vec,
filter={"user_id": user_id, "type": "conversation"},
limit=limit*2
)
# Keyword search (if using hybrid)
keyword_results = keyword_search(query, filter={"user_id": user_id})
# Merge and rerank
merged = merge_results(vector_results, keyword_results, weights=[0.7, 0.3])
# Apply recency boost (decay factor 0.9 per day)
for mem in merged:
days_old = (now - mem.timestamp).days
mem.score = mem.score * (0.9 ** days_old)
return merged[:limit]
Memory and RAG: Differences and Relationships​
Many engineers confuse agent memory with Retrieval‑Augmented Generation (RAG). They are related but distinct.
| Aspect | RAG | Agent Memory |
|---|---|---|
| Purpose | Ground LLM responses in external knowledge base. | Retain conversation history and user‑specific facts. |
| Data source | Static corpus (documents, FAQs, product specs). | Dynamic interaction history (user messages, agent actions). |
| Retrieval trigger | Always before each LLM call (or on demand). | On every turn, but can be skipped for non‑memory tasks. |
| Updates | Batch re‑indexing. | Every turn (immediate update). |
| Lifetime | Persistent, rarely deleted. | Ephemeral (short‑term) to persistent (long‑term). |
| Use case | “What does the manual say about X?” | “What did the user tell me five minutes ago?” |
How they work together:
- RAG provides domain knowledge (semantic memory).
- Agent memory provides conversation and user‑specific context.
An agent can query both: first retrieve from RAG for factual grounding, then from short‑term memory for conversation history, and from long‑term memory for user preferences. All results are merged into the context.
Common mistake: Using RAG instead of memory for conversation history. RAG is not designed for sequential, rapidly changing data.
Memory and Planning​
Planning depends heavily on memory to avoid repeating work and to adapt to past outcomes.
- Memory of past plans – The planner retrieves successful plan templates from long‑term memory for similar goals, reducing LLM calls.
- Memory of tool outputs – Working memory stores results of previous steps, used for conditional branching.
- Memory of failures – Episodic memory records what went wrong (e.g., “tool X timed out at 5pm”). The replanner avoids repeating that tool at the same time.
Example: A research agent that failed to call scrape_website due to anti‑bot protection. It stores that episode. Next time, it retrieves that memory and chooses use_playwright instead.
Memory and Tool Calling​
Tools both read from and write to memory.
Reading (memory as input to tool selection):
- The LLM chooses a tool based on memory. Example: memory says “user prefers email”, so the agent calls
send_emailinstead ofsend_sms.
Writing (tool results stored as memory):
- Every tool output should be stored in working memory for immediate use and, if valuable, promoted to long‑term memory.
Example flow:
- Tool
get_order_statusreturns{status: "delayed", eta: "+2 days"}. - Working memory stores this for the current plan.
- After the turn, the memory extractor promotes the fact “order #123 delayed” to long‑term memory.
- Future queries about that order will retrieve this fact without calling the tool again.
Implementation pattern:
class ToolExecutor:
def execute(self, tool_name, params):
# Check memory cache first
cache_key = f"{tool_name}:{hash(params)}"
if cached := long_term_memory.get(cache_key, ttl=3600):
return cached
result = call_tool(tool_name, params)
# Store result in working memory for current turn
working_memory.set(cache_key, result)
# Optionally promote to long-term if result is stable
if is_stable_result(result):
long_term_memory.set(cache_key, result, ttl=86400)
return result
Memory Architectures in Popular Frameworks​
| Framework | Memory Model | Storage | Strengths | Limitations |
|---|---|---|---|---|
| LangGraph | State + checkpoints + optional vector store. | Any (Redis, Postgres, in‑memory). | Fine‑grained control; checkpointing includes memory. | No built‑in summarisation; you implement. |
| CrewAI | ShortTermMemory (Redis), LongTermMemory (SQLite), EntityMemory. | Redis, SQLite, vector stores. | Good out‑of‑box for short/long. | Entity memory is basic; no built‑in compaction. |
| AutoGen | MemoryModule interface; ListMemory, VectorMemory. | In‑memory, Chroma, Redis. | Extensible. | No native checkpointing of memory across sessions. |
| OpenAI Agents SDK | Session memory only (conversation turns). | In‑memory (not persisted). | Simple. | No long‑term memory; no cross‑session retention. |
| Semantic Kernel | MemoryStore interface; VolatileMemoryStore, AzureCognitiveSearch. | Azure Cognitive Search, Chroma, Qdrant. | Strong enterprise integration. | Steep learning curve for custom memory. |
Recommendation: For production, use LangGraph with a persistent checkpoint store (PostgreSQL) for state and a separate vector DB (pgvector or Qdrant) for long‑term semantic memory. Implement memory compaction as a separate node in the graph.
Production Memory Challenges​
| Challenge | Description | Mitigation |
|---|---|---|
| Context window limits | Even short‑term memory can exceed 8k–128k tokens. | Summarise older turns; store only recent N turns raw. |
| Retrieval accuracy | Irrelevant memories pollute context, causing hallucinations. | Use hybrid search + reranking; set relevance threshold. |
| Stale memory | Outdated facts (e.g., “user’s old address”). | TTL; versioning; user‑confirm updates. |
| Duplicate memory | Same fact stored multiple times. | Deduplication via embedding similarity + exact match. |
| Privacy concerns | Storing PII in memory across sessions. | Auto‑redact; encrypt at rest; honour delete requests. |
| Cost management | Embedding every turn, storing many memories. | Batch embedding; prune low‑importance memories. |
| Latency | Retrieval adds 50–200ms. | Cache frequent queries; use faster vector DB (Qdrant). |
| Memory poisoning | User injects malicious facts (e.g., “I am admin”). | Validate and sanitise facts before storage. |
Memory Evaluation​
You cannot optimise memory without measuring. Track these metrics offline and online.
| Metric | Definition | How to Measure |
|---|---|---|
| Precision@K | % of retrieved memories relevant to the query (top K). | Human annotators; automated with LLM as judge. |
| Recall@K | % of all relevant memories retrieved in top K. | Requires full relevant set (labelled). |
| Memory freshness | Average age of retrieved memories (lower is better for recent tasks). | Compute from timestamps. |
| Context relevance | % of injected memory tokens actually used by the LLM in reasoning. | Analyse attention or log whether memory appears in final answer. |
| User satisfaction | “Did the agent remember your preferences?” (post‑turn survey). | User feedback. |
| Memory overhead | Total memory storage size per user (bytes). | Database metrics. |
| Retrieval latency | p50, p95, p99 latency of memory retrieval. | OpenTelemetry spans. |
Example evaluation dataset:
| Query | Expected relevant memory | Retrieved (top 3) | Precision@3 |
|---|---|---|---|
| “Book a flight to Paris” | preferred_airline=Delta | preferred_airline=Delta, last_trip=London, seat_preference=aisle | 0.33 (only one correct) |
Improvement loop: Low precision → improve embedding model or add metadata filters. Low recall → increase K or reduce similarity threshold.
Best Practices​
-
Separate short‑term and long‑term memory – Use different stores and retention policies. Short‑term lives in Redis with 1‑hour TTL; long‑term in PostgreSQL with years.
-
Always include metadata – For each memory, store
user_id,session_id,timestamp,type(preference, fact, conversation),importance(0‑1). Enables filtering. -
Implement memory summarisation (compaction) – When short‑term memory exceeds 80% of context window, trigger an LLM to summarise old turns into a condensed paragraph.
-
Use hybrid retrieval – Semantic search alone misses exact keywords. Combine with BM25 or full‑text search.
-
Set TTLs aggressively – Most memories become stale after weeks. Default long‑term TTL = 90 days unless explicitly marked as permanent.
-
Cache frequent retrievals – If the same query repeats (e.g., “user’s shipping address”), cache the result for the session.
-
Validate facts before storing – Use a small LLM to check “Is this fact likely true and non‑contradictory?” before writing to long‑term memory.
-
Monitor retrieval quality – Log
retrieved_memoriesandactually_used_in_responseto compute relevance. -
Encrypt memory at rest – Especially for PII. Use AES‑256 with per‑user keys.
-
Provide a memory reset mechanism – Users should be able to clear their long‑term memory (GDPR, privacy).
-
Treat memory as part of state – When you checkpoint an agent’s execution, include the current memory snapshot. Otherwise, replay may be inconsistent.
-
Test memory with simulated long conversations – Automatically generate 100‑turn conversations and measure retrieval accuracy and context growth.
Common Memory Mistakes​
| Mistake | Consequence | Fix |
|---|---|---|
| Storing everything | Explosive growth, high cost, slow retrieval. | Prune low‑importance memories; summarise; set TTL. |
| No memory expiration | Old, incorrect facts persist forever. | Always assign TTL; run periodic expiry jobs. |
| Poor chunking strategy | Embedding whole conversation turns mixes unrelated topics. | Chunk at sentence or paragraph level; overlap. |
| Ignoring privacy | Storing sensitive data without consent. | Redact PII before storage; offer memory deletion. |
| Mixing task context with user profiles | Temporary working memory pollutes long‑term memory. | Keep separate stores; promote only explicitly confirmed facts. |
| No memory retrieval for planning | Planner repeats same mistakes. | Include episodic memory in planning phase. |
| Retrieving too many memories | Context overflow, LLM confused. | Limit to K=5–10; compress with summarisation. |
| No fallback when retrieval fails | Agent hallucinates because no memory found. | Return “no relevant memory” as a signal; prompt LLM to ask user. |
Case Study: Enterprise Knowledge Assistant​
Scenario: A large company deploys an internal agent for employees to ask questions about HR policies, IT support, and project documentation. The agent needs to remember each employee’s role, past requests, and preferences.
Memory Architecture​
Storage layers:
- Short‑term (Redis) – Raw conversation history of the current session. TTL = 1 hour after last message.
- Long‑term structured (PostgreSQL) – Table
user_facts:(user_id, fact_key, fact_value, updated_at). Example:(alice, role, "Software Engineer"). - Episodic (Qdrant) – Embeddings of past question‑answer pairs, plus resolution steps. Used to retrieve “When Alice asked about VPN access last month, what did we do?”
- Cache (Redis) – Results of frequent queries like “What is the IT support phone number?” for 1 day.
Retrieval Workflow​
For a query: “Alice, I need to request a new laptop – what’s the process?”
- Short‑term memory retrieves last 5 turns (Alice already said her current laptop is slow).
- Long‑term structured fetches
role=Software Engineer,department=Engineering. - Episodic searches past requests: “laptop request process” → returns a previous resolution: “Fill form X, requires manager approval”.
- Cache returns “IT support contact: [email protected]”.
These are merged and injected into the LLM context. The agent answers: “As a Software Engineer, you can request a new laptop via form X (I see you asked about a slow laptop earlier). Your manager will need to approve. Would you like me to open the form?”
Optimisation Strategy​
- Importance scoring – Episodic memories that led to successful resolutions get higher importance (boost in retrieval).
- Periodic compaction – Every night, summarise short‑term memories older than 24 hours into a weekly digest, stored in episodic memory.
- Cost control – Embeddings generated only for turns that contain new information (detected by similarity to last 10 memories).
- Privacy – After 90 days, episodic memories are anonymised (user ID removed) and kept for aggregate analytics only.
Result: The assistant resolved 78% of queries without human escalation (up from 52% without memory). Average retrieval latency: 85ms p95.
FAQ​
1. Is memory the same as RAG?
No. RAG retrieves from a static knowledge base. Memory stores dynamic, user‑specific, time‑ordered interactions. They complement each other.
2. How much memory should an agent store per user?
For short‑term: last 10–20 turns or 8k tokens. For long‑term: thousands of facts is fine, but retrieve only top 5–10 most relevant per query.
3. When should memories expire?
Short‑term: minutes to hours. Long‑term: 90 days default, but allow permanent for critical preferences (e.g., “user is allergic to peanuts”). Episodic: 30–90 days.
4. Can memory improve planning?
Absolutely. By retrieving past successful plans (episodic memory), the planner can reuse them, saving LLM calls and improving reliability.
5. Which vector database should I choose for agent memory?
Qdrant or pgvector for self‑hosted; Pinecone for managed. Qdrant offers built‑in filtering and fast hybrid search. pgvector is simplest if you already use PostgreSQL.
6. How do I prevent memory poisoning (user injecting false facts)?
Never store unverified facts. Use a guardrail LLM to check fact plausibility. For critical facts (e.g., “I am an admin”), require external authentication.
7. Do I need to embed every memory?
Only if you intend to retrieve it semantically. For structured facts (e.g., role=engineer), use exact key‑value stores, not vectors.
8. How does memory work in multi‑agent systems?
Each agent may have its own memory, plus a shared memory for coordination. Shared memory must be carefully permissioned to avoid conflicts.
9. What is the cost of memory at scale?
Embedding: ~$0.0001 per 1k tokens (OpenAI). Storage: vector DB ~$0.10–$1 per GB per month. For 10k users, 100 facts each (~100MB), cost is negligible. The bigger cost is retrieval latency and LLM context tokens.
10. Can I use the same memory store for multiple agents?
Yes, but each agent should use a namespace or filter by agent_id to avoid cross‑agent contamination.
11. How do I handle memory across distributed agent instances?
Use a centralised store (Redis, PostgreSQL) that all instances share. Implement row‑level locking for updates.
12. What is the role of summarisation in memory?
Summarisation (compaction) prevents context from growing unbounded. It condenses many turns into a few sentences, preserving key information while discarding noise.
13. How do I test my memory system?
Create a test suite with long synthetic conversations. After each turn, query the agent and verify that it correctly recalls facts from earlier turns. Measure recall.
14. Is working memory just a variable?
Essentially, yes. But in a distributed agent, working memory must be scoped to the current turn and cleared afterward. Use a per‑request context object.
15. Should I store tool outputs as memory?
Sometimes. Store stable, reusable tool outputs (e.g., “exchange rate USD/EUR = 0.92”) with a short TTL. Do not store transient outputs (e.g., “random joke”).
Continue Your Journey​
Memory is foundational to intelligent agents. Next, see how memory interacts with other components:
- Planning – Agent Planning (memory for plan reuse)
- Tool Calling – Tool Calling (caching tool results)
- Lifecycle – Agent Lifecycle (where memory fits in execution)
- Frameworks – LangGraph Guide (implementing memory with checkpoints)
- Evaluation – Agent Evaluation (measuring memory effectiveness)
Or return to the Agent Learning Path to plan your next topic.
This article is part of the AgentDevPro Production Agent Engineering Handbook. Updated for Q2 2026.