Agent Memory: Engineering Context Retention for Production AI Agents

What Is Agent Memory

Agent Memory is the capability that enables an AI agent to retain, retrieve, update, and utilise information across interactions, turns, and tasks. Unlike a stateless LLM call, an agent with memory can recall a user’s preference from a conversation three weeks ago, reuse a successful plan from a similar task, or avoid repeating a tool call that already failed.

Memory turns an agent from a stateless function into a system that learns, personalises, and improves over time. It is the difference between “I don’t remember you” and “Welcome back, John – your last order #1234 is still in transit.”

Why Memory Matters

Without memory, every agent interaction starts from zero. The user repeats themselves, the agent repeats the same expensive tool calls, and no personalisation is possible.

Capability	Without Memory	With Memory
Personalisation	“What’s your name?” every time.	Remembers user name, preferences, history.
Long‑running tasks	Cannot pause and resume.	Checkpoints state, resumes days later.
Context retention	Only current turn.	Keeps conversation history across turns.
Multi‑step workflows	Must re‑plan from scratch.	Reuses previous steps, avoids duplicate work.
Knowledge reuse	Re‑retrieves same information.	Caches retrieved facts, reduces cost.

Practical example: A customer support agent handling a refund request. Memory stores that the user already verified their identity in turn 1, so turn 2 does not ask again. Memory also remembers that the user has contacted support three times this month – the agent can escalate to a human automatically.

Memory in Agent Architecture

Memory is not a single component. It is a set of storage and retrieval mechanisms that plug into the agent’s runtime at multiple points.

Key interaction points:

Pre‑LLM – Memory retrieval injects relevant past information into the context.
During execution – Working memory stores intermediate results (tool outputs).
Post‑execution – The agent updates short‑term and long‑term memory with new facts.

Types of Agent Memory

Memory Type	Lifetime	Content Example	Storage Tech
Working memory	Single turn or step	“Current tool output = `{temp: 22°C}`”	In‑memory dict, session variable
Short‑term memory	Session (minutes to hours)	Last 10 conversation turns	Redis list, SQLite, circular buffer
Long‑term memory	Persistent across sessions (weeks to years)	“User prefers email over SMS”, “User’s shipping address”	PostgreSQL, DynamoDB, vector DB
Episodic memory	Persistent (specific events)	“On March 15, user asked about refund policy”	Time‑series + embedding
Semantic memory	Persistent (facts, knowledge)	“Product X has a 2‑year warranty”	Vector database, knowledge graph

Working memory – Extremely short‑lived, not persisted. Used for chaining tool calls within a single plan.

Short‑term memory – Also called “conversation buffer”. Stores raw exchange history. Usually bounded by token limit or number of turns.

Long‑term memory – Stores structured facts about the user or domain. Often updated via extraction prompts (e.g., “Extract user preferences from this conversation”).

Episodic memory – Stores complete past interactions as “episodes”. Useful for case‑based reasoning: “When this happened before, we did X.”

Semantic memory – Often confused with RAG. It is a static knowledge base (company policies, product specs). Differs from episodic memory because it is not tied to a specific user interaction.

Agent Memory Lifecycle

Each piece of memory goes through a lifecycle from creation to eventual expiration.

Stage 1: Memory Creation

Triggered after each agent turn or explicitly by tool.
Extract facts using a small LLM or rule‑based parser.
Generate embeddings for semantic retrieval.
Attach metadata: user ID, session ID, timestamp, type (preference, fact, event), importance score.

Stage 2: Memory Storage

Write to appropriate store (vector DB for semantic, relational for structured).
Ensure idempotency (avoid duplicate memories for same fact).
Set initial TTL (e.g., 30 days for short‑term, 1 year for long‑term).

Stage 3: Memory Retrieval

Before each LLM call, the agent queries memory.
Use hybrid search (vector similarity + keyword + metadata filters).
Limit to top‑K (e.g., 5–10 memories) to avoid context overflow.

Stage 4: Memory Consolidation (Compaction)

When short‑term memory exceeds token budget (e.g., >8k tokens), trigger summarisation.
Use an LLM to summarise older turns into a condensed form.
Replace raw turns with summary; keep full raw history in cold storage (optional).

Stage 5: Memory Expiration

Delete or archive memories past their TTL.
For long‑term memory, implement soft delete (mark as inactive) rather than hard delete.
Periodic batch job to remove stale entries.

Memory Storage Architectures

Context Window Memory

The simplest form: store memory directly in the LLM context window as text.

Implementation: Append each turn to a list, join as string, prefix to prompt.

Pros: Zero infrastructure, immediate, no retrieval logic.
Cons: Fixed size (grows until limit), no persistent storage across sessions, no selective retrieval.

When to use: Prototypes, single‑turn tasks, sessions that fit in 8k tokens.

Vector Database Memory

Embed each memory chunk and store embeddings. Retrieve by semantic similarity.

Implementation: Use OpenAI text-embedding-3-small or local model (BGE, E5). Store in Pinecone, Weaviate, Qdrant, or pgvector.

Pros: Semantic retrieval, scales to millions of memories, supports filtering.
Cons: Requires embedding cost, tuning (chunk size, overlap), latency (~100ms).

When to use: Long‑term memory, RAG, any system where memories are not strictly ordered by time.

Relational Database Memory

Store structured memories in tables: user_id, fact_type, fact_value, timestamp, ttl.

Implementation: PostgreSQL, MySQL.

Pros: Exact queries, ACID, easy updates, joins across user data.
Cons: No semantic search (only exact match or LIKE), requires predefined schema.

When to use: User preferences, settings, facts that are key‑value pairs.

Knowledge Graph Memory

Store memories as nodes (entities) and edges (relations). Example: [User] - [prefers] -> [email_notifications].

Implementation: Neo4j, Amazon Neptune.

Pros: Complex reasoning, relationship traversal, explainable.
Cons: High complexity, non‑trivial to build from raw text.

When to use: Enterprise domains with many relationships (e.g., customer‑product‑region).

Hybrid Memory Architecture

Combine multiple stores: short‑term in Redis, long‑term facts in PostgreSQL, semantic memory in vector DB.

Pros: Best of each world.
Cons: Complex orchestration, consistency challenges.

Memory Retrieval Strategies

Strategy	Description	When to Use
Semantic search	Embed query, retrieve by cosine similarity.	“Find memories about product returns” – meaning matters.
Similarity search	Same as semantic (often used interchangeably).	General retrieval.
Metadata filtering	Filter by user ID, date range, type before vector search.	Multi‑tenant, time‑bounded queries.
Hybrid retrieval	Combine vector similarity + keyword BM25 + metadata.	Highest accuracy.
Context compression	Retrieve many candidates, then compress (summarise) before injecting.	Very long memories (e.g., 50 pages of chat log).
Recency boost	Boost scores of recent memories.	Conversations where recent context matters more.
Importance ranking	Use a learned model or heuristic (e.g., user starred message).	High‑value memories.

Implementation example (pseudo‑code):

def retrieve_memory(query, user_id, limit=5):
    # Embed query
    query_vec = embed(query)
    
    # Vector search with metadata filter
    vector_results = vector_db.search(
        vector=query_vec,
        filter={"user_id": user_id, "type": "conversation"},
        limit=limit*2
    )
    
    # Keyword search (if using hybrid)
    keyword_results = keyword_search(query, filter={"user_id": user_id})
    
    # Merge and rerank
    merged = merge_results(vector_results, keyword_results, weights=[0.7, 0.3])
    
    # Apply recency boost (decay factor 0.9 per day)
    for mem in merged:
        days_old = (now - mem.timestamp).days
        mem.score = mem.score * (0.9 ** days_old)
    
    return merged[:limit]

Memory and RAG: Differences and Relationships

Many engineers confuse agent memory with Retrieval‑Augmented Generation (RAG). They are related but distinct.

Aspect	RAG	Agent Memory
Purpose	Ground LLM responses in external knowledge base.	Retain conversation history and user‑specific facts.
Data source	Static corpus (documents, FAQs, product specs).	Dynamic interaction history (user messages, agent actions).
Retrieval trigger	Always before each LLM call (or on demand).	On every turn, but can be skipped for non‑memory tasks.
Updates	Batch re‑indexing.	Every turn (immediate update).
Lifetime	Persistent, rarely deleted.	Ephemeral (short‑term) to persistent (long‑term).
Use case	“What does the manual say about X?”	“What did the user tell me five minutes ago?”

How they work together:

RAG provides domain knowledge (semantic memory).
Agent memory provides conversation and user‑specific context.

An agent can query both: first retrieve from RAG for factual grounding, then from short‑term memory for conversation history, and from long‑term memory for user preferences. All results are merged into the context.

Common mistake: Using RAG instead of memory for conversation history. RAG is not designed for sequential, rapidly changing data.

Memory and Planning

Planning depends heavily on memory to avoid repeating work and to adapt to past outcomes.

Memory of past plans – The planner retrieves successful plan templates from long‑term memory for similar goals, reducing LLM calls.
Memory of tool outputs – Working memory stores results of previous steps, used for conditional branching.
Memory of failures – Episodic memory records what went wrong (e.g., “tool X timed out at 5pm”). The replanner avoids repeating that tool at the same time.

Example: A research agent that failed to call scrape_website due to anti‑bot protection. It stores that episode. Next time, it retrieves that memory and chooses use_playwright instead.

Memory and Tool Calling

Tools both read from and write to memory.

Reading (memory as input to tool selection):

The LLM chooses a tool based on memory. Example: memory says “user prefers email”, so the agent calls send_email instead of send_sms.

Writing (tool results stored as memory):

Every tool output should be stored in working memory for immediate use and, if valuable, promoted to long‑term memory.

Example flow:

Tool get_order_status returns {status: "delayed", eta: "+2 days"}.
Working memory stores this for the current plan.
After the turn, the memory extractor promotes the fact “order #123 delayed” to long‑term memory.
Future queries about that order will retrieve this fact without calling the tool again.

Implementation pattern:

class ToolExecutor:
    def execute(self, tool_name, params):
        # Check memory cache first
        cache_key = f"{tool_name}:{hash(params)}"
        if cached := long_term_memory.get(cache_key, ttl=3600):
            return cached
        
        result = call_tool(tool_name, params)
        
        # Store result in working memory for current turn
        working_memory.set(cache_key, result)
        
        # Optionally promote to long-term if result is stable
        if is_stable_result(result):
            long_term_memory.set(cache_key, result, ttl=86400)
        
        return result

Memory Architectures in Popular Frameworks

Framework	Memory Model	Storage	Strengths	Limitations
LangGraph	State + checkpoints + optional vector store.	Any (Redis, Postgres, in‑memory).	Fine‑grained control; checkpointing includes memory.	No built‑in summarisation; you implement.
CrewAI	`ShortTermMemory` (Redis), `LongTermMemory` (SQLite), `EntityMemory`.	Redis, SQLite, vector stores.	Good out‑of‑box for short/long.	Entity memory is basic; no built‑in compaction.
AutoGen	`MemoryModule` interface; `ListMemory`, `VectorMemory`.	In‑memory, Chroma, Redis.	Extensible.	No native checkpointing of memory across sessions.
OpenAI Agents SDK	Session memory only (conversation turns).	In‑memory (not persisted).	Simple.	No long‑term memory; no cross‑session retention.
Semantic Kernel	`MemoryStore` interface; `VolatileMemoryStore`, `AzureCognitiveSearch`.	Azure Cognitive Search, Chroma, Qdrant.	Strong enterprise integration.	Steep learning curve for custom memory.

Recommendation: For production, use LangGraph with a persistent checkpoint store (PostgreSQL) for state and a separate vector DB (pgvector or Qdrant) for long‑term semantic memory. Implement memory compaction as a separate node in the graph.

Production Memory Challenges

Challenge	Description	Mitigation
Context window limits	Even short‑term memory can exceed 8k–128k tokens.	Summarise older turns; store only recent N turns raw.
Retrieval accuracy	Irrelevant memories pollute context, causing hallucinations.	Use hybrid search + reranking; set relevance threshold.
Stale memory	Outdated facts (e.g., “user’s old address”).	TTL; versioning; user‑confirm updates.
Duplicate memory	Same fact stored multiple times.	Deduplication via embedding similarity + exact match.
Privacy concerns	Storing PII in memory across sessions.	Auto‑redact; encrypt at rest; honour delete requests.
Cost management	Embedding every turn, storing many memories.	Batch embedding; prune low‑importance memories.
Latency	Retrieval adds 50–200ms.	Cache frequent queries; use faster vector DB (Qdrant).
Memory poisoning	User injects malicious facts (e.g., “I am admin”).	Validate and sanitise facts before storage.

Memory Evaluation

You cannot optimise memory without measuring. Track these metrics offline and online.

Metric	Definition	How to Measure
Precision@K	% of retrieved memories relevant to the query (top K).	Human annotators; automated with LLM as judge.
Recall@K	% of all relevant memories retrieved in top K.	Requires full relevant set (labelled).
Memory freshness	Average age of retrieved memories (lower is better for recent tasks).	Compute from timestamps.
Context relevance	% of injected memory tokens actually used by the LLM in reasoning.	Analyse attention or log whether memory appears in final answer.
User satisfaction	“Did the agent remember your preferences?” (post‑turn survey).	User feedback.
Memory overhead	Total memory storage size per user (bytes).	Database metrics.
Retrieval latency	p50, p95, p99 latency of memory retrieval.	OpenTelemetry spans.

Example evaluation dataset:

Query	Expected relevant memory	Retrieved (top 3)	Precision@3
“Book a flight to Paris”	`preferred_airline=Delta`	`preferred_airline=Delta`, `last_trip=London`, `seat_preference=aisle`	0.33 (only one correct)

Improvement loop: Low precision → improve embedding model or add metadata filters. Low recall → increase K or reduce similarity threshold.

Best Practices

Separate short‑term and long‑term memory – Use different stores and retention policies. Short‑term lives in Redis with 1‑hour TTL; long‑term in PostgreSQL with years.
Always include metadata – For each memory, store user_id, session_id, timestamp, type (preference, fact, conversation), importance (0‑1). Enables filtering.
Implement memory summarisation (compaction) – When short‑term memory exceeds 80% of context window, trigger an LLM to summarise old turns into a condensed paragraph.
Use hybrid retrieval – Semantic search alone misses exact keywords. Combine with BM25 or full‑text search.
Set TTLs aggressively – Most memories become stale after weeks. Default long‑term TTL = 90 days unless explicitly marked as permanent.
Cache frequent retrievals – If the same query repeats (e.g., “user’s shipping address”), cache the result for the session.
Validate facts before storing – Use a small LLM to check “Is this fact likely true and non‑contradictory?” before writing to long‑term memory.
Monitor retrieval quality – Log retrieved_memories and actually_used_in_response to compute relevance.
Encrypt memory at rest – Especially for PII. Use AES‑256 with per‑user keys.
Provide a memory reset mechanism – Users should be able to clear their long‑term memory (GDPR, privacy).
Treat memory as part of state – When you checkpoint an agent’s execution, include the current memory snapshot. Otherwise, replay may be inconsistent.
Test memory with simulated long conversations – Automatically generate 100‑turn conversations and measure retrieval accuracy and context growth.

Common Memory Mistakes

Mistake	Consequence	Fix
Storing everything	Explosive growth, high cost, slow retrieval.	Prune low‑importance memories; summarise; set TTL.
No memory expiration	Old, incorrect facts persist forever.	Always assign TTL; run periodic expiry jobs.
Poor chunking strategy	Embedding whole conversation turns mixes unrelated topics.	Chunk at sentence or paragraph level; overlap.
Ignoring privacy	Storing sensitive data without consent.	Redact PII before storage; offer memory deletion.
Mixing task context with user profiles	Temporary working memory pollutes long‑term memory.	Keep separate stores; promote only explicitly confirmed facts.
No memory retrieval for planning	Planner repeats same mistakes.	Include episodic memory in planning phase.
Retrieving too many memories	Context overflow, LLM confused.	Limit to K=5–10; compress with summarisation.
No fallback when retrieval fails	Agent hallucinates because no memory found.	Return “no relevant memory” as a signal; prompt LLM to ask user.

Case Study: Enterprise Knowledge Assistant

Scenario: A large company deploys an internal agent for employees to ask questions about HR policies, IT support, and project documentation. The agent needs to remember each employee’s role, past requests, and preferences.

Memory Architecture

Storage layers:

Short‑term (Redis) – Raw conversation history of the current session. TTL = 1 hour after last message.
Long‑term structured (PostgreSQL) – Table user_facts: (user_id, fact_key, fact_value, updated_at). Example: (alice, role, "Software Engineer").
Episodic (Qdrant) – Embeddings of past question‑answer pairs, plus resolution steps. Used to retrieve “When Alice asked about VPN access last month, what did we do?”
Cache (Redis) – Results of frequent queries like “What is the IT support phone number?” for 1 day.

Retrieval Workflow

For a query: “Alice, I need to request a new laptop – what’s the process?”

Short‑term memory retrieves last 5 turns (Alice already said her current laptop is slow).
Long‑term structured fetches role=Software Engineer, department=Engineering.
Episodic searches past requests: “laptop request process” → returns a previous resolution: “Fill form X, requires manager approval”.
Cache returns “IT support contact: [email protected]”.

These are merged and injected into the LLM context. The agent answers: “As a Software Engineer, you can request a new laptop via form X (I see you asked about a slow laptop earlier). Your manager will need to approve. Would you like me to open the form?”

Optimisation Strategy

Importance scoring – Episodic memories that led to successful resolutions get higher importance (boost in retrieval).
Periodic compaction – Every night, summarise short‑term memories older than 24 hours into a weekly digest, stored in episodic memory.
Cost control – Embeddings generated only for turns that contain new information (detected by similarity to last 10 memories).
Privacy – After 90 days, episodic memories are anonymised (user ID removed) and kept for aggregate analytics only.

Result: The assistant resolved 78% of queries without human escalation (up from 52% without memory). Average retrieval latency: 85ms p95.

FAQ

1. Is memory the same as RAG?
No. RAG retrieves from a static knowledge base. Memory stores dynamic, user‑specific, time‑ordered interactions. They complement each other.

2. How much memory should an agent store per user?
For short‑term: last 10–20 turns or 8k tokens. For long‑term: thousands of facts is fine, but retrieve only top 5–10 most relevant per query.

3. When should memories expire?
Short‑term: minutes to hours. Long‑term: 90 days default, but allow permanent for critical preferences (e.g., “user is allergic to peanuts”). Episodic: 30–90 days.

4. Can memory improve planning?
Absolutely. By retrieving past successful plans (episodic memory), the planner can reuse them, saving LLM calls and improving reliability.

5. Which vector database should I choose for agent memory?
Qdrant or pgvector for self‑hosted; Pinecone for managed. Qdrant offers built‑in filtering and fast hybrid search. pgvector is simplest if you already use PostgreSQL.

6. How do I prevent memory poisoning (user injecting false facts)?
Never store unverified facts. Use a guardrail LLM to check fact plausibility. For critical facts (e.g., “I am an admin”), require external authentication.

7. Do I need to embed every memory?
Only if you intend to retrieve it semantically. For structured facts (e.g., role=engineer), use exact key‑value stores, not vectors.

8. How does memory work in multi‑agent systems?
Each agent may have its own memory, plus a shared memory for coordination. Shared memory must be carefully permissioned to avoid conflicts.

9. What is the cost of memory at scale?
Embedding: ~$0.0001 per 1k tokens (OpenAI). Storage: vector DB ~$0.10–$1 per GB per month. For 10k users, 100 facts each (~100MB), cost is negligible. The bigger cost is retrieval latency and LLM context tokens.

10. Can I use the same memory store for multiple agents?
Yes, but each agent should use a namespace or filter by agent_id to avoid cross‑agent contamination.

11. How do I handle memory across distributed agent instances?
Use a centralised store (Redis, PostgreSQL) that all instances share. Implement row‑level locking for updates.

12. What is the role of summarisation in memory?
Summarisation (compaction) prevents context from growing unbounded. It condenses many turns into a few sentences, preserving key information while discarding noise.

13. How do I test my memory system?
Create a test suite with long synthetic conversations. After each turn, query the agent and verify that it correctly recalls facts from earlier turns. Measure recall.

14. Is working memory just a variable?
Essentially, yes. But in a distributed agent, working memory must be scoped to the current turn and cleared afterward. Use a per‑request context object.

15. Should I store tool outputs as memory?
Sometimes. Store stable, reusable tool outputs (e.g., “exchange rate USD/EUR = 0.92”) with a short TTL. Do not store transient outputs (e.g., “random joke”).

Continue Your Journey

Memory is foundational to intelligent agents. Next, see how memory interacts with other components:

Planning – Agent Planning (memory for plan reuse)
Tool Calling – Tool Calling (caching tool results)
Lifecycle – Agent Lifecycle (where memory fits in execution)
Frameworks – LangGraph Guide (implementing memory with checkpoints)
Evaluation – Agent Evaluation (measuring memory effectiveness)

Or return to the Agent Learning Path to plan your next topic.

This article is part of the AgentDevPro Production Agent Engineering Handbook. Updated for Q2 2026.

What Is Agent Memory​

Why Memory Matters​

Memory in Agent Architecture​

Types of Agent Memory​

Agent Memory Lifecycle​

Stage 1: Memory Creation​

Stage 2: Memory Storage​

Stage 3: Memory Retrieval​

Stage 4: Memory Consolidation (Compaction)​

Stage 5: Memory Expiration​

Memory Storage Architectures​

Context Window Memory​

Vector Database Memory​

Relational Database Memory​

Knowledge Graph Memory​

Hybrid Memory Architecture​

Memory Retrieval Strategies​

Memory and RAG: Differences and Relationships​

Memory and Planning​

Memory and Tool Calling​

Memory Architectures in Popular Frameworks​

Production Memory Challenges​

Memory Evaluation​

Best Practices​

Common Memory Mistakes​

Case Study: Enterprise Knowledge Assistant​

Memory Architecture​

Retrieval Workflow​

Optimisation Strategy​

FAQ​

Continue Your Journey​