Back to blog

AI Agent Memory Architecture: How to Diagram Agent Memory Systems (2026)

A deep dive into AI agent memory architecture — the four memory types (working, episodic, semantic, procedural), storage backends, retrieval strategies, and open-source tools like Mem0, Zep, and LangMem — with prompt templates for generating memory diagrams.

R
Ryan·Senior AI Engineer
·

The single biggest gap between a demo AI agent and a production AI agent is memory. A demo agent starts fresh every conversation — impressive in a five-minute walkthrough, useless in a week-long project. A production agent knows what happened yesterday, remembers your preferences, recalls domain facts it learned three months ago, and applies skills it has practiced hundreds of times. That difference is entirely determined by how the agent's memory architecture is designed.

In 2026, memory has become the central engineering problem for long-running agent systems. Teams building customer-support agents, coding assistants, research pipelines, and autonomous workflows all confront the same challenge: the context window is finite, sessions end, and knowledge must persist somewhere outside the model. The frameworks and tooling for solving this — Mem0, Zep, LangMem, custom vector store pipelines — have matured rapidly, but the architectural patterns underlying them are still poorly understood. Diagramming your memory architecture before you build it is the fastest way to surface design conflicts, align your team, and avoid the class of subtle bugs that emerge when retrieval logic and storage logic are not clearly separated.

The four types of agent memory

Cognitive science distinguishes four memory systems in humans. The same taxonomy maps cleanly onto AI agent architectures and provides a useful vocabulary for diagramming what your system actually stores and retrieves.

Working memory (in-context)

Working memory is everything currently loaded into the model's context window: the system prompt, the conversation history, retrieved documents, tool outputs, and the current user message. It is the agent's immediate scratch pad — fast, zero-latency, and bounded by the model's context limit (typically 128k–1M tokens in 2026 frontier models).

Working memory is not persistent. When the context window ends, working memory is gone unless something explicitly extracts and stores the relevant pieces. In diagrams, working memory is the content flowing into the model inference box — show it as a labeled buffer fed by all upstream retrieval systems. Annotating token budgets per slot (e.g., "system prompt: 2k tokens, retrieved memories: 8k tokens, history: 4k tokens") makes context pressure visible.

Episodic memory (specific past experiences)

Episodic memory stores records of specific events — individual conversations, completed tasks, past interactions with a particular user, or errors the agent encountered and recovered from. Unlike working memory, episodic memory persists across sessions and can be retrieved selectively when relevant.

A customer-support agent with episodic memory can recall that this specific user reported a billing issue three weeks ago and received a $10 credit. A coding agent can recall that this specific repository uses a non-standard test runner that tripped it up last Tuesday. Episodic memory is typically stored in a combination of key-value stores (for fast lookup by session or user ID) and vector stores (for semantic similarity retrieval when the agent can't look up by exact key). In diagrams, show episodic memory as a timestamped event log with both an indexed lookup path and a vector search path.

Semantic memory (facts and knowledge)

Semantic memory stores general facts, domain knowledge, and learned beliefs — things that are true independent of any specific event. Examples: a user's stated preferences (“prefers TypeScript over Python”), domain facts the agent has been told (“the API rate limit is 100 requests per minute”), or synthesized knowledge extracted from many past conversations (“this customer segment consistently asks about HIPAA compliance before purchasing”).

Semantic memory is the most powerful long-term store because it accumulates structured knowledge over time. It is typically stored in vector databases (for semantic retrieval) or knowledge graphs (for relational queries). The key design challenge is extraction: how does the agent decide what is worth storing as a semantic memory versus what can be discarded? In diagrams, show an extraction step downstream of the agent's response that classifies and writes important facts to the semantic store.

Procedural memory (how-to knowledge, tools, and skills)

Procedural memory encodes how to do things — workflows, tool usage patterns, decision heuristics, and learned skills. In AI agents, procedural memory is mostly encoded in the system prompt and tool definitions, but it can also be stored externally and retrieved dynamically: a library of reusable prompt chains, a registry of MCP tool schemas, or a collection of few-shot examples that teach the agent how to handle specific task types.

Dynamic procedural memory is an emerging pattern in 2026: rather than hardcoding all tool definitions into every system prompt, agents retrieve only the tools relevant to the current task. This keeps context windows lean and allows new skills to be added without redeploying the agent. In diagrams, show procedural memory as a tool/skill registry with a retrieval arrow into the working memory context assembly layer.

Agent memory architecture layers

A complete agent memory architecture has three distinct processing layers that must be designed and diagrammed separately: the read path, the write path, and the consolidation process.

The read path (retrieval at the start of each agent turn)

Before the agent model runs, the read path assembles the working memory context from all available memory stores. A typical read path for a production agent runs in parallel:

  • Episodic lookup: fetch recent conversation summaries for this user/session by ID from a key-value store — millisecond latency, exact match.
  • Semantic retrieval: embed the current query and retrieve the top-k most semantically relevant facts from the vector store — tens of milliseconds, fuzzy match.
  • Procedural lookup: retrieve relevant tool definitions and few-shot examples from the skill registry based on the task category.

The assembled context is then ranked, trimmed to fit the token budget, and injected into the system prompt or as separate context blocks before the model call. In diagrams, show the read path as a parallel fan-out from a “context assembler” node to each memory store, with a merge/rank step before the model inference box.

The write path (extraction and storage after each agent turn)

After the agent produces a response, the write path extracts valuable information and persists it to the appropriate memory store. This is typically done asynchronously — the user receives the response immediately while the write path runs in the background. The write path includes:

  • Episode recording: append the conversation turn (user message + agent response) to the session log in the episodic store.
  • Fact extraction: run a lightweight extraction model or prompt that identifies new facts, preferences, or corrections stated in the conversation and writes them to the semantic store.
  • Importance scoring: score each candidate memory by novelty and relevance before writing, to avoid polluting the store with redundant or low-value entries.

The consolidation process

Consolidation is a background process — analogous to sleep in human memory — that periodically reorganizes and compresses stored memories. Common consolidation operations include summarizing long episode logs into compact session summaries, merging duplicate semantic facts, promoting high-frequency short-term memories to long-term storage, and applying forgetting curves to decay low-importance memories over time. Consolidation is typically scheduled as a batch job (nightly, or after N new episodes). In diagrams, show consolidation as a separate async process with read-modify-write arrows to each memory store, distinct from the agent's real-time read/write paths.

Memory storage backends

Vector stores for semantic search

Vector stores are the backbone of semantic and episodic memory retrieval. They store embedding vectors alongside metadata and support approximate nearest-neighbor search. The leading options in 2026:

  • pgvector — a PostgreSQL extension that adds vector search to an existing Postgres database. Excellent for teams that already run Postgres and want to avoid a separate vector database. Supports HNSW and IVFFlat indexes. Best for moderate scale (up to tens of millions of vectors).
  • Pinecone — a fully managed vector database with a simple API, serverless pricing, and strong performance at large scale. No infrastructure to manage. Best when you need to scale to hundreds of millions of vectors without ops overhead.
  • Qdrant — an open-source vector database with rich filtering, payload indexing, and a Rust core for high throughput. Self-hostable or cloud-managed. Best for teams that need fine-grained filtering on metadata alongside vector search (e.g., “retrieve memories for this user from the last 30 days”).

Key-value stores for episodic lookup

Exact-key lookups — fetching a user's session history by user ID, retrieving a conversation by session ID — are best served by key-value stores. Redis is the default choice for hot episodic data (recent sessions, active user state) due to its sub-millisecond latency and TTL support. DynamoDB is a natural fit for serverless architectures on AWS, offering unlimited scale and single-digit millisecond reads at any throughput level. For agents that need both fast lookup and durable long-term storage, a common pattern is Redis as a hot cache backed by DynamoDB or Postgres as the source of truth.

Knowledge graphs for relational memory

When semantic memory involves complex relationships between entities — people, organizations, products, events — knowledge graphs outperform flat vector stores for relational queries. Tools like Neo4j or Amazon Neptune let the agent traverse relationships (“what projects has this user worked on, and who else worked on them?”) that a vector store cannot answer efficiently. GraphRAG architectures combine a knowledge graph with a vector store: the graph answers structural relationship queries, the vector store answers semantic similarity queries, and the agent's retrieval layer queries both in parallel.

Prompt templates for agent memory diagrams

Full four-layer memory architecture

"AI agent memory architecture with four memory types. Draw a production agent system with a central LLM inference node. On the read path (left side, flowing into the model): (1) Working Memory buffer — assembled from retrieved context, annotated with a 128k token limit; (2) Episodic Memory store — Redis for hot session lookup by user ID, backed by Postgres for long-term session logs; (3) Semantic Memory store — Qdrant vector database, queried by embedding similarity; (4) Procedural Memory store — a tool/skill registry queried by task category. Show a Context Assembler node that fans out to all three long-term stores in parallel, merges results, ranks by relevance, and injects into the Working Memory buffer before the model call. On the write path (right side, flowing out of the model): show an async Extraction Service that reads the agent's response, scores memory candidates for importance, and writes to the Episodic store (session append) and Semantic store (fact upsert). Add a separate Consolidation Service that runs nightly, summarizing episode logs and decaying low-importance semantic memories."

Mem0-based memory layer for a customer support agent

"Customer support agent with Mem0 memory layer. A user sends a support request via a React chat frontend. The request hits a FastAPI backend that calls Mem0's add() and search() APIs. Before each agent turn, search() retrieves the top-5 relevant memories for this user ID — past issues, stated preferences, account facts — and injects them into the system prompt under a 'User Memory' section. The agent (GPT-4o) generates a response. After each turn, add() extracts and stores new facts from the conversation. Mem0 internally uses a vector store (Qdrant) for semantic retrieval and a graph store (Neo4j) for entity relationships. Show the Mem0 service as a box between the FastAPI backend and the memory stores, with labeled API calls: search() on the read path and add() on the write path. Draw the user's memory as a persistent store that grows across sessions."

LangMem procedural memory with dynamic tool retrieval

"LangMem-powered coding agent with dynamic procedural memory. A LangGraph agent node receives a coding task. Before calling the LLM, a Memory Manager node (using LangMem) retrieves: (1) user preferences from in-context storage — language, framework, coding style; (2) relevant procedural memories from the semantic store — past solutions to similar problems, error patterns the agent learned to avoid, tool usage heuristics; (3) the top-3 most relevant tool schemas from a Tool Registry (stored as embeddings in pgvector), so only applicable tools are injected into the context. After the agent responds, LangMem runs background extraction to update procedural memories based on whether the solution worked. Show LangMem as a middleware layer between the LangGraph state graph and the LLM, with arrows for retrieve and store operations to each memory type. Label the background extraction step as async."

Memory management patterns

Memory consolidation

Raw episode logs grow without bound. Consolidation compresses them into durable summaries: a week of daily session logs becomes a single “user profile update,” and a month of product usage episodes becomes a structured preference record. Consolidation is typically a scheduled LLM call that reads a batch of recent episodes and writes a compressed summary to the semantic store, then deletes or archives the raw episodes. The key design decision is frequency: too infrequent and the raw store grows large; too frequent and consolidation costs outweigh the savings.

Forgetting curves and importance scoring

Not all memories are equally valuable over time. Importance scoring assigns each memory a numerical score based on recency, frequency of retrieval, and explicit signals (e.g., the user confirmed this fact was important). A forgetting curve decays importance scores over time for memories that are never retrieved, eventually dropping them below a deletion threshold. This mirrors the Ebbinghaus forgetting curve in human cognition and prevents the semantic store from accumulating stale, low-value entries indefinitely. Mem0 and Zep both implement variants of this pattern; custom implementations typically encode importance as a metadata field on each memory record and run a nightly decay job.

Memory retrieval strategies

Retrieval strategy determines which memories surface for a given query. The three dominant strategies in production systems are:

  • Recency-weighted similarity: combine vector similarity score with a recency decay factor so that recent memories rank higher than older memories of equal semantic similarity. Best for conversational agents where recent context is almost always more relevant.
  • Importance-weighted retrieval: multiply similarity score by the memory's importance score. Surfaces high-signal memories even if they are semantically distant from the current query.
  • Hybrid sparse-dense retrieval: combine BM25 keyword search (sparse) with vector similarity (dense) using reciprocal rank fusion. Best for memories that contain specific identifiers, names, or codes that may not be well-represented in embedding space.

Open-source memory solutions

Mem0

Mem0 is the most widely adopted open-source agent memory library as of mid-2026. It provides a simple add() / search() / get_all() API that abstracts over a vector store (Qdrant, Chroma, or Pinecone) and optionally a graph store (Neo4j) for entity relationships. Mem0 handles extraction, deduplication, and importance scoring internally using a small extraction LLM call after each conversation turn. It supports multi-level memory scoping: per-user, per-agent, and per-session. A managed cloud offering (Mem0 Platform) removes infrastructure management for teams that prefer not to self-host.

Zep

Zep is designed specifically for long-term conversational memory with a strong emphasis on temporal reasoning. It automatically summarizes conversation history using a sliding-window summarizer, extracts structured facts into a knowledge graph, and exposes a retrieval API optimized for chat agents. Zep's key differentiator is its temporal awareness: it tracks when facts were learned and can answer questions like “what did the user say about their budget in the conversation two weeks ago?” without a full vector similarity scan. Best for customer-facing chatbots and support agents that must maintain accurate, time-indexed conversation history at scale.

LangMem

LangMem is Langchain's native memory library, tightly integrated with LangGraph's state management model. It provides three memory scopes — in-context (current thread), cross-thread (shared across sessions for a user), and namespace-scoped (shared across all users for a given agent) — backed by LangGraph's storage layer. LangMem is the natural choice for teams already building on LangGraph who want memory that integrates directly with graph state and checkpointing. Its background extraction runs as a LangGraph node triggered at the end of each turn.

Custom vector database implementations

Teams with specialized requirements often build custom memory layers directly on pgvector or Qdrant. Custom implementations give full control over extraction logic, importance scoring, retrieval ranking, and consolidation schedules. The tradeoff is engineering investment: building a robust extraction pipeline, deduplication logic, and decay system from scratch typically requires 2–4 weeks of focused effort. A pragmatic hybrid is to use Mem0 or Zep for the memory API surface while replacing the default storage backend with your own pgvector or Qdrant instance.

Frequently asked questions

What is the difference between agent memory and RAG?

RAG (retrieval-augmented generation) retrieves from a static external knowledge base — documents, PDFs, database records — that is written by humans and does not change based on agent behavior. Agent memory is dynamic: it is written by the agent itself from lived experience, grows with every interaction, and is scoped to a specific user or session rather than a global corpus. In practice, most production agents use both: RAG for domain knowledge and documentation, and an agent memory layer for user-specific context and learned preferences. Your architecture diagram should show these as two distinct retrieval paths feeding into the context assembler.

How do I decide which memory type to use for a given piece of information?

A practical decision rule: if the information is tied to a specific event or conversation (when it happened matters), store it as episodic memory. If it is a general fact that is true regardless of when it was learned (a user's preferred language, a domain constraint, a product limitation), store it as semantic memory. If it describes how to accomplish a task or use a tool, store it as procedural memory. Working memory is reserved for information that only needs to last for the current agent turn. When in doubt, start with episodic storage and promote frequently-accessed facts to semantic memory during consolidation.

What tool should I use to diagram my agent memory architecture?

ArchitectureDiagram.ai generates agent memory architecture diagrams from natural language prompts. Describe your memory types, storage backends, retrieval paths, and write-path extraction logic, and the tool produces a clean, shareable architecture diagram exportable as SVG or PNG. The prompt templates above are ready to paste directly into the tool — they are written to produce detailed, production-quality diagrams that distinguish the read path, write path, and consolidation process explicitly.

Related guides: AI agent architecture diagrams, context engineering diagram, RAG architecture diagram, and multi-agent orchestration patterns.

Ready to try it yourself?

Start Creating - Free