Skip to main content

External Memory Systems

PM: Read in full โ€” 20 min

The Ideaโ€‹

Every LLM API call is stateless. Send a message, get a reply, done. The model has no memory of what you discussed yesterday, last week, or thirty minutes ago in a different browser tab. All it knows is what is in the current context window.

For a one-shot task โ€” "translate this sentence," "summarize this document" โ€” that's fine. For anything where continuity matters, it's a fundamental problem:

  • A coding assistant that forgets you prefer functional patterns
  • A customer support bot that makes the user re-explain their plan every call
  • An agent that can't build compound knowledge over multi-day work
  • A personal assistant that doesn't know your name by the third conversation

External memory is the pattern for solving this: store information outside the model, retrieve what's relevant before each inference call, inject it into the prompt. The model still has no memory โ€” it's still stateless โ€” but from the user's perspective it feels like it does.

This is different from RAG, which retrieves external knowledge (documents, facts about the world, your product documentation). External memory retrieves state โ€” what this specific user has done, said, and expressed preference for. A production system often needs both: RAG for knowledge, memory for state.

What It Does and Why You Careโ€‹

The Four Memory Tiersโ€‹

Memory in AI applications exists on a spectrum from fast-and-ephemeral to slow-and-persistent. Understanding where each tier applies is the first design decision.

Tier 1 โ€” In-context memory

The messages in the current context window. No infrastructure, no retrieval โ€” just the conversation history you're already managing. The limit is the context window size and cost: a 200-turn conversation history might cost $0.10 per inference call and eventually overflow. Good for within-session continuity; useless across sessions.

Tier 2 โ€” Conversation summary

Compress old turns into a short summary, store it keyed to the session or user, inject it at the top of the next session's system prompt. Most chat applications use this pattern. "Earlier in our conversation, the user explained they're building a healthcare app and prefer TypeScript" costs 30 tokens instead of 3,000. Good for maintaining thread continuity without storing or re-sending full history.

Tier 3 โ€” Key-value facts

Structured, explicit facts about an entity. User preferences, account state, profile data. Stored in Redis or a relational database, retrieved by key (user_id, session_id). Predictable, cheap, and exactly right when you know in advance what to store: "User's preferred language: Python." "Last order: #12345." "Subscription tier: Enterprise." The weakness is that it doesn't handle unstructured or situational context โ€” you need to know what fields to define.

Tier 4 โ€” Semantic/episodic memory

Unstructured memories stored as vector embeddings and retrieved by similarity. "The user mentioned once that their team's biggest frustration is deployment." This fact doesn't fit neatly in a field โ€” but it surfaces when anything deployment-related comes up, retrieved by semantic search against the current context. Implemented with a vector database. This is the tier that makes an AI application feel genuinely attentive over time.

The Read-Write Cycleโ€‹

All external memory systems implement the same underlying pattern, regardless of how they package it:

Before the LLM call: query the memory store for what's relevant to this context. Inject results into the system prompt.

After the LLM response: extract memorable information from the exchange, store or update it in the memory system.

The extraction step is where implementations diverge. Simple systems store raw conversation turns. Sophisticated ones use a separate LLM call to pull out facts worth keeping ("User confirmed they're using Next.js 14 with the App Router"), deduplicate against existing memories ("User prefers TypeScript" + "User said to use TypeScript" โ†’ one memory), and score by relevance and recency.

Why This Changes What You Can Buildโ€‹

Without external memory, personalization requires users to constantly re-establish context. With it:

  • Applications accumulate knowledge about users over time without any explicit "user profile" UI
  • Agents can be handed off between sessions without losing intermediate context
  • Preferences stated once carry forward indefinitely
  • Multi-session workflows โ€” research that spans days, an ongoing project with an AI collaborator โ€” become coherent

The capability shift is qualitative, not just incremental. An assistant that remembers is categorically different from one that doesn't.

How to Take Advantage of Itโ€‹

Three approaches, ordered from least to most build effort:

1. Managed Memory Layersโ€‹

These systems handle the full read-write cycle: extraction, deduplication, storage, and retrieval. You call two functions.

Mem0 โ€” drop-in memory layer. You send conversation turns to m.add() and retrieve relevant context with m.search(). Internally, Mem0 uses an LLM to extract memorable facts from the conversation, runs deduplication against existing memories, stores them with embeddings, and serves them via semantic search. Available as a cloud API or self-hosted with your own vector database and LLM.

from mem0 import Memory

m = Memory()

# After a conversation turn โ€” Mem0 extracts what's worth keeping
m.add(
messages=[
{"role": "user", "content": "I'm building a Next.js app with TypeScript"},
{"role": "assistant", "content": "Great, I'll keep that in mind."}
],
user_id="user-42"
)

# Before the next LLM call โ€” retrieves semantically relevant memories
memories = m.search(query="what stack is the user using?", user_id="user-42")
# Returns: [{"memory": "User is building a Next.js app with TypeScript", ...}]

Zep โ€” conversation history plus long-term memory. Stores the full conversation graph with automatic summarization, extracts named entities and facts into a searchable knowledge graph, and surfaces temporal context ("what did they say about X last month?"). Strong on longitudinal conversations where the sequence of what was said matters, not just the content.

LangMem (LangChain) โ€” memory primitives integrated into LangChain's ecosystem. More compositional than Mem0 or Zep โ€” you wire together extractors, stores, and retrievers โ€” which gives more control but requires more glue code. The right choice if you're already deep in the LangChain stack.

Choose a managed layer when you want memory working quickly and don't need to control extraction logic. The managed systems handle the hardest part: deciding what's worth remembering and how to deduplicate it.

2. Self-Managed Semantic Memoryโ€‹

Build the pipeline yourself using a vector database. You control what gets stored, how it's chunked, and how it's retrieved. More work, more control.

The pipeline has two sides:

Write path (after each conversation turn or agent step):

# 1. Extract memorable facts using an LLM
extraction_prompt = """
Extract facts worth remembering from this conversation.
Return a JSON list of strings. Only include non-obvious, user-specific facts.
Conversation: {messages}
"""
facts = llm.call(extraction_prompt.format(messages=messages))

# 2. Embed and store each fact
for fact in facts:
vector = embed(fact)
vector_db.upsert(
id=str(uuid4()),
vector=vector,
payload={"text": fact, "user_id": user_id, "ts": now()}
)

Read path (before each LLM call):

# 1. Embed the current user message
query_vector = embed(current_user_message)

# 2. Retrieve top-K relevant memories for this user
memories = vector_db.search(
vector=query_vector,
filter={"user_id": user_id},
top_k=5
)

# 3. Inject into the system prompt
memory_block = "\n".join(f"- {m.payload['text']}" for m in memories)
system_prompt = f"What you know about this user:\n{memory_block}\n\n{base_system_prompt}"

The self-managed path is better when you need precise control: custom extraction logic, specific memory formats, integration with an existing data model, or when you want to avoid a managed service dependency.

3. Key-Value Facts (Simplest)โ€‹

If you know in advance what to track โ€” user preferences, account state, explicit settings โ€” a key-value store is the most predictable and cheapest option. Redis for low-latency reads; Postgres for durability and joins.

# Write: user stated a preference
redis.hset(f"user:{user_id}:prefs", "language", "Python")
redis.hset(f"user:{user_id}:prefs", "framework", "FastAPI")

# Read: inject into every call for this user
prefs = redis.hgetall(f"user:{user_id}:prefs")
system_prompt = f"User preferences: {prefs}\n\n{base_system_prompt}"

The limitation: you can only retrieve what you explicitly defined a field for. Anything outside the schema gets lost.

Combining Tiersโ€‹

Production systems typically layer tiers rather than choosing one:

What to storeTierWhy
Current conversationIn-contextFree, always right
User account, plan, explicit settingsKey-valueStructured, deterministic
"User mentioned they're migrating from Rails"SemanticNo field exists for this
Prior session contextSummary injected at session startCheap, preserves thread

Starting with key-value for the explicit stuff and adding semantic memory once you see what users actually need to have remembered is a reasonable progression.

Choosing an Approachโ€‹

SituationApproach
Want memory working this weekManaged (Mem0, Zep)
Already on Postgres, structured factspgvector + custom pipeline
Need to control extraction logic exactlySelf-managed vector DB
Simple preferences, account stateRedis or Postgres key-value
Prototype / exploring what to rememberManaged first, migrate later
PM Takeaway

The hard problem with memory systems isn't storage or retrieval โ€” it's deciding what's worth remembering. Storing everything is noisy; the retrieved context clutters the prompt with irrelevant facts. The extraction step (what do we pull from this conversation?) is where most memory implementations fail. If you're evaluating a memory system, the extraction quality โ€” not the retrieval mechanism โ€” is the most important thing to test.

Going Deeper

The vector search underlying semantic memory uses the same infrastructure as RAG. Vector Databases covers the implementation layer โ€” HNSW indexes, the options (Qdrant, Chroma, pgvector, Pinecone), and operational concerns like embedding model lock-in.

Further Readingโ€‹

  • Vector Databases โ€” the storage and retrieval layer for semantic memory
  • RAG โ€” retrieval for external knowledge (distinct from memory but uses the same infrastructure)
  • Agents & Tool Use โ€” the agent loop that external memory makes stateful
  • Context Windows โ€” why context is ephemeral and what the limits are