External Memory Systems

PM: Read in full — 20 min

The Idea

Every LLM API call is stateless. Send a message, get a reply, done. The model has no memory of what you discussed yesterday, last week, or thirty minutes ago in a different browser tab. All it knows is what is in the current context window.

For a one-shot task — "translate this sentence," "summarize this document" — that's fine. For anything where continuity matters, it's a fundamental problem:

A coding assistant that forgets you prefer functional patterns
A customer support bot that makes the user re-explain their plan every call
An agent that can't build compound knowledge over multi-day work
A personal assistant that doesn't know your name by the third conversation

External memory is the pattern for solving this: store information outside the model, retrieve what's relevant before each inference call, inject it into the prompt. The model still has no memory — it's still stateless — but from the user's perspective it feels like it does.

This is different from RAG, which retrieves external knowledge (documents, facts about the world, your product documentation). External memory retrieves state — what this specific user has done, said, and expressed preference for. A production system often needs both: RAG for knowledge, memory for state.

What It Does and Why You Care

The Four Memory Tiers

Memory in AI applications exists on a spectrum from fast-and-ephemeral to slow-and-persistent. Understanding where each tier applies is the first design decision.

Tier 1 — In-context memory

The messages in the current context window. No infrastructure, no retrieval — just the conversation history you're already managing. The limit is the context window size and cost: a 200-turn conversation history might cost $0.10 per inference call and eventually overflow. Good for within-session continuity; useless across sessions.

Tier 2 — Conversation summary

Compress old turns into a short summary, store it keyed to the session or user, inject it at the top of the next session's system prompt. Most chat applications use this pattern. "Earlier in our conversation, the user explained they're building a healthcare app and prefer TypeScript" costs 30 tokens instead of 3,000. Good for maintaining thread continuity without storing or re-sending full history.

Tier 3 — Key-value facts

Structured, explicit facts about an entity. User preferences, account state, profile data. Stored in Redis or a relational database, retrieved by key (user_id, session_id). Predictable, cheap, and exactly right when you know in advance what to store: "User's preferred language: Python." "Last order: #12345." "Subscription tier: Enterprise." The weakness is that it doesn't handle unstructured or situational context — you need to know what fields to define.

Tier 4 — Semantic/episodic memory

Unstructured memories stored as vector embeddings and retrieved by similarity. "The user mentioned once that their team's biggest frustration is deployment." This fact doesn't fit neatly in a field — but it surfaces when anything deployment-related comes up, retrieved by semantic search against the current context. Implemented with a vector database. This is the tier that makes an AI application feel genuinely attentive over time.

The Read-Write Cycle

All external memory systems implement the same underlying pattern, regardless of how they package it:

Before the LLM call: query the memory store for what's relevant to this context. Inject results into the system prompt.

After the LLM response: extract memorable information from the exchange, store or update it in the memory system.

The extraction step is where implementations diverge. Simple systems store raw conversation turns. Sophisticated ones use a separate LLM call to pull out facts worth keeping ("User confirmed they're using Next.js 14 with the App Router"), deduplicate against existing memories ("User prefers TypeScript" + "User said to use TypeScript" → one memory), and score by relevance and recency.

Why This Changes What You Can Build

Without external memory, personalization requires users to constantly re-establish context. With it:

Applications accumulate knowledge about users over time without any explicit "user profile" UI
Agents can be handed off between sessions without losing intermediate context
Preferences stated once carry forward indefinitely
Multi-session workflows — research that spans days, an ongoing project with an AI collaborator — become coherent

The capability shift is qualitative, not just incremental. An assistant that remembers is categorically different from one that doesn't.

How to Take Advantage of It

Three approaches, ordered from least to most build effort:

1. Managed Memory Layers

These systems handle the full read-write cycle: extraction, deduplication, storage, and retrieval. You call two functions.

Zep — conversation history plus long-term memory. Stores the full conversation graph with automatic summarization, extracts named entities and facts into a searchable knowledge graph, and surfaces temporal context ("what did they say about X last month?"). Strong on longitudinal conversations where the sequence of what was said matters, not just the content.

LangMem (LangChain) — memory primitives integrated into LangChain's ecosystem. More compositional than Zep — you wire together extractors, stores, and retrievers — which gives more control but requires more glue code. The right choice if you're already deep in the LangChain stack.

Choose a managed layer when you want memory working quickly and don't need to control extraction logic. The managed systems handle the hardest part: deciding what's worth remembering and how to deduplicate it.

2. Self-Managed Semantic Memory

Build the pipeline yourself using a vector database. You control what gets stored, how it's chunked, and how it's retrieved. More work, more control.

The pipeline has two sides:

Write path (after each conversation turn or agent step):

# 1. Extract memorable facts using an LLM
extraction_prompt = """
Extract facts worth remembering from this conversation.
Return a JSON list of strings. Only include non-obvious, user-specific facts.
Conversation: {messages}
"""
facts = llm.call(extraction_prompt.format(messages=messages))

# 2. Embed and store each fact
for fact in facts:
    vector = embed(fact)
    vector_db.upsert(
        id=str(uuid4()),
        vector=vector,
        payload={"text": fact, "user_id": user_id, "ts": now()}
    )

Read path (before each LLM call):

# 1. Embed the current user message
query_vector = embed(current_user_message)

# 2. Retrieve top-K relevant memories for this user
memories = vector_db.search(
    vector=query_vector,
    filter={"user_id": user_id},
    top_k=5
)

# 3. Inject into the system prompt
memory_block = "\n".join(f"- {m.payload['text']}" for m in memories)
system_prompt = f"What you know about this user:\n{memory_block}\n\n{base_system_prompt}"

The self-managed path is better when you need precise control: custom extraction logic, specific memory formats, integration with an existing data model, or when you want to avoid a managed service dependency.

3. Key-Value Facts (Simplest)

If you know in advance what to track — user preferences, account state, explicit settings — a key-value store is the most predictable and cheapest option. Redis for low-latency reads; Postgres for durability and joins.

# Write: user stated a preference
redis.hset(f"user:{user_id}:prefs", "language", "Python")
redis.hset(f"user:{user_id}:prefs", "framework", "FastAPI")

# Read: inject into every call for this user
prefs = redis.hgetall(f"user:{user_id}:prefs")
system_prompt = f"User preferences: {prefs}\n\n{base_system_prompt}"

The limitation: you can only retrieve what you explicitly defined a field for. Anything outside the schema gets lost.

Combining Tiers

Production systems typically layer tiers rather than choosing one:

What to store	Tier	Why
Current conversation	In-context	Free, always right
User account, plan, explicit settings	Key-value	Structured, deterministic
"User mentioned they're migrating from Rails"	Semantic	No field exists for this
Prior session context	Summary injected at session start	Cheap, preserves thread

Starting with key-value for the explicit stuff and adding semantic memory once you see what users actually need to have remembered is a reasonable progression.

Choosing an Approach

Situation	Approach
Want memory working this week	Managed (Zep)
Already on Postgres, structured facts	pgvector + custom pipeline
Need to control extraction logic exactly	Self-managed vector DB
Simple preferences, account state	Redis or Postgres key-value
Prototype / exploring what to remember	Managed first, migrate later

PM Takeaway

The hard problem with memory systems isn't storage or retrieval — it's deciding what's worth remembering. Storing everything is noisy; the retrieved context clutters the prompt with irrelevant facts. The extraction step (what do we pull from this conversation?) is where most memory implementations fail. If you're evaluating a memory system, the extraction quality — not the retrieval mechanism — is the most important thing to test.

Going Deeper

The vector search underlying semantic memory uses the same infrastructure as RAG. Vector Databases covers the implementation layer — HNSW indexes, the options (Qdrant, Chroma, pgvector, Pinecone), and operational concerns like embedding model lock-in.

The Idea​

What It Does and Why You Care​

The Four Memory Tiers​

The Read-Write Cycle​

Why This Changes What You Can Build​

How to Take Advantage of It​

1. Managed Memory Layers​

2. Self-Managed Semantic Memory​

3. Key-Value Facts (Simplest)​

Combining Tiers​

Choosing an Approach​

Further Reading​