AI agent memory is not one thing. It is three layers, each handling a different timescale and a different question. Short-term memory holds what is happening right now. Long-term memory holds what the agent might need to retrieve. Episodic memory holds what the agent has done before. Production-grade agents use all three with explicit handoffs; weak designs collapse them into the LLM context window and break under load.

The architecture borrows directly from cognitive science: short-term memory in humans is bounded and temporary; long-term memory is associative and large; episodic memory is autobiographical (Tulving, 1972). Modern AI agents use the same three-layer split because the same problems show up: bounded working memory, the need for cheap retrieval over a large knowledge base, and the need to remember "what happened last time" without re-experiencing it.

The three layers

The three layers run on different timescales. Short-term memory operates in seconds: the prompt, recent tool outputs, the current step. Long-term memory operates indefinitely: a knowledge base of documents and facts the agent retrieves on demand. Episodic memory operates across tasks: a log of past runs that informs new ones. Each layer answers a different question.

  1. Short-term: what is happening right now? Context window plus per-step state.
  2. Long-term: what does the agent know? Vector store of facts, documents, and policies.
  3. Episodic: what has the agent done before? Log of past tasks, plans, actions, outcomes.

The handoffs matter as much as the layers. Short-term hands off relevant facts to long-term so they survive context-window overflow. Long-term hands relevant retrievals back to short-term at the start of each step. Episodic hands relevant past outcomes to both. The agent designer's job is the wiring, not the layers themselves; the layers are commodity infrastructure in 2026.

Short-term: the context window

Short-term memory in AI agents centres on the LLM context window: the maximum number of tokens the model can attend to in a single inference call. Frontier models in 2026 offer context windows ranging from tens of thousands to over a million tokens. The context window holds the system prompt, the goal, recent tool outputs, and the immediate plan. Everything else has to be retrieved from long-term or episodic memory and re-injected.

The common mistake is treating the context window as the entire memory system. A long-running agent that runs 50 steps cannot keep all 50 in context; the early steps fall off, and with them the goal context, the tool outputs, and the partial plan. This is the structural cause of stop-after-one-task failure: short-term memory degrades and the agent loses the thread.

The fix is not bigger context windows. The fix is explicit summarisation and handoff to long-term memory. The agent compresses what it has done so far into a structured state that fits, retrieves relevant facts from long-term, and continues. The pattern is sometimes called the "agent state machine"; it appears in most mature agent frameworks (LangChain memory, LlamaIndex agent state, Anthropic's recommendations).

Long-term: the vector store

Long-term memory in AI agents is implemented as a vector store: a database of embeddings (numerical representations of text or other data) indexed by similarity. Common stores in 2026: Pinecone, Weaviate, Chroma, Milvus, pgvector. The agent queries the store by similarity to retrieve relevant documents or facts and injects them into short-term memory for the current step. This pattern is retrieval-augmented generation (RAG); when the agent decides when and what to retrieve, it becomes "agentic RAG".

The benefits are scope and freshness. The vector store can hold gigabytes of data the LLM was never trained on. New documents added to the store are immediately queryable; no fine-tuning required. The OpenAI and Anthropic engineering guidance both recommend RAG over fine-tuning for most knowledge-augmentation tasks (OpenAI, Anthropic, retrieved 2026-05-07).

The failure modes are precision and freshness. Vector similarity returns documents that look similar in embedding space but are not actually relevant. The agent then reasons over noise. Mitigations: hybrid search (dense embeddings plus keyword search), reranking (a second model scores retrieved chunks), and metadata filters (only retrieve documents from this date range or this source).

Three layers of AI agent memory Short-term Context window seconds Long-term Vector store indefinite Episodic Task log across tasks summarise retrieve recall log past outcomes Source: Adapted from Tulving 1972 (cognitive memory), modern agent frameworks (LangChain, LlamaIndex), Anthropic engineering.
The three layers, with the four key handoffs labelled. Production failures concentrate on the handoffs.

Episodic: the task log

Episodic memory in AI agents is the log of past tasks and their outcomes. For each task: the input, the plan, the actions taken, the result, the success or failure label. The agent queries the log by similarity at the start of new tasks: "have I done something like this before? what worked? what failed?" This is the layer that lets agents improve across runs without retraining the underlying model.

Episodic memory is the least mature of the three layers in 2026. Most production agents log task data but do not actively retrieve from it during planning. The frameworks are still settling on standard patterns (LangChain's "memory" abstractions, LlamaIndex's "agent state", custom implementations). The opportunity is large: an agent that consistently consults its own history avoids repeating known failure modes.

Episodic memory also feeds the 80-test methodology in reverse: failed tasks become test cases. The methodology assumes the test corpus grows over time as production failures are surfaced and added; episodic memory is the source of those additions.

How memory fails in production

Four common failure modes, each in a different place. Context overflow: the task exceeds the context window and early state is lost. Retrieval irrelevance: the vector store returns documents that look similar in embedding space but do not help. Stale episodic memory: logs from old workflows mislead the agent on a current task with different constraints. Missing handoffs: short-term forgets something that was never written to long-term, or long-term holds a fact that short-term never queried.

Diagnosing memory failures requires per-layer logging. The agent should log what was in context at each step, what it retrieved from long-term, and what episodic context it pulled. Without these logs, "the agent forgot" looks the same regardless of which layer failed. The 80-test methodology category for "partial results" covers this: a partial-result failure usually traces to a handoff problem, not a model problem.

The pragmatic operating advice: design the three layers explicitly from day one, even if early implementations are simple. A vector store with 100 documents is fine for a prototype; a context summarisation strategy can be a single prompt; an episodic log can be a database table. Retrofitting memory architecture into an agent that "just uses context" is a rewrite, not a refactor.

Frequently asked questions

What are the three types of AI agent memory?

Short-term context (the LLM context window for the current task), long-term vector store (embeddings of facts and documents retrieved on demand), and episodic memory (a log of past tasks and their outcomes). Each handles a different timescale and a different question. Production agents typically use all three with explicit handoffs between layers.

Is the LLM context window the same as agent memory?

No. The context window is one component of short-term memory. It is bounded by the model and disappears after the inference call. Long-term and episodic memory persist across tasks and sessions; they live in vector stores or databases external to the LLM. Treating context as the only memory layer is the most common mistake in early agent designs.

How does episodic memory work in AI agents?

Episodic memory logs each task the agent runs: input, plan, actions taken, outcome. The log is queried by similarity at the start of new tasks so the agent can recall what worked or failed last time. Episodic memory is what lets agents improve across runs without retraining the underlying model.

What is a vector store in AI agent memory?

A vector store holds embeddings (numerical representations) of documents, facts, or memories. The agent queries it by similarity to retrieve relevant items. Pinecone, Weaviate, pgvector, and Chroma are common stores. Vector stores enable retrieval-augmented generation (RAG) and are the standard mechanism for long-term memory in agents.

What memory failures cause AI agents to break?

Four common failures: context overflow (the task exceeds the context window), retrieval irrelevance (the vector store returns documents that look similar but are not useful), stale episodic memory (logs from old workflows mislead the agent on the current task), and missing handoffs between layers (short-term forgets, long-term never captured the data).

Three takeaways before you close this tab

Sources