To build an AI agent with memory, you build two layers and connect them. The first is short-term working memory, the text the agent holds inside its context window during a single run. The second is long-term persistent memory, durable facts stored outside the model in a database or vector store that survive from one run to the next. At run time, the agent retrieves only the relevant few stored facts and loads them into the working layer. Memory is not one big bucket; it is a small, fast layer that lives inside the run and a larger, durable layer that you query into it on demand.

This guide covers both layers in order: what each one is, what belongs in storage and what does not, how retrieval works, how memory relates to the context window, how to keep stored facts accurate rather than stale, and how to scope memory so private data stays private. If you want the underlying concept first, the definition of an AI agent and how AI agents work give the foundation this builds on.

Short-term vs long-term memory
Short-term vs long-term memory

Short-term vs long-term memory

The two kinds of memory solve different problems, and conflating them is the most common reason agent memory goes wrong.

Short-term, or working, memory is everything the agent can see during the current run: the task you gave it, the steps it has taken so far, the outputs of tools it has called, and any intermediate reasoning. It lives inside the context window. It is fast because it is already in front of the model, and it is complete for the task at hand. It is also temporary. When the run ends, it is gone unless you deliberately write part of it to durable storage. Working memory is closely tied to state management: the run's state is what the agent is currently holding in mind.

Long-term, or persistent, memory is what survives across runs. It lives outside the model in a store you control: a relational database, a key-value store, a vector index, or some combination. A fact written there on Monday is available on Friday, on a different task, in a different session. This is the layer that lets an agent recall a user's stated preference, a prior decision, or the outcome of a previous job. Designing this layer well is its own discipline, covered in depth in long-term memory strategies.

The relationship between them is one of supply. Working memory is the desk; long-term memory is the filing cabinet. You do not work out of the cabinet, and you do not store everything on the desk. You pull the right folder onto the desk when you need it and file the useful results back when you are done.

What to store and what not to store

A persistent store is only as useful as it is selective. The instinct to "remember everything" produces a store full of noise that retrieves badly and leaks risk. Store facts that change the agent's behavior on a future run, and nothing else.

Good candidates for long-term memory:

Things to keep out of long-term memory:

A useful test: would this fact, recalled three weeks from now with no other context, help the agent or mislead it? Store the ones that help. The line between a fact worth keeping and live data worth fetching often runs through a database the agent queries directly; connecting an agent to a database covers that side.

Retrieval at run time

Storing memory is half the system. The other half is getting the right piece back at the right moment. Two retrieval methods cover most needs, and good agents use both.

Keyed retrieval fetches a specific record by an exact identifier. When the agent knows the user ID, the order number, or the project key, it looks up that record directly. This is fast, exact, and cheap. It is the right method whenever the entity is known: load this user's preference profile, fetch this account's settings. There is no guessing involved.

Semantic retrieval handles the case where the agent does not have an exact key, only a need. It embeds the current task as a vector, compares it against the embedded memories in the store, and returns the closest matches by meaning. This is how an agent surfaces "we discussed a similar issue before" without anyone holding the precise reference. It is fuzzy by design, which is its strength and its risk: it finds related context, but it can also return near-misses, so you cap how many results feed in and rank by relevance.

The practical pattern is to run keyed lookups for known entities first, then a semantic pass to surface anything related, then merge, deduplicate, and inject only the top handful into the working layer. Retrieving five sharp memories beats retrieving fifty vague ones, every time. The goal of retrieval is never to load the whole store; it is to choose the smallest set that makes the current task succeed.

Memory and the context window

Memory and the context window are easy to confuse and important to separate. The context window is the fixed budget of text the model can read on a single run. Persistent memory is the larger store that sits outside it. Memory exists because the window is finite. If models could read an unlimited history at once, you would need far less of a memory system. They cannot, so you keep durable facts in the store and pull only the relevant few into the window per task.

This reframes a tempting mistake. Faced with a long history, the naive move is to stuff all of it into the prompt. That fills the window with low-value text, raises cost, and degrades quality, because models attend less reliably as the window fills with material that is not relevant to the task. The discipline of retrieving a small relevant slice is the same discipline behind good context window management: the window is scarce, so spend it on what the current step needs and keep the rest in storage.

A clean mental model: long-term memory is where information rests, and the context window is where it works. Retrieval is the bridge that moves the right piece across, briefly, for one run. After the run, the working copy is discarded and only deliberate writes go back to the store.

Keeping memory accurate, not stale

A memory store that only grows is a liability. Old facts that were once true become wrong, contradictions accumulate, and retrieval starts surfacing outdated context. Treat memory as mutable, not append-only.

The principle behind all four is that memory should reflect what is true now, not everything that was ever true. Accuracy beats completeness. A store of fifty current facts serves the agent better than a store of five thousand where half are obsolete.

Privacy and scoping

Because memory persists, it deserves stricter handling than transient working state. The core rule is scoping: every stored memory belongs to a clearly defined boundary, and retrieval never crosses it.

Scope by owner. One user's memories must never surface for another user. In multi-tenant settings, scope by account or organization so retrieval is filtered to the requester's data before any semantic match runs. A scoping bug here is a data leak, not a quality issue, so it is worth enforcing at the storage layer rather than trusting the retrieval step to behave.

Scope by purpose. Memory gathered for one task should not silently power an unrelated one. If a user shares a fact for a billing workflow, reusing it elsewhere without a clear basis erodes trust. Keeping scopes explicit makes the agent's behavior predictable and auditable.

Minimize and respect deletion. Store the least that does the job, and make stored memory deletable on request. When a user asks to be forgotten, the persistent store is exactly where that data lives, so deletion has to reach it. Designing for deletion from the start is far easier than retrofitting it. Pairing memory scoping with broader agent guardrails keeps both the data and the actions inside the boundaries you intend.

A build order that works

Putting the pieces together, a reliable build order avoids the trap of over-engineering memory before the agent does anything useful.

  1. Start with working memory only. Get the agent doing the task within a single run, holding everything it needs in the context window. Many useful agents never need more than this.
  2. Add a small persistent store for one clear need. Pick the single most valuable thing to remember across runs, often user preferences or a few durable identifiers, and store just that.
  3. Add keyed retrieval first. Fetch records by exact identifier. It is simple, exact, and covers most recall needs without any vector infrastructure.
  4. Add semantic retrieval when fuzzy recall matters. Introduce embeddings and a vector store only once you have a genuine need to find related context without a precise key.
  5. Add the accuracy and scoping rules. Timestamps, update-in-place, expiry, and owner scoping turn a working prototype into something safe to run on real data.

Building in this order means the agent is delivering value at step one and you add memory complexity only where it earns its place. For the broader setup arc, setting up your first AI agent covers the surrounding workflow, and the glossary defines the terms used here. If cost is a concern as memory grows the prompt, optimizing an agent prompt for cost pairs naturally with tight retrieval.

How Gravity handles agent memory

Gravity is an AI agent platform. The two-layer memory design described here is built into the expert-built agents that run on it, so you do not assemble vector stores, write retrieval code, or manage expiry rules yourself. You describe what you want remembered in plain words, "remember my preferred report format and which clients I work with," and the agent keeps those durable facts across runs while pulling only the relevant ones into each task.

Working memory is handled within each run, persistent memory is scoped to your account so nothing crosses between users, and stored facts stay editable and deletable on request. You prompt and run an agent in about 60 seconds; the memory layer is part of the service, not something you build and maintain. Pay per use: $1 equals 1,000 credits, and you only pay when the agent runs.

For builders creating agents for Gravity, the same memory model applies: define what an agent should remember, scope it correctly, and keep it accurate, and the platform handles the storage and retrieval mechanics underneath.

FAQ

What is the difference between short-term and long-term agent memory?

Short-term, or working, memory is what the agent holds inside its context window during a single run: the current task, recent steps, and tool outputs. It disappears when the run ends. Long-term, or persistent, memory is stored outside the model in a database or vector store and survives across runs, so the agent can recall a fact from last week. You build the two layers separately and connect them through retrieval.

What should an AI agent store in long-term memory?

Store durable facts that change the agent's behavior on future runs: stable user preferences, account and project identifiers, decisions and their reasons, and summaries of past outcomes. Do not store raw transcripts, secrets, payment details, or anything you would not want surfaced later. A smaller store of high-value facts retrieves more accurately than a large store of noise.

How does an agent retrieve the right memory at run time?

Two methods cover most cases. Keyed lookup fetches a specific record by an exact identifier, such as a user ID or order number, and is fast and exact. Semantic retrieval embeds the current task and finds stored memories with similar meaning, which handles fuzzy recall. Most agents use keyed lookup for known entities and semantic retrieval to surface related context, then inject only the top matches into the context window.

How is agent memory related to the context window?

The context window is the fixed amount of text the model can read on a single run. Memory is the larger store that lives outside it. Memory exists precisely because the window is finite: you keep durable facts in the store and pull only the relevant few into the window when a task needs them. Loading everything into the window is the mistake memory is designed to prevent.

How do you keep agent memory from going stale?

Treat memory as mutable. When a fact changes, update or replace the record rather than appending a new one, so the agent does not hold two contradictory versions. Timestamp entries, prefer the most recent on conflict, and expire records that have a natural shelf life. Periodically review or summarize the store so old, superseded facts do not crowd out current ones during retrieval.