AI-Agent 白皮书 3 - Context Engineering: Sessions & Memory

110k 词

Context
Engineering:
Sessions, Memory Authors: Kimberly Milam and Antonio Gulli
Context Engineering: Sessions, Memory

上下文工程:会话与记忆
作者:Kimberly Milam 和 Antonio Gulli

Introduction

简介

This whitepaper explores the critical role of Sessions and Memory in building stateful, intelligent LLM agents to empower developers to create more powerful, personalized, and persistent AI experiences. To enable Large Language Models (LLMs) to remember, learn, and personalize interactions, developers must dynamically assemble and manage information within their context window—a process known as Context Engineering.

本白皮书探讨了会话(Sessions)和记忆(Memory)在构建有状态、智能的 LLM 智能体中的关键作用,旨在帮助开发者创建更强大、更个性化、更持久的 AI 体验。为了使大型语言模型(LLMs)能够记住、学习并个性化交互,开发者必须在其上下文窗口中动态组装和管理信息——这一过程被称为上下文工程(Context Engineering)。

These core concepts are summarized in the whitepaper below:

本白皮书总结了以下核心概念:

• Context Engineering: The process of dynamically assembling and managing information within an LLM’s context window to enable stateful, intelligent agents.

• 上下文工程(Context Engineering): 在 LLM 的上下文窗口中动态组装和管理信息的过程,以实现有状态的智能体。

• Sessions: The container for an entire conversation with an agent, holding the chronological history of the dialogue and the agent’s working memory.

• 会话(Sessions): 与智能体进行完整对话的容器,保存对话的时间顺序历史记录和智能体的工作记忆。

• Memory: The mechanism for long-term persistence, capturing and consolidating key information across multiple sessions to provide a continuous and personalized experience for LLM agents.

• 记忆(Memory): 长期持久化的机制,跨多个会话捕获和整合关键信息,为 LLM 智能体提供连续且个性化的体验。

Context Engineering

上下文工程

LLMs are inherently stateless. Outside of their training data, their reasoning and awareness are confined to the information provided within the “context window” of a single API call. This presents a fundamental problem, as AI agents must be equipped with operating instructions identifying what actions can be taken, the evidential and factual data to reason over, and the immediate conversational information that defines the current task. To build stateful, intelligent agents that can remember, learn, and personalize interactions, developers must construct this context for every turn of a conversation. This dynamic assembly and management of information for an LLM is known as Context Engineering.

LLM 本质上是无状态的。除了其训练数据之外,它们的推理和感知仅限于单次 API 调用的”上下文窗口”中提供的信息。这带来了一个根本性问题,因为 AI 智能体必须配备操作指令(说明可以采取哪些行动)、用于推理的证据和事实数据,以及定义当前任务的即时对话信息。为了构建能够记忆、学习和个性化交互的有状态智能体,开发者必须为对话的每一轮构建这些上下文。这种为 LLM 动态组装和管理信息的过程被称为上下文工程。

Context Engineering represents an evolution from traditional Prompt Engineering. Prompt engineering focuses on crafting optimal, often static, system instructions. Conversely, Context Engineering addresses the entire payload, dynamically constructing a state-aware prompt based on the user, conversation history, and external data. It involves strategically selecting, summarizing, and injecting different types of information to maximize relevance while minimizing noise. External systems—such as RAG databases, session stores, and memory managers—manage much of this context. The agent framework must orchestrate these systems to retrieve and assemble context into the final prompt.

上下文工程代表了从传统提示工程(Prompt Engineering)的演进。提示工程专注于制作最优的、通常是静态的系统指令。相反,上下文工程关注整个负载,根据用户、对话历史和外部数据动态构建状态感知的提示。它涉及战略性地选择、总结和注入不同类型的信息,以最大化相关性同时最小化噪音。外部系统——如 RAG 数据库、会话存储和记忆管理器——管理大部分上下文。智能体框架必须协调这些系统以检索和组装上下文到最终提示中。

Think of Context Engineering as the mise en place for an agent—the crucial step where a chef gathers and prepares all their ingredients before cooking. If you only give a chef the recipe (the prompt), they might produce an okay meal with whatever random ingredients they have. However, if you first ensure they have all the right, high-quality ingredients, specialized

tools, and a clear understanding of the presentation style, they can reliably produce an excellent, customized result. The goal of context engineering is to ensure the model has no more and no less than the most relevant information to complete its task.

可以将上下文工程想象成智能体的 mise en place(法语:就位准备)——这是厨师在烹饪前收集和准备所有食材的关键步骤。如果你只给厨师食谱(提示),他们可能会用手边的随机食材做出一顿还可以的饭。然而,如果你首先确保他们拥有所有正确的、高质量的食材、专业工具,以及对呈现风格的清晰理解,他们就能可靠地制作出优秀的定制化成果。上下文工程的目标是确保模型拥有完成任务所需的最相关信息——不多也不少。

Context Engineering governs the assembly of a complex payload that can include a variety of components:

上下文工程管理着复杂负载的组装,可以包含多种组件:

• Context to guide reasoning defines the agent’s fundamental reasoning patterns and available actions, dictating its behavior:

• 引导推理的上下文 定义智能体的基本推理模式和可用操作,决定其行为:

• System Instructions: High-level directives defining the agent’s persona, capabilities, and constraints.

• 系统指令(System Instructions): 定义智能体角色、能力和约束的高层指令。

• Tool Definitions: Schemas for APIs or functions the agent can use to interact with the outside world.

• 工具定义(Tool Definitions): 智能体可用于与外部世界交互的 API 或函数的模式定义。

• Few-Shot Examples: Curated examples that guide the model’s reasoning process via in-context learning.

• 少样本示例(Few-Shot Examples): 通过上下文学习引导模型推理过程的精选示例。

• Evidential & Factual Data is the substantive data the agent reasons over, including pre existing knowledge and dynamically retrieved information for the specific task; it serves as the ‘evidence’ for the agent’s response:

• 证据与事实数据 是智能体进行推理的实质性数据,包括预先存在的知识和为特定任务动态检索的信息;它作为智能体响应的”证据”:

• Long-Term Memory: Persisted knowledge about the user or topic, gathered across multiple sessions.

• 长期记忆(Long-Term Memory): 跨多个会话收集的关于用户或主题的持久化知识。

• External Knowledge: Information retrieved from databases or documents, often using Retrieval-Augmented Generation (RAG)1.

• 外部知识(External Knowledge): 从数据库或文档检索的信息,通常使用检索增强生成(RAG)¹。

• Tool Outputs: The data or results returned by a tool.

• 工具输出(Tool Outputs): 工具返回的数据或结果。

• Sub-Agent Outputs: The conclusions or results returned by specialized agents that have been delegated a specific sub-task.

• 子智能体输出(Sub-Agent Outputs): 被委派特定子任务的专门智能体返回的结论或结果。

• Artifacts: Non-textual data (e.g., files, images) associated with the user or session.

• 产物(Artifacts): 与用户或会话相关的非文本数据(如文件、图像)。

• Immediate conversational information grounds the agent in the current interaction, defining the immediate task:

• 即时对话信息 将智能体锚定在当前交互中,定义即时任务:

• Conversation History: The turn-by-turn record of the current interaction.

• 对话历史(Conversation History): 当前交互的逐轮记录。

• State / Scratchpad: Temporary, in-progress information or calculations the agent uses for its immediate reasoning process.

• 状态/草稿本(State / Scratchpad): 智能体用于即时推理过程的临时、进行中的信息或计算。

• User’s Prompt: The immediate query to be addressed.

• 用户提示(User’s Prompt): 需要处理的即时查询。

The dynamic construction of context is critical. Memories, for instance, are not static; they must be selectively retrieved and updated as the user interacts with the agent or new data is ingested. Additionally, effective reasoning often relies on in-context learning2 (a process where the LLM learns how to perform tasks from demonstrations in the prompt). In-context

上下文的动态构建至关重要。例如,记忆不是静态的;它们必须在用户与智能体交互或摄入新数据时被选择性地检索和更新。此外,有效的推理通常依赖于上下文学习²(LLM 从提示中的示例学习如何执行任务的过程)。

learning can be more effective when the agent uses few-shot examples that are releva nt to the current task, rather than relying on hardcoded ones. Similarly, external knowledge is retrieved by RAG tools based on the user’s immediate query.

当智能体使用与当前任务相关的少样本示例而非依赖硬编码示例时,上下文学习会更加有效。同样,外部知识是由 RAG 工具根据用户的即时查询检索的。

One of the most critical challenges in building a context-aware agent is managing an ever-growing conversation history. In theory, models with large context windows can handle extensive transcripts; in practice, as the context grows, cost and latency increase. Additionally, models can suffer from “context rot,” a phenomenon where their ability to pay attention to critical information diminishes as context grows. Context Engineering directly addresses this by employing strategies to dynamically mutate the history—such as summarization, selective pruning, or other compaction techniques—to preserve vital information while managing the overall token count, ultimately leading to more robust and personalized AI experiences.

构建上下文感知智能体的最关键挑战之一是管理不断增长的对话历史。理论上,具有大上下文窗口的模型可以处理大量对话记录;但实际上,随着上下文增长,成本和延迟也会增加。此外,模型可能遭受”上下文腐烂(context rot)“——这是一种随着上下文增长,模型关注关键信息的能力下降的现象。上下文工程通过采用动态改变历史的策略来直接解决这个问题——如摘要、选择性修剪或其他压缩技术——以在管理整体 token 数量的同时保留关键信息,最终带来更健壮和个性化的 AI 体验。

This practice manifests as a continuous cycle within the agent’s operational loop for each turn of a conversation:

这种实践表现为智能体操作循环中每轮对话的连续循环:

![][image1]Figure 1. Flow of context management for agents

图 1. 智能体的上下文管理流程

1. Fetch Context: The agent begins by retrieving context—such as user memories, RAG documents, and recent conversation events. For dynamic context retrieval, the agent will use the user query and other metadata to identify what information to retrieve.

1. 获取上下文: 智能体首先检索上下文——如用户记忆、RAG 文档和近期对话事件。对于动态上下文检索,智能体将使用用户查询和其他元数据来识别要检索的信息。

2. Prepare Context: The agent framework dynamically constructs the full prompt for the LLM call. Although individual API calls may be asynchronous, preparing the context is a blocking, “hot-path” process. The agent cannot proceed until the context is ready.

2. 准备上下文: 智能体框架动态构建用于 LLM 调用的完整提示。虽然单个 API 调用可能是异步的,但准备上下文是一个阻塞的”热路径”过程。智能体在上下文准备好之前无法继续。

3. Invoke LLM and Tools: The agent iteratively calls the LLM and any necessary tools until a final response for the user is generated. Tool and model output is appended to the context.

3. 调用 LLM 和工具: 智能体迭代调用 LLM 和任何必要的工具,直到为用户生成最终响应。工具和模型输出被附加到上下文中。

4. Upload Context: New information gathered during the turn is uploaded to persistent storage. This is often a “background” process, allowing the agent to complete execution while memory consolidation or other post-processing occurs asynchronously.

4. 上传上下文: 在该轮中收集的新信息被上传到持久存储。这通常是一个”后台”过程,允许智能体完成执行,同时记忆整合或其他后处理异步进行。

At the heart of this lifecycle are two fundamental components: sessions and memory. A session manages the turn-by-turn state of a single conversation. Memory, in contrast, provides the mechanism for long-term persistence, capturing and consolidating key information across multiple sessions.

这个生命周期的核心是两个基本组件:会话记忆会话管理单次对话的逐轮状态。相比之下,记忆提供长期持久化的机制,跨多个会话捕获和整合关键信息。

You can think of a session as the workbench or desk you’re using for a specific project. While you’re working, it’s covered in all the necessary tools, notes, and reference materials. Everything is immediately accessible but also temporary and specific to the task at hand. Once the project is finished, you don’t just shove the entire messy desk into storage. Instead, you begin the process of creating memory, which is like an organized filing cabinet. You review the materials on the desk, discard the rough drafts and redundant notes, and file away only the most critical, finalized documents into labeled folders. This ensures the filing cabinet remains a clean, reliable, and efficient source of truth for all future projects, without being cluttered by the transient chaos of the workbench. This analogy directly mirrors how an effective agent operates: the session serves as the temporary workbench for a single conversation, while the agent’s memory is the meticulously organized filing cabinet, allowing it to recall key information during future interactions.

你可以将会话想象成你为特定项目使用的工作台或书桌。在你工作时,它上面铺满了所有必要的工具、笔记和参考材料。一切都可以即时访问,但也是临时的,专门针对手头的任务。一旦项目完成,你不会直接把整个凌乱的桌子塞进存储室。相反,你开始创建记忆的过程,这就像一个有组织的文件柜。你检查桌上的材料,丢弃草稿和冗余笔记,只将最关键的、最终确定的文档归档到标记好的文件夹中。这确保文件柜保持干净、可靠和高效,成为所有未来项目的真实信息来源,而不会被工作台的临时混乱所堆满。这个类比直接反映了有效智能体的运作方式:会话作为单次对话的临时工作台,而智能体的记忆是精心组织的文件柜,使其能够在未来的交互中回忆关键信息。

Building on this high-level overview of context engineering, we can now explore two core components: sessions and memory, beginning with sessions.

基于上下文工程的这一高层概述,我们现在可以探索两个核心组件:会话和记忆,从会话开始。

Sessions

会话

A foundational element of Context Engineering is the session, which encapsulates the immediate dialogue history and working memory for a single, continuous conversation. Each session is a self-contained record that is tied to a specific user. The session allows the agent to maintain context and provide coherent responses within the bounds of a single conversation. A user can have multiple sessions, but each one functions as a distinct, disconnected log of a specific interaction. Every session contains two key components: the chronological history (events) and the agent’s working memory (state).

上下文工程的基础元素是会话,它封装了单次连续对话的即时对话历史和工作记忆。每个会话都是与特定用户关联的自包含记录。会话使智能体能够在单次对话的范围内维护上下文并提供连贯的响应。用户可以拥有多个会话,但每个会话都作为特定交互的独立、断开的日志。每个会话包含两个关键组件:时间顺序历史(事件)和智能体的工作记忆(状态)。

Events are the building blocks of the conversation. Common types of events include: user input (a message from the user (text, audio, image, etc.), agent response (the agent’s reply to the user), tool call (the agent’s decision to use an external tool or API), or tool output (the data returned from a tool call, which the agent uses to continue its reasoning).

事件是对话的构建块。常见的事件类型包括:用户输入(来自用户的消息(文本、音频、图像等)),智能体响应(智能体对用户的回复),工具调用(智能体决定使用外部工具或 API),或工具输出(从工具调用返回的数据,智能体使用它继续推理)。

Beyond the chat history, a Session often includes a state—a structured “working memory” or scratchpad. This holds temporary, structured data relevant to the current conversation, like what items are in a shopping cart.

除了聊天历史,会话通常还包括一个状态——结构化的”工作记忆”或草稿本。它保存与当前对话相关的临时结构化数据,比如购物车中有哪些商品。

As the conversation progresses, the agent will append additional events to the session. Additionally, it may mutate the state based on logic in the agent.

随着对话的进行,智能体会将额外的事件附加到会话中。此外,它可能根据智能体中的逻辑改变状态。

The structure of the events is analogous to the list of Content objects passed to the Gemini API, where each item with a role and parts represents one turn—or one Event—in the conversation.

事件的结构类似于传递给 Gemini API 的 Content 对象列表,其中每个具有 roleparts 的项目代表对话中的一轮——或一个事件。

Python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
contents = [
{
"role": "user",
"parts": [ {"text": "What is the capital of France?"} ]
}, {
"role": "model",
"parts": [ {"text": "The capital of France is Paris."} ]
}
]

response = client.models.generate_content(
model="gemini-2.5-flash",
contents=contents
)

Snippet 1: Example multi-turn call to Gemini

代码片段 1:Gemini 多轮调用示例

A production agent’s execution environment is typically stateless, meaning it retains no information after a request is completed. Consequently, its conversation history must be saved to persistent storage to maintain a continuous user experience. While in-memory storage is suitable for development, production applications should leverage robust databases to reliably store and manage sessions. For example, you can store conversation history in managed solutions like Agent Engine Sessions3.

生产环境中智能体的执行环境通常是无状态的,意味着在请求完成后它不会保留任何信息。因此,其对话历史必须保存到持久存储以维持连续的用户体验。虽然内存存储适合开发,但生产应用应利用健壮的数据库来可靠地存储和管理会话。例如,你可以将对话历史存储在像 Agent Engine Sessions³ 这样的托管解决方案中。

Variance across frameworks and models

不同框架和模型间的差异

While the core ideas are similar, different agent frameworks implement sessions, events, and state in distinct ways. Agent frameworks are responsible for maintaining the conversation history and state for LLMs, building LLM requests using this context, and parsing and storing the LLM response.

虽然核心理念相似,但不同的智能体框架以不同的方式实现会话、事件和状态。智能体框架负责维护 LLM 的对话历史和状态,使用此上下文构建 LLM 请求,以及解析和存储 LLM 响应。

Agent frameworks act as a universal translator between your code and a LLM. While you, the developer, work with the framework’s consistent, internal data structures for each conversational turn, the framework handles the critical task of converting those structures into the precise format the LLM requires. This abstraction is powerful because it decouples your agent’s logic from the specific LLM you’re using, preventing vendor lock-in.

智能体框架充当你的代码和 LLM 之间的通用翻译器。当你作为开发者使用框架一致的内部数据结构处理每个对话轮次时,框架处理将这些结构转换为 LLM 所需精确格式的关键任务。这种抽象很强大,因为它将你的智能体逻辑与你使用的特定 LLM 解耦,防止供应商锁定。

![][image2]Figure 2: Flow of context management for agents

图 2:智能体的上下文管理流程

Ultimately, the goal is to produce a “request” that the LLM can understand. For Google’s Gemini models, this is a List[Content]. Each Content object is a simple dictionary-like structure containing two keys: role which defines who is speaking (“user” or “model”) and parts which defines the actual content of the message (text, images, tool calls, etc.).

最终目标是生成 LLM 可以理解的”请求”。对于 Google 的 Gemini 模型,这是一个 List[Content]。每个 Content 对象是一个简单的类似字典的结构,包含两个键:role 定义谁在说话(”user” 或 “model”),parts 定义消息的实际内容(文本、图像、工具调用等)。

The framework automatically handles mapping the data from its internal object (e.g., an ADK Event) to the corresponding role and parts in the Content object before making the API call. In essence, the framework provides a stable, internal API for the developer, while managing the complex and varied external APIs of the different LLMs behind the scenes.

框架在进行 API 调用之前自动处理将数据从其内部对象(例如 ADK Event)映射到 Content 对象中相应的 role 和 parts。本质上,框架为开发者提供稳定的内部 API,同时在幕后管理不同 LLM 的复杂且多样的外部 API。

ADK uses an explicit Session object that contains a list of Event objects and a separate state object. The Session is like a filing cabinet, with one folder for the conversation history (events) and another for working memory (state).

ADK 使用显式的 Session 对象,其中包含 Event 对象列表和单独的状态对象。会话就像一个文件柜,一个文件夹用于对话历史(事件),另一个用于工作记忆(状态)。

LangGraph doesn’t have a formal “session” object. Instead, the state is the session. This all encompassing state object holds the conversation history (as a list of Message objects) and all other working data. Unlike the append-only log of a traditional session, LangGraph’s state

LangGraph 没有正式的”会话”对象。相反,状态就是会话。这个包罗万象的状态对象保存对话历史(作为 Message 对象列表)和所有其他工作数据。与传统会话的只追加日志不同,LangGraph 的状态是可变的。

is mutable. It can be transformed, and strategies like history compaction can alter the record. This is useful for managing long conversations and token limits.

它可以被转换,历史压缩等策略可以改变记录。这对于管理长对话和 token 限制很有用。

Sessions for multi-agent systems

多智能体系统的会话

In a multi-agent system, multiple agents collaborate. Each agent focuses on a smaller, specialized task. For these agents to work together effectively, they must share information. As shown in the diagram below, the system’s architecture defines the communication patterns they use to share information. A central component of this architecture is how the system handles session history—the persistent log of all interactions.

在多智能体系统中,多个智能体协作。每个智能体专注于较小的专门任务。为了使这些智能体有效地协同工作,它们必须共享信息。如下图所示,系统的架构定义了它们用于共享信息的通信模式。该架构的核心组件是系统如何处理会话历史——所有交互的持久日志。

![][image3]
Figure 3: Different multi-agent architectural patterns30

图 3:不同的多智能体架构模式³⁰

Before exploring the architectural patterns for managing this history, it’s crucial to distinguish it from the context sent to an LLM. Think of the session history as the permanent, unabridged transcript of the entire conversation. The context, on the other hand, is the carefully crafted information payload sent to the LLM for a single turn. An agent might construct this context by selecting only a relevant excerpt from the history or by adding special formatting, like a guiding preamble, to steer the model’s response. This section focuses on what information is passed across agents, not necessarily what context is sent to the LLM.

在探索管理这些历史的架构模式之前,区分它与发送给 LLM 的上下文至关重要。将会话历史视为整个对话的永久、完整的记录。另一方面,上下文是为单轮精心制作的发送给 LLM 的信息负载。智能体可能通过只选择历史中的相关摘录或添加特殊格式(如引导性前言)来构建此上下文,以引导模型的响应。本节重点关注跨智能体传递的信息,而不一定是发送给 LLM 的上下文。

Agent frameworks handle session history for multi-agent systems using one of two primary approaches: a shared, unified history where all agents contribute to a single log, or separate, individual histories where each agent maintains its own perspective4. The choice between these two patterns depends on the nature of the task and the desired collaboration style between the agents.

智能体框架使用两种主要方法之一处理多智能体系统的会话历史:共享的统一历史(所有智能体贡献到单个日志),或单独的个体历史(每个智能体维护自己的视角)⁴。这两种模式之间的选择取决于任务的性质和智能体之间期望的协作风格。

For the shared, unified history model, all agents in the system read from and write all events to the same, single conversation history. Every agent’s message, tool call, and observation is appended to one central log in chronological order. This approach is best for tightly coupled, collaborative tasks requiring a single source of truth, such as a multi-step problem-solving process where one agent’s output is the direct input for the next. Even with a shared history, a sub-agent might process the log before passing it to the LLM. For instance, it could filter for a subset of relevant events or add labels to identify which agent generated each event.

对于共享的统一历史模型,系统中的所有智能体都从同一个对话历史中读取并写入所有事件。每个智能体的消息、工具调用和观察都按时间顺序附加到一个中央日志中。这种方法最适合紧密耦合的协作任务,需要单一的真实来源,例如多步骤问题解决过程,其中一个智能体的输出是下一个智能体的直接输入。即使使用共享历史,子智能体也可能在将日志传递给 LLM 之前对其进行处理。例如,它可以过滤相关事件的子集或添加标签以识别每个事件是由哪个智能体生成的。

If you use ADK’s LLM-driven delegation to handoff to sub-agents, all of the intermediary events of the sub-agent would be written to the same session as the root agent5:

如果你使用 ADK 的 LLM 驱动委派来移交给子智能体,子智能体的所有中间事件都将写入与根智能体相同的会话⁵:

Python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from google.adk.agents import LlmAgent

# The sub-agent has access to Session and writes events to it.
# 子智能体可以访问 Session 并向其写入事件。
sub_agent_1 = LlmAgent(...)

# Optionally, the sub-agent can save the final response text (or structured
# output) to the specified state key.
# 可选地,子智能体可以将最终响应文本(或结构化输出)保存到指定的状态键。
sub_agent_2 = LlmAgent(
...,
output_key="..."
)

# Parent agent.
# 父智能体。
root_agent = LlmAgent(
...,
sub_agents=[sub_agent_1, sub_agent_2]
)

Continues next page…

Snippet 2: A2A communication across multiple agent frameworks

代码片段 2:跨多个智能体框架的 A2A 通信

In the separate, individual histories model, each agent maintains its own private conversation history and functions like a black box to other agents. All internal processes— such as intermediary thoughts, tool use, and reasoning steps—are kept within the agent’s private log and are not visible to others. Communication occurs only through explicit messages, where an agent shares its final output, not its process.

单独的个体历史模型中,每个智能体维护自己的私有对话历史,对其他智能体来说就像一个黑盒。所有内部过程——如中间思考、工具使用和推理步骤——都保存在智能体的私有日志中,对其他人不可见。通信仅通过显式消息进行,智能体分享其最终输出,而非其过程。

This interaction is typically implemented by either implementing Agent-as-a-tool or using the Agent-to-Agent (A2A) Protocol. With Agent-as a-Tool, one agent invokes another as if it were a standard tool, passing inputs and receiving a final, self-contained output6. With the Agent to-Agent (A2A) Protocol, agents use a structured protocol for direct messaging7.

这种交互通常通过实现智能体即工具(Agent-as-a-tool)或使用智能体到智能体(A2A)协议来实现。使用智能体即工具时,一个智能体像调用标准工具一样调用另一个智能体,传递输入并接收最终的自包含输出⁶。使用智能体到智能体(A2A)协议时,智能体使用结构化协议进行直接消息传递⁷。

We’ll explore the A2A protocol in more detail in the next session.

我们将在下一节更详细地探讨 A2A 协议。

Interoperability across multiple agent frameworks

跨多个智能体框架的互操作性

**![][image4]**Figure 4: A2A communication across multiple agents that use different frameworks

图 4:使用不同框架的多个智能体之间的 A2A 通信

A framework’s use of an internal data representation introduces a critical architectural trade-off for multi-agent system: the very abstraction that decouples an agent from an LLM also isolates it from agents using other agent frameworks. This isolation is solidified at the persistence layer. The storage model for a Session typically couples the database schema directly to the framework’s internal objects, creating a rigid, relatively non-portable conversation record. Therefore, an agent built with LangGraph cannot natively interpret the distinct Session and Event objects persisted by an ADK-based agent, making seamless task handoffs impossible.

框架使用内部数据表示为多智能体系统引入了关键的架构权衡:将智能体与 LLM 解耦的抽象同时也将其与使用其他智能体框架的智能体隔离开来。这种隔离在持久层得到巩固。Session 的存储模型通常将数据库模式直接耦合到框架的内部对象,创建了一个刚性的、相对不可移植的对话记录。因此,用 LangGraph 构建的智能体无法原生解释基于 ADK 的智能体持久化的不同 SessionEvent 对象,使无缝任务移交变得不可能。

One emerging architectural pattern architectural pattern for coordinating collaboration between these isolated agents is Agent-to-Agent (A2A) communication8. While this pattern enables agents to exchange messages, it fails to address the core problem of sharing rich, contextual state. Each agent’s conversation history is encoded in its framework’s internal schema. As a result, any A2A message containing session events requires a translation layer to be useful.

一种新兴的架构模式是智能体到智能体(A2A)通信⁸,用于协调这些隔离智能体之间的协作。虽然这种模式使智能体能够交换消息,但它未能解决共享丰富上下文状态的核心问题。每个智能体的对话历史都以其框架的内部模式编码。因此,任何包含会话事件的 A2A 消息都需要一个转换层才能有用。

A more robust architectural pattern for interoperability involves abstracting shared knowledge into a framework-agnostic data layer, such as Memory. Unlike a Session store, which preserves raw, framework-specific objects like Events and Messsages, a memory layer is designed to hold processed, canonical information. Key information—like summaries, extracted entities, and facts—is extracted from the conversation and is typically stored as strings or dictionaries. The memory layer’s data structures are not coupled to any single framework’s internal data representation, which allows it to serve as a universal, common data layer. This pattern allows heterogeneous agents to achieve true collaborative intelligence by sharing a common cognitive resource without requiring custom translators.

一种更健壮的互操作性架构模式涉及将共享知识抽象到框架无关的数据层,如记忆(Memory)。与保存原始的、框架特定对象(如 EventsMessages)的 Session 存储不同,记忆层旨在保存经过处理的规范化信息。关键信息——如摘要、提取的实体和事实——从对话中提取,通常存储为字符串或字典。记忆层的数据结构不与任何单一框架的内部数据表示耦合,这使其能够作为通用的公共数据层。这种模式允许异构智能体通过共享共同的认知资源实现真正的协作智能,而无需自定义转换器。

Production Considerations for Sessions

会话的生产环境考量

When moving an agent to a production environment, its session management system must evolve from a simple log to a robust, enterprise-grade service. The key considerations fall into three critical areas: security and privacy, data integrity, and performance. A managed session store, like Agent Engine Sessions, is specifically designed to address these production requirements.

将智能体迁移到生产环境时,其会话管理系统必须从简单的日志演变为健壮的企业级服务。关键考虑因素分为三个关键领域:安全和隐私、数据完整性和性能。托管会话存储(如 Agent Engine Sessions)专门设计用于满足这些生产需求。

Security and Privacy

安全和隐私

Protecting the sensitive information contained within a session is a non-negotiable requirement. Strict Isolation is the most critical security principle. A session is owned by a single user, and the system must enforce strict isolation to ensure one user can never access another user’s session data (i.e. via ACLs). Every request to the session store must be authenticated and authorized against the session’s owner.

保护会话中包含的敏感信息是不可协商的要求。严格隔离是最关键的安全原则。会话由单个用户拥有,系统必须强制执行严格隔离,以确保一个用户永远无法访问另一个用户的会话数据(即通过 ACL)。对会话存储的每个请求都必须针对会话所有者进行身份验证和授权。

A best practice for handling Personally Identifiable Information (PII) is to redact it before the session data is ever written to storage. This is a fundamental security measure that drastically reduces the risk and “blast radius” of a potential data breach. By ensuring sensitive data is never persisted using tools like Model Armor9, you simplify compliance with privacy regulations like GDPR and CCPA and build user trust.

处理个人身份信息(PII)的最佳实践是在会话数据写入存储之前对其进行脱敏。这是一项基本的安全措施,可大幅降低潜在数据泄露的风险和”爆炸半径”。通过使用 Model Armor⁹ 等工具确保敏感数据永远不会被持久化,你可以简化对 GDPR 和 CCPA 等隐私法规的合规性并建立用户信任。

Data Integrity and Lifecycle Management

数据完整性和生命周期管理

A production system requires clear rules for how session data is stored and maintained over time. Sessions should not live forever. You can implement a Time-to-Live (TTL) policy to automatically delete inactive sessions to manage storage costs and reducing data management overhead. This requires a clear data retention policy that defines how long sessions should be kept before being archived or permanently deleted.

生产系统需要明确的规则来说明会话数据如何随时间存储和维护。会话不应该永远存在。你可以实施生存时间(TTL)策略来自动删除不活跃的会话,以管理存储成本并减少数据管理开销。这需要一个明确的数据保留策略,定义会话在被存档或永久删除之前应保留多长时间。

Additionally, the system must guarantee that operations are appended to the session history in a deterministic order. Maintaining the correct chronological sequence of events is fundamental to the integrity of the conversation log.

此外,系统必须保证操作以确定性顺序附加到会话历史中。保持正确的事件时间顺序对于对话日志的完整性至关重要。

Performance and Scalability

性能和可扩展性

Session data is on the “hot path” of every user interaction, making its performance a primary concern. Reading and writing the session history must be extremely fast to ensure a responsive user experience. Agent runtimes are typically stateless, so the entire session history is retrieved from a central database at the start of every turn, incurring network transfer latency.

会话数据位于每个用户交互的”热路径”上,使其性能成为主要关注点。读取和写入会话历史必须非常快,以确保响应式的用户体验。智能体运行时通常是无状态的,因此在每轮开始时从中央数据库检索整个会话历史,会产生网络传输延迟。

To mitigate latency, it is crucial to reduce the size of the data transferred. A key optimization is to filter or compact the session history before sending it to the agent. For example, you can remove old, irrelevant function call outputs that are no longer needed for the current state of the conversation. The following section details several strategies for compacting history to effectively manage long-context conversations.

为了缓解延迟,减少传输数据的大小至关重要。一个关键的优化是在将会话历史发送给智能体之前对其进行过滤或压缩。例如,你可以删除当前对话状态不再需要的旧的、无关的函数调用输出。以下部分详细介绍了几种压缩历史的策略,以有效管理长上下文对话。

Managing long context conversation: tradeoffs and optimizations

管理长上下文对话:权衡与优化

In a simplistic architecture, a session is an immutable log of the conversation between the user and agent. However, as the conversation scales, the conversation’s token usage increases. Modern LLMs can handle long contexts, but limitations exist, especially for latency-sensitive applications10:

在简单的架构中,会话是用户和智能体之间对话的不可变日志。然而,随着对话规模的扩大,对话的 token 使用量会增加。现代 LLM 可以处理长上下文,但存在限制,特别是对于延迟敏感的应用¹⁰:

1. Context Window Limits: Every LLM has a maximum amount of text (context window) it can process at once. If the conversation history exceeds this limit, the API call will fail.

1. 上下文窗口限制: 每个 LLM 都有一次可以处理的最大文本量(上下文窗口)。如果对话历史超过此限制,API 调用将失败。

2. API Costs ($): Most LLM providers charge based on the number of tokens you send and receive. Shorter histories mean fewer tokens and lower costs per turn.

2. API 成本($): 大多数 LLM 提供商根据你发送和接收的 token 数量收费。更短的历史意味着更少的 token 和更低的每轮成本。

3. Latency (Speed): Sending more text to the model takes longer to process, resulting in a slower response time for the user. Compaction keeps the agent feeling quick and responsive.

3. 延迟(速度): 向模型发送更多文本需要更长的处理时间,导致用户的响应时间变慢。压缩使智能体保持快速和响应。

4. Quality: As the number of tokens increases, performance can get worse due to additional noise in the context and autoregressive errors.

4. 质量: 随着 token 数量的增加,由于上下文中的额外噪音和自回归错误,性能可能会变差。

Managing a long conversation with an agent can be compared to a savvy traveler packing a suitcase for a long trip. The suitcase represents the agent’s limited context window, and the clothes and items are the pieces of information from the conversation. If you simply try to stuff everything in, the suitcase becomes too heavy and disorganized, making it difficult to find what you need quickly—like how an overloaded context window increases processing costs and slows down response times. On the other hand, if you pack too little, you risk leaving behind essential items like a passport or a warm coat, compromising the entire trip— like how an agent could lose critical context, leading to irrelevant or incorrect answers. Both the traveler and the agent operate under a similar constraint: success hinges not on how much you can carry, but on carrying only what you need.

管理与智能体的长对话可以比作一个精明的旅行者为长途旅行打包行李箱。行李箱代表智能体有限的上下文窗口,衣服和物品是对话中的信息片段。如果你只是试图把所有东西都塞进去,行李箱会变得太重和杂乱无章,很难快速找到你需要的东西——就像过载的上下文窗口会增加处理成本并减慢响应时间一样。另一方面,如果你打包太少,你可能会冒着遗漏护照或保暖外套等必需品的风险,从而影响整个旅程——就像智能体可能会丢失关键上下文,导致无关或错误的答案。旅行者和智能体都在类似的约束下运作:成功不在于你能携带多少,而在于只携带你需要的东西。

Compaction strategies shrink long conversation histories, condensing dialogue to fit within the model’s context window, reducing API costs and latency. As a conversation gets longer, the history sent to the model with each turn can become too large. Compaction strategies solve this by intelligently trimming the history while trying to preserve the most important context.

压缩策略缩减长对话历史,压缩对话以适应模型的上下文窗口,降低 API 成本和延迟。随着对话变长,每轮发送给模型的历史可能变得太大。压缩策略通过智能修剪历史同时尽量保留最重要的上下文来解决这个问题。

So, how do you know what content to throw out of a Session without losing valuable information? Strategies range from simple truncation to sophisticated compaction:

那么,你如何知道从会话中丢弃什么内容而不丢失有价值的信息?策略从简单的截断到复杂的压缩:

• Keep the last N turns: This is the simplest strategy. The agent only keeps the most recent N turns of the conversation (a “sliding window”) and discards everything older.

• 保留最后 N 轮: 这是最简单的策略。智能体只保留对话中最近的 N 轮(”滑动窗口”)并丢弃所有更旧的内容。

• Token-Based Truncation: Before sending the history to the model, the agent counts the tokens in the messages, starting with the most recent and working backward. It includes as many messages as possible without exceeding a predefined token limit (e.g., 4000 tokens). Everything older is simply cut off.

• 基于 Token 的截断: 在将历史发送到模型之前,智能体计算消息中的 token,从最近的开始向后计算。它包含尽可能多的消息而不超过预定义的 token 限制(例如 4000 个 token)。所有更旧的内容都被简单地截断。

• Recursive Summarization: Older parts of the conversation are replaced by an AI generated summary. As the conversation grows, the agent periodically uses another LLM call to summarize the oldest messages. This summary is then used as a condensed form of the history, often prefixed to the more recent, verbatim messages.

• 递归摘要: 对话的较旧部分被 AI 生成的摘要替换。随着对话的增长,智能体定期使用另一个 LLM 调用来总结最旧的消息。然后将此摘要用作历史的压缩形式,通常作为前缀添加到更近期的逐字消息之前。

For example, you can keep the last N turns with ADK by using a built-in plug-in for your ADK app to limit the context sent to the model. This does not modify the historical events stored in your session storage:

例如,你可以通过使用 ADK 应用的内置插件来保留最后 N 轮,以限制发送给模型的上下文。这不会修改存储在会话存储中的历史事件:

Python

1
2
3
4
5
6
7
8
9
10
11
12
from google.adk.apps import App
from google.adk.plugins.context_filter_plugin import ContextFilterPlugin

app = App(
name='hello_world_app',
root_agent=agent,
plugins=[
# Keep the last 10 turns and the most recent user query.
# 保留最后 10 轮和最近的用户查询。
ContextFilterPlugin(num_invocations_to_keep=10),
],
)

Snippet 3: Session truncation to only use the last N turns with ADK

代码片段 3:使用 ADK 将会话截断为仅使用最后 N 轮

Given that sophisticated compaction strategies aim to reduce cost and latency, it is critical to perform expensive operations (like recursive summarization) asynchronously in the background and persist the results. “In the background” ensures the client is not kept waiting, and “persistence” ensures that expensive computations are not excessively repeated. Frequently, the agent’s memory manager is responsible for both generating and persisting these recursive summaries. The agent must also keep a record of which events are included in the compacted summary; this prevents the original, more verbose events from being needlessly sent to the LLM.

鉴于复杂的压缩策略旨在降低成本和延迟,在后台异步执行昂贵的操作(如递归摘要)并持久化结果至关重要。”在后台”确保客户端不会等待,”持久化”确保昂贵的计算不会过度重复。通常,智能体的记忆管理器负责生成和持久化这些递归摘要。智能体还必须记录哪些事件包含在压缩摘要中;这可以防止原始的、更冗长的事件被不必要地发送给 LLM。

Additionally, the agent must decide when compaction is necessary. The trigger mechanism generally falls into a few distinct categories:

此外,智能体必须决定何时需要压缩。触发机制通常分为几个不同的类别:

• Count-Based Triggers (i.e. token size or turn count threshold): The conversation is compacted once the conversation exceeds a certain predefined threshold. This approach is often “good enough” for managing context length.

• 基于计数的触发器(即 token 大小或轮次计数阈值):一旦对话超过某个预定义阈值,对话就会被压缩。这种方法通常”足够好”来管理上下文长度。

• Time-Based Triggers: Compaction is triggered not by the size of the conversation, but by a lack of activity. If a user stops interacting for a set period (e.g., 15 or 30 minutes), the system can run a compaction job in the background.

• 基于时间的触发器: 压缩不是由对话的大小触发,而是由缺乏活动触发。如果用户停止交互一段时间(例如 15 或 30 分钟),系统可以在后台运行压缩作业。

• Event-Based Triggers (i.e. Semantic/Task Completion): The agent decides to trigger compaction when it detects that a specific task, sub-goal, or topic of conversation has concluded.

• 基于事件的触发器(即语义/任务完成):当智能体检测到特定任务、子目标或对话主题已结束时,它会决定触发压缩。

For example, you can use ADK’s EventsCompactionConfig to trigger LLM-based summarization after a configured number of turns:

例如,你可以使用 ADK 的 EventsCompactionConfig 在配置的轮次后触发基于 LLM 的摘要:

Python

1
2
3
4
5
6
7
8
9
10
11
from google.adk.apps import App
from google.adk.apps.app import EventsCompactionConfig

app = App(
name='hello_world_app',
root_agent=agent,
events_compaction_config=EventsCompactionConfig(
compaction_interval=5,
overlap_size=1,
),
)

Snippet 4: Session compaction using summarization with ADK

代码片段 4:使用 ADK 进行会话压缩摘要

Memory generation is the broad capability of extracting persistent knowledge from a verbose and noisy data source. In this section, we covered a primary example of extracting information from conversation history: session compaction. Compaction distills the verbatim transcript of an entire conversation, extracting key facts and summaries while discarding conversational filler.

记忆生成是从冗长且嘈杂的数据源中提取持久知识的广泛能力。在本节中,我们介绍了从对话历史中提取信息的主要示例:会话压缩。压缩提炼整个对话的逐字记录,提取关键事实和摘要,同时丢弃对话填充内容。

Building on compaction, the next section will explore memory generation and management more broadly. We will discuss the various ways memories can be created, stored, and retrieved to build an agent’s long-term knowledge.

在压缩的基础上,下一节将更广泛地探讨记忆生成和管理。我们将讨论创建、存储和检索记忆以构建智能体长期知识的各种方式。

Memory

记忆

Memory and Sessions share a deeply symbiotic relationship: sessions are the primary data source for generating memories, and memories are a key strategy for managing the size of a session. A memory is a snapshot of extracted, meaningful information from a conversation or data source. It’s a condensed representation that preserves important context, making it useful for future interactions. Generally, memories are persisted across sessions to provide a continuous and personalized experience.

记忆和会话共享深度共生的关系:会话是生成记忆的主要数据源,而记忆是管理会话大小的关键策略。记忆是从对话或数据源中提取的有意义信息的快照。它是一种浓缩的表示,保留重要上下文,使其对未来的交互有用。通常,记忆跨会话持久化以提供连续且个性化的体验。

As a specialized, decoupled service, a “memory manager“ provides the foundation for multi agent interoperability. Memory managers frequently use framework-agnostic data structures, like simple strings and dictionaries. This allows agents built on different frameworks to connect to a single memory store, enabling the creation of a shared knowledge base that any connected agent can utilize.

作为一个专门的、解耦的服务,”记忆管理器“为多智能体互操作性提供基础。记忆管理器经常使用框架无关的数据结构,如简单的字符串和字典。这允许构建在不同框架上的智能体连接到单个记忆存储,从而创建任何连接的智能体都可以利用的共享知识库。

Note: some frameworks may also refer to Sessions or verbatim conversation as “short-term memory.” For this whitepaper, memories are defined as extracted information, not the raw dialogue of turn-by-turn conversation.

注意:一些框架也可能将会话或逐字对话称为”短期记忆”。对于本白皮书,记忆被定义为提取的信息,而非原始的逐轮对话。

Storing and retrieving memories is crucial for building sophisticated and intelligent agents. A robust memory system transforms a basic chatbot into a truly intelligent agent by unlocking several key capabilities:

存储和检索记忆对于构建复杂和智能的智能体至关重要。健壮的记忆系统通过解锁几个关键能力,将基本的聊天机器人转变为真正智能的智能体:

• Personalization: The most common use case is to remember user preferences, facts, and past interactions to tailor future responses. For example, remembering a user’s favorite sports team or their preferred seat on an airplane creates a more helpful and personal experience.

• 个性化: 最常见的用例是记住用户偏好、事实和过去的交互,以定制未来的响应。例如,记住用户最喜欢的运动队或他们在飞机上的首选座位可以创造更有帮助和个人化的体验。

• Context Window Management: As conversations become longer, the full history can exceed an LLM’s context window. Memory systems can compact this history by creating summaries or extracting key facts, preserving context without sending thousands of tokens in every turn. This reduces both cost and latency.

• 上下文窗口管理: 随着对话变长,完整历史可能超过 LLM 的上下文窗口。记忆系统可以通过创建摘要或提取关键事实来压缩这些历史,在不每轮发送数千个 token 的情况下保留上下文。这可以降低成本和延迟。

• Data Mining and Insight: By analyzing stored memories across many users (in an aggregated, privacy-preserving way), you can extract insights from the noise. For example, a retail chatbot might identify that many users are asking about the return policy for a specific product, flagging a potential issue.

• 数据挖掘和洞察: 通过分析跨多个用户的存储记忆(以聚合的、隐私保护的方式),你可以从噪音中提取洞察。例如,零售聊天机器人可能会识别出许多用户正在询问特定产品的退货政策,标记潜在问题。

• Agent Self-Improvement and Adaptation: The agent learns from previous runs by creating procedural memories about its own performance—recording which strategies, tools, or reasoning paths led to successful outcomes. This enables the agent to build a playbook of effective solutions, allowing it to adapt and improve its problem-solving over time.

• 智能体自我改进和适应: 智能体通过创建关于其自身性能的程序性记忆从之前的运行中学习——记录哪些策略、工具或推理路径导致了成功的结果。这使智能体能够建立有效解决方案的剧本,使其能够随时间调整和改进其问题解决能力。

Creating, storing, and utilizing memory in an AI system is a collaborative process. Each component in the stack—from the end-user to the developer’s code—has a distinct role to play.

在 AI 系统中创建、存储和利用记忆是一个协作过程。堆栈中的每个组件——从最终用户到开发者的代码——都有独特的角色要扮演。

1. The User: Provides the raw source data for memories. In some systems, users may provide memories directly (i.e. via a form).

1. 用户: 提供记忆的原始源数据。在某些系统中,用户可以直接提供记忆(即通过表单)。

2. The Agent (Developer Logic): Configures how to decide what and when to remember, orchestrating calls to the memory manager. In simple architectures, the developer can implement the logic such that memory is always retrieved and always triggered-to-be generated. In more advanced architectures, the developer may implement memory-as-a tool, where the agent (via LLM) decides when memory should be retrieved or generated.

2. 智能体(开发者逻辑): 配置如何决定记住什么和何时记住,协调对记忆管理器的调用。在简单的架构中,开发者可以实现这样的逻辑:记忆总是被检索,总是被触发生成。在更高级的架构中,开发者可以实现记忆即工具,其中智能体(通过 LLM)决定何时应该检索或生成记忆。

3. The Agent Framework (e.g., ADK, LangGraph): Provides the structure and tools for memory interaction. The framework acts as the plumbing. It defines how the developer’s logic can access conversation history and interact with the memory manager, but it doesn’t manage the long-term storage itself. It also defines how to stuff retrieved memories into the context window.

3. 智能体框架(例如 ADK、LangGraph): 提供记忆交互的结构和工具。框架充当管道。它定义了开发者的逻辑如何访问对话历史和与记忆管理器交互,但它本身不管理长期存储。它还定义了如何将检索到的记忆填充到上下文窗口中。

4. The Session Storage (i.e. Agent Engine Sessions, Spanner, Redis): Stores the turn by-turn conversation of the Session. The raw dialogue will be ingested into the memory manager in order to generate memories.

4. 会话存储(即 Agent Engine Sessions、Spanner、Redis): 存储会话的逐轮对话。原始对话将被摄入记忆管理器以生成记忆。

5. The Memory Manager (e.g. Agent Engine Memory Bank, Mem0, Zep): Handles the storage, retrieval, and compaction of memories. The mechanisms to store and retrieve memories depend on what provider is used. This is the specialized service or component that takes the potential memory identified by the agent and handles its entire lifecycle.

5. 记忆管理器(例如 Agent Engine Memory Bank、Mem0、Zep): 处理记忆的存储、检索和压缩。存储和检索记忆的机制取决于使用的提供商。这是一个专门的服务或组件,它获取智能体识别的潜在记忆并处理其整个生命周期。

• Extraction distills the key information from the source data.

• 提取 从源数据中提炼关键信息。

• Consolidation curates memories to merge duplicative entities.

• 整合 策划记忆以合并重复的实体。

• Storage persists the memory to persistent databases.

• 存储 将记忆持久化到持久数据库。

• Retrieval fetches relevant memories to provide context for new interactions

• 检索 获取相关记忆以为新交互提供上下文

![][image5]
Figure 5: The flow of information between sessions, memory, and external knowledge

图 5:会话、记忆和外部知识之间的信息流

The division of responsibilities ensures that the developer can focus on the agent’s unique logic without having to build the complex underlying infrastructure for memory persistence and management. It is important to recognize that a memory manager is an active system, not just a passive vector database. While it uses similarity search for retrieval, its core value

责任的划分确保开发者可以专注于智能体的独特逻辑,而无需构建复杂的底层基础设施来实现记忆持久化和管理。重要的是要认识到记忆管理器是一个主动的系统,而不仅仅是一个被动的向量数据库。虽然它使用相似性搜索进行检索,但其核心价值

lies in its ability to intelligently extract, consolidate, and curate memories over time. Managed memory services, like Agent Engine Memory Bank, handle the entire lifecycle of memory generation and storage, freeing you to focus on your agent’s core logic.

在于其能够随时间智能地提取、整合和策划记忆。托管记忆服务(如 Agent Engine Memory Bank)处理记忆生成和存储的整个生命周期,让你可以专注于智能体的核心逻辑。

This retrieval capability is also why memory is frequently compared to another key architectural pattern: Retrieval-Augmented Generation (RAG). However, they are built on different architectural principles, as RAG handles static, external data while Memory curates dynamic, user-specific context. They fulfill two distinct and complementary roles: RAG makes an agent an expert on facts, while memory makes it an expert on the user. The following chart breaks down their high-level differences:

这种检索能力也是为什么记忆经常与另一个关键架构模式进行比较的原因:检索增强生成(RAG)。然而,它们建立在不同的架构原则之上,RAG 处理静态的外部数据,而记忆策划动态的、用户特定的上下文。它们履行两个不同且互补的角色:RAG 使智能体成为事实专家,而记忆使其成为用户专家。下表分解了它们的高层差异:

RAG Engines Memory Managers

RAG 引擎 | 记忆管理器

To inject external, factual knowledge into the context
A static, pre-indexed external knowledge base (e.g., PDFs, wikis, documents, APIs).
Generally Shared. The knowledge base is typically a global, read-only resource accessible by all users to ensure consistent, factual answers.
Static, factual, and authoritative. Often contains domain-specific data, product details, or technical documentation.
Batch processing Triggered via an offline, administrative action.
RAG data is almost always retrieved “as a-tool“. It’s retrieved when the agent decides that the user’s query requires external information.
A natural-language “chunk”.
将外部的事实性知识注入上下文
静态的、预索引的外部知识库(如 PDF、wiki、文档、API)。
通常是共享的。知识库通常是一个全局的、只读的资源,所有用户都可以访问,以确保一致的、事实性的答案。
静态的、事实性的和权威的。通常包含领域特定的数据、产品详情或技术文档。
批处理,通过离线管理操作触发。
RAG 数据几乎总是”作为工具“检索。当智能体决定用户的查询需要外部信息时检索。
自然语言”块”。

Primary Goal To create a personalized and stateful experience. The agent remembers facts,

主要目标:创建个性化和有状态的体验。智能体记住事实,

adapts to the user over time, and maintains

随时间适应用户,并维护

long-running context.

长期运行的上下文。

Data source The dialogue between the user and agent.

数据源:用户和智能体之间的对话。

Isolation Level Highly Isolated: Memory is almost always scoped per-user to prevent data leaks.

隔离级别:高度隔离:记忆几乎总是按用户范围划分以防止数据泄露。

Information type Dynamic and (generally) user-specific. Memories are derived from conversation,

信息类型:动态的且(通常)用户特定的。记忆源自对话,

so there’s an inherent level of uncertainty.

因此存在固有的不确定性水平。

Write patterns Event-based processing

写入模式:基于事件的处理

Triggered at some cadence (i.e. every

以某种节奏触发(即每

turn or at the end of a session) or

轮或在会话结束时)或

Memory-as-a-tool (agent decides to

记忆即工具(智能体决定

generate memories).

生成记忆)。

Read patterns There are two common read patterns: • Memory-as-a-tool: Retrieved when

读取模式:有两种常见的读取模式:• 记忆即工具:

the user’s query requires additional

用户的查询需要额外的

information about the user (or some

关于用户(或某些

other identity).

其他身份)的信息时检索。

• Static retrieval: Memory is always

• 静态检索: 记忆总是

retrieved at the start of each turn.

在每轮开始时检索。

Data Format A natural language snippet or a structured profile.

数据格式:自然语言片段或结构化配置文件。

Data preparation Chunking and Indexing: Source documents are broken into smalvler

数据准备:分块和索引: 源文档被分解成更小的

Chunks, which are then converted to

块,然后转换为

embeddings and stored for fast lookup.

嵌入并存储以供快速查找。

Table 1: Comparison of RAG engines and memory managers

表 1:RAG 引擎和记忆管理器的比较

Extraction and consolidation: Extract key details from the conversation, ensuring content is not duplicative or contradictory.

提取和整合: 从对话中提取关键细节,确保内容不重复或矛盾。

A helpful way to understand the difference is to think of RAG as the agent’s research librarian and a memory manager as its personal assistant.

理解差异的一个有用方法是将 RAG 视为智能体的研究图书管理员,将记忆管理器视为其私人助理。

The research librarian (RAG) works in a vast public library filled with encyclopedias, textbooks, and official documents. When the agent needs an established fact—like a product’s technical specifications or a historical date—it consults the librarian. The librarian retrieves information from this static, shared, and authoritative knowledge base to provide consistent, factual answers. The librarian is an expert on the world’s facts, but they don’t know anything personal about the user asking the question.

研究图书管理员(RAG)在一个充满百科全书、教科书和官方文档的庞大公共图书馆工作。当智能体需要一个已确立的事实——如产品的技术规格或历史日期——它会咨询图书管理员。图书管理员从这个静态的、共享的和权威的知识库中检索信息,以提供一致的、事实性的答案。图书管理员是世界事实的专家,但他们不知道任何关于提问用户的个人信息。

In contrast, the personal assistant (memory) follows the agent and carries a private notebook, recording the details of every interaction with a specific user. This notebook is dynamic and highly isolated, containing personal preferences, past conversations, and evolving goals. When the agent needs to recall a user’s favorite sports team or the context of last week’s project discussion, it turns to the assistant. The assistant’s expertise is not in global facts, but in the user themselves.

相比之下,私人助理(记忆)跟随智能体并携带一本私人笔记本,记录与特定用户每次交互的细节。这本笔记本是动态的且高度隔离的,包含个人偏好、过去的对话和不断发展的目标。当智能体需要回忆用户最喜欢的运动队或上周项目讨论的上下文时,它会求助于助理。助理的专业知识不在于全球事实,而在于用户本身。

Ultimately, a truly intelligent agent needs both. RAG provides it with expert knowledge of the world, while memory provides it with an expert understanding of the user it’s serving.

最终,一个真正智能的智能体两者都需要。RAG 为其提供关于世界的专家知识,而记忆为其提供对它所服务用户的专家理解。

The next section deconstructs the concept of memory by examining its core components: the types of information it stores, the patterns for its organization, the mechanisms for its storage and creation, the strategic definition of its scope, and its handling of multimodal versus textual data.

下一节通过检查记忆的核心组件来解构记忆的概念:它存储的信息类型、其组织模式、其存储和创建的机制、其作用域的战略定义,以及它对多模态与文本数据的处理。

Types of memory

记忆类型

An agent’s memory can be categorized by how the information is stored and how it was captured. These different types of memory work together to create a rich, contextual understanding of a user and their needs. Across all types of memories, the rule stands that memories are descriptive, not predictive.

智能体的记忆可以按信息的存储方式和捕获方式进行分类。这些不同类型的记忆共同作用,创造对用户及其需求的丰富、上下文化的理解。对于所有类型的记忆,规则是记忆是描述性的,而非预测性的。

A “memory” is an atomic piece of context that is returned by the memory manager and used by the agent as context. While the exact schema can vary, a single memory generally consists of two main components: content and metadata.

“记忆”是由记忆管理器返回并被智能体用作上下文的原子上下文片段。虽然确切的模式可能有所不同,但单个记忆通常由两个主要组件组成:内容元数据

Content is the substance of the memory that was extracted from the source data (i.e. the raw dialogue of the session). Crucially, the content is designed to be framework-agnostic, using simple data structures that any agent can easily ingest. The content can either be structured or unstructured data. Structured memories include information typically stored in universal formats like a dictionary or JSON. Its schema is typically defined by the developer, not a specific framework. For example, {"seat_preference": "Window"}. Unstructured memories are natural language descriptions that capture the essence of a longer interaction, event, or topic. For example, “The user prefers a window seat.”

内容是从源数据(即会话的原始对话)中提取的记忆实质。至关重要的是,内容被设计为框架无关的,使用任何智能体都可以轻松摄入的简单数据结构。内容可以是结构化或非结构化数据。结构化记忆包括通常以通用格式(如字典或 JSON)存储的信息。其模式通常由开发者定义,而非特定框架。例如,{"seat_preference": "Window"}非结构化记忆是捕获较长交互、事件或主题本质的自然语言描述。例如,”用户偏好靠窗座位。”

Metadata provides context about the memory, typically stored as a simple string. This can include a unique identifier for the memory, identifiers for the “owner” of the memory, and labels describing the content or data source of the memory.

元数据提供关于记忆的上下文,通常存储为简单字符串。这可以包括记忆的唯一标识符、记忆”所有者”的标识符,以及描述记忆内容或数据源的标签。

Types of information

信息类型

Beyond their basic structure, memories can be classified by the fundamental type of knowledge they represent. This distinction, crucial for understanding how an agent uses memories, separates memory into two primary functional categories derived from cognitive science11: declarative memories (“knowing what”) and procedural memories (“knowing how”).

除了其基本结构之外,记忆还可以按它们所代表的基本知识类型进行分类。这种区分对于理解智能体如何使用记忆至关重要,将记忆分为源自认知科学¹¹的两个主要功能类别:陈述性记忆(*”知道是什么”)和程序性记忆“知道怎么做”*)。

Declarative memory is the agent’s knowledge of facts, figures, and events. It’s all the information that the agent can explicitly state or “declare.” If the memory is an answer to a “what” question, it’s declarative. This category encompasses both general world knowledge (Semantic) and specific user facts (Entity/Episodic).

陈述性记忆是智能体对事实、数据和事件的知识。它是智能体可以明确陈述或”声明”的所有信息。如果记忆是对”什么”问题的回答,它就是陈述性的。这个类别包括一般的世界知识(语义)和特定的用户事实(实体/情景)。

Procedural memory is the agent’s knowledge of skills and workflows. It guides the agent’s actions by demonstrating implicitly how to perform a task correctly. If the memory helps answer a “how” question—like the correct sequence of tool calls to book a trip—it’s procedural.

程序性记忆是智能体对技能和工作流程的知识。它通过隐式地演示如何正确执行任务来指导智能体的行动。如果记忆有助于回答”如何”问题——比如预订旅行的正确工具调用序列——它就是程序性的。

Organization patterns

组织模式

Once a memory is created, the next question is how to organize it. Memory managers typically employ one or more of the following patterns to organize memories: Collections12, Structured User Profile, or “Rolling Summary”. The patterns define how individual memories relate to each other and to the user.

一旦记忆被创建,下一个问题是如何组织它。记忆管理器通常采用以下一种或多种模式来组织记忆:集合¹²、结构化用户档案或**”滚动摘要”**。这些模式定义了单个记忆如何相互关联以及如何与用户关联。

The collections13 pattern organizes content into multiple self-contained, natural language memories for a single user. Each memory is a distinct event, summary, or observation, although there may be multiple memories in the collection for a single high-level topic. Collections allow for storing and searching through a larger, less structured pool of information related to specific goals or topics.

集合¹³模式将内容组织成单个用户的多个自包含的自然语言记忆。每个记忆是一个独特的事件、摘要或观察,尽管集合中可能有多个记忆用于单个高级主题。集合允许存储和搜索与特定目标或主题相关的更大、更少结构化的信息池。

The structured user profile pattern organizes memories as a set of core facts about a user, like a contact card that is continuously updated with new, stable information. It’s designed for quick lookups of essential, factual information like names, preferences, and account details.

结构化用户档案模式将记忆组织为关于用户的一组核心事实,就像一张不断用新的、稳定的信息更新的联系人卡。它旨在快速查找基本的、事实性的信息,如姓名、偏好和账户详情。

Unlike a structured user profile, the “rolling” summary pattern consolidates all information into a single, evolving memory that represents a natural-language summary of the entire user-agent relationship. Instead of creating new, individual memories, the manager continuously updates this one master document. This pattern is frequently used to compact long Sessions, preserving vital information while managing the overall token count.

与结构化用户档案不同,**”滚动”摘要**模式将所有信息整合到一个单一的、不断发展的记忆中,代表整个用户-智能体关系的自然语言摘要。管理器不是创建新的、单独的记忆,而是不断更新这一个主文档。这种模式经常用于压缩长会话,在管理整体 token 数量的同时保留关键信息。

Storage architectures

存储架构

Additionally, the storage architecture is a critical decision that determines how quickly and intelligently an agent can retrieve memories. The choice of architecture defines whether the agent excels at finding conceptually similar ideas, understanding structured relationships, or both.

此外,存储架构是一个关键决策,决定了智能体能够多快、多智能地检索记忆。架构的选择决定了智能体是擅长查找概念相似的想法、理解结构化关系,还是两者兼而有之。

Memories are generally stored in vector databases and/or knowledge graphs. Vector databases help find memories that are conceptually similar to the query. Knowledge graphs store memories as a network of entities and their relationships.

记忆通常存储在向量数据库和/或知识图谱中。向量数据库帮助找到与查询概念相似的记忆。知识图谱将记忆存储为实体及其关系的网络。

Vector databases are the most common approach, enabling retrieval based on semantic similarity rather than exact keywords. Memories are converted into embedding vectors, and the database finds the closest conceptual matches to a user’s query. This excels at retrieving unstructured, natural language memories where context and meaning are key (i.e. “atomic facts”14).

向量数据库是最常见的方法,能够基于语义相似性而非精确关键词进行检索。记忆被转换为嵌入向量,数据库找到与用户查询最接近的概念匹配。这在检索非结构化的自然语言记忆方面表现出色,其中上下文和含义是关键(即”原子事实”¹⁴)。

Knowledge graphs are used to store memories as a network of entities (nodes) and their relationships (edges). Retrieval involves traversing this graph to find direct and indirect connections, allowing the agent to reason about how different facts are linked. It is ideal for structured, relational queries and understanding complex connections within the data (i.e. “knowledge triples”15).

知识图谱用于将记忆存储为实体(节点)及其关系(边)的网络。检索涉及遍历此图以查找直接和间接连接,使智能体能够推理不同事实是如何链接的。它非常适合结构化的关系查询和理解数据中的复杂连接(即”知识三元组”¹⁵)。

You can also combine both methods into a hybrid approach by enriching a knowledge graph’s structured entities with vector embeddings. This enables the system to perform both relational and semantic searches simultaneously. This provides the structured reasoning of a graph and the nuanced, conceptual search of a vector database, offering the best of both worlds.

你还可以通过用向量嵌入丰富知识图谱的结构化实体,将两种方法结合成混合方法。这使系统能够同时执行关系和语义搜索。这提供了图的结构化推理和向量数据库的细微、概念性搜索,两全其美。

Creation mechanisms

创建机制

We can also classify memories by how they were created, including how the information was derived. Explicit memories are created when the user gives a direct command to the agent to remember something (e.g., “Remember my anniversary is October 26th”). On the other hand, implicit memories are created when the agent infers and extracts information from the conversation without a direct command (e.g., “My anniversary is next week. Can you help me find a gift for my partner?”)

我们还可以按记忆的创建方式对其进行分类,包括信息是如何派生的。显式记忆是当用户给智能体一个直接命令来记住某些东西时创建的(例如,”记住我的纪念日是 10 月 26 日”)。另一方面,隐式记忆是当智能体从对话中推断和提取信息而没有直接命令时创建的(例如,”我的纪念日是下周。你能帮我找个给我伴侣的礼物吗?”)

Memories can also be distinguished by whether the memory extraction logic is located internally or externally to the agent framework. Internal memory refers to memory management that is built directly into the agent framework. It’s convenient for getting started but often lacks advanced features. Internal memory can use external storage, but the mechanism for generating memories is internal to the agent.

记忆还可以根据记忆提取逻辑是位于智能体框架内部还是外部来区分。内部记忆指的是直接构建到智能体框架中的记忆管理。它便于入门但通常缺乏高级功能。内部记忆可以使用外部存储,但生成记忆的机制是智能体内部的。

External Memory involves using a separate, specialized service dedicated to memory management (e.g., Agent Engine Memory Bank, Mem0, Zep). The agent framework makes API calls to this external service to store, retrieve, and process memories. This approach provides more sophisticated features like semantic search, entity extraction, and automatic summarization, offloading the complex task of memory management to a purpose-built tool.

外部记忆涉及使用专门用于记忆管理的单独专业服务(例如 Agent Engine Memory Bank、Mem0、Zep)。智能体框架对此外部服务进行 API 调用以存储、检索和处理记忆。这种方法提供更复杂的功能,如语义搜索、实体提取和自动摘要,将记忆管理的复杂任务卸载到专门构建的工具上。

Memory scope

记忆作用域

You also need to consider who or what a memory describes. This has implications on what entity (i.e. a user, session, or application) you use to aggregate and retrieve memories.

你还需要考虑记忆描述的是什么。这对你使用哪个实体(即用户会话应用程序)来聚合和检索记忆有影响。

User-Level scope is the most common implementation, designed to create a continuous, personalized experience for each individual; for example, “the User prefers the middle seat.” Memories are tied to a specific user ID and persist across all their sessions, allowing the agent to build a long-term understanding of their preferences and history.

用户级作用域是最常见的实现,旨在为每个个人创造连续的、个性化的体验;例如,*”用户偏好中间座位。”* 记忆与特定用户 ID 绑定,并在其所有会话中持久存在,使智能体能够对其偏好和历史建立长期理解。

Session-Level scope is designed for the compaction of long conversations; for example, “the User is shopping for tickets between New York and Paris between November 7, 2025 and November 14, 2025. They prefer direct flights and the middle seat”. It creates a persistent record of insights extracted from a single session, allowing an agent to replace the verbose,

会话级作用域旨在压缩长对话;例如,*”用户正在购买 2025 年 11 月 7 日至 2025 年 11 月 14 日之间纽约和巴黎之间的机票。他们偏好直飞航班和中间座位”*。它创建从单个会话中提取的洞察的持久记录,允许智能体用简洁的关键事实集替换冗长的、

token-heavy transcript with a concise set of key facts. Crucially, this memory is distinct from the raw session log; it contains only the processed insights from the dialogue, not the dialogue itself, and its context is isolated to that specific session.

token 密集的记录。至关重要的是,这种记忆与原始会话日志不同;它只包含从对话中处理的洞察,而非对话本身,其上下文被隔离到该特定会话。

Application-level scope (or global context), are memories accessible by all users of an application; for example, “The codename XYZ refers to the project….” This scope is used to provide shared context, broadcast system-wide information, or establish a baseline of

应用程序级作用域(或全局上下文)是应用程序所有用户都可以访问的记忆;例如,*”代号 XYZ 指的是该项目……”* 这个作用域用于提供共享上下文、广播系统范围的信息,或建立

common knowledge. A common use case for application-level memories is procedural memories, which provide “how-to” instructions for the agent; the memories are generally intended to help with the agent’s reasoning for all users. It is critical that these memories are sanitized of all sensitive content to prevent data leaks between users.

公共知识的基线。应用程序级记忆的一个常见用例是程序性记忆,它为智能体提供”如何”指令;这些记忆通常旨在帮助智能体为所有用户进行推理。至关重要的是,这些记忆必须清除所有敏感内容,以防止用户之间的数据泄露。

Multimodal memory

多模态记忆

“Multimodal memory” is a crucial concept that describes how an agent handles non-textual information, like images, videos, and audio. The key is to distinguish between the data the memory is derived from (its source) and the data the memory is stored as (its content).

“多模态记忆”是一个关键概念,描述智能体如何处理非文本信息,如图像、视频和音频。关键是区分记忆派生自的数据(其来源)和记忆存储为的数据(其内容)。

Memory from a multimodal source is the most common implementation. The agent can process various data types—text, images, audio—but the memory it creates is a textual insight derived from that source. For example, an agent can process a user’s voice memo to create memories. It doesn’t store the audio file itself; instead, it transcribes the audio and creates a textual memory like, “User expressed frustration about the recent shipping delay.”

来自多模态来源的记忆是最常见的实现。智能体可以处理各种数据类型——文本、图像、音频——但它创建的记忆是从该来源派生的文本洞察。例如,智能体可以处理用户的语音备忘录来创建记忆。它不存储音频文件本身;相反,它转录音频并创建文本记忆,如”用户对最近的发货延迟表示沮丧。”

Memory with Multimodal Content is a more advanced approach where the memory itself contains non-textual media. The agent doesn’t just describe the content; it stores the content directly. For example, a user can upload an image and say “Remember this design for our logo.” The agent creates a memory that directly contains the image file, linked to the user’s request.

具有多模态内容的记忆是一种更高级的方法,其中记忆本身包含非文本媒体。智能体不仅描述内容;它直接存储内容。例如,用户可以上传图像并说”记住这个作为我们的标志设计。”智能体创建一个直接包含图像文件的记忆,链接到用户的请求。

Most contemporary memory managers focus on handling multimodal sources while producing textual content. This is because generating and retrieving unstructured binary data like images or audio for a specific memory requires specialized models, algorithms, and infrastructure. It is far simpler to convert all inputs into a common, searchable format: text.

大多数当代记忆管理器专注于处理多模态来源,同时生成文本内容。这是因为为特定记忆生成和检索非结构化二进制数据(如图像或音频)需要专门的模型、算法和基础设施。将所有输入转换为通用的、可搜索的格式:文本,要简单得多。

For example, you can generate memories from multimodal input16 using Agent Engine Memory Bank. The output memories will be textual insights extracted from the content:

例如,你可以使用 Agent Engine Memory Bank 从多模态输入¹⁶生成记忆。输出记忆将是从内容中提取的文本洞察:

Python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from google.genai import types

client = vertexai.Client(project=..., location=...)

response = client.agent_engines.memories.generate(
name=agent_engine_name,
direct_contents_source={
"events": [
{
"content": types.Content(
role="user",
parts=[
types.Part.from_text(
"This is context about the multimodal input."
),
types.Part.from_bytes(
data=CONTENT_AS_BYTES,
mime_type=MIME_TYPE
),
types.Part.from_uri(
file_uri="file/path/to/content",
mime_type=MIME_TYPE
)
]
)
}
]
},
scope={"user_id": user_id}
)

Continues next page…

Snippet 5: Example memory generation API call for Agent Engine Memory Bank

代码片段 5:Agent Engine Memory Bank 的记忆生成 API 调用示例

The next section examines the mechanics of memory generation, detailing the two core stages: the extraction of new information from source data, and the subsequent consolidation of that information with the existing memory corpus.

下一节检查记忆生成的机制,详细介绍两个核心阶段:从源数据中提取新信息,以及随后将该信息与现有记忆语料库整合。

Memory Generation: Extraction and Consolidation

记忆生成:提取与整合

Memory generation autonomously transforms raw conversational data into structured, meaningful insights, functioning. Think of it as an LLM-driven ETL (Extract, Transform, Load) pipeline designed to extract and condense memories. Memory generation’s ETL pipeline distinguishes memory managers from RAG engines and traditional databases.

记忆生成自主地将原始对话数据转换为结构化的、有意义的洞察。将其视为一个LLM 驱动的 ETL(提取、转换、加载)管道,旨在提取和压缩记忆。记忆生成的 ETL 管道将记忆管理器与 RAG 引擎和传统数据库区分开来。

Rather than requiring developers to manually specify database operations, a memory manager uses an LLM to intelligently decide when to add, update, or merge memories. This automation is a memory manager’s core strength; it abstracts away the complexity of managing the database contents, chaining together LLM calls, and deploying background services for data processing.

记忆管理器不是要求开发者手动指定数据库操作,而是使用 LLM 智能地决定何时添加、更新或合并记忆。这种自动化是记忆管理器的核心优势;它抽象了管理数据库内容、链接 LLM 调用和部署后台数据处理服务的复杂性。

![][image6]
Figure 6: High-level algorithm of memory generation which extracts memories from new data sources and consolidates them with existing memories

图 6:记忆生成的高级算法,从新数据源提取记忆并将其与现有记忆整合

While the specific algorithms vary by platform (e.g., Agent Engine Memory Bank, Mem0, Zep), the high-level process of memory generation generally follows these four stages:

虽然具体算法因平台而异(例如 Agent Engine Memory Bank、Mem0、Zep),但记忆生成的高级过程通常遵循这四个阶段:

1. Ingestion: The process begins when the client provides a source of raw data, typically a conversation history, to the memory manager.

1. 摄入: 当客户端向记忆管理器提供原始数据源(通常是对话历史)时,过程开始。

2. Extraction & Filtering: The memory manager uses an LLM to extract meaningful content from the source data. The key is that this LLM doesn’t extract everything; it only captures information that fits a predefined topic definition. If the ingested data contains no information that matches these topics, no memory is created.

2. 提取和过滤: 记忆管理器使用 LLM 从源数据中提取有意义的内容。关键是这个 LLM 不会提取所有内容;它只捕获符合预定义主题定义的信息。如果摄入的数据不包含与这些主题匹配的信息,则不会创建记忆。

3. Consolidation: This is the most sophisticated stage, where the memory manager handles conflict resolution and deduplication. It performs a “self-editing” process, using an LLM to compare the newly extracted information with existing memories. To ensure the user’s knowledge base remains coherent, accurate, and evolves over time based on new information, the manager can decide to:

3. 整合: 这是最复杂的阶段,记忆管理器处理冲突解决和去重。它执行”自编辑”过程,使用 LLM 将新提取的信息与现有记忆进行比较。为确保用户的知识库保持连贯、准确,并根据新信息随时间演变,管理器可以决定:

• Merge the new insight into an existing memory.

• 合并 将新洞察合并到现有记忆中。

• Delete an existing memory if it’s now invalidated.

• 删除 如果现有记忆现在无效,则删除它。

• Create an entirely new memory if the topic is novel.

• 创建 如果主题是新颖的,则创建一个全新的记忆。

4. Storage: Finally, the new or updated memory is persisted to a durable storage layer (such as a vector database or knowledge graph) so it can be retrieved in future interactions.

4. 存储: 最后,新的或更新的记忆被持久化到持久存储层(如向量数据库或知识图谱),以便在未来的交互中检索。

A managed memory manager, like Agent Engine Memory Bank, fully automates this pipeline. They provide a single, coherent system for turning conversational noise into structured knowledge, allowing developers to focus on agent logic rather than building and maintaining the underlying data infrastructure themselves. For example, triggering memory generation with Memory Bank only requires a simple API call17:

Python

from google.cloud import vertexai

client = vertexai.Client(project=..., location=...)

client.agent_engines.memories.generate(

name="projects/.../locations/...reasoningEngines/...",

scope={"user_id": "123"},

direct_contents_source={

"events": [...]

},

config={

# Run memory generation in the background.

"wait_for_completion": False

}

}

Snippet 6: Generate memories with Agent Engine Memory Bank

代码片段 6:使用 Agent Engine Memory Bank 生成记忆

The process of memory generation can be compared to the work of a diligent gardener tending to a garden. Extraction is like receiving new seeds and saplings (new information from a conversation). The gardener doesn’t just throw them randomly onto the plot. Instead, they perform Consolidation by pulling out weeds (deleting redundant or conflicting data), pruning back overgrown branches to improve the health of existing plants (refining and summarizing existing memories), and then carefully planting the new saplings in the optimal location. This constant, thoughtful curation ensures the garden remains healthy, organized, and continues to flourish over time, rather than becoming an overgrown, unusable mess. This asynchronous process happens in the background, ensuring the garden is always ready for the next visit.

记忆生成的过程可以比作一个勤奋的园丁照料花园的工作。提取就像接收新的种子和幼苗(来自对话的新信息)。园丁不会只是随机地把它们扔到地块上。相反,他们通过拔除杂草(删除冗余或冲突的数据)、修剪过度生长的枝条以改善现有植物的健康(精炼和总结现有记忆),然后仔细地将新幼苗种植在最佳位置来执行整合。这种持续的、深思熟虑的策划确保花园保持健康、有组织,并随时间继续繁荣,而不是变成一团杂乱无章、无法使用的乱草。这个异步过程在后台发生,确保花园随时为下次访问做好准备。

Now, let’s dive into the two key steps of memory generation: extraction and consolidation.

现在,让我们深入了解记忆生成的两个关键步骤:提取和整合。

Deep-dive: Memory Extraction

深入探讨:记忆提取

The goal of memory extraction is to answer the fundamental question: “What information in this conversation is meaningful enough to become a memory?” This is not simple summarization; it is a targeted, intelligent filtering process designed to separate the signal (important facts, preferences, goals) from the noise (pleasantries, filler text).

记忆提取的目标是回答一个基本问题:**”这次对话中哪些信息足够有意义以成为记忆?”** 这不是简单的摘要;它是一个有针对性的、智能的过滤过程,旨在将信号(重要事实、偏好、目标)与噪音(寒暄、填充文本)分开。

“Meaningful” is not a universal concept; it is defined entirely by the agent’s purpose and use case. What a customer support agent needs to remember (e.g., order numbers, technical issues) is fundamentally different from what a personal wellness coach needs to remember (e.g., long-term goals, emotional states). Customizing what information is preserved is therefore the key to creating a truly effective agent.

“有意义”不是一个普遍的概念;它完全由智能体的目的和用例定义。客户支持智能体需要记住的内容(例如订单号、技术问题)与个人健康教练需要记住的内容(例如长期目标、情绪状态)根本不同。因此,自定义保留哪些信息是创建真正有效智能体的关键。

The memory manager’s LLM decides what to extract by following a carefully constructed set of programmatic guardrails and instructions, usually embedded in a complex system prompt. This prompt defines what “meaningful” means by providing the LLM with a set of topic definitions. With schema and template-based extraction, the LLM is given a predefined JSON schema or a template using LLM features like structured output18; the LLM is instructed to construct the JSON using corresponding information in the conversation. Alternatively, with natural language topic definitions, the LLM is guided by a simple natural language description of the topic.

记忆管理器的 LLM 通过遵循一组精心构建的程序化护栏和指令来决定提取什么,这些通常嵌入在复杂的系统提示中。此提示通过为 LLM 提供一组主题定义来定义”有意义”的含义。使用模式和基于模板的提取,LLM 被给予一个预定义的 JSON 模式或使用 LLM 功能(如结构化输出¹⁸)的模板;LLM 被指示使用对话中的相应信息构建 JSON。或者,使用自然语言主题定义,LLM 由简单的自然语言主题描述引导。

With few-shot prompting, the LLM is “shown” what information to extract using examples. The prompt includes several examples of input text and the ideal, high-fidelity memory that should be extracted. The LLM learns the desired extraction pattern from the examples, making it highly effective for custom or nuanced topics that are difficult to describe with a schema or a simple definition.

使用少样本提示,LLM 通过示例”展示”要提取的信息。提示包含输入文本的几个示例以及应该提取的理想、高保真记忆。LLM 从示例中学习所需的提取模式,使其对于难以用模式或简单定义描述的自定义或细微主题非常有效。

Most memory managers work out-of-the-box by looking for common topics, such as user preferences, key facts, or goals. Many platforms also allow developers to define their own custom topics, tailoring the extraction process to their specific domain. For example, you can customize what information Agent Engine Memory Bank considers to be meaningful to be persisted by providing your own topic definitions and few-shot examples19:

大多数记忆管理器通过寻找常见主题(如用户偏好、关键事实或目标)开箱即用。许多平台还允许开发者定义自己的自定义主题,根据其特定领域定制提取过程。例如,你可以通过提供自己的主题定义和少样本示例¹⁹来自定义 Agent Engine Memory Bank 认为有意义需要持久化的信息:

Python

from google.genai.types import Content, Part

# See https://cloud.google.com/agent-builder/agent-engine/memory-bank/set-up for more information.

memory_bank_config = {
"customization_configs": [{
"memory_topics": [
{ "managed_memory_topic": {"managed_topic_enum": "USER_PERSONAL_INFO" }},

Continues next page…

{
"custom_memory_topic": {
"label": "business_feedback",
"description": """Specific user feedback about their experience at the coffee shop. This includes opinions on drinks, food, pastries, ambiance, staff friendliness, service speed, cleanliness, and any suggestions for improvement."""

}
}
],
"generate_memories_examples": {
"conversationSource": {
"events": [
{
"content": Content(
role="model",
parts=[Part(text="Welcome back to The Daily Grind! We'd love to hear your feedback on your visit.")])

},{
"content": Content(
role="user",
parts=[Part(text= "Hey. The drip coffee was a bit lukewarm today, which was a bummer. Also, the music was way too loud, I could barely hear my friend.")]) }]

},
"generatedMemories": [
{"fact": "The user reported that the drip coffee was lukewarm."}, {"fact": "The user felt the music in the shop was too loud."}

]
}
}]
}

agent_engine = client.agent_engines.create(
config={
"context_spec": {"memory_bank_config": memory_bank_config }
}

)

Snippet 7: Customizing what information Agent Engine Memory Bank considers meaningful to persist

代码片段 7:自定义 Agent Engine Memory Bank 认为有意义需要持久化的信息

Although memory extraction itself is not “summarization,” the algorithm may incorporate summarization to distill information. To enhance efficiency, many memory managers incorporate a rolling summary of the conversation directly into the memory extraction

虽然记忆提取本身不是”摘要”,但算法可能会结合摘要来提炼信息。为了提高效率,许多记忆管理器将对话的滚动摘要直接纳入记忆提取

prompt20. This condensed history provides the necessary context to extract key information from the most recent interactions. It eliminates the need to repeatedly process the full, verbose dialogue with each turn to maintain context.

提示²⁰中。这种压缩的历史提供了从最近交互中提取关键信息所需的上下文。它消除了在每轮中重复处理完整、冗长对话以维护上下文的需要。

Once information has been extracted from the data source, the existing corpus of memories must be updated to reflect the new information via consolidation.

一旦从数据源中提取了信息,现有的记忆语料库必须通过整合更新以反映新信息。

Deep-dive: Memory Consolidation

深入探讨:记忆整合

After memories are extracted from the verbose conversation, consolidation should integrate the new information into a coherent, accurate, and evolving knowledge base. It is arguably the most sophisticated stage in the memory lifecycle, transforming a simple collection of facts into a curated understanding of the user. Without consolidation, an agent’s memory would quickly become a noisy, contradictory, and unreliable log of every piece of information ever captured. This “self-curation” is typically managed by an LLM and is what elevates a memory manager beyond a simple database.

在从冗长的对话中提取记忆之后,整合应该将新信息集成到一个连贯的、准确的和不断发展的知识库中。它可以说是记忆生命周期中最复杂的阶段,将简单的事实集合转变为对用户的策划理解。没有整合,智能体的记忆会很快变成每一条曾经捕获的信息的嘈杂、矛盾和不可靠的日志。这种”自我策划”通常由 LLM 管理,是将记忆管理器提升到超越简单数据库的关键。

Consolidation addresses fundamental problems arising from conversational data, including:

整合解决了对话数据产生的基本问题,包括:

• Information Duplication: A user might mention the same fact in multiple ways across different conversations (e.g., “I need a flight to NYC” and later “I’m planning a trip to New York”). A simple extraction process would create two redundant memories.

• 信息重复: 用户可能在不同的对话中以多种方式提到同一事实(例如,”我需要飞往纽约的航班”,后来又说”我计划去纽约旅行”)。简单的提取过程会创建两个冗余的记忆。

• Conflicting Information: A user’s state changes over time. Without consolidation, the agent’s memory would contain contradictory facts.

• 冲突信息: 用户的状态会随时间变化。没有整合,智能体的记忆将包含矛盾的事实。

• Information Evolution: A simple fact can become more nuanced. An initial memory that “the user is interested in marketing” might evolve into “the user is leading a marketing project focused on Q4 customer acquisition.”

• 信息演变: 一个简单的事实可以变得更加细致。初始记忆”用户对营销感兴趣”可能演变为”用户正在领导一个专注于第四季度客户获取的营销项目。”

• Memory Relevance Decay: Not all memories remain useful forever. An agent must engage in forgetting—proactively pruning old, stale, or low-confidence memories to keep the knowledge base relevant and efficient. Forgetting can happen by instructing the LLM to defer to newer information during consolidation or through automatic deletion via a time-to-live (TTL).

• 记忆相关性衰减: 并非所有记忆永远有用。智能体必须进行遗忘——主动修剪旧的、过时的或低置信度的记忆,以保持知识库的相关性和效率。遗忘可以通过指示 LLM 在整合过程中优先使用更新的信息,或通过生存时间(TTL)自动删除来实现。

The consolidation process is an LLM-driven workflow that compares newly extracted insights against the user’s existing memories. First, the workflow tries to retrieve existing memories that are similar to the newly extracted memories. These existing memories are candidates for consolidation. If the existing memory is contradicted by the new information, it may be deleted. If it is augmented, it may be updated.

整合过程是一个 LLM 驱动的工作流,将新提取的洞察与用户现有的记忆进行比较。首先,工作流尝试检索与新提取记忆相似的现有记忆。这些现有记忆是整合的候选对象。如果现有记忆被新信息否定,它可能被删除。如果它被增强,它可能被更新。

Second, an LLM is presented with both the existing memories and the new information. Its core task is to analyze them together and identify what operations should be performed. The primary operations include:

其次,LLM 被呈现现有记忆新信息。其核心任务是一起分析它们并确定应执行什么操作。主要操作包括:

• UPDATE: Modify an existing memory with new or corrected information.

• 更新: 用新的或更正的信息修改现有记忆。

• CREATE: If the new insight is entirely novel and unrelated to existing memories, create a new one.

• 创建: 如果新洞察完全新颖且与现有记忆无关,则创建一个新的。

• DELETE / INVALIDATE: If the new information makes an old memory completely irrelevant or incorrect, delete or invalidate it.

• 删除/使无效: 如果新信息使旧记忆完全无关或不正确,则删除或使其无效。

Finally, the memory manager translates the LLM’s decision into a transaction that updates the memory store.

最后,记忆管理器将 LLM 的决策转换为更新记忆存储的事务。

Memory Provenance

记忆溯源

The classic machine learning axiom of “garbage in, garbage out” is even more critical for LLMs, where the outcome is often “garbage in, confident garbage out.” For an agent to make reliable decisions and for a memory manager to effectively consolidate memories, they must be able to critically evaluate the quality of its own memories. This trustworthiness is derived directly from a memory’s provenance—a detailed record of its origin and history.

经典的机器学习公理*”垃圾进,垃圾出”对于 LLM 来说更加关键,其结果通常是“垃圾进,自信的垃圾出”*。为了让智能体做出可靠的决策,让记忆管理器有效地整合记忆,它们必须能够批判性地评估其自身记忆的质量。这种可信度直接来源于记忆的溯源——其起源和历史的详细记录。

![][image7]
Figure 7: The flow of information between data sources and memories. A single memory can be derived from multiple data sources, and a single data source may contribute to multiple memories.

图 7:数据源和记忆之间的信息流。单个记忆可以从多个数据源派生,单个数据源可能贡献多个记忆。

The process of memory consolidation—merging information from multiple sources into a single, evolving memory—creates the need to track its lineage. As shown in the diagram above, a single memory might be a blend of multiple data sources, and a single source might be segmented into multiple memories.

记忆整合的过程——将来自多个来源的信息合并到一个单一的、不断发展的记忆中——产生了跟踪其血统的需要。如上图所示,单个记忆可能是多个数据源的混合,单个来源可能被分割成多个记忆。

To assess trustworthiness, the agent must track key details for each source, such as its origin (source type) and age (“freshness”). These details are critical for two reasons: they dictate the weight each source has during memory consolidation, and they inform how much the agent should rely on that memory during inference.

为了评估可信度,智能体必须跟踪每个来源的关键细节,如其起源(来源类型)和年龄(”新鲜度”)。这些细节至关重要,原因有两个:它们决定了每个来源在记忆整合过程中的权重,以及它们告知智能体在推理过程中应该多大程度上依赖该记忆。

The source type is one of the most important factors in determining trust. Data sources fall into three main categories:

来源类型是确定信任的最重要因素之一。数据源分为三个主要类别:

• Bootstrapped Data: Information pre-loaded from internal systems, such as a CRM. This high-trust data can be used to initialize a user’s memories to address the cold-start problem, which is the challenge of providing a personalized experience to a user the agent has never interacted with before.

• 引导数据: 从内部系统(如 CRM)预加载的信息。这种高信任度的数据可用于初始化用户的记忆,以解决冷启动问题,即为智能体从未交互过的用户提供个性化体验的挑战。

• User Input: This includes data provided explicitly (e.g., via a form, which is high-trust) or information extracted implicitly from a conversation (which is generally less trustworthy).

• 用户输入: 这包括显式提供的数据(例如通过表单,这是高信任度的)或从对话中隐式提取的信息(这通常不太可信)。

• Tool Output: Data returned from an external tool call. Generating memories from Tool Output is generally discouraged because these memories tend to be brittle and stale, making this source type better suited for short-term caching.

• 工具输出: 从外部工具调用返回的数据。通常不建议从工具输出生成记忆,因为这些记忆往往脆弱且过时,使得这种来源类型更适合短期缓存。

Accounting for memory lineage during memory management

记忆管理中的血统追踪

This dynamic, multi-source approach to memory creates two primary operational challenges when managing memories: conflict resolution and deleting derived data.

这种动态的、多源的记忆方法在管理记忆时产生两个主要的操作挑战:冲突解决删除派生数据

Memory consolidation inevitably leads to conflicts where one data source conflicts with another. A memory’s provenance allows the memory manager to establish a hierarchy of trust for its information sources. When memories from different sources contradict each

记忆整合不可避免地会导致一个数据源与另一个数据源冲突的情况。记忆的溯源允许记忆管理器为其信息源建立信任层次结构。当来自不同来源的记忆相互

other, the agent must use this hierarchy in a conflict resolution strategy. Common strategies include prioritizing the most trusted source, favoring the most recent information, or looking for corroboration across multiple data points.

矛盾时,智能体必须在冲突解决策略中使用此层次结构。常见策略包括优先考虑最受信任的来源、优先使用最新信息,或在多个数据点之间寻找佐证。

Another challenge to managing memories occurs when deleting memories. A memory can be derived from multiple data sources. When a user revokes access to one data source, data derived from that source should also be removed. Deleting every memory “touched” by that source can be overly aggressive. A more precise, though computationally expensive, approach is to regenerate the affected memories from scratch using only the remaining, valid sources.

管理记忆的另一个挑战发生在删除记忆时。一个记忆可以从多个数据源派生。当用户撤销对一个数据源的访问时,从该来源派生的数据也应该被删除。删除该来源”触及”的每个记忆可能过于激进。一种更精确但计算成本更高的方法是仅使用剩余的有效来源从头重新生成受影响的记忆。

Beyond static provenance details, confidence in a memory must evolve. Confidence increases through corroboration, such as when multiple trusted sources provide consistent information. However, an efficient memory system must also actively curate its existing knowledge through memory pruning—a process that identifies and “forgets” memories that are no longer useful. This pruning can be triggered by several factors.

除了静态的溯源详情之外,对记忆的信心必须不断演变。信心通过佐证增加,例如当多个受信任的来源提供一致的信息时。然而,一个高效的记忆系统还必须通过记忆修剪主动策划其现有知识——这是一个识别和”遗忘”不再有用的记忆的过程。这种修剪可以由几个因素触发。

• Time-based Decay: The importance of a memory can decrease over time. A memory about a meeting from two years ago is likely less relevant than one from last week.

• 基于时间的衰减: 记忆的重要性可能会随时间降低。两年前的会议记忆可能不如上周的相关。

• Low Confidence: A memory that was created from a weak inference and was never corroborated by other sources may be pruned.

• 低置信度: 从弱推断创建且从未被其他来源佐证的记忆可能会被修剪。

• Irrelevance: As an agent gains a more sophisticated understanding of a user, it might determine that some older, trivial memories are no longer relevant to the user’s current goals.

• 无关性: 随着智能体对用户获得更复杂的理解,它可能会确定一些较旧的、琐碎的记忆不再与用户当前的目标相关。

By combining a reactive consolidation pipeline with proactive pruning, the memory manager ensures that the agent’s knowledge base is not just a growing log of everything ever said. Instead, it’s a curated, accurate, and relevant understanding of the user.

通过将反应性整合管道与主动修剪相结合,记忆管理器确保智能体的知识库不仅仅是曾经说过的所有内容的不断增长的日志。相反,它是对用户的策划、准确和相关的理解。

Accounting for memory lineage during inference

推理时的血统追踪

In addition to accounting for a memory’s lineage while curating the corpus’s contents, a memory’s trustworthiness should also be considered at inference time. An agent’s confidence in a memory should not be static; it must evolve based on new information and the passage of time. Confidence increases through corroboration, such as when multiple trusted sources provide consistent information. Conversely, confidence decreases (or decays) over time as older memories become stale, and it also drops when contradictory information is introduced. Eventually, the system can “forget” by archiving or deleting low confidence memories. This dynamic confidence score is critical during inference time. Rather than being shown to the user, memories and, if available, their confidence scores are injected into the prompt, enabling the LLM to assess information reliability and make more nuanced decisions.

除了在策划语料库内容时考虑记忆的血统之外,在推理时也应考虑记忆的可信度。智能体对记忆的信心不应该是静态的;它必须基于新信息和时间的推移而演变。信心通过佐证增加,例如当多个受信任的来源提供一致的信息时。相反,随着较旧的记忆变得过时,信心会随时间降低(或衰减),当引入矛盾信息时信心也会下降。最终,系统可以通过存档或删除低置信度的记忆来”遗忘”。这个动态的置信度分数在推理时至关重要。记忆及其置信度分数(如果可用)不是展示给用户,而是注入到提示中,使 LLM 能够评估信息可靠性并做出更细致的决策。

This entire trust framework serves the agent’s internal reasoning process. Memories and their confidence scores are not typically shown to the user directly. Instead, they are injected into the system prompt, allowing the LLM to weigh the evidence, consider the reliability of its information, and ultimately make more nuanced and trustworthy decisions.

整个信任框架服务于智能体的内部推理过程。记忆及其置信度分数通常不会直接展示给用户。相反,它们被注入到系统提示中,允许 LLM 权衡证据,考虑其信息的可靠性,并最终做出更细致和可信的决策。

Triggering memory generation

触发记忆生成

Although memory managers automate memory extraction and consolidation once generation is triggered, the agent must still decide when memory generation should be attempted. This is a critical architectural choice, balancing data freshness against computational cost and latency. This decision is typically managed by the agent’s logic, which can employ several triggering strategies. Memory generation can be initiated based on various events:

虽然记忆管理器在触发生成后自动化记忆提取和整合,但智能体仍必须决定何时应尝试记忆生成。这是一个关键的架构选择,平衡数据新鲜度与计算成本和延迟。此决策通常由智能体的逻辑管理,可以采用多种触发策略。记忆生成可以基于各种事件启动:

• Session Completion: Triggering generation at the end of a multi-turn session.

• 会话完成: 在多轮会话结束时触发生成。

• Turn Cadence: Running the process after a specific number of turns (e.g., every 5 turns). • Real-Time: Generating memories after every single turn.

• 轮次节奏: 在特定轮次后运行该过程(例如每 5 轮)。• 实时: 每轮后生成记忆。

• Explicit Command: Activating the process upon a direct user command (e.g., “Remember this”

• 显式命令: 在直接用户命令时激活该过程(例如,”记住这个”)

The choice of trigger involves a direct tradeoff between cost and fidelity. Frequent generation (e.g., real-time) ensures memories are highly detailed and fresh, capturing every nuance of the conversation. However, this incurs the highest LLM and database costs and can introduce latency if not handled properly. Infrequent generation (e.g., at session completion) is far more cost-effective but risks creating lower-fidelity memories, as the LLM must summarize a much larger block of conversation at once. You also want to be careful that the memory manager is not processing the same events multiple times, as that introduces unnecessary cost.

触发器的选择涉及成本和保真度之间的直接权衡。频繁生成(例如实时)确保记忆高度详细和新鲜,捕捉对话的每个细微差别。然而,这会产生最高的 LLM 和数据库成本,如果处理不当还可能引入延迟。不频繁生成(例如在会话完成时)更具成本效益,但有创建较低保真度记忆的风险,因为 LLM 必须一次总结更大块的对话。你还需要注意记忆管理器不要多次处理相同的事件,因为这会引入不必要的成本。

Memory-as-a-Tool

记忆即工具

A more sophisticated approach is to allow the agent to decide for itself when to create a memory. In this pattern, memory generation is exposed as a tool (i.e. `create_memory`); the tool definition should define what types of information should be considered meaningful. The agent can then analyze the conversation and autonomously decide to call this tool when it identifies information that is meaningful to persist. This shifts the responsibility for identifying “meaningful information” from the external memory manager to the agent (and thus you as the developer) itself.

一种更复杂的方法是允许智能体自己决定何时创建记忆。在这种模式中,记忆生成作为工具公开(即 create_memory);工具定义应该定义哪些类型的信息应被视为有意义的。然后智能体可以分析对话,并在识别出值得持久化的有意义信息时自主决定调用此工具。这将识别”有意义信息”的责任从外部记忆管理器转移到智能体(因此也是你作为开发者)本身。

For example, you can do this using ADK by packaging your memory generation code into a Tool21 that the agent decides to invoke when it deems the conversation meaningful to persist. You can send the Session to Memory Bank, and Memory Bank will extract and consolidate memories from the conversation history:

例如,你可以通过将记忆生成代码打包到一个工具²¹中来使用 ADK 实现这一点,智能体在认为对话值得持久化时决定调用该工具。你可以将会话发送到 Memory Bank,Memory Bank 将从对话历史中提取和整合记忆:

Python

from google.adk.agents import LlmAgent
from google.adk.memory import VertexAiMemoryBankService
from google.adk.runners import Runner
from google.adk.tools import ToolContext

def generate_memories(tool_context: ToolContext):
"""Triggers memory generation to remember the session."""
# Option 1: Extract memories from the complete conversation history using the # ADK memory service.

tool_context._invocation_context.memory_service.add_session_to_memory( session)

# Option 2: Extract memories from the last conversation turn.
client.agent_engines.memories.generate(
name="projects/.../locations/...reasoningEngines/...",
direct_contents_source={
"events": [
{"content": tool_context._invocation_context.user_content} ]

},
scope={
"user_id": tool_context._invocation_context.user_id,
"app_name": tool_context._invocation_context.app_name
},
# Generate memories in the background
config={"wait_for_completion": False}
)
return {"status": "success"}

agent = LlmAgent(
...,
tools=[generate_memories]
)

runner = Runner(
agent=agent,
app_name=APP_NAME,
session_service=session_service,
memory_service=VertexAiMemoryBankService(
agent_engine_id=AGENT_ENGINE_ID,
project=PROJECT,
location=LOCATION
)

)

Snippet 8: ADK agent using a custom tool to trigger memory generation. Memory Bank will extract and consolidate the memories.

代码片段 8:使用自定义工具触发记忆生成的 ADK 智能体。Memory Bank 将提取和整合记忆。

Another approach is to leverage internal memory, where the agent actively decides what to remember from a conversation. In this workflow, the agent is responsible for extracting key information. Optionally, these extracted memories are then sent to Agent Engine Memory Bank to be consolidated with the user’s existing memories22:

另一种方法是利用内部记忆,其中智能体主动决定从对话中记住什么。在此工作流中,智能体负责提取关键信息。可选地,这些提取的记忆然后被发送到 Agent Engine Memory Bank 以与用户的现有记忆整合²²:

Python

def extract_memories(query: str, tool_context: ToolContext):

"""Triggers memory generation to remember information.

Args:

query: Meaningful information that should be persisted about the user. """

client.agent_engines.memories.generate(

name="projects/.../locations/...reasoningEngines/...",

# The meaningful information is already extracted from the conversation, so we # just want to consolidate it with existing memories for the same user. direct_memories_source={

"direct_memories": [{"fact": query}]

},

scope={

"user_id": tool_context._invocation_context.user_id,

"app_name": tool_context._invocation_context.app_name

},

config={"wait_for_completion": False}

)

return {"status": "success"}

agent = LlmAgent(

...,

tools=[extract_memories]

)

Snippet 9: ADK agent using a custom tool to extract memories from the conversation and trigger consolidation with Agent Engine Memory Bank. Unlike Snippet 8, the agent is responsible for extracting memories, not Memory Bank.

代码片段 9:使用自定义工具从对话中提取记忆并触发与 Agent Engine Memory Bank 整合的 ADK 智能体。与代码片段 8 不同,智能体负责提取记忆,而不是 Memory Bank。

Background vs. Blocking Operations

后台操作与阻塞操作

Memory generation is an expensive operation requiring LLM calls and database writes. For agents in production, memory generation should almost always be handled asynchronously as a background process23.

记忆生成是一个昂贵的操作,需要 LLM 调用和数据库写入。对于生产中的智能体,记忆生成几乎总是应该作为后台进程异步处理²³。

After an agent sends its response to the user, the memory generation pipeline can run in parallel without blocking the user experience. This decoupling is essential for keeping the agent feeling fast and responsive. A blocking (or synchronous) approach, where the user has to wait for the memory to be written before receiving a response, would create an unacceptably slow and frustrating user experience. This necessitates that memory generation occurs in a service that is architecturally separate from the agent’s core runtime.

在智能体向用户发送响应后,记忆生成管道可以并行运行而不阻塞用户体验。这种解耦对于保持智能体快速和响应感至关重要。阻塞(或同步)方法——用户必须等待记忆写入后才能收到响应——会创造一个令人无法接受的缓慢和令人沮丧的用户体验。这需要记忆生成发生在与智能体核心运行时架构上分离的服务中。

Memory Retrieval

记忆检索

With a mechanism for memory generation in place, your focus can shift to the critical task of retrieval. An intelligent retrieval strategy is essential for an agent’s performance, encompassing decisions about which memories should be retrieved and when to retrieve them.

有了记忆生成机制,你的关注点可以转向检索的关键任务。智能检索策略对智能体的性能至关重要,包括关于应检索哪些记忆以及何时检索它们的决策。

The strategy for retrieving a memory depends heavily on how memories are organized. For a structured user profile, retrieval is typically a straightforward lookup for the full profile or a specific attribute. For a collection of memories, however, retrieval is a far more complex search problem. The goal is to discover the most pertinent, conceptually related information from a large pool of unstructured or semi-structured data. The strategies discussed in this section are designed to solve this complex retrieval challenge for memory collections.

检索记忆的策略在很大程度上取决于记忆的组织方式。对于结构化用户档案,检索通常是对完整档案或特定属性的直接查找。然而,对于记忆集合,检索是一个复杂得多的搜索问题。目标是从大量非结构化或半结构化数据池中发现最相关的、概念相关的信息。本节讨论的策略旨在解决记忆集合的这种复杂检索挑战。

Memory retrieval searches for the most pertinent memories for the current conversation. An effective retrieval strategy is crucial; providing irrelevant memories can confuse the model and degrade its response, while finding the perfect piece of context can lead to a remarkably intelligent interaction. The core challenge is balancing memory ‘usefulness’ within a strict latency budget.

记忆检索为当前对话搜索最相关的记忆。有效的检索策略至关重要;提供无关记忆可能会混淆模型并降低其响应质量,而找到完美的上下文片段可以带来非常智能的交互。核心挑战是在严格的延迟预算内平衡记忆的”有用性”。

Advanced memory systems go beyond a simple search and score potential memories across multiple dimensions to find the best fit.

高级记忆系统超越简单搜索,跨多个维度对潜在记忆进行评分以找到最佳匹配。

• Relevance (Semantic Similarity): How conceptually related is this memory to the current conversation?

• 相关性(语义相似性): 这个记忆与当前对话的概念相关程度如何?

• Recency (Time-based): How recently was this memory created?

• 新近性(基于时间): 这个记忆是多近创建的?

• Importance (Significance): How critical is this memory overall? Unlike relevance, the “importance” of a memory may be defined at generation-time.

• 重要性(显著性): 这个记忆总体上有多关键?与相关性不同,记忆的”重要性”可能在生成时定义。

Relying solely on vector-based relevance is a common pitfall. Similarity scores can surface memories that are conceptually similar but old or trivial. The most effective strategy is a blended approach that combines the scores from all three dimensions.

仅依赖基于向量的相关性是一个常见陷阱。相似性分数可能会浮现概念相似但旧的或琐碎的记忆。最有效的策略是结合所有三个维度分数的混合方法。

For applications where accuracy is paramount, retrieval can be refined using approaches like query rewriting, reranking, or specialized retrievers. However, these techniques are computationally expensive and add significant latency, making them unsuitable for most real-time applications. For scenarios where these complex algorithms are necessary and the memories do not quickly become stale, a caching layer can be an effective mitigation. Caching allows the expensive results of a retrieval query to be temporarily stored, bypassing the high latency cost for subsequent identical requests.

对于准确性至上的应用,可以使用查询重写、重新排序或专门的检索器等方法来优化检索。然而,这些技术计算成本高昂,会增加显著延迟,使它们不适合大多数实时应用。对于需要这些复杂算法且记忆不会很快过时的场景,缓存层可以是一个有效的缓解措施。缓存允许临时存储检索查询的昂贵结果,从而绕过后续相同请求的高延迟成本。

With query rewriting, an LLM can be used to improve the search query itself. This can involve rewriting a user’s ambiguous input into a more precise query, or expanding a single query into multiple related ones to capture different facets of a topic. While this significantly improves the quality of the initial search results, it adds the latency of an extra LLM call at the start of the process.

使用查询重写,可以使用 LLM 改进搜索查询本身。这可能涉及将用户的模糊输入重写为更精确的查询,或将单个查询扩展为多个相关查询以捕获主题的不同方面。虽然这显著提高了初始搜索结果的质量,但它在过程开始时增加了额外 LLM 调用的延迟。

With reranking, an initial retrieval fetches a broad set of candidate memories (e.g., the top 50 results) using similarity search. Then, an LLM can re-evaluate and re-rank this smaller set to produce a more accurate final list24.

使用重新排序,初始检索使用相似性搜索获取广泛的候选记忆集(例如前 50 个结果)。然后,LLM 可以重新评估并重新排序这个较小的集合以产生更准确的最终列表²⁴。

Finally, you can train a specialized retriever using fine-tuning. However, this requires access to labeled data and can significantly increase costs.

最后,你可以使用微调训练专门的检索器。然而,这需要访问标记数据,并可能显著增加成本。

Ultimately, the best approach to retrieval starts with better memory generation. Ensuring the memory corpus is high-quality and free of irrelevant information is the most effective way to guarantee that any set of retrieved memories will be helpful.

最终,检索的最佳方法始于更好的记忆生成。确保记忆语料库高质量且没有无关信息是保证任何检索到的记忆集都有帮助的最有效方式。

Timing for retrieval

检索时机

The final architectural decision for retrieval is when to retrieve memories. One approach is proactive retrieval, where memories are automatically loaded at the start of every turn. This ensures context is always available but introduces unnecessary latency for turns that don’t require memory access. Since memories remain static throughout a single turn, they can be efficiently cached to mitigate this performance cost.

检索的最终架构决策是何时检索记忆。一种方法是主动检索,其中记忆在每轮开始时自动加载。这确保上下文始终可用,但为不需要记忆访问的轮次引入了不必要的延迟。由于记忆在单轮中保持静态,它们可以被高效缓存以缓解这种性能成本。

For example, you can implement proactive retrieval in ADK using the built-in PreloadMemoryTool or a custom callback25:

例如,你可以使用内置的 PreloadMemoryTool 或自定义回调²⁵在 ADK 中实现主动检索:

Python

# Option 1: Use the built-in PreloadMemoryTool which retrieves memories with similarity search every turn.

agent = LlmAgent(

...,

tools=[adk.tools.preload_memory_tool.PreloadMemoryTool()]

)

# Option 2: Use a custom callback to have more control over how memories are retrieved.

def retrieve_memories_callback(callback_context, llm_request):

user_id = callback_context._invocation_context.user_id

app_name = callback_context._invocation_context.app_name

response = client.agent_engines.memories.retrieve(

name="projects/.../locations/...reasoningEngines/...",

scope={

"user_id": user_id,

"app_name": app_name

}

)

memories = [f"* {memory.memory.fact}" for memory in list(response)] if not memories:

# No memories to add to System Instructions.

return

# Append formatted memories to the System Instructions

llm_request.config.system_instruction += "nHere is information that you have about the user:n"

llm_request.config.system_instruction += "n".join(memories) agent = LlmAgent(

...,

before_model_callback=retrieve_memories_callback,

)

Snippet 10: Retrieve memories at the start of every turn with ADK using a built-in tool or custom callback

代码片段 10:使用内置工具或自定义回调在每轮开始时用 ADK 检索记忆

Alternatively, you can use reactive retrieval (“Memory-as-a-Tool”) where the agent is given a tool to query its memory, deciding for itself when to retrieve context. This is more efficient and robust but requires an additional LLM call, increasing latency and cost; however, memory is retrieved only when necessary, so the latency cost is incurred less frequently. Additionally, the agent may not know if relevant information exists to be retrieved. However, this can be mitigated by making the agent aware of the types of memories available (e.g., in the tool’s description if you’re using a custom tool), allowing for a more informed decision on when to query.

或者,你可以使用反应式检索(”记忆即工具”),其中智能体被给予一个查询其记忆的工具,自己决定何时检索上下文。这更高效和健壮,但需要额外的 LLM 调用,增加延迟和成本;然而,记忆仅在必要时检索,因此延迟成本的发生频率较低。此外,智能体可能不知道是否存在相关信息可供检索。然而,这可以通过让智能体了解可用的记忆类型来缓解(例如,如果你使用自定义工具,可以在工具的描述中说明),从而在何时查询时做出更明智的决定。

Python

# Option 1: Use the built-in LoadMemory.

agent = LlmAgent(

...,

tools=[adk.tools.load_memory_tool.LoadMemoryTool()],

)

# Option 2: Use a Custom tool where you can describe what type of information # might be available.

def load_memory(query: str, tool_context: ToolContext):

"""Retrieves memories for the user.

The following types of information may be stored for the user: * User preferences, like the user's favorite foods.

...

"""

# Retrieve memories using similarity search.

response = tool_context.search_memory(query)

return response.memories

agent = LlmAgent(

...,

tools=[load_memory],

)

Snippet 11: Configure your ADK agent to decide when memories should be retrieved using a built-in or custom tool

代码片段 11:配置你的 ADK 智能体使用内置或自定义工具来决定何时应检索记忆

Inference with Memories

使用记忆进行推理

Once relevant memories have been retrieved, the final step is to strategically place them into the model’s context window. This is a critical process; the placement of memories can significantly influence the LLM’s reasoning, affect operational costs, and ultimately determine the quality of the final answer.

一旦检索到相关记忆,最后一步是将它们战略性地放置到模型的上下文窗口中。这是一个关键过程;记忆的放置可以显著影响 LLM 的推理,影响运营成本,并最终决定最终答案的质量。

Memories are primarily presented by appending them to system instructions or injecting them into conversation history. In practice, a hybrid strategy is often the most effective. Use the system prompt for stable, global memories (like a user profile) that should always be present. Otherwise, use dialogue injection or memory-as-a-tool for transient, episodic memories that are only relevant to the immediate context of the conversation. This balances the need for persistent context with the flexibility of in-the-moment information retrieval.

记忆主要通过附加到系统指令或注入对话历史来呈现。在实践中,混合策略通常是最有效的。对于应始终存在的稳定、全局记忆(如用户档案),使用系统提示。否则,对于仅与对话即时上下文相关的临时、情景记忆,使用对话注入记忆即工具。这平衡了对持久上下文的需求与即时信息检索的灵活性。

Memories in the System Instructions

系统指令中的记忆

A simple option to use memories for inference is to append memories to the system instructions. This method keeps the conversation history clean by appending retrieved memories directly to the system prompt alongside a preamble, framing them as foundational context for the entire interaction. For example, you can use Jinja to dynamically add memories to your system instructions:

使用记忆进行推理的一个简单选项是将记忆附加到系统指令。这种方法通过将检索到的记忆直接附加到系统提示以及前言来保持对话历史的干净,将它们框架为整个交互的基础上下文。例如,你可以使用 Jinja 动态地将记忆添加到你的系统指令中:

Python

from jinja2 import Template

template = Template("""

{{ system_instructions }}}

<MEMORIES>

Here is some information about the user:

{% for retrieved_memory in data %}* {{ retrieved_memory.memory.fact }} {% endfor %}</MEMORIES>

""")

prompt = template.render(

system_instructions=system_instructions,

data=retrieved_memories

)

Snippet 12: Build your system instruction using retrieved memories

代码片段 12:使用检索到的记忆构建你的系统指令

Including memories in the system instructions gives memories high authority, cleanly separates context from dialogue, and is ideal for stable, “global” information like a user profile. However, there is a risk of over-influence, where the agent might try to relate every topic back to the memories in its core instructions, even when inappropriate.

在系统指令中包含记忆赋予记忆高权威性,清晰地将上下文与对话分开,非常适合稳定的”全局”信息如用户档案。然而,存在过度影响的风险,智能体可能会尝试将每个主题都与其核心指令中的记忆联系起来,即使这样做不适当。

This architectural pattern introduces several constraints. First, it requires the agent framework to support dynamic construction of the system prompt before each LLM call; this functionality isn’t always readily supported. Additionally, the pattern is incompatible with “Memory-as-a-Tool” given that the system prompt must be finalized before the LLM can decide to call a memory retrieval tool. Finally, it poorly handles non-textual memories. Most LLMs only accept a text for the system instructions, making it challenging to embed multimodal content like images or audio directly into the prompt.

这种架构模式引入了几个约束。首先,它要求智能体框架支持在每次 LLM 调用之前动态构建系统提示;这种功能并不总是容易支持的。此外,该模式与*”记忆即工具”*不兼容,因为系统提示必须在 LLM 可以决定调用记忆检索工具之前最终确定。最后,它对非文本记忆的处理很差。大多数 LLM 只接受文本作为系统指令,使得将图像或音频等多模态内容直接嵌入提示中变得具有挑战性。

Memories in the Conversation History

对话历史中的记忆

In this approach, retrieved memories are injected directly into the turn-by-turn dialogue. Memories can either be placed before the full conversation history or right before the latest user query.

在这种方法中,检索到的记忆直接注入到逐轮对话中。记忆可以放在完整对话历史之前,也可以放在最新用户查询之前。

However, this method can be noisy, increasing token costs and potentially confusing the model if the retrieved memories are irrelevant. Its primary risk is dialogue injection, where the model might mistakenly treat a memory as something that was actually said in the conversation. You also need to be more careful about the perspective of the memories that you’re injecting into the conversation; for example, if you’re using the “user” role and user level memories, memories should be written in first-person point of view.

然而,这种方法可能很嘈杂,增加 token 成本,如果检索到的记忆无关,可能会混淆模型。其主要风险是对话注入,模型可能会错误地将记忆视为对话中实际说过的内容。你还需要更加注意注入对话的记忆的视角;例如,如果你使用”用户”角色和用户级记忆,记忆应该以第一人称视角书写。

A special case of injecting memories into the conversation history is retrieving memories via tool calls. The memories will be included directly in the conversation as part of the tool output.

将记忆注入对话历史的一个特殊情况是通过工具调用检索记忆。记忆将作为工具输出的一部分直接包含在对话中。

Python

def load_memory(query: str, tool_context: ToolContext):

"""Loads memories into the conversation history..."""

response = tool_context.search_memory(query)

return response.memories

agent = LlmAgent(

...,

tools=[load_memory],

)

Snippet 13: Retrieve memories as a tool, which directly inserts memories into the conversation

代码片段 13:将记忆作为工具检索,直接将记忆插入对话

Procedural memories

程序性记忆

This whitepaper has focused primarily on declarative memories, a concentration that mirrors the current commercial memory landscape. Most memory management platforms are also architected for this declarative approach, excelling at extracting, storing, and retrieving the “what”—facts, history, and user data.

本白皮书主要关注陈述性记忆,这种集中反映了当前商业记忆领域的现状。大多数记忆管理平台也针对这种陈述性方法进行架构设计,擅长提取、存储和检索”是什么”——事实、历史和用户数据。

However, these systems are not designed to manage procedural memories, the mechanism for improving an agent’s workflows and reasoning. Storing the “how” is not an information retrieval problem; it is a reasoning augmentation problem. Managing this “knowing how” requires a completely separate and specialized algorithmic lifecycle, albeit with a similar high-level structure26:

然而,这些系统并非设计用于管理程序性记忆,即改进智能体工作流和推理的机制。存储”如何”不是信息检索问题;它是推理增强问题。管理这种”知道如何”需要一个完全独立和专门的算法生命周期,尽管具有类似的高级结构²⁶:

1. Extraction: Procedural extraction requires specialized prompts designed to distill a reusable strategy or “playbook” from a successful interaction, rather than just capturing a fact or meaningful information.

1. 提取: 程序性提取需要专门设计的提示,从成功的交互中提炼可重用的策略或”剧本”,而不仅仅是捕获事实或有意义的信息。

2. Consolidation: While declarative consolidation merges related facts (the “what”), procedural consolidation curates the workflow itself (the “how”). This is an active logic management process focused on integrating new successful methods with existing “best practices,” patching flawed steps in a known plan, and pruning outdated or ineffective procedures.

2. 整合: 虽然陈述性整合合并相关事实(”是什么”),程序性整合策划工作流本身(”如何”)。这是一个主动的逻辑管理过程,专注于将新的成功方法与现有的”最佳实践”集成,修补已知计划中有缺陷的步骤,以及修剪过时或无效的程序。

3. Retrieval: The goal is not to retrieve data to answer a question, but to retrieve a plan that guides the agent on how to execute a complex task. Therefore, procedural memories may have a different data schema than declarative memories.

3. 检索: 目标不是检索数据来回答问题,而是检索一个指导智能体如何执行复杂任务的计划。因此,程序性记忆可能具有与陈述性记忆不同的数据模式。

This capacity for an agent to ‘self-evolve’ its logic naturally invites a comparison to a common adaptation method: fine-tuning—often via Reinforcement Learning from Human Feedback (RLHF)27. While both processes aim to improve agent behavior, their mechanisms

智能体”自我进化”其逻辑的这种能力自然地引发了与一种常见适应方法的比较:微调——通常通过人类反馈强化学习(RLHF)²⁷。虽然两个过程都旨在改进智能体行为,但它们的机制

and applications are fundamentally different. Fine-tuning is a relatively slow, offline training process that alters model weights. Procedural memory provides a fast, online adaptation by dynamically injecting the correct “playbook” into the prompt, guiding the agent via in-context learning without requiring any fine-tuning.

和应用根本不同。微调是一个相对缓慢的离线训练过程,会改变模型权重。程序性记忆通过动态将正确的”剧本”注入提示来提供快速的在线适应,通过上下文学习指导智能体而无需任何微调。

Testing and Evaluation

测试与评估

Now that you have a memory-enabled agent, you should validate the behavior of your memory-enabled agent via comprehensive quality and evaluation tests. Evaluating an agent’s memory is a multi-layered process. Evaluation requires verifying that the agent is remembering the right things (quality), that it can find those memories when needed (retrieval), and that using those memories actually helps it accomplish its goals (task success). While academia focuses on reproducible benchmarks, industry evaluation is centered on how memory directly impacts the performance and usability of a production agent.

现在你有了一个启用记忆的智能体,你应该通过全面的质量和评估测试来验证其行为。评估智能体的记忆是一个多层次的过程。评估需要验证智能体正在记住正确的事情(质量),它能在需要时找到这些记忆(检索),以及使用这些记忆确实帮助它实现目标(任务成功)。虽然学术界专注于可重复的基准测试,但行业评估以记忆如何直接影响生产智能体的性能和可用性为中心。

Memory generation quality metrics evaluate the content of the memories themselves, answering the question: “Is the agent remembering the right things?” This is typically measured by comparing the agent’s generated memories against a manually created “golden set” of ideal memories.

记忆生成质量指标评估记忆本身的内容,回答问题:**”智能体正在记住正确的事情吗?”** 这通常通过将智能体生成的记忆与手动创建的理想记忆”黄金集”进行比较来测量。

• Precision: Of all the memories the agent created, what percentage are accurate and relevant? High precision guards against an “over-eager” memory system that pollutes the knowledge base with irrelevant noise.

• 精确度: 在智能体创建的所有记忆中,有多少百分比是准确和相关的?高精确度防止”过于急切”的记忆系统用无关噪音污染知识库。

• Recall: Of all the relevant facts it should have remembered from the source, what percentage did it capture? High recall ensures the agent doesn’t miss critical information.

• 召回率: 在应该从源中记住的所有相关事实中,它捕获了多少百分比?高召回率确保智能体不会错过关键信息。

• F1-Score: The harmonic mean of precision and recall, providing a single, balanced measure of quality.

• F1 分数: 精确度和召回率的调和平均值,提供单一的、平衡的质量度量。

Memory retrieval performance metrics evaluate the agent’s ability to find the right memory at the right time.

记忆检索性能指标评估智能体在正确时间找到正确记忆的能力。

• Recall@K: When a memory is needed, is the correct one found within the top ‘K’ retrieved results? This is the primary measure of a retrieval system’s accuracy.

• Recall@K: 当需要记忆时,正确的记忆是否在前 ‘K’ 个检索结果中找到?这是检索系统准确性的主要度量。

• Latency: Retrieval is on the “hot path” of an agent’s response. The entire retrieval process must execute within a strict latency budget (e.g., under 200ms) to avoid degrading the user experience.

• 延迟: 检索在智能体响应的”热路径”上。整个检索过程必须在严格的延迟预算内执行(例如低于 200 毫秒),以避免降低用户体验。

End-to-End task success metrics are the ultimate test, answering the question: “Does memory actually help the agent perform its job better?” This is measured by evaluating the agent’s performance on downstream tasks using its memory, often with an LLM “judge” comparing the agent’s final output to a golden answer. The judge determines if the agent’s answer was accurate, effectively measuring how well the memory system contributed to the final outcome.

端到端任务成功指标是最终测试,回答问题:”记忆是否真的帮助智能体更好地完成工作?”这通过评估智能体使用其记忆执行下游任务的性能来测量,通常使用 LLM”评判员”将智能体的最终输出与黄金答案进行比较。评判员确定智能体的答案是否准确,有效地测量记忆系统对最终结果的贡献程度。

Evaluation is not a one-time event; it’s an engine for continuous improvement. The metrics above provide the data needed to identify weaknesses and systematically enhance the memory system over time. This iterative process involves establishing a baseline, analyzing failures, tuning the system (e.g., refining prompts, adjusting retrieval algorithms), and re evaluating to measure the impact of the changes.

评估不是一次性事件;它是持续改进的引擎。上述指标提供了识别弱点和随时间系统性增强记忆系统所需的数据。这个迭代过程包括建立基线、分析失败、调整系统(例如精炼提示、调整检索算法),以及重新评估以测量变化的影响。

While the metrics above focus on quality, production-readiness also depends on performance. For each evaluation area, it is critical to measure the latency of underlying algorithms and their ability to scale under load. Retrieving memories “on the hot-path” may have a strict, sub-second latency budget. Generation and consolidation, while often asynchronous, must have enough throughput to keep up with user demand. Ultimately, a successful memory system must be intelligent, efficient, and robust for real-world use.

虽然上述指标专注于质量,但生产就绪性还取决于性能。对于每个评估领域,测量底层算法的延迟及其在负载下扩展的能力至关重要。在”热路径”上检索记忆可能有严格的亚秒级延迟预算。生成和整合虽然通常是异步的,但必须有足够的吞吐量来跟上用户需求。最终,一个成功的记忆系统必须对真实世界使用而言是智能的、高效的和健壮的。

Production considerations for Memory

记忆的生产环境考量

In addition to performance, transitioning a memory-enabled agent from prototype to production demands a focus on enterprise-grade architectural concerns. This move introduces critical requirements for scalability, resilience, and security. A production-grade system must be designed not only for intelligence but also for enterprise-level robustness.

除了性能之外,将启用记忆的智能体从原型过渡到生产需要关注企业级架构问题。这一转变引入了对可扩展性、弹性和安全性的关键要求。生产级系统必须不仅为智能而设计,还要为企业级健壮性而设计。

To ensure the user experience is never blocked by the computationally expensive process of memory generation, a robust architecture must decouple memory processing from the main application logic. While this is an event-driven pattern, it is typically implemented via direct, non-blocking API calls to a dedicated memory service rather than a self-managed message queue. The flow looks like this:

为确保用户体验永远不会被记忆生成的计算密集型过程阻塞,健壮的架构必须将记忆处理与主应用程序逻辑解耦。虽然这是一个事件驱动的模式,但它通常通过对专用记忆服务的直接非阻塞 API 调用来实现,而不是自管理的消息队列。流程如下:

1. Agent pushes data: After a relevant event (e.g., a session ends), the agent application makes a non-blocking API call to the memory manager, “pushing” the raw source data (like the conversation transcript) to be processed.

1. 智能体推送数据: 在相关事件(例如会话结束)之后,智能体应用程序对记忆管理器进行非阻塞 API 调用,”推送”原始源数据(如对话记录)进行处理。

2. Memory manager processes in the background: The memory manager service immediately acknowledges the request and places the generation task into its own internal, managed queue. It is then solely responsible for the asynchronous heavy lifting: making the necessary LLM calls to extract, consolidate, and format memories. The manager may delay processing the events until a certain period of inactivity elapses.

2. 记忆管理器在后台处理: 记忆管理器服务立即确认请求并将生成任务放入其自己的内部托管队列。然后它独自负责异步的繁重工作:进行必要的 LLM 调用来提取、整合和格式化记忆。管理器可能会延迟处理事件,直到一定的不活动期过去。

3. Memories are persisted: The service writes the final memories—which may be new entries or updates to existing ones—to a dedicated, durable database. For managed memory managers, the storage is built-in.

3. 记忆被持久化: 服务将最终记忆——可能是新条目或对现有条目的更新——写入专用的持久数据库。对于托管记忆管理器,存储是内置的。

4. Agent retrieves memories: The main agent application can then query this memory store directly when it needs to retrieve context for a new user interaction.

4. 智能体检索记忆: 主智能体应用程序然后可以在需要为新用户交互检索上下文时直接查询此记忆存储。

This service-based, non-blocking approach ensures that failures or latency in the memory pipeline do not directly impact the user-facing application, making the system far more resilient. It also informs the choice between online (real-time) generation, which is ideal for conversational freshness, and offline (batch) processing, which is useful for populating the system from historical data.

这种基于服务的非阻塞方法确保记忆管道中的故障或延迟不会直接影响面向用户的应用程序,使系统更加有弹性。它还指导在在线(实时)生成和离线(批处理)处理之间的选择,前者适合对话新鲜度,后者适合从历史数据填充系统。

As an application grows, the memory system must handle high-frequency events without failure. Given concurrent requests, the system must prevent deadlocks or race conditions when multiple events try to modify the same memory. You can mitigate race conditions using transactional database operations or optimistic locking; however, this can introduce queuing or throttling when multiple requests are trying to modify the same memories. A robust message queue is essential to buffer high volumes of events and prevent the memory generation service from being overwhelmed.

随着应用程序的增长,记忆系统必须无故障地处理高频事件。考虑到并发请求,系统必须防止多个事件尝试修改同一记忆时发生死锁或竞争条件。你可以使用事务性数据库操作或乐观锁来缓解竞争条件;然而,当多个请求尝试修改相同记忆时,这可能会引入排队限流。健壮的消息队列对于缓冲大量事件并防止记忆生成服务不堪重负至关重要。

The memory service must also be resilient to transient errors (failure handling). If an LLM call fails, the system should use a retry mechanism with exponential backoff and route persistent failures to a dead-letter queue for analysis.

记忆服务还必须对暂时性错误具有弹性(故障处理)。如果 LLM 调用失败,系统应使用带有指数退避的重试机制,并将持久性故障路由到死信队列进行分析。

For global applications, the memory manager must use a database with built-in multi region replication to ensure low latency and high availability. Client-side replication is not feasible because consolidation requires a single, transactionally consistent view of the data to prevent conflicts. Therefore, the memory system must handle replication internally, presenting a single, logical datastore to the developer while ensuring the underlying knowledge base is globally consistent.

对于全球应用程序,记忆管理器必须使用具有内置多区域复制的数据库,以确保低延迟和高可用性。客户端复制是不可行的,因为整合需要数据的单一、事务一致的视图以防止冲突。因此,记忆系统必须在内部处理复制,向开发者呈现单一的逻辑数据存储,同时确保底层知识库全局一致。

Managed memory systems, like Agent Engine Memory Bank, should help you address these production considerations, so that you can focus on the core agent logic.

托管记忆系统(如 Agent Engine Memory Bank)应该帮助你解决这些生产考量,以便你可以专注于核心智能体逻辑。

Privacy and security risks

隐私与安全风险

Memories are derived from and include user data, so they require stringent privacy and security controls. A useful analogy is to think of the system’s memory as a secure corporate archive managed by a professional archivist, whose job is to preserve valuable knowledge while protecting the company.

记忆源自并包含用户数据,因此需要严格的隐私和安全控制。一个有用的类比是将系统的记忆想象成由专业档案管理员管理的安全企业档案馆,其工作是在保护公司的同时保存有价值的知识。

The cardinal rule for this archive is data isolation. Just as an archivist would never mix confidential files from different departments, memory must be strictly isolated at the user or tenant level. An agent serving one user must never have access to the memories of another, enforced using restrictive Access Control Lists (ACLs). Furthermore, users must have programmatic control over their data, with clear options to opt-out of memory generation or request the deletion of all their files from the archive.

这个档案馆的首要规则是数据隔离。就像档案管理员永远不会混合来自不同部门的机密文件一样,记忆必须在用户或租户级别严格隔离。服务一个用户的智能体绝不能访问另一个用户的记忆,通过限制性访问控制列表(ACL)强制执行。此外,用户必须对其数据有程序化控制,有明确的选项来退出记忆生成或请求从档案中删除其所有文件。

Before filing any document, the archivist performs critical security steps. First, they meticulously go through each page to redact sensitive personal information (PII), ensuring knowledge is saved without creating a liability. Second, the archivist is trained to spot and discard forgeries or intentionally misleading documents—a safeguard against memory poisoning28. In the same way, the system must validate and sanitize information before committing it to long-term memory to prevent a malicious user from corrupting the agent’s persistent knowledge through prompt injection. The system must include safeguards like Model Armor to validate and sanitize information before committing it to long-term memory29.

在归档任何文档之前,档案管理员执行关键的安全步骤。首先,他们仔细检查每一页以脱敏敏感的个人信息(PII),确保知识被保存而不会产生责任。其次,档案管理员被训练来识别和丢弃伪造或故意误导的文档——这是防止记忆投毒²⁸的保障。同样,系统必须在将信息提交到长期记忆之前验证和清理信息,以防止恶意用户通过提示注入破坏智能体的持久知识。系统必须包含像 Model Armor 这样的保障措施来在将信息提交到长期记忆之前验证和清理信息²⁹。

Additionally, there is an exfiltration risk if multiple users share the same set of memories, like with procedural memories (which teach an agent how to do something). For example, if a procedural memory from one user is used as an example for another—like sharing a memo company-wide—the archivist must first perform rigorous anonymization to prevent sensitive information from leaking across user boundaries.

此外,如果多个用户共享同一组记忆,如程序性记忆(教智能体如何做某事),则存在泄露风险。例如,如果来自一个用户的程序性记忆被用作另一个用户的示例——就像在全公司范围内共享备忘录——档案管理员必须首先执行严格的匿名化,以防止敏感信息跨用户边界泄露。

Conclusion

结论

This whitepaper has explored the discipline of Context Engineering, focusing on its two central components: Sessions and Memory. The journey from a simple conversational turn to a piece of persistent, actionable intelligence is governed by this practice, which involves dynamically assembling all necessary information—including conversation history, memories, and external knowledge—into the LLM’s context window. This entire process relies on the interplay between two distinct but interconnected systems: the immediate Session and the long-term Memory.

本白皮书探讨了上下文工程的学科,重点关注其两个核心组件:会话记忆。从简单的对话轮次到持久、可操作的智能的旅程由这一实践管理,它涉及动态组装所有必要的信息——包括对话历史、记忆和外部知识——到 LLM 的上下文窗口中。整个过程依赖于两个不同但相互关联的系统之间的相互作用:即时会话和长期记忆。

The Session governs the “now,” acting as a low-latency, chronological container for a single conversation. Its primary challenge is performance and security, requiring low-latency access and strict isolation. To prevent context window overflow and latency, you must use extraction techniques like token-based truncation or recursive summarization to compact content within the Session’s history or a single request payload. Furthermore, security is paramount, mandating PII redaction before session data is persisted.

会话管理”现在”,充当单次对话的低延迟、时间顺序容器。其主要挑战是性能和安全性,需要低延迟访问严格隔离。为防止上下文窗口溢出和延迟,你必须使用提取技术(如基于 token 的截断或递归摘要)来压缩会话历史或单个请求负载的内容。此外,安全性至关重要,要求在会话数据持久化之前进行 PII 脱敏

Memory is the engine of long-term personalization and the core mechanism for persistence across multiple sessions. It moves beyond RAG (which makes an agent an expert on facts) to make the agent an expert on the user. Memory is an active, LLM-driven ETL pipeline—responsible for extraction, consolidation, and retrieval—that distills the most important information from conversation history. With extraction, the system distills the most critical information into key memory points. Following this, consolidation curates and integrates this new information with the existing corpus, resolving conflicts, and deleting redundant data to ensure a coherent knowledge base. To maintain a snappy user experience, memory generation must run as an asynchronous background process after the agent has responded. By tracking provenance and employing safeguards against risks like memory poisoning, developers can build trusted, adaptive assistants that truly learn and grow with the user.

记忆长期个性化的引擎,也是跨多个会话持久化的核心机制。它超越了 RAG(使智能体成为事实专家)来使智能体成为用户专家。记忆是一个主动的、LLM 驱动的 ETL 管道——负责提取、整合和检索——从对话历史中提炼最重要的信息。通过提取,系统将最关键的信息提炼成关键记忆点。随后,整合策划并将这些新信息与现有语料库集成,解决冲突,删除冗余数据以确保连贯的知识库。为保持快速的用户体验,记忆生成必须在智能体响应后作为异步后台进程运行。通过跟踪溯源并采用针对记忆投毒等风险的保障措施,开发者可以构建真正与用户一起学习和成长的可信、自适应助手。

Endnotes

尾注

  1. https://cloud.google.com/use-cases/retrieval-augmented-generation?hl=en

  2. https://arxiv.org/abs/2301.00234

  3. https://cloud.google.com/vertex-ai/generative-ai/docs/agent-engine/sessions/overview

  4. https://langchain-ai.github.io/langgraph/concepts/multi_agent/#message-passing-between-agents

  5. https://google.github.io/adk-docs/agents/multi-agents/

  6. https://google.github.io/adk-docs/agents/multi-agents/#c-explicit-invocation-agenttool

  7. https://agent2agent.info/docs/concepts/message/

  8. https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/

  9. https://cloud.google.com/security-command-center/docs/model-armor-overview

  10. https://ai.google.dev/gemini-api/docs/long-context#long-context-limitations

  11. https://huggingface.co/blog/Kseniase/memory

  12. https://langchain-ai.github.io/langgraph/concepts/memory/#semantic-memory

  13. https://langchain-ai.github.io/langgraph/concepts/memory/#semantic-memory

  14. https://arxiv.org/pdf/2412.15266

  15. https://arxiv.org/pdf/2412.15266

  16. https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/inference#sample-requests-text-gen-multimodal-prompt

  17. https://cloud.google.com/vertex-ai/generative-ai/docs/agent-engine/memory-bank/generate-memories

  18. https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output

  19. https://cloud.google.com/agent-builder/agent-engine/memory-bank/set-up#memory-bank-config

  20. https://arxiv.org/html/2504.19413v1

  21. https://google.github.io/adk-docs/tools/#how-agents-use-tools

  22. https://cloud.google.com/vertex-ai/generative-ai/docs/agent-engine/memory-bank/generate-memories#consolidate-pre-extracted-memories

  23. https://cloud.google.com/vertex-ai/generative-ai/docs/agent-engine/memory-bank/generate-memories#background-memory-generation

  24. https://arxiv.org/pdf/2503.08026

  25. https://google.github.io/adk-docs/callbacks/

  26. https://arxiv.org/html/2508.06433v2

  27. https://cloud.google.com/blog/products/ai-machine-learning/rlhf-on-google-cloud

  28. https://arxiv.org/pdf/2503.03704

  29. https://cloud.google.com/security-command-center/docs/model-armor-overview

  30. https://cloud.google.com/architecture/choose-design-pattern-agentic-ai-system