AI-Agent 白皮书 5 - Prototype to Production

2025-12-23

54k 词

Prototype to Production

从原型到生产

Authors: Sokratis Kartakis, Gabriela Hernandez Larios, Ran Li, Elia Secchi, and Huang Xia

This whitepaper provides a comprehensive technical guide to the operational life cycle of AI agents, focusing on deployment, scaling, and productionizing. Building on Day 4’s coverage of evaluation and observability, this guide emphasizes how to build the necessary trust to move agents into production through robust CI/CD pipelines and scalable infrastructure. It explores the challenges of transitioning agent-based systems from prototypes to enterprise grade solutions, with special attention to Agent2Agent (A2A) interoperability. This guide offers practical insights for AI/ML engineers, DevOps professionals, and system architects.

本白皮书提供了 AI 智能体运营生命周期的全面技术指南，重点关注部署、扩展和生产化。在第 4 天评估和可观测性内容的基础上，本指南强调如何通过强大的 CI/CD 管道和可扩展的基础设施建立必要的信任，将智能体推向生产。它探讨了将基于智能体的系统从原型转变为企业级解决方案的挑战，特别关注 Agent2Agent（A2A）互操作性。本指南为 AI/ML 工程师、DevOps 专业人员和系统架构师提供实用洞察。

Introduction: From Prototype
to Production

简介：从原型到生产

You can spin up an AI agent prototype in minutes, maybe even seconds. But turning that clever demo into a trusted, production-grade system that your business can depend on? That’s where the real work begins. Welcome to the “last mile” production gap, where we consistently observe in practice with customers that roughly 80% of the effort is spent not on the agent’s core intelligence, but on the infrastructure, security, and validation needed to make it reliable and safe.

你可以在几分钟甚至几秒钟内启动一个 AI 智能体原型。但要将那个聪明的演示转变为你的业务可以依赖的受信任的生产级系统？真正的工作从这里开始。欢迎来到**”最后一公里”生产差距**，我们在与客户的实践中一直观察到，大约 80% 的工作不是花在智能体的核心智能上，而是花在使其可靠和安全所需的基础设施、安全和验证上。

Skipping these final steps could cause several problems. For example:

跳过这些最后步骤可能会导致几个问题。例如：

• A customer service agent is tricked into giving products away for free because you forgot to set up the right guardrails.

• 客服智能体被诱骗免费赠送产品，因为你忘记设置正确的护栏。

• A user discovers they can access a confidential internal database through your agent because authentication was improperly configured.

• 用户发现他们可以通过你的智能体访问机密内部数据库，因为身份验证配置不当。

• An agent generates a large consumption bill over the weekend, but no one knows why because you didn’t set up any monitoring.

• 智能体在周末产生了大额消费账单，但没人知道为什么，因为你没有设置任何监控。

• A critical agent that worked perfectly yesterday suddenly stops, but your team is scrambling because there was no continuous evaluation in place.

• 昨天运行完美的关键智能体突然停止，但你的团队手忙脚乱，因为没有持续评估机制。

These aren’t just technical problems; they are major business failures. And while principles from DevOps and MLOps provide a critical foundation, they aren’t enough on their own. Deploying agentic systems introduces a new class of challenges that require an evolution in our operational discipline. Unlike traditional ML models, agents are autonomously interactive, stateful, and follow dynamic execution paths.

这些不仅仅是技术问题；它们是重大的业务失败。虽然 DevOps 和 MLOps 的原则提供了关键基础，但仅凭它们是不够的。部署智能体系统引入了一类新的挑战，需要我们运营纪律的演进。与传统 ML 模型不同，智能体是自主交互的、有状态的，并遵循动态执行路径。

This creates unique operational headaches that demand specialized strategies:

这造成了独特的运营难题，需要专门的策略：

• Dynamic Tool Orchestration: An agent’s “trajectory” is assembled on the fly as it picks and chooses tools. This requires robust versioning, access control, and observability for a system that behaves differently every time.

• 动态工具编排： 智能体的”轨迹”在其选择工具时即时组装。这需要强大的版本控制、访问控制和可观测性，因为系统每次的行为都不同。

• Scalable State Management: Agents can remember things across interactions. Managing session and memory securely and consistently at scale is a complex systems design problem.

• 可扩展的状态管理： 智能体可以跨交互记住事物。大规模安全一致地管理会话和记忆是一个复杂的系统设计问题。

• Unpredictable Cost & Latency: An agent can take many different paths to find an answer, making its cost and response time incredibly hard to predict and control without smart budgeting and caching.

• 不可预测的成本与延迟： 智能体可以采取许多不同的路径来找到答案，使其成本和响应时间在没有智能预算和缓存的情况下极难预测和控制。

To navigate these challenges successfully, you need a foundation built on three key pillars: Automated Evaluation, Automated Deployment (CI/CD), and Comprehensive Observability.

要成功应对这些挑战，你需要建立在三个关键支柱上的基础：自动化评估、自动化部署（CI/CD）和全面可观测性。

This whitepaper is your step-by-step playbook for building that foundation and navigating the path to production! We’ll start with the pre-production essentials, showing you how to set up automated CI/CD pipelines and use rigorous evaluation as a critical quality check. From there, we’ll dive into the challenges of running agents in the wild, covering strategies for scaling, performance tuning, and real-time monitoring. Finally, we’ll look ahead to the exciting world of multi-agent systems with the Agent-to-Agent protocol and explore what it takes to get them communicating safely and effectively.

本白皮书是你构建该基础并导航生产路径的分步指南！我们将从预生产基础开始，向你展示如何设置自动化 CI/CD 管道并使用严格的评估作为关键质量检查。从那里，我们将深入探讨在野外运行智能体的挑战，涵盖扩展、性能调优和实时监控的策略。最后，我们将展望多智能体系统的激动人心的世界，了解 Agent-to-Agent 协议，并探索使它们安全有效通信所需的条件。

![][image1] Practical Implementation Guide

实践实施指南

Throughout this whitepaper, practical examples reference the Google Cloud Platform Agent Starter Pack1—1a Python package providing production-ready Generative AI agent templates for Google Cloud. It includes pre-built agents, automated CI/CD setup, Terraform deployment, Vertex AI evaluation integration and built-in Google Cloud observability. The starter pack demonstrates the concepts discussed here with working code you can deploy in minutes.

在本白皮书中，实际示例引用了 Google Cloud Platform Agent Starter Pack¹——一个为 Google Cloud 提供生产就绪的生成式 AI 智能体模板的 Python 包。它包括预构建的智能体、自动化 CI/CD 设置、Terraform 部署、Vertex AI 评估集成和内置的 Google Cloud 可观测性。该入门包用可在几分钟内部署的可工作代码演示了此处讨论的概念。

People and Process

人员与流程

After all that talk of CI/CD, observability, and dynamic pipelines, why the focus on people and process? Because the best technology in the world is ineffective without the right team to build, manage, and govern it.

在谈论了所有 CI/CD、可观测性和动态管道之后，为什么要关注人员和流程？因为世界上最好的技术如果没有合适的团队来构建、管理和治理，也是无效的。

That customer service agent isn’t magically prevented from giving away free products; an AI Engineer and a Prompt Engineer design and implement the guardrails. The confidential database isn’t secured by an abstract concept; a Cloud Platform team configures the authentication. Behind every successful, production-grade agent there is a well-orchestrated team of specialists, and in this section, we’ll introduce the key players.

那个客服智能体不是被魔法阻止免费赠送产品的；是 AI 工程师和提示工程师设计和实施了护栏。机密数据库不是被抽象概念保护的；是云平台团队配置了身份验证。每个成功的生产级智能体背后都有一个协调良好的专家团队，在本节中，我们将介绍关键参与者。

![][image2]
Figure 1: A diagram showing that “Ops” is the intersection of people, processes, and technology

图 1：显示”Ops”是人员、流程和技术交汇点的图表

In a traditional MLOps landscape, this involves several key teams:

在传统的 MLOps 环境中，这涉及几个关键团队：

• Cloud Platform Team: Comprising cloud architects, administrators, and security specialists, this team manages the foundational cloud infrastructure, security, and access control. The team grants engineers and service accounts least-privilege roles, ensuring access only to necessary resources.

• 云平台团队： 由云架构师、管理员和安全专家组成，该团队管理基础云基础设施、安全和访问控制。该团队授予工程师和服务账户最小权限角色，确保只能访问必要的资源。

• Data Engineering Team: Data engineers and data owners build and maintain the data pipelines, handling ingestion, preparation, and quality standards.

• 数据工程团队： 数据工程师和数据所有者构建和维护数据管道，处理摄取、准备和质量标准。

• Data Science and MLOps Team: This includes data scientists who experiment with and train models, and ML engineers who automate the end-to-end ML pipeline (e.g., preprocessing, training, post-processing) at scale using CI/CD. MLOps Engineers support this by building and maintaining the standardized pipeline infrastructure.

• 数据科学和 MLOps 团队： 这包括实验和训练模型的数据科学家，以及使用 CI/CD 大规模自动化端到端 ML 管道（如预处理、训练、后处理）的 ML 工程师。MLOps 工程师通过构建和维护标准化管道基础设施来支持这一点。

• Machine Learning Governance: This centralized function, including product owners and auditors, oversees the ML lifecycle, acting as a repository for artifacts and metrics to ensure compliance, transparency, and accountability .

• 机器学习治理： 这一集中功能，包括产品负责人和审计员，监督 ML 生命周期，作为工件和指标的存储库，以确保合规性、透明度和问责制。

Generative AI introduces a new layer of complexity and specialized roles to this landscape:

生成式 AI 为这一领域引入了新的复杂性层次和专业角色：

• Prompt Engineers: While this role title is still evolving in the industry, these individuals blend technical skill in crafting prompts with deep domain expertise. They define the right questions and expected answers from a model, though in practice this work may be done by AI Engineers, domain experts, or dedicated specialists depending on the organization’s maturity.

• 提示工程师： 虽然这个角色名称在行业中仍在演变，但这些人将制作提示的技术技能与深厚的领域专业知识相结合。他们定义模型的正确问题和预期答案，尽管在实践中，这项工作可能由 AI 工程师、领域专家或专门的专家完成，具体取决于组织的成熟度。

• AI Engineers: They are responsible for scaling GenAI solutions to production, building robust backend systems that incorporate evaluation at scale, guardrails, and RAG/tool integration .

• AI 工程师： 他们负责将 GenAI 解决方案扩展到生产，构建包含大规模评估、护栏和 RAG/工具集成的强大后端系统。

• DevOps/App Developers: These developers build the front-end components and user friendly interfaces that integrate with the GenAI backend.

• DevOps/应用开发人员： 这些开发人员构建与 GenAI 后端集成的前端组件和用户友好界面。

The scale and structure of an organization will influence these roles; in smaller companies, individuals may wear multiple hats, while mature organizations will have more specialized teams. Effectively coordinating all these diverse roles is essential for establishing a robust operational foundation and successfully productionizing both traditional ML and generative AI initiatives.

组织的规模和结构会影响这些角色；在较小的公司中，个人可能身兼数职，而成熟的组织将拥有更专业化的团队。有效协调所有这些不同的角色对于建立强大的运营基础并成功地将传统 ML 和生成式 AI 计划投入生产至关重要。

![][image3]
Figure 2: How multiple teams collaborate to operationalize both models and GenAI applications

图 2：多个团队如何协作以将模型和 GenAI 应用程序投入运营

The Journey to Production

生产之旅

Now that we’ve established the team, we turn to the process. How do we translate the work of all these specialists into a system that is trustworthy, reliable, and ready for users?

现在我们已经建立了团队，我们转向流程。我们如何将所有这些专家的工作转化为一个值得信赖、可靠且可供用户使用的系统？

The answer lies in a disciplined pre-production process built on a single core principle: Evaluation-Gated Deployment. The idea is simple but powerful: no agent version should reach users without first passing a comprehensive evaluation that proves its quality and safety. This pre-production phase is where we trade manual uncertainty for automated confidence, and it consists of three pillars: a rigorous evaluation process that acts as a quality gate, an automated CI/CD pipeline that enforces it, and safe rollout strategies to de-risk the final step into production.

答案在于建立在单一核心原则之上的纪律严明的预生产流程：评估门控部署。这个想法简单但强大：任何智能体版本都不应在未首先通过证明其质量和安全性的全面评估的情况下到达用户。这个预生产阶段是我们用自动化信心换取手动不确定性的地方，它由三个支柱组成：作为质量门控的严格评估流程、执行它的自动化 CI/CD 管道，以及降低进入生产最后一步风险的安全发布策略。

Evaluation as a Quality Gate

评估作为质量门控

Why do we need a special quality gate for agents? Traditional software tests are insufficient for systems that reason and adapt. Furthermore, evaluating an agent is distinct from evaluating an LLM; it requires assessing not just the final answer, but the entire trajectory of reasoning and actions taken to complete a task. An agent can pass 100 unit tests for its tools but still fail spectacularly by choosing the wrong tool or hallucinating a response. We need to evaluate its behavioral quality, not just its functional correctness. This gate can be implemented in two primary ways:

为什么我们需要针对智能体的特殊质量门控？传统软件测试对于推理和适应的系统是不够的。此外，评估智能体不同于评估 LLM；它需要评估的不仅仅是最终答案，还有完成任务所采取的整个推理和行动轨迹。一个智能体可以通过其工具的 100 个单元测试，但仍然可能因选择错误的工具或产生幻觉响应而惨败。我们需要评估它的行为质量，而不仅仅是其功能正确性。这个门控可以通过两种主要方式实现：

1. The Manual “Pre-PR” Evaluation: For teams seeking flexibility or just beginning their evaluation journey, the quality gate is enforced through a team process. Before submitting a pull request (PR), the AI Engineer or Prompt Engineer (or whoever is responsible for agent behavior in your organization) runs the evaluation suite locally. The resulting performance report—comparing the new agent against the production baseline—is then linked in the PR description. This makes the evaluation results a mandatory artifact for human review. The reviewer—typically another AI Engineer or the Machine Learning Governor—is now responsible for assessing not just the code, but also the agent’s behavioral changes against guardrail violations and prompt injection vulnerabilities.

1. 手动”Pre-PR”评估： 对于寻求灵活性或刚刚开始评估之旅的团队，质量门控通过团队流程来执行。在提交拉取请求（PR）之前，AI 工程师或提示工程师（或在您的组织中负责智能体行为的任何人）在本地运行评估套件。然后将生成的性能报告（将新智能体与生产基线进行比较）链接到 PR 描述中。这使得评估结果成为人工审查的强制性工件。审查者——通常是另一位 AI 工程师或机器学习治理人员——现在不仅负责评估代码，还负责评估智能体针对护栏违规和提示注入漏洞的行为变化。

2. The Automated In-Pipeline Gate: For mature teams, the evaluation harness—built and maintained by the Data Science and MLOps Team—is integrated directly into the CI/ CD pipeline. A failing evaluation automatically blocks the deployment, providing rigid,

programmatic enforcement of quality standards that the Machine Learning Governance team has defined. This approach trades the flexibility of manual review for the consistency of automation. The CI/CD pipeline can be configured to automatically trigger an evaluation job that compares the new agent’s responses against a golden dataset. The deployment is

programmatically blocked if key metrics, such as “tool call success rate” or “helpfulness,” fall below a predefined threshold.

2. 自动化管道内门控： 对于成熟的团队，由数据科学和 MLOps 团队构建和维护的评估框架直接集成到 CI/CD 管道中。评估失败会自动阻止部署，提供机器学习治理团队定义的质量标准的严格程序化执行。这种方法以自动化的一致性换取手动审查的灵活性。CI/CD 管道可以配置为自动触发评估作业，将新智能体的响应与黄金数据集进行比较。如果关键指标（如”工具调用成功率”或”有用性”）低于预定义阈值，则以编程方式阻止部署。

Regardless of the method, the principle is the same: no agent proceeds to production without a quality check. We covered the specifics of what to measure and how to build this evaluation harness in our deep dive on Day 4: Agent Quality: Observability, Logging, Tracing, Evaluation, Metrics, which explored everything from crafting a ‘golden dataset’ (a curated, representative set of test cases designed to assess an agent’s intended behavior and guardrail compliance) to implementing LLM-as-a-judge techniques, to finally using a service like Vertex AI Evaluation2 to power evaluation.

无论采用哪种方法，原则都是相同的：没有智能体在没有质量检查的情况下进入生产。我们在第 4 天：智能体质量：可观测性、日志记录、追踪、评估、指标的深入探讨中涵盖了衡量什么以及如何构建此评估框架的具体内容，其中探讨了从制作”黄金数据集”（旨在评估智能体预期行为和护栏合规性的精选代表性测试用例集）到实施 LLM 即评判者技术，再到最终使用像 Vertex AI Evaluation² 这样的服务来支持评估的所有内容。

The Automated CI/CD Pipeline

自动化 CI/CD 管道

An AI agent is a composite system, comprising not just source code but also prompts, tool definitions, and configuration files. This complexity introduces significant challenges: how do we ensure a change to a prompt doesn’t degrade the performance of a tool? How do we test the interplay between all these artifacts before they reach users?

AI 智能体是一个复合系统，不仅包括源代码，还包括提示、工具定义和配置文件。这种复杂性带来了重大挑战：我们如何确保对提示的更改不会降低工具的性能？我们如何在这些工件到达用户之前测试它们之间的相互作用？

The solution is a CI/CD (Continuous Integration/Continuous Deployment) pipeline. It is more than just an automation script; it’s a structured process that helps different people in a team collaborate to manage complexity and ensure quality. It works by testing changes in stages, incrementally building confidence before the agent is released to users.

解决方案是 CI/CD（持续集成/持续部署）管道。它不仅仅是一个自动化脚本；它是一个结构化的流程，帮助团队中的不同人员协作管理复杂性并确保质量。它通过分阶段测试更改来工作，在智能体发布给用户之前逐步建立信心。

A robust pipeline is designed as a funnel. It catches errors as early and as cheaply as possible, a practice often called “shifting left.” It separates fast, pre-merge checks from more comprehensive, resource-intensive post-merge deployments. This progressive workflow is typically structured into three distinct phases:

强大的管道被设计成漏斗状。它尽可能早地、以最低成本捕获错误，这种做法通常被称为”左移”。它将快速的合并前检查与更全面的、资源密集的合并后部署分开。这种渐进式工作流通常被组织成三个不同的阶段：

1. Phase 1: Pre-Merge Integration (CI). The pipeline’s first responsibility is to provide rapid feedback to the AI Engineer or Prompt Engineer who has opened a pull request. Triggered automatically, this CI phase acts as a gatekeeper for the main branch. It runs fast checks like unit tests, code linting, and dependency scanning. Crucially, this is the

1. 阶段 1：合并前集成（CI）。管道的首要职责是向打开拉取请求的 AI 工程师或提示工程师提供快速反馈。自动触发的这个 CI 阶段充当主分支的守门人。它运行快速检查，如单元测试、代码检查和依赖扫描。至关重要的是，这是

ideal stage to run the agent quality evaluation suite designed by Prompt Engineers. This provides immediate feedback on whether a change improves or degrades the agent’s performance against key scenarios before it is ever merged. By catching issues here, we prevent polluting the main branch. The PR checks configuration template3 generated with the Agent Starter Pack1 (ASP) is a practical example of implementing this phase with Cloud Build.4

运行由提示工程师设计的智能体质量评估套件的理想阶段。这在更改合并之前提供关于更改是否改进或降低智能体针对关键场景的性能的即时反馈。通过在这里捕获问题，我们防止污染主分支。使用 Agent Starter Pack¹（ASP）生成的 PR 检查配置模板³是使用 Cloud Build⁴ 实现此阶段的实际示例。

2. Phase 2: Post-Merge Validation in Staging (CD). Once a change passes all CI checks— including the performance evaluation—and is merged, the focus shifts from code and performance correctness to the operational readiness of the integrated system. The Continuous Deployment (CD) process, often managed by the MLOps Team, packages the agent and deploys it to a staging environment—a high-fidelity replica of production. Here, more comprehensive, resource-intensive tests are run, such as load testing and integration tests against remote services. This is also the critical phase for internal user testing (often called “dogfooding”), where humans within the company can interact with the agent and provide qualitative feedback before it reaches the end user. This ensures that the agent as an integrated system performs reliably and efficiently under production like conditions before it is considered for release. The staging deployment template5 from ASP shows an example of this deployment.

2. 阶段 2：暂存环境中的合并后验证（CD）。一旦更改通过所有 CI 检查（包括性能评估）并被合并，焦点就从代码和性能正确性转移到集成系统的运营准备就绪。持续部署（CD）流程，通常由 MLOps 团队管理，打包智能体并将其部署到暂存环境——生产的高保真副本。在这里，运行更全面的、资源密集的测试，如负载测试和针对远程服务的集成测试。这也是内部用户测试（通常称为”dogfooding”）的关键阶段，公司内部的人员可以与智能体交互并在其到达最终用户之前提供定性反馈。这确保了智能体作为一个集成系统在被考虑发布之前在类生产条件下可靠高效地运行。ASP 的暂存部署模板⁵展示了此部署的示例。

3. Phase 3: Gated Deployment to Production. After the agent has been thoroughly validated in the staging environment, the final step is deploying to production. This is almost never fully automatic, typically requiring a Product Owner to give the final sign-off, ensuring human-in-the-loop. Upon approval, the exact deployment artifact that was tested and validated in staging is promoted to the production environment. This production deployment template6 generated with ASP shows how this final phase retrieves the validated artifact and deploys it to production with appropriate safeguards.

3. 阶段 3：门控部署到生产。在智能体在暂存环境中经过彻底验证后，最后一步是部署到生产。这几乎从不是完全自动的，通常需要产品负责人做最终签字，确保人机协同。批准后，在暂存中测试和验证的确切部署工件被提升到生产环境。使用 ASP 生成的此生产部署模板⁶展示了这个最终阶段如何检索已验证的工件并以适当的保障措施将其部署到生产。

![][image4]
Figure 3: Different stages of the CI/CD process

图 3：CI/CD 流程的不同阶段

Making this three-phase CI/CD workflow possible requires robust automation infrastructure and proper secrets management. This automation is powered by two key technologies:

使这个三阶段 CI/CD 工作流成为可能需要强大的自动化基础设施和适当的密钥管理。这种自动化由两项关键技术驱动：

• Infrastructure as Code (IaC): Tools like Terraform define environments programmatically, ensuring they are identical, repeatable, and version-controlled. For example, this template generated with Agent Starter Pack7 provides Terraform configurations for complete agent infrastructure including Vertex AI, Cloud Run, and BigQuery resources.

• 基础设施即代码（IaC）： 像 Terraform 这样的工具以编程方式定义环境，确保它们是相同的、可重复的和版本控制的。例如，使用 Agent Starter Pack⁷ 生成的此模板提供了完整智能体基础设施的 Terraform 配置，包括 Vertex AI、Cloud Run 和 BigQuery 资源。

• Automated Testing Frameworks: Frameworks like Pytest execute tests and evaluations at each stage, handling agent-specific artifacts like conversation histories, tool invocation logs, and dynamic reasoning traces.

• 自动化测试框架： 像 Pytest 这样的框架在每个阶段执行测试和评估，处理智能体特定的工件，如对话历史、工具调用日志和动态推理追踪。

Furthermore, sensitive information like API keys for tools should be managed securely using a service like Secret Manager8 and injected into the agent’s environment at runtime, rather than being hardcoded in the repository.

此外，工具的 API 密钥等敏感信息应使用像 Secret Manager⁸ 这样的服务安全管理，并在运行时注入到智能体的环境中，而不是在存储库中硬编码。

Safe Rollout Strategies

安全发布策略

While comprehensive pre-production checks are essential, real-world application inevitably reveals unforeseen issues. Rather than switching 100% of users at once, consider minimizing risk through gradual rollouts with careful monitoring.

虽然全面的预生产检查是必不可少的，但真实世界的应用不可避免地会揭示不可预见的问题。与其一次性切换 100% 的用户，不如考虑通过谨慎监控的渐进式发布来最小化风险。

Here are four proven patterns that help teams build confidence in their deployments:

以下是帮助团队对其部署建立信心的四种经过验证的模式：

• Canary: Start with 1% of users, monitoring for prompt injections and unexpected tool usage. Scale up gradually or roll back instantly.

• 金丝雀发布： 从 1% 的用户开始，监控提示注入和意外的工具使用。逐步扩大规模或立即回滚。

• Blue-Green: Run two identical production environments. Route traffic to “blue” while deploying to “green,” then switch instantly. If issues emerge, switch back—zero downtime, instant recovery.

• 蓝绿部署： 运行两个相同的生产环境。在部署到”绿色”时将流量路由到”蓝色”，然后立即切换。如果出现问题，切换回来——零停机时间，即时恢复。

• A/B Testing: Compare agent versions on real business metrics for data-driven decisions. This can happen either with internal or external users.

• A/B 测试： 在真实业务指标上比较智能体版本，以进行数据驱动的决策。这可以与内部或外部用户一起进行。

• Feature Flags: Deploy code but control release dynamically, testing new capabilities with select users first.

• 功能标志： 部署代码但动态控制发布，首先与选定的用户测试新功能。

All these strategies share a foundation: rigorous versioning. Every component—code, prompts, model endpoints, tool schemas, memory structures, even evaluation datasets— must be versioned. When issues arise despite safeguards, this enables instant rollback to a known-good state. See this as your production “undo” button!

所有这些策略都有一个共同的基础：严格的版本控制。每个组件——代码、提示、模型端点、工具模式、记忆结构，甚至评估数据集——都必须进行版本控制。当尽管有保障措施但仍出现问题时，这可以立即回滚到已知良好状态。将此视为您的生产”撤销”按钮！

You can deploy agents using Agent Engine9 or Cloud Run10, then leverage Cloud Load Balancing11 for traffic management across versions or connect to other microservices. The Agent Starter Pack1 provides ready-to-use templates with GitOps workflows—where every deployment is a git commit, every rollback is a git revert, and your repository becomes the single source of truth for both current state and complete deployment history.

您可以使用 Agent Engine⁹ 或 Cloud Run¹⁰ 部署智能体，然后利用 Cloud Load Balancing¹¹ 进行跨版本的流量管理或连接到其他微服务。Agent Starter Pack¹ 提供了带有 GitOps 工作流的即用模板——每次部署都是一次 git 提交，每次回滚都是一次 git 还原，您的存储库成为当前状态和完整部署历史的单一事实来源。

Building Security from the Start

从一开始就构建安全性

Safe deployment strategies protect you from bugs and outages, but agents face a unique challenge: they can reason and act autonomously. A perfectly deployed agent can still cause harm if it hasn’t been built with proper security and responsibility measures. This requires a comprehensive governance strategy embedded from day one, not added as an afterthought.

安全的部署策略可以保护您免受错误和中断的影响，但智能体面临着独特的挑战：它们可以自主推理和行动。即使部署完美的智能体，如果没有构建适当的安全和责任措施，仍然可能造成伤害。这需要从第一天就嵌入全面的治理策略，而不是事后添加。

Unlike traditional software that follows predetermined paths, agents make decisions. They interpret ambiguous requests, access multiple tools, and maintain memory across sessions. This autonomy creates distinct risks:

与遵循预定路径的传统软件不同，智能体做出决策。它们解释模糊的请求，访问多个工具，并在会话之间维护记忆。这种自主性创造了独特的风险：

• Prompt Injection & Rogue Actions: Malicious users can trick agents into performing unintended actions or bypassing restrictions.

• 提示注入和流氓行为： 恶意用户可以诱骗智能体执行意外操作或绕过限制。

• Data Leakage: Agents might inadvertently expose sensitive information through their responses or tool usage.

• 数据泄露： 智能体可能通过其响应或工具使用无意中暴露敏感信息。

• Memory Poisoning: False information stored in an agent’s memory can corrupt all future interactions.

• 记忆投毒： 存储在智能体记忆中的虚假信息可能会破坏所有未来的交互。

Fortunately, frameworks like Google’s Secure AI Agents approach12 and the Google Secure AI Framework (SAIF)13 address these challenges through three layers of defense:

幸运的是，像 Google 的安全 AI 智能体方法¹² 和 Google 安全 AI 框架（SAIF）¹³ 这样的框架通过三层防御来解决这些挑战：

1. Policy Definition and System Instructions (The Agent’s Constitution): The process begins by defining policies for desired and undesired agent behavior. These are engineered into System Instructions (SIs) that act as the agent’s core constitution.

1. 策略定义和系统指令（智能体的宪法）： 流程从定义期望和不期望的智能体行为策略开始。这些被设计成系统指令（SI），作为智能体的核心宪法。

2. Guardrails, Safeguards, and Filtering (The Enforcement Layer): This layer acts as the hard-stop enforcement mechanism.

2. 护栏、保障措施和过滤（执行层）： 该层充当硬停止执行机制。

• Input Filtering: Use classifiers and services like the Perspective API to analyze prompts and block malicious inputs before they reach the agent.

• 输入过滤： 使用分类器和像 Perspective API 这样的服务来分析提示，并在恶意输入到达智能体之前阻止它们。

• Output Filtering: After the agent generates a response, Vertex AI’s built-in safety filters provide a final check for harmful content, PII, or policy violations. For example, before a response is sent to the user, it is passed through Vertex AI’s built-in safety filters14, which can be configured to block outputs containing specific PII, toxic language, or other harmful content.

• 输出过滤： 在智能体生成响应后，Vertex AI 的内置安全过滤器提供对有害内容、PII 或政策违规的最终检查。例如，在将响应发送给用户之前，它会通过 Vertex AI 的内置安全过滤器¹⁴，可以配置为阻止包含特定 PII、有毒语言或其他有害内容的输出。

• Human-in-the-Loop (HITL) Escalation: For high-risk or ambiguous actions, the system must pause and escalate to a human for review and approval.

• 人机协同（HITL）升级： 对于高风险或模糊的操作，系统必须暂停并升级给人类进行审查和批准。

3. Continuous Assurance and Testing: Safety is not a one-time setup. It requires constant evaluation and adaptation.

3. 持续保证和测试： 安全不是一次性设置。它需要持续的评估和适应。

• Rigorous Evaluation: Any change to the model or its safety systems must trigger a full re-run of a comprehensive evaluation pipeline using Vertex AI Evaluation.

• 严格评估： 对模型或其安全系统的任何更改都必须触发使用 Vertex AI Evaluation 的综合评估管道的完整重新运行。

• Dedicated RAI Testing: Rigorously test for specific risks either by creating dedicated datasets or using simulation agents, including Neutral Point of View (NPOV) evaluations and Parity evaluations.

• 专门的 RAI 测试： 通过创建专用数据集或使用模拟智能体来严格测试特定风险，包括中立观点（NPOV）评估和对等性评估。

• Proactive Red Teaming: Actively try to break the safety systems through creative manual testing and AI-driven persona-based simulation.

• 主动红队测试： 通过创造性的手动测试和 AI 驱动的基于角色的模拟主动尝试破坏安全系统。

Operations in-Production

生产中的运营

Your agent is live. Now the focus shifts from development to a fundamentally different challenge: keeping the system reliable, cost-effective, and safe as it interacts with thousands of users. A traditional service operates on predictable logic. An agent, in contrast, is an autonomous actor. Its ability to follow unexpected reasoning paths means it can exhibit emergent behaviors and accumulate costs without direct oversight.

您的智能体已上线。现在焦点从开发转移到一个根本不同的挑战：在与数千名用户交互时保持系统可靠、具有成本效益和安全。传统服务按可预测的逻辑运行。相比之下，智能体是一个自主行动者。它遵循意外推理路径的能力意味着它可以在没有直接监督的情况下展现涌现行为并累积成本。

Managing this autonomy requires a different operational model. Instead of static monitoring, effective teams adopt a continuous loop: Observe the system’s behavior in real-time, Act to maintain performance and safety, and Evolve the agent based on production learnings. This integrated cycle is the core discipline for operating agents successfully in production.

管理这种自主性需要不同的运营模式。有效的团队不是静态监控，而是采用持续循环：实时观察系统行为，采取行动以维护性能和安全，并根据生产学习来演进智能体。这个集成循环是在生产中成功运营智能体的核心纪律。

Observe: Your Agent’s Sensory System

观察：智能体的感知系统

To trust and manage an autonomous agent, you must first understand its process. Observability provides this crucial insight, acting as the sensory system for the subsequent “Act” and “Evolve” phases. A robust observability practice is built on three pillars that work together to provide a complete picture of the agent’s behavior:

要信任和管理自主智能体，您必须首先了解其过程。可观测性提供这种关键洞察，作为后续”行动”和”演进”阶段的感知系统。强大的可观测性实践建立在三个支柱之上，它们协同工作以提供智能体行为的完整画面：

• Logs: The granular, factual diary of what happened, recording every tool call, error, and decision.

• 日志： 发生事情的细粒度、事实性日记，记录每个工具调用、错误和决策。

• Traces: The narrative that connects individual logs, revealing the causal path of why an agent took a certain action.

• 追踪： 连接单个日志的叙事，揭示智能体为何采取某种行动的因果路径。

• Metrics: The aggregated report card, summarizing performance, cost, and operational health at scale to show how well the system is performing.

• 指标： 汇总的成绩单，大规模总结性能、成本和运营健康状况，以显示系统运行得有多好。

For example, in Google Cloud, this is achieved through the operations suite: a user’s request generates a unique ID in Cloud Trace15 that links the Vertex AI Agent Engine9 invocation, model calls, and tool executions with visible durations. Detailed logs flow to Cloud Logging16, while Cloud Monitoring17 dashboards alert when latency thresholds are exceeded. The Agent Development Kit (ADK)18 provides built-in Cloud Trace integration for automatic instrumentation of agent operations.

例如，在 Google Cloud 中，这是通过运维套件实现的：用户的请求在 Cloud Trace¹⁵ 中生成唯一 ID，将 Vertex AI Agent Engine⁹ 调用、模型调用和工具执行与可见的持续时间链接起来。详细日志流入 Cloud Logging¹⁶，而 Cloud Monitoring¹⁷ 仪表板在超过延迟阈值时发出警报。Agent Development Kit（ADK）¹⁸ 提供内置的 Cloud Trace 集成，用于自动检测智能体操作。

By implementing these pillars, we move from operating in the dark to having a clear, data driven view of our agent’s behavior, providing the foundation needed to manage it effectively in production. (For a full discussion of these concepts, see Agent Quality: Observability, Logging, Tracing, Evaluation, Metrics).

通过实施这些支柱，我们从在黑暗中运营转变为对智能体行为拥有清晰的、数据驱动的视图，提供在生产中有效管理它所需的基础。（有关这些概念的完整讨论，请参阅智能体质量：可观测性、日志记录、追踪、评估、指标）。

Act: The Levers of Operational Control

行动：运营控制的杠杆

Observations without action are just expensive dashboards. The “Act” phase is about real time intervention—the levers you pull to manage the agent’s performance, cost, and safety based on what you observe.

没有行动的观察只是昂贵的仪表板。”行动”阶段是关于实时干预的——根据你观察到的情况来拉动管理智能体性能、成本和安全的杠杆。

Think of “Act” as the system’s automated reflexes designed to maintain stability in real-time. In contrast, “Evolve”, which will be covered later, is the strategic process of learning from behavior to create a fundamentally better system.

将”行动”视为旨在实时维持稳定性的系统自动化反射。相比之下，稍后将介绍的”演进”是从行为中学习以创建根本上更好的系统的战略过程。

Because an agent is autonomous, you cannot pre-program every possible outcome. Instead, you must build robust mechanisms to influence its behavior in production. These operational levers fall into two primary categories: managing the system’s health and managing its risk.

因为智能体是自主的，所以你无法预先编程每一种可能的结果。相反，你必须构建强大的机制来影响其在生产中的行为。这些运营杠杆分为两个主要类别：管理系统健康和管理风险。

Managing System Health: Performance, Cost, and Scale

管理系统健康：性能、成本和规模

Unlike traditional microservices, an agent’s workload is dynamic and stateful. Managing its health requires a strategy for handling this unpredictability.

与传统微服务不同，智能体的工作负载是动态和有状态的。管理其健康需要一种处理这种不可预测性的策略。

• Designing for Scale: The foundation is decoupling the agent’s logic from its state.

• 为规模而设计： 基础是将智能体的逻辑与其状态解耦。

• Horizontal Scaling: Design the agent as a stateless, containerized service. With external state, any instance can handle any request, enabling serverless platforms like Cloud Run10 or the managed Vertex AI Agent Engine Runtime9 to scale automatically.

• 水平扩展： 将智能体设计为无状态的容器化服务。使用外部状态，任何实例都可以处理任何请求，使像 Cloud Run¹⁰ 或托管的 Vertex AI Agent Engine Runtime⁹ 这样的无服务器平台能够自动扩展。

• Asynchronous Processing: For long-running tasks, offload work using event driven patterns. This keeps the agent responsive while complex jobs process in the background. On Google Cloud, for example, a service can publish tasks to Pub/Sub19, which can then trigger a Cloud Run service for asynchronous processing.

• 异步处理： 对于长时间运行的任务，使用事件驱动模式卸载工作。这使智能体保持响应性，同时复杂的作业在后台处理。例如，在 Google Cloud 上，服务可以将任务发布到 Pub/Sub¹⁹，然后可以触发 Cloud Run 服务进行异步处理。

• Externalized State Management: Since LLMs are stateless, persisting memory externally is non-negotiable. This highlights a key architectural choice: Vertex AI Agent Engine provides a built-in, durable Session and memory service, while Cloud Run offers the flexibility to integrate directly with databases like AlloyDB20 or Cloud SQL21.

• 外部化状态管理： 由于 LLM 是无状态的，外部持久化记忆是不可妥协的。这突出了一个关键的架构选择：Vertex AI Agent Engine 提供内置的、持久的会话和记忆服务，而 Cloud Run 提供直接与 AlloyDB²⁰ 或 Cloud SQL²¹ 等数据库集成的灵活性。

• Balancing Competing Goals: Scaling always involves balancing three competing goals: speed, reliability, and cost.

• 平衡竞争目标： 扩展总是涉及平衡三个竞争目标：速度、可靠性和成本。

• Speed (Latency): Keep your agent fast by designing it to work in parallel, aggressively caching results, and using smaller, efficient models for routine tasks.

• 速度（延迟）： 通过将智能体设计为并行工作、积极缓存结果以及对常规任务使用更小、更高效的模型来保持智能体的快速性。

• Reliability (Handling Glitches): Agents must handle temporary failures. When a call fails, automatically retry, ideally with exponential backoff to give the service time to recover. This requires designing “safe-to-retry” (idempotent) tools to prevent bugs like duplicate charges.

• 可靠性（处理故障）： 智能体必须处理临时故障。当调用失败时，自动重试，理想情况下使用指数退避给服务时间恢复。这需要设计”可安全重试”（幂等）工具以防止重复收费等错误。

• Cost: Keep the agent affordable by shortening prompts, using cheaper models for easier tasks, and sending requests in groups (batching).

• 成本： 通过缩短提示、对较简单的任务使用更便宜的模型以及分组发送请求（批处理）来保持智能体的经济性。

Managing Risk: The Security Response Playbook

管理风险：安全响应手册

Because an agent can act on its own, you need a playbook for rapid containment. When a threat is detected, the response should follow a clear sequence: contain, triage, and resolve.

因为智能体可以自主行动，你需要一个快速遏制的手册。当检测到威胁时，响应应遵循明确的顺序：遏制、分类和解决。

The first step is immediate containment. The priority is to stop the harm, typically with a “circuit breaker”—a feature flag to instantly disable the affected tool.

第一步是立即遏制。优先事项是停止伤害，通常使用”断路器”——一个功能标志来立即禁用受影响的工具。

Next is triage. With the threat contained, suspicious requests are routed to a human-in-the loop (HITL) review queue to investigate the exploit’s scope and impact.

接下来是分类。在威胁被遏制后，可疑请求被路由到人机协同（HITL）审查队列，以调查漏洞的范围和影响。

Finally, the focus shifts to a permanent resolution. The team develops a patch—like an updated input filter or system prompt—and deploys it through the automated CI/CD pipeline, ensuring the fix is fully tested before blocking the exploit for good.

最后，焦点转向永久解决。团队开发补丁——如更新的输入过滤器或系统提示——并通过自动化 CI/CD 管道部署它，确保修复在永久阻止漏洞之前经过充分测试。

Evolve: Learning from Production

演进：从生产中学习

While the “Act” phase provides the system’s immediate, tactical reflexes, the “Evolve” phase is about long-term, strategic improvement. It begins by looking at the patterns and trends collected in your observability data and asking a crucial question: “How do we fix the root cause so this problem never happens again?”

虽然”行动”阶段提供系统的即时战术反射，但”演进”阶段是关于长期战略改进的。它从查看在可观测性数据中收集的模式和趋势开始，并提出一个关键问题：”我们如何修复根本原因，使这个问题永远不再发生？”

This is where you move from reacting to production incidents to proactively making your agent smarter, more efficient, and safer. You turn the raw data from the “Observe” phase into durable improvements in your agent’s architecture, logic, and behavior.

这是你从对生产事件做出反应转变为主动使你的智能体更智能、更高效、更安全的地方。你将”观察”阶段的原始数据转化为智能体架构、逻辑和行为的持久改进。

The Engine of Evolution: An Automated Path to Production

演进的引擎：通往生产的自动化路径

An insight from production is only valuable if you can act on it quickly. Observing that 30% of your users fail at a specific task is useless if it takes your team six months to deploy a fix.

来自生产的洞察只有在你能快速采取行动时才有价值。如果你的团队需要六个月才能部署修复，那么观察到 30% 的用户在特定任务上失败是没用的。

This is where the automated CI/CD pipeline you built in pre-production (Section 3) becomes the most critical component of your operational loop. It is the engine that powers rapid evolution. A fast, reliable path to production allows you to close the loop between observation and improvement in hours or days, not weeks or months.

这就是你在预生产中构建的自动化 CI/CD 管道（第 3 节）成为你运营循环中最关键组件的地方。它是驱动快速演进的引擎。快速、可靠的生产路径允许你在数小时或数天内（而非数周或数月）关闭观察和改进之间的循环。

When you identify a potential improvement—whether it’s a refined prompt, a new tool, or an updated safety guardrail—the process should be:

当你识别出潜在的改进——无论是精炼的提示、新工具还是更新的安全护栏——流程应该是：

1. Commit the Change: The proposed improvement is committed to your version-controlled repository.

1. 提交更改： 将提议的改进提交到你的版本控制存储库。

2. Trigger Automation: The commit automatically triggers your CI/CD pipeline.

2. 触发自动化： 提交自动触发你的 CI/CD 管道。

3. Validate Rigorously: The pipeline runs the full suite of unit tests, security scans, and the agent quality evaluation suite against your updated datasets.

3. 严格验证： 管道针对你更新的数据集运行完整的单元测试、安全扫描和智能体质量评估套件。

4. Deploy Safely: Once validated, the change is deployed to production using a safe rollout strategy.

4. 安全部署： 一旦验证，更改将使用安全发布策略部署到生产。

This automated workflow transforms evolution from a slow, high-risk manual project into a fast, repeatable, and data-driven process.

这种自动化工作流将演进从缓慢的、高风险的手动项目转变为快速的、可重复的、数据驱动的过程。

The Evolution Workflow: From Insight to Deployed Improvement

演进工作流：从洞察到部署改进

1. Analyze Production Data: Identify trends in user behavior, task success rates, and security incidents from production logs.

1. 分析生产数据： 从生产日志中识别用户行为、任务成功率和安全事件的趋势。

2. Update Evaluation Datasets: Transform production failures into tomorrow’s test cases, augmenting your golden dataset.

2. 更新评估数据集： 将生产失败转化为明天的测试用例，增强你的黄金数据集。

3. Refine and Deploy: Commit improvements to trigger the automated pipeline—whether refining prompts, adding tools, or updating guardrails.

3. 精炼和部署： 提交改进以触发自动化管道——无论是精炼提示、添加工具还是更新护栏。

This creates a virtuous cycle where your agent continuously improves with every user interaction.

这创造了一个良性循环，你的智能体随着每次用户交互不断改进。

![][image5] An Evolve Loop in Action

演进循环实战案例

A retail agent’s logs (Observe) show that 15% of users receive an error when asking for ‘similar products.’ The product team Acts by creating a high-priority ticket. The Evolve phase begins: production logs are used to create a new, failing test case for the evaluation dataset. An AI Engineer refines the agent’s prompt and adds a new, more robust tool for similarity search. The change is committed, passes the now updated evaluation suite in the CI/CD pipeline, and is safely rolled out via a canary deployment, resolving the user issue in under 48 hours.

一个零售智能体的日志（观察）显示，15% 的用户在询问”类似产品”时收到错误。产品团队通过创建高优先级工单来行动。演进阶段开始：使用生产日志为评估数据集创建一个新的失败测试用例。AI 工程师精炼智能体的提示并添加一个新的、更强大的相似性搜索工具。更改被提交，通过 CI/CD 管道中现在更新的评估套件，并通过金丝雀部署安全发布，在 48 小时内解决了用户问题。

Evolving Security: The Production Feedback Loop

演进安全：生产反馈循环

While the foundational security and responsibility framework is established in pre-production (Section 3.4), the work is never truly finished. Security is not a static checklist; it is a dynamic, continuous process of adaptation. The production environment is the ultimate testing ground, and the insights gathered there are essential for hardening your agent against real-world threats.

虽然基础安全和责任框架在预生产中建立（第 3.4 节），但工作从未真正完成。安全不是静态检查清单；它是一个动态的、持续的适应过程。生产环境是最终的测试场，在那里收集的洞察对于强化你的智能体以应对真实世界的威胁至关重要。

This is where the Observe → Act → Evolve loop becomes critical for security. The process is a direct extension of the evolution workflow:

这就是观察 → 行动 → 演进循环对安全变得至关重要的地方。该过程是演进工作流的直接扩展：

1. Observe: Your monitoring and logging systems detect a new threat vector. This could be a novel prompt injection technique that bypasses your current filters, or an unexpected interaction that leads to a minor data leak.

1. 观察： 你的监控和日志系统检测到新的威胁向量。这可能是绕过你当前过滤器的新型提示注入技术，或者导致轻微数据泄露的意外交互。

2. Act: The immediate security response team contains the threat (as discussed in Section 4.2).

2. 行动： 立即安全响应团队遏制威胁（如第 4.2 节所述）。

3. Evolve: This is the crucial step for long-term resilience. The security insight is fed back into your development lifecycle:

3. 演进： 这是长期弹性的关键步骤。安全洞察被反馈到你的开发生命周期中：

• Update Evaluation Datasets: The new prompt injection attack is added as a permanent test case to your evaluation suite.

• 更新评估数据集： 新的提示注入攻击作为永久测试用例添加到你的评估套件中。

• Refine Guardrails: A Prompt Engineer or AI Engineer refines the agent’s system prompt, input filters, or tool-use policies to block the new attack vector.

• 精炼护栏： 提示工程师或 AI 工程师精炼智能体的系统提示、输入过滤器或工具使用策略以阻止新的攻击向量。

• Automate and Deploy: The engineer commits the change, which triggers the full CI/ CD pipeline. The updated agent is rigorously validated against the newly expanded evaluation set and deployed to production, closing the vulnerability.

• 自动化和部署： 工程师提交更改，触发完整的 CI/CD 管道。更新的智能体针对新扩展的评估集进行严格验证并部署到生产，关闭漏洞。

This creates a powerful feedback loop where every production incident makes your agent stronger and more resilient, transforming your security posture from a defensive stance to one of continuous, proactive improvement.

这创造了一个强大的反馈循环，每个生产事件都使你的智能体更强大、更有弹性，将你的安全态势从防御姿态转变为持续主动改进的姿态。

To learn more about Responsible AI and securing AI Agentic Systems, please consult the whitepaper Google’s Approach for Secure AI Agents12 and the Google Secure AI Framework (SAIF)13.

要了解更多关于负责任 AI 和保护 AI 智能体系统的信息，请参阅白皮书《Google 安全 AI 智能体方法》¹² 和《Google 安全 AI 框架（SAIF）》¹³。

Beyond Single-Agent Operations

超越单智能体运营

You’ve mastered operating individual agents in production and can ship them at high velocity. But as organizations scale to dozens of specialized agents—each built by different teams with different frameworks—a new challenge emerges: these agents can’t collaborate. The next section explores how standardized protocols can transform these isolated agents into an interoperable ecosystem, unlocking exponential value through agent collaboration.

你已经掌握了在生产中运营单个智能体，并且可以高速交付它们。但随着组织扩展到数十个专门的智能体——每个都由不同团队使用不同框架构建——一个新的挑战出现了：这些智能体无法协作。下一节探讨标准化协议如何将这些孤立的智能体转变为可互操作的生态系统，通过智能体协作释放指数级价值。

A2A - Reusability and Standardization

A2A——可重用性与标准化

You’ve built dozens of specialized agents across your organization. The customer service team has their support agent. Analytics built a forecasting system. Risk management created fraud detection. But here’s the problem: these agents can’t talk to each other - whether that be because they were created in different frameworks, projects or different clouds altogether.

你已经在组织中构建了数十个专门的智能体。客服团队有他们的支持智能体。分析团队构建了预测系统。风险管理创建了欺诈检测。但问题是：这些智能体无法相互交流——无论是因为它们是在不同的框架、项目中创建的，还是完全在不同的云上创建的。

This isolation creates massive inefficiency. Every team rebuilds the same capabilities. Critical insights stay trapped in silos. What you need is interoperability—the ability for any agent to leverage any other agent’s capabilities, regardless of who built it or what framework they used.

这种孤立造成了巨大的低效率。每个团队都在重建相同的功能。关键洞察被困在孤岛中。你需要的是互操作性——任何智能体都能利用任何其他智能体的能力，无论谁构建了它或使用了什么框架。

To solve this, a principled approach to standardization is required, built on two distinct but complementary protocols. While the Model Context Protocol (MCP22), which we covered in detail on Agent Tools and Interoperability with MCP, provides a universal standard for tool integration, it is not sufficient for the complex, stateful collaboration required between

intelligent agents. This is the problem the Agent2Agent (A2A23) protocol, now governed by the Linux Foundation, was designed to solve.

为了解决这个问题，需要一种基于原则的标准化方法，建立在两个不同但互补的协议之上。虽然模型上下文协议（MCP²²）——我们在智能体工具和与 MCP 的互操作性中详细介绍过——为工具集成提供了通用标准，但它不足以满足智能体之间所需的复杂、有状态协作。这就是Agent2Agent（A2A²³）协议旨在解决的问题，该协议现在由 Linux 基金会管理。

The distinction is critical. When you need a simple, stateless function like fetching weather data or querying a database, you need a tool that speaks MCP. But when you need to delegate a complex goal, such as “analyze last quarter’s customer churn and recommend three intervention strategies,” you need an intelligent partner that can reason, plan, and act autonomously via A2A. In short, MCP lets you say, “Do this specific thing,” while A2A lets you say, “Achieve this complex goal.”

这种区别至关重要。当你需要一个简单的无状态函数（如获取天气数据或查询数据库）时，你需要一个使用 MCP 的工具。但当你需要委托一个复杂目标（如”分析上季度的客户流失并推荐三种干预策略”）时，你需要一个能够通过 A2A 自主推理、规划和行动的智能伙伴。简而言之，MCP 让你说”做这个具体的事情”，而 A2A 让你说”实现这个复杂目标”。

A2A Protocol: From Concept to Implementation

A2A 协议：从概念到实现

The A2A protocol is designed to break down organizational silos and enable seamless collaboration between agents. Consider a scenario where a fraud detection agent spots suspicious activity. To understand the full context, it needs data from a separate transaction analysis agent. Without A2A, a human analyst must manually bridge this gap—a process that could take hours. With A2A, the agents collaborate automatically, resolving the issue in minutes.

A2A 协议旨在打破组织孤岛并实现智能体之间的无缝协作。考虑这样一个场景：欺诈检测智能体发现可疑活动。为了了解完整的上下文，它需要来自单独的交易分析智能体的数据。没有 A2A，人类分析师必须手动弥合这一差距——这个过程可能需要数小时。有了 A2A，智能体自动协作，在几分钟内解决问题。

The first step of the collaboration is discovering the right agent to delegate to - this is made possible through Agent Cards,24 which are standardized JSON specifications that act as a business card for each agent. An Agent Card describes what an agent can do, its security requirements, its skills, and how to reach out to it (url), allowing any other agent in the ecosystem to dynamically discover its peers. See example Agent Card below:

协作的第一步是发现要委托的正确智能体——这是通过 Agent Cards²⁴ 实现的，它们是标准化的 JSON 规范，充当每个智能体的名片。Agent Card 描述了智能体可以做什么、其安全要求、技能以及如何联系它（url），允许生态系统中的任何其他智能体动态发现其同行。请参见下面的示例 Agent Card：

{
  "name": "check_prime_agent",
  "version": "1.0.0",
  "description": "An agent specialized in checking whether numbers are prime",
  "capabilities": {},
  "securitySchemes": {
    "agent_oauth_2_0": {
      "type": "oauth2"
    }
  },
  "defaultInputModes": ["text/plain"],
  "defaultOutputModes": ["application/json"],
  "skills": [
    {
      "id": "prime_checking",
      "name": "Prime Number Checking",
      "description": "Check if numbers are prime using efficient algorithms",
      "tags": ["mathematical", "computation", "prime"]
    }
  ],
  "url": "http://localhost:8001/a2a/check_prime_agent"
}

Snippet 1: A sample agent card for the check_prime_agent

代码片段 1：check_prime_agent 的示例智能体卡片

Adopting this protocol doesn’t require an architectural overhaul. Frameworks like the ADK simplify this process significantly (docs25). You can make an existing agent A2A-compatible with a single function call, which automatically generates its AgentCard and makes it available on the network.

采用此协议不需要架构大修。像 ADK 这样的框架显著简化了这个过程（文档²⁵）。你可以通过单个函数调用使现有智能体兼容 A2A，这会自动生成其 AgentCard 并使其在网络上可用。

# Example using ADK: Exposing an agent via A2A
# 使用 ADK 的示例：通过 A2A 暴露智能体

from google.adk.a2a.utils.agent_to_a2a import to_a2a

# Your existing agent
# 你现有的智能体
root_agent = Agent(
    name='hello_world_agent',
    # ... your agent code ...
)

# Make it A2A-compatible
# 使其兼容 A2A
a2a_app = to_a2a(root_agent, port=8001)

# Serve with uvicorn
# uvicorn agent:a2a_app --host localhost --port 8001

# Or serve with Agent Engine
# from vertexai.preview.reasoning_engines import A2aAgent
# from google.adk.a2a.executor.a2a_agent_executor import A2aAgentExecutor
# a2a_agent = A2aAgent(
#     agent_executor_builder=lambda: A2aAgentExecutor(agent=root_agent)
# )

Snippet 2: Using the ADK’s to_a2a utility to wrap an existing agent and expose it for A2A communication

代码片段 2：使用 ADK 的 to_a2a 实用程序包装现有智能体并将其暴露用于 A2A 通信

Once an agent is exposed, any other agent can consume it by referencing its AgentCard. For example, a customer service agent can now query a remote product catalog agent without needing to know its internal workings.

一旦智能体被暴露，任何其他智能体都可以通过引用其 AgentCard 来消费它。例如，客服智能体现在可以查询远程产品目录智能体，而无需了解其内部工作原理。

# Example using ADK: Consuming a remote agent via A2A
# 使用 ADK 的示例：通过 A2A 消费远程智能体

from google.adk.agents.remote_a2a_agent import RemoteA2aAgent

prime_agent = RemoteA2aAgent(
    name="prime_agent",
    description="Agent that handles checking if numbers are prime.",
    agent_card="http://localhost:8001/a2a/check_prime_agent/.well-known/agent-card.json"
)

Snippet 3: Using the ADK’s RemoteA2aAgent class to connect to and consume a remote agent

代码片段 3：使用 ADK 的 RemoteA2aAgent 类连接并消费远程智能体

This unlocks powerful, hierarchical compositions. A root agent can be configured to orchestrate both a local sub-agent for a simple task and a remote, specialized agent via A2A, creating a more capable system.

这解锁了强大的层次化组合。根智能体可以配置为同时编排用于简单任务的本地子智能体和通过 A2A 的远程专门智能体，创建更有能力的系统。

# Example using ADK: Hierarchical agent composition
# 使用 ADK 的示例：层次化智能体组合

# ADK Local sub-agent for dice rolling
# 用于掷骰子的 ADK 本地子智能体
roll_agent = Agent(
    name="roll_agent",
    instruction="You are an expert at rolling dice."
)

# ADK Remote A2A agent for prime checking
# 用于质数检查的 ADK 远程 A2A 智能体
prime_agent = RemoteA2aAgent(
    name="prime_agent",
    agent_card="http://localhost:8001/.well-known/agent-card.json"
)

# ADK Root orchestrator combining both
# 组合两者的 ADK 根编排器
root_agent = Agent(
    name="root_agent",
    instruction="""Delegate rolling dice to roll_agent, prime checking to prime_agent.""",
    sub_agents=[roll_agent, prime_agent]
)

Snippet 4: Using a remote A2A agent (prime_agent) as a sub-agent within a hierarchical agent structure in the ADK

代码片段 4：在 ADK 的层次化智能体结构中使用远程 A2A 智能体（prime_agent）作为子智能体

However, enabling this level of autonomous collaboration introduces two non-negotiable technical requirements. First is distributed tracing, where every request carries a unique trace ID, which is essential for debugging and maintaining a coherent audit trail across multiple agents. Second is robust state management. A2A interactions are inherently stateful, requiring a sophisticated persistence layer for tracking progress and ensuring transactional integrity.

然而，启用这种级别的自主协作引入了两个不可妥协的技术要求。首先是分布式追踪，每个请求都携带唯一的追踪 ID，这对于调试和维护跨多个智能体的连贯审计跟踪至关重要。其次是强大的状态管理。A2A 交互本质上是有状态的，需要一个复杂的持久层来跟踪进度并确保事务完整性。

A2A is best suited for formal, cross-team integrations that require a durable service contract. For tightly coupled tasks within a single application, lightweight local sub-agents often remain a more efficient choice. As the ecosystem matures, new agents should be built with native support for both protocols, ensuring every new component is immediately discoverable, interoperable, and reusable, compounding the value of the whole system.

A2A 最适合需要持久服务契约的正式跨团队集成。对于单个应用程序内紧密耦合的任务，轻量级本地子智能体通常仍是更高效的选择。随着生态系统的成熟，新智能体应该内置对两种协议的原生支持，确保每个新组件都能立即被发现、可互操作和可重用，从而复合整个系统的价值。

How A2A and MCP Work Together

A2A 与 MCP 如何协同工作

**![][image6]**Figure 4: A2A and MCP collaboration with a single glance

图 4：一目了然的 A2A 和 MCP 协作

A2A and MCP are not competing standards; they are complementary protocols designed to operate at different levels of abstraction. The distinction depends on what an agent is interacting with. MCP is the domain of tools and resources—primitives with well-defined, structured inputs and outputs, like a calculator or a database API. A2A is the domain of other agents—autonomous systems that can reason, plan, use multiple tools, and maintain state to achieve complex goals.

A2A 和 MCP 不是竞争标准；它们是设计用于在不同抽象级别运行的互补协议。区别取决于智能体与什么交互。MCP 是工具和资源的领域——具有明确定义的结构化输入和输出的原语，如计算器或数据库 API。A2A 是其他智能体的领域——可以推理、规划、使用多个工具并维护状态以实现复杂目标的自主系统。

The most powerful agentic systems use both protocols in a layered architecture. An application might primarily use A2A to orchestrate high-level collaboration between multiple intelligent agents, while each of those agents internally uses MCP to interact with its own specific set of tools and resources.

最强大的智能体系统在分层架构中使用这两种协议。应用程序可能主要使用 A2A 来编排多个智能智能体之间的高级协作，而这些智能体中的每一个都在内部使用 MCP 与其自己的特定工具和资源集交互。

A practical analogy is an auto repair shop staffed by autonomous AI agents.

一个实际的类比是由自主 AI 智能体组成的汽车维修店。

1. User-to-Agent (A2A): A customer uses A2A to communicate with the “Shop Manager” agent to describe a high-level problem: “My car is making a rattling noise.”

1. 用户到智能体（A2A）： 客户使用 A2A 与”店长”智能体通信来描述一个高级问题：”我的车发出咔嗒声。”

2. Agent-to-Agent (A2A): The Shop Manager engages in a multi-turn diagnostic conversation and then delegates the task to a specialized “Mechanic” agent, again using A2A.

2. 智能体到智能体（A2A）： 店长进行多轮诊断对话，然后再次使用 A2A 将任务委托给专门的”机械师”智能体。

3. Agent-to-Tool (MCP): The Mechanic agent now needs to perform specific actions. It uses MCP to call its specialized tools: it runs scan_vehicle_for_error_codes() on a diagnostic scanner, queries a repair manual database with get_repair_procedure(), and operates a platform lift with raise_platform().

3. 智能体到工具（MCP）： 机械师智能体现在需要执行具体操作。它使用 MCP 调用其专门工具：在诊断扫描仪上运行 scan_vehicle_for_error_codes()，用 get_repair_procedure() 查询维修手册数据库，并用 raise_platform() 操作平台升降机。

4. Agent-to-Agent (A2A): After diagnosing the issue, the Mechanic agent determines a part is needed. It uses A2A to communicate with an external “Parts Supplier” agent to inquire about availability and place an order.

4. 智能体到智能体（A2A）： 诊断问题后，机械师智能体确定需要一个零件。它使用 A2A 与外部”零件供应商”智能体通信，询问可用性并下订单。

In this workflow, A2A facilitates the higher-level, conversational, and task-oriented interactions between the customer, the shop’s agents, and external suppliers. Meanwhile, MCP provides the standardized plumbing that enables the mechanic agent to reliably use its specific, structured tools to do its job.

在这个工作流中，A2A 促进了客户、店铺智能体和外部供应商之间更高级别的、对话式的、面向任务的交互。同时，MCP 提供了标准化的管道，使机械师智能体能够可靠地使用其特定的结构化工具来完成工作。

Registry Architectures: When and How to Build Them

注册表架构：何时及如何构建

Why do some organizations build registries while others don’t need them? The answer lies in scale and complexity. When you have fifty tools, manual configuration works fine. But when you reach five thousand tools distributed across different teams and environments, you face a discovery problem that demands a systematic solution.

为什么有些组织构建注册表而其他组织不需要？答案在于规模和复杂性。当你有五十个工具时，手动配置就足够了。但当你有五千个分布在不同团队和环境中的工具时，你面临的发现问题需要系统性的解决方案。

A Tool Registry uses a protocol like MCP to catalog all assets, from functions to APIs. Instead of giving agents access to thousands of tools, you create curated lists, leading to three common patterns:

工具注册表使用像 MCP 这样的协议来编目所有资产，从函数到 API。与其让智能体访问数千个工具，不如创建精选列表，从而产生三种常见模式：

• Generalist agents: Access the full catalog, trading speed and accuracy for scope. • Specialist agents: Use predefined subsets for higher performance. • Dynamic agents: Query the registry at runtime to adapt to new tools.

• 通用智能体： 访问完整目录，用速度和准确性换取范围。• 专家智能体： 使用预定义的子集以获得更高的性能。• 动态智能体： 在运行时查询注册表以适应新工具。

The primary benefit is human discovery—developers can search for existing tools before building duplicates, security teams can audit tool access, and product owners can understand their agents’ capabilities.

主要好处是人工发现——开发人员可以在构建重复项之前搜索现有工具，安全团队可以审计工具访问，产品负责人可以了解其智能体的能力。

An Agent Registry applies the same concept to agents, using formats like A2A’s AgentCards. It helps teams discover and reuse existing agents, reducing redundant work. This also lays the groundwork for automated agent-to-agent delegation, though this remains an emerging pattern.

智能体注册表将相同的概念应用于智能体，使用像 A2A 的 AgentCards 这样的格式。它帮助团队发现和重用现有智能体，减少冗余工作。这也为自动化智能体到智能体委托奠定了基础，尽管这仍然是一种新兴模式。

Registries offer discovery and governance at the cost of maintenance. You can consider starting without one and only build it when your ecosystem’s scale demands centralized management!

注册表以维护成本为代价提供发现和治理。你可以考虑在没有注册表的情况下开始，只有在你的生态系统规模需要集中管理时才构建它！

Decision Framework for Registries

注册表决策框架

Tool Registry: Build when tool discovery becomes a bottleneck or security requires centralized auditing.

工具注册表： 当工具发现成为瓶颈或安全需要集中审计时构建。

Agent Registry: Build when multiple teams need to discover and reuse specialized agents without tight coupling.

智能体注册表： 当多个团队需要发现和重用专门智能体而不需要紧密耦合时构建。

Putting It All Together: The
AgentOps Lifecycle

整合一切：AgentOps 生命周期

We can now assemble these pillars into a single, cohesive reference architecture! The life cycle begins in the developer’s inner loop—a phase of rapid local testing and prototyping to shape the agent’s core logic. Once a change is ready, it enters the formal pre-production engine, where automated evaluation gates validate its quality and safety against a golden dataset. From there, safe rollouts release it to production, where comprehensive observability captures the real-world data needed to fuel the continuous evolution loop, turning every insight into the next improvement.

我们现在可以将这些支柱组装成一个单一的、连贯的参考架构！生命周期从开发者的内部循环开始——一个快速本地测试和原型设计阶段，以塑造智能体的核心逻辑。一旦更改准备就绪，它就进入正式的预生产引擎，自动化评估门控根据黄金数据集验证其质量和安全性。从那里，安全发布将其发布到生产，全面的可观测性捕获为持续演进循环提供动力所需的真实世界数据，将每个洞察转化为下一个改进。

For a comprehensive walkthrough of operationalizing AI agents, including evaluation, tool management, CI/CD standardization, and effective architecture designs, watch the AgentOps: Operationalize AI Agents video26 on the official Google Cloud YouTube channel.

有关将 AI 智能体投入运营的全面演练，包括评估、工具管理、CI/CD 标准化和有效的架构设计，请观看 Google Cloud 官方 YouTube 频道上的 AgentOps: Operationalize AI Agents 视频²⁶。

![][image7]Figure 5: AgentOps core capabilities, environments, and processes

图 5：AgentOps 核心能力、环境和流程

Conclusion: Bridging the Last Mile with AgentOps

结论：用 AgentOps 跨越最后一公里

Moving an AI prototype to a production system is an organizational transformation that requires a new operational discipline: AgentOps.

将 AI 原型转移到生产系统是一种需要新运营纪律的组织转型：AgentOps。

Most agent projects fail in the “last mile” not due to technology, but because the operational complexity of autonomous systems is underestimated. This guide maps the path to bridge that gap. It begins with establishing People and Process as the foundation for governance. Next, a Pre-Production strategy built on evaluation-gated deployment automates high stakes releases. Once live, a continuous Observe → Act → Evolve loop turns every user interaction into a potential insight. Finally, Interoperability protocols scale the system by transforming isolated agents into a collaborative, intelligent ecosystem.

大多数智能体项目在”最后一公里”失败不是因为技术，而是因为低估了自主系统的运营复杂性。本指南绘制了弥合这一差距的路径。它从建立人员和流程作为治理基础开始。接下来，基于评估门控部署的预生产策略自动化高风险发布。一旦上线，持续的观察 → 行动 → 演进循环将每次用户交互转化为潜在的洞察。最后，互操作性协议通过将孤立的智能体转变为协作的智能生态系统来扩展系统。

The immediate benefits—like preventing a security breach or enabling a rapid rollback— justify the investment. But the real value is velocity. Mature AgentOps practices allow teams to deploy improvements in hours, not weeks, turning static deployments into continuously evolving products.

直接好处——如防止安全漏洞或启用快速回滚——证明了投资的合理性。但真正的价值是速度。成熟的 AgentOps 实践允许团队在数小时而非数周内部署改进，将静态部署转变为持续演进的产品。

Your Path Forward

你的前进之路

• If you’re starting out, focus on the fundamentals: build your first evaluation dataset, implement a CI/CD pipeline, and establish comprehensive monitoring. The Agent Starter Pack is a great place to start—it creates a production-ready agent project in minutes with these foundations already built-in.

• 如果你刚开始， 专注于基础：构建你的第一个评估数据集，实施 CI/CD 管道，并建立全面的监控。Agent Starter Pack 是一个很好的起点——它在几分钟内创建一个已内置这些基础的生产就绪智能体项目。

• If you’re scaling, elevate your practice: automate the feedback loop from production insight to deployed improvement and standardize on interoperable protocols to build a cohesive ecosystem, not just point solutions.

• 如果你正在扩展， 提升你的实践：自动化从生产洞察到部署改进的反馈循环，并在可互操作的协议上标准化，以构建一个有凝聚力的生态系统，而不仅仅是点解决方案。

The next frontier is not just building better individual agents, but orchestrating sophisticated multi-agent systems that learn and collaborate. The operational discipline of AgentOps is the foundation that makes this possible.

下一个前沿不仅仅是构建更好的个体智能体，而是编排学习和协作的复杂多智能体系统。AgentOps 的运营纪律是使这成为可能的基础。

We hope this playbook empowers you to build the next generation of intelligent, reliable, and trustworthy AI. Bridging the last mile is therefore not the final step in a project, but the first step in creating value!

我们希望这本指南能帮助你构建下一代智能、可靠和值得信赖的 AI。因此，跨越最后一公里不是项目的最后一步，而是创造价值的第一步！

Endnotes

尾注

分类

标签

归档

最新文章

AI-Agent 白皮书 5 - Prototype to Production