AI Agent Platforms Benchmark: Claude Managed Agents vs Google Vertex Agent Engine
We benchmarked 4 AI agent platforms across 3 dimensions: task completion (10 coding tasks × 3 runs), harness-specific capabilities (steering, reconnection, long-conversation recall, large-file handling), and cost.
Results of AI agent platforms benchmark
Platform | Model | Pass rate | Wall time | Cost | Token |
|---|---|---|---|---|---|
Claude Managed Agents | Claude Sonnet 4.6 | 30/30 (100%) | 1,172s | $2.50 | 93k |
Vertex AI Agent Engine | Gemini 2.5 Pro | 30/30 (100%) | 1,447s | $1.45 | 159k |
OpenAI Responses + CI | GPT-5.4 | 27/30 (90%) | 522s | $1.54 | 113k |
Control (self-hosted) | Claude Sonnet 4.6 | 30/30 (100%) | 794s | $1.96 | 464k |
Claude Managed Agents and Vertex AI Agent Engine both achieve 100% pass rates on the task suite, with Vertex winning on cost ($1.45 vs $2.50). For harness-specific features only available in managed platforms like mid-stream steering, disconnect/reconnect, long-conversation compaction, Claude Managed Agents is the most capable, but Vertex Agent Engine matches it on the portable tests (compaction, large-file handling).
Key findings from the task benchmark
- Claude MA and Vertex AE tied on pass rate at 30/30 (100%). Both handle all task types including network tasks (06, 10) that tripped OpenAI.
- OpenAI’s failures stem from its sandbox policy. Tasks 06 (REST API) and 10 (concurrent downloader) both require outbound HTTP. Code Interpreter’s sandbox restricts this, and both failed 2/3 and 1/3, respectively. We saw that GPT-5.4 can write the code, but the sandbox won’t run it reliably.
- Vertex AE is the cheapest at $1.45 total. Claude MA is the most expensive at $2.50. It is 72% more than Vertex on the same task suite with the same pass rate.
- Vertex AE is the slowest. Managed ADK orchestration adds overhead.
Harness-specific capabilities
Two platforms are compared head-to-head on features that only exist because there’s a managed harness.
See the benchmark methodology below.
AI agent platforms
Claude Managed Agents
Anthropic’s Claude Managed Agents provides a hosted agent runtime combining stateful sessions, built-in tool execution, event-based streaming, and automatic compaction for long-running autonomous workloads. The platform differentiates through unique primitives unavailable in comparable offerings like mid-stream user event injection for in-flight steering, resumable SSE streams for disconnect/reconnect, and native MCP server integration. They are all delivered as a fully managed service with no infrastructure for developers to provision.1
Pricing is $0.08 per session-hour on top of standard Claude API token costs.
Pros:
- Stateful sessions with mid-stream event injection allow new user messages to steer agents during in-flight execution.
- Disconnect and reconnect support via persistent SSE streams; sessions continue executing server-side across network interruptions and clients can resume event consumption on reconnection.
- Built-in agent toolset bundles bash, file operations (read, write, edit, glob, grep), and web tools (web_fetch, web_search) accessible via a single configuration parameter, eliminating custom tool wiring.
- Native MCP (Model Context Protocol) server integration for custom tool extensions without modifying the agent’s built-in toolset.
Cons:
- Currently in beta; all requests require the managed-agents-2026-04-01 beta header, and behavior may refine between releases.
- Claude-only, no model flexibility compared to platforms like AWS Bedrock AgentCore or Northflank that support multiple model providers.
Salesforce Agentforce
Salesforce Agentforce differentiates through native CRM data access via the Atlas Reasoning Engine and pre-built agents for sales, service, marketing, and commerce workflows.2
The platform integrates with MuleSoft Agent Fabric for cross-system orchestration and offers Agentforce 360 for AWS partnerships.
Agentforce serves organizations requiring autonomous customer-facing workflows embedded directly within their existing Salesforce Cloud infrastructure.
Pros:
– Native CRM data access via Atlas Reasoning Engine enables context-aware agent actions.
– Pre-built agents available for sales, service, marketing, and commerce reduce time-to-deployment.
– FedRAMP authorized on Salesforce Government Cloud for regulated industries.
– Foundations free tier includes 200,000 Flex Credits for initial testing.
Cons:
– Cloud-only SaaS with no on-premise deployment option available.
– Limited model agnosticism; defaults to Salesforce-managed models with restricted external provider support.
– Requires existing Salesforce ecosystem investment to realize full value.
Microsoft Copilot Studio
Pros:
– Included with Microsoft 365 Copilot licenses for internal agent use at no additional charge.3
– Real-time voice agents and IVR telephony support for customer service scenarios.
– FedRAMP authorized through Azure Government for public sector deployments.
– Supports OpenAI, Anthropic models, and open-source frameworks within a single build environment.
Cons:
– Limited functionality outside the Microsoft ecosystem; requires Azure or M365 commitment for full capabilities.
– No standalone permanent free tier; requires existing M365 Copilot subscription for included usage.
– Real-time voice AI model hosted in North America only as of April 2026.
Copilot Studio is most cost-effective for organizations already using Microsoft 365, Teams, and Azure, offering employee-facing automation that inherits existing identity, security, and compliance configurations.
Google Agentspace and Vertex AI Agent Builder
Google’s dual offering combines Agentspace for enterprise knowledge management and Vertex AI Agent Builder for low-code development, differentiated by Gemini model integration, Google Workspace cross-product context, and multimodal input support for text, voice, and images.4
The platform provides $300 in free credits for new users and pay-as-you-go pricing for Vertex AI Agent Engine.
Pros:
– $300 free credit for new users enables extensive prototyping without upfront investment.
– On-premise deployment supported via Google Distributed Cloud for regulated environments.
– FedRAMP authorized through Google Cloud.
– Google ADK (Agent Development Kit) supports code-first development in Python, TypeScript, Go, and Java.
Cons:
– Gemini-primary design limits model flexibility compared to fully agnostic platforms.
AWS Bedrock Agents and AgentCore
AWS Bedrock Agents and the newer AgentCore platform provide serverless infrastructure management for enterprise-scale agents, launched at re:Invent 2025.5
Differentiators include pay-as-you-go pricing at $0.0895 per vCPU-hour for AgentCore runtime, provisioned throughput options, and Mem0 as the exclusive memory provider.
Pros:
– FedRAMP High authorized on AWS GovCloud for sensitive workloads.
– Bidirectional streaming supports voice agents with simultaneous speech from user and agent.
– Free tier available for new AWS customers for initial experimentation.
– Access to models from Anthropic, Amazon, Meta, Mistral, and AI21 through Bedrock catalog.
Cons:
– No pre-built domain-specific agent templates; requires building from scratch using SDK.
– No on-premise deployment option; runs exclusively on AWS infrastructure.
– Building agents requires significant API/SDK coding compared to visual builders.
AWS Bedrock serves enterprises requiring scalable, serverless agent infrastructure with deep integration into the AWS ecosystem, offering cost efficiency through granular usage-based billing.
IBM watsonx Orchestrate
IBM watsonx Orchestrate targets regulated enterprises with 150-plus pre-built domain-specific agents for HR, procurement, sales, and finance, alongside Skills Studio for building custom skills.6
The platform offers hybrid cloud and on-premise deployment flexibility through IBM Cloud Pak for Data and Software Hub.
Pros:
– On-premise installation supported via IBM Cloud Pak for Data for data residency requirements.
– 150-plus pre-built agents and tools from IBM and partners, with 80-plus enterprise application integrations including SAP, Salesforce, and Workday.
– FedRAMP authorization expanded in April 2026 for federal deployments.
– True model agnosticism supporting multiple LLM providers without vendor lock-in.
Cons:
– No permanent free tier; requires paid Essentials or Standard subscription for ongoing use.
– Voice and telephony capabilities are available within watsonx Orchestrate via native voice configuration in the ADK and integrations with providers such as Deepgram and ElevenLabs, though advanced telephony may require additional configuration.
– Complex pricing structure requiring custom quotes for enterprise features.
ServiceNow AI Agents
ServiceNow AI Agents embed directly within the Now Platform, differentiating through native integration with IT, HR, and customer service workflows rather than operating as a standalone platform.
The platform includes AI Control Tower for governance, pre-built agentic workflows for ITSM and HRSD, and a Context Engine connecting policy history to agent actions.7
Pros:
– Inherits existing Now Platform governance, SLA rules, and approval workflows.
– AI Voice Agents support Genesys Cloud, Twilio, and 3CLogic as CCaaS providers.
– AI Web Agents learn from human demonstrations to automate browser-based tasks.
Cons:
– No permanent free tier; new customers receive only 100 free Build Agent calls.
– FedRAMP High authorization for AI Agents, AI Agent Orchestrator, and AI Agent Studio was confirmed for Government Community Cloud (GCC) customers as of March 2026.
– Limited value for organizations not already using ServiceNow for IT or HR service management.
Kore.ai
Kore.ai focuses on enterprise conversational AI with 300-plus pre-built agents, 250-plus enterprise integrations, and a model-agnostic architecture supporting cloud and on-premise deployments.
The platform serves six verticals including banking, healthcare, and retail.8
Pros:
– Native voice infrastructure delivering low-latency global voice interactions.
– Flexible deployment including on-premises and private cloud options.
– Supporting multiple LLM providers.
Cons:
– No permanent free tier; offers only $500 in one-time credits for initial testing.
LangGraph
Pros:
– MIT open-source license allows unrestricted commercial use and modification.
– Deterministic workflow control via graph architecture ensures reproducible execution paths.
– LangSmith observability integration provides production monitoring and tracing.
Cons:
– No visual no-code builder; requires Python or JavaScript code to define agent graphs.
– No native voice or telephony integration; requires custom coding for voice channels.
– Steep learning curve for teams unfamiliar with graph-based programming paradigms.
LangGraph suits engineering teams building production-grade agents requiring complex conditional logic, error recovery, and auditability of individual execution steps.
CrewAI
Pros:
– Role-based abstraction mirrors human team structures for intuitive agent coordination.
– Free open-source core with no licensing fees for self-hosted deployments.
– Visual editor and AI copilot available in the free tier for non-technical team members.
Cons:
– No official vendor-maintained template marketplace; relies on community contributions.
– Code-first approach requires Python knowledge for agent creation.
– Enterprise plan pricing is available only on request, which may create budget uncertainty for small teams compared to other open-source options.
CrewAI enables rapid prototyping of role-based agent pipelines, particularly suited for document processing, research workflows, and multi-step content generation tasks.
n8n
n8n operates under a fair-code license (Sustainable Use License), offering 400-plus native app connectors with visual AI nodes and self-hostable infrastructure.
Pros:
– Self-hosted Community Edition includes SSO SAML, LDAP, RBAC, and encrypted secret stores at no cost.
– Native support for LangChain and LlamaIndex within visual workflows.
– Visual workflow editor enables complex automation without coding.
Cons:
– Fair-code license requires paid license for commercial hosting or SaaS products.
– No native voice or telephony node; requires external API integration for voice.
– No FedRAMP authorization confirmed.
n8n bridges traditional workflow automation and AI agents, serving technical business analysts and DevOps teams who require self-hosted deployment for data residency while maintaining visual building capabilities.
Dify
Dify is an open-source LLMOps platform.
The platform supports RAG pipelines, prompt engineering tools, and model-agnostic architecture.
Pros:
– Self-hosted Community Edition is permanently free with full data control via Docker deployment.
– Visual workflow builder enables complex agent creation without coding.
– Supports hundreds of proprietary and open-source LLMs from dozens of inference providers.
Cons:
– Voice support requires marketplace plugins such as Agora or Tencent RTC; no native PSTN telephony.
– No FedRAMP authorization.
– Cloud Team plan at $159 per month may be costly for small teams.
Dify suits product and operations teams requiring document-aware agents with strong RAG capabilities, particularly those prioritizing data control through self-hosting.
Voiceflow
Voiceflow differentiates as the only major platform treating voice-first agent design as a first-class citizen rather than an add-on, featuring a purpose-built design canvas for both voice and chat agents with sub-500ms latency.
The platform specializes in customer service ticket automation and IVR systems.
Pros:
– Native voice and telephony channels with IVR support and sub-500ms latency.
– Entity extraction capabilities for knowledge base queries.
– Free plan includes 2 agents and 100 monthly AI tokens with no expiration.
– Visual canvas designed specifically for conversational AI workflows.
Cons:
– On-premise deployment only available through custom enterprise agreements.
Voiceflow serves CX and support teams building customer-facing conversational agents that require deployment across voice, chat, and messaging channels from a single design interface.
Relevance AI
Relevance AI offers bring-your-own-LLM (BYOLLM) flexibility with an action-based billing model, allowing non-technical teams to build multi-agent teams through natural language descriptions.
Pros:
– Free tier includes 100 credits per day with no expiration.
– 2,000-plus integrations including HubSpot, Salesforce, Slack, and Gmail.
– True model agnosticism supporting multiple LLM providers.
Cons:
– No self-hosting or on-premise deployment options; cloud-only SaaS.
– No FedRAMP authorization for regulated industries.
– Voice capabilities require integration with Vapi or Twilio rather than native telephony.
Lindy AI
Lindy AI provides various integrations via Pipedream, pre-built agent templates for email triage and scheduling, and phone call agent capabilities through the Gaia voice feature.9
The platform uses a credit-based execution model with a free tier available.
Pros:
– Free tier includes 400 credits per month and 1-million-character knowledge base.
– True model agnosticism and extensive integration library.
Cons:
– On-premise deployment only available through custom enterprise agreements for regulated industries.
Best for individual business users, founders, and operations teams requiring quick automation of email, calendar, and CRM workflows without engineering resources.
Methodology
What does a managed AI agent platform actually deliver over its competitors, and over the alternative of building your own agent harness? The AI tooling space has a persistent blind spot here. “Managed agent” products are routinely compared using the same task-completion scorecards used for raw language models, which conflates two very different things: the model’s ability to generate correct code, and the harness’s ability to run that code reliably in a managed runtime with state, tools, and isolation. We designed this benchmark to separate those signals.
What is a managed agent platform?
We’re benchmarking a specific category: hosted runtimes that bundle LLM inference, agent orchestration, and sandboxed code execution into a single managed service. This is distinct from (1) raw LLM inference APIs, (2) agent orchestration frameworks you host yourself, and (3) compute sandboxes you pair with your own model. The four platforms under test each take a slightly different shape of this bundle:
- Claude Managed Agents (Anthropic): Full managed harness. Agent definitions, sessions, event-based streaming, compaction, and tool execution are all server-side. One of two true competitors in this category.
- Vertex AI Agent Engine (Google): Full managed harness. Deploy an ADK-defined agent to a managed runtime; the deployment hosts the agent state and tool execution. Accessed via the vertexai.agent_engines SDK.
- OpenAI Responses API with Code Interpreter: Adjacent category. Inference API with a built-in Python sandbox tool, but no persistent multi-turn session state or mid-stream steering.
- Control: Claude Messages API with a local tool loop: Included as baseline. Same model as Claude MA (claude-sonnet-4-6), but we implement the agent loop locally in ~150 lines of Python. Tools (bash, write, read, edit) execute in a per-task tempdir on the benchmark machine. This isolates what the managed harness contributes beyond “model plus tool loop.” Running the Messages API with a local agent loop produces a comparison where the model is identical but the harness is absent. Any delta between Claude MA and the control is attributable entirely to the harness, not to model capability.
The task suite
Ten coding tasks spanning three difficulty tiers. Each task has a fixed prompt specifying the deliverable, a verification script encoding deterministic pass/fail criteria. Each task runs three times per platform to measure variance.
Harness-specific stress tests
The task suite measures end-to-end correctness. It cannot measure capabilities that exist only because of a managed harness: stateful session persistence, mid-stream steering, connection resumption, automatic context compaction, and managed filesystem artifact handling. For these, we designed two additional test suites.
Suite A: Steering & Interruption
Three tests exercising harness-specific primitives.
A1 starts an agent on a coding task, then injects a new user event via POST /events after 10 seconds changing the requirements, and verifies by inspecting the container filesystem that the final artifact reflects the new requirement rather than the original.
A2 opens an SSE stream, drops the connection after four events, reconnects, and verifies the session still reaches status_idle.
A3 sends a deliberately contradictory prompt and measures whether the agent asks for clarification or silently picks an interpretation.
Only A3 is portable across platforms. A1’s mid-stream event injection has no direct equivalent on OpenAI Responses (single request/response) or Vertex Agent Engine (session model lacks in-flight message injection). A2’s disconnect/reconnect similarly has no equivalent elsewhere. These are genuine structural advantages of Claude MA’s event-driven session model, not benchmarkable on the alternatives. We ran A1 and A2 on Claude MA only and ran A3 on both Claude MA and Vertex Agent Engine.
Suite B: Compaction & Context
Two tests exercising managed-context features.
B1 plants a unique canary string (a UUID-derived token) in the first turn of a session, runs 23 padding turns of unrelated small coding tasks each producing tool calls and tool results, then asks the agent to recall the canary from memory on the 25th turn with no file lookup allowed. Successful recall after 23 padding turns is evidence that the harness preserves early context through whatever compaction policy it uses.
B2 asks the agent to generate a 50,000-line text file with a buried marker, then answer a question that requires finding the marker. This tests whether the agent can reason about artifacts larger than its context window without attempting to read the whole file.
Both B1 and B2 ran on both Claude MA and Vertex Agent Engine, using the same prompts and protocols.
LLM-as-judge for behavioral scoring
For Suite A3 (contradictions), pass/fail isn’t a deterministic check; we treated “did the agent ask for clarification” as a qualitative judgment about conversational behavior. We use an LLM-as-judge design with three methodological guards:
- The judge model is different from the tested model: Claude Opus 4.6 is the judge to avoid self-evaluation bias.
- Structured rubric with 4 boolean dimensions: The judge returns JSON scoring: recognized_contradiction, asked_for_clarification, proceeded_with_assumption, documented_assumption, and a one-paragraph reasoning.
- 3-run consistency check: Each judgment is run 3 times. We report per-dimension majority consensus and per-dimension agreement rate. If any dimension’s agreement falls below 67%, the judge is flagged as inconsistent on that dimension, and the result is treated as low confidence.
A keyword heuristic runs in parallel as a sanity check. Divergence between the heuristic and the judge is logged for manual review.
Scoring
For every task run on every platform:
- Pass/fail
- Wall time: Elapsed seconds from sending the prompt to receiving the terminal event (status_idle for Claude MA, task completion for Vertex AE, response completion for OpenAI, tool-loop exit for control).
- Tool call count: Distinct tool invocations. Useful as a behavioral fingerprint; less useful as an efficiency metric because tool granularity differs significantly across platforms.
- Token usage: Parsed from model_request_end events on Claude MA, usage_metadata on Vertex AE, response.usage on OpenAI, per-turn accumulation in the control’s message loop. Broken down into input, output, cache read, and cache creation.
- Cost in USD: Computed from token usage against published pricing: claude-sonnet-4-6 at $3/$15/$0.30/$3.75 per million; gpt-5.4 at $2.50/$15/$0.25; gemini-2.5-pro at $1.25/$10/$0.13. Platform-specific infrastructure fees are added: Claude MA’s $0.08/session-hour pro-rated by wall time, OpenAI’s $0.03/container when any tool call occurred, Vertex AE’s approximately $0.35/hour hosting fee pro-rated by deployment uptime.
Suite A and B results additionally capture session-level metrics (turns, canary recall, judge consensus and agreement).
Fairness considerations and known limitations
Several asymmetries in the setup affect how the numbers should be read; calling them out explicitly:
The control runs tool execution on the benchmark machine with no cloud round-trip. This gives it an unfair wall-time advantage that does not reflect agent speed so much as network skip. When we observe the control completing tasks ~25% faster than Claude MA on the same model, roughly half of that gap is round-trip asymmetry.
OpenAI Code Interpreter operates in a network-restricted sandbox. Tasks 06 (REST API) and 10 (concurrent downloader) require outbound HTTP, which CI permits only intermittently. OpenAI’s failures on those tasks are sandbox policy failures, not model capability failures. GPT-5.4 can write correct concurrent HTTP code; the platform can’t always run it. Readers should not interpret “OpenAI fails on networking tasks” as a statement about the model.
Gemini 3.1-pro-preview is gated behind project-level preview allowlisting. We attempted to benchmark this model on both the direct Vertex API and the Vertex Agent Engine. Direct API calls returned 404; Agent Engine deployments with the model succeeded at deploy time, but inference calls returned zero events with no error. We fell back to gemini-2.5-pro.
A suite of multi-hour refactoring tasks, debugging in unfamiliar codebases, or long-running autonomous workflows would stress the harnesses differently and probably separate the top-tier options more clearly.
We did not measure provisioning latency, cold-start behavior, concurrent-session performance, or rate-limit ceilings. These are important for high-throughput production workloads but were out of scope for this round.
Features common to all AI agent platforms
Every platform in this comparison provides baseline capabilities that define the AI agent category. These common features establish the minimum viable product for agentic automation, while differentiating features determine platform selection.
Multi-agent orchestration: All platforms support multi-agent orchestration, though implementation varies (see individual platform sections above).
Tool use and external integrations: Agents across every platform can call external APIs, databases, and business applications. Prebuilt connector counts range from approximately 50 (Dify) to 2,000+ (Relevance AI), with all platforms supporting custom API definitions.
Persistent memory and context management: Retaining information within sessions (short-term memory) and across sessions (long-term memory) is a standard capability, achieved through vector databases, session objects, or configurable context windows depending on the platform.
Monitoring and observability: Every platform exposes logs, traces, or analytics for inspecting agent execution, tracking token usage and latency, and identifying failures.
Human oversight and approval controls: Mechanisms for human review, approval, or override of agent actions are present across every platform. Examples include n8n’s per-tool approval gates, LangGraph interrupt-and-resume primitives, Bedrock AgentCore policy controls, ServiceNow AI Control Tower, and Lindy’s automatic escalation.
Knowledge base and retrieval-augmented generation (RAG): Grounding agents in custom knowledge through document indexing and retrieval is a baseline capability across the category. Implementations include Dify RAG pipeline, Voiceflow Knowledge Base, Bedrock Knowledge Bases, Vertex AI RAG Engine, and Kore.ai Search AI.
No-code or low-code agent builder interface: Graphical or natural-language interfaces for agent creation are available on every platform. Enterprise platforms offer no-code studios (Agentforce Builder, Copilot Studio, watsonx Orchestrate), while developer frameworks provide companion visual tools (LangGraph Studio, AutoGen Studio, CrewAI Studio).
Be the first to comment
Your email address will not be published. All fields are required.