Multi-agent systems use specialized agents working together to solve complex tasks. A key challenge: does performance degrade as more agents and tools are added, or can orchestration mechanisms handle the growing complexity efficiently?
We benchmarked 5 agentic frameworks across 750 runs with three tasks. We measured latency, token consumption, and orchestration overhead to identify which architectural patterns maintain efficiency under scale and which degrade.
Multi-agent framework benchmark
We tested how token usage and latency scale as agents and tools increase. Across three tasks with the same problem, we progressively expanded agent count and tool availability. For LangChain and LangGraph, we used single-agent setups to observe how their sequential architectures handle the same complexity faced by multi-agent systems.
We did not include the framework with below 95% accuracy (Swarm) in the chart. You can read our benchmark methodology.
Particularly for Swarm, we observed that accuracy shifted alongside this complexity, driven by architectural differences rather than model capabilities.
Accuracy in agentic frameworks can typically be improved through LLM selection or configuration tuning. However, examining the architectural causes of accuracy variations in our benchmark revealed valuable insights. It helped us understand the fundamental design differences between frameworks.
Multi-agent framework benchmark results
CrewAI forces all agents to execute sequentially, causing exponential token growth as each agent’s output compounds into the next agent’s context. This rigidity ensures completeness but creates massive overhead.
Swarm prioritizes speed through stateless routing but suffers progressive accuracy collapse (84% → 0%) as task complexity increases. Without global state tracking, agents terminate prematurely, breaking multi-step chains.
LangChain uses a single-agent “super-agent” with unified context, avoiding coordination overhead entirely. Performance remains efficient until tool library size (100 tools) and reasoning complexity significantly increase latency.
LangGraph matches LangChain’s reliability but adds graph traversal overhead. The state management cost becomes pronounced under high complexity, though it maintains high accuracy.
AutoGen generates high handoff counts through chat-based coordination but uses GroupChatManager to dynamically prune unnecessary agents. This prevents CrewAI’s exponential growth while maintaining high accuracy, though token consumption stays higher than single-agent baselines due to conversation history re-processing.
CrewAI
CrewAI’s role-based sequential pipeline executed every assigned agent without filtering out unnecessary noise agents throughout the process. This architectural characteristic has significant implications for agentic systems where every agent performs a critical function. It ensures that the framework will not skip any expected steps and will continue using every agent rather than making autonomous routing decisions. However, this rigidity comes at a steep cost in terms of resource consumption and latency as task complexity scales.
Exponential resource growth across tasks
From Task 1 to Task 3, we observed continuously compounding token consumption and latency. Latency approximately doubled with each task increase, while token consumption grew at an even more dramatic rate. Agent handoff counts naturally increased in parallel with this scaling.
Why CrewAI consumes more tokens and time
CrewAI’s sequential pipeline aligned naturally with both workflows. In Task 1, the Data Analyst gathered information before the Arbitrator made decisions. In Task 2, this pattern continued with expanded agent roles. CrewAI selected all tools correctly and demonstrated that sequential execution eliminates coordination confusion, with each agent executing its designated tools without routing ambiguity.
However, this natural alignment came with substantial and growing overhead:
The state compounding mechanism:
- Each agent generates a detailed report upon completing its task
- This report is passed to the next agent in the sequence
- By the time the final Arbitrator receives the handoff, it reads a document containing the history and outputs of all previous agents, plus its own system instructions and tool metadata
- The LLM spends significant time reading and re-generating large markdown state objects between tasks
This verbose state management wrapped even small data points in substantial orchestration metadata. CrewAI prioritizes total context awareness over efficiency.
Task 2-3 rigidity:
- The framework executed all 5 agents in the pre-defined sequence, even when only a subset were strictly necessary
- This rigidity inflated both token costs and latency while maintaining high accuracy
- The framework’s inability to skip unnecessary agents became increasingly apparent as a fundamental architectural constraint
- Each additional agent compounded the context that subsequent agents must process
Swarm
Swarm’s lightweight routing mechanism demonstrated genuine multi-agent delegation with minimal orchestration overhead. An initial agent gathered necessary context, actively recognized when its job was done, and explicitly handed the session over to a distinct decision-making agent. This stateless architecture prioritized speed and simplicity, achieving performance comparable to single-agent baselines in simple scenarios. However, as task complexity increased, this lightweight approach revealed fundamental scalability limitations where the absence of global state tracking and central orchestration caused progressive accuracy collapse.
Progressive accuracy degradation across tasks
From Task 1 to Task 3, we observed a dramatic collapse in accuracy despite maintaining fast execution speeds. Accuracy dropped from 84% in Task 1 to 22% in Task 2, ultimately reaching 0% in Task 3. This progressive degradation revealed that Swarm’s stateless architecture, optimized for speed in short-burst interactions, is fundamentally unscalable for multi-step reasoning chains.
Why accuracy decreased as we increase the complexity
Swarm’s lightweight routing kept completion tokens extremely low and latency fast throughout all tasks. The framework operated as a relay race of independent agents with no global state, where each agent made autonomous handoff decisions without central oversight. This approach excels at minimizing token consumption and achieving rapid execution, but comes at a steep cost in reliability and precision as operational persistence requirements increase.
The architectural blind spot:
A post-mortem of the logs revealed an architectural blind spot that could not be resolved through prompt engineering alone. Initially, a prompting oversight (missing transfer instructions) was identified and fixed. However, even with explicit transfer commands, Swarm failed to complete the chain.
Because the framework is stateless and lacks global intent tracking, the first recipient agent (e.g., Finance) would simply issue a conversational acknowledgment (e.g., “Received data, starting financial audit”) and terminate the thread. Swarm interprets any conversational end-of-turn as task completion, leading to the fastest but most hollow results. Without a central orchestrator to maintain task state and ensure all steps execute, the framework cannot distinguish between “agent gave an acknowledgment” and “task is fully complete.” This fundamental limitation means that even genuine multi-agent delegation with explicit handoffs cannot guarantee task completion when the chain extends beyond simple interactions.
Task 1-2: Growing precision issues
In Task 1, where Swarm demonstrated strong performance with short agent chains, the framework still failed in 16% of runs due to incomplete handoff resolution. By analyzing conversation logs, we found that the Arbitrator successfully made decisions, but Swarm’s output mechanism surfaced the Data Analyst’s intermediate handoff message instead. Users received “I will now transfer this information to the Arbitrator” rather than the actual decision, revealing that dynamic routing systems risk losing final results during agent transitions.
In Task 2, with 5 agent options and 20 tools in scope, precision degraded significantly as the framework’s lightweight, context-poor prompting strategy began to buckle under increased complexity:
- Correct tool chose rate dropped to 40%, representing a 20% decline from Task 1
- Agents occasionally called irrelevant tools or issued routing messages where real tool calls were expected
- Agents sometimes attempted tools from the wrong domain or retried failed calls
- Without a central controller, agents lost track of execution state or handed off to roles that sounded conclusive but hadn’t executed necessary tool calls
Task 3: The handoff paradox
Task 3 exposed Swarm’s fundamental architectural limitation with 0% accuracy despite maintaining the fastest execution speed. This complete failure revealed what we term the “Handoff Paradox”: in a 10-agent chain, Swarm requires 100% tool-based transfers at every link, but without a central orchestrator or state graph (like LangGraph), the chain breaks at the first link. While Swarm excels at 1-to-1 handoffs, it collapses in multi-step workflows requiring operational persistence across long chains.
Handoff chain exhaustion:
In Task 1 with 1 handoff, the chain was short enough that the goal remained in context. However, as the chain extended to 9 handoffs in Task 3, cumulative probability of success dropped to zero. Each additional specialist acted as a “leakage point” where a conversational reply could terminate the process before the final Arbitrator was reached. This geometric failure rate demonstrates that stateless routing, while optimized for speed, cannot scale to multi-step reasoning marathons.
LangChain
LangChain executed tasks as a straightforward state machine: receive prompt, evaluate tools, execute, finalize. We configured LangChain as a single-agent executor with zero handoffs and one unique agent throughout all tasks. This unified context approach maintained a single logical entity throughout execution, making zero conversational jumps and executing exactly what each task required without orchestration overhead. The framework’s linear execution model demonstrated that tasks not requiring agent collaboration benefit significantly from avoiding the coordination costs inherent in multi-agent systems.
Efficient scaling until tool entropy threshold
LangChain maintained correct outputs across all three tasks. However, Task 3 revealed the framework’s sensitivity to tool library size and reasoning complexity, with latency increasing noticeably as both dimensions expanded.
Why LangChain remained efficient
Task 1-2: Linear execution advantage
In Task 1, LangChain achieved minimal latency and optimized token usage with correct tool selection precision. The framework avoided getting distracted by coordination mechanics, processing only what was necessary to complete the task. The single-agent architecture meant no agent-to-agent communication overhead, no report generation between steps, and no conversational filler.
In Task 2, we implemented LangChain using a “super-agent” architecture where a single controller had direct access to all 20 tools. By consolidating roles into a single logical entity, the framework bypassed the need for inter-agent data passing, report generation, and conversational filler. This linear execution model ensured that the LLM only processed relevant tool results, avoiding the exponential growth of prompt history seen in multi-agent frameworks.
The unified context architecture meant that the presence of 20 tools in the library created no selection confusion. The single agent processed tool calls sequentially without needing to coordinate or negotiate with other agents, maintaining correct tool selection despite the expanded tool library. Zero handoffs confirmed that no orchestration overhead was introduced as complexity increased.
Tool entropy and reasoning complexity
Task 3 introduced two significant challenges that impacted LangChain’s performance:
Tool entropy :
While Task 1 had 5 tools and Task 2 had 20 tools, Task 3 presented 100 available tools. Because LangChain operates as a single-agent system, every message must include the definitions of all 100 tools in the prompt. This creates two bottlenecks:
- The LLM must evaluate 100 options to select the correct tool, increasing processing time
- The massive prompt size (containing all tool definitions) delays the model’s time-to-first-token, increasing overall latency
Reasoning complexity (10 expert roles):
In Task 1 and Task 2, the agent simply acted as an arbitrator making a decision. In Task 3, the agent was instructed to reason through 10 expert perspectives sequentially.
This instruction caused the model to generate significantly longer outputs, with completion tokens increasing substantially compared to Task 2. More generated text directly translates to longer execution time, as the model must produce each token sequentially.
Despite these challenges, LangChain maintained correct outputs and never selected wrong tools. The framework’s simple loop structure (AgentExecutor) processed tool calls and responses without additional architectural overhead, keeping latency increases proportional to the inherent task complexity rather than compounding it with orchestration mechanisms.
LangChain’s architectural approach proved that unified context execution can maintain reliability as complexity scales, though performance becomes sensitive to tool library size and reasoning depth. The framework’s ability to produce correct outputs across all tasks while avoiding the token explosion and coordination overhead of multi-agent systems demonstrated the value of linear execution models for tasks not requiring agent collaboration.
LangGraph
As observed in our agentic frameworks benchmark, LangGraph employed a state machine architecture with explicit state transitions and graph-based control flow. We configured LangGraph as a single-agent executor with zero handoffs and one unique agent throughout all tasks. This approach eliminated inter-agent communication entirely while providing structured state management that tracked execution progress through defined nodes and edges. The framework demonstrated that formal state tracking can coexist with unified context execution.
Consistent reliability with graph management overhead
LangGraph produced correct outputs across all three tasks without errors. In Task 1 and Task 2, performance remained nearly identical to LangChain’s linear execution model. However, Task 3 revealed more pronounced latency increases compared to LangChain, exposing the computational cost of graph-based state management under high tool entropy and reasoning complexity.
Why LangGraph matched LangChain
LangGraph’s state graph provided formal control flow without requiring multiple agents. In both tasks, the framework maintained zero handoffs while selecting all tools correct. The single controller accessed all necessary tools directly, processing each step through state transitions rather than agent handoffs.
The “super-agent” implementation meant the framework never split cognitive load across multiple personas. Tool selection remained precise even with 20 available tools in Task 2, with the agent never calling incorrect or irrelevant tools. The unified context prevented the selection confusion that plagued frameworks relying on agent-to-agent coordination.
Why token consumption matched LangChain
Both frameworks used identical LLM configuration, tool definitions, and system prompts. Unlike multi-agent frameworks (AutoGen, CrewAI) that generate coordination overhead through agent-to-agent conversations and intermediate coordination messages, both single-agent frameworks consolidate all expertise into one model call. Every token spent represents either the input instructions or the direct output, with no intermediate “Agent A spoke to Agent B” overhead. Additionally, both frameworks invoke the same tools in the same sequence to solve the task, receiving identical data from the underlying system, which results in highly similar completion token counts. Token differences between the frameworks were negligible because the LLM performed the same reasoning work in both cases.
Task 3: Graph traversal overhead amplified
Task 3 introduced the same challenges LangChain faced (100 tools and 10-role reasoning complexity), but LangGraph’s graph-based architecture amplified the performance impact:
Tool entropy burden:
Like LangChain, LangGraph must include all 100 tool definitions in every prompt due to its single-agent architecture. The LLM must evaluate the full tool library for each selection, and the massive prompt size delays response generation.
Reasoning complexity:
The instruction to reason through 10 expert perspectives sequentially caused LangGraph to generate significantly longer outputs, just as it did for LangChain. However, LangGraph’s additional overhead became visible here.
Graph management overhead:
While LangChain uses a simple loop structure (AgentExecutor) that calls tools and processes responses, LangGraph traverses an entire graph structure at each step. For every tool call:
- The framework must traverse the complete graph from start to finish
- Message history (State) is updated at every node transition
- The system validates transitions between nodes and maintains state consistency
In Task 1 and Task 2, this overhead was negligible. In Task 3, with 100 tools and complex reasoning requirements, this graph management burden became substantial. The additional latency compared to LangChain directly reflected the cost of maintaining and traversing the state graph structure under high complexity.
Despite this overhead, LangGraph never selected wrong tools, consistently invoking only the necessary functions to complete each step. The framework’s formal state tracking provided structured control flow at the cost of increased processing time.
LangGraph’s architectural approach demonstrated that explicit state management can maintain it’s reliability as complexity scales, though the graph traversal overhead becomes more pronounced under high tool entropy and reasoning complexity. For applications requiring auditability, rollback capabilities, or complex branching logic, this trade-off may be worthwhile. For straightforward sequential execution, LangGraph’s additional structure provides limited value over simpler linear models like LangChain.
Autogen
AutoGen consumed substantially more resources than single-agent baselines, though not reaching the extreme levels of CrewAI’s sequential pipeline. The framework involved multiple conversations between a UserProxy and specialized agents. Every turn in this chat required a full LLM pass that re-processed the entire conversation history to date.
However, AutoGen consistently selected the correct tools and produced accurate outputs across all tasks without calling irrelevant tools. It introduced conversational overhead where the framework spent more time coordinating than executing. For this straightforward task, AutoGen’s chat-based coordination became unnecessary complexity rather than collaborative benefit.
AutoGen employed a chat-based architecture where specialized agents collaborate through a UserProxy that manages workflow coordination.
We configured AutoGen using GroupChatManager across all three tasks, enabling dynamic agent selection rather than forcing sequential execution. This architecture demonstrated that intelligent orchestration can achieve multi-agent collaboration without the exponential resource costs of rigid pipelines.
High handoff counts with competitive performance
AutoGen recorded the highest handoff counts across all frameworks. In Task 1, the framework already reached handoff levels that CrewAI only achieved in Task 3 (9 handoffs). This reflected AutoGen’s conversational nature: every interaction between the UserProxy and specialist agents registers as a handoff, even when discussing which tool to call.
However, despite these high handoff counts, AutoGen’s latency remained competitive with sequential frameworks in Task 1 and Task 2. In Task 3, while CrewAI’s framework overhead reached 1.35 million tokens, AutoGen consumed only 56,700 tokens (compared to LangChain and LangGraph’s 13,500 and 13600).
Why AutoGen consumed more tokens despite it’s latency
AutoGen consumed substantially more tokens than single-agent baselines, though not reaching the extreme levels of CrewAI’s sequential pipeline. The framework involved multiple conversations between a UserProxy and specialized agents. Every turn in this chat required a full LLM pass that re-processed the entire conversation history to date.
This recursive token accumulation explains why AutoGen’s token consumption remained higher than LangChain and LangGraph even when latency stayed competitive. The chat history grows with each turn, increasing prompt size, but the framework’s GroupChatManager prevents the exponential explosion seen in sequential pipelines by pruning unnecessary agents.
However, AutoGen consistently selected the correct tools and produced accurate outputs across all tasks without calling irrelevant tools. The conversational overhead meant the framework spent more time coordinating than executing, but this coordination ensured no agent ever lost focus or called wrong tools.
AutoGen’s GroupChatManager
AutoGen’s architectural strength: dynamic agent selection through GroupChatManager. Unlike Task 2’s sequential orchestration, the GroupChat mode allowed the framework to activate only necessary agents from the available pool.
The Manager pruned unnecessary specialists, activating only 5-6 agents from the 10-agent pool. As soon as the Arbitrator found sufficient grounds for a decision, the Manager terminated the loop. This prevented the exponential token growth that would occur if context was forced through every remaining agent sequentially.
This dynamic pruning resulted in substantially lower latency and token consumption compared to CrewAI’s rigid sequential pipeline. While CrewAI forced all 10 agents to execute regardless of necessity, AutoGen’s GroupChat adaptively selected only the agents needed to reach a decision.
Despite coordination overhead, the high handoff count reflected thorough deliberation where agents cross-referenced findings before termination. AutoGen’s ability to switch between sequential and GroupChat modes provides flexibility that rigid architectures lack, demonstrating that chat-based orchestration with intelligent agent selection can scale more efficiently than fixed pipelines for complex multi-agent workflows.
How AutoGen GroupChatManager works:
- At each step, the Manager decides “which agent should speak next?” based on the conversation context
- The framework is not required to execute all agents sequentially
- If sufficient information is gathered early, the Manager can skip unnecessary specialists
- The Manager can terminate the loop as soon as the agent has enough information to make a decision
The “Please Continue” challenge: AutoGen’s default behavior is to keep conversations alive. For benchmarks, precise termination signals are critical to avoid “token bleed.” We addressed this by ensuring all specialist agents include explicit TERMINATE signals upon task completion.
Manager overhead: Even with GroupChatManager, AutoGen’s internal message state is larger than LangChain’s due to multi-agent orchestration. However, this provides significantly more structured logs and deliberation trails than simpler frameworks.
Note on sequential vs. GroupChat: We ran all tasks using GroupChatManager. In experimental runs with sequential orchestration, we observed token consumption and latency at least doubling compared to GroupChat mode, confirming that dynamic agent selection provides substantial efficiency gains over fixed pipelines.
Multi-agent framework benchmark methodology
Each framework was tested for 50 iterations (N=50) per task each task.
To eliminate variability in the model’s reasoning process, all frameworks utilized identical LLM configuration. The model used was openai/gpt-5.2 via the OpenRouter API. Temperature was set to 0.0.
No maximum token limit was imposed on the LLM’s responses, allowing frameworks to use as much context as their internal architecture required to solve the task.
Metrics captured include: number of LLM API calls, handoffs between agents, unique agents invoked, tool calls executed, and tool call accuracy. All metrics were logged per iteration and aggregated across the 50-run sample.
We separated raw LLM outputs from the additional overhead introduced by orchestration to measure orchestration efficiency. LLM output tokens represent the actual useful responses generated by the model, while Framework Overhead encompasses the system commands, tool definitions, and conversation histories that the framework must feed to the LLM behind the scenes to obtain those responses.
This metric, calculated by subtracting the output tokens from the total tokens (Total – Output tokens), directly reveals the “management cost” that the framework hides from the user. Thanks to this distinction, we can see which frameworks remain efficient and lean, versus which ones repeatedly load massive data payloads into the LLM for each orchestration step. We based our analysis on framework overhead tokens as our primary efficiency metric.
To ensure that frameworks were measured solely on their coordination logic, we synchronized all other variables. This eliminated confounding factors and isolated architectural differences.
Agents were defined in a central file. Each framework’s wrapper injected the exact same persona string into its native parameter,system_message for AutoGen, backstory for CrewAI, system prompts for LangChain/LangGraph, and agent descriptions for Swarm. No framework-specific prompt engineering was applied.
Every framework used the same underlying Python functions. Tool definitions, docstrings, and parameter schemas were standardized. No framework-specific pre-built tools were used. This ensures that tool execution logic is consistent, and only orchestration mechanisms differ.
For every iteration “DataCo Smart Supply Chain” dataset were fed to the agents. Ground truth data (shipping status, payment status, profit margins) remained constant across frameworks.
While keeping inputs identical, each framework operated in its native structural mode. We did not force frameworks into unnatural architectures. Instead, we implemented each framework according to its intended design pattern to measure real-world performance.
AutoGen operates as a conversational group chat system. It uses initiate_chats with TERMINATE signals to manage exit conditions. Agents communicate through message passing, with a UserProxy coordinating the workflow.
CrewAI implements a task-based sequential pipeline. It uses Process.sequential where agents execute in a fixed order. Each agent completes its task and generates a report before the next agent begins.
LangChain follows a linear chain architecture. It uses a standard AgentExecutor that wraps the tool-calling loop. The agent executes tools sequentially within a single context.
LangGraph structures execution as a cyclic state graph. It uses StateGraph with nodes representing processing steps and conditional routing edges to determine flow.
Swarm employs handoff-based routines. It uses transfer_to_agent functions to shift control dynamically between agents based on runtime decisions.
The tasks evolved in complexity to pressure-test different orchestration capabilities and failure modes.
Task 1 (2 Agents / 5 Tools): Tests baseline orchestration overhead for a simple workflow requiring order information gathering and refund decision-making.
Task 2 (5 Agents / 20 Tools): Tests routing intelligence under noise. Only 2-3 agents and 3-5 tools are necessary, but 5 agents and 20 tools are available.
Task 3 (10 Agents / 100 Tools): Tests high-entropy filtering and scalability limits. Only 2-3 agents and 3-5 tools are needed, but 10 agents and 100 tools are available, including 98 irrelevant noise tools designed to confuse routing.
Be the first to comment
Your email address will not be published. All fields are required.