Agentic CLI tools are AI coding tools that can create and delete files, run commands, plan, and execute the coding of the entire project. We tested the leading tools across 10 real-world web development scenarios to determine which is more reliable.
Agentic CLI benchmark results
Analysis and insights
Codex has the highest overall score (67.7%) and the strongest backend performance (58.5%). Its backend score is nearly 10% points higher than the next-best performer, Kiro CLI (48.7%).
Claude Code has the highest frontend score (95.0%), but its backend score (38.6%) pulls down its overall result (55.5%). This illustrates the main dynamic in the chart: frontend performance is relatively high for several agents, while backend correctness and contract discipline drive most of the ranking separation.
The largest UI-backend gap appears in Claude Code (95.0% frontend vs 38.6% backend). In contrast, Codex combines a high frontend score (89.2%) with the best backend score, which is why it leads overall under the 0.7 backend / 0.3 frontend weighting.
Lower-ranked agents fail for different reasons. Goose scores near zero on both backend (3.1%) and frontend (10.0%), indicating basic execution and completeness issues. Forge and Cline show moderate frontend scores (45.8% and 33.3%) but low backend scores (20.1% and 26.7%), consistent with backend contract and routing problems dominating their outcomes.
Speed vs score & token usage vs score
We evaluated runtime efficiency using Average execution time (seconds), effective token usage (input + output), and combined accuracy score:
Aider occupies the most balanced region of the chart. With a 52.7% combined score, it completes tasks in 257 seconds and consumes 126k tokens. It is the only agent that combines mid-to-high accuracy with relatively low runtime and moderate token usage.
Codex achieves the highest overall score (67.7%) but at higher cost. Its average runtime is 426 seconds and token usage is 258k. The efficiency trade-off appears proportional to its accuracy gain.
Claude Code is the most expensive among top-performing agents. It ranks second in accuracy (55.5%) but requires 745 seconds and 397k tokens. Compared to Aider, Claude consumes over 3x more tokens for a 2.8 percentage point increase in score.
Kiro CLI is the fastest agent, completing in 168 seconds and achieving a 58.1% combined score. However, Kiro did not expose token usage. Instead, we measured credit consumption (46.1 credits). A full efficiency comparison is incomplete for Kiro, but given its credit usage, it’s one of the cheapest.
At the lower end, Goose demonstrates poor efficiency. It consumes 300k tokens and takes 587 seconds while scoring only 5.2%. High token usage does not translate into correctness in this case.
Overall, higher token consumption does not consistently correlate with higher accuracy. Architectural retry behavior and validation strategy appear to influence token usage more than raw problem-solving depth.
You can see our methodology below.
How agentic CLI tools work
Agentic CLI tools are autonomous agents that operate inside the terminal. While most users deploy them for coding tasks, they can execute any workflow that can be performed via shell commands.
These agents typically operate in a loop consisting of three phases:
- Gather context
- Take action
- Verify results
After verification, the agent gathers updated context and repeats the loop until it completes the task or reaches a stopping condition.
The loop is influenced by two sources:
- The human user, who provides the initial task and may interrupt execution
- The model, which performs planning, reasoning, and action selection
The agent framework provides structure around the model. It defines how the model should plan, when it should execute commands, how it should validate results, and which tools are available. These tools may include shell execution, file system access, browser control, computer use,MCP integrations, or reusable “skills.”
Different agent architectures impose different planning strategies, retry policies, and verification logic. Some agents prioritize precision and deeper reasoning at the cost of higher token usage and latency. Others prioritize speed and lower cost with reduced behavioral robustness.
Model intelligence vs agent architecture
Performance differences between agentic CLI tools do not come from a single source. They emerge from two layers: the foundation model and the orchestration framework that wraps around it.
The foundation model determines how well the system understands requirements, plans multi-step tasks, and generates correct code. If the model misinterprets a constraint or produces incorrect logic, no amount of orchestration can fully compensate for that mistake.
The agent architecture, however, determines how that model is used. It decides how context is gathered from the workspace, when shell commands are executed, how outputs are validated, and whether the system retries after failure. These decisions shape runtime behavior, cost, and reliability.
Two agents powered by equally capable models can behave differently. One may aggressively retry after partial failure, consuming more tokens but recovering from early mistakes. Another may terminate quickly after the first inconsistency. One may enforce strict validation before moving forward, while another may continue with unverified assumptions.
This benchmark evaluates the complete system. It does not isolate raw model intelligence from orchestration logic. When an agent consumes excessive tokens or fails a backend contract, the cause may lie in planning quality, retry policy, context management, or validation strictness.
Understanding this distinction is essential. High token usage does not necessarily indicate deeper reasoning, and a lower score does not automatically imply weaker underlying model capability. In autonomous environments, architecture and model reasoning interact continuously.
Agent insights
We evaluated agents across 10 tasks. Below, we present a detailed breakdown of Task 6 to illustrate how different agent architectures behave under the same constraints.
Task 6: Helpdesk ticket system (Web)
Task 6 required building a full-stack helpdesk ticket system with:
- Two user roles (customer and agent)
- JWT-based authentication
- Strict status workflow transitions
- Data isolation (404 instead of 403 for cross-user access)
- FastAPI backend
- React/Vue/Svelte + Vite frontend
- Deterministic run commands
The smoke test validated:
- Health check
- Dual-role authentication
- Ticket CRUD operations
- Assignment and replies
- Status transitions
- Role enforcement
- Data isolation
- UI login and post-login behavior
This task stresses state management, auth correctness, REST contract discipline, and frontend-backend integration.
Codex
Installation
Install globally with:
- npm install -g @openai/codex
Alternatively, install globally with Homebrew (macOS/Linux)
- brew install –cask codex
Authentication
After setting up Codex, you can continue with your ChatGPT Account, or with your OpenAI API Key. No provider options available.
Task Report
Codex built a functionally correct system but diverged from the specified REST contract. A method choice reduced strict compliance despite correct business logic.
Backend Behavior
Authentication, ticket CRUD, replies, and status transitions functioned correctly. Role enforcement and data isolation were implemented properly.
The primary issue was HTTP method mismatch. Codex implemented /tickets/{id}/assign and /tickets/{id}/status as PATCH endpoints, while the smoke test required PUT.
Adaptive mode recovered some functionality by attempting alternate methods. Strict mode failed all steps tied to those endpoints.
UI Behavior
The frontend passed all UI validation steps. Login flow and post-login state behaved correctly.
Kiro CLI
Installation
For macOS/Linux/WSL:
- curl -fsSL https://cli.kiro.dev/install | bash
Alternative Linux AppImage (portable option):
- Download: https://desktop-release.q.us-east-1.amazonaws.com/latest/kiro-cli.appimage
Then run:
- chmod +x kiro-cli.appimage && ./kiro-cli.appimage
Authentication
You can continue with your Kiro-Code plan. No provider options available.
Task Report
Kiro produced the fastest and most compact implementation. Status transitions, role enforcement, and data isolation were correctly implemented at the logic level.
However, the same unified-update endpoint design pattern seen in Aider caused six contract failures. A frontend lifecycle issue further reduced the UI score. The system is structurally sound but diverges from the specified API design.
Backend Behavior
Kiro generated a compact full-stack implementation in approximately 97 seconds. The backend consisted of a 324-line main.py file, and the frontend was a 276-line single-file React application. Only 9 files were produced in total. Seed data included 4 sample tickets across different statuses.
Authentication, ticket CRUD, replies, detail view, and data isolation worked correctly. Nine of 16 API steps passed.
The six failing steps correspond to /tickets/{id}/assign and /tickets/{id}/status. Kiro implemented a unified PATCH /tickets/{id} endpoint that updates status, priority, and assignment via JSON body fields. The business logic is correct, but the endpoint structure does not match the expected contract, resulting in 404 responses.
UI Behavior
Backend preflight passed and the frontend started successfully. Vite launched without runtime crashes.
However, the login form did not render. Playwright timed out after 7 seconds waiting for the email input field. Console diagnostics showed a 422 error during initial page load, likely caused by an /auth/me call executed on mount without a valid token. This prevented the login component from rendering and blocked the remaining UI steps.
Claude Code
Installation
For macOS/Linux/WSL, considering your preferred package manager, you can install Claude Code with either:
- curl -fsSL https://claude.ai/install.sh | bash
- brew install –cask codex
Authentication
After setting up Claude Code, you can continue with your Claude Account. No provider options available.
Task Report
Claude Code produced one of the most structured codebases in this task. However, a fundamental JWT validation issue rendered the backend unusable.
This highlights a key distinction in agent evaluation: structural completeness does not compensate for authentication correctness.
It also consumed the highest token volume among agents evaluated in Task 6.
Backend Behavior
Login endpoints returned 200 and issued JWT tokens successfully. However, all subsequent authenticated requests returned 401 “Could not validate credentials.”
The root cause appears to be a mismatch between OAuth2PasswordBearer(tokenUrl=”auth/login”) and the /auth route prefix. The smoke adapter correctly discovered the login endpoint, but the issued tokens were not accepted by the middleware.
As a result, 13 of 16 backend steps failed.
Additionally, Claude Code implemented a single PATCH /tickets/{id} endpoint for updates instead of dedicated /assign and /status endpoints. However, this design choice became irrelevant due to the auth failure.
UI Behavior
The login form rendered correctly. The form submission returned 200. However, after login, Playwright detected a navigation crash:
“Execution context was destroyed.”
Browser logs showed 401 responses on authenticated API calls, which caused the post-login state to break.
Aider
Installation
If you already have python 3.8-3.13 installed, first, install aider:
- python -m pip install aider-install
- aider-install
Authentication
Login to your OpenRouter account and authorize, or export your API Key in your environment with:
- export OPENROUTER_API_KEY=”sk-or-v1-…”
Task Report
Aider was the fastest and most token-efficient builder. However, its API design diverged from the spec, and the login UI failed to render properly.
Backend Behavior
Authentication, ticket CRUD, replies, detail view, and data isolation were implemented correctly.
Instead of dedicated /assign and /status endpoints, Aider used a unified PUT /tickets/{id} endpoint for all updates. The smoke test expected separate endpoints, causing 404 failures for assignment and status steps.
UI Behavior
The frontend rendered content, but the login form did not appear. Playwright timed out waiting for the email input field. Subsequent UI steps were blocked.
OpenCode
Installation
For macOS/Linux/WSL:
- curl -fsSL https://opencode.ai/install | bash
Install globally with:
- npm i -g opencode-ai
For macOS/Linux, considering your preferred package manager:
- bun add -g opencode-ai
- brew install anomalyco/tap/opencode
- paru -S opencode
Authentication
There are lots of provider options, select your desired provider, and authenticate with /connect
Task Report
OpenCode produced the most spec-compliant implementation with a single edge-case deviation. It also consumed the lowest token volume among all agents in this task.
Backend Behavior
Authentication, CRUD operations, replies, assignment, status transitions, role enforcement, and data isolation were implemented correctly.
Both /tickets/{id}/assign and /tickets/{id}/status endpoints were implemented as expected.
The only failing step occurred when the agent attempted to set the status to in_progress after assignment. Since the assignment operation already transitioned the ticket to in_progress, the second transition returned 400 due to strict no-op enforcement.
The backend behavior was logically correct, but the smoke test expected idempotent success for repeated transitions.
UI Behavior
The frontend passed all 8 validation steps. Login rendered correctly, authentication persisted, and post-login behavior worked as expected.
Forge
Installation
For macOS/Linux/WSL:
- curl -fsSL https://opencode.ai/install | bash
Authentication
Configure your provider credentials interactively by:
- forge provider login
And choose your provider.
Task Report
A single routing misconfiguration triggered cascading backend failures. The relatively low output token count suggests limited implementation depth.
Backend Behavior
Login succeeded and tokens were issued.
Ticket creation returned 307 redirects instead of 200/201. Because ticket creation failed, subsequent steps referencing $created_ticket.id failed with 422 errors.
The 307 responses likely stem from trailing slash redirection behavior in FastAPI.
/assign and /status endpoints returned 404.
UI Behavior
The frontend served content, but login components failed to render properly due to runtime errors in AuthContext.tsx. Subsequent UI steps were blocked.
Gemini CLI
Installation
Run instantly with:
- npx @google/gemini-cli
Install globally with:
- npm install -g @google/gemini-cli
Install globally with Homebrew (macOS/Linux):
- brew install gemini-cli
Install globally with MacPorts (macOS):
- sudo port install gemini-cli
Install with Anaconda (for restricted environments):
- conda create -y -n gemini_env -c conda-forge nodejs
- conda activate gemini_env
Authentication
Option 1: Login with Google (OAuth login using your Google Account):
Start gemini and write:
- export GOOGLE_CLOUD_PROJECT=”YOUR_PROJECT_ID”
Then, start Gemini.
Option 2: Gemini API Key
Start gemini and write:
- export GEMINI_API_KEY=”YOUR_API_KEY”
Then, start Gemini.
Option 3: Vertex AI
Start gemini and write:
- export GOOGLE_API_KEY=”YOUR_API_KEY”
- export GOOGLE_GENAI_USE_VERTEXAI=true
Task Report
Gemini CLI produced a strong backend but failed due to frontend toolchain incompatibility. It also consumed the highest token volume among successful backend implementations.
Backend Behavior
Authentication, CRUD, replies, assignment, role enforcement, and data isolation were implemented correctly.
However, the /tickets/{id}/status endpoint was missing entirely, causing all status transition steps to return 404.
UI Behavior
The frontend failed to start. Vite 7.3.1 was installed, which requires Node.js 20.19+, while the test environment runs Node.js 18.18.0. The crypto.hash API required by Vite was unavailable.
As a result, the UI never launched and scored 0/8.
Cline
Installation
Install globally with:
- npm install -g cline
Authentication
By writing `cline auth` you can select your Cline account or continue with your desired provider.
Task Report
Cline’s error-limit mechanism terminated the build before completion. The backend structure shows correct architectural intent, but route registration issues and incomplete implementation prevented functional validation.
The absence of a frontend and cascading backend failures place this result among the weakest in Task 6.
Backend Behavior
Cline generated a backend with five files: main.py, models.py, schemas.py, auth.py, and database.py, along with a requirements.txt. The structure included proper models, JWT authentication scaffolding, and endpoint stubs.
However, the agent reached its eight-error limit during backend development and terminated before completing the system.
Only login endpoints functioned correctly. Three of 16 API steps passed.
Ticket creation returned 307 redirects instead of 200 or 201, likely due to trailing-slash route mismatches. Because ticket creation failed, $created_ticket.id was never captured. All subsequent steps referencing the ticket ID passed the literal string value, leading to 422 errors.
The /tickets/{id}/assign and /tickets/{id}/status endpoints were not implemented, resulting in 404 responses.
This produced a cascading failure pattern similar to Forge, where an early routing issue invalidated downstream steps.
UI Behavior
The backend started successfully. However, the frontend/ directory was empty and no package.json file existed.
Only the backend preflight step passed. All remaining UI steps were blocked.
Goose
Installation
For macOS/Linux/WSL:
- curl -fsSL https://github.com/block/goose/releases/download/stable/download_cli.sh | bash
Model: Gemini 3 Pro Preview (via OpenRouter)
Time: 1,297s
Tokens: 17k input / 752 output
API Score: 60%
UI Score: 0%
Goose demonstrated limited self-correction but failed to complete the full-stack requirement. Reliability issues during re-runs raise stability concerns.
Backend Behavior
Authentication, ticket CRUD, replies, detail view, and data isolation worked.
However, /assign and /status endpoints were not implemented, causing 404 responses for all related steps.
In an earlier build, Goose encountered bcrypt compatibility errors, self-corrected by pinning the dependency version, and eventually launched the backend.
A later re-run crashed with a stream decode error after minimal file generation.
UI Behavior
No frontend was created. The frontend directory was empty and no package.json existed. The UI test failed immediately.
AI coding tools
AI coding tools can be grouped into three categories:
- Agentic CLI: Tools for terminal-based development workflows, generate, edit and refactor code through prompts and command-line interactions.
- Examples: Aider, Opencode, Claude Code, Codex
- AI code editors: Also known as agentic IDEs, these tools provides a GUI similar to VS Code (most of them are built on VS Code).
- Examples: Antigravity, Cursor, Kiro Code, Windsurf
- Prompt-to-app builders: Low-code/no-code platforms to build apps using natural language prompts and visual workflows.
- Examples: Bolt, Lovable, v0.dev, Firebase Studio, Dazl
AI code review tools
As AI-generated code becomes more common, code review tools are essential for catching bugs and vulnerabilities. We evaluated the top tools on 309 PRs in our RevEval benchmark.
What can agentic CLI tools do?
Across tools like Claude Code, Gemini CLI, and OpenHands, common capabilities include:
- End-to-end code work: Create and modify files, fix bugs, refactor code, and run tests or linters directly from the terminal.
- Agentic workflows: Perform multi-step tasks such as task chaining, troubleshooting, search, and iterative debugging.
- Git & project management: Review history, resolve merges, manage branches, and create commits or pull requests.
- Command execution & automation: Run shell commands, automate analyses, and translate natural language into complex CLI operations.
- Deep context handling: Operate on full repositories with awareness of dependencies and project structure.
- Model flexibility: Support multiple cloud and, in some cases, local models; some tools allow using your own API key or choosing between plans.
- Sandboxed or controlled access: Offer modes ranging from read-only to full automation, often with isolated environments for safety.
Methodology
We evaluated agents under a one-shot execution setup to measure their autonomous capabilities without human intervention. Agents were then evaluated using our backend and frontend smoke tests to measure infrastructure readiness and behavioral correctness.
Scores reflect how reliably each agent produced runnable systems and how many functional requirements passed validation.
Model Configuration
We aimed to use Google’s gemini-3-pro-preview due to its high context window, which is suitable for multi-file orchestration and long task prompts. However, some agentic CLIs are tightly coupled to specific providers:
- Claude Code was evaluated using claude-opus-4-5-20251101 via Anthropic’s official API.
- Codex was evaluated using gpt-5.2-codex-medium through OpenAI’s native configuration.
For these agents, alternative model providers are not supported within their current CLI architecture. Each agent was evaluated using its default configuration. We did not tune temperature, retry policies, or reasoning parameters.
Our evaluation goal was to separate and measure:
- Build ability (can the agent produce runnable code?)
- Backend behavior correctness
- Frontend behavior correctness
- Autonomous orchestration reliability
CLI Versions (Mid February, 2026)
- Opencode: v1.2.10
- Cline: v3.41
- Aider: v0.86.0
- Gemini CLI: v0.29.0
- Forge: v1.28.0
- Codex: 0.104.0
- Goose: v1.25.0
- Claude Code: v2.1.62
- Kiro CLI: 1.26.0
For evaluation methodology, visit: AI coding benchmark methodology
Read more
For those exploring the broader ecosystem of agentic developer tools, here are our latest benchmarks:
- MCP benchmark: A comparison of the top MCP servers for web access.
- Remote browsers: How emerging browser infrastructure enables AI agents to interact with the web securely.
Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.
Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.
He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.
Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
Be the first to comment
Your email address will not be published. All fields are required.