Agentic CLI tools are AI coding tools that can create and delete files, run commands, plan, and execute the coding of the entire project. We benchmarked the leading tools across 10 real-world web development scenarios, performing ~600 atomic validation checks per agent and more than ~5,000 total automated test executions, including backend logic, frontend functionality, and multi-run consistency verification.
Agentic CLI benchmark results
Performance Insights of Agentic CLI tools
Backend correctness is the ranking axis. Every agent scored high on the frontend, ranging from 82% to 100%, so the UI does not distinguish between them. The combined score weights the backend 0.7 and the frontend 0.3.
Among the nine agents that reach Sonnet 4.6 cleanly, Opencode leads at 77.3% backend and 81.6% combined, and does so with some of the lowest token counts in the field, at 55k effective input. Grok is second at 75.4% backend, Claude Code third at 74.9%. This is the orchestration signal: one model, different scaffolding, and a backend range of more than 20 points among the tools that ran cleanly, from Opencode’s 77.3% to Goose’s 55.4%. Aider sits lower still at 32.7%; two of its backends crashed on startup, which is what drags its score down.
Codex and Gemini CLI sit apart because the proxy removed their reasoning budget. Codex is the clearest case: 100% frontend against 52.1% backend. It builds apps that run and serve a UI, then misses the backend contract. Gemini scores 63.7% on backend and is the slowest agent at 1,158 seconds, with its proxy chain dirtier than Codex’s because the auxiliary calls it uses for loop detection and context compression fail when routed through the gateway. Both numbers are floors set by the proxy, not orchestration scores, so they do not rank against the nine.
Token use does not track score. Junie shows the field’s highest effective input at 2.36M for a mid-pack 66.4% backend. Goose spends 956k input for 55.4%. Grok spends 310k for the second-best 75.4%. Opencode spends 55k for the best. Aider is the lightest and fastest at 33k input and 338 seconds, and also last at 32.7% backend, dragged down by two tasks whose backends crashed on startup, reproducible across two fresh builds, and scored 0. Kiro spends about 510k input for 64.2% backend, which, on its Bedrock-backed credit plan, worked out to roughly 46 Kiro credits per 10-task grid.
One result sets up the rest of the article: build rank and behavioral quality are independent. The agent that tops this leaderboard retains nothing after compacting its context, whereas an agent in the middle of it retains everything. The build score measures one thing well. It does not predict the other two.
Speed, token usage, and cost vs score
We evaluated runtime efficiency using average execution time (seconds), effective token usage (input + output), and cost per task (USD), each plotted against the combined accuracy score:
None of the three axes tracks score. Opencode proves it on all three at once: the best combined score (81.6%) at the lowest cost of any capable agent ($1.03 per task), among the fewest tokens, and among the fastest runs. The most accurate tool here is also the most efficient, which inverts the usual accuracy-for-cost trade-off.
Cost per task spans about 40x, from Forge’s $0.18 to Junie’s $7.58, with no relationship to rank. Forge is cheapest because it does the least: that price buys a backend that fails ticket creation outright. At the other end, Junie costs $7.58 for a mid-pack 74.7% combined, and even that is an inflated upper bound. Goose is second-priciest at $3.23 for the field’s lowest clean score (62.5%). Grok and Claude Code, second and third on score, cost $2.03 and $1.83.
A few numbers on these axes are bounds, not exact values. The token axis is effective input (gross input minus cache reads), a clean figure only for the six CLIs that report cache reads (Claude Code, Codex, Cline, Grok, Gemini, Opencode); for the rest it equals everything they sent and reads as an upper bound. Kiro bills Bedrock credits, not dollars, so its $1.72 is token-derived at raw Sonnet rates and reads as a floor; its credit-billed cost is closer to $2.23. Cline’s 64.4% combined reflects four tasks where it hit its error limit before shipping a frontend, each scored zero, the same rule applied to crashed backends.
Time and tokens tell the same story the chart shows: neither predicts score. Kiro is fastest at 439s and Gemini slowest at 1,158s on proxy overhead, and both land mid-pack. Higher spend buys retries and re-validation, not problem-solving depth.
You can see our methodology below.
How agentic CLI tools work
Agentic CLI tools are autonomous agents that operate inside the terminal. While most users deploy them for coding tasks, they can execute any workflow that can be performed via shell commands.
These agents typically operate in a loop consisting of three phases:
- Gather context
- Take action
- Verify results
After verification, the agent gathers updated context and repeats the loop until it completes the task or reaches a stopping condition.
The loop is influenced by two sources:
- The human user, who provides the initial task and may interrupt execution
- The model, which performs planning, reasoning, and action selection
The agent framework provides structure around the model. It defines how the model should plan, when it should execute commands, how it should validate results, and which tools are available. These tools may include shell execution, file system access, browser control, computer use, MCP integrations, or reusable “skills.”
Different agent architectures impose different planning strategies, retry policies, and verification logic. Some agents prioritize precision and deeper reasoning at the cost of higher token usage and latency. Others prioritize speed and lower cost with reduced behavioral robustness.
Model intelligence vs agent architecture
Performance differences between agentic CLI tools do not come from a single source. They emerge from two layers: the foundation model and the orchestration framework that wraps around it.
This benchmark tests both agents on the same foundation model: Claude Sonnet 4.6. Any difference in score is therefore a difference in orchestration: how the CLI gathers context, when it executes commands, how it validates output, and whether it retries after failure.
Opencode and Claude Code both use Sonnet 4.6 directly. Opencode scores 77.3% backend; Claude Code scores 74.9%. Two agents, same model, 2.4 percentage point difference in backend correctness. Kiro and Opencode both use Sonnet 4.6. Kiro scores 64.2% backend; Opencode scores 77.3%. The 13-point gap is the CLI’s contribution.
The two observatory benchmarks below push this further. They run the same common-model test on web research and context compaction, where the gaps are not 13 points but the difference between finding the right answer and inventing a wrong one.
Web research grounding
We asked every agent to audit framework documentation: which version introduced a feature, what its current status is, and what changed recently. Every answer had to cite one official source. We ran the probe twice, once on Unity and once on Next.js/React. The facts were selected so the correct answer only exists on a current, published page. Answering from training data produces a confident, wrong answer. We checked one thing: did the agent actually fetch the page it cited?
Four agents have built-in web search. Three of them (Codex, Gemini, Grok) ran on their non-Sonnet native models; the other eight, Claude Code included, ran on Sonnet 4.6.
Four patterns emerged.
- Real live search Codex, Claude Code, Gemini, and Grok fetch current pages and catch recent changes. Codex was the only agent that reached the developer forum, where the hardest facts live.
- Searches, but lands on old pages Cline fetched two dozen real documentation pages and still reported a version that had been replaced. The fetches were real; the pages were stale.
- No search, answers from training Aider does not browse and says so. This is the honest response.
- Fabricated sources Forge fetched nothing that worked, yet cited 31 sources on the Next.js probe. The cited pages do not exist. Its closing statement: “every cell is sourced from a page actually fetched during this session.”
On the Next.js probe, every other browsing agent grounded nearly all of its citations in pages it had actually fetched. Forge grounded none. The chart stacks each agent’s grounded citations against its fabricated ones, so the honest agents read as full green bars and Forge as a single red one. The chart covers the eight agents with a verifiable per-URL fetch log. Grok (server-side search), Gemini (truncated run), and Aider (no citations) appear in the table above but are excluded here.
Cline and Claude Code both ran on Sonnet 4.6 in this test. Claude Code found and opened the page with the correct answer. Cline did not. Same model, different result.
We scored each answer for factual accuracy, but those scores depend on an answer key currently under review. We are withholding the accuracy tables until the key is finalized.
Context compaction
When a session grows long, the agent compacts its context: it replaces the detailed history with a short summary and discards the originals. We tested whether the summary retains what matters.
We gave each agent approximately 112,000 tokens of documents with 13 invented facts embedded: an on-call PIN, a cloud region, a build tag, and ten more. Invented means the values are unique strings with no presence in training data. The agent read the documents and compacted. We then deleted the source files and asked for all 13 facts. With the files deleted, the only possible source is the compaction summary.
Four agents retained every fact. Three retained none. The three that scored 0 from memory only had answered 13 of 13 while they could still re-read the files. They were re-reading on each query. When the files were gone, they wrote “unknown” rather than guessing.
Goose, Forge, Opencode, and Kiro all run Sonnet 4.6. Kiro retained all 13. The other three retained none. Same model, opposite result.
Opencode ranks first in the build benchmark and retains nothing in compaction. Kiro ranks seventh in the build benchmark and retains everything in compaction. Strong build performance and strong compaction are independent properties.
Four agents fell outside the scope of this test, each for a concrete reason. Cline could not be driven to its compaction threshold. We built an 863,000-token document set and had it read every file, but cline truncates each tool output to about 2,000 characters, so the documents collapsed to short previews. Its context plateaued at 214,000 tokens, 21% of its one-million-token window, and compaction never fired. We report cline as not measurable under this protocol rather than estimate a number. Grok has a compaction command, but it read our documents in fragments rather than loading them in full, so there was never a complete context for it to compact. Aider’s summarizer compresses chat turns, not the contents of files added to the session, which is where the facts lived. Junie has no compaction feature.
Agent behaviors on task 6
We evaluated agents across 10 tasks. Below is a detailed breakdown of Task 6 to show how different CLI architectures behave under the same constraints when all run on the same model.
Task 6: Helpdesk ticket system (Web)
Task 6 required building a full-stack helpdesk ticket system with:
- Two user roles (customer and agent)
- JWT-based authentication
- Strict status workflow transitions
- Data isolation (404 instead of 403 for cross-user access)
- FastAPI backend
- React/Vue/Svelte + Vite frontend
- Deterministic run commands
The smoke test validated:
- Health check
- Dual-role authentication
- Ticket CRUD operations
- Assignment and replies
- Status transitions
- Role enforcement
- Data isolation
- UI login and post-login behavior
This task stresses state management, auth correctness, REST contract discipline, and frontend-backend integration. Visit GitHub to see the task details.
On one model, the field split into three groups.
Seven agents scored exactly 60% backend with the identical six failed steps, stable across all three reruns: codex, claude-code, cline, grok, goose, junie, and opencode. Authentication, ticket CRUD, replies, and data isolation all passed. The six failures were all on two routes: `/tickets/{id}/assign` and `/tickets/{id}/status`. These agents implemented a single unified `PATCH /tickets/{id}` endpoint with body fields instead of the spec’s separate routes. The business logic was correct; the REST contract was not. In the prior native-model run on Gemini 3 Pro Preview, Opencode built the separate endpoints and scored 93.3% here. On Sonnet 4.6 it chose the unified design like everyone else.
Three agents scored 13.3%: aider, forge, and gemini-cli. Authentication worked, but ticket creation itself failed, so every step that depended on an existing ticket cascaded to failure.
Kiro scored 24.4%, an average of instability rather than a single failure mode. It passed nine steps on the first run, dropped to two on the second, and on the third the backend never started at all (the health check failed). The other ten agents produced the same result on every rerun.
The takeaway is the convergence. Seven different CLIs on the same model made the same REST-contract mistake. On this task the model dominates and the orchestration barely matters. The observatory benchmarks below show the reverse: on web research and compaction, the same model produces opposite results depending on the tool.
UI behavior split the 60% cluster. claude-code and cline failed the login step with an identical bug: the frontend called the backend on `localhost:8000` from a `127.0.0.1` origin, and the browser blocked the login request under CORS policy. Both scored 75%. The other five in the cluster rendered and logged in cleanly at 100%.
Codex
Installation
Install globally with:
- npm install -g @openai/codex
Alternatively, install globally with Homebrew (macOS/Linux)
- brew install –cask codex
Authentication
After setting up Codex, you can continue with your ChatGPT Account, or with your OpenAI API Key. No provider options available.
Task Report
Codex built a working system in 454 seconds and landed in the 60% cluster. The business logic was correct; it missed the REST contract on assignment and status, like the rest of the field.
Backend Behavior
Authentication, ticket CRUD, replies, and data isolation passed. The six failures were the assignment and status-transition steps, which targeted `/tickets/{id}/assign` and `/tickets/{id}/status`. Codex routed both through a unified update endpoint, so those calls returned 404. Stable across all three reruns.
UI Behavior
Frontend passed all eight validation steps. Login and post-login state behaved correctly. 100% UI.
Junie
Installation
Junie is available through JetBrains Toolbox or as a standalone CLI:
- curl -fsSL https://junie.jetbrains.com/install | bash
Authentication
Continue with your JetBrains account or generate a JUNIE_API_KEY at junie.jetbrains.com/cli, or export your own API key from Anthropic, OpenAI, Google, or other supported providers. Multiple provider options available.
Task Report
Junie produced a complete full-stack system in 444 seconds and scored 60% backend, in the main cluster. Its effective input on this task is the highest in the field at 1.52M, an uncached upper bound affected by a known caching-accounting bug (see the results table note).
Backend Behavior
Nine of sixteen steps passed: authentication, ticket CRUD, replies, and data isolation. The six failures were the assignment and status-transition steps. Junie handled status and assignment through a unified update endpoint, so the spec’s `/tickets/{id}/assign` and `/tickets/{id}/status` routes returned 404. The transition logic itself was correct. Stable across all three reruns.
UI Behavior
Frontend passed all eight validation steps. 100% UI.
Kiro CLI
Installation
For macOS/Linux/WSL:
- curl -fsSL https://cli.kiro.dev/install | bash
Alternative Linux AppImage (portable option):
- Download: https://desktop-release.q.us-east-1.amazonaws.com/latest/kiro-cli.appimage
Then run:
- chmod +x kiro-cli.appimage && ./kiro-cli.appimage
Authentication
You can continue with your Kiro-Code plan. No provider options available.
Task Report
Kiro is the one agent whose score reflects instability rather than a single design choice. Its 24.4% backend is an average across three reruns that produced three different outcomes. The build itself was sound when it ran; the problem was that it did not run the same way twice.
Backend Behavior
On the first run, Kiro passed nine of sixteen steps, the same profile as the 60% cluster, failing only the assignment and status routes. On the second run it passed two. On the third the backend never came up and even the health check failed. Averaged, this is 24.4%. The instability, not the endpoint design, is what separates Kiro from the cluster here.
UI Behavior
When the backend was up, the frontend passed all eight validation steps. 100% UI. This is a change from the prior run, where the login form failed to render on a 422 at mount.
Claude Code
Installation
For macOS/Linux/WSL, considering your preferred package manager, you can install Claude Code with either:
- curl -fsSL https://claude.ai/install.sh | bash
- npm install -g @anthropic-ai/claude-code
Authentication
After setting up Claude Code, you can continue with your Claude Account. No provider options available.
Task Report
Claude Code scored 60% backend in 379 seconds, in the main cluster. This is a marked improvement over the prior run, where a JWT validation bug returned 401 on every authenticated route and failed 13 of 16 steps. In this run the backend worked; the loss was on the UI.
Backend Behavior
Authentication, ticket CRUD, replies, and data isolation passed. The six failures were the assignment and status-transition steps, routed through a unified update endpoint instead of the spec’s separate paths. Stable across all three reruns.
UI Behavior
The login step failed. The frontend called the backend on localhost:8000 while the page was served from a 127.0.0.1 origin, and the browser blocked the login request under CORS policy. Five steps passed, one failed, two were blocked. 75% UI. Cline failed the same way.
Aider
Installation
If you already have python 3.8-3.13 installed, first, install aider:
- python -m pip install aider-install
- aider-install
Authentication
Login to your OpenRouter account and authorize, or export your API Key in your environment with:
- export OPENROUTER_API_KEY=”sk-or-v1-…”
Task Report
Aider was the fastest agent at 236 seconds and the lightest, with 1.3k input and 18k output tokens. It also scored 13.3% backend. Authentication worked, but ticket creation failed, and every step that needed an existing ticket failed with it.
Backend Behavior
Two steps passed. The build broke at ticket creation, so the customer and agent ticket lists, replies, assignment, status transitions, and role checks all cascaded to failure. Stable across all three reruns. This is a different failure class from the 60% cluster, which created tickets correctly and only missed the assignment and status routes.
UI Behavior
The login step failed under the same CORS origin mismatch seen in claude-code and cline. Five steps passed, one failed, two blocked. 75% UI.
OpenCode
Installation
For macOS/Linux/WSL:
- curl -fsSL https://opencode.ai/install | bash
Install globally with:
- npm i -g opencode-ai
For macOS/Linux, considering your preferred package manager:
- bun add -g opencode-ai
- brew install anomalyco/tap/opencode
- paru -S opencode
Authentication
There are lots of provider options, select your desired provider, and authenticate with /connect
Task Report
Opencode leads the overall benchmark, but on Task 6 it scored 60% backend, in the main cluster, in 542 seconds. This is the clearest single piece of model evidence in the article. In the prior native-model run on Gemini 3 Pro Preview, Opencode built the spec’s separate endpoints and scored 93.3% here. The same CLI on Sonnet 4.6 chose the unified endpoint and dropped to 60%. The tool did not change; the model did.
Backend Behavior
Authentication, ticket CRUD, replies, and data isolation passed. The six failures were the assignment and status-transition steps, routed through a unified update endpoint. Stable across all three reruns.
UI Behavior
Frontend passed all eight validation steps. 100% UI.
Grok Build
Installation
For macOS/Linux:
- curl -fsSL https://x.ai/cli/install.sh | bash
Authentication
Sign in with your xAI account on first launch, or set an API key for headless use:
- export XAI_API_KEY=”xai-…”
Task Report
Grok finished second overall in the build benchmark at 75.4% backend. On Task 6 it scored 60% backend in 433 seconds, in the main cluster. In this run Grok reached Sonnet 4.6 through OpenRouter.
Backend Behavior
Nine of sixteen steps passed: authentication, ticket CRUD, replies, and data isolation. The six failures were the assignment and status-transition steps, which targeted /tickets/{id}/assign and /tickets/{id}/status. Grok routed both through a unified update endpoint, so those calls and the role checks that depend on them returned 404. Stable across all three reruns.
UI Behavior
Frontend passed all eight validation steps. Login and post-login state behaved correctly. 100% UI.
Forge
Installation
For macOS/Linux/WSL:
- curl -fsSL https://forgecode.dev/cli | sh
Authentication
Configure your provider credentials interactively by:
- forge provider login
And choose your provider.
Task Report
Forge scored 13.3% backend in 844 seconds. Its output token count is the lowest in the field at 1.6k, which points to a shallow implementation. As in the prior run, the build broke at ticket creation and cascaded.
Backend Behavior
Two steps passed. Ticket creation failed, so the ticket lists, replies, assignment, status transitions, and role checks all failed with it. Stable across all three reruns, the same 13.3% profile as aider and gemini-cli.
UI Behavior
The login step failed under the same CORS origin mismatch seen in claude-code, cline, and aider. Five steps passed, one failed, two blocked. 75% UI.
Gemini CLI
Installation
Run instantly:
- npx @google/gemini-cli
Or install globally:
- npm install -g @google/gemini-cli
- brew install gemini-cli
Authentication
Option 1 (Google OAuth): export GOOGLE_CLOUD_PROJECT=”YOUR_PROJECT_ID” then start gemini.
Option 2 (API key): export GEMINI_API_KEY=”YOUR_API_KEY” then start gemini.
Option 3 (Vertex AI): export GOOGLE_API_KEY + GOOGLE_GENAI_USE_VERTEXAI=true.
Task Report
Gemini CLI scored 13.3% backend in 926 seconds, one of the two slowest agents in the field. Authentication worked, but ticket creation failed and cascaded. Its frontend, which failed entirely in the prior run on a Node 18 versus Vite 7 incompatibility, passed every step this time.
Backend Behavior
Two steps passed. Ticket creation failed, so all dependent steps failed. Stable across all three reruns, the same 13.3% profile as aider and forge.
UI Behavior
Frontend passed all eight validation steps. 100% UI, up from 0% in the prior run. A 401 appeared in the console on an authenticated call, but it did not block the rendered flow.
Cline
Installation
Install globally with:
- npm install -g cline
Authentication
By writing `cline auth` you can select your Cline account or continue with your desired provider.
Task Report
Cline scored 60% backend in 648 seconds, in the main cluster. This is a large change from the prior run, where its eight-error limit terminated the build early and left an empty frontend. Here it completed the full stack.
Backend Behavior
Authentication, ticket CRUD, replies, and data isolation passed. The six failures were the assignment and status-transition steps, routed through a unified update endpoint. Stable across all three reruns.
UI Behavior
The login step failed under the same CORS origin mismatch seen in claude-code, on a 127.0.0.1 page calling a localhost backend. Five steps passed, one failed, two blocked. 75% UI.
Goose
Installation
For macOS/Linux/WSL:
- curl -fsSL https://github.com/block/goose/releases/download/stable/download_cli.sh | bash
Task Report
Goose scored 60% backend in 553 seconds, in the main cluster, but consumed 1.06M input tokens to get there. It completed the full stack this time, a change from the prior run where the frontend directory was left empty.
Backend Behavior
Authentication, ticket CRUD, replies, and data isolation passed. The six failures were the assignment and status-transition steps, routed through a unified update endpoint. Stable across all three reruns.
UI Behavior
Frontend passed all eight validation steps. 100% UI, up from 0% in the prior run.
AI coding tools
AI coding tools can be grouped into three categories:
- Agentic CLI: Tools for terminal-based development workflows, generate, edit and refactor code through prompts and command-line interactions.
- Examples: Aider, Junie, Opencode, Claude Code, Codex
- AI code editors: Also known as agentic IDEs, these tools provides a GUI similar to VS Code (most of them are built on VS Code).
- Examples: Antigravity, Cursor, Kiro Code, Windsurf
- Prompt-to-app builders: Low-code/no-code platforms to build apps using natural language prompts and visual workflows.
- Examples: Bolt, Lovable, v0.dev, Firebase Studio, Dazl
AI code review tools
As AI-generated code becomes more common, code review tools are essential for catching bugs and vulnerabilities. We evaluated the top tools on 309 PRs in our RevEval benchmark.
What can agentic CLI tools do?
Across tools like Codex, Junie, Kiro and Claude Code, common capabilities include:
- End-to-end code work: Create and modify files, fix bugs, refactor code, and run tests or linters directly from the terminal.
- Agentic workflows: Perform multi-step tasks such as task chaining, troubleshooting, search, and iterative debugging.
- Git & project management: Review history, resolve merges, manage branches, and create commits or pull requests.
- Command execution & automation: Run shell commands, automate analyses, and translate natural language into complex CLI operations.
- Deep context handling: Operate on full repositories with awareness of dependencies and project structure.
- Model flexibility: Support multiple cloud and, in some cases, local models; some tools allow using your own API key or choosing between plans.
- Sandboxed or controlled access: Offer modes ranging from read-only to full automation, often with isolated environments for safety.
Methodology
A-CODE-CLI Benchmark
We evaluated agents under a one-shot execution setup to measure autonomous capability without human intervention. Agents were then evaluated using backend and frontend smoke tests to measure infrastructure readiness and behavioral correctness.
Model configuration. All 11 agents ran on Claude Sonnet 4.6 (non-reasoning). Two agents required a proxy to reach this model:
- Codex (OpenAI CLI) cannot point at Anthropic models natively. It was routed through a LiteLLM gateway to OpenRouter/Anthropic, with a cache shim restoring prompt caching. The proxy strips reasoning tokens (capability cost) and adds latency.
- Gemini CLI cannot call Anthropic models natively. It was routed through an SSE shim and LiteLLM gateway. Its auxiliary model calls (loop detection, malformed-tool repair, context compression) fail or return invalid content through the proxy, so it ran without its own safety nets.
Forge required a separate proxy to strip extended thinking blocks from responses, which Forge force-enables and which cause 400 errors when echoed back. All other agents used Sonnet 4.6 directly via their native provider configuration or OpenRouter.
The proxy can only handicap codex and gemini-cli, never inflate them. Their scores are conservative.
Junie co-runs a non-overridable GPT-4.1-mini helper alongside Sonnet 4.6 primary. It is the only agent with a second model active during the build. Its scores carry a multi-model asterisk.
Claude Code ran via user subscription (OAuth). Kiro ran on Kiro-hosted credits (Bedrock-backed, 1.3x multiplier).
No agent had temperature, retry, or reasoning parameters tuned. Each ran its default configuration.
Scoring. Backend: functional smoke (adaptive_avg_step_pass_rate). Frontend: UI smoke via Playwright. Combined: 0.7 × backend + 0.3 × frontend (for agents with complete UI data). Backend score is the primary ranking axis. Frontend performance saturates across the field.
Aider t-3 and t-4. Both tasks produced backends that crashed on startup. Confirmed across two fresh builds (same errors: TypeError on class Card in t-3, AmbiguousForeignKeysError on User.auctions in t-4). Scored 0 with a backend_never_ready flag, not excluded.
For evaluation methodology, visit: AI coding benchmark methodology
CLI versions (June 2026 benchmark run)
Versions read from the benchmark VPS boxes. The build run executed June 5 to 8, 2026.
- Claude Code: 2.1.165
- Cline: 3.0.27
- Codex: 0.140.0
- Aider: 0.86.2
- Gemini CLI: 0.26.0
- Forge: 2.13.11
- Goose: 1.37.0
- Grok: 0.2.54
- Junie: 26.06.01 (build 1831.35)
- Kiro CLI: 2.6.1
- Opencode: 1.17.7
Web research grounding methodology
Two probes: a Unity migration audit (probe 2) and a Next.js/React version audit (probe 3). Each asked the agent to report version, status, and timeline for specified framework features and cite one official URL per claim.
Grading used two parallel methods. Ground-truth gating: a claim scores only if the cited URL appears in the agent’s real fetch log AND the fetched page contains the fact, measured against a verified answer key. Behavioral classification: an LLM judge read each agent’s full transcript and assigned it to one of the four behavioral categories. The behavioral classification is the primary output; the scored accuracy tables will publish after the answer key completes its human anchor review.
Agents with built-in search (Codex, Gemini, Grok) ran on their native models because the task requires their built-in search capability. The remaining eight ran on Claude Sonnet 4.6. N=1.
Context compaction methodology
Agents received approximately 112,000 tokens of filler documents containing 13 invented infrastructure facts. After the agent read the documents and compacted its context, we deleted the source files before asking any questions. Scoring: exact match against 13 invented values, automated by a grading script with one regex per fact. N=3.
Agents that scored 13/13 with files present and 0/13 with files deleted are classified as re-readers. Agents that scored 13/13 with files deleted are classified as true retainers. File deletion rules out re-reading; invented facts rule out training-data recall.
All agents except Codex (GPT-5.5) and Gemini (Gemini 2.5 Pro) ran on Sonnet 4.6. Model used per agent is listed in the results table.
Read more
For those exploring the broader ecosystem of agentic developer tools, here are our latest benchmarks:
- MCP benchmark: A comparison of the top MCP servers for web access.
- Remote browsers: How emerging browser infrastructure enables AI agents to interact with the web securely.
Cite this benchmark
Pick the format that matches where you're publishing. Pasting the link version into your CMS preserves the backlink.
@misc{kaleliolu2026,
author = {Kalelioğlu, Berk and Dilmegani, Cem},
title = {{A-CODE-CLI Bench: Agentic CLI Benchmark}},
year = {2026},
month = jun,
howpublished = {\url{https://aimultiple.com/agentic-cli}},
note = {AIMultiple. Retrieved June 18, 2026}
}Results and timestamps of 110 data points. Download the data used in this article as a ZIP file containing one CSV file and a README.




Be the first to comment
Your email address will not be published. All fields are required. Comments are left in their original language.