Agentic CLI Tools: Codex vs Claude Code

with

updated on Feb 27, 2026

Agentic CLI tools are AI coding tools that can create and delete files, run commands, plan, and execute the coding of the entire project. We tested the leading tools across 10 real-world web development scenarios to determine which is more reliable.

Agentic CLI benchmark results

Analysis and insights

Codex has the highest overall score (67.7%) and the strongest backend performance (58.5%). Its backend score is nearly 10% points higher than the next-best performer, Kiro CLI (48.7%).

Claude Code has the highest frontend score (95.0%), but its backend score (38.6%) pulls down its overall result (55.5%). This illustrates the main dynamic in the chart: frontend performance is relatively high for several agents, while backend correctness and contract discipline drive most of the ranking separation.

The largest UI-backend gap appears in Claude Code (95.0% frontend vs 38.6% backend). In contrast, Codex combines a high frontend score (89.2%) with the best backend score, which is why it leads overall under the 0.7 backend / 0.3 frontend weighting.

Lower-ranked agents fail for different reasons. Goose scores near zero on both backend (3.1%) and frontend (10.0%), indicating basic execution and completeness issues. Forge and Cline show moderate frontend scores (45.8% and 33.3%) but low backend scores (20.1% and 26.7%), consistent with backend contract and routing problems dominating their outcomes.

Speed vs score & token usage vs score

We evaluated runtime efficiency using Average execution time (seconds), effective token usage (input + output), and combined accuracy score:

Aider occupies the most balanced region of the chart. With a 52.7% combined score, it completes tasks in 257 seconds and consumes 126k tokens. It is the only agent that combines mid-to-high accuracy with relatively low runtime and moderate token usage.

Codex achieves the highest overall score (67.7%) but at higher cost. Its average runtime is 426 seconds and token usage is 258k. The efficiency trade-off appears proportional to its accuracy gain.

Claude Code is the most expensive among top-performing agents. It ranks second in accuracy (55.5%) but requires 745 seconds and 397k tokens. Compared to Aider, Claude consumes over 3x more tokens for a 2.8 percentage point increase in score.

Kiro CLI is the fastest agent, completing in 168 seconds and achieving a 58.1% combined score. However, Kiro did not expose token usage. Instead, we measured credit consumption (46.1 credits). A full efficiency comparison is incomplete for Kiro, but given its credit usage, it’s one of the cheapest.

At the lower end, Goose demonstrates poor efficiency. It consumes 300k tokens and takes 587 seconds while scoring only 5.2%. High token usage does not translate into correctness in this case.

Overall, higher token consumption does not consistently correlate with higher accuracy. Architectural retry behavior and validation strategy appear to influence token usage more than raw problem-solving depth.

You can see our methodology below.

How agentic CLI tools work

Agentic CLI tools are autonomous agents that operate inside the terminal. While most users deploy them for coding tasks, they can execute any workflow that can be performed via shell commands.

These agents typically operate in a loop consisting of three phases:

Gather context
Take action
Verify results

After verification, the agent gathers updated context and repeats the loop until it completes the task or reaches a stopping condition.

The loop is influenced by two sources:

The human user, who provides the initial task and may interrupt execution
The model, which performs planning, reasoning, and action selection

The agent framework provides structure around the model. It defines how the model should plan, when it should execute commands, how it should validate results, and which tools are available. These tools may include shell execution, file system access, browser control, computer use,MCP integrations, or reusable “skills.”

Different agent architectures impose different planning strategies, retry policies, and verification logic. Some agents prioritize precision and deeper reasoning at the cost of higher token usage and latency. Others prioritize speed and lower cost with reduced behavioral robustness.

Model intelligence vs agent architecture

Performance differences between agentic CLI tools do not come from a single source. They emerge from two layers: the foundation model and the orchestration framework that wraps around it.

The foundation model determines how well the system understands requirements, plans multi-step tasks, and generates correct code. If the model misinterprets a constraint or produces incorrect logic, no amount of orchestration can fully compensate for that mistake.

The agent architecture, however, determines how that model is used. It decides how context is gathered from the workspace, when shell commands are executed, how outputs are validated, and whether the system retries after failure. These decisions shape runtime behavior, cost, and reliability.

Two agents powered by equally capable models can behave differently. One may aggressively retry after partial failure, consuming more tokens but recovering from early mistakes. Another may terminate quickly after the first inconsistency. One may enforce strict validation before moving forward, while another may continue with unverified assumptions.

This benchmark evaluates the complete system. It does not isolate raw model intelligence from orchestration logic. When an agent consumes excessive tokens or fails a backend contract, the cause may lie in planning quality, retry policy, context management, or validation strictness.

Understanding this distinction is essential. High token usage does not necessarily indicate deeper reasoning, and a lower score does not automatically imply weaker underlying model capability. In autonomous environments, architecture and model reasoning interact continuously.

Agent insights

We evaluated agents across 10 tasks. Below, we present a detailed breakdown of Task 6 to illustrate how different agent architectures behave under the same constraints.

Task 6: Helpdesk ticket system (Web)

Task 6 required building a full-stack helpdesk ticket system with:

Two user roles (customer and agent)
JWT-based authentication
Strict status workflow transitions
Data isolation (404 instead of 403 for cross-user access)
FastAPI backend
React/Vue/Svelte + Vite frontend
Deterministic run commands

The smoke test validated:

Health check
Dual-role authentication
Ticket CRUD operations
Assignment and replies
Status transitions
Role enforcement
Data isolation
UI login and post-login behavior

This task stresses state management, auth correctness, REST contract discipline, and frontend-backend integration.

Codex

Installation

Install globally with:

npm install -g @openai/codex

Alternatively, install globally with Homebrew (macOS/Linux)

brew install –cask codex

Authentication

After setting up Codex, you can continue with your ChatGPT Account, or with your OpenAI API Key. No provider options available.

Task Report

Codex built a functionally correct system but diverged from the specified REST contract. A method choice reduced strict compliance despite correct business logic.

Backend Behavior

Authentication, ticket CRUD, replies, and status transitions functioned correctly. Role enforcement and data isolation were implemented properly.

The primary issue was HTTP method mismatch. Codex implemented /tickets/{id}/assign and /tickets/{id}/status as PATCH endpoints, while the smoke test required PUT.

Adaptive mode recovered some functionality by attempting alternate methods. Strict mode failed all steps tied to those endpoints.

UI Behavior

The frontend passed all UI validation steps. Login flow and post-login state behaved correctly.

Kiro CLI

Installation

For macOS/Linux/WSL:

curl -fsSL https://cli.kiro.dev/install | bash

Alternative Linux AppImage (portable option):

Download: https://desktop-release.q.us-east-1.amazonaws.com/latest/kiro-cli.appimage

Then run:

chmod +x kiro-cli.appimage && ./kiro-cli.appimage

Authentication

You can continue with your Kiro-Code plan. No provider options available.

Task Report

Kiro produced the fastest and most compact implementation. Status transitions, role enforcement, and data isolation were correctly implemented at the logic level.

However, the same unified-update endpoint design pattern seen in Aider caused six contract failures. A frontend lifecycle issue further reduced the UI score. The system is structurally sound but diverges from the specified API design.

Backend Behavior

Kiro generated a compact full-stack implementation in approximately 97 seconds. The backend consisted of a 324-line main.py file, and the frontend was a 276-line single-file React application. Only 9 files were produced in total. Seed data included 4 sample tickets across different statuses.

Authentication, ticket CRUD, replies, detail view, and data isolation worked correctly. Nine of 16 API steps passed.

The six failing steps correspond to /tickets/{id}/assign and /tickets/{id}/status. Kiro implemented a unified PATCH /tickets/{id} endpoint that updates status, priority, and assignment via JSON body fields. The business logic is correct, but the endpoint structure does not match the expected contract, resulting in 404 responses.

UI Behavior

Backend preflight passed and the frontend started successfully. Vite launched without runtime crashes.

However, the login form did not render. Playwright timed out after 7 seconds waiting for the email input field. Console diagnostics showed a 422 error during initial page load, likely caused by an /auth/me call executed on mount without a valid token. This prevented the login component from rendering and blocked the remaining UI steps.

Claude Code

Installation

For macOS/Linux/WSL, considering your preferred package manager, you can install Claude Code with either:

curl -fsSL https://claude.ai/install.sh | bash
brew install –cask codex

Authentication

After setting up Claude Code, you can continue with your Claude Account. No provider options available.

Task Report

Claude Code produced one of the most structured codebases in this task. However, a fundamental JWT validation issue rendered the backend unusable.

This highlights a key distinction in agent evaluation: structural completeness does not compensate for authentication correctness.

It also consumed the highest token volume among agents evaluated in Task 6.

Backend Behavior

Login endpoints returned 200 and issued JWT tokens successfully. However, all subsequent authenticated requests returned 401 “Could not validate credentials.”

The root cause appears to be a mismatch between OAuth2PasswordBearer(tokenUrl=”auth/login”) and the /auth route prefix. The smoke adapter correctly discovered the login endpoint, but the issued tokens were not accepted by the middleware.

As a result, 13 of 16 backend steps failed.

Additionally, Claude Code implemented a single PATCH /tickets/{id} endpoint for updates instead of dedicated /assign and /status endpoints. However, this design choice became irrelevant due to the auth failure.

UI Behavior

The login form rendered correctly. The form submission returned 200. However, after login, Playwright detected a navigation crash:
“Execution context was destroyed.”

Browser logs showed 401 responses on authenticated API calls, which caused the post-login state to break.

Aider

Installation

If you already have python 3.8-3.13 installed, first, install aider:

python -m pip install aider-install
aider-install

Authentication

export OPENROUTER_API_KEY=”sk-or-v1-…”

Task Report

Aider was the fastest and most token-efficient builder. However, its API design diverged from the spec, and the login UI failed to render properly.

Backend Behavior

Authentication, ticket CRUD, replies, detail view, and data isolation were implemented correctly.

Instead of dedicated /assign and /status endpoints, Aider used a unified PUT /tickets/{id} endpoint for all updates. The smoke test expected separate endpoints, causing 404 failures for assignment and status steps.

UI Behavior

The frontend rendered content, but the login form did not appear. Playwright timed out waiting for the email input field. Subsequent UI steps were blocked.

OpenCode

Installation

For macOS/Linux/WSL:

curl -fsSL https://opencode.ai/install | bash

Install globally with:

npm i -g opencode-ai

For macOS/Linux, considering your preferred package manager:

bun add -g opencode-ai
brew install anomalyco/tap/opencode
paru -S opencode

Authentication

There are lots of provider options, select your desired provider, and authenticate with /connect

Task Report

OpenCode produced the most spec-compliant implementation with a single edge-case deviation. It also consumed the lowest token volume among all agents in this task.

Backend Behavior

Authentication, CRUD operations, replies, assignment, status transitions, role enforcement, and data isolation were implemented correctly.

Both /tickets/{id}/assign and /tickets/{id}/status endpoints were implemented as expected.

The only failing step occurred when the agent attempted to set the status to in_progress after assignment. Since the assignment operation already transitioned the ticket to in_progress, the second transition returned 400 due to strict no-op enforcement.

The backend behavior was logically correct, but the smoke test expected idempotent success for repeated transitions.

UI Behavior

The frontend passed all 8 validation steps. Login rendered correctly, authentication persisted, and post-login behavior worked as expected.

Forge

Installation

For macOS/Linux/WSL:

curl -fsSL https://opencode.ai/install | bash

Authentication

Configure your provider credentials interactively by:

forge provider login

And choose your provider.

Task Report

A single routing misconfiguration triggered cascading backend failures. The relatively low output token count suggests limited implementation depth.

Backend Behavior

Ticket creation returned 307 redirects instead of 200/201. Because ticket creation failed, subsequent steps referencing $created_ticket.id failed with 422 errors.

The 307 responses likely stem from trailing slash redirection behavior in FastAPI.

/assign and /status endpoints returned 404.

UI Behavior

The frontend served content, but login components failed to render properly due to runtime errors in AuthContext.tsx. Subsequent UI steps were blocked.

Gemini CLI

Installation

Run instantly with:

npx @google/gemini-cli

Install globally with:

npm install -g @google/gemini-cli

Install globally with Homebrew (macOS/Linux):

brew install gemini-cli

Install globally with MacPorts (macOS):

sudo port install gemini-cli

Install with Anaconda (for restricted environments):

conda create -y -n gemini_env -c conda-forge nodejs
conda activate gemini_env

Authentication

Option 1: Login with Google (OAuth login using your Google Account):

Start gemini and write:

export GOOGLE_CLOUD_PROJECT=”YOUR_PROJECT_ID”

Then, start Gemini.

Option 2: Gemini API Key

Start gemini and write:

export GEMINI_API_KEY=”YOUR_API_KEY”

Then, start Gemini.

Option 3: Vertex AI

Start gemini and write:

export GOOGLE_API_KEY=”YOUR_API_KEY”
export GOOGLE_GENAI_USE_VERTEXAI=true

Task Report

Gemini CLI produced a strong backend but failed due to frontend toolchain incompatibility. It also consumed the highest token volume among successful backend implementations.

Backend Behavior

Authentication, CRUD, replies, assignment, role enforcement, and data isolation were implemented correctly.

However, the /tickets/{id}/status endpoint was missing entirely, causing all status transition steps to return 404.

UI Behavior

The frontend failed to start. Vite 7.3.1 was installed, which requires Node.js 20.19+, while the test environment runs Node.js 18.18.0. The crypto.hash API required by Vite was unavailable.

As a result, the UI never launched and scored 0/8.

Cline

Installation

Install globally with:

npm install -g cline

Authentication

By writing `cline auth` you can select your Cline account or continue with your desired provider.

Task Report

Cline’s error-limit mechanism terminated the build before completion. The backend structure shows correct architectural intent, but route registration issues and incomplete implementation prevented functional validation.

The absence of a frontend and cascading backend failures place this result among the weakest in Task 6.

Backend Behavior

Cline generated a backend with five files: main.py, models.py, schemas.py, auth.py, and database.py, along with a requirements.txt. The structure included proper models, JWT authentication scaffolding, and endpoint stubs.

However, the agent reached its eight-error limit during backend development and terminated before completing the system.

Only login endpoints functioned correctly. Three of 16 API steps passed.

Ticket creation returned 307 redirects instead of 200 or 201, likely due to trailing-slash route mismatches. Because ticket creation failed, $created_ticket.id was never captured. All subsequent steps referencing the ticket ID passed the literal string value, leading to 422 errors.

The /tickets/{id}/assign and /tickets/{id}/status endpoints were not implemented, resulting in 404 responses.

This produced a cascading failure pattern similar to Forge, where an early routing issue invalidated downstream steps.

UI Behavior

The backend started successfully. However, the frontend/ directory was empty and no package.json file existed.

Only the backend preflight step passed. All remaining UI steps were blocked.

Goose

Installation

For macOS/Linux/WSL:

curl -fsSL https://github.com/block/goose/releases/download/stable/download_cli.sh | bash

Model: Gemini 3 Pro Preview (via OpenRouter)
Time: 1,297s
Tokens: 17k input / 752 output
API Score: 60%
UI Score: 0%

Goose demonstrated limited self-correction but failed to complete the full-stack requirement. Reliability issues during re-runs raise stability concerns.

Backend Behavior

Authentication, ticket CRUD, replies, detail view, and data isolation worked.

However, /assign and /status endpoints were not implemented, causing 404 responses for all related steps.

In an earlier build, Goose encountered bcrypt compatibility errors, self-corrected by pinning the dependency version, and eventually launched the backend.

A later re-run crashed with a stream decode error after minimal file generation.

UI Behavior

No frontend was created. The frontend directory was empty and no package.json existed. The UI test failed immediately.

AI coding tools

AI coding tools can be grouped into three categories:

Agentic CLI: Tools for terminal-based development workflows, generate, edit and refactor code through prompts and command-line interactions.
- Examples: Aider, Opencode, Claude Code, Codex

AI code editors: Also known as agentic IDEs, these tools provides a GUI similar to VS Code (most of them are built on VS Code).
- Examples: Antigravity, Cursor, Kiro Code, Windsurf

Prompt-to-app builders: Low-code/no-code platforms to build apps using natural language prompts and visual workflows.
- Examples: Bolt, Lovable, v0.dev, Firebase Studio, Dazl

AI code review tools

As AI-generated code becomes more common, code review tools are essential for catching bugs and vulnerabilities. We evaluated the top tools on 309 PRs in our RevEval benchmark.

What can agentic CLI tools do?

Across tools like Claude Code, Gemini CLI, and OpenHands, common capabilities include:

End-to-end code work: Create and modify files, fix bugs, refactor code, and run tests or linters directly from the terminal.
Agentic workflows: Perform multi-step tasks such as task chaining, troubleshooting, search, and iterative debugging.
Git & project management: Review history, resolve merges, manage branches, and create commits or pull requests.
Command execution & automation: Run shell commands, automate analyses, and translate natural language into complex CLI operations.
Deep context handling: Operate on full repositories with awareness of dependencies and project structure.
Model flexibility: Support multiple cloud and, in some cases, local models; some tools allow using your own API key or choosing between plans.
Sandboxed or controlled access: Offer modes ranging from read-only to full automation, often with isolated environments for safety.

Methodology

We evaluated agents under a one-shot execution setup to measure their autonomous capabilities without human intervention. Agents were then evaluated using our backend and frontend smoke tests to measure infrastructure readiness and behavioral correctness.

Scores reflect how reliably each agent produced runnable systems and how many functional requirements passed validation.

Model Configuration

We aimed to use Google’s gemini-3-pro-preview due to its high context window, which is suitable for multi-file orchestration and long task prompts. However, some agentic CLIs are tightly coupled to specific providers:

Claude Code was evaluated using claude-opus-4-5-20251101 via Anthropic’s official API.
Codex was evaluated using gpt-5.2-codex-medium through OpenAI’s native configuration.

For these agents, alternative model providers are not supported within their current CLI architecture. Each agent was evaluated using its default configuration. We did not tune temperature, retry policies, or reasoning parameters.

Our evaluation goal was to separate and measure:

Build ability (can the agent produce runnable code?)
Backend behavior correctness
Frontend behavior correctness
Autonomous orchestration reliability

CLI Versions (Mid February, 2026)

Opencode: v1.2.10
Cline: v3.41
Aider: v0.86.0
Gemini CLI: v0.29.0
Forge: v1.28.0
Codex: 0.104.0
Goose: v1.25.0
Claude Code: v2.1.62
Kiro CLI: 1.26.0

For evaluation methodology, visit: AI coding benchmark methodology

For those exploring the broader ecosystem of agentic developer tools, here are our latest benchmarks:

MCP benchmark: A comparison of the top MCP servers for web access.
Remote browsers: How emerging browser infrastructure enables AI agents to interact with the web securely.

Principal Analyst

Cem Dilmegani

Principal Analyst

Follow On

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

View Full Profile

Technically reviewed by

Berk Kalelioğlu

AI Researcher

Follow On View Full Profile

Be the first to comment

Your email address will not be published. All fields are required.

Next to Read

AI AgentsJan 28

Agentic CLI Tools: Codex vs Claude Code

Agentic CLI benchmark results

Analysis and insights

Speed vs score & token usage vs score

How agentic CLI tools work

Model intelligence vs agent architecture

Agent insights

Task 6: Helpdesk ticket system (Web)

Codex

Kiro CLI

Claude Code

Aider

OpenCode

Forge

Gemini CLI

Cline

Goose

AI coding tools

AI code review tools

What can agentic CLI tools do?

Methodology

Model Configuration

Read more

Be the first to comment

Next to Read

Computer Use Agents: Benchmark & Architecture

Agentic Search in 2026: Benchmark 8 Search APIs for Agents

Agentic AI in ITSM: 10 Use Cases & Examples

Building Personal AI Agents + 18 Agent Platforms and Tools

Building AI Agents with Composable Patterns

Top 8 Agentic CRM Platforms in 2026