We benchmarked the top Large Language Models (LLMs) across 10 software development tasks using an agentic CLI tool. We executed ~3,500 automated validation steps per model across both API and UI layers.
A-CODE-LLM Bench results
Each alias ran 3 times across 10 tasks (30 samples per alias, 270 cells per iteration). See more details on methodology.
Claude Sonnet 4.6 led the benchmark with an overall score of 0.748, with the base and thinking variants tied to three decimal places. The two Claude Opus 4.6 variants followed at 0.706 and 0.729. Claude Opus 4.8 entered at 0.702, fifth overall, so Anthropic held all five top spots. The highest non-Anthropic result was Gemini 3.5 Flash thinking at 0.625, which ranked sixth and edged out Claude Opus 4.7 (0.61). Kimi K2.7 Code scored 0.611, tying Opus 4.7 and ranking as the second-highest non-Anthropic model behind Gemini. Claude Fable 5 posted 0.688 on the Claude Code harness, reported separately as a disclosed exception rather than ranked in the cohort.
Key takeaways
- Grok 4.3 was the fastest model in the benchmark with an average completion time of 142 seconds.
- Anthropic models are on top of the leaderboard. Sonnet 4.6 leads, Opus 4.8 and Opus 4.6 follows.
- Opus 4.8 recovered from the 4.7 regression
- Gemini 3.5 Flash was the first non-Anthropic model above 0.60, beating all providers except Anthropic.
- GPT models offer competitive backends but lack frontend stability.
- Reasoning variants moved the score for only two models.
- MiniMax M3 was the cheapest capable model at $0.33 per cell.
- Kimi K2.7 Code is tying Opus 4.7 at half Sonnet 4.6’s cost per cell.
Cost & success comparison
Claude Sonnet 4.6 achieved the highest overall score (0.748) at $1.26-$1.33 per cell. Every Opus variant scored at or below Sonnet for 1.8 to 2.4 times the cost. Among the Opencode cohort, Opus 4.7 was the most expensive at $3.08 per cell, followed by Opus 4.8 at $2.92. Claude Fable 5 ran higher at $4.73 on its separate harness. Kimi K2.7 Code reached 0.611 at $0.70 per cell, about half Sonnet 4.6’s cost and the cheapest model in the 0.60-plus band. Gemini 3.5 Flash thinking matched Sonnet’s cost ($1.30 per cell) for a 0.625 score, undercutting every Opus model it outranked. MiniMax M3 was the cheapest model above 0.50 at $0.33 per cell for 0.583. GPT 5.4 Mini and GPT 5.3 Codex reached 0.57-0.59 overall at $0.41-$0.45 per cell, roughly a third of Sonnet 4.6’s cost.
Task completion time & success comparison
Claude Sonnet 4.6 achieved the highest overall score, with an average completion time of 612 seconds, faster than every Opus variant. Opus 4.7 took 1,562 seconds, more than twice as long, and scored 0.14 below Sonnet 4.6. Opus 4.8 cut that to 1,072 seconds at a 0.702 score, between Opus 4.6 (882 seconds) and Opus 4.7.
Gemini 3.5 Flash thinking was the fastest model above 0.60, completing tasks in 390 seconds at 0.625. GPT 5.5 thinking ran fastest among all models above 0.50, at 276 seconds with a 0.597 score. MiniMax M3 was the slow outlier among mid-pack scorers: 1,684 seconds for 0.583, driven by over-iteration rather than task difficulty. Both Qwen 3.6 Plus variants, the GLM 5.1 base, and the Deepseek V4 Pro base each took over 1,700 seconds and scored below 0.45.
Tool calls per task
Tool-call counts ranged from 18 (Grok 4.3) to 108 (MiniMax M3) under the same harness. OpenAI clustered at 24 to 36, Gemini at 37 to 45, Anthropic at 48 to 70, Qwen at 81 to 82, Kimi K2.7 Code at 84, and MiniMax M3 at the top with 108. Higher counts did not bring higher scores. MiniMax M3 made the most calls and scored mid-pack at 0.583. Grok used the fewest and scored 0.431. Tool volume tracked iteration style rather than capability.
OpenAI’s low cluster count comes from the `apply_patch` tool, which bundles a full-file diff into a single call. Other families wrote and re-edited files across multiple calls.
Within Anthropic, Sonnet 4.6 made the fewest calls (51 base, 48 thinking) and reached the top overall score. The Opus variants used more (56 to 70), and Opus 4.8 made the most at 70.
LLM performance on a single successful task
No model passed every step of the full benchmark above. To compare cost and speed on equal terms, we ran a simple baseline task that every model can complete: four CRUD endpoints, basic validation, no authentication, and no database.
Cost & lines of code comparison
Costs ran from $0.0066 (Minimax M2.7) to $0.56 (Gemini 3.5 Flash), an 85x spread on a task whose reference solution is about 50 lines. Gemini 3.5 Flash base was the most expensive in the baseline, above every Anthropic variant, because it wrote 131 lines of code, two to three times the rest of the field.
Lines of code otherwise converged tightly, from 40 (Deepseek V4 Pro) to 64 (MiniMax M3). Code length stopped differentiating models on a trivial task, with Gemini 3.5 Flash base the lone exception. Claude Sonnet 4.6 completed the baseline at $0.20 base and $0.15 thinking, about 15% of its full-benchmark cost per cell. Opus 4.8 ran the baseline at $0.16, level with Sonnet and well below the other Opus variants ($0.33 to $0.38), because it finished in 34 seconds with 6 tool calls, rather than overworking the task. MiniMax M3 ran it for $0.025. For more details on per-token rates, see the LLM Pricing article.
Completion time & token usage
Completion time spanned 34 seconds (Claude Opus 4.8) to 475 seconds (MiniMax M3). Output tokens spanned 787 (Qwen 3.6 Plus thinking) to 6,643 (GPT 5.4 Mini).
Opus 4.8 was the fastest model on the baseline, at 34 seconds and 1,830 output tokens, in sharp contrast to its 1,072-second full-benchmark average. On an easy task, it stopped early; on hard tasks, it iterated. MiniMax M3 was the slowest at 475 seconds, and its slowness was model-level: it took 475 seconds on the trivial task and 1,684 seconds on the full benchmark, so over-iteration, not task difficulty, set its pace. Qwen 3.6 Plus thinking, by contrast, took 67 seconds on the baseline and 1,788 seconds on the full benchmark, a 27x ratio, and GLM 5.1 thinking dropped from 1,615 to 75 seconds, a 22x ratio. Their full-benchmark latency reflected task difficulty rather than a model-level pace constraint.
Gemini 3.5 Flash thinking ran the baseline in 97 seconds with 4,023 output tokens, faster and leaner than its base variant (146 seconds, 4,665 tokens). GPT 5.4 Mini generated the most output, with 6,643 tokens, for the same CRUD task. Claude Sonnet 4.6 thinking ran the baseline in 126 seconds at $0.15 with 1,800 output tokens.
What are agentic LLM systems?
Building software is iterative: write code, run it, read errors, fix them, repeat. Agentic AI systems enable LLMs to follow this same cycle. The model operates inside a development environment where it can write files, execute commands, read outputs, and make changes based on what it sees, continuing until the task is complete.
This matters because real applications aren’t single files. They have backends with routes and database models, frontends with components and API calls, configuration files, dependencies, and tests. Making these work together requires iterative testing and refinement, which is exactly what agentic architecture enables.
How it works
The model sits inside a harness with access to a shell, file system, and execution output. When asked to build an application, it writes files incrementally. After each step, the harness shows the model what happened: did the server start, did tests pass, did the linter flag errors? Based on that feedback, the model decides what to write or fix next.
This differs fundamentally from single-shot generation. In one-shot setups, the model generates an entire codebase blind, with no way to verify if it works. In agentic LLM systems, the model sees the consequences of each action and course-corrects. However, this capability alone isn’t sufficient. The model still needs strong reasoning to implement business logic correctly, which is where performance differences really emerge.
Agentic LLM benchmark methodology
We used Opencode as the agent harness for all models and connected them through OpenRouter, with one exception: Claude Fable 5 ran on the Claude Code CLI on the Claude subscription. Each cell was run 3 times to measure per-cell variance and stabilize the leaderboard. We evaluated their ability to work autonomously on 10 software development tasks (T-1 to T-10), ranging from reservation systems to interactive dashboards. These tasks require agents to manage multi-file projects and deliver functional products.
Execution and orchestration
Every agent and task begins in a clean environment. The instructions are provided as a TASK.md file, and we use a 20-minute heartbeat watchdog for the launch scripts. During this phase, we record exit codes, execution time, and whether the backend and frontend files were created. We also track real-time token usage across input, output, and cached categories.
Backend validation: We deploy the generated projects in isolated environments to test them against a canonical YAML contract. The validation covers happy path scenarios, error handling (400/403/409), and data consistency.
We test the results in two modes:
Adaptive mode validates functionality even with differing route names, while Strict mode requires exact adherence to the contract.
The backend overall score is calculated per cell as:
backend_overall = has_backend × (0.7 × adaptive_pass_rate + 0.3 × strict_pass_rate)
where has_backend is 1 if the cell produced a backend project, 0 otherwise. Adaptive is weighted higher because it measures behavioral correctness; strict adds a penalty for contract drift (renamed routes, substituted status codes, restructured response fields).
UI and user scenario testing
We use browser automation to simulate real user flows, including preflights, rendering, and authentication. We verify functional steps such as login submission and post-login behavior to ensure the application runs without crashing.
UI scoring splits eight steps into two groups. Infrastructure steps (backend preflight, frontend render, login form visible, login submit, login 2xx, no runtime crash) measure whether the app runs at all. Behavior steps (post-login auth signal, post-login behavior signal) assess whether the app performs its intended function once running.
ui_score = (behavior_passed / (behavior_passed + behavior_failed)) × (infra_passed / infra_total)
Blocked behavior steps are excluded from the behavior denominator, so a cell is not double-penalized when the app fails to load.
Tokens calculation
Token counts are extracted from the LLM API response. We subtract cached input tokens from total input tokens to get the effective input, which reflects only newly processed tokens. Output tokens are never cached, so they remain unchanged.
Final aggregation
The final benchmark score is calculated by combining the results from the previous phases: Final Score = (0.7 × backend_overall) + (0.3 × ui_score) We assign a higher weight to the backend because logic failures at the API level often invalidate any success in the frontend.
Task example
Task 6: Helpdesk ticket system
Task 6 focuses on developing a complex customer support ecosystem. The primary objective is to build a platform that mediates communication between customers and support agents while strictly enforcing business rules and security boundaries. This task evaluates an agent’s ability to handle multi-user state machines, data isolation, and threaded communication within a full-stack environment.
The task required building a helpdesk system featuring:
- Distinct permissions for Customers (issuing/replying) and Agents (management/resolution).
- A rigid status workflow that prevents illegal transitions and enforces role-specific actions.
- Advanced data isolation where unauthorized resource requests return 404 instead of 403 to protect system integrity.
- A chronological reply system for seamless agent-customer interaction.
- A FastAPI backend combined with a responsive Vite-powered frontend (React/Vue/Svelte).
- Reproducible setup via specific shell commands for immediate system activation.
You can view the Task 6 documentation on GitHub.
Cite this benchmark
Pick the format that matches where you're publishing. Pasting the link version into your CMS preserves the backlink.
@misc{kaleliolu2026,
author = {Kalelioğlu, Berk and Dilmegani, Cem},
title = {{A-CODE-LLM Bench: Agentic Coding Benchmark}},
year = {2026},
month = jun,
howpublished = {\url{https://aimultiple.com/agentic-llm}},
note = {AIMultiple. Retrieved June 15, 2026}
}Results and timestamps of 319 data points. Download the data used in this article as a ZIP file containing 2 CSV files and a README.
Be the first to comment
Your email address will not be published. All fields are required. Comments are left in their original language.