We benchmarked the top Large Language Models (LLMs) across 10 software development tasks by using an agentic CLI tool. We executed ~3,500 automated validation steps per model across both API and UI layers.
Agentic LLM benchmark results
Each alias ran 3 times across 10 tasks (30 samples per alias, 230 cells per iteration). See more details on methodology.
Claude Sonnet 4.6 led the benchmark with an overall score of 0.748, with the base and thinking variants tied to three decimal places. Claude Opus 4.6 followed at 0.71-0.73 across its two variants. Claude Opus 4.7 placed fifth at 0.61. No non-Anthropic model scored above 0.60.
Key takeaways
- Grok 4.3 was the cheapest and fastest model in the benchmark. Cost: $0.16 per cell at an average orchestration time of 142 seconds. It shipped 7/10 backends and failed the Kanban board and appointment-booking tasks across every iteration. Fits simple CRUD work with tight cost ceilings.
- Anthropic models are on top of the leaderboard. Sonnet 4.6 and Opus 4.6 (both base and thinking variants) took the top four spots. All four shipped working backends on 10/10 tasks. Their UI rates of 91-96% are the four highest in the benchmark. Sonnet 4.6 outscored Opus 4.6 by 0.04 overall at roughly half the cost ($1.33 vs $2.32 per cell).
- Claude Opus 4.7 regressed compared to Opus 4.6 across all dimensions we measured. The overall score fell by 0.10. Average orchestration time rose from 882 to 1,562 seconds. Cost per cell rose from $2.32 to $3.08. The regression replicated across all 3 iterations, ruling out single-run noise.
- GPT-5 models scored higher on the backend than on the frontend. GPT 5.5, GPT 5.4, GPT 5.4 Mini, and GPT 5.3 Codex ranged from 0.56 to 0.60 overall. Backend scores: 0.54-0.62. UI rates: 53-66%. The same model that shipped a working API often failed to render a usable frontend.
- Reasoning variants rarely moved the score. Seven base/thinking pairs ran in the benchmark. Only GLM 5.1 thinking improved its base overall by more than 0.10 (+0.13). Sonnet 4.6 thinking matched the base mean but raised per-cell variance by 53% (stddev 0.18 vs 0.12). For most providers, the base variant is the safer default.
- GPT 5.4 Mini and GPT 5.3 Codex were the best price-to-score trade. Both shipped working backends on 9/10 tasks at $0.41-0.45 per cell. Sonnet 4.6 costs $1.33 per cell; Opus 4.6 costs $2.32. The mid-tier OpenAI models reached roughly 80% of Sonnet 4.6’s overall performance at one-third the cost.
Cost & success comparison
Claude Sonnet 4.6 achieved the top overall score of $1.26- $1.33 per cell, while Opus variants scored lower at a higher cost. Opus 4.7 was the most expensive at $3.08 per cell. GPT 5.4 Mini and GPT 5.3 Codex reached 0.57-0.59 overall at $0.41-0.45 per cell, roughly a third of Sonnet 4.6’s cost.
Task completion time & success comparison
Claude Sonnet 4.6 achieved the top overall score, with an average completion time of 612 seconds, faster than Opus variants. Opus 4.7 took 1,562 seconds, more than twice as long, and scored 0.14 below Sonnet 4.6.
GPT 5.5 thinking ran fastest among models above 0.50 overall, completing tasks in 276 seconds with a 0.597 score. GPT 5.4, GPT 5.5, and GPT 5.3 Codex followed at 305-315 seconds with overall scores between 0.56 and 0.59. Both Qwen 3.6 Plus variants, the GLM 5.1 base, and the Deepseek V4 Pro base each took over 1,700 seconds and scored below 0.45.
Tool calls per task
Tool-call counts ranged from 18 to 82 under the same harness. OpenAI clustered at 24-36, Anthropic at 48-61, Qwen at 81-82. Higher counts did not bring higher scores. Qwen used the most calls and scored at the bottom, and Grok used the fewest and scored 0.43 overall.
OpenAI’s low cluster count comes from the `apply_patch` tool, which bundles a full-file diff into a single call. Other families wrote and re-edited files across multiple calls.
Within Anthropic, Sonnet 4.6 made the fewest calls (51 base, 48 thinking) and reached the top overall score. The Opus variants used more (56-61) and scored lower.
LLM performance on a single successful task
No model passed every step of the benchmark above. To compare cost and speed on equal terms, we ran a simple baseline task that every model can complete: four CRUD endpoints, basic validation, no authentication, and no database. Every model passed all steps.
Cost & lines of code comparison
Costs ranged from $0.0066 (Minimax M2.7) to $0.54 (GPT 5.5), an 82x spread on a single 50-line CRUD API. Claude Sonnet 4.6 completed the baseline at $0.20 base and $0.15 thinking, 15% of its full benchmark cost per cell. Opus variants ran $0.33-$0.38, down from $2.32-$3.08 in the full run. For more details, see the LLM Pricing article.
GPT 5.5 base was the most expensive in the baseline at $0.54, above all Anthropic variants. It generated 5,715 output tokens, almost twice Opus 4.6’s 3,434. On a trivial task, GPT 5.5’s verbosity costs more than Opus’s per-token rate.
Lines of code converged tightly: 40 (Deepseek V4 Pro) to 57 (Claude Opus 4.7). Code length stops differentiating models when the task is trivial.
Completion time & token usage
Completion time spanned 67 seconds (Qwen 3.6 Plus thinking) to 244 seconds (GPT 5.4 Nano). Output tokens spanned 787 to 6,643. Qwen 3.6 Plus thinking was both the fastest and the most concise. GPT 5.4 Mini generated the most output at 6,643 tokens for the same CRUD task.
Qwen 3.6 Plus thinking took 67 seconds on the baseline and 1,788 seconds on the full benchmark, a 27x ratio. GLM 5.1 thinking dropped from 1,615 seconds to 75 seconds, a 22x ratio. Their full-benchmark latency reflects task difficulty, not a model-level pace constraint.
Kimi K2.6 took 237 seconds on the baseline, close to its 1,603-second full-benchmark average. Its latency is model-level. GPT 5.4 Nano took 244 seconds to complete a trivial task at only $0.008, making it the slowest model despite its lowest price.
Claude Sonnet 4.6 thinking ran the baseline in 127 seconds at $0.15 with 1,800 output tokens, faster and cheaper than every other top-5 model from the full benchmark.
What are agentic LLM systems?
Building software is iterative: write code, run it, read errors, fix them, repeat. Agentic AI systems enable LLMs to follow this same cycle. The model operates inside a development environment where it can write files, execute commands, read outputs, and make changes based on what it sees, continuing until the task is complete.
This matters because real applications aren’t single files. They have backends with routes and database models, frontends with components and API calls, configuration files, dependencies, and tests. Making these work together requires iterative testing and refinement, which is exactly what agentic architecture enables.
How it works
The model sits inside a harness with access to a shell, file system, and execution output. When asked to build an application, it writes files incrementally. After each step, the harness shows the model what happened: did the server start, did tests pass, did the linter flag errors? Based on that feedback, the model decides what to write or fix next.
This differs fundamentally from single-shot generation. In one-shot setups, the model generates an entire codebase blind, with no way to verify if it works. In agentic LLM systems, the model sees the consequences of each action and course-corrects. However, this capability alone isn’t sufficient. The model still needs strong reasoning to implement business logic correctly, which is where performance differences really emerge.
Agentic LLM benchmark methodology
We used Opencode as the agent harness for all models and connected them through OpenRouter. Each cell was run 3 times to measure per-cell variance and stabilize the leaderboard. We evaluated their ability to work autonomously on 10 software development tasks (T-1 to T-10), ranging from reservation systems to interactive dashboards. These tasks require agents to manage multi-file projects and deliver functional products.
Execution and orchestration
Every agent and task begins in a clean environment. The instructions are provided as a TASK.md file, and we use a 20-minute heartbeat watchdog for the launch scripts. During this phase, we record exit codes, execution time, and whether the backend and frontend files were created. We also track real-time token usage across input, output, and cached categories.
Backend validation: We deploy the generated projects in isolated environments to test them against a canonical YAML contract. The validation covers happy path scenarios, error handling (400/403/409), and data consistency.
We test the results in two modes:
Adaptive mode validates functionality even with differing route names, while Strict mode requires exact adherence to the contract.
The backend overall score is calculated per cell as:
backend_overall = has_backend × (0.7 × adaptive_pass_rate + 0.3 × strict_pass_rate)
where has_backend is 1 if the cell produced a backend project, 0 otherwise. Adaptive is weighted higher because it measures behavioral correctness; strict adds a penalty for contract drift (renamed routes, substituted status codes, restructured response fields).
UI and user scenario testing
We use browser automation to simulate real user flows, including preflights, rendering, and authentication. We verify functional steps such as login submission and post-login behavior to ensure the application runs without crashing.
UI scoring splits the 8-step categories into two groups. Infrastructure steps (backend preflight, frontend render, login form visible, login submit, login 2xx, no runtime crash) measure whether the app runs at all. Behavior steps (post-login auth signal, post-login behavior signal) assess whether the app performs its intended function once running.
ui_score = (behavior_passed / (behavior_passed + behavior_failed)) × (infra_passed / infra_total)
Blocked behavior steps are excluded from the behavior denominator, so a cell is not double-penalized when the app fails to load.
Tokens calculation
Token counts are extracted from the LLM API response. We subtract cached input tokens from total input tokens to get the effective input, which reflects only newly processed tokens. Output tokens are never cached, so they remain unchanged.
Final aggregation
The final benchmark score is calculated by combining the results from the previous phases: Final Score = (0.7 × backend_overall) + (0.3 × ui_score) We assign a higher weight to the backend because logic failures at the API level often invalidate any success in the frontend.
Task example
Task 6: Helpdesk ticket system
Task 6 focuses on developing a complex customer support ecosystem. The primary objective is to build a platform that mediates communication between customers and support agents while strictly enforcing business rules and security boundaries. This task evaluates an agent’s ability to handle multi-user state machines, data isolation, and threaded communication within a full-stack environment.
The task required building a helpdesk system featuring:
- Distinct permissions for Customers (issuing/replying) and Agents (management/resolution).
- A rigid status workflow that prevents illegal transitions and enforces role-specific actions.
- Advanced data isolation where unauthorized resource requests return 404 instead of 403 to protect system integrity.
- A chronological reply system for seamless agent-customer interaction.
- A FastAPI backend combined with a responsive Vite-powered frontend (React/Vue/Svelte).
- Reproducible setup via specific shell commands for immediate system activation.
You can view the Task 6 documentation on GitHub.
Be the first to comment
Your email address will not be published. All fields are required.