Benchmark

A-CODE-LLM Bench: Agentic Coding Benchmark

Berk Kalelioğlu

with

Cem Dilmegani

updated on Jul 23, 2026

See our ethical norms

Cite This Benchmark

We benchmarked the top Large Language Models (LLMs) across 10 software development tasks using an agentic CLI tool. We executed ~3,500 automated validation steps per model across both API and UI layers.

A-CODE-LLM Bench results

Loading Chart

Each alias ran 3 times across 10 tasks (30 samples per alias, 400 cells per iteration across 40 aliases). See more details on methodology.

Mid-tier Sonnet beats the flagship Opus. Both Sonnet versions outscore every Opus, including Opus 4.8 (0.702). Anthropic’s most expensive tier is not its best coder.
The top is no longer Anthropic-only: Grok 4.5 (0.732) beats every Opus variant. OpenAI’s new flagship GPT 5.6 Sol posts the company’s best score (0.615), still 0.157 below Sonnet 5, and its higher-compute pro variants mostly land below their own bases (Sol Pro 0.543, Terra Pro 0.568; only Luna Pro improves, 0.603 vs 0.579).
Code-specialists did not win the coding benchmark. GPT 5.3 Codex, OpenAI’s code-tuned variant, scores 0.572, mid-pack and below OpenAI’s own general GPT 5.4 Mini (0.594). Moonshot’s Kimi K2.7 Code is the stronger specialist at 0.611.
Kimi K3, a general model, takes the open-weights lead from Moonshot’s own code specialist: 0.725, sixth overall, 0.114 above K2.7 Code. Its 0.632 backend ranks fifth, above every Opus and GPT alias.
Inkling, Thinking Machines’ first model, debuts at 0.575, third among open-weights entries. Its 0.747 frontend beats every GPT 5.6 alias; the 0.501 backend caps its rank.
No model is reliable on backend: the ceiling is 0.701 (Sonnet 5), so even the winner fails about a third of the business-logic and contract checks. Grok 4.5 comes closest among the July additions at 0.663. Frontend is near-solved among the leaders (0.79 to 0.96), so backend is the open problem, and it sets the ranking. Claude Haiku 4.5 renders fine (0.731), but a 0.277 backend holds it to 0.413.
GPT’s weak spot is the frontend. GPT 5.4 and 5.5 match Opus 4.8 on backend (about 0.6) but score 0.53 to 0.55 on frontend; the GPT 5.6 family lifts that to 0.63 to 0.71, still far below the Sonnet line’s 0.91-plus.

Cost & success comparison

The flagship-priced models are the worst value. Opus 4.7 is the most expensive ($3.08/cell) and scores 0.610, below Sonnet 4.6 at $1.33.
The top charges a steep premium for little gain: Sonnet 5 scores 0.024 above Sonnet 4.6 for 70% more cost per cell.
Grok 4.5 is the new best value: 0.732, within 0.040 of the winner, at $0.46 per cell, compared to Sonnet 5’s $2.23. Inside the GPT 5.6 family price buys nothing: $0.18 (Luna) to $2.76 (Sol Pro) per cell for scores between 0.543 and 0.615, with the cheapest alias outscoring the most expensive.
Kimi K3 buys its sixth-place score at a mid-tier price: $1.47 per cell, near Sonnet 4.6’s $1.33 but 0.007 higher in score, and well under Sonnet 5’s $2.23. Inkling costs $1.64 per cell at list price for 0.575, above Sonnet 4.6 for a lower score.

Task completion time & success comparison

The top score is among the slowest. Sonnet 5 takes about 30 minutes per task, 3x Sonnet 4.6 for 0.024 more; Sonnet 4.6 gives nearly the same score in a third of the time.
A long run usually signals a stuck model, not a thorough one: the bottom scorers, both Qwen variants, GLM 5.1 base, and Deepseek V4 Pro, each ran over 1,700 seconds from over-iteration for scores below 0.45.
Grok 4.3 was fast because it quit early: 142 seconds and 18 tool calls for 0.431. Grok 4.5 keeps the speed and drops the quitting: about 9 minutes per task, under a third of Sonnet 5’s time, for 0.732.
Kimi K3 sits at the opposite corner: top-tier score, worst speed. It averages about 55 minutes per task, the slowest in the field and roughly double Sonnet 5, for a sixth-place 0.725. Its accuracy is real, but it is the least practical way into the top tier.

Tool calls per task

Tool-call count measures neither capability nor effort you can compare. Sonnet 5 made the most calls (125) and scored highest; MiniMax M3 made 108 for a mid-pack 0.583; Grok 4.5 reached 0.732 on 40; OpenAI’s low 16 to 36 come from apply_patch bundling a whole file into one call. Sol Pro and Terra Pro call a quarter to a third fewer tools than their bases and score less: more reasoning, less execution. Do not rank agents by tool volume.
Two paths reach the same score: Sonnet 5 iterates heavily (125 calls), Sonnet 4.6 barely (about 50), 0.024 apart.

LLM performance on a single successful task

No model passed every step of the full benchmark above. To compare cost and speed on equal terms, we ran a simple baseline task that every model can complete: four CRUD endpoints, basic validation, no authentication, and no database.

Cost & lines of code comparison

Simple tasks cannot rank models, so toy evaluations mislead. On the baseline every model passes, code converges to 40 to 64 lines, and cost drops to cents; differences appear only on long, multi-file work.
The “fast and light” tier was the most expensive here: Gemini 3.5 Flash base wrote 131 lines for the trivial task, two to three times the field, making it the priciest baseline, against its own positioning.
Sonnet 5’s heavy iteration is task-driven, not a habit: 9 calls and $0.09 here versus 125 calls on the benchmark.

See more details in the LLM Pricing article.

Completion time & token usage

Cost predictability splits models in two. Adaptive models spend only when needed (Opus 4.8: 34s baseline, 1,072s benchmark); fixed-pace models run slow and costly even on trivial work (MiniMax M3: 475 vs 1,684s).
Output length is a fixed model trait, ranging nearly 10x for the same task (787 to 7,508 tokens), feeding directly into cost.

Get our team to automate one of your business processes with AI agents, free of charge.

Automate a process

What are agentic LLM systems?

Building software is iterative: write code, run it, read errors, fix them, repeat. Agentic AI systems enable LLMs to follow this same cycle. The model operates inside a development environment where it can write files, execute commands, read outputs, and make changes based on what it sees, continuing until the task is complete.

This matters because real applications aren’t single files. They have backends with routes and database models, frontends with components and API calls, configuration files, dependencies, and tests. Making these work together requires iterative testing and refinement, which is exactly what agentic architecture enables.

How it works

The model sits inside a harness with access to a shell, file system, and execution output. When asked to build an application, it writes files incrementally. After each step, the harness shows the model what happened: did the server start, did tests pass, did the linter flag errors? Based on that feedback, the model decides what to write or fix next.

This differs fundamentally from single-shot generation. In one-shot setups, the model generates an entire codebase blind, with no way to verify if it works. In agentic LLM systems, the model sees the consequences of each action and course-corrects. However, this capability alone isn’t sufficient. The model still needs strong reasoning to implement business logic correctly, which is where performance differences really emerge.

Don’t miss our benchmarks and data-driven insights. The button opens Google; selecting AIMultiple confirms that you wish to see AIMultiple more often in Google search results.

Add as preferred source

Agentic LLM benchmark methodology

We used Opencode as the agent harness for all models and connected them through OpenRouter, with two exceptions: Claude Fable 5 ran on the Claude Code CLI on the Claude subscription, and Inkling ran through Thinking Machines’ own OpenAI-compatible API because the model is not available on OpenRouter. Each cell was run 3 times to measure per-cell variance and stabilize the leaderboard. We evaluated their ability to work autonomously on 10 software development tasks (T-1 to T-10), ranging from reservation systems to interactive dashboards. These tasks require agents to manage multi-file projects and deliver functional products. The seven aliases added in July 2026 (Grok 4.5 and the GPT 5.6 family: Sol, Terra, Luna, each in base and pro mode) ran on Opencode 1.15.13 at API-default settings, the same as every model here; OpenAI’s launch numbers for Sol use higher-compute modes.¹ Four Sol Pro cells and one Terra Pro cell ended in repeated silent API stream failures rather than model errors and are scored as failures.

Kimi K3 and Inkling were added on July 18 at API-default settings; Inkling ran at its default thinking effort of 0.9. Inkling’s cost column uses Thinking Machines’ list prices of $3.74 per million input tokens and $9.36 per million output tokens; the provider billed about half that under a limited-time launch discount.² Both models ran with a 150-minute per-cell wall instead of the earlier 45 minutes, because Kimi K3’s token stream is slow enough to hit the shorter wall mid-build; completion time is reported separately in the time chart. Kimi K3’s shared upstream capacity returned frequent rate-limit and timeout errors during the run window; affected cells were rerun once against the same wall, the same policy applied to the Sol Pro and Terra Pro stream failures above.

Execution and orchestration

Every agent and task begins in a clean environment. The instructions are provided as a TASK.md file, and we use a 20-minute heartbeat watchdog for the launch scripts. During this phase, we record exit codes, execution time, and whether the backend and frontend files were created. We also track real-time token usage across input, output, and cached categories.

Backend validation: We deploy the generated projects in isolated environments to test them against a canonical YAML contract. The validation covers happy path scenarios, error handling (400/403/409), and data consistency.

We test the results in two modes:

Adaptive mode validates functionality even with differing route names, while Strict mode requires exact adherence to the contract.

The backend overall score is calculated per cell as:

backend_overall = has_backend × (0.7 × adaptive_pass_rate + 0.3 × strict_pass_rate)

where has_backend is 1 if the cell produced a backend project, 0 otherwise. Adaptive is weighted higher because it measures behavioral correctness; strict adds a penalty for contract drift (renamed routes, substituted status codes, restructured response fields).

UI and user scenario testing

We use browser automation to simulate real user flows, including preflights, rendering, and authentication. We verify functional steps such as login submission and post-login behavior to ensure the application runs without crashing.

UI scoring splits eight steps into two groups. Infrastructure steps (backend preflight, frontend render, login form visible, login submit, login 2xx, no runtime crash) measure whether the app runs at all. Behavior steps (post-login auth signal, post-login behavior signal) assess whether the app performs its intended function once running.

ui_score = (behavior_passed / (behavior_passed + behavior_failed)) × (infra_passed / infra_total)

Blocked behavior steps are excluded from the behavior denominator, so a cell is not double-penalized when the app fails to load.

Tokens calculation

Token counts are extracted from the LLM API response. We subtract cached input tokens from total input tokens to get the effective input, which reflects only newly processed tokens. Output tokens are never cached, so they remain unchanged.

Final aggregation

The final benchmark score is calculated by combining the results from the previous phases: Final Score = (0.7 × backend_overall) + (0.3 × ui_score) We assign a higher weight to the backend because logic failures at the API level often invalidate any success in the frontend.

Task example

Task 6: Helpdesk ticket system

Task 6 focuses on developing a complex customer support ecosystem. The primary objective is to build a platform that mediates communication between customers and support agents while strictly enforcing business rules and security boundaries. This task evaluates an agent’s ability to handle multi-user state machines, data isolation, and threaded communication within a full-stack environment.

The task required building a helpdesk system featuring:

Distinct permissions for Customers (issuing/replying) and Agents (management/resolution).
A rigid status workflow that prevents illegal transitions and enforces role-specific actions.
Advanced data isolation where unauthorized resource requests return 404 instead of 403 to protect system integrity.
A chronological reply system for seamless agent-customer interaction.
A FastAPI backend combined with a responsive Vite-powered frontend (React/Vue/Svelte).
Reproducible setup via specific shell commands for immediate system activation.

You can view the Task 6 documentation on GitHub.

Cite this benchmark

Pick the format that matches where you're publishing. Pasting the link version into your CMS preserves the backlink.

Berk Kalelioğlu and Cem Dilmegani (2026) - "A-CODE-LLM Bench: Agentic Coding Benchmark". Published online at AIMultiple.com. Retrieved July 23, 2026, from: https://aimultiple.com/agentic-llm [Online Resource]

Kalelioğlu, B., & Dilmegani, C. (2026, July 23). A-CODE-LLM Bench: Agentic Coding Benchmark. AIMultiple. https://aimultiple.com/agentic-llm

@misc{kalelioglu2026,
  author = {Kalelioğlu, Berk and Dilmegani, Cem},
  title  = {{A-CODE-LLM Bench: Agentic Coding Benchmark}},
  year   = {2026},
  month  = jul,
  howpublished    = {\url{https://aimultiple.com/agentic-llm}},
  note   = {AIMultiple. Retrieved July 23, 2026}
}

Download all data

Results and timestamps of 429 data points. Download the data used in this article as a ZIP file containing 2 CSV files and a README.

Last updated: July 3, 2026

Download

Reference Links

GPT-5.6: Frontier intelligence that scales with your ambition | OpenAI

Inkling: Our Open-Weights Model - Thinking Machines Lab

Berk Kalelioğlu

AI Researcher

Follow On

Berk is an AI Researcher at AIMultiple, focusing on agentic ai systems and language models.

View Full Profile

Technically reviewed by

Cem Dilmegani

Principal Analyst

Follow On

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month. Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple. Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization. He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider. Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

View Full Profile

Be the first to comment

Your email address will not be published. All fields are required. Comments are left in their original language.

A-CODE-LLM Bench results

LLM performance on a single successful task

What are agentic LLM systems?

How it works

Agentic LLM benchmark methodology

Cite this benchmark

We follow ethical norms & our process for objectivity. This research does not feature any customers of AIMultiple.

Don’t miss our benchmarks and data-driven insights. The button opens Google; selecting AIMultiple confirms that you wish to see AIMultiple more often in Google search results.

Add as preferred source

Next to Read

AI Models

Benchmark

Jun 30

A-CODE-LLM Bench: Agentic Coding Benchmark

A-CODE-LLM Bench results

Cost & success comparison

Task completion time & success comparison

Tool calls per task

LLM performance on a single successful task

Cost & lines of code comparison

Completion time & token usage

What are agentic LLM systems?

How it works

Agentic LLM benchmark methodology

Execution and orchestration

UI and user scenario testing

Tokens calculation

Final aggregation

Task example

Task 6: Helpdesk ticket system

Cite this benchmark

Link with attributionHTML, for blog posts, LinkedIn articles & newsletters. Recommended.

APA 7th editionFor academic papers and analyst reports following APA 7th style.

BibTeXFor LaTeX documents and academic reference managers.

Reference Links

Be the first to comment

Next to Read

Vision Language Models Compared to Image Recognition

LLM Latency Benchmark by Use Cases in 2026

Top 5 Open-Source Agentic AI Frameworks in 2026

Top No-Code ML Platforms: ChatGPT Alternatives

AI Gateways for OpenAI: OpenRouter Alternatives

Top LLMOps Tools & Compare them to MLOPs