Services
Contact Us
No results found.

Intelligence Density of 69 LLMs: Smarter or More Efficient?

Hazal Şimşek
Hazal Şimşek
updated on Jun 10, 2026

We tracked 69 LLMs released between February 2023 and May 2026 and collected 10 public benchmarks to measure intelligence density. We divided the capability score by the resource the model consumes (active parameters, training compute, and inference price).

Loading Chart

Intelligence density indexed to 100 in 2023, averaged across all models released each year.

  • X-axis: Year of release.
  • Y-axis: Intelligence density, indexed to 100 in 2023.
  • Each point: Yearly average density across all models released that year.
  • Tooltip on hover: Yearly mean on each benchmark group (Standard (2023–2024) and Advanced (2024-2026)), which group was used to compute that year’s growth rate, and the percent change from the previous year.

LLM intelligence density results

The Intelligence density graph above reveals that:

  • A density of 100 in 2023 grew to 888 by 2026, roughly an 8.9× improvement.
  • The flat 2023 to 2024 line reflects offsetting effects across three resource axes (per parameter, per FLOP, per dollar), not stalled progress.
  • Most of the gain came in 2024 and 2025, as smaller and cheaper open-weight models began matching the capability of larger predecessors.

Note that the standard benchmarks got saturated in late 2024 once frontier models started scoring 90%+ and the tests stopped separating strong models from stronger ones.Therefore, we follow the same split:

  • Standard Benchmarks (2023–2024): MMLU (5-shot), GSM8K (8-shot chain of thought), HellaSwag (10-shot), ARC Challenge (25-shot), HumanEval (0-shot pass@1).
  • Advanced Benchmarks (2024–2026): MMLU-Pro, GPQA Diamond, AIME 2025, LiveCodeBench, SWE-bench Verified.

See methodology for the scoring approach, and per-resource breakdowns.

LLM capability

The density chart shows efficiency. To put it in context, we also measured raw capability over the same period: how much smarter have these models actually gotten before any resource adjustment?

Capability indexed to 100 in 2023, averaged across all models released each year.

Axes and tooltip work the same as the density chart, with the Y axis tracking raw benchmark performance rather than density.

A capability score of 100 in 2023 grew to 468 by 2026, roughly a 4.7× improvement. Year by year:

  • 2024 (+88%): Rapid progress on Standard Benchmarks before they saturated.
  • 2025 (+105%): Same rapid progress, measured on the harder Advanced Benchmarks that replaced Standard.
  • 2026 (+22%): Slowdown is an early sign that Advanced Benchmarks are starting to saturate too.

Putting the two charts side by side:

  • Capability grew 4.7× while intelligence density grew 8.9×.
  • Models got smarter and more efficient at the same time. They do not need proportionally more parameters, training compute, or money to run.

Note that: The 4.7× capability figure is a credible magnitude. The 8.9× density figure is more fragile (the 2024 Advanced group has fewer data points, which amplifies that year’s growth rate), so treat it as a directional signal rather than a precise multiplier.

Which LLM has the highest intelligence density?

Intelligence density depends on which resource we measure and which benchmark group we score on. The leader changes for every combination.

Density per active parameter (open-weight only)

This view measures how efficiently a model packs capability into its weights. Active parameters matter because Mixture-of-Experts (MoE) models route each token through a fraction of total weights. DeepSeek V3 has 671B total parameters but 37B active per token, and the 37B figure reflects effective serving cost.

Standard benchmarks leaders

  • LLaMA 3.1 8B (Jul 2024): 100.0
  • LLaMA 3 8B (Apr 2024): 85.0
  • LLaMA 4 Maverick (Apr 2025): 55.9
  • Mixtral 8x7B (Dec 2023): 47.4
  • Mistral 7B (Sep 2023): 46.3

Advanced benchmarks leaders

  • Qwen 3.6-35B-A3B (Apr 2026): 100.0 (3B active parameters; score rests on a single benchmark, SWE-bench Verified at 73.4%, so read accordingly)
  • GPT-OSS 20B (Aug 2025): 85.2 (3.6B active parameters)
  • GPT-OSS 120B (Aug 2025): 64.4 (5.1B active parameters)
  • Qwen 3.5 9B (Mar 2026): 35.2 (9B dense parameters)
  • DeepSeek V4 Flash (Apr 2026): 26.0

The Standard Benchmarks ranking has stopped moving since late 2024 because the benchmarks have saturated, not because progress stopped. GPT-4 scored 90.2 on Standard Benchmarks in early 2023 but 17.3 on Advanced Benchmarks.

Density per training FLOP (open-weight only)

This view rewards data curation, architecture choices, and training recipes rather than raw spend.

Standard benchmarks leaders

  • Mistral 7B (Sep 2023): 100.0 (still the panel anchor, more than two years on)
  • LLaMA 3 8B (Apr 2024): 40.8
  • LLaMA 1 (Feb 2023): 37.0
  • DeepSeek V2 (May 2024): 32.6
  • Mixtral 8x7B (Dec 2023): 27.1

Advanced benchmarks leaders

  • Qwen 3.5 9B (Mar 2026): 100.0
  • Qwen 3.6-35B-A3B (Apr 2026): 28.4
  • DeepSeek V4 Flash (Apr 2026): 6.4

The Qwen 3.5 (9B) result is notable: a 9B dense model trained at roughly 1.5 × 10²³ FLOP that scores 81.2 on Advanced Benchmarks capability. Its training compute is roughly an order of magnitude lower than DeepSeek V4 Pro, but the capability gap is smaller.

Note: per-FLOP density on Standard Benchmarks looks like it is falling over time. This is a saturation artifact, not declining training efficiency. The older benchmarks no longer measure recent gains, so capability scores compress while training compute keeps growing.

Density per dollar of inference

Per-dollar density is the axis where open-weight and closed frontier models can be directly compared, because API pricing is publicly verifiable. We use settled prices for the headline so post-launch cuts are reflected.

Standard benchmarks leaders

  • Mistral Small 3 24B (Jan 2025): 100.0 at $0.10 / $0.30 per million tokens via Mistral’s first-party API
  • Gemini 1.5 Flash (May 2024): 99.3 (highest-scoring closed-weight model after its August 2024 price cut)
  • DeepSeek V2 (May 2024): 66.0 at $0.27 / $1.10 per million tokens
  • DeepSeek V3 (Dec 2024): 30.9
  • DeepSeek R1 (Jan 2025): 17.8

Advanced benchmarks leaders

  • GPT-OSS 120B (Aug 2025): 100.0 (price reference is the OpenRouter cross-provider median, since OpenAI did not host the model on a first-party API)
  • GPT-OSS 20B (Aug 2025): 80.0
  • DeepSeek V4 Flash (Apr 2026): 44.0 at $0.06 / $0.42 per million tokens via DeepSeek’s first-party API
  • DeepSeek V3.2 (Dec 2025): 22.5
  • Mistral Small 3 24B (Jan 2025): 20.9
  • DeepSeek V4 Pro (Apr 2026): 16.0 at $0.54 blended, with 97.8 Advanced Benchmarks capability. That puts V4 Pro within 2.2 capability points of Claude Opus 4.7 at roughly 1/18th the blended price ($0.54 vs $10.00).

The closed-frontier gap has not narrowed

Closed reasoning models from Anthropic, OpenAI, and Google cluster between 0 and 6 on Advanced Benchmarks per-dollar density throughout the panel period. GPT-4 in March 2023 scored 0.0. Claude Opus 4.7 in April 2026 scored 0.9.

Closed labs have cut absolute prices:

  • Claude Opus dropped from $30 to $10 per million tokens between Opus 4 and Opus 4.5.
  • OpenAI’s o3 dropped from $10/$40 launch to $2/$8 settled.
  • Gemini 1.5 Flash dropped roughly 78%.

But the relative gap to open-weight density leaders stayed roughly 30 to 50× across the period. DeepSeek priced near $0.18 per million in 2024 (V2) and near $0.18 per million in 2026 (V4 Flash). The open-weight price floor has not moved, and the closed flagship floor has not moved enough to close the gap. Price competition is happening at the previous-generation tier, where older closed models are kept alive as cheap alternatives, not at the frontier.

The new direction: Agentic benchmarks

The advanced benchmarks used to measure AI performance are starting to lose their effectiveness, like the older testing suites they replaced. Labs have started to report agentic benchmarks (SWE-Bench Pro, SWE-bench Verified, Terminal-Bench, OSWorld-Verified, Finance Agent) instead of the older reasoning tests. We began building a panel to track this shift.

For example, the newest frontier model, Claude Opus 4.8, reports its results almost entirely on these agentic benchmarks, with no scores on the reasoning benchmarks our main panel tracks.

SWE-Bench Pro resolve rate by release date, scored on Scale’s SEAL standardized harness. Among plotted models:

  • The top closed-weight performers: Claude Opus 4.5 (45.9%), Claude Sonnet 4.5 (43.6%), and Gemini 3 Pro Preview (43.3%).
  • The top open-weight performers: Qwen3-Coder-480B-A35B (38.7%), Kimi K2 Instruct (27.7%), and Qwen3-235B-A22B (21.4%).

Explore Agentic benchmarks by checking out agentic coding benchmarks, agentic search, and agentic monitoring.

Challenges in comparing agentic performance

Evaluating and comparing agentic benchmark results presents several key difficulties:

  • Focus on capability metrics: We can display capability scores, because the resource data needed to measure density (parameter counts, training compute, and verified per-model pricing) remains incomplete or undisclosed by closed labs.
  • System-level sensitivity: Benchmark scores measure the entire system, not just the AI model. Because scores are highly dependent on the “scaffolding” (the surrounding software framework), the same model can yield vastly different results depending on the specific agent harness used.
  • Standardization protocol: To ensure a fair comparison, this report exclusively plots SWE-Bench Pro data from a single, standardized environment (Scale’s SEAL leaderboard). Other laboratory-reported scores use different harnesses and are not directly comparable.
  • Exclusion of Claude Opus 4.8: Although Opus 4.8 self-reports a record-high 69.2% on SWE-Bench Pro, it was tested on Anthropic’s internal harness and has not yet been evaluated by Scale’s SEAL. To maintain strict data consistency, we note its score here rather than plotting it.

SWE-Bench Pro from Scale SEAL standardized harness. Other columns are lab-reported on varying harnesses and are not directly comparable across rows. Dashes mean no verified score found.

These three ran on the mini-swe-agent harness, so they are excluded from the scatter chart to keep it comparable.

Intelligence density methodology

Panel coverage

  • Models: 69 LLMs released February 2023 to April 2026. 37 models have Standard Benchmarks scores and 54 have Advanced Benchmarks scores. Models without a complete benchmark group are excluded from that group’s density calculations.
  • Benchmarks: 10 public benchmarks, all lab-reported and traceable to a primary source. We did not re-run any benchmark.
  • Resources: For each model, we divided its capability score by three resources independently:
    • Active parameters: A proxy for memory and serving cost. Restricted to open-weight models, because closed-weight labs (OpenAI, Anthropic, Google) do not disclose parameter counts.
    • Training compute (FLOP): A proxy for training efficiency. Also restricted to open-weight models. FLOP figures are mostly estimates from Epoch AI and should be read as directional.
    • Inference price: The price a buyer actually pays at the API. Capability is divided by blended price per million tokens at a 3:1 input-to-output ratio. Both open and closed models participate, because API pricing is publicly verifiable.

Why two benchmark groups?

Models released in 2024 saturated the standard benchmarks AI labs had used since 2020. Once frontier models reach 90% or above on a benchmark, the test no longer separates strong models from stronger ones. Labs responded by introducing harder benchmarks.

We treat this shift as a clean break for two reasons:

  • First, mixing easy and hard benchmarks inside a single capability score would let a saturated score on an easy benchmark mask a low score on a hard one. A model that scores 85 on MMLU-Pro is stronger than one that scores 85 on MMLU. Treating the two as equivalent compresses the score range and rewards older benchmarks.
  • Second, the two suites have different scoring distributions: Standard Benchmarks cluster at the top (most frontier models above 85), while Advanced Benchmarks have wider spread (top models around 90, mid-tier around 50, weaker models below 30). Z-scoring across both at once would distort the variance estimate.

We compute capability for each group independently:

  • Standard Benchmarks (2023–2024): MMLU (5-shot), GSM8K (8-shot chain of thought), HellaSwag (10-shot), ARC Challenge (25-shot), HumanEval (0-shot pass@1).
  • Advanced Benchmarks (2024–2026): MMLU-Pro, GPQA Diamond, AIME 2025, LiveCodeBench, SWE-bench Verified.

A score of 90 on Standard Benchmarks is not equivalent to 90 on Advanced Benchmarks. For any model released from late 2024 onward, Advanced Benchmarks are the meaningful reference.

Capability formula

For each benchmark b and each model m:

We use the population standard deviation (divides by N, not N−1). This matches the original pandas calculation with ddof=0 and ensures that adding a single new model does not retroactively shift the variance estimate.

We then average the available z-scores per model within each benchmark group and min-max rescale to a 0–100 panel-relative score.

Density formula

For each density view:

The denominator changes per view: active parameters in billions, training FLOP divided by 10²⁴, or blended price per million tokens.

Chained index methodology

The two index charts at the top of the article (Intelligence Density and Capability) show a single-line cumulative growth indexed to 100 in 2023, modeled on the S&P 500 cumulative-return format. We chain year-over-year growth rates across the benchmark transition rather than plotting absolute values.

Step 1. Compute the yearly mean of the metric for each benchmark group:

For the Intelligence Density Index, the metric is the row-wise average of the three density scores per model (Combined Density A or B). For the Capability Index, the metric is the panel-relative capability score (Capability_A or Capability_B).

Step 2. For each year-to-year transition, compute growth within a single benchmark group. Both years must come from the same group:

Step 3. Chain multiplicatively from a 2023 baseline of 100:

Group selection per transition:

  • 2023 – 2024: Standard Benchmarks (Advanced has 1 model in 2023, too thin to compute a stable mean)
  • 2024 – 2025: Advanced Benchmarks (Standard saturated in late 2024; Advanced has 16 models in 2024 and 27 in 2025)
  • 2025 – 2026: Advanced Benchmarks (Standard not reported for 2026)

Chained index assumes the underlying populations are comparable across years within a group. Our panel is heterogeneous (different models each year, with different benchmark coverage), so the chained values are best read as a directional summary rather than a precise multiplier.

Core evaluation principles for intelligence density measurement

  • Industry-standard benchmarks: We introduce no custom benchmark.
  • Primary-source sourcing: Every benchmark score is traceable to the model’s release paper, model card, or official announcement. Where the primary source did not report a benchmark, the cell is blank rather than imputed.
  • Coverage flags (N): For each model we report the number of benchmarks contributing to its capability score in each group. Models with N=1 receive a noisier score than models with N=5. The N=1 case applies to Qwen 3.6 35B A3B (SWE-bench Verified only) and a few 2026 closed flagship releases.

Pricing sources for closed-weight models

Pricing is the most cohort-sensitive part of the index. We use the following hierarchy:

  1. Lab launch announcement (preferred): Price quoted in the lab’s launch post, API documentation, or contemporaneous press coverage.
  2. Artificial Analysis cross-provider median (fallback for weights-only models): Used for three Llama 3.1 rows (8B, 70B, 405B) and the GPT-OSS rows, because no first-party API exists.
  3. Weights-only without coverage: 22 of 69 rows have no per-dollar price recorded. They appear in capability and per-parameter charts but are absent from per-dollar charts.

We track two price tracks: launch price (first-party price on the day of release) and settled price (current first-party price, or the last before deprecation). Six models in the panel were repriced after launch:

  • Gemini 1.5 Pro: $3.50/$10.50 to $1.25/$5.00 (October 2024)
  • GPT-4o: $5.00/$15.00 to $2.50/$10.00 (October 2024)
  • Gemini 1.5 Flash: $0.35/$1.05 to $0.075/$0.30 (August 2024)
  • Claude 3.5 Haiku: $1.00/$5.00 to $0.80/$4.00 (December 2024)
  • o3: $10.00/$40.00 to $2.00/$8.00 (June 2025)

The headline per-dollar density uses settled pricing. For 37 of 43 priced rows, launch and settled prices are identical.

Update cadence

We refresh the index monthly. Each refresh adds new model rows for any frontier or notable release in the prior month, updates pricing where labs have adjusted their rates, backfills newly reported benchmarks for existing rows, and recalculates z-scores across the full panel. New model entries shift the mean and standard deviation of each benchmark, so existing 0–100 scores move with the panel. Historical snapshots are preserved.

Intelligence density assessment limitations

  • Lab-reported optimism: Benchmark scores come from the labs themselves. Independent re-runs typically come in 5 to 13 points lower. Trend lines are directionally informative but not point-precise.
  • Reasoning and non-reasoning modes not separated: Models released in 2025 and 2026 support extended-thinking toggles that can shift math and coding scores by 5 to 15 points. We use default-mode scores where identifiable.
  • Training FLOP figures are mostly estimates: Only a handful of rows have lab-confirmed FLOP. The remainder are extrapolated from disclosed parameters and training tokens, mostly via Epoch AI.
  • Closed-model parameter counts unknown: Per-parameter and per-FLOP density are blank for closed frontier rows. Closed models participate only in per-dollar density.
  • Panel-relative scores: Density scores are min-max rescaled inside the panel. A new model with a higher raw ratio shifts every existing model’s 0–100 score. To compare versions of the panel, use the underlying raw ratios rather than the rescaled scores.

What is intelligence density?

Intelligence density represents a new metric to evaluate how much knowledge and reasoning capability a model can pack into a single unit of memory or compute. It measures the model architecture by assessing how effectively it uses each parameter to generate value.

In this framework, the x axis often represents the number of model parameters while the y axis tracks performance on challenging tasks. High density occurs when models demonstrate superior capabilities without requiring the massive computational cost associated with larger models. This process proves that intelligence does not solely depend on the amount of data or the scale of training, but rather on how the software and algorithms organize that information.

Formerly, the AI community had relied on scaling model parameters to reach higher performance, which resulted in massive, resource-heavy systems. With this shift in focus, algorithmic optimization is prioritized over raw volume to produce more intelligence from smaller models. 

Why is it important to measure intelligence density?

Efficient systems must optimize inference time and memory usage. Identifying the point where more tokens or additional compute no longer yield significant gains helps the industry understand the upper bounds of current large language models. The major reasons to measure intelligence density can be summarized as:

  • The AI community underestimates the long term overhead of maintaining oversized models in data centers.
  • Value creation in practical applications depends on the ratio of output quality to the power required to produce it.
  • Moore’s Law for hardware cannot sustain the growth of language models without corresponding progress in how we optimize them.
  • Reinforcement learning and improved data distribution allow next generation tools to reach benchmarks once reserved for only humans.

While larger models still matter for exploring new frontiers, the density of a system determines its actual utility in the real world. This context helps view intelligence as a concentrated outcome of training rather than a byproduct of sheer size.

Further reading

Intelligence density is one way to compare LLMs. Other dimensions are covered in separate analyses:

Cite this research

Pick the format that matches where you're publishing. Pasting the link version into your CMS preserves the backlink.

Hazal Şimşek (2026) - "Intelligence Density of 69 LLMs: Smarter or More Efficient?". Published online at AIMultiple.com. Retrieved June 10, 2026, from: https://aimultiple.com/intelligence-density [Online Resource]

Şimşek, H. (2026, June 10). Intelligence Density of 69 LLMs: Smarter or More Efficient?. AIMultiple. https://aimultiple.com/intelligence-density

@misc{imek2026,
  author = {Şimşek, Hazal},
  title  = {{Intelligence Density of 69 LLMs: Smarter or More Efficient?}},
  year   = {2026},
  month  = jun,
  howpublished    = {\url{https://aimultiple.com/intelligence-density}},
  note   = {AIMultiple. Retrieved June 10, 2026}
}
Hazal Şimşek
Hazal Şimşek
Industry Analyst
Hazal is an industry analyst at AIMultiple, focusing on process mining and IT automation.
View Full Profile

Be the first to comment

Your email address will not be published. All fields are required.

0/450