Services
Contact Us

200+ Leading AI Benchmarks

Sıla Ermut
Sıla Ermut
updated on Jul 3, 2026

We curated a list with over 200 AI benchmarks for LLMs, GPUs, cloud GPUs, AI agents, tabular AI, and cybersecurity that are not yet saturated.

List of AI benchmarks

Benchmark
Category
Sub-category
Metric
Last meas.
Freq.
Performance
Price
Latency
Reliability
Contamination resistant
Contamination source
BenchLM Weighted Score
LLM
Intelligence
Intelligence
05-26
Continuous
T
T
F
F
F
benchlm.ai/methodology
Humanity's Last Exam
LLM
Reasoning
Reasoning
05-26
Continuous
T
F
F
F
T
labs.scale.com/leaderboard/humanitys_last_exam
ARC-AGI-2
LLM
Reasoning
Reasoning
05-26
Continuous
T
T
F
F
T
arcprize.org/guide/1
SimpleBench
LLM
Reasoning
Reasoning
05-26
Per release
T
F
F
F
T
simple-bench.com
CritPt
LLM
Reasoning
Reasoning
05-26
Per release
T
F
F
F
T
artificialanalysis.ai
FrontierMath
LLM
Math
Math reasoning
05-26
Per release
T
F
F
F
T
epoch.ai/frontiermath
FrontierMath Tier 4
LLM
Math
Math reasoning
05-26
Per release
T
F
F
F
T
epoch.ai
AIME 2025
LLM
Math
Math
04-26
Per release
T
F
F
F
F
matharena.ai
AIME 2026
LLM
Math
Math
04-26
Annual
T
F
F
F
T
matharena.ai
USAMO 2026
LLM
Math
Math proof
03-26
Annual
T
F
F
F
T
matharena.ai

Read our methodology to learn how we gathered this list.

Notes on how to read the list:

The four columns with boolean flags (T = true, F = false) indicate which evaluation dimension each benchmark covers. Each flag answers a yes/no question about the benchmark’s scope:

  • Performance (T/F): Does the benchmark evaluate capability or quality, such as output accuracy, task completion, or intelligence? This is marked T for almost all benchmarks, since most of them assess how well a model or system performs. It is marked F only for benchmarks that focus purely on cost or speed and do not evaluate output quality.
  • Price (T/F): Does the benchmark include cost-related factors, such as dollars per token, price per throughput, or cost per task? 
  • Latency (T/F): Does the benchmark measure speed, such as tokens per second, time to first token, throughput, or response time? It is F for benchmarks that assess correctness only, regardless of how long the response takes.
  • Reliability (T/F): Does the benchmark evaluate consistency or dependability, such as variance across runs, stability of success rates, or robustness? This is the least common flag. It is T for benchmarks designed for this purpose, including HAL Reliability, tau-bench/tau2-bench, METR Time Horizons, and several agent benchmarks in which pass-rate consistency is central. It is F for most leaderboards that report a single overall score.
  • Contamination resistant (T/F): Indicates whether the benchmark is designed to reduce the risk of data contamination, where test questions appear in a model’s training data, and the model achieves high scores through recall rather than genuine capability. T means the benchmark has a meaningful defense, such as a hidden holdout, newly generated or rotating questions, monthly refreshes, self-generating items, or competition problems released after a model’s training cutoff. F means the benchmark is a fixed public dataset that has been online for years and may have been absorbed into training corpora. In those cases, high scores should be interpreted with greater caution.

In practice, a row marked T/F/F/F represents a pure quality benchmark. By contrast, a benchmark marked T/T/T/T/F evaluates quality, cost, and speed together. These flags provide a compact taxonomy showing which of the four evaluation axes each benchmark covers.

Why are some cells blank?

Contamination-resistant & Contamination source: These two fields are usually blank for the same types of rows, especially GPU, cloud GPU, speed, and pricing benchmarks. Contamination resistance is relevant to knowledge and reasoning benchmarks, in which a model might have memorized test questions from the training data. For hardware throughput, latency, or pricing benchmarks, there are no test questions to contaminate, so the field is left empty rather than marked T or F.

AI benchmarks methodology

We collected the benchmark data through an online research and validation process. The objective was to build a structured list of technology benchmarks that remain useful for comparing current AI systems and infrastructure, covering LLMs, GPUs, cloud GPUs, AI agents, tabular AI, and cybersecurity.

We began by defining the scope of the dataset. We focused on benchmarks that measure model capability, infrastructure performance, cost, latency, reliability, or resistance to contamination. The initial source list included major benchmark and analysis providers such as Artificial Analysis, SemiAnalysis, Vals AI, LMArena, AIMultiple, and Epoch AI, as well as official benchmark websites, GitHub repositories, academic papers, leaderboard pages, and relevant third-party benchmark aggregators.

For each benchmark, we recorded both descriptive and evaluative fields. Descriptive fields capture what the benchmark is, what it measures, which products or models are evaluated, and how frequently it is updated. Evaluative fields classify whether the benchmark measures performance, price, latency, or reliability. We also collected information on benchmark structure and data integrity.

We prioritized primary sources wherever possible. These included official benchmark methodology pages, leaderboard pages, GitHub repositories, benchmark papers, and provider documentation. When a primary source did not provide a specific field, we used reputable secondary sources or aggregators to fill gaps, especially for top scores, current model coverage, and recent measurement dates. Source columns were included throughout the dataset so that the evidence for each value could be traced to its source.

See more of our benchmarks and data-driven insights in Google Search.
GoogleAdd as preferred source

Cite this research

Pick the format that matches where you're publishing. Pasting the link version into your CMS preserves the backlink.

Sıla Ermut (2026) - "200+ Leading AI Benchmarks". Published online at AIMultiple.com. Retrieved July 3, 2026, from: https://aimultiple.com/ai-benchmarks [Online Resource]

Ermut, S. (2026, July 3). 200+ Leading AI Benchmarks. AIMultiple. https://aimultiple.com/ai-benchmarks

@misc{ermut2026,
  author = {Ermut, Sıla},
  title  = {{200+ Leading AI Benchmarks}},
  year   = {2026},
  month  = jul,
  howpublished    = {\url{https://aimultiple.com/ai-benchmarks}},
  note   = {AIMultiple. Retrieved July 3, 2026}
}
Sıla Ermut
Sıla Ermut
Industry Analyst
Sıla Ermut is an industry analyst at AIMultiple focused on email marketing and sales videos. She previously worked as a recruiter in project management and consulting firms. Sıla holds a Master of Science degree in Social Psychology and a Bachelor of Arts degree in International Relations.
View Full Profile

Be the first to comment

Your email address will not be published. All fields are required. Comments are left in their original language.

0/450