Services
Contactez-nous

200+ Leading AI Benchmarks

Sıla Ermut
Sıla Ermut
mis à jour le 3 juil. 2026

We curated a list with over 200 AI benchmarks for LLMs, GPUs, cloud GPUs, AI agents, tabular AI, and cybersecurity that are not yet saturated.

List of AI benchmarks

Benchmark
Category
Sub-category
Metric
Last meas.
Freq.
Performance
Price
Latency
Reliability
Contamination resistant
Contamination source
BenchLM Weighted Score
LLM
Intelligence
Intelligence
05-26
Continuous
T
T
F
F
F
benchlm.ai/methodology
Humanity's Last Exam
LLM
Reasoning
Reasoning
05-26
Continuous
T
F
F
F
T
labs.scale.com/leaderboard/humanitys_last_exam
ARC-AGI-2
LLM
Reasoning
Reasoning
05-26
Continuous
T
T
F
F
T
arcprize.org/guide/1
SimpleBench
LLM
Reasoning
Reasoning
05-26
Per release
T
F
F
F
T
simple-bench.com
CritPt
LLM
Reasoning
Reasoning
05-26
Per release
T
F
F
F
T
artificialanalysis.ai
FrontierMath
LLM
Math
Math reasoning
05-26
Per release
T
F
F
F
T
epoch.ai/frontiermath
FrontierMath Tier 4
LLM
Math
Math reasoning
05-26
Per release
T
F
F
F
T
epoch.ai
AIME 2025
LLM
Math
Math
04-26
Per release
T
F
F
F
F
matharena.ai
AIME 2026
LLM
Math
Math
04-26
Annual
T
F
F
F
T
matharena.ai
USAMO 2026
LLM
Math
Math proof
03-26
Annual
T
F
F
F
T
matharena.ai

Read our methodology to learn how we gathered this list.

Notes on how to read the list:

The four columns with boolean flags (T = true, F = false) indicate which evaluation dimension each benchmark covers. Each flag answers a yes/no question about the benchmark’s scope:

  • Performance (T/F): Does the benchmark evaluate capability or quality, such as output accuracy, task completion, or intelligence? This is marked T for almost all benchmarks, since most of them assess how well a model or system performs. It is marked F only for benchmarks that focus purely on cost or speed and do not evaluate output quality.
  • Price (T/F): Does the benchmark include cost-related factors, such as dollars per token, price per throughput, or cost per task? 
  • Latency (T/F): Does the benchmark measure speed, such as tokens per second, time to first token, throughput, or response time? It is F for benchmarks that assess correctness only, regardless of how long the response takes.
  • Reliability (T/F): Does the benchmark evaluate consistency or dependability, such as variance across runs, stability of success rates, or robustness? This is the least common flag. It is T for benchmarks designed for this purpose, including HAL Reliability, tau-bench/tau2-bench, METR Time Horizons, and several agent benchmarks in which pass-rate consistency is central. It is F for most leaderboards that report a single overall score.
  • Contamination resistant (T/F): Indicates whether the benchmark is designed to reduce the risk of data contamination, where test questions appear in a model’s training data, and the model achieves high scores through recall rather than genuine capability. T means the benchmark has a meaningful defense, such as a hidden holdout, newly generated or rotating questions, monthly refreshes, self-generating items, or competition problems released after a model’s training cutoff. F means the benchmark is a fixed public dataset that has been online for years and may have been absorbed into training corpora. In those cases, high scores should be interpreted with greater caution.

In practice, a row marked T/F/F/F represents a pure quality benchmark. By contrast, a benchmark marked T/T/T/T/F evaluates quality, cost, and speed together. These flags provide a compact taxonomy showing which of the four evaluation axes each benchmark covers.

Why are some cells blank?

Contamination-resistant & Contamination source: These two fields are usually blank for the same types of rows, especially GPU, cloud GPU, speed, and pricing benchmarks. Contamination resistance is relevant to knowledge and reasoning benchmarks, in which a model might have memorized test questions from the training data. For hardware throughput, latency, or pricing benchmarks, there are no test questions to contaminate, so the field is left empty rather than marked T or F.

AI benchmarks methodology

We collected the benchmark data through an online research and validation process. The objective was to build a structured list of technology benchmarks that remain useful for comparing current AI systems and infrastructure, covering LLMs, GPUs, cloud GPUs, AI agents, tabular AI, and cybersecurity.

We began by defining the scope of the dataset. We focused on benchmarks that measure model capability, infrastructure performance, cost, latency, reliability, or resistance to contamination. The initial source list included major benchmark and analysis providers such as Artificial Analysis, SemiAnalysis, Vals AI, LMArena, AIMultiple, and Epoch AI, as well as official benchmark websites, GitHub repositories, academic papers, leaderboard pages, and relevant third-party benchmark aggregators.

For each benchmark, we recorded both descriptive and evaluative fields. Descriptive fields capture what the benchmark is, what it measures, which products or models are evaluated, and how frequently it is updated. Evaluative fields classify whether the benchmark measures performance, price, latency, or reliability. We also collected information on benchmark structure and data integrity.

We prioritized primary sources wherever possible. These included official benchmark methodology pages, leaderboard pages, GitHub repositories, benchmark papers, and provider documentation. When a primary source did not provide a specific field, we used reputable secondary sources or aggregators to fill gaps, especially for top scores, current model coverage, and recent measurement dates. Source columns were included throughout the dataset so that the evidence for each value could be traced to its source.

Découvrez davantage de nos benchmarks et analyses basées sur les données dans la recherche Google.
GoogleAjouter comme source préférée

Citer cette recherche

Choisissez le format qui correspond à votre lieu de publication. Coller la version avec lien dans votre CMS préserve le lien retour.

Sıla Ermut (2026) - "200+ Leading AI Benchmarks". Publié en ligne sur AIMultiple.com. Consulté le 3 Juillet 2026, à : https://aimultiple.com/ai-benchmarks [Ressource en ligne]

Ermut, S. (2026, 3 Juillet). 200+ Leading AI Benchmarks. AIMultiple. https://aimultiple.com/ai-benchmarks

@misc{ermut2026,
  author = {Ermut, Sıla},
  title  = {{200+ Leading AI Benchmarks}},
  year   = {2026},
  month  = jul,
  howpublished    = {\url{https://aimultiple.com/ai-benchmarks}},
  note   = {AIMultiple. Consulté le 3 Juillet 2026}
}
Sıla Ermut
Sıla Ermut
Analyste du secteur
Sıla Ermut est analyste chez AIMultiple, spécialisée dans le marketing par e-mail et les vidéos de vente. Auparavant, elle travaillait comme recruteuse dans des cabinets de conseil et de gestion de projets. Sıla est titulaire d'un master en psychologie sociale et d'une licence en relations internationales.
Voir le profil complet

Soyez le premier à commenter

Votre adresse courriel ne sera pas publiée. Tous les champs sont obligatoires. Les commentaires sont laissés dans leur langue d'origine.

0/450