Insight

200+ Leading AI Benchmarks

updated on Jul 8, 2026

We curated a list with over 200 AI benchmarks for LLMs, GPUs, cloud GPUs, AI agents, tabular AI, and cybersecurity that are not yet saturated.

AI benchmark growth

Loading Chart

We found that benchmarking activity was fairly low and steady through 2024–2025, then rose at the beginning of 2026. This reflects the rapid growth of AI systems requiring evaluation, especially as models have become more capable in coding, reasoning, multimodal tasks, agent capabilities, and enterprise use cases.

AI benchmarks by categories

We listed the AI benchmarks based on their main categories. LLM benchmarks are the leader in terms of the largest number of benchmarks.

AI benchmarks by subcategories

Get our team to automate one of your business processes with AI agents, free of charge.

Automate a process

List of AI benchmarks

Benchmark	Category	Sub-category	Metric	Last meas.	Freq.	Performance	Price	Latency	Reliability	Contamination resistant	Contamination source
BenchLM Weighted Score	LLM	Intelligence	Intelligence	05-26	Continuous	T	T	F	F	F	benchlm.ai/methodology
Humanity's Last Exam	LLM	Reasoning	Reasoning	05-26	Continuous	T	F	F	F	T	labs.scale.com/leaderboard/humanitys_last_exam
ARC-AGI-2	LLM	Reasoning	Reasoning	05-26	Continuous	T	T	F	F	T	arcprize.org/guide/1
SimpleBench	LLM	Reasoning	Reasoning	05-26	Per release	T	F	F	F	T	simple-bench.com
CritPt	LLM	Reasoning	Reasoning	05-26	Per release	T	F	F	F	T	artificialanalysis.ai
FrontierMath	LLM	Math	Math reasoning	05-26	Per release	T	F	F	F	T	epoch.ai/frontiermath
FrontierMath Tier 4	LLM	Math	Math reasoning	05-26	Per release	T	F	F	F	T	epoch.ai
AIME 2025	LLM	Math	Math	04-26	Per release	T	F	F	F	F	matharena.ai
AIME 2026	LLM	Math	Math	04-26	Annual	T	F	F	F	T	matharena.ai
USAMO 2026	LLM	Math	Math proof	03-26	Annual	T	F	F	F	T	matharena.ai

Read our methodology to learn how we gathered this list.

Notes on how to read the list:

The four columns with boolean flags (T = true, F = false) indicate which evaluation dimension each benchmark covers. Each flag answers a yes/no question about the benchmark’s scope:

Performance (T/F): Does the benchmark evaluate capability or quality, such as output accuracy, task completion, or intelligence? This is marked T for almost all benchmarks, since most of them assess how well a model or system performs. It is marked F for benchmarks that focus purely on cost or speed and do not evaluate output quality.
Price (T/F): Does the benchmark include cost-related factors, such as dollars per token, price per throughput, or cost per task?
Latency (T/F): Does the benchmark measure speed, such as tokens per second, time to first token, throughput, or response time? It is F for benchmarks that assess correctness, regardless of how long the response takes.
Reliability (T/F): Does the benchmark evaluate consistency or dependability, such as variance across runs, stability of success rates, or robustness? This is the least common flag. It is T for benchmarks designed for this purpose, including HAL Reliability, tau-bench/tau2-bench, METR Time Horizons, and several agent benchmarks in which pass-rate consistency is central. It is F for most leaderboards that report a single overall score.
Contamination resistant (T/F): Indicates whether the benchmark is designed to reduce the risk of data contamination, where test questions appear in a model’s training data, and the model achieves high scores through recall rather than genuine capability. T means the benchmark has a meaningful defense, such as a hidden holdout, newly generated or rotating questions, monthly refreshes, self-generating items, or competition problems released after a model’s training cutoff. F means the benchmark is a fixed public dataset that has been online for years and may have been absorbed into training corpora. In those cases, high scores should be interpreted with greater caution.

In practice, a row marked T/F/F/F represents a pure quality benchmark. By contrast, a benchmark marked T/T/T/T/F evaluates quality, cost, and speed together. These flags provide a compact taxonomy showing which of the four evaluation axes each benchmark covers.

Why are some cells blank?

Contamination-resistant & Contamination source: These two fields are usually blank for the same types of rows, especially GPU, cloud GPU, speed, and pricing benchmarks. Contamination resistance is relevant to knowledge and reasoning benchmarks, in which a model might have memorized test questions from the training data. For hardware throughput, latency, or pricing benchmarks, there are no test questions to contaminate, so the field is left empty rather than marked T or F.

AI benchmarks methodology

We collected the benchmark data through an online research and validation process. The objective was to build a structured list of technology benchmarks that remain useful for comparing current AI systems and infrastructure, covering LLMs, GPUs, cloud GPUs, AI agents, tabular AI, and cybersecurity.

We began by defining the scope of the dataset. We focused on benchmarks that measure model capability, infrastructure performance, cost, latency, reliability, or resistance to contamination. The initial source list included major benchmark and analysis providers such as Artificial Analysis, SemiAnalysis, Vals AI, LMArena, AIMultiple, and Epoch AI, as well as official benchmark websites, GitHub repositories, academic papers, leaderboard pages, and relevant third-party benchmark aggregators.

For each benchmark, we recorded both descriptive and evaluative fields. Descriptive fields capture what the benchmark is, what it measures, which products or models are evaluated, and how frequently it is updated. Evaluative fields classify whether the benchmark measures performance, price, latency, or reliability. We also collected information on benchmark structure and data integrity.

We prioritized primary sources wherever possible. These included official benchmark methodology pages, leaderboard pages, GitHub repositories, benchmark papers, and provider documentation. When a primary source did not provide a specific field, we used reputable secondary sources or aggregators to fill gaps, especially for top scores, current model coverage, and recent measurement dates. Source columns were included throughout the dataset so that the evidence for each value could be traced to its source.

See more of our benchmarks and data-driven insights in Google Search.

Add as preferred source

Cite this research

Pick the format that matches where you're publishing. Pasting the link version into your CMS preserves the backlink.

Sıla Ermut (2026) - "200+ Leading AI Benchmarks". Published online at AIMultiple.com. Retrieved July 8, 2026, from: https://aimultiple.com/ai-benchmarks [Online Resource]

Ermut, S. (2026, July 8). 200+ Leading AI Benchmarks. AIMultiple. https://aimultiple.com/ai-benchmarks

@misc{ermut2026,
  author = {Ermut, Sıla},
  title  = {{200+ Leading AI Benchmarks}},
  year   = {2026},
  month  = jul,
  howpublished    = {\url{https://aimultiple.com/ai-benchmarks}},
  note   = {AIMultiple. Retrieved July 8, 2026}
}

Sıla Ermut

Industry Analyst

Follow On

Sıla Ermut is an industry analyst at AIMultiple focused on email marketing and sales videos. She previously worked as a recruiter in project management and consulting firms. Sıla holds a Master of Science degree in Social Psychology and a Bachelor of Arts degree in International Relations.

View Full Profile