We curated a list with over 200 AI benchmarks for LLMs, GPUs, cloud GPUs, AI agents, tabular AI, and cybersecurity that are not yet saturated.
List of AI benchmarks
Benchmark | Category | Sub-category | Metric | Last meas. | Freq. | Performance | Price | Latency | Reliability | Contamination resistant | Contamination source |
|---|---|---|---|---|---|---|---|---|---|---|---|
BenchLM Weighted Score | LLM | Intelligence | Intelligence | 05-26 | Continuous | T | T | F | F | F | benchlm.ai/methodology |
Humanity's Last Exam | LLM | Reasoning | Reasoning | 05-26 | Continuous | T | F | F | F | T | labs.scale.com/leaderboard/humanitys_last_exam |
ARC-AGI-2 | LLM | Reasoning | Reasoning | 05-26 | Continuous | T | T | F | F | T | arcprize.org/guide/1 |
SimpleBench | LLM | Reasoning | Reasoning | 05-26 | Per release | T | F | F | F | T | simple-bench.com |
CritPt | LLM | Reasoning | Reasoning | 05-26 | Per release | T | F | F | F | T | artificialanalysis.ai |
FrontierMath | LLM | Math | Math reasoning | 05-26 | Per release | T | F | F | F | T | epoch.ai/frontiermath |
FrontierMath Tier 4 | LLM | Math | Math reasoning | 05-26 | Per release | T | F | F | F | T | epoch.ai |
AIME 2025 | LLM | Math | Math | 04-26 | Per release | T | F | F | F | F | matharena.ai |
AIME 2026 | LLM | Math | Math | 04-26 | Annual | T | F | F | F | T | matharena.ai |
USAMO 2026 | LLM | Math | Math proof | 03-26 | Annual | T | F | F | F | T | matharena.ai |
Read our methodology to learn how we gathered this list.
Notes on how to read the list:
The four columns with boolean flags (T = true, F = false) indicate which evaluation dimension each benchmark covers. Each flag answers a yes/no question about the benchmark’s scope:
- Performance (T/F): Does the benchmark evaluate capability or quality, such as output accuracy, task completion, or intelligence? This is marked T for almost all benchmarks, since most of them assess how well a model or system performs. It is marked F only for benchmarks that focus purely on cost or speed and do not evaluate output quality.
- Price (T/F): Does the benchmark include cost-related factors, such as dollars per token, price per throughput, or cost per task?
- Latency (T/F): Does the benchmark measure speed, such as tokens per second, time to first token, throughput, or response time? It is F for benchmarks that assess correctness only, regardless of how long the response takes.
- Reliability (T/F): Does the benchmark evaluate consistency or dependability, such as variance across runs, stability of success rates, or robustness? This is the least common flag. It is T for benchmarks designed for this purpose, including HAL Reliability, tau-bench/tau2-bench, METR Time Horizons, and several agent benchmarks in which pass-rate consistency is central. It is F for most leaderboards that report a single overall score.
- Contamination resistant (T/F): Indicates whether the benchmark is designed to reduce the risk of data contamination, where test questions appear in a model’s training data, and the model achieves high scores through recall rather than genuine capability. T means the benchmark has a meaningful defense, such as a hidden holdout, newly generated or rotating questions, monthly refreshes, self-generating items, or competition problems released after a model’s training cutoff. F means the benchmark is a fixed public dataset that has been online for years and may have been absorbed into training corpora. In those cases, high scores should be interpreted with greater caution.
In practice, a row marked T/F/F/F represents a pure quality benchmark. By contrast, a benchmark marked T/T/T/T/F evaluates quality, cost, and speed together. These flags provide a compact taxonomy showing which of the four evaluation axes each benchmark covers.
Why are some cells blank?
Contamination-resistant & Contamination source: These two fields are usually blank for the same types of rows, especially GPU, cloud GPU, speed, and pricing benchmarks. Contamination resistance is relevant to knowledge and reasoning benchmarks, in which a model might have memorized test questions from the training data. For hardware throughput, latency, or pricing benchmarks, there are no test questions to contaminate, so the field is left empty rather than marked T or F.
AI benchmarks methodology
We collected the benchmark data through an online research and validation process. The objective was to build a structured list of technology benchmarks that remain useful for comparing current AI systems and infrastructure, covering LLMs, GPUs, cloud GPUs, AI agents, tabular AI, and cybersecurity.
We began by defining the scope of the dataset. We focused on benchmarks that measure model capability, infrastructure performance, cost, latency, reliability, or resistance to contamination. The initial source list included major benchmark and analysis providers such as Artificial Analysis, SemiAnalysis, Vals AI, LMArena, AIMultiple, and Epoch AI, as well as official benchmark websites, GitHub repositories, academic papers, leaderboard pages, and relevant third-party benchmark aggregators.
For each benchmark, we recorded both descriptive and evaluative fields. Descriptive fields capture what the benchmark is, what it measures, which products or models are evaluated, and how frequently it is updated. Evaluative fields classify whether the benchmark measures performance, price, latency, or reliability. We also collected information on benchmark structure and data integrity.
We prioritized primary sources wherever possible. These included official benchmark methodology pages, leaderboard pages, GitHub repositories, benchmark papers, and provider documentation. When a primary source did not provide a specific field, we used reputable secondary sources or aggregators to fill gaps, especially for top scores, current model coverage, and recent measurement dates. Source columns were included durchgehend the dataset so that the evidence for each value could be traced to its source.
Diese Forschung zitieren
Wählen Sie das Format, das zu Ihrem Veröffentlichungsort passt. Wenn Sie die Link-Version in Ihr CMS einfügen, bleibt der Backlink erhalten.
@misc{ermut2026,
author = {Ermut, Sıla},
title = {{200+ Leading AI Benchmarks}},
year = {2026},
month = jul,
howpublished = {\url{https://aimultiple.com/ai-benchmarks}},
note = {AIMultiple. Abgerufen am 3. Juli 2026}
}
Seien Sie der Erste, der kommentiert
Ihre E-Mail-Adresse wird nicht veröffentlicht. Alle Felder sind erforderlich. Kommentare werden in ihrer Originalsprache belassen.