Benchmark

Tabular Models Benchmark: Performance Across 19 Datasets 2026

Berk Kalelioğlu

with

Cem Dilmegani

updated on Jul 3, 2026

See our ethical norms

Cite This Benchmark

We benchmarked 8 tabular learning models on 19 real-world datasets covering roughly 260,000 samples, with dataset sizes from 435 to 48,800 rows. Every model ran on the same machine with 5-fold cross-validation and identical splits.

Tabular learning models benchmark results

Tabular model Elo ratings across 19 datasets

Loading Chart

Each dataset is a round-robin of head-to-head matches between models, decided by the primary metric. Elo aggregates all 483 matches into a single rating per model; a 200-point gap means the higher-rated model wins about 3 out of 4 matches. Details are in the methodology.

TabFM, released by Google Research on June 30, 2026¹, won 15 of the 19 datasets. It sits 218 Elo points above the next model and 538 above the best gradient-boosted tree. No boosted tree won a single dataset.

Elo also handles partial coverage fairly: Mitra is rated only on matches it could play, meaning the 12 datasets within its 10,000-row training cap².

Dataset	Size	Size Bucket	Task	Type	Metric	LR	XGB	LGBM	CAT	TabPFN 3	TabICLv2	TabFM	Mitra
adult	48.8k	large	binary	hybrid	roc_auc	0.9067	0.9274	0.9295	0.9277	0.9183	0.9252	0.9318	N/A
amazon_employee_access	32.8k	large	binary	hybrid	roc_auc	0.8631	0.8555	0.8573	0.8621	0.8508	0.8556	0.8627	N/A
bank-marketing-10pct	4.5k	medium	binary	hybrid	roc_auc	0.8875	0.9018	0.9073	0.9069	0.929	0.9247	0.9334	0.9036
bank-marketing-full	45.2k	large	binary	hybrid	roc_auc	0.9066	0.932	0.9369	0.934	0.945	0.9423	0.9465	N/A
breast-w	700	small	binary	numeric	roc_auc	0.9942	0.9906	0.9927	0.9921	0.9937	0.9942	0.9941	0.9945
california_housing	20.6k	large	regression	numeric	rmse	68881	47254	47488	47885	39403	40406	37381	N/A
compas-two-years	5k	medium	binary	hybrid	roc_auc	0.7286	0.7105	0.7211	0.7341	0.7361	0.7369	0.7362	0.7343
credit-g	1k	small	binary	hybrid	roc_auc	0.7863	0.765	0.7668	0.7826	0.7977	0.7966	0.8088	0.7885
default_credit	30k	large	binary	hybrid	roc_auc	0.7227	0.762	0.7805	0.7733	0.7903	0.7907	0.7934	N/A
diabetes	769	small	binary	numeric	roc_auc	0.8304	0.7823	0.7973	0.8326	0.8391	0.8407	0.8424	0.8345

TabFM’s 15 wins include every large, hard dataset: adult (0.9318 ROC-AUC), electricity (0.9936), bank-marketing-full (0.9465), and house_sales (RMSE 97,704).
Its largest margin is vehicle, the one multiclass dataset: 0.9136 macro-F1 against XGBoost’s 0.7492.
The four non-TabFM wins: Logistic Regression on amazon_employee_access, TabICLv2 on compas-two-years, Mitra on breast-w, and a three-way foundation-model tie at a perfect 1.0000 on monks-problems-2.
Logistic Regression’s win is thinner than the table suggests: 0.8631 ROC-AUC against TabFM’s 0.8627 and CatBoost’s 0.8621, with fold-level standard deviations of 0.010 to 0.015. On extreme-cardinality categorical data, the top classical models and TabFM are statistically tied, and the classical case rests on cost rather than accuracy.

One boundary on the headline: TabFM’s lead over all four classical models is statistically significant (Friedman test, then Nemenyi critical difference; every gap clears it). Its lead over TabPFN 3 and TabICLv2 is not settled at 19 datasets: those gaps stay under the critical difference, and the Elo intervals overlap at the edge. Foundation models beating gradient boosting is the statistically backed claim; TabFM leading the foundation pack is a mean-rank result awaiting more datasets. Test details are in the methodology.

Latency of tabular models

GPU pricing uses RunPod’s on-demand B200 rate of $5.89 per GPU-hour as of July 2, 2026³. The classical models ran on CPU (62 cores); each foundation model ran on one B200.
TabFM averaged 173.6 seconds per fold (median 16.9). Its worst dataset, adult at 48,800 rows, took 893 seconds per fold. The full benchmark cost TabFM about $27 in GPU time; TabPFN 3 did the same work for $0.65 and LightGBM for effectively nothing on CPU.
Those numbers exclude TabFM’s cold start. Before its first prediction, TabFM’s JAX backend spends roughly 20 to 25 minutes compiling GPU kernels. A persistent compilation cache removes this on later runs; any fresh deployment pays it once.
TabFM also had the highest setup burden of the eight models. It requires a recent datacenter GPU, and on B200 hardware the default CUDA 12 stack triggers a documented cuBLAS bug that JAX flags as a silent data-corruption risk. One of our runs had to be discarded and repeated on the CUDA 13 stack because of it. No other model had a comparable failure mode.
TabFM moves the cost to inference: every prediction batch re-processes the training data as context. Served once per dataset, this is fine. For repeated scoring on large data, the per-call cost compounds where a trained XGBoost model predicts in microseconds.

Results by dataset size and feature type

Small datasets (under 1,000 rows): the four foundation models take the top four slots on both numeric and hybrid data. Mitra, built for small data, wins breast-w and ties monks-problems-2 at a perfect score. Logistic Regression holds up on numeric data (91.2% average ROC-AUC) and collapses on hybrid (77.5%).
Medium datasets (1,000 to 10,000 rows): foundation models lead on both feature types. The hybrid gap is wide: TabFM 85.3% average ROC-AUC against 83.4% for the best classical model, CatBoost.
Large datasets (over 10,000 rows): the old “boosting catches up at scale” pattern is gone. On large numeric data TabFM averages 99.4% against XGBoost’s 97.0%. On large hybrid data TabFM leads with 88.4% against LightGBM’s 87.6%.

Average rank by regime

The rank view adds two things the ROC-AUC averages hide.

TabFM holds the top average rank in all six regimes, including a perfect 8.0 on medium numeric data, where it placed first on every dataset.
How close the race is depends on the regime. On small data the four foundation models sit within one rank point of each other, so small-data users can pick any of them and lose little accuracy. On large hybrid data the field splits: TabFM at 6.8, second place at 4.4. LightGBM’s third place there (4.2) is the best classical showing in any regime.

Model wins by regime

TabFM takes win points in all six regimes; no other model scores in more than two. The classical models’ single point in the entire grid is Logistic Regression’s win on amazon_employee_access. Mitra’s points all sit in the small-data row, consistent with its design.

Regime still matters, but for choosing how much compute to spend, no longer for choosing which model family wins.

Model-specific observations

TabFM (Google Research, June 2026)

Best mean rank (1.42), 15 outright wins, top model in every regime. Architecturally it is a hybrid: TabICL-style row-and-column attention over raw cells, row compression, then a TabPFN-style in-context transformer over the compressed rows, pretrained on hundreds of millions of synthetic tables. The trade-offs: about 40x TabPFN 3’s compute, a 20-plus-minute first-run compile, a 500-feature and 10-class cap, and datacenter-GPU-only deployment.

TabICLv2 (Inria SODA, February 2026)

Second-best mean rank (2.74) at 3.1 seconds per fold⁴. Adds regression support, which the original TabICL lacked. Completed all 19 datasets, including amazon_employee_access, whose 7,518-value categorical column overwhelms in-context models when it is one-hot encoded. The best accuracy per GPU-second in the benchmark.

TabPFN 3 (Prior Labs)

Third mean rank (3.00), completed everything, 6.6 GPU-minutes total⁵. Row-chunking lets it handle datasets that overflow other in-context models on the same hardware. SAP announced its acquisition of Prior Labs in May 2026⁶, which matters for enterprise buyers evaluating vendor stability.

Mitra (Amazon, via AutoGluon)

A small-data specialist by design: a hard 10,000-row training cap makes it N/A on 7 of 19 datasets. Within its envelope it is competitive: mean rank 4.50 on its 12 datasets and one outright win. Choose it when data is genuinely small and AutoGluon is already in the stack.

Gradient boosting: LightGBM, CatBoost, XGBoost

Still the deployment default, no longer the accuracy leader in any regime. LightGBM was the best classical model (mean rank 5.16). CatBoost remains the safest classical pick on categorical-heavy data. XGBoost finished last on mean rank (6.42) and showed one notable weakness: with native categorical splits on the high-cardinality employee_salaries regression, it scored RMSE 10,894 where LightGBM scored 4,367 on identical inputs. That is a one-dataset result, but worth checking before relying on XGBoost’s categorical mode.

The boosted trees’ case in 2026 is operational: sub-second training, microsecond inference, no GPU, no row caps, mature tooling.

Logistic Regression

Mean rank 6.21, yet it owns a share of the amazon_employee_access tie. Sparse one-hot plus a linear model remains a strong, nearly free baseline on high-cardinality categorical data. Keep it in every benchmark.

Get our team to automate one of your business processes with AI agents, free of charge.

Automate a process

Key takeaways

Foundation models beat gradient-boosted trees on accuracy, with statistical significance, across small, medium, and large tabular data. This holds only when each model receives the categorical input it is designed for; forcing one-hot encoding on in-context models cripples them on high-cardinality data.

TabFM leads on accuracy but costs roughly 40x the compute of TabPFN 3 or TabICLv2 and carries the highest setup burden. TabICLv2 at 3 seconds per fold is the value pick. TabPFN 3 is the proven, enterprise-backed middle.

On high-cardinality categorical data, the strongest classical models and TabFM are statistically tied. The classical case is cost, simplicity, and deployment freedom.

Zero-shot means zero training, not zero compute. Score your workload’s prediction volume before choosing TabFM.

What are tabular foundation models?

Traditional tabular models train from scratch on each dataset. Tabular foundation models (TFMs) are pretrained once on large collections of tabular problems, mostly synthetic, and adapt to new datasets at inference time: the training rows go in as context, and predictions come out of a forward pass. No gradient training, no hyperparameter tuning.

TabPFN reframes tabular learning as amortized Bayesian inference: a transformer pretrained on millions of synthetic tasks conditions on the observed dataset to produce predictions⁷. TabPFN 3 raised the limits to 100,000 rows and 2,000 features with native categorical handling.

TabICL extends in-context learning to tables: rows become structured tokens, and the model infers decision rules from context examples⁸. TabICLv2 added regression support in February 2026.

TabFM combines the two approaches: alternating row and column attention over raw cells, compression of each row into a dense vector, then an in-context transformer over the compressed rows. It was trained on hundreds of millions of synthetic datasets generated from structural causal models.

Mitra follows the TabPFN line inside AutoGluon, tuned for small data.

Don’t miss our benchmarks and data-driven insights. The button opens Google; selecting AIMultiple confirms that you wish to see AIMultiple more often in Google search results.

Add as preferred source

Tabular Models Benchmark Methodology

Models (8): LogisticRegression, XGBoost, LightGBM, CatBoost, TabPFN 3, TabICLv2, TabFM, Mitra.

Datasets (19, OpenML): 15 binary classification, 1 multiclass, 3 regression; 435 to 48,800 rows. Known leakage columns removed.

Evaluation: 5-fold stratified cross-validation (plain CV for regression), seed 42, identical splits for every model. Primary metrics: ROC-AUC (binary), macro-F1 (multiclass), RMSE (regression). Default hyperparameters, no tuning. This favors tuning-free models: a tuned XGBoost would score higher, and tuning budgets are themselves a cost we did not model.

Categorical handling, per model design: native category dtype for CatBoost, XGBoost, and LightGBM; raw features with categorical indices for TabPFN 3 and TabFM; ordinal encoding for TabICLv2; one-hot plus StandardScaler for Logistic Regression only. No model that supports native categoricals is forced through one-hot. The trade-off: cells differ in input representation by design. Forcing one representation on every model looks fairer but penalizes the models built for a different input.

Environment: one RunPod pod with 2x NVIDIA B200 (183 GB), driver 580, Ubuntu 24.04. Classical models on CPU (62 cores), each foundation model on one B200. PyTorch 2.8.0 cu128 for TabPFN 3, TabICLv2, and Mitra; JAX 0.10.1 with CUDA 13 wheels for TabFM.

Latency: per-fold fit-plus-predict wall time, recorded inside the harness. TabFM’s one-time compilation is excluded from fold times and reported separately.

N/A policy: a cell is N/A only for a documented model limit, with the reason captured in logs. The only N/As in the final data are Mitra’s 7 datasets above its 10,000-row cap.

Rank metrics: within each dataset, models are ranked 1 (best) to k by the primary metric; a model’s mean rank averages these over the datasets it completed. The regime chart shows the inverted rank, k + 1 minus rank, so higher is better and 8 means first among eight. Win points award each dataset’s single point to the best score; exact ties split it equally, so a three-way tie pays 0.33 each.

Statistics: Friedman test plus Nemenyi critical difference over the 7 full-coverage models (Mitra excluded; ranks are recomputed within these 7, so they sit slightly below the 8-model mean ranks quoted in the model sections). The Friedman test gives chi-squared 72.8, p below 10^-12. The critical difference is CD = q × sqrt(k(k+1) / (6N)), with q = 2.949 at alpha 0.05 for k = 7 models and N = 19 datasets, giving CD = 2.07: two models differ significantly only if their mean ranks are more than 2.07 apart. TabFM’s gaps to the classical models run from 3.47 (LightGBM) to 4.58 (XGBoost), all above the threshold; its gaps to TabICLv2 (1.34) and TabPFN 3 (1.55) are below it. Mean-rank confidence intervals come from a 10,000-sample bootstrap over datasets.

Elo ratings: each dataset is a round-robin between all models that completed it, 483 pairwise matches in total. The model with the better primary metric wins a match (score 1); exact ties score 0.5 each.

The formulas use four quantities. R(A) is model A’s current rating, the number the chart reports; every model starts at R = 1000. S(A) is the actual score model A took from a match: 1 for a win, 0.5 for a tie, 0 for a loss. E(A) is the expected score, the win probability the current ratings predict for A before the match is counted. K = 32 is the update step, which caps how much one match can move a rating.

Before a match between models A and B, the expected score is E(A) = 1 / (1 + 10^((R(B) − R(A)) / 400)): when the ratings are equal, it gives 0.5, and it approaches 1 as A’s rating pulls ahead. After the match, the rating updates by R(A) = R(A) + K × (S(A) − E(A)), so A gains when it does better than expected and loses points when it does worse; an upset win over a higher-rated model pays more than a routine win over a weaker one. The update is zero-sum, so the pool average stays at 1000, and a rating of 994 reads as an average model in this field. The 400 divisor sets the scale: at a 200-point gap the expected score is 1 / (1 + 10^(−0.5)) = 0.76, which is where “wins about 3 out of 4 matches” comes from.

A single Elo pass depends on match order, and the 19 datasets are themselves a sample. We therefore run 1,000 bootstrap rounds; each round resamples the 19 datasets with replacement, rebuilds the match list, shuffles the match order, and runs Elo from scratch. The published rating is the mean across rounds, and the 95% intervals are the 2.5th and 97.5th percentiles, covering both dataset sampling and ordering noise. Elo is the summary visualization; the Friedman and Nemenyi test remains the significance arbiter.

Reproducibility: fixed seed 42, identical splits, per-fold results and all merge, statistics, and Elo scripts in the benchmark repository.

Cite this benchmark

Pick the format that matches where you're publishing. Pasting the link version into your CMS preserves the backlink.

Berk Kalelioğlu and Cem Dilmegani (2026) - "Tabular Models Benchmark: Performance Across 19 Datasets 2026". Published online at AIMultiple.com. Retrieved July 3, 2026, from: https://aimultiple.com/tabular-models [Online Resource]

Kalelioğlu, B., & Dilmegani, C. (2026, July 3). Tabular Models Benchmark: Performance Across 19 Datasets 2026. AIMultiple. https://aimultiple.com/tabular-models

@misc{kalelioglu2026,
  author = {Kalelioğlu, Berk and Dilmegani, Cem},
  title  = {{Tabular Models Benchmark: Performance Across 19 Datasets 2026}},
  year   = {2026},
  month  = jul,
  howpublished    = {\url{https://aimultiple.com/tabular-models}},
  note   = {AIMultiple. Retrieved July 3, 2026}
}

Download all data

Results and timestamps of 16 data points. Download the data used in this article as a ZIP file containing 2 CSV files and a README.

Last updated: July 8, 2026

Download

Reference Links

Introducing TabFM: A zero-shot foundation model for tabular data

AutoGluon Tabular - Foundational Models - AutoGluon 1.5.0 documentation

Pricing | Runpod

Runpod

[2602.11139] TabICLv2: A better, faster, scalable, and open tabular foundation model

[2207.01848] TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

SAP to Acquire Prior Labs | SAP News Center

SAP

[2207.01848] TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

[2502.05564] TabICL: A Tabular Foundation Model for In-Context Learning on Large Data

Berk Kalelioğlu

AI Researcher

Follow On

Berk is an AI Researcher at AIMultiple, focusing on agentic ai systems and language models.

View Full Profile

Researched by

Cem Dilmegani

Principal Analyst

Follow On

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month. Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple. Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization. He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider. Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

View Full Profile

Be the first to comment

Your email address will not be published. All fields are required. Comments are left in their original language.

Tabular learning models benchmark results

Model-specific observations

Key takeaways

What are tabular foundation models?

Tabular Models Benchmark Methodology

Cite this benchmark

We follow ethical norms & our process for objectivity. This research does not feature any customers of AIMultiple.

Don’t miss our benchmarks and data-driven insights. The button opens Google; selecting AIMultiple confirms that you wish to see AIMultiple more often in Google search results.

Add as preferred source

Next to Read

Agentic AI

Benchmark

Jul 21

Tabular Models Benchmark: Performance Across 19 Datasets 2026

Tabular learning models benchmark results

Latency of tabular models

Results by dataset size and feature type

Average rank by regime

Model wins by regime

Model-specific observations

TabFM (Google Research, June 2026)

TabICLv2 (Inria SODA, February 2026)

TabPFN 3 (Prior Labs)

Mitra (Amazon, via AutoGluon)

Gradient boosting: LightGBM, CatBoost, XGBoost

Logistic Regression

Key takeaways

What are tabular foundation models?

Tabular Models Benchmark Methodology

Cite this benchmark

Link with attributionHTML, for blog posts, LinkedIn articles & newsletters. Recommended.

APA 7th editionFor academic papers and analyst reports following APA 7th style.

BibTeXFor LaTeX documents and academic reference managers.

Reference Links

Be the first to comment

Next to Read

AIM Agentic Marketing Benchmark

Top Google Proxies for Scalable Google Scraping (2026 Benchmark)

Intelligence Density of 71 LLMs: Smarter and Denser Models

LLM Quantization: BF16 vs FP8 vs INT4

Code Execution with MCP: A New Approach to AI Agent Efficiency

MySQL Monitoring: SolarWinds vs New Relic vs Datadog