Services
Contact Us
No results found.

Tabular Models Benchmark: Performance Across 19 Datasets 2026

Cem Dilmegani
Cem Dilmegani
updated on May 22, 2026

We benchmarked 7 widely used tabular learning models to identify top-performing model families across 19 real-world datasets of varying sizes and structures, covering ~260,000 samples and over 250 total features, with dataset sizes ranging from 435 to nearly 49,000 rows.

Tabular learning models benchmark results

Loading Chart

In the chart, the winning model receives 1 point. In case of a draw, the point is shared equally among the tied models. Win rate measures how often a model finishes first within a given regime, providing a stricter view of dominance than average rank.

Different models win under different structural conditions, and the success rate varies with dataset size and feature composition.

In particular:

  • Foundation models are the most successful when the data is limited
  • TabPFN 3 wins both large + numeric datasets, outperforming XGBoost and TabICL
  • On large + hybrid datasets:
    • TabPFN 3 takes 3 of 5 datasets, with Logistic Regression (amazon_employee_access) and LightGBM (adult) winning the others
    • Score gaps between the top four models remain narrow (within 0.5 points of ROC-AUC), so model choice still depends on data characteristics

Disclaimer: Feature types are categorized as numeric or hybrid based on the dominant input representation after preprocessing.

How to interpret the dataset mix:

  • Size buckets range from small datasets with fewer than 1,000 rows to large datasets with more than 40,000 rows.
  • Task types include binary classification, multiclass classification, and regression.
  • Feature types reflect practical enterprise data:
    • Numeric: primarily continuous or ordinal variables
    • Hybrid: a mix of numeric and categorical features

This variation makes the benchmark well-suited for understanding which model families perform reliably under different conditions.

You can see our methodology below.

High-level results by dataset size and feature type

Here is how models behave across dataset size buckets and feature types, rather than focusing on individual dataset scores.

For each dataset size bucket, the chart reports the average ROC-AUC achieved by each model, separately for numeric and hybrid datasets.

Small datasets (<1K rows)

On small datasets, foundation-style tabular models are the most successful.

  • TabPFN 3 and TabICL, the leading tabular foundation models (TFMs), achieve the strongest performance on both numeric and hybrid datasets.
  • The performance gap is especially pronounced on hybrid datasets
  • Logistic regression performs competitively on numeric data, but degrades sharply on hybrid data

When data is scarce, models with strong inductive bias outperform both boosting and neural baselines. In this regime, prior knowledge and learned feature interactions matter more than model capacity.

Medium datasets (1K–10K rows)

On medium-sized datasets, overall performance improves, but structural differences remain.

  • Most models perform strongly on numeric datasets (often exceeding 97% ROC-AUC)
  • Hybrid datasets remain more challenging.
  • Within TFMs, TabPFN 3 and TabICL continue to lead, but the gap to gradient boosting is closer.

Medium-sized datasets represent a transition regime: signal density increases, but inductive bias still provides a measurable advantage, particularly on mixed feature types.

Large datasets (>10K rows)

At scale, performance patterns shift.

  • On large numeric datasets, TabPFN 3 leads, followed by XGBoost and TabICL. TabPFN 3 also wins the california_housing regression task, which is reported in the per-dataset table rather than in this chart.
  • On large + hybrid datasets, performance converges:
    • Differences are smaller, and model choice becomes less obvious

At scale, TabPFN 3 closes the gap that previously favored gradient boosting on numeric data and extends its lead on hybrid data. The one regime where boosting and linear baselines still win is high-cardinality pure-categorical data, as seen on amazon_employee_access.

Average rank by regime

Models are ranked within each regime (dataset size × feature type).
Ranks are normalized so that higher values indicate stronger relative performance, making cross-regime comparisons easier.

Small datasets

On small datasets, foundation-style models dominate the rankings.

  • TabPFN 3 and TabICL rank first on both the numeric and hybrid datasets, often tied
  • Gradient boosting models consistently rank near the bottom
  • The gap between foundation models and boosting is larger on Hybrid data

Average rank highlights the same pattern observed in raw performance:
When data is scarce, learned priors and inductive bias outweigh scale-driven optimization.

Medium datasets

On medium-sized datasets, rankings begin to shift.

  • TabPFN 3 and TabICL remain top-ranked across both feature types, with TabPFN 3 holding a small lead
  • CatBoost emerges as a strong third option on hybrid datasets
  • Boosting models improve their relative position compared to the small-data regime

This regime reflects a balance point. Data volume increases, but feature interactions still reward models with stronger inductive bias.

Large datasets

On large datasets, dominance becomes regime-specific.

  • Large + numeric:
    • TabPFN 3 ranks first, with XGBoost and TabICL behind.
  • Large + hybrid:
    • TabPFN 3 takes the top average rank but only by a small margin
    • LightGBM, TabICL, and CatBoost follow within 1 rank point of each other

Average rank shows TabPFN 3 leading in every regime, though gaps narrow on large hybrid data where multiple models cluster within a rank point.
Strong overall rankings often mask sharp performance differences across regimes.

Model-specific observations

This section summarizes where each model class performs well and where it struggles, based on the full set of results.

Tabular foundation models(TFMs): TabPFN 3 and TabICL

Strengths

  • Consistently top-performing on small and medium datasets
  • Particularly strong on hybrid datasets, where categorical structure matters
  • High win rates on small datasets

Limitations

  • Both TFMs have bounded row caps, so they cannot ingest dataset sizes where gradient boosting still operates without subsampling
  • TFMs typically cap at 2,000 features or fewer, which can be limiting on very wide tables, even with native categorical handling
  • TabICL does not support regression, so it cannot be scored on regression datasets

TFMs now cover most regimes well. TabPFN 3 in particular performs strongly across small, medium, and large datasets, with the main remaining weak spot being high-cardinality, purely categorical data.

Gradient boosting models: XGBoost and LightGBM

Strengths

  • Competitive on large datasets
  • Strong and stable performance as data volume increases
  • Remain competitive on hybrid data at scale

Limitations

  • Underperforming compared to foundation models on smaller datasets
  • Require careful preprocessing and tuning for categorical-heavy data

Gradient boosting remains a strong baseline across regimes and the practical default for production settings where TFM constraints apply, including license restrictions, regression support gaps, or row/feature caps.

CatBoost

Strengths

  • Among non-foundation models, generally strong on medium and large hybrid datasets
  • Native categorical handling provides consistent gains
  • Rarely performs poorly across regimes

Limitations

  • Rarely is the top performer
  • Less dominant on purely numeric datasets

CatBoost is the safest non-foundation choice when categorical features dominate. On high-cardinality, purely categorical data, both Logistic Regression and CatBoost outperform TabPFN 3, with Logistic Regression slightly ahead.

RealMLP

Observations

  • Rarely wins across regimes
  • Often ranks near the bottom, except on a small number of datasets

Generic neural MLPs struggle on tabular data without strong inductive bias, reinforcing a long-standing lesson in applied machine learning. 1

Logistic regression (baseline)

Observations

  • Competitive on small numeric datasets; falls behind on medium and large numeric data
  • Occasionally wins or ranks highly on hybrid datasets
  • Performance degrades sharply when feature interactions dominate

Despite its simplicity, logistic regression remains a meaningful baseline and should not be skipped in tabular benchmarks.

To get up to date on enterprise AI and software, follow us:
Cem Dilmegani
Cem Dilmegani
Principal Analyst

Key takeaways of the tabular learning models benchmark

Across 19 datasets, TabPFN 3 is the top model in every regime we tested. It wins on small, medium, and large datasets, and on both numeric and hybrid data.

The exception is high-cardinality categorical data, where Logistic Regression and CatBoost still beat TabPFN 3.

For teams choosing a tabular model, TabPFN 3 is now the practical default for most datasets. Gradient boosting stays the strong baseline when the dataset is too large or too wide for TabPFN 3, or when the data structure favors models like CatBoost.

Conceptual foundations of foundation-style tabular models

Foundation-style tabular models aim to generalize across diverse tabular datasets by learning strong priors over table structure, feature interactions, and task behavior, rather than optimizing for a single dataset.

Unlike traditional tabular models, which are trained independently for each dataset, foundation-style approaches are pretrained on large collections of tabular problems and then applied to new datasets through inference-time adaptation.

In this benchmark, TabPFN 3 and TabICL represent two prominent approaches within this paradigm.

Key capabilities of foundation-style tabular models

Foundation-style tabular models typically exhibit the following capabilities:

  • Strong inductive bias: By learning common patterns across many tabular datasets, these models encode assumptions about feature interactions, target distributions, and noise characteristics that generalize well to unseen problems.
  • Unified handling of feature types: Numeric and categorical features are embedded into a shared representation space, allowing the model to reason over mixed-feature tables without extensive manual preprocessing.
  • Inference-time adaptation: Rather than retraining, these models adapt to new datasets using context examples or dataset-level statistics, enabling strong performance under data scarcity.
  • Transfer across tasks: A single pretrained model can perform classification or regression on previously unseen datasets, often with minimal configuration.

These properties give foundation-style models a clear advantage on small and medium-sized datasets, where classical methods do not have enough data to estimate complex feature interactions. Recent releases like TabPFN 3 extend that strength to larger datasets too, through higher row and feature limits and native categorical handling.

TabPFN: Prior-data fitting for tabular prediction

TabPFN (Tabular Prior-Data Fitted Network) reframes tabular learning as a Bayesian inference problem.

Instead of learning parameters for a single dataset, TabPFN is trained on millions of synthetic tabular tasks sampled from a distribution of data-generating processes. During inference, the model effectively performs amortized Bayesian inference, conditioning on the observed dataset to produce predictions.

Key characteristics of TabPFN include:

  • A transformer architecture that processes entire datasets as context.
  • Training on a wide distribution of synthetic tasks to encode general-purpose priors.
  • Strong performance in low-data regimes without hyperparameter tuning.2

In practice, this design enables TabPFN 3 to outperform traditional boosting methods across small, medium, and large datasets in the benchmark.

TabPFN 3 extends the prior-fitted network approach to handle up to 100,000 training rows and to ingest categorical features natively, two changes that close most of the historical gap between TFMs and gradient boosting at scale.

SAP announced its acquisition of Prior Labs, the research group behind TabPFN, in May 2026 and committed more than €1 billion over four years to operate it as an independent AI research lab.3

TabICL: In-context learning for tabular data

TabICL extends the idea of in-context learning to tabular prediction.

Instead of fitting model parameters, TabICL conditions on examples from the dataset provided directly in the input context. The model learns to infer decision rules from these examples, similar to how large language models perform few-shot learning.

Key aspects of TabICL include:

  • Dataset rows encoded as structured tokens
  • Task adaptation through context examples rather than gradient-based training
  • A single pretrained model capable of handling diverse tabular tasks4

TabICL works best on small datasets. On large numeric datasets, it falls behind TabPFN 3 and XGBoost.

This approach allows TabICL to achieve strong performance on Hybrid datasets, especially when feature interactions are complex and labeled data is limited.

Where foundation-style models still lose

The earlier pattern was that foundation models excelled on small data and gradient boosting dominated at scale. TabPFN 3 narrows this gap and now wins or leads on large datasets as well.

The main regime where non-foundation models still win is high-cardinality pure-categorical data, where Logistic Regression and CatBoost outperform TabPFN 3. Teams with such datasets should benchmark gradient boosting and linear baselines alongside foundation models rather than defaulting to a single approach.

Methodology of tabular learning models benchmark

We benchmark 7 ML models on 19 tabular datasets using 5-fold stratified cross-validation.

Environment: RunPod Cloud Container (Ubuntu 24.04).

Drivers: Cuda 12.8.1, Pytorch 2.8.0

Models:

  • LogisticRegression – Linear baseline
  • XGBoost – Gradient boosting
  • LightGBM – Gradient boosting
  • CatBoost – Gradient boosting with native categorical support
  • RealMLP – Deep learning (MLP)
  • TabPFN 3 – Transformer-based prior-fitted network
  • TabICL – Transformer-based in-context learning

19 datasets from OpenML:

  • Binary classification: 15 datasets
  • Multiclass classification: 1 dataset
  • Regression: 3 datasets
  • Dataset sizes range from ~600 to ~45,000 samples.

Evaluation

Cross-Validation

  • 5-fold stratified CV for classification
  • 5-fold CV for regression
  • Same random seed (42) across all experiments

Metrics

Preprocessing

  • Numerical features: StandardScaler
  • Categorical features: One-hot encoding (except CatBoost, which handles natively)
  • Missing values: Median imputation (numerical), mode imputation (categorical)

Limitations

  • TabPFN 3: ≤2,000 raw features, ≤100,000 training rows. Native categorical handling avoids one-hot blow-up that constrained earlier versions
  • TabICL: Classification tasks only (no regression support); no scores recorded on the 3 regression datasets in this benchmark

Reproducibility

All experiments use:

  • Fixed random seed: 42
  • Same train/test splits across models
  • Default hyperparameters (no tuning)
Cem Dilmegani
Cem Dilmegani
Principal Analyst
Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
View Full Profile
Researched by
Berk Kalelioğlu
Berk Kalelioğlu
AI Researcher
Berk is an AI Researcher at AIMultiple, focusing on agentic ai systems and language models.
View Full Profile

Be the first to comment

Your email address will not be published. All fields are required.

0/450