+100 Datasets for ML & AI Models

with

updated on Jun 10, 2026

Data is required to leverage or build generative AI or conversational AI solutions. You can use existing datasets available on the market or hire a data collection service.

We identified over 100 datasets to train and evaluate machine learning and AI models.

Large Language Models (LLMs) and Agentic AI datasets

Dataset / Benchmark	Description	Free / Paid	Last Update
MMLU (Massive Multitask Language Understanding)	Benchmark for general reasoning and academic knowledge	Free	Ongoing
HumanEval+	Python coding benchmark for generative code	Free	Ongoing
FineWeb	Hugging Face's dataset for LLM pre-training	Free	Ongoing
FineWeb-Edu	Educational subset of FineWeb	Free	Ongoing
BFCL (Berkeley Function Calling Leaderboard)	Continuously updated standard for evaluating tool/function-calling	Free	Ongoing
Superior-Reasoning-SFT	Alibaba-Apsara's Long-CoT reasoning dataset	Free	2026
Terminal-Bench 2.0	89 realistic terminal tasks (file manipulation, system administration, debugging, re-implementing research code)	Free	2026
MMMU (Massive Multi-disciplinary Multimodal Understanding)	Multimodal benchmark (image + text reasoning)	Free	2025
Humanity’s Last Exam (HLE)	Multimodal benchmark to test frontier LLMs beyond MMLU	Free	2025
AI Idea Bench (2025)	Tests LLMs’ ability to synthesize new research ideas	Free (research)	2025

This category includes datasets and benchmarks designed for training and evaluating advanced language and multimodal models. These datasets help assess model capabilities in reasoning, text generation, question answering, and creative tasks.

Large language model benchmarks such as MMLU and GPQA measure general and scientific reasoning.
Multimodal datasets, such as LAION-5B, combine text and images to train models that can handle both formats.
Frontier evaluations, such as Humanity’s Last Exam, ARC-AGI-2, and AI Idea Bench, test models’ creativity, factual accuracy, and adaptability to complex prompts.
Agentic and tool-use benchmarks, such as GAIA, BFCL (Berkeley Function Calling Leaderboard), and ComplexFuncBench, evaluate multi-step reasoning, tool calling, and task completion.
Pre-training corpora, such as FineWeb, Nemotron-CC, and Essential-Web v1.0, provide large-scale token collections for training base models.

AI coding and software engineering datasets

Dataset	Description	Free / Paid	Last Update
CodeNet (IBM)	14M code samples across 50+ languages	Free	Ongoing
HumanEval	Code generation evaluation benchmark	Free	Ongoing
APPS (Code Problems Dataset)	Programming problem-solution pairs	Free	Ongoing
CodeSearchNet	Code + docstring dataset	Free	Ongoing
Terminal-Bench	CLI/terminal tasks for AI agents	Free	2026
The Heap (2025)	Multilingual contamination-free code dataset	Free	2025
Amazon CodeWhisperer Dataset	Proprietary code suggestion dataset	Paid	2025
GitHub Copilot Telemetry Data	Proprietary; used internally for fine-tuning	Paid / Closed	2025
SWE-Bench Pro	1,865 multi-language tasks, contamination-resistant; far harder than SWE-bench Verified	Free (public set)	2025
SWE-bench Multilingual	300 curated tasks from real GitHub pull requests across 42 repositories and 9 programming languages	Free	2025

This category covers datasets for code generation, understanding, debugging, and translation. They are used to build and assess systems that assist programmers or automate software development tasks.

Datasets such as The Heap and MADE-WIC contain multilingual and annotated code for evaluating coding accuracy and technical debt.
HumanEval and APPS provide coding problems with reference solutions to benchmark code-generation quality.
Repository-level and agentic benchmarks, such as SWE-Bench Pro, SWE-Bench Multilingual, SWE-Lancer, and Terminal-Bench 2.0, evaluate models on real GitHub issues and end-to-end software tasks rather than isolated functions.
Proprietary datasets, such as those from Amazon CodeWhisperer and GitHub Copilot, support commercial coding assistants.

Older benchmarks like HumanEval and SWE-bench Verified are now widely considered contaminated or saturated; as a result, contamination-resistant successors such as SWE-Bench Pro have emerged. These datasets enable consistent testing of coding models and support the creation of tools that can analyze or generate software efficiently.

Cybersecurity and data security datasets

Dataset	Description	Free / Paid	Last Update
VirusShare / VirusTotal	Malware binaries and metadata	Freemium / Paid	Ongoing
CVE-MITRE Database	Public vulnerability and exploit metadata	Free	Ongoing
CIC-IIoT-2025 (DataSense)	Sensor-based benchmark dataset	Free	2025
Adversarial ML Threat Dataset (AdvBench)	Synthetic attacks (poisoning, evasion)	Free	2025
Defender AI Logs (Microsoft)	Security telemetry data for enterprise AI	Paid	2025
OWASP Top 10 for LLMs 2025	Guidelines/taxonomy for GenAI security	Free	2024
CICIDS2017	Network intrusion detection dataset	Free	2024
TON_IoT	IoT security dataset (network + telemetry logs)	Free	2024
EMBER	Malware feature dataset for static analysis	Free	2023
MalNet	Android malware function call graphs	Free	2021

Cybersecurity datasets provide information for detecting, classifying, and preventing digital threats. They include network traffic logs, malware samples, and vulnerability databases.

CICIDS2017 and TON_IoT are widely used for training intrusion and anomaly detection systems.
EMBER and VirusShare datasets contain labeled malware data for model-based classification.
The CVE-MITRE database provides structured information on known software vulnerabilities.

These datasets support research and model training in cybersecurity, allowing systems to learn from real attack patterns and improve threat identification.

Data, synthetic data, and privacy datasets

Dataset / Platform	Description	Free / Paid	Last Update
Kaggle Datasets	Open data across domains	Free	Ongoing
Google Dataset Search	Search engine for open datasets	Free	Ongoing
Data.gov / Data.gov.uk / EU Open Data Portal	Government data repositories	Free	Ongoing
Mostly AI / Gretel.ai	Synthetic data platforms	Paid	2025
GitHub Datasets List	Library of mixed domain datasets	Free & Paid	2025
Appen	Human-generated datasets for ML	Paid	2025
Telus International	Human + synthetic dataset provider	Paid	2024
Prolific	Human response data for research	Paid	2024
LXT	Crowdsourced data collection	Paid	2024
Hazy (Synthetic Data)	Synthetic structured data for enterprises	Paid	2024

This category includes open and synthetic datasets that help organizations train models while maintaining data privacy and quality. Synthetic data replicates real-world distributions without exposing personal or proprietary information.

Platforms such as Appen, Amazon Mechanical Turk, and Telus International supply human-generated datasets for supervised learning.
Hazy and Gretel.ai generate synthetic structured data for enterprise use.
Open repositories like Kaggle Datasets and Google Dataset Search provide publicly accessible data across multiple domains.

These datasets ensure that machine learning models have access to diverse, representative data while complying with privacy standards.

Domain-specific and industry datasets

Domain	Dataset	Description	Free / Paid	Last Update
Healthcare	MIMIC-IV	ICU patient records (de-identified)	Free (research only)	Ongoing
Healthcare	PhysioNet	Biomedical signals & physiological data	Free	Ongoing
Healthcare	HealthData.gov	U.S. government health datasets	Free	Ongoing
Autonomous Driving	Waymo Open Dataset	Labeled video / LiDAR data	Free (non-commercial)	Ongoing
Autonomous Driving	ApolloScape / KITTI / nuScenes	Road scene perception	Free	Ongoing
Finance / Economics	World Bank / IMF / OECD Open Data	Macroeconomic time series	Free	Ongoing
Education / Language	Common Voice	Crowdsourced speech data	Free	Ongoing
Music / Audio	Free Music Archive (FMA)	Music tracks + metadata	Free	Ongoing
Climate / Sustainability	NASA EarthData / Copernicus	Climate imagery, environmental metrics	Free	Ongoing
Robotics	10Kh-RealOmin-OpenData	GenRobot AI's embodied AI dataset with bimanual manipulation	Free	2026

Domain-specific datasets focus on applications in particular sectors such as healthcare, finance, robotics, and autonomous driving. They provide specialized, labeled data for training models in industry-relevant tasks.

MIMIC-IV and PhysioNet support medical research and healthcare analytics.
Waymo Open Dataset and KITTI are used for computer vision in autonomous vehicles.
Robotics and embodied-AI datasets, such as AGIBOT WORLD 2026, EgoDex, and EgoVerse, provide first-person (egocentric) video and manipulation data for training physical-AI and humanoid systems.
Multimodal medical benchmarks, such as GMAI-MMBench and OmniMedVQA, evaluate clinical visual question answering across many imaging modalities.
World Bank Open Data and OECD datasets provide economic and financial indicators.
Common Voice and Free Music Archive support audio and language model development.

These datasets help organizations and researchers develop models tailored to industry challenges and specific data environments.

See more of our benchmarks and data-driven insights in Google Search.

Add as preferred source

What are ML datasets?

A machine learning dataset is a structured data collection specifically gathered and prepared to train machine learning models. These datasets for ML act as examples that help the model learn patterns, extract meaningful features, and make predictions on unseen data.

Depending on the task, the machine learning dataset may consist of various data types, including:

Text data: Used in applications like natural language processing, sentiment analysis, and machine translation.
Image data: Mostly used in computer vision and convolutional neural networks for tasks like handwritten digits recognition or steel plate faults detection.
Audio data: For speech recognition or sound classification tasks.
Video data: For object tracking or real-time video analysisç
Numeric data: Used in regression or classification tasks, sometimes coming from mass spectrometry data or time stamp logs.

Most machine learning projects begin with raw data, which is then labeled or annotated. This labeling helps the machine learning system understand the expected outcome for classification, regression, or other predictive tasks.

A good dataset, often sourced from open, public, or specialized machine learning repositories, can significantly improve model performance.

Why prepare datasets for machine learning?

Preparing and choosing high-quality datasets is one of the most crucial steps in developing artificial intelligence systems. Many organizations recognize that data preparation can make or break their machine learning projects.

The quality of the training data affects how well models generalize to real-world scenarios and how accurately they handle specific problems. There are three key purposes of a machine learning dataset:

To train the model

The training set teaches the machine the relationships and patterns within the data. This involves feeding annotated or labeled data, allowing the model to adjust its parameters and improve its predictions on similar inputs.

To measure model accuracy

After training, the testing dataset (or test set) is used to evaluate the model’s performance. This helps determine how well the model handles unseen data, and whether it’s overfitting to the training set or learning meaningful patterns.

To improve the model post-deployment

Once deployed, machine learning models are often refined using additional collected data, helping them adapt to new conditions or classes. Validation sets also help tune and prevent overfitting.

Working with a data partner

Preparing datasets can be resource-intensive, especially when dealing with extensive collections, missing values, or complex annotations. Many organizations handle this process with a data collection or generation service provider.

You can collaborate with a data crowdsourcing platform or company specializing in data science services to create domain-specific datasets, whether you need machine learning datasets for sentiment analysis, text classification, or image-based tasks like identifying one hundred plant species.

Sometimes, data is gathered through web scraping or accessed through tools like Google Dataset Search or open data initiatives.

For specialized needs, such as datasets for deep learning models or computer vision systems, relying on curated public datasets or free datasets ensures that the training data covers the necessary range of examples and classes.

Conclusion

Selecting the right dataset is a foundational step in any machine learning or AI project. Whether you opt for human-generated data, machine-generated synthetic data, or freely available open datasets, the key is aligning your data choice with your project’s specific goals and challenges.

High-quality and well-prepared datasets directly influence how effectively a model learns, generalizes, and performs in real-world applications.

Organizations and practitioners can better navigate the complexities of AI development by understanding the types and roles of datasets, training, validation, and test sets, and by exploring the rich ecosystem of available data sources.

Careful attention to data quality, relevance, and diversity ensures models are accurate and adaptable to evolving needs.

FAQs

To find datasets for machine learning, data scientists can explore various data repositories offering diverse datasets, including demographic data, economic and financial data, and public government data. These curated datasets cover a range of applications, such as natural language processing, sentiment analysis, computer vision, and healthcare.

Resources like open datasets, free datasets, and public datasets provide high-quality training data, validation datasets, and test datasets in various data formats like CSV files. Popular sources include government portals, academic institutions, and organizations like the International Monetary Fund, offering extensive collections of datasets for ML projects, predictive models, and deep learning algorithms.

A good machine learning dataset is a high-quality, diverse dataset with rich metadata, suitable for specific tasks like natural language processing, image classification, or sentiment analysis, and is often available from public data repositories or open datasets.

Cite this research

Pick the format that matches where you're publishing. Pasting the link version into your CMS preserves the backlink.

Cem Dilmegani and Sıla Ermut (2026) - "+100 Datasets for ML & AI Models". Published online at AIMultiple.com. Retrieved June 10, 2026, from: https://aimultiple.com/datasets-for-ml [Online Resource]

Dilmegani, C., & Ermut, S. (2026, June 10). +100 Datasets for ML & AI Models. AIMultiple. https://aimultiple.com/datasets-for-ml

@misc{dilmegani2026,
  author = {Dilmegani, Cem and Ermut, Sıla},
  title  = {{+100 Datasets for ML & AI Models}},
  year   = {2026},
  month  = jun,
  howpublished    = {\url{https://aimultiple.com/datasets-for-ml}},
  note   = {AIMultiple. Retrieved June 10, 2026}
}

Cem Dilmegani

Principal Analyst

Follow On

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

View Full Profile

Researched by

Sıla Ermut

Industry Analyst

Follow On

Sıla Ermut is an industry analyst at AIMultiple focused on email marketing and sales videos. She previously worked as a recruiter in project management and consulting firms. Sıla holds a Master of Science degree in Social Psychology and a Bachelor of Arts degree in International Relations.

View Full Profile