List of datasets for machine-learning research

These datasets are used in machine learning (ML) research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less intuitively, the availability of high-quality training datasets.^[1] High-quality labeled training datasets for supervised and semi-supervised machine-learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality unlabeled datasets for unsupervised learning can also be difficult and costly to produce.^[2]^[3]^[4]

Many organizations, including governments, publish and share their datasets, often using common metadata formats (such as Croissant).^[5] The datasets are classified, based on the licenses, into two groups: open data and non-open data.

The datasets from various governmental-bodies are presented in List of open government data sites. The datasets are ported on open data portals. They are made available for searching, depositing and accessing through interfaces like Open API.^{[citation needed]} The datasets are made available as various sorted types and subtypes.^{[citation needed]}

[1]

[2]

[3]

[4]

[5]

Type	Subtypes
Specific category	Finance, Economics, Commerce, Societal, Health, Academy, Sports, Food, Agriculture, Travel, Geospatial, Political, Consumer, Transport, Logistics, Environmental, Real-Estate, Legal, Entertainment, Energy, Hospitality
Scope	Supranational Union, National, Subnational, Municipality, Urban, Rural
Language	Mandarin Chinese, Spanish, English, Arabic, Hindi, Bengali
Type	Tabular, Graph, Text, Image, Sound, Video
Usage	Training, validating, and testing
File-Formats	CSV, JSON, XML, KML, GeoJSON, Shapefile, GML
Licenses	Creative-Commons, GPL, Other Non-Open data licenses
Last-Updated	Last-Hour, Last-Day, Last-Week, Last-Month, Last-Year
File-Size	Minimum, Maximum, Range
Status	Verified, In-Preparation, Deactivated(or Deprecated)
Number of records	100s, 1000s, 10000s, 100000s, Millions
Number of variables	Less than 10, 10s, 100s, 1000s, 10000s
Services	Individual, Aggregation

Portal-name	License	List of installations of the portal	Typical usages
Comprehensive Knowledge Archive Network (CKAN)	AGPL	https://ckan.github.io/ckan-instances/ https://github.com/sebneu/ckan_instances/blob/master/instances.csv	Data repository for government or non-profit organisations, Data Management Solution for Research Institutes
DKAN	GPL	https://getdkan.org/community	Data repository for government or non-profit organisations, Data Management Solution for Research Institutes
Dataverse	Apache	https://dataverse.org/installations https://dataverse.org/metrics	Data Management Solution for Research Institutes
DSpace	BSD	https://registry.lyrasis.org/	Data Management Solution for Research Institutes
OpenML	BSD	https://www.openml.org/search?type=data&sort=runs&status=active	Data Management Solution to share datasets, algorithms, and experiments results through APIs.

Academic Torrents	https://academictorrents.com
Amazon Datasets	https://registry.opendata.aws/
Awesome Public Datasets Collection	https://github.com/awesomedata/awesome-public-datasets
data.world	https://data.world/datasets/machine-learning
Datahub – Core Datasets	https://datahub.io/docs/core-data
DataONE	https://www.dataone.org/
DataPortals	https://dataportals.org/
Datasetlist.com	https://www.datasetlist.com
Global Open Data Index – Open Knowledge Foundation	https://okfn.org/ Archived 25 May 2020 at the Wayback Machine
Google Dataset Search	https://datasetsearch.research.google.com/
Hugging Face	https://huggingface.co/docs/datasets/
IBM's Data Asset Exchange	https://developer.ibm.com/exchanges/data/
Jupyter – Tutorial Data	https://jupyter-tutorial.readthedocs.io/en/latest/data-processing/opendata.html
Kaggle	https://www.kaggle.com/datasets
Machine learning datasets	https://macgence.com/data-sets-and-cataloges/
Major Smart Cities with Open Data	https://rlist.io/l/major-smart-cities-with-open-data-portals
Microsoft Datasets	https://msropendata.com/datasets
Open Data Inception	https://opendatainception.io/
Opendatasoft	https://data.opendatasoft.com/explore/dataset/open-data-sources%40public/table/?sort=code_en
OpenDOAR	https://v2.sherpa.ac.uk/opendoar/
OpenML	https://www.openml.org/search?type=data
Papers with Code	https://paperswithcode.com/datasets
Penn Machine Learning Benchmarks	https://github.com/EpistasisLab/pmlb/tree/master/datasets
Public APIs	https://github.com/public-apis/public-apis
Registry of Open Access Repositories	http://roar.eprints.org/
REgistry of REsearch Data REpositories	https://www.re3data.org/
UCI Machine Learning Repository	https://archive.ics.uci.edu/
Speech Dataset	https://www.shaip.com/offerings/speech-data-catalog/
Visual Data Discovery	https://visualdata.io/discovery

Dataset name	Brief description	Preprocessing	Instances	Format	Default task	Created (updated)	Reference	Creator
Netflix Prize	Movie ratings on Netflix.		100,480,507 ratings that 480,189 users gave to 17,770 movies	Text, rating	Rating prediction	2006	^[6]	Netflix
Amazon reviews	US product reviews from Amazon.com.	None.	233.1 million	Text	Classification, sentiment analysis	2015 (2018)	^[7]^[8]	McAuley et al.
OpinRank Review Dataset	Reviews of cars and hotels from Edmunds.com and TripAdvisor respectively.	None.	42,230 / ~259,000 respectively	Text	Sentiment analysis, clustering	2011	^[9]^[10]	K. Ganesan et al.
MovieLens	22,000,000 ratings and 580,000 tags applied to 33,000 movies by 240,000 users.	None.	~ 22M	Text	Regression, clustering, classification	2016	^[11]	GroupLens Research
Yahoo! Music User Ratings of Musical Artists	Over 10M ratings of artists by Yahoo users.	None described.	~ 10M	Text	Clustering, regression	2004	^[12]^[13]	Yahoo!
Car Evaluation Data Set	Car properties and their overall acceptability.	Six categorical features given.	1728	Text	Classification	1997	^[14]^[15]	M. Bohanec
YouTube Comedy Slam Preference Dataset	User vote data for pairs of videos shown on YouTube. Users voted on funnier videos.	Video metadata given.	1,138,562	Text	Classification	2012	^[16]^[17]	Google
Skytrax User Reviews Dataset	User reviews of airlines, airports, seats, and lounges from Skytrax.	Ratings are fine-grain and include many aspects of airport experience.	41396	Text	Classification, regression	2015	^[18]	Q. Nguyen
Teaching Assistant Evaluation Dataset	Teaching assistant reviews.	Features of each instance such as class, class size, and instructor are given.	151	Text	Classification	1997	^[19]^[20]	W. Loh et al.
Vietnamese Students' Feedback Corpus (UIT-VSFC)	Students' Feedback.	Comments	16,000	Text	Classification	1997	^[21]	Nguyen et al.
Vietnamese Social Media Emotion Corpus (UIT-VSMEC)	Users' Facebook Comments.	Comments	6,927	Text	Classification	1997	^[22]	Nguyen et al.
Vietnamese Open-domain Complaint Detection dataset (ViOCD)	Customer product reviews	Comments	5,485	Text	Classification	2021	^[23]	Nguyen et al.
ViHOS: Hate Speech Spans Detection for Vietnamese	Social Media Texts	Comments	Containing 26k spans on 11k comments	Text	Span Detection	2021	^[24]	Hoang et al.

Dataset name	Brief description	Preprocessing	Instances	Format	Default task	Created (updated)	Reference	Creator
NYSK Dataset	English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn.	Filtered and presented in XML format.	10,421	XML, text	Sentiment analysis, topic extraction	2013	^[25]	Dermouche, M. et al.
The Reuters Corpus Volume 1	Large corpus of Reuters news stories in English.	Fine-grain categorization and topic codes.	810,000	Text	Classification, clustering, summarization	2002	^[26]	Reuters
The Reuters Corpus Volume 2	Large corpus of Reuters news stories in multiple languages.	Fine-grain categorization and topic codes.	487,000	Text	Classification, clustering, summarization	2005	^[27]	Reuters
Thomson Reuters Text Research Collection	Large corpus of news stories.	Details not described.	1,800,370	Text	Classification, clustering, summarization	2009	^[28]	T. Rose et al.
Saudi Newspapers Corpus	31,030 Arabic newspaper articles.	Metadata extracted.	31,030	JSON	Summarization, clustering	2015	^[29]	M. Alhagri
RE3D (Relationship and Entity Extraction Evaluation Dataset)	Entity and Relation marked data from various news and government sources. Sponsored by Dstl	Filtered, categorisation using Baleen types	not known	JSON	Classification, Entity and Relation recognition	2017	^[30]	Dstl
Examiner Spam Clickbait Catalogue	Clickbait, spam, crowd-sourced headlines from 2010 to 2015	Publish date and headlines	3,089,781	CSV	Clustering, Events, Sentiment	2016	^[31]	R. Kulkarni
ABC Australia News Corpus	Entire news corpus of ABC Australia from 2003 to 2019	Publish date and headlines	1,186,018	CSV	Clustering, Events, Sentiment	2020	^[32]	R. Kulkarni
Worldwide News – Aggregate of 20K Feeds	One week snapshot of all online headlines in 20+ languages	Publish time, URL and headlines	1,398,431	CSV	Clustering, Events, Language Detection	2018	^[33]	R. Kulkarni
Reuters News Wire Headline	11 Years of timestamped events published on the news-wire	Publish time, Headline Text	16,121,310	CSV	NLP, Computational Linguistics, Events	2018	^[34]	R. Kulkarni
The Irish Times Ireland News Corpus	24 Years of Ireland News from 1996 to 2019	Publish time, Headline Category and Text	1,484,340	CSV	NLP, Computational Linguistics, Events	2020	^[35]	R. Kulkarni
News Headlines Dataset for Sarcasm Detection	High quality dataset with Sarcastic and Non-sarcastic news headlines.	Clean, normalized text	26,709	JSON	NLP, Classification, Linguistics	2018	^[36]	Rishabh Misra

Dataset name	Brief description	Preprocessing	Instances	Format	Default task	Created (updated)	Reference	Creator
Enron Corpus	Emails from employees at Enron organized into folders.	Attachments removed, invalid email addresses converted to user@enron.com or no_address@enron.com.	~ 500,000	Text	Network analysis, sentiment analysis	2004 (2015)	^[37]^[38]	Klimt, B. and Y. Yang
Ling-Spam Dataset	Corpus containing both legitimate and spam emails.	Four version of the corpus involving whether or not a lemmatiser or stop-list was enabled.	2,412 Ham 481 Spam	Text	Classification	2000	^[39]^[40]	Androutsopoulos, J. et al.
SMS Spam Collection Dataset	Collected SMS spam messages.	None.	5,574	Text	Classification	2011	^[41]^[42]	T. Almeida et al.
Twenty Newsgroups Dataset	Messages from 20 different newsgroups.	None.	20,000	Text	Natural language processing	1999	^[43]	T. Mitchell et al.
Spambase Dataset	Spam emails.	Many text features extracted.	4,601	Text	Spam detection, classification	1999	^[44]	M. Hopkins et al.

Dataset name	Brief description	Preprocessing	Instances	Format	Default task	Created (updated)	Reference	Creator
MovieTweetings	Movie rating dataset based on public and well-structured tweets		~710,000	Text	Classification, regression	2018	^[45]	S. Dooms
Twitter100k	Pairs of images and tweets		100,000	Text and Images	Cross-media retrieval	2017	^[46]^[47]	Y. Hu, et al.
Sentiment140	Tweet data from 2009 including original text, time stamp, user and sentiment.	Classified using distant supervision from presence of emoticon in tweet.	1,578,627	Tweets, comma, separated values	Sentiment analysis	2009	^[48]^[49]	A. Go et al.
ASU Twitter Dataset	Twitter network data, not actual tweets. Shows connections between a large number of users.	None.	11,316,811 users, 85,331,846 connections	Text	Clustering, graph analysis	2009	^[50]^[51]	R. Zafarani et al.
SNAP Social Circles: Twitter Database	Large Twitter network data.	Node features, circles, and ego networks.	1,768,149	Text	Clustering, graph analysis	2012	^[52]^[53]	J. McAuley et al.
Twitter Dataset for Arabic Sentiment Analysis	Arabic tweets.	Samples hand-labeled as positive or negative.	2000	Text	Classification	2014	^[54]^[55]	N. Abdulla
Buzz in Social Media Dataset	Data from Twitter and Tom's Hardware. This dataset focuses on specific buzz topics being discussed on those sites.	Data is windowed so that the user can attempt to predict the events leading up to social media buzz.	140,000	Text	Regression, Classification	2013	^[56]^[57]	F. Kawala et al.
Paraphrase and Semantic Similarity in Twitter (PIT)	This dataset focuses on whether tweets have (almost) same meaning/information or not. Manually labeled.	tokenization, part-of-speech and named entity tagging	18,762	Text	Regression, Classification	2015	^[58]^[59]	Xu et al.
Geoparse Twitter benchmark dataset	This dataset contains tweets during different news events in different countries. Manually labeled location mentions.	location annotations added to JSON metadata	6,386	Tweets, JSON	Classification, Information Extraction	2014	^[60]^[61]	S. E. Middleton et al.
Sarcasm, Perceived and Intended, by Reactive Supervision (SPIRS)	Intended and perceived sarcastic tweets along with their context collected using reactive supervision; an equal number of negative (non-sarcastic) samples		30,000	Tweet IDs, CSV	Classification	2020	^[62]^[63]	B. Shmueli et al.
Dutch Social media collection	This dataset contains COVID-19 tweets made by Dutch speakers or users from Netherlands. The data has been machine labeled	classified for sentiment, tweet text & user description translated to English. Industry mention are extracted	271,342	JSONL	Sentiment, multi-label classification, machine translation	2020	^[64]^[65]^[66]	Aaaksh Gupta, CoronaWhy
ReactionGIF dataset	A dataset of 30K tweets and their GIF reactions	Classified for sentiment, reaction, and emotion	30,000	Tweet IDs, JSONL	Classified for sentiment, reaction, and emotion	2021	^[67]^[68]	B. Shmueli et al.

Dataset name	Brief description	Preprocessing	Instances	Format	Default task	Created (updated)	Reference	Creator
NPS Chat Corpus	Posts from age-specific online chat rooms.	Hand privacy masked, tagged for part of speech and dialogue-act.	~ 500,000	XML	NLP, programming, linguistics	2007	^[69]	Forsyth, E., Lin, J., & Martell, C.
Twitter Triple Corpus	A-B-A triples extracted from Twitter.		4,232	Text	NLP	2016	^[70]	Sordini, A. et al.
UseNet Corpus	UseNet forum postings.	Anonymized e-mails and URLs. Omitted documents with lengths <500 words or >500,000 words, or that were <90% English.	7 billion	Text		2011	^[71]	Shaoul, C., & Westbury C.
NUS SMS Corpus	SMS messages collected between two users, with timing analysis.		~ 10,000	XML	NLP	2011	^[72]	KAN, M
Reddit All Comments Corpus	All Reddit comments (as of 2015).		~ 1.7 billion	JSON	NLP, research	2015	^[73]	Stuck_In_the_Matrix
Ubuntu Dialogue Corpus	Dialogues extracted from Ubuntu chat stream on IRC.		930 thousand dialogues, 7.1 million utterances	CSV	Dialogue Systems Research	2015	^[74]	Lowe, R. et al.
Dialog State Tracking Challenge	The Dialog State Tracking Challenges 2 & 3 (DSTC2&3) were research challenge focused on improving the state of the art in tracking the state of spoken dialog systems.	Transcription of spoken dialogs with labelling	DSTC2 contains ~3.2k calls – DSTC3 contains ~2.3k calls	Json	Dialogue state tracking	2014	^[75]	Henderson, Matthew and Thomson, Blaise and Williams, Jason D

Dataset name	Brief description	Preprocessing	Instances	Format	Default task	Created (updated)	Reference	Creator
FreeLaw	Filtered data from Court Listener, part of the FreeLaw project.	Cleaned and normalized text	4,940,710	Json	NLP, linguistics	2020	^[76]	T. Hoppe
Pile of Law	Corpus of legal and administrative data	Cleaned, normalized, and privatized	~50,000,000	Json	NLP, linguistics, sentiment	2022	^[77]^[78]	L. Zheng; N. Guha; B. Anderson; P. Henderson; D. Ho
Caselaw Access Project	All official, book-published state and federal United States case law — every volume or case designated as an official report of decisions by a court within the United States.	Cleaned and normalized text	~10,000	Json	NLP, linguistics	2022	^[79]	A. Aizman; S. Chapman; J. Cushman; K. Dulin; H. Eidolon; et al.

List of sorting used for datasets

List of open data portals

List of portals suitable for multiple types of applications

List of portals suitable for a specific subtype of applications

Image data

Text data

Reviews

News articles

Messages

Twitter and tweets

Dialogues

Legal

Other text

Sound data

Speech

Music

Other sounds

Signal data

Electrical

Motion-tracking

Other signals

Chemical data

Chemical Reactions with transition states (TS)

OpenReACT-CHON-EFH

Physical data

High-energy physics

Systems

Astronomy

Earth science

Other physical

Biological data

Human

Animal

Fungi

Plant

Microbe

Drug discovery

Anomaly data

Question answering data

Dialog or instruction prompted data

Cybersecurity

Climate and sustainability

Code data

Multivariate data

Financial

Weather

Census

Transit

Internet

Games

Other multivariate

Curated repositories of datasets

See also

References