List of datasets for machine-learning research

From Wikipedia, the free encyclopedia

These datasets are used in machine learning (ML) research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less-intuitively, the availability of high-quality training datasets.[1] High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for unsupervised learning can also be difficult and costly to produce.[2][3][4]

Many organizations, including governments, publish and share their datasets. The datasets are classified, based on the licenses, as Open data and Non-Open data.

The datasets from various governmental-bodies are presented in List of open government data sites. The datasets are ported on open data portals. They are made available for searching, depositing and accessing through interfaces like Open API. The datasets are made available as various sorted types and subtypes.

List of sorting used for datasets

More information Type, Subtypes ...
Type Subtypes
Specific category Finance, Economics, Commerce, Societal, Health, Academy, Sports, Food, Agriculture, Travel, Geospatial, Political, Consumer, Transport, Logistics, Environmental, Real-Estate, Legal, Entertainment, Energy, Hospitality
Scope Supranational Union, National, Subnational, Municipality, Urban, Rural
Language Mandarin Chinese, Spanish, English, Arabic, Hindi, Bengali
Type Tabular, Graph, Text, Image, Sound, Video
Usage Training, validating, and testing
File-Formats CSV, JSON, XML, KML, GeoJSON, Shapefile, GML
Licenses Creative-Commons, GPL, Other Non-Open data licenses
Last-Updated Last-Hour, Last-Day, Last-Week, Last-Month, Last-Year
File-Size Minimum, Maximum, Range
Status Verified, In-Preparation, Deactivated(or Deprecated)
Number of records 100s, 1000s, 10000s, 100000s, Millions
Number of variables Less than 10, 10s, 100s, 1000s, 10000s
Services Individual, Aggregation
Close

The data portal is classified based on its type of license. The open source license based data portals are known as open data portals which are used by many government organizations and academic institutions.

List of open data portals

More information Portal-name, License ...
Portal-name License List of installations of the portal Typical usages
Comprehensive Knowledge Archive Network (CKAN) AGPL https://ckan.github.io/ckan-instances/

https://github.com/sebneu/ckan_instances/blob/master/instances.csv

Data repository for government or non-profit organisations, Data Management Solution for Research Institutes
DKAN GPL https://getdkan.org/community Data repository for government or non-profit organisations, Data Management Solution for Research Institutes
Dataverse Apache https://dataverse.org/installations

https://dataverse.org/metrics

Data Management Solution for Research Institutes
DSpace BSD https://registry.lyrasis.org/ Data Management Solution for Research Institutes
OpenML BSD https://www.openml.org/search?type=data&sort=runs&status=active Data Management Solution to share datasets, algorithms, and experiments results through APIs.
Close

List of portals suitable for multiple types of applications

Summarize
Perspective

The data portal sometimes lists a wide variety of subtypes of datasets pertaining to many machine learning applications.

Academic Torrents https://academictorrents.com
Amazon Datasets https://registry.opendata.aws/
Awesome Public Datasets Collection https://github.com/awesomedata/awesome-public-datasets
data.world https://data.world/datasets/machine-learning
Datahub – Core Datasets https://datahub.io/docs/core-data
DataONE https://www.dataone.org/
DataPortals https://dataportals.org/
Datasetlist.com https://www.datasetlist.com
Global Open Data Index – Open Knowledge Foundation https://okfn.org/ Archived 25 May 2020 at the Wayback Machine
Google Dataset Search https://datasetsearch.research.google.com/
Hugging Face https://huggingface.co/docs/datasets/
IBM's Data Asset Exchange https://developer.ibm.com/exchanges/data/
Jupyter – Tutorial Data https://jupyter-tutorial.readthedocs.io/en/latest/data-processing/opendata.html
Kaggle https://www.kaggle.com/datasets
Machine learning datasets https://macgence.com/data-sets-and-cataloges/
Major Smart Cities with Open Data https://rlist.io/l/major-smart-cities-with-open-data-portals
Microsoft Datasets https://msropendata.com/datasets
Open Data Inception https://opendatainception.io/
Opendatasoft https://data.opendatasoft.com/explore/dataset/open-data-sources%40public/table/?sort=code_en
OpenDOAR https://v2.sherpa.ac.uk/opendoar/
OpenML https://www.openml.org/search?type=data
Papers with Code https://paperswithcode.com/datasets
Penn Machine Learning Benchmarks https://github.com/EpistasisLab/pmlb/tree/master/datasets
Public APIs https://github.com/public-apis/public-apis
Registry of Open Access Repositories http://roar.eprints.org/ 
REgistry of REsearch Data REpositories https://www.re3data.org/ 
UCI Machine Learning Repository http://mlr.cs.umass.edu/ml/ Archived 26 June 2020 at the Wayback Machine
Speech Dataset https://www.shaip.com/offerings/speech-data-catalog/
Visual Data Discovery https://visualdata.io/discovery

List of portals suitable for a specific subtype of applications

The data portals which are suitable for a specific subtype of machine learning application are listed in the subsequent sections.

Image data

Text data

Summarize
Perspective

These datasets consist primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis.

Reviews

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Netflix Prize Movie ratings on Netflix. 100,480,507 ratings that 480,189 users gave to 17,770 movies Text, rating Rating prediction 2006 [5] Netflix
Amazon reviews US product reviews from Amazon.com. None. 233.1 million Text Classification, sentiment analysis 2015 (2018) [6][7] McAuley et al.
OpinRank Review Dataset Reviews of cars and hotels from Edmunds.com and TripAdvisor respectively. None. 42,230 / ~259,000 respectively Text Sentiment analysis, clustering 2011 [8][9] K. Ganesan et al.
MovieLens 22,000,000 ratings and 580,000 tags applied to 33,000 movies by 240,000 users. None. ~ 22M Text Regression, clustering, classification 2016 [10] GroupLens Research
Yahoo! Music User Ratings of Musical Artists Over 10M ratings of artists by Yahoo users. None described. ~ 10M Text Clustering, regression 2004 [11][12] Yahoo!
Car Evaluation Data Set Car properties and their overall acceptability. Six categorical features given. 1728 Text Classification 1997 [13][14] M. Bohanec
YouTube Comedy Slam Preference Dataset User vote data for pairs of videos shown on YouTube. Users voted on funnier videos. Video metadata given. 1,138,562 Text Classification 2012 [15][16] Google
Skytrax User Reviews Dataset User reviews of airlines, airports, seats, and lounges from Skytrax. Ratings are fine-grain and include many aspects of airport experience. 41396 Text Classification, regression 2015 [17] Q. Nguyen
Teaching Assistant Evaluation Dataset Teaching assistant reviews. Features of each instance such as class, class size, and instructor are given. 151 Text Classification 1997 [18][19] W. Loh et al.
Vietnamese Students’ Feedback Corpus (UIT-VSFC) Students’ Feedback. Comments 16,000 Text Classification 1997 [20] Nguyen et al.
Vietnamese Social Media Emotion Corpus (UIT-VSMEC) Users’ Facebook Comments. Comments 6,927 Text Classification 1997 [21] Nguyen et al.
Vietnamese Open-domain Complaint Detection dataset (ViOCD) Customer product reviews Comments 5,485 Text Classification 2021 [22] Nguyen et al.
ViHOS: Hate Speech Spans Detection for Vietnamese Social Media Texts Comments Containing 26k spans on 11k comments Text Span Detection 2021 [23] Hoang et al.
Close

News articles

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
NYSK Dataset English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn. Filtered and presented in XML format. 10,421 XML, text Sentiment analysis, topic extraction 2013 [24] Dermouche, M. et al.
The Reuters Corpus Volume 1 Large corpus of Reuters news stories in English. Fine-grain categorization and topic codes. 810,000 Text Classification, clustering, summarization 2002 [25] Reuters
The Reuters Corpus Volume 2 Large corpus of Reuters news stories in multiple languages. Fine-grain categorization and topic codes. 487,000 Text Classification, clustering, summarization 2005 [26] Reuters
Thomson Reuters Text Research Collection Large corpus of news stories. Details not described. 1,800,370 Text Classification, clustering, summarization 2009 [27] T. Rose et al.
Saudi Newspapers Corpus 31,030 Arabic newspaper articles. Metadata extracted. 31,030 JSON Summarization, clustering 2015 [28] M. Alhagri
RE3D (Relationship and Entity Extraction Evaluation Dataset) Entity and Relation marked data from various news and government sources. Sponsored by Dstl Filtered, categorisation using Baleen types not known JSON Classification, Entity and Relation recognition 2017 [29] Dstl
Examiner Spam Clickbait Catalogue Clickbait, spam, crowd-sourced headlines from 2010 to 2015 Publish date and headlines 3,089,781 CSV Clustering, Events, Sentiment 2016 [30] R. Kulkarni
ABC Australia News Corpus Entire news corpus of ABC Australia from 2003 to 2019 Publish date and headlines 1,186,018 CSV Clustering, Events, Sentiment 2020 [31] R. Kulkarni
Worldwide News – Aggregate of 20K Feeds One week snapshot of all online headlines in 20+ languages Publish time, URL and headlines 1,398,431 CSV Clustering, Events, Language Detection 2018 [32] R. Kulkarni
Reuters News Wire Headline 11 Years of timestamped events published on the news-wire Publish time, Headline Text 16,121,310 CSV NLP, Computational Linguistics, Events 2018 [33] R. Kulkarni
The Irish Times Ireland News Corpus 24 Years of Ireland News from 1996 to 2019 Publish time, Headline Category and Text 1,484,340 CSV NLP, Computational Linguistics, Events 2020 [34] R. Kulkarni
News Headlines Dataset for Sarcasm Detection High quality dataset with Sarcastic and Non-sarcastic news headlines. Clean, normalized text 26,709 JSON NLP, Classification, Linguistics 2018 [35] Rishabh Misra
Close

Messages

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Enron Corpus Emails from employees at Enron organized into folders. Attachments removed, invalid email addresses converted to user@enron.com or no_address@enron.com. ~ 500,000 Text Network analysis, sentiment analysis 2004 (2015) [36][37] Klimt, B. and Y. Yang
Ling-Spam Dataset Corpus containing both legitimate and spam emails. Four version of the corpus involving whether or not a lemmatiser or stop-list was enabled. 2,412 Ham 481 Spam Text Classification 2000 [38][39] Androutsopoulos, J. et al.
SMS Spam Collection Dataset Collected SMS spam messages. None. 5,574 Text Classification 2011 [40][41] T. Almeida et al.
Twenty Newsgroups Dataset Messages from 20 different newsgroups. None. 20,000 Text Natural language processing 1999 [42] T. Mitchell et al.
Spambase Dataset Spam emails. Many text features extracted. 4,601 Text Spam detection, classification 1999 [43] M. Hopkins et al.
Close

Twitter and tweets

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
MovieTweetings Movie rating dataset based on public and well-structured tweets ~710,000 Text Classification, regression 2018 [44] S. Dooms
Twitter100k Pairs of images and tweets 100,000 Text and Images Cross-media retrieval 2017 [45][46] Y. Hu, et al.
Sentiment140 Tweet data from 2009 including original text, time stamp, user and sentiment. Classified using distant supervision from presence of emoticon in tweet. 1,578,627 Tweets, comma, separated values Sentiment analysis 2009 [47][48] A. Go et al.
ASU Twitter Dataset Twitter network data, not actual tweets. Shows connections between a large number of users. None. 11,316,811 users, 85,331,846 connections Text Clustering, graph analysis 2009 [49][50] R. Zafarani et al.
SNAP Social Circles: Twitter Database Large Twitter network data. Node features, circles, and ego networks. 1,768,149 Text Clustering, graph analysis 2012 [51][52] J. McAuley et al.
Twitter Dataset for Arabic Sentiment Analysis Arabic tweets. Samples hand-labeled as positive or negative. 2000 Text Classification 2014 [53][54] N. Abdulla
Buzz in Social Media Dataset Data from Twitter and Tom's Hardware. This dataset focuses on specific buzz topics being discussed on those sites. Data is windowed so that the user can attempt to predict the events leading up to social media buzz. 140,000 Text Regression, Classification 2013 [55][56] F. Kawala et al.
Paraphrase and Semantic Similarity in Twitter (PIT) This dataset focuses on whether tweets have (almost) same meaning/information or not. Manually labeled. tokenization, part-of-speech and named entity tagging 18,762 Text Regression, Classification 2015 [57][58] Xu et al.
Geoparse Twitter benchmark dataset This dataset contains tweets during different news events in different countries. Manually labeled location mentions. location annotations added to JSON metadata 6,386 Tweets, JSON Classification, Information Extraction 2014 [59][60] S.E. Middleton et al.
Sarcasm, Perceived and Intended, by Reactive Supervision (SPIRS) Intended and perceived sarcastic tweets along with their context collected using reactive supervision; an equal number of negative (non-sarcastic) samples 30,000 Tweet IDs, CSV Classification 2020 [61][62] B. Shmueli et al.
Dutch Social media collection This dataset contains COVID-19 tweets made by Dutch speakers or users from Netherlands. The data has been machine labeled classified for sentiment, tweet text & user description translated to English. Industry mention are extracted 271,342 JSONL Sentiment, multi-label classification, machine translation 2020 [63][64][65] Aaaksh Gupta, CoronaWhy
ReactionGIF dataset A dataset of 30K tweets and their GIF reactions Classified for sentiment, reaction, and emotion 30,000 Tweet IDs, JSONL Classified for sentiment, reaction, and emotion 2021 [66][67] B. Shmueli et al.
Close

Dialogues

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
NPS Chat Corpus Posts from age-specific online chat rooms. Hand privacy masked, tagged for part of speech and dialogue-act. ~ 500,000 XML NLP, programming, linguistics 2007 [68] Forsyth, E., Lin, J., & Martell, C.
Twitter Triple Corpus A-B-A triples extracted from Twitter. 4,232 Text NLP 2016 [69] Sordini, A. et al.
UseNet Corpus UseNet forum postings. Anonymized e-mails and URLs. Omitted documents with lengths <500 words or >500,000 words, or that were <90% English. 7 billion Text 2011 [70] Shaoul, C., & Westbury C.
NUS SMS Corpus SMS messages collected between two users, with timing analysis. ~ 10,000 XML NLP 2011 [71] KAN, M
Reddit All Comments Corpus All Reddit comments (as of 2015). ~ 1.7 billion JSON NLP, research 2015 [72] Stuck_In_the_Matrix
Ubuntu Dialogue Corpus Dialogues extracted from Ubuntu chat stream on IRC. 930 thousand dialogues, 7.1 million utterances CSV Dialogue Systems Research 2015 [73] Lowe, R. et al.
Dialog State Tracking Challenge The Dialog State Tracking Challenges 2 & 3 (DSTC2&3) were research challenge focused on improving the state of the art in tracking the state of spoken dialog systems. Transcription of spoken dialogs with labelling DSTC2 contains ~3.2k calls – DSTC3 contains ~2.3k calls Json Dialogue state tracking 2014 [74] Henderson, Matthew and Thomson, Blaise and Williams, Jason D
Close
More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
FreeLaw Filtered data from Court Listener, part of the FreeLaw project. Cleaned and normalized text 4,940,710 Json NLP, linguistics 2020 [75] T. Hoppe
Pile of Law Corpus of legal and administrative data Cleaned, normalized, and privatized ~50,000,000 Json NLP, linguistics, sentiment 2022 [76][77] L. Zheng; N. Guha; B. Anderson; P. Henderson; D. Ho
Caselaw Access Project All official, book-published state and federal United States case law — every volume or case designated as an official report of decisions by a court within the United States. Cleaned and normalized text ~10,000 Json NLP, linguistics 2022 [78] A. Aizman; S. Chapman; J. Cushman; K. Dulin; H. Eidolon; et al.
Close

Other text

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Web of Science Dataset Hierarchical Datasets for Text Classification None. 46,985 Text Classification,

Categorization

2017 [79][80] K. Kowsari et al.
Legal Case Reports Federal Court of Australia cases from 2006 to 2009. None. 4,000 Text Summarization,

citation analysis

2012 [81][82] F. Galgani et al.
Blogger Authorship Corpus Blog entries of 19,320 people from blogger.com. Blogger self-provided gender, age, industry, and astrological sign. 681,288 Text Sentiment analysis, summarization, classification 2006 [83][84] J. Schler et al.
Social Structure of Facebook Networks Large dataset of the social structure of Facebook. None. 100 colleges covered Text Network analysis, clustering 2012 [85][86] A. Traud et al.
Dataset for the Machine Comprehension of Text Stories and associated questions for testing comprehension of text. None. 660 Text Natural language processing, machine comprehension 2013 [87][88] M. Richardson et al.
The Penn Treebank Project Naturally occurring text annotated for linguistic structure. Text is parsed into semantic trees. ~ 1M words Text Natural language processing, summarization 1995 [89][90] M. Marcus et al.
DEXTER Dataset Task given is to determine, from features given, which articles are about corporate acquisitions. Features extracted include word stems. Distractor features included. 2600 Text Classification 2008 [91] Reuters
Google Books N-grams N-grams from a very large corpus of books None. 2.2 TB of text Text Classification, clustering, regression 2011 [92][93] Google
Personae Corpus Collected for experiments in Authorship Attribution and Personality Prediction. Consists of 145 Dutch-language essays. In addition to normal texts, syntactically annotated texts are given. 145 Text Classification, regression 2008 [94][95] K. Luyckx et al.
PushShift Archives of social media websites, including Reddit, Twitter, and Hackernews. Text extracted and normalized from WARCs ~100,000,000 posts Json NLP, sentiment, linguistics 2022 [96][97] J. Baumgartner
SEC Filings EDGAR | Company Filings Text extracted. csv NLP
CNAE-9 Dataset Categorization task for free text descriptions of Brazilian companies. Word frequency has been extracted. 1080 Text Classification 2012 [98][99] P. Ciarelli et al.
Sentiment Labeled Sentences Dataset 3000 sentiment labeled sentences. Sentiment of each sentence has been hand labeled as positive or negative. 3000 Text Classification, sentiment analysis 2015 [100][101] D. Kotzias
BlogFeedback Dataset Dataset to predict the number of comments a post will receive based on features of that post. Many features of each post extracted. 60,021 Text Regression 2014 [102][103] K. Buza
PubMed Central PubMed® comprises more than 35 million citations for biomedical literature from MEDLINE, life science journals, and online books. None 35 Million Text NLP
USPTO The United States Patent and Trademark Office Text NLP
PhilPapers Open access collection of philosophy publications Text NLP
Book Corpus A popular large-scale text corpus. None Text NLP 2015 [104] Zhu, Yukun, et al.
Stanford Natural Language Inference (SNLI) Corpus Image captions matched with newly constructed sentences to form entailment, contradiction, or neutral pairs. Entailment class labels, syntactic parsing by the Stanford PCFG parser 570,000 Text Natural language inference/recognizing textual entailment 2015 [105] S. Bowman et al.
DSL Corpus Collection (DSLCC) A multilingual collection of short excerpts of journalistic texts in similar languages and dialects. None 294,000 phrases Text Discriminating between similar languages 2017 [106] Tan, Liling et al.
Urban Dictionary Dataset Corpus of words, votes and definitions User names anonymised 2,580,925 CSV NLP, Machine comprehension 2016 May [107] Anonymous
T-REx Wikipedia abstracts aligned with Wikidata entities Alignment of Wikidata triples with Wikipedia abstracts 11M aligned triples JSON and NIF NLP, Relation Extraction 2018 [108] H. Elsahar et al.
General Language Understanding Evaluation (GLUE) Benchmark of nine tasks Various ~1M sentences and sentence pairs NLU 2018 [109][110][111] Wang et al.
Contract Understanding Atticus Dataset (CUAD) (formerly known as Atticus Open Contract Dataset (AOK)) Dataset of legal contracts with rich expert annotations ~13,000 labels CSV and PDF Natural language processing, QnA 2021 The Atticus Project
Vietnamese Image Captioning Dataset (UIT-ViIC) Vietnamese Image Captioning Dataset 19,250 captions for 3,850 images CSV and PDF Natural language processing, Computer vision 2020 [112] Lam et al.
Vietnamese Names annotated with Genders (UIT-ViNames) Vietnamese Names annotated with Genders 26,850 Vietnamese full names annotated with genders CSV Natural language processing 2020 [113] To et al.
Vietnamese Constructive and Toxic Speech Detection Dataset (UIT-ViCTSD) Vietnamese Constructive and Toxic Speech Detection Dataset 10,000 Vietnamese users' comments on online newspapers on 10 domains CSV Natural Language Processing 2021 [114] Nguyen et al.
PG-19 A set of books extracted from the Project Gutenberg books library Text Natural Language Processing 2019 Jack W et al.
Deepmind Mathematics Mathematical question and answer pairs. Text Natural Language Processing 2018 [115] D Saxton et al.
Anna's Archive A comprehensive archive of published books and papers None 100,356,641 Text, epub, PDF Natural Language Processing 2024
Close

Sound data

Summarize
Perspective

These datasets consist of sounds and sound features used for tasks such as speech recognition and speech synthesis.

Speech

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Switchboard-1 Conversational speech over telephone. 260 hours of speech, from 543 speakers (302 male, 241 female) from across the United States, for around 2,400 two-sided telephone conversations, collected by Texas Instruments in 1990-1991. audio, text transcript, word-level timestamps, phonetic transcriptions speech recognition, phonetic transcription. 1992 (2000) [116][117] NIST
Hub5'00 Conversational speech over telephone. 260 hours of speech, from 543 speakers (302 male, 241 female) from across the United States, for around 2,400 two-sided telephone conversations, at ~3 million words. Collected by Texas Instruments in 1990-1991. audio, text transcript, word-level timestamps, phonetic transcriptions speech recognition, phonetic transcription. The most commonly used test set for this dataset is called "Hub5'00". 1992 (2000) [118][119] NIST
Zero Resource Speech Challenge 2015 Spontaneous speech (English), Read speech (Xitsonga). None, raw WAV files. English: 5h, 12 speakers; Xitsonga: 2h30, 24 speakers WAV (audio only) Unsupervised discovery of speech features/subword units/word units 2015 [120][121] Versteegh et al.
Parkinson Speech Dataset Multiple recordings of people with and without Parkinson's Disease. Voice features extracted, disease scored by physician using unified Parkinson's disease rating scale. 1,040 Text Classification, regression 2013 [122][123] B. E. Sakar et al.
Spoken Arabic Digits Spoken Arabic digits from 44 male and 44 female. Time-series of mel-frequency cepstrum coefficients. 8,800 Text Classification 2010 [124][125] M. Bedda et al.
ISOLET Dataset Spoken letter names. Features extracted from sounds. 7797 Text Classification 1994 [126][127] R. Cole et al.
Japanese Vowels Dataset Nine male speakers uttered two Japanese vowels successively. Applied 12-degree linear prediction analysis to it to obtain a discrete-time series with 12 cepstrum coefficients. 640 Text Classification 1999 [128][129] M. Kudo et al.
Parkinson's Telemonitoring Dataset Multiple recordings of people with and without Parkinson's Disease. Sound features extracted. 5875 Text Classification 2009 [130][131] A. Tsanas et al.
TIMIT Recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. Speech is lexically and phonemically transcribed. 6300 Text Speech recognition, classification. 1986 [132][133] J. Garofolo et al.
Arabic Speech Corpus A single-speaker, Modern Standard Arabic (MSA) speech corpus with phonetic and orthographic transcripts aligned to phoneme level. Speech is orthographically and phonetically transcribed with stress marks. ~1900 Text, WAV Speech Synthesis, Speech Recognition, Corpus Alignment, Speech Therapy, Education. 2016 [134] N. Halabi
Common Voice A public domain database of crowdsourced data across a wide range of dialects. Validation by other users . English: 1,118 hours MP3 with corresponding text files Speech recognition 2017 June (2019 December) [135] Mozilla
LJSpeech A single-speaker corpus of English public-domain audiobook recordings, split into short clips at punctuation marks. Quality check, normalized transcription alongside the original. 13,100 CSV, WAV Speech synthesis 2017 [136] Keith Ito, Linda Johnson
Arabic Speech Commands Dataset Collected from 30 contributors and grouped into 40 keywords. Raw WAV files 12,000 WAV, CSV Speech recognition, keyword spotting 2021 [137] Abdulkader Ghandoura
Close

Music

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Geographic Origin of Music Data Set Audio features of music samples from different locations. Audio features extracted using MARSYAS software. 1,059 Text Geographic classification, clustering 2014 [138][139] F. Zhou et al.
Million Song Dataset Audio features from one million different songs. Audio features extracted. 1M Text Classification, clustering 2011 [140][141] T. Bertin-Mahieux et al.
MUSDB18 Multi-track popular music recordings Raw audio 150 MP4, WAV Source Separation 2017 [142] Z. Rafii et al.
Free Music Archive Audio under Creative Commons from 100k songs (343 days, 1TiB) with a hierarchy of 161 genres, metadata, user data, free-form text. Raw audio and audio features. 106,574 Text, MP3 Classification, recommendation 2017 [143] M. Defferrard et al.
Bach Choral Harmony Dataset Bach chorale chords. Audio features extracted. 5665 Text Classification 2014 [144][145] D. Radicioni et al.
Close

Other sounds

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
UrbanSound Labeled sound recordings of sounds like air conditioners, car horns and children playing. Sorted into folders by class of events as well as metadata in a JSON file and annotations in a CSV file. 1,059 Sound

(WAV)

Classification 2014 [146][147] J. Salamon et al.
AudioSet 10-second sound snippets from YouTube videos, and an ontology of over 500 labels. 128-d PCA'd VGG-ish features every 1 second. 2,084,320 Text (CSV) and TensorFlow Record files Classification 2017 [148] J. Gemmeke et al., Google
Bird Audio Detection challenge Audio from environmental monitoring stations, plus crowdsourced recordings 17,000+ Classification 2016 (2018) [149][150] Queen Mary University and IEEE Signal Processing Society
WSJ0 Hipster Ambient Mixtures Audio from WSJ0 mixed with noise recorded in the San Francisco Bay Area Noise clips matched to WSJ0 clips 28,000 Sound (WAV) Audio source separation 2019 [151] Wichern, G., et al., Whisper and MERL
Clotho 4,981 audio samples of 15 to 30 seconds long, each audio sample having five different captions of eight to 20 words long. 24,905 Sound (WAV) and text (CSV) Automated audio captioning 2020 [152][153] K. Drossos, S. Lipping, and T. Virtanen
Close

Signal data

Summarize
Perspective

Datasets containing electric signal information requiring some sort of signal processing for further analysis.

Electrical

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Witty Worm Dataset Dataset detailing the spread of the Witty worm and the infected computers. Split into a publicly available set and a restricted set containing more sensitive information like IP and UDP headers. 55,909 IP addresses Text Classification 2004 [154][155] Center for Applied Internet Data Analysis
Cuff-Less Blood Pressure Estimation Dataset Cleaned vital signals from human patients which can be used to estimate blood pressure. 125 Hz vital signs have been cleaned. 12,000 Text Classification, regression 2015 [156][157] M. Kachuee et al.
Gas Sensor Array Drift Dataset Measurements from 16 chemical sensors utilized in simulations for drift compensation. Extensive number of features given. 13,910 Text Classification 2012 [158][159] A. Vergara
Servo Dataset Data covering the nonlinear relationships observed in a servo-amplifier circuit. Levels of various components as a function of other components are given. 167 Text Regression 1993 [160][161] K. Ullrich
UJIIndoorLoc-Mag Dataset Indoor localization database to test indoor positioning systems. Data is magnetic field based. Train and test splits given. 40,000 Text Classification, regression, clustering 2015 [162][163] D. Rambla et al.
Sensorless Drive Diagnosis Dataset Electrical signals from motors with defective components. Statistical features extracted. 58,508 Text Classification 2015 [164][165] M. Bator
Close

Motion-tracking

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Wearable Computing: Classification of Body Postures and Movements (PUC-Rio) People performing five standard actions while wearing motion trackers. None. 165,632 Text Classification 2013 [166][167] Pontifical Catholic University of Rio de Janeiro
Gesture Phase Segmentation Dataset Features extracted from video of people doing various gestures. Features extracted aim at studying gesture phase segmentation. 9900 Text Classification, clustering 2014 [168][169] R. Madeo et a
Vicon Physical Action Data Set Dataset 10 normal and 10 aggressive physical actions that measure the human activity tracked by a 3D tracker. Many parameters recorded by 3D tracker. 3000 Text Classification 2011 [170][171] T. Theodoridis
Daily and Sports Activities Dataset Motor sensor data for 19 daily and sports activities. Many sensors given, no preprocessing done on signals. 9120 Text Classification 2013 [172][173] B. Barshan et al.
Human Activity Recognition Using Smartphones Dataset Gyroscope and accelerometer data from people wearing smartphones and performing normal actions. Actions performed are labeled, all signals preprocessed for noise. 10,299 Text Classification 2012 [174][175] J. Reyes-Ortiz et al.
Australian Sign Language Signs Australian sign language signs captured by motion-tracking gloves. None. 2565 Text Classification 2002 [176][177] M. Kadous
Weight Lifting Exercises monitored with Inertial Measurement Units Five variations of the biceps curl exercise monitored with IMUs. Some statistics calculated from raw data. 39,242 Text Classification 2013 [178][179] W. Ugulino et al.
sEMG for Basic Hand movements Dataset Two databases of surface electromyographic signals of 6 hand movements. None. 3000 Text Classification 2014 [180][181] C. Sapsanis et al.
REALDISP Activity Recognition Dataset Evaluate techniques dealing with the effects of sensor displacement in wearable activity recognition. None. 1419 Text Classification 2014 [181][182] O. Banos et al.
Heterogeneity Activity Recognition Dataset Data from multiple different smart devices for humans performing various activities. None. 43,930,257 Text Classification, clustering 2015 [183][184] A. Stisen et al.
Indoor User Movement Prediction from RSS Data Temporal wireless network data that can be used to track the movement of people in an office. None. 13,197 Text Classification 2016 [185][186] D. Bacciu
PAMAP2 Physical Activity Monitoring Dataset 18 different types of physical activities performed by 9 subjects wearing 3 IMUs. None. 3,850,505 Text Classification 2012 [187] A. Reiss
OPPORTUNITY Activity Recognition Dataset Human Activity Recognition from wearable, object, and ambient sensors is a dataset devised to benchmark human activity recognition algorithms. None. 2551 Text Classification 2012 [188][189] D. Roggen et al.
Real World Activity Recognition Dataset Human Activity Recognition from wearable devices. Distinguishes between seven on-body device positions and comprises six different kinds of sensors. None. 3,150,000 (per sensor) Text Classification 2016 [190] T. Sztyler et al.
Toronto Rehab Stroke Pose Dataset 3D human pose estimates (Kinect) of stroke patients and healthy participants performing a set of tasks using a stroke rehabilitation robot. None. 10 healthy person and 9 stroke survivors (3500–6000 frames per person) CSV Classification 2017 [191][192][193] E. Dolatabadi et al.
Corpus of Social Touch (CoST) 7805 gesture captures of 14 different social touch gestures performed by 31 subjects. The gestures were performed in three variations: gentle, normal and rough, on a pressure sensor grid wrapped around a mannequin arm. Touch gestures performed are segmented and labeled. 7805 gesture captures CSV Classification 2016 [194][195] M. Jung et al.
Close

Other signals

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Wine Dataset Chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. 13 properties of each wine are given 178 Text Classification, regression 1991 [196][197] M. Forina et al.
Combined Cycle Power Plant Data Set Data from various sensors within a power plant running for 6 years. None 9568 Text Regression 2014 [198][199] P. Tufekci et al.
Close

Physical data

Summarize
Perspective

Datasets from physical systems.

High-energy physics

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
HIGGS Dataset Monte Carlo simulations of particle accelerator collisions. 28 features of each collision are given. 11M Text Classification 2014 [200][201][202] D. Whiteson
HEPMASS Dataset Monte Carlo simulations of particle accelerator collisions. Goal is to separate the signal from noise. 28 features of each collision are given. 10,500,000 Text Classification 2016 [201][202][203] D. Whiteson
Close

Systems

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Yacht Hydrodynamics Dataset Yacht performance based on dimensions. Six features are given for each yacht. 308 Text Regression 2013 [204][205] R. Lopez
Robot Execution Failures Dataset 5 data sets that center around robotic failure to execute common tasks. Integer valued features such as torque and other sensor measurements. 463 Text Classification 1999 [206] L. Seabra et al.
Pittsburgh Bridges Dataset Design description is given in terms of several properties of various bridges. Various bridge features are given. 108 Text Classification 1990 [207][208] Y. Reich et al.
Automobile Dataset Data about automobiles, their insurance risk, and their normalized losses. Car features extracted. 205 Text Regression 1987 [209][210] J. Schimmer et al.
Auto MPG Dataset MPG data for cars. Eight features of each car given. 398 Text Regression 1993 [211] Carnegie Mellon University
Energy Efficiency Dataset Heating and cooling requirements given as a function of building parameters. Building parameters given. 768 Text Classification, regression 2012 [212][213] A. Xifara et al.
Airfoil Self-Noise Dataset A series of aerodynamic and acoustic tests of two and three-dimensional airfoil blade sections. Data about frequency, angle of attack, etc., are given. 1503 Text Regression 2014 [214] R. Lopez
Challenger USA Space Shuttle O-Ring Dataset Attempt to predict O-ring problems given past Challenger data. Several features of each flight, such as launch temperature, are given. 23 Text Regression 1993 [215][216] D. Draper et al.
Statlog (Shuttle) Dataset NASA space shuttle datasets. Nine features given. 58,000 Text Classification 2002 [217] NASA
Close

Astronomy

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Volcanoes on Venus – JARtool experiment Dataset Venus images returned by the Magellan spacecraft. Images are labeled by humans. not given Images Classification 1991 [218][219] M. Burl
MAGIC Gamma Telescope Dataset Monte Carlo generated high-energy gamma particle events. Numerous features extracted from the simulations. 19,020 Text Classification 2007 [219][220] R. Bock
Solar Flare Dataset Measurements of the number of certain types of solar flare events occurring in a 24-hour period. Many solar flare-specific features are given. 1389 Text Regression, classification 1989 [221] G. Bradshaw
CAMELS Multifield Dataset 2D maps and 3D grids from thousands of N-body and state-of-the-art hydrodynamic simulations spanning a broad range in the value of the cosmological and astrophysical parameters Each map and grid has 6 cosmological and astrophysical parameters associated to it 405,000 2D maps and 405,000 3D grids 2D maps and 3D grids Regression 2021 [222] Francisco Villaescusa-Navarro et al.
Close

Earth science

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Volcanoes of the World Volcanic eruption data for all known volcanic events on earth. Details such as region, subregion, tectonic setting, dominant rock type are given. 1535 Text Regression, classification 2013 [223] E. Venzke et al.
Seismic-bumps Dataset Seismic activities from a coal mine. Seismic activity was classified as hazardous or not. 2584 Text Classification 2013 [224][225] M. Sikora et al.
CAMELS-US Catchment hydrology dataset with hydrometeorological timeseries and various attributes see Reference 671 CSV, Text, Shapefile Regression 2017 [226][227] N. Addor et al. / A. Newman et al.
CAMELS-Chile Catchment hydrology dataset with hydrometeorological timeseries and various attributes see Reference 516 CSV, Text, Shapefile Regression 2018 [228] C. Alvarez-Garreton et al.
CAMELS-Brazil Catchment hydrology dataset with hydrometeorological timeseries and various attributes see Reference 897 CSV, Text, Shapefile Regression 2020 [229] V. Chagas et al.
CAMELS-GB Catchment hydrology dataset with hydrometeorological timeseries and various attributes see Reference 671 CSV, Text, Shapefile Regression 2020 [230] G. Coxon et al.
CAMELS-Australia Catchment hydrology dataset with hydrometeorological timeseries and various attributes see Reference 222 CSV, Text, Shapefile Regression 2021 [231] K. Fowler et al.
LamaH-CE Catchment hydrology dataset with hydrometeorological timeseries and various attributes see Reference 859 CSV, Text, Shapefile Regression 2021 [232] C. Klingler et al.
Close

Other physical

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Concrete Compressive Strength Dataset Dataset of concrete properties and compressive strength. Nine features are given for each sample. 1030 Text Regression 2007 [233][234] I. Yeh
Concrete Slump Test Dataset Concrete slump flow given in terms of properties. Features of concrete given such as fly ash, water, etc. 103 Text Regression 2009 [235][236] I. Yeh
Musk Dataset Predict if a molecule, given the features, will be a musk or a non-musk. 168 features given for each molecule. 6598 Text Classification 1994 [237] Arris Pharmaceutical Corp.
Steel Plates Faults Dataset Steel plates of 7 different types. 27 features given for each sample. 1941 Text Classification 2010 [238] Semeion Research Center
Noble Metal Monometallic Nanoparticles Datasets Processing and structural features of monometallic nanoparticles, labels being formation energy. 85-182 features given for each sample. 425 to 4000 CSV Regression 2017 to 2023 [239][240][241][242][243][244] A. Barnard and G. Opletal
Noble Metal Bimetallic Nanoparticles Datasets Processing and structural features of bimetallic nanoparticles, labels being formation energy. 922 features given for each sample. 138147 to 162770 CSV Regression 2023 [245][246][247][248][249][250][251][252][253][254][255][256] J. Ting et al.
AuPdPt Trimetallic Nanoparticles Dataset Processing and structural features of AuPdPt nanoparticles, labels being formation energy. 1958 features given for each sample. 48136 CSV Regression 2023 [257] K. Lu et al.
Close

Biological data

Summarize
Perspective

Datasets from biological systems.

Human

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Age Dataset A structured general-purpose dataset on life, work, and death of 1.22 million distinguished people. Public domain. A five-step method to infer birth and death years, gender, and occupation from community-submitted data to all language versions of the Wikipedia project. 1,223,009 Text Regression, Classification 2022 Paper[258]

Dataset[259]

Amoradnejad et al.
Synthetic Fundus Dataset[260] Photorealistic retinal images and vessel segmentations. Public domain. 2500 images with 1500*1152 pixels useful for segmentation and classification of veins and arteries on a single background. 2500 Images Classification, Segmentation 2020 [261] C. Valenti et al.
EEG Database Study to examine EEG correlates of genetic predisposition to alcoholism. Measurements from 64 electrodes placed on the scalp sampled at 256 Hz (3.9 ms epoch) for 1 second. 122 Text Classification 1999 [262] H. Begleiter
P300 Interface Dataset Data from nine subjects collected using P300-based brain-computer interface for disabled subjects. Split into four sessions for each subject. MATLAB code given. 1,224 Text Classification 2008 [263][264] U. Hoffman et al.
Heart Disease Data Set Attributed of patients with and without heart disease. 75 attributes given for each patient with some missing values. 303 Text Classification 1988 [265][266] A. Janosi et al.
Breast Cancer Wisconsin (Diagnostic) Dataset Dataset of features of breast masses. Diagnoses by physician is given. 10 features for each sample are given. 569 Text Classification 1995 [267][268] W. Wolberg et al.
National Survey on Drug Use and Health Large scale survey on health and drug use in the United States. None. 55,268 Text Classification, regression 2012 [269] United States Department of Health and Human Services
Lung Cancer Dataset Lung cancer dataset without attribute definitions 56 features are given for each case 32 Text Classification 1992 [270][271] Z. Hong et al.
Arrhythmia Dataset Data for a group of patients, of which some have cardiac arrhythmia. 276 features for each instance. 452 Text Classification 1998 [272][273] H. Altay et al.
Diabetes 130-US hospitals for years 1999–2008 Dataset 9 years of readmission data across 130 US hospitals for patients with diabetes. Many features of each readmission are given. 100,000 Text Classification, clustering 2014 [274][275] J. Clore et al.
Diabetic Retinopathy Debrecen Dataset Features extracted from images of eyes with and without diabetic retinopathy. Features extracted and conditions diagnosed. 1151 Text Classification 2014 [276][277] B. Antal et al.
Diabetic Retinopathy Messidor Dataset Methods to evaluate segmentation and indexing techniques in the field of retinal ophthalmology (MESSIDOR) Features retinopathy grade and risk of macular edema 1200 Images, Text Classification, Segmentation 2008 [278][279] Messidor Project
Liver Disorders Dataset Data for people with liver disorders. Seven biological features given for each patient. 345 Text Classification 1990 [280][281] Bupa Medical Research Ltd.
Thyroid Disease Dataset 10 databases of thyroid disease patient data. None. 7200 Text Classification 1987 [282][283] R. Quinlan
Mesothelioma Dataset Mesothelioma patient data. Large number of features, including asbestos exposure, are given. 324 Text Classification 2016 [284][285] A. Tanrikulu et al.
Parkinson's Vision-Based Pose Estimation Dataset 2D human pose estimates of Parkinson's patients performing a variety of tasks. Camera shake has been removed from trajectories. 134 Text Classification, regression 2017 [286][287][288] M. Li et al.
KEGG Metabolic Reaction Network (Undirected) Dataset Network of metabolic pathways. A reaction network and a relation network are given. Detailed features for each network node and pathway are given. 65,554 Text Classification, clustering, regression 2011 [289] M. Naeem et al.
Modified Human Sperm Morphology Analysis Dataset (MHSMA) Human sperm images from 235 patients with male factor infertility, labeled for normal or abnormal sperm acrosome, head, vacuole, and tail. Cropped around single sperm head. Magnification normalized. Training, validation, and test set splits created. 1,540 .npy files Classification 2019 [290][291] S. Javadi and S.A. Mirroshandel
Close

Animal

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Abalone Dataset Physical measurements of Abalone. Weather patterns and location are also given. None. 4177 Text Regression 1995 [292] Marine Research Laboratories – Taroona
Zoo Dataset Artificial dataset covering 7 classes of animals. Animals are classed into 7 categories and features are given for each. 101 Text Classification 1990 [293] R. Forsyth
Demospongiae Dataset Data about marine sponges. 503 sponges in the Demosponge class are described by various features. 503 Text Classification 2010 [294] E. Armengol et al.
Farm animals data PLF data inventory (cows, pigs; location, acceleration, etc.). Labeled datasets. List is constantly updated Text Classification 2020 [295] V. Bloch
Splice-junction Gene Sequences Dataset Primate splice-junction gene sequences (DNA) with associated imperfect domain theory. None. 3190 Text Classification 1992 [271] G. Towell et al.
Mice Protein Expression Dataset Expression levels of 77 proteins measured in the cerebral cortex of mice. None. 1080 Text Classification, Clustering 2015 [296][297] C. Higuera et al.
Close

Fungi

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
UCI Mushroom Dataset Mushroom attributes and classification. Many properties of each mushroom are given. 8124 Text Classification 1987 [298] J. Schlimmer
Secondary Mushroom Dataset Mushroom attributes and classification Simulated data from larger and more realistic primary mushroom entries. Fully reproducible. 61069 Text Classification 2020 [299][300] D. Wagner et al.
Close

Plant

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Forest Fires Dataset Forest fires and their properties. 13 features of each fire are extracted. 517 Text Regression 2008 [301][302] P. Cortez et al.
Iris Dataset Three types of iris plants are described by 4 different attributes. None. 150 Text Classification 1936 [303][304] R. Fisher
Plant Species Leaves Dataset Sixteen samples of leaf each of one-hundred plant species. Shape descriptor, fine-scale margin, and texture histograms are given. 1600 Text Classification 2012 [305][306] J. Cope et al.
Soybean Dataset Database of diseased soybean plants. 35 features for each plant are given. Plants are classified into 19 categories. 307 Text Classification 1988 [307] R. Michalski et al.
Seeds Dataset Measurements of geometrical properties of kernels belonging to three different varieties of wheat. None. 210 Text Classification, clustering 2012 [308][309] Charytanowicz et al.
Covertype Dataset Data for predicting forest cover type strictly from cartographic variables. Many geographical features given. 581,012 Text Classification 1998 [310][311] J. Blackard et al.
Abscisic Acid Signaling Network Dataset Data for a plant signaling network. Goal is to determine set of rules that governs the network. None. 300 Text Causal-discovery 2008 [312] J. Jenkens et al.
Folio Dataset 20 photos of leaves for each of 32 species. None. 637 Images, text Classification, clustering 2015 [313][314] T. Munisami et al.
Oxford Flower Dataset 17 category dataset of flowers. Train/test splits, labeled images, 1360 Images, text Classification 2006 [315][316] M-E Nilsback et al.
Plant Seedlings Dataset 12 category dataset of plant seedlings. Labelled images, segmented images, 5544 Images Classification, detection 2017 [317] Giselsson et al.
Fruits-360 Database with images of 131 fruits and vegetables. 100x100 pixels, white background. 90483 Images (jpg) Classification 2017–2024 [318] Mihai Oltean
Weed-ID.App Database with 1,025 species, 13,500+ images, and 120,000+ characteristics Varying size and background. Labeled by PhD botanist. 13,500 Images, text Classification 1999-2024 [319] Richard Old
CottonWeedDet3 Dataset A 3-class weed detection dataset for cotton cropping systems 3 species of weeds. 848 Images Classification 2022 [320] Rahman et al.
Close

Microbe

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Ecoli Dataset Protein localization sites. Various features of the protein localizations sites are given. 336 Text Classification 1996 [321][322] K. Nakai et al.
MicroMass Dataset Identification of microorganisms from mass-spectrometry data. Various mass spectrometer features. 931 Text Classification 2013 [323][324] P. Mahe et al.
Yeast Dataset Predictions of Cellular localization sites of proteins. Eight features given per instance. 1484 Text Classification 1996 [325][326] K. Nakai et al.
Close

Drug discovery

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Tox21 Dataset Prediction of outcome of biological assays. Chemical descriptors of molecules are given. 12707 Text Classification 2016 [327] A. Mayr et al.
Close

Anomaly data

Summarize
Perspective
More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Numenta Anomaly Benchmark (NAB) Data are ordered, timestamped, single-valued metrics. All data files contain anomalies, unless otherwise noted. None 50+ files CSV Anomaly detection 2016 (continually updated) [328] Numenta
Skoltech Anomaly Benchmark (SKAB) Each file represents a single experiment and contains a single anomaly. The dataset represents a multivariate time series collected from the sensors installed on the testbed. There are two markups for Outlier detection (point anomalies) and Changepoint detection (collective anomalies) problems 30+ files (v0.9) CSV Anomaly detection 2020 (continually updated)

[329] [330]

Iurii D. Katser and Vyacheslav O. Kozitsin
On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study Most data files are adapted from UCI Machine Learning Repository data, some are collected from the literature. treated for missing values, numerical attributes only, different percentages of anomalies, labels 1000+ files ARFF Anomaly detection 2016 (possibly updated with new datasets and/or results)

[331]

Campos et al.
Close

Question answering data

Summarize
Perspective

This section includes datasets that deals with structured data.

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
DBpedia Neural Question Answering (DBNQA) Dataset A large collection of Question to SPARQL specially design for Open Domain Neural Question Answering over DBpedia Knowledgebase. This dataset contains a large collection of Open Neural SPARQL Templates and instances for training Neural SPARQL Machines; it was pre-processed by semi-automatic annotation tools as well as by three SPARQL experts. 894,499 Question-query pairs Question Answering 2018 [332][333] Hartmann, Soru, and Marx et al.
Vietnamese Question Answering Dataset (UIT-ViQuAD) A large collection of Vietnamese questions for evaluating MRC models. This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia. 23,074 Question-answer pairs Question Answering 2020 [334] Nguyen et al.
Vietnamese Multiple-Choice Machine Reading Comprehension Corpus(ViMMRC) A collection of Vietnamese multiple-choice questions for evaluating MRC models. This corpus includes 2,783 Vietnamese multiple-choice questions. 2,783 Question-answer pairs Question Answering/Machine Reading Comprehension 2020 [335] Nguyen et al.
Open-Domain Question Answering Goes Conversational via Question Rewriting An end-to-end open-domain question answering. This dataset includes 14,000 conversations with 81,000 question-answer pairs. Context, Question, Rewrite, Answer, Answer_URL, Conversation_no, Turn_no, Conversation_source

Further details are provided in the project's GitHub repository and respective Hugging Face dataset card.

Question Answering 2021 [336] Anantha and Vakulenko et al.
UnifiedQA Question-answer data Processed dataset Question Answering 2020 [337] Khashabi et al.
Close

Dialog or instruction prompted data

Summarize
Perspective

This section includes datasets that ...

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Taskmaster "The Taskmaster corpus consists of THREE datasets, Taskmaster-1 (TM-1), Taskmaster-2 (TM-2), and Taskmaster-3 (TM-3), comprising over 55,000 spoken and written task-oriented dialogs in over a dozen domains."[338] Taskmaster-1: goal-oriented conversational dataset. It includes 13,215 task-based dialogs comprising six domains.

Taskmaster-2: 17,289 dialogs in the seven domains (restaurants, food ordering, movies, hotels, flights, music and sports).

Taskmaster-3: 23,757 movie ticketing dialogs.

Taskmaster-1 and Taskmaster-2: conversation id, utterances, Instruction id

Taskmaster-3: conversation id, utterances, vertical, scenario, instructions.

For further details check the project's GitHub repository or the Hugging Face dataset cards (taskmaster-1, taskmaster-2, taskmaster-3).

Dialog/Instruction prompted 2019 [339] Byrne and Krishnamoorthi et al.
DrRepair A labeled dataset for program repair. Pre-processed data Check format details in the project's worksheet. Dialog/Instruction prompted 2020 [340] Michihiro et al.
Natural Instructions v2 Large dataset that covers a wider range of reasoning abilities Each task consists of input/output, and a task definition.

Additionally, each ask contains a task definition.

Further information is provided in the GitHub repository of the project and the Hugging Face data card.

Input/Output and task definition 2022 [341] Wang et al.
LAMBADA " LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word."[342] Information about this dataset's format is available in the HuggingFace dataset card and the project's website.

The dataset can be downloaded here, and the rejected data here.

2016 [343] Paperno et al.
FLAN A re-preprocessed version of the FLAN dataset with updates since the original FLAN dataset was released is available in Hugging Face:
  1. test data
  2. train data
  3. validation data

The scripts to process the data are available in the GitHub repo mentioned on the paper: https://github.com/google-research/FLAN/tree/main/flan.

Another FLAN GitHub repo was created as well. This is the one associated with the dataset card in Hugging Face.

2021 [344] Wei et al.
Close

Cybersecurity

Summarize
Perspective
More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
MITRE ATTACK The ATT&CK is a globally-accessible knowledge base of adversary tactics and techniques. Data can be downloaded from these two GitHub repositories: version 2.1 and version 2.0 [345] MITRE ATTACK
CAPEC Common Attack Pattern Enumeration and Classification Data can be downloaded from CAPEC's website:

Mechanisms of Attack Domains of Attack

[346] CAPEC
CVE CVE is a list of publicly disclosed cybersecurity vulnerabilities that is free to search, use, and incorporate into products and services. Data can be downloaded from: Allitems [347] CVE
CWE Common Weakness Enumeration data. Data can be downloaded from:

Software Development Hardware Design[permanent dead link]Research Concepts

[348] CWE
MalwareTextDB Annotated database of malware texts. The GitHub repository of the project contains the data to download. [349] Kiat et al.
USENIX Security Symposium proceedings Collection of security proceedings from USENIX Security Symposium – technical sessions from 1995 to 2022. This data is not pre-processed. 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008,

2009, 2010 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022.

[350] USENIX Security Symposium
APTNotes Collection of public documents, whitepapers and articles about APT campaigns. All the documents are publicly available data. This data is not pre-processed. The GitHub repository of the project contains a file with links to the data stored in box.

Data files can also be downloaded here.

[351] APT Notes
arXiv Cryptography and Security papers Collection of articles about cybersecurity This data is not pre-processed. All articles available here. [352] arXiv
Security eBooks for free Small collection of security eBooks, and security presentations publicly available. This data is not pre-processed. [353][354][355][356][357][358][359][360][361][362][363][364]
National Cyber Security strategy repository Repository of worldwide strategy documents about cybersecurity. This data is not pre-processed. [365]
Cyber Security Natural Language Processing Data about cybersecurity strategies from more than 75 countries. Tokenization, meaningless-frequent words removal. [366] Yanlin Chen, Yunjian Wei, Yifan Yu, Wen Xue, Xianya Qin
APT Reports collection Sample of APT reports, malware, technology, and intelligence collection Raw and tokenize data available. All data is available in this GitHub repository. [citation needed] blackorbird
Offensive Language Identification Dataset (OLID) Data available in the project's website.

Data is also available here.

[367] Zampieri et al.
Cyber reports from the National Cyber Security Centre This data is not pre-processed. Threat reports, reports and advisory, news, blog-posts, speeches.

Alternate list of reports.

[368]
APT reports by Kaspersky This data is not pre-processed. [369]
The cyberwire This data is not pre-processed. Newsletters, podcasts, and stories. [370]
Databreaches news This data is not pre-processed. News, list of news from Aug 2022 to Feb 2023 [371]
Cybernews This data is not pre-processed. News, curated list of news [372]
Bleepingcomputer This data is not pre-processed. News [373]
Therecord This data is not pre-processed. Cybercrime news [374]
Hackread This data is not pre-processed. Hacking news [375]
Securelist This data is not pre-processed. APT reports, archive, DDOS reports, incidents, Kaspersky security bulletin, industrial threats, malware-reports, opinions, publications, research, and SAS. [376]
Stucco project The Stucco project collects data not typically integrated into security systems. This data is not pre-processed Project's website with data informationReviewed source with links to data sources [377]
Farsightsecurity Website with technical information, reports, and more about security topics. This data is not pre-processed Technical information, research, reports. [378]
Schneier Website with academic papers about security topics. This data is not pre-processed Papers per category, papers archive by date. [379]
Trendmicro Website with research, news, and perspectives bout security topics. This data is not pre-processed Reviewed list of Trendmicro research, news, and perspectives. [380]
The Hacker News News about cybersecurity topics. This data is not pre-processed data breaches, cyberattacks, vulnerabilities, malware news. [381]
Krebsonsecurity Security news and investigation This data is not pre-processed curated list of news [382]
Mitre Defend Matrix of Defend artifacts json files [383]
Mitre Atlas Mitre Atlas is a knowledge base of adversary tactics, techniques, and case studies for machine learning (ML) systems based on real-world observations. This data is not pre-processed [384]
Mitre Engage MITRE Engage is a framework for planning and discussing adversary engagement operations that empowers you to engage your adversaries and achieve your cybersecurity goals. This data is not pre-processed [385]
Hacking Tutorials This data is not pre-processed [386]
Close

Climate and sustainability

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
TCFD reports Database of company reports that include TCFD-related disclosures. This data is not pre-processed Direct link to reportsCurated list of reports [387] TCFD Knowledge Hub
Corporate Social Responsibility Reports A listing of responsibility reports on the internet. This data is not pre-processed Curated list of reports [388] ResponsibilityReports
The Intergovernmental Panel on Climate Change (IPCC) A collection of comprehensive assessment reports about knowledge on climate change, its causes, potential impacts and response options This data is not pre-processed ReportsCurated list of reports [389] IPCC
Alliance for Research on Corporate Sustainability This data is not pre-processed Curated list of blog posts [390] ARCS
ESG corpus: Knowledge Hub of the Accounting for Sustainability This data is not pre-processed Guides, case studies, blogs, and reports & surveys. [391] Mehra et al.
CLIMATE-FEVER A dataset adopting the FEVER methodology that consists of 1,535 real-world claims regarding climate-change collected on the internet. Each claim is accompanied by five manually annotated evidence sentences retrieved from the English Wikipedia that support, refute or do not give enough information to validate the claim totalling in 7,675 claim-evidence pairs.[392] Dataset HF card, and project's GitHub repository. [393] Diggelmann et al.
Climate News dataset A dataset for NLP and climate change media researchers The dataset is made up of a number of data artifacts (JSON, JSONL & CSV text files & SQLite database) Climate news DB, Project's GitHub repository [394] ADGEfficiency
Climatext Climatext is a dataset for sentence-based climate change topic detection. HF dataset [395] University of Zurich
GreenBiz Collection of articles and news about climate and sustainability This data is not pre-processed Curated list of climate articlesCurated list of sustainability articles [396]
Top research pre-prints in climate and sustainability List of pre-prints from researchers in the reuters hot list This data is not pre-processed Curated list of pre-prints [397] Maurice Tamman
ARCS This data is not pre-processed Curated list of corporate sustainability blogs [398]
GreenBiz Website with articles about climate and sustainability This data is not pre-processed [399] GreenBiz
CSRWIRE This data is not pre-processed Curated list of articles [400] CSRWIRE
CDP Articles about climate, water, and forests This data is not pre-processed [401] CDP
Close

Code data

Summarize
Perspective
More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
The Stack A 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages. Filtered through license detection and deduplication. 6 TB, 51.76B files (prior to deduplication); 3 TB, 5.28B files (after). 358 programming languages. Parquet Language modeling, autocompletion, program synthesis. 2022 [402][403] D. Kocetkov, R. Li, L. Ben Allal, L. von Werra, H. de Vries
GitHub repositories This data is not pre-processed Curated lis of repositories from GitHub: 61 62 63 64 65 66 67 68 69 70 71, 72, 73, 74, 75, 76, 77 101
IBM Public GitHub repositories This data is not pre-processed Curated list of repositories from GitHub
RedHat Public GitHub repositories This data is not pre-processed Curated list of repositories from GitHub
StackExchange Public Archive.org files This data is not pre-processed Curated list of files from Archive.org
Gitlab Public repositories This data is not pre-processed Curated list of repositories from Gitlab: 1 2
Ansible Collections public repositories This data is not pre-processed Curated list of repositories from GitHub.
CodeParrot GitHub Code Dataset This data is not pre-processed Curated list of repositories from Hugging Face: 1 2 3 4 5 6 7 8 9 10
OKD The Community Distribution of Kubernetes that powers Red Hat OpenShift This data is not pre-processed List of GitHub repositories of the project
OpenShift The developer and operations friendly Kubernetes distro List of GitHub repositories of the project
Kubernetes This data is not pre-processed List of GitHub repositories of the project
Red Hat Developer GitHub home of the Red Hat Developer program This data is not pre-processed List of GitHub repositories of the project
Red Hat

Workshops

This data is not pre-processed List of GitHub repositories of the project
Kubernetes SIGs This data is not pre-processed List of GitHub repositories of the project
Konveyor This data is not pre-processed List of GitHub repositories of the project
RedHat Marketplace This data is not pre-processed List of GitHub repositories of the project
Redhat blog This data is not pre-processed [404]
Kubernetes io This data is not pre-processed [405]
Docs Openshift This data is not pre-processed [406]
cncf io This data is not pre-processed [407]
Kubernetes presentations List of publicly available Kubernetes presentations This data is not pre-processed data link
Red Hat Open Innovation Labs This data is not pre-processed List of GitHub repositories of the project
Red Hat Demos This data is not pre-processed List of GitHub repositories of the project
Red Hat OpenShift Online This data is not pre-processed List of GitHub repositories of the project
Software Collections This data is not pre-processed List of GitHub repositories of the project
Red Hat Insights This data is not pre-processed List of GitHub repositories of the project
Red Hat Government This data is not pre-processed List of GitHub repositories of the project
Red Hat Consulting This data is not pre-processed List of GitHub repositories of the project
Red Hat Communities of Practice This data is not pre-processed List of GitHub repositories of the project
Red Hat Partner Tech This data is not pre-processed List of GitHub repositories of the project
Red Hat Documentation This data is not pre-processed List of GitHub repositories of the project
IBM This data is not pre-processed List of GitHub repositories of the project
IBM Cloud This data is not pre-processed List of GitHub repositories of the project
Build Lab Team This data is not pre-processed List of GitHub repositories of the project
Terraform IBM Modules This data is not pre-processed List of GitHub repositories of the project
Cloud Schematics This data is not pre-processed List of GitHub repositories of the project
OCP Power Demos This data is not pre-processed List of GitHub repositories of the project
IBM App Modernization  This data is not pre-processed List of GitHub repositories of the project
Kubernetes OperatorHub  This data is not pre-processed List of GitHub repositories of the project
Cloud Native Computing Foundation (CNCF)  This data is not pre-processed List of GitHub repositories of the project
Operator Framework This data is not pre-processed List of GitHub repositories of the project [408]
GitHub repositories referenced in artifacthub.io This data is not pre-processed List of GitHub repositories in artifacthub.io
Red Hat Communities of Practice This data is not pre-processed List of GitHub repositories of the project
Red Hat partner This data is not pre-processed List of GitHub repositories of the project
IBM Repositories This data is not pre-processed List of GitHub repositories for the project
Build Lab Team This data is not pre-processed List of GitHub repositories for the project
Operator Framework This data is not pre-processed List of GitHub repositories for the project
GitHub repositories This data is not pre-processed List of GitHub repositories for the project
Red Hat This data is not pre-processed List of GitHub repositories of the project
Kubernetes Patterns This data is not pre-processed List of GitHub repositories of the project
Kubernetes Deployment & Security Patterns This data is not pre-processed List of GitHub repositories of the project
Kubernetes for Full-Stack Developers This data is not pre-processed List of GitHub repositories of the project
Load Balancer Cloudwatch Metrics This data is not pre-processed GitHub repository of the project
Dynatrace This data is not pre-processed
AIOps Challenge 2020 Data This data is not pre-processed GitHub repository of the project
Loghub This data is not pre-processed List of repositories
HTML Pages This data is not pre-processed List of HTML pages
Opensift ebooks This data is not pre-processed [409]
Kubernetes ebooks This data is not pre-processed Kubernetes Patterns, Kubernetes Deployment, Kubernetes for Full-Stack Developers
Kubernetes for Full-Stack Developers This data is not pre-processed Kubernetes for Full-Stack Developers
List of public and licensed Github repositories This data is not pre-processed List of repositories
Close

Multivariate data

Financial

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Dow Jones Index Weekly data of stocks from the first and second quarters of 2011. Calculated values included such as percentage change and a lags. 750 Comma separated values Classification, regression, Time series 2014 [410][411] M. Brown et al.
Statlog (Australian Credit Approval) Credit card applications either accepted or rejected and attributes about the application. Attribute names are removed as well as identifying information. Factors have been relabeled. 690 Comma separated values Classification 1987 [412][413] R. Quinlan
eBay auction data Auction data from various eBay.com objects over various length auctions Contains all bids, bidderID, bid times, and opening prices. ~ 550 Text Regression, classification 2012 [414][415] G. Shmueli et al.
Statlog (German Credit Data) Binary credit classification into "good" or "bad" with many features Various financial features of each person are given. 690 Text Classification 1994 [416] H. Hofmann
Bank Marketing Dataset Data from a large marketing campaign carried out by a large bank . Many attributes of the clients contacted are given. If the client subscribed to the bank is also given. 45,211 Text Classification 2012 [417][418] S. Moro et al.
Istanbul Stock Exchange Dataset Several stock indexes tracked for almost two years. None. 536 Text Classification, regression 2013 [419][420] O. Akbilgic
Default of Credit Card Clients Credit default data for Taiwanese creditors. Various features about each account are given. 30,000 Text Classification 2016 [421][422] I. Yeh
StockNet Stock movement prediction from tweets and historical stock prices None Text NLP 2018 [423] Yumo Xu and Shay B. Cohen
Close

Weather

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Cloud DataSet Data about 1024 different clouds. Image features extracted. 1024 Text Classification, clustering 1989 [424] P. Collard
El Nino Dataset Oceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific. 12 weather attributes are measured at each buoy. 178080 Text Regression 1999 [425] Pacific Marine Environmental Laboratory
Greenhouse Gas Observing Network Dataset Time-series of greenhouse gas concentrations at 2921 grid cells in California created using simulations of the weather. None. 2921 Text Regression 2015 [426] D. Lucas
Atmospheric CO2 from Continuous Air Samples at Mauna Loa Observatory Continuous air samples in Hawaii, USA. 44 years of records. None. 44 years Text Regression 2001 [427] Mauna Loa Observatory
Ionosphere Dataset Radar data from the ionosphere. Task is to classify into good and bad radar returns. Many radar features given. 351 Text Classification 1989 [283][428] Johns Hopkins University
Ozone Level Detection Dataset Two ground ozone level datasets. Many features given, including weather conditions at time of measurement. 2536 Text Classification 2008 [429][430] K. Zhang et al.
Close

Census

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Adult Dataset Census data from 1994 containing demographic features of adults and their income. Cleaned and anonymized. 48,842 Comma separated values Classification 1996 [431] United States Census Bureau
Census-Income (KDD) Weighted census data from the 1994 and 1995 Current Population Surveys. Split into training and test sets. 299,285 Comma separated values Classification 2000 [432][433] United States Census Bureau
IPUMS Census Database Census data from the Los Angeles and Long Beach areas. None 256,932 Text Classification, regression 1999 [434] IPUMS
US Census Data 1990 Partial data from 1990 US census. Results randomized and useful attributes selected. 2,458,285 Text Classification, regression 1990 [435] United States Census Bureau
Close

Transit

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Bike Sharing Dataset Hourly and daily count of rental bikes in a large city. Many features, including weather, length of trip, etc., are given. 17,389 Text Regression 2013 [436][437] H. Fanaee-T
New York City Taxi Trip Data Trip data for yellow and green taxis in New York City. Gives pick up and drop off locations, fares, and other details of trips. 6 years Text Classification, clustering 2015 [438] New York City Taxi and Limousine Commission
Taxi Service Trajectory ECML PKDD Trajectories of all taxis in a large city. Many features given, including start and stop points. 1,710,671 Text Clustering, causal-discovery 2015 [439][440] M. Ferreira et al.
METR-LA Speed from loop detectors in the highway of Los Angeles County. Average speed in 5 minutes timesteps. 7,094,304 from 207 sensors and 34,272 timesteps Comma separated values Regression, Forecasting 2014 [441] Jagadish et al.
PeMS Speed, flow, occupancy and other metrics from loop detectors and other sensors in the freeway of the State of California, U.S.A.. Metric usually aggregated via Average into 5 minutes timesteps. 39,000 individual detectors, each containing years of timeseries Comma separated values Regression, Forecasting, Nowcasting, Interpolation (updated realtime) [442] California Department of Transportation
Close

Internet

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Webpages from Common Crawl 2012 Large collection of webpages and how they are connected via hyperlinks None. 3.5B Text clustering, classification 2013 [443] V. Granville
Internet Advertisements Dataset Dataset for predicting if a given image is an advertisement or not. Features encode geometry of ads and phrases occurring in the URL. 3279 Text Classification 1998 [444][445] N. Kushmerick
Internet Usage Dataset General demographics of internet users. None. 10,104 Text Classification, clustering 1999 [446] D. Cook
URL Dataset 120 days of URL data from a large conference. Many features of each URL are given. 2,396,130 Text Classification 2009 [447][448] J. Ma
Phishing Websites Dataset Dataset of phishing websites. Many features of each site are given. 2456 Text Classification 2015 [449] R. Mustafa et al.
Online Retail Dataset Online transactions for a UK online retailer. Details of each transaction given. 541,909 Text Classification, clustering 2015 [450] D. Chen
Freebase Simple Topic Dump Freebase is an online effort to structure all human knowledge. Topics from Freebase have been extracted. large Text Classification, clustering 2011 [451][452] Freebase
Farm Ads Dataset The text of farm ads from websites. Binary approval or disapproval by content owners is given. SVMlight sparse vectors of text words in ads calculated. 4143 Text Classification 2011 [453][454] C. Masterharm et al.
The Pile Assembling several large datasets of diverse and unstructured texts Various (removing HTML and Javascript from websites, removing duplicated sentences) 825 GiB English text JSON Lines[455][456] Natural Language Processing, Text Prediction 2021 [457][455] Gao et al.
OSCAR Large collection of monolingual corpora extracted from web data (Common Crawl dumps) covering 150+ languages Various (filtering, language classification, adult-content detection and other labelling) 3.4 TB English text, 1.4 TB Chinese text, 1.1 TB Russian text, 595 MB German text, 431 MB French text, and data for 150+ languages (figures for version 23.01) JSON Lines[458] Natural Language Processing, Text Prediction 2021 [459][460] Ortiz Suarez, Abadji, Sagot et al.
OpenWebText An open-source recreation of the WebText corpus. The text is web content extracted from URLs shared on Reddit with at least three upvotes. Extracted non-HTML content, deduplicated, and tokenized. 8,013,769 Documents, 38GB Text Natural Language Processing, Text Prediction 2019 [461][462] A. Gokaslan, V. Cohen
ROOTS A well-documented and representative multilingual dataset with the explicit goal of doing good for and by the people whose data was collected. Extracted non-HTML content, cleaned out UI and ads, deduplicated, removed PII, and tokenized. 1.6 TB, 59 languages. Parquet Natural Language Processing, Text Prediction 2022 [463][464] H. Laurençon, L. Saulnier, T. Wang, C. Akiki, A. Villanova del Moral, T. Le Scao
Close

Games

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Poker Hand Dataset 5 card hands from a standard 52 card deck. Attributes of each hand are given, including the Poker hands formed by the cards it contains. 1,025,010 Text Regression, classification 2007 [465] R. Cattral
Connect-4 Dataset Contains all legal 8-ply positions in the game of connect-4 in which neither player has won yet, and in which the next move is not forced. None. 67,557 Text Classification 1995 [466] J. Tromp
Chess (King-Rook vs. King) Dataset Endgame Database for White King and Rook against Black King. None. 28,056 Text Classification 1994 [467][468] M. Bain et al.
Chess (King-Rook vs. King-Pawn) Dataset King+Rook versus King+Pawn on a7. None. 3196 Text Classification 1989 [469] R. Holte
Tic-Tac-Toe Endgame Dataset Binary classification for win conditions in tic-tac-toe. None. 958 Text Classification 1991 [470] D. Aha
Close

Other multivariate

More information Dataset Name, Brief description ...
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Housing Data Set Median home values of Boston with associated home and neighborhood attributes. None. 506 Text Regression 1993 [471] D. Harrison et al.
The Getty Vocabularies structured terminology for art and other material culture, archival materials, visual surrogates, and bibliographic materials. None. large Text Classification 2015 [472] Getty Center
Yahoo! Front Page Today Module User Click Log User click log for news articles displayed in the Featured Tab of the Today Module on Yahoo! Front Page. Conjoint analysis with a bilinear model. 45,811,883 user visits Text Regression, clustering 2009 [473][474] Chu et al.
British Oceanographic Data Centre Biological, chemical, physical and geophysical data for oceans. 22K variables tracked. Various. 22K variables, many instances Text Regression, clustering 2015 [475] British Oceanographic Data Centre
Congressional Voting Records Dataset Voting data for all USA representatives on 16 issues. Beyond the raw voting data, various other features are provided. 435 Text Classification 1987 [476] J. Schlimmer
Entree Chicago Recommendation Dataset Record of user interactions with Entree Chicago recommendation system. Details of each user's usage of the app are recorded in detail. 50,672 Text Regression, recommendation 2000 [477] R. Burke
Insurance Company Benchmark (COIL 2000) Information on customers of an insurance company. Many features of each customer and the services they use. 9,000 Text Regression, classification 2000 [478][479] P. van der Putten
Nursery Dataset Data from applicants to nursery schools. Data about applicant's family and various other factors included. 12,960 Text Classification 1997 [480][481] V. Rajkovic et al.
University Dataset Data describing attributed of a large number of universities. None. 285 Text Clustering, classification 1988 [482] S. Sounders et al.
Blood Transfusion Service Center Dataset Data from blood transfusion service center. Gives data on donors return rate, frequency, etc. None. 748 Text Classification 2008 [483][484] I. Yeh
Record Linkage Comparison Patterns Dataset Large dataset of records. Task is to link relevant records together. Blocking procedure applied to select only certain record pairs. 5,749,132 Text Classification 2011 [485][486] University of Mainz
Nomao Dataset Nomao collects data about places from many different sources. Task is to detect items that describe the same place. Duplicates labeled. 34,465 Text Classification 2012 [487][488] Nomao Labs
Movie Dataset Data for 10,000 movies. Several features for each movie are given. 10,000 Text Clustering, classification 1999 [489] G. Wiederhold
Open University Learning Analytics Dataset Information about students and their interactions with a virtual learning environment. None. ~ 30,000 Text Classification, clustering, regression 2015 [490][491] J. Kuzilek et al.
Mobile phone records Telecommunications activity and interactions Aggregation per geographical grid cells and every 15 minutes. large Text Classification, Clustering, Regression 2015 [492] G. Barlacchi et al.
Close

Curated repositories of datasets

As datasets come in myriad formats and can sometimes be difficult to use, there has been considerable work put into curating and standardizing the format of datasets to make them easier to use for machine learning research.

  • OpenML:[493] Web platform with Python, R, Java, and other APIs for downloading hundreds of machine learning datasets, evaluating algorithms on datasets, and benchmarking algorithm performance against dozens of other algorithms.
  • PMLB:[494] A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms. Provides classification and regression datasets in a standardized format that are accessible through a Python API.
  • Metatext NLP: https://metatext.io/datasets web repository maintained by community, containing nearly 1000 benchmark datasets, and counting. Provides many tasks from classification to QA, and various languages from English, Portuguese to Arabic.
  • Appen: Off The Shelf and Open Source Datasets hosted and maintained by the company. These biological, image, physical, question answering, signal, sound, text, and video resources number over 250 and can be applied to over 25 different use cases.[495][496]

See also

References

Wikiwand - on

Seamless Wikipedia browsing. On steroids.