Datasets
150+ Machine Learning Datasets
Introduction:
Without Data there is no Machine Learning, no AI, no Deep Learning. Because of heavy automation, IOT devices all around, there is no dirth of data. The first issue is, due to privacy and security related issues, data is not available for everyone. The second issue is cleaning this data. He third issues is getting complete data which can solve a given business problem. To get the complete data you need to get the data from multiple sources, identify the key to connect different records/sample of different sources. It is an expensive and time-consuming step of data science project. If you want to learn data science or want to solve any existing problem using new methods. Then you need some benchmarking framework in place which can display the model metrics (recall/precision/accuracy etc.) of each new approach (algorithm) against a given dataset. So datasets play a critical role in benchmarking algorithm performance.
Thus, machine learning rely heavily on high-quality datasets for training and evaluation. These datasets serve as the foundation for developing robust and accurate models across various AI domains, including classical machine learning, computer vision, natural language processing (NLP), audio processing, and time series analysis. Access to diverse and comprehensive datasets is crucial for researchers and practitioners to tackle real-world problems and advance the field of machine learning.
In this article, I am publishing a curated collection of datasets sourced from over 150 data sources. These datasets and data sources have been carefully selected to cover a wide range of domains, ensuring their relevance to different machine learning applications. Whether you are working on a NLP/text project, CV/image project, audio project, or time series forecasting, or classical machine learning, you’ll find valuable datasets to support your research and development efforts. Let’s dive into the world of machine learning datasets and discover the wealth of resources available to fuel your projects. If you dig each link, you will find hundreds, if not thousands of datasets under many of the links shared. I hope you will get benefitted from this work.
Note: These links I got from chrome bookmarks. At the time of writing this article, I have validated the link. If you find any link is not work / wrongly pointing / wrongly describing then please help me in improving this article. You can write to me at hari.prasad @ vedavit-ps .com.
Note: If you want to search image dataset on this page search “image”, for speech search “speech”
List of Datasets and Data Sources
Sno. | URL | Description | |
---|---|---|---|
1. | 100+ Interesting Data Sets for Statistics | 100+ Interesting Data Sets for Statistics | |
2. | 100+ Mammography Image Databases | Mammography Image Databases – 100 or more images of mammograms with ground truth. Additional images available by request, and links to several other mammography databases are provided. (Formats: homebrew) | |
3. | 15 amazon datasets on data.world | amazon data on data.world | 8 datasets available |
4. | 20 Free Big Data Sources | 20 Free Big Data Sources | |
5. | 332 Sport Datasets on data.world | sports data on data.world | 338 datasets available |
6. | 4000+ Groningen Natural Image Database | Groningen Natural Image Database – 4000+ 1536×1024 (16 bit) calibrated outdoor images (Formats: homebrew) | |
7. | 450+ UCI datasets | ||
8. | 538 Datasets | FiveThirtyEight: data and code related to their articles | |
9. | 538 Datasets Summary Githu | FiveThirtyEight: data and code related to their articles | |
10. | 57 products datasets on data.world | ||
11. | 622 UCI Archive Dataset | UCI Archive-Machine Learning Repository: Data Sets | |
12. | A Collective list of Free API for Datasets | A collective list of free APIs for use in software and web development. | |
13. | Academic Torrents- Large Research dataset | Academic Torrents: distributed network for sharing large research datasets | |
14. | The Air Freight data set is a ray-traced image sequence | Air Freight - The Air Freight data set is a ray-traced image sequence along with ground truth segmentation based on textural characteristics. (455 images + GT, each 160x120 pixels). (Formats: PNG) | |
15. | Air Freight Dataset - Computer Vision | Air Freight – The Air Freight data set is a ray-traced image sequence along with ground truth segmentation based on textural characteristics. (455 images + GT, each 160×120 pixels). (Formats: PNG) | |
16. | Allen Institutes Dataset | Datasets – Allen Institute for AI | |
17. | Amazon Datasets | Amazon Web Services Public Data Sets | |
18. | Amsterdam Library of Object Images - ALOI | Amsterdam Library of Object Images – ALOI is a color image collection of one-thousand small objects, recorded for scientific purposes. In order to capture the sensory variation in object recordings, we systematically varied viewing angle, illumination angle, and illumination color for each object, and additionally captured wide-baseline stereo images. We recorded over a hundred images of each object, yielding a total of 110,250 images for the collection. (Formats: png) | |
19. | Annotated face, hand, cardiac & meat images | Annotated face, hand, cardiac & meat images – Most images & annotations are supplemented by various ASM/AAM analyses using the AAM-API. (Formats: bmp,asf) | |
20. | Apigee | Apigee: explore dozens of popular APIs | |
21. | apilist.fun | API List: A public list of free APIs for programmers | |
22. | AT&T Laboratories Cambridge face database - Images | AT&T Laboratories Cambridge face database | |
23. | AVHRR Pathfinder | National Centre for Environment Information | |
24. | Awesome Deep Learning Database | Densely Sampled View Spheres – Densely sampled view spheres – upper half of the view sphere of two toy objects with 2500 images each. (Formats: tiff) | |
25. | Awesome Public Dataset | ||
26. | Awesome Public Datasets | Awesome Public Datasets: Well-organized and frequently updated | |
27. | B2SHARE | ||
28. | Berkeley Segmentation Dataset 500 | Berkeley Segmentation Dataset 500 | |
29. | Biometric Systems Lab | Biometric Systems Lab – University of Bologna | |
30. | Caltech Image Database | Caltech Image Database – about 20 images – mostly top-down views of small objects and toys. (Formats: GIF) | |
31. | CAVIAR video sequences of mall and public space behavior | CAVIAR video sequences of mall and public space behavior - 90K video frames in 90 sequences of various human activities, with XML ground truth of detection and behavior classification (Formats: MPEG2 & JPEG) | |
32. | CCITT Fax standard images | CCITT Fax standard images – 8 images (Formats: gif) | |
33. | Census of India | ||
34. | Global Terrorism Database (GTD) | ||
35. | CIFAR-10 and CIFAR-100 | CIFAR-10 and CIFAR-100 | |
36. | CMU CIL’s Stereo Data (Image) | CMU CIL’s Stereo Data with Ground Truth – 3 sets of 11 images, including color tiff images with spectroradiometry (Formats: gif, tiff) | |
37. | CMU PIE Database | CMU PIE Database - A database of 41,368 face images of 68 people captured under 13 poses, 43 illuminations conditions, and with 4 different expressions. | |
38. | CMU VASC Image Database | CMU VASC Image Database – Images, sequences, stereo pairs (thousands of images) (Formats: Sun Rasterimage) | |
39. | College Scorecard Data | ||
40. | Columbia-Utrecht Reflectance and Texture Database | Columbia-Utrecht Reflectance and Texture Database – Texture and reflectance measurements for over 60 samples of 3D texture, observed with over 200 different combinations of viewing and illumination directions. (Formats: bmp) | |
41. | Computational Colour Constancy Data | Computational Colour Constancy Data - A dataset oriented towards computational color constancy, but useful for computer vision in general. It includes synthetic data, camera sensor data, and over 700 images. (Formats: tiff) | |
42. | Computational Vision Lab | Computational Vision Lab | |
43. | Content-based image retrieval database | Content-based image retrieval database - 11 sets of color images for testing algorithms for content-based retrieval. Most sets have a description file with names of objects in each image. (Formats: jpg) | |
44. | Cricket Data | ||
45. | Crowdanalytics | ||
46. | Crowdflower Dataset | CrowdFlower: interesting datasets created or enhanced by their contributors | |
47. | Crunchbase | Crunchbase: Discover innovative companies and the people behind them | |
48. | CVD Foundation Open Images | Open Images dataset – Open Images is a dataset of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories. | |
49. | Data world | ||
50. | Data.lacity.org | DataLA | |
51. | DataCamp | ||
52. | DataInnovation Dataset Blog | Center for Data Innovation: blog posts about interesting, recently-released data sets. | |
53. | Dataverse.org | Dataverse Project: searchable archive of research data | |
54. | DC Open Data Catalog | DC Open Data Catalog / OpenDataDC | |
55. | Deep Fashion - Images | Large-scale Fashion (DeepFashion) Database – Contains over 800,000 diverse fashion images. Each image in this dataset is labeled with 50 categories, 1,000 descriptive attributes, bounding box and clothing landmarks | |
56. | Devanagari Handwritten Character Dataset - Images | ||
57. | Donor Choose | Donors Choose: data related to their projects | |
58. | Face and Gesture images and image sequences | Face and Gesture images and image sequences – Several image datasets of faces and gestures that are ground truth annotated for benchmarking http://www.fg-net.org/ | |
59. | FG-NET Facial Aging Database | FG-NET Facial Aging Database – Database contains 1002 face images showing subjects at different ages. (Formats: jpg) | |
60. | Finding Datasets from inside-r.org | inside-R | |
61. | Flickr 30k, Images with Caption | Flickr 30k | |
62. | Flickr 8k, Images with Caption | Flickr 8k | |
63. | Flickr Data 100 Million Yahoo dataset, Images | Flickr Data 100 Million Yahoo dataset | |
64. | FVC2000 Fingerprint Databases, Images | FVC2000 Fingerprint Databases - FVC2000 is the First International Competition for Fingerprint Verification Algorithms. Four fingerprint databases constitute the FVC2000 benchmark (3520 fingerprints in all). | |
65. | Gapminder Data | ||
66. | German Fingerspelling Database | German Fingerspelling Database – The database contains 35 gestures and consists of 1400 image sequences that contain gestures of 20 different persons recorded under non-uniform daylight lighting conditions. http://www-i6.informatik.rwth-aachen.de/~dreuw/database.html | |
67. | Getting Stock Data | ||
68. | Github-DataMeet | Datameet is a community of Data Science enthusiasts. | |
69. | Google House Numbers from street view | ||
70. | Google Scholar | ||
71. | HowStat | HowSTAT! The Cricket Statisticians – Home Page | |
72. | github.com/TheUpShot | The Upshot: data related to their articles | |
73. | Huggingface datasets | ||
74. | Humanitarian Data Exchange | Humanitarian Data Exchange | |
75. | IEEE DataPort | Data Competitions | IEEE DataPort |
76. | IEEN Image Library | IEN Image Library – 1000+ images, mostly outdoor sequences (Formats: raw, ppm) | |
77. | Image Analysis Laboratory | Image Analysis Laboratory – Images obtained from a variety of imaging modalities — raw CFA images, range images and a host of “medical images”. (Formats: homebrew) | |
78. | Image QA | Image QA | |
79. | ImageNet | ImageNet | |
80. | IMDb Top 250 Movies | Ratings and Reviews for New Movies and TV Shows – IMDb | |
81. | IMF-Exchange Rate | IMF-Exchange Rate Archives by Month | |
82. | Indian Govt | ||
83. | Indian Liver Patient Dataset | ||
84. | INRIA | INRIA | |
85. | Institute of Computer Graphics and Vision | Institute of Computer Graphics and Vision | |
86. | Inter University Consortium for Politics & Social | Inter-university Consortium for Political and Social Research | |
87. | Kaggle Datasets | Kaggle provides datasets with their challenges, but each competition has its own rules as to whether the data can be used outside of the scope of the competition. | |
88. | kdnuggets | kdnuggets- Datasets for Data Mining and Data Science | |
89. | Mammography Image Databases | Mammography Image Databases - 100 or more images of mammograms with ground truth. Additional images available by request, and links to several other mammography databases are provided. (Formats: homebrew) | |
90. | Mashape - Explore APIs | Mashape: explore hundreds of APIs | |
91. | Microsoft COCO | Microsoft COCO | |
92. | Million Song Dataset | Million Song Dataset | |
93. | MIT Vision Texure | MIT Vision Texture – Image archive (100+ images) (Formats: ppm) | |
94. | MNIST Handwritten digits | MNIST Handwritten digits | |
95. | NLM HyperDoc Visible Human Project | NLM HyperDoc Visible Human Project - Color, CAT and MRI image samples - over 30 images (Formats: jpeg) | |
96. | NYC Open Data socrata | NYC Open Data | |
97. | OASIS 1 | OASIS-1 (Open Access Series of Imaging Studies) | |
98. | OASIS Brain - Imaging Studies | Cross-Sectional MRI Data in Young, Middle Aged, Nondemented, and Demented Older Adults | |
99. | Open Images is a dataset of ~9 million URLs | Open Images dataset - Open Images is a dataset of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories. | |
100. | Photometric 3D Surface Texture Database | Photometric 3D Surface Texture Database - This is the first 3D texture database which provides both full real surface rotations and registered photometric stereo data (30 textures, 1680 images). (Formats: TIFF) | |
101. | Pittsburgh Science of Learning | Pittsburgh Science of Learning Center’s DataShop | |
102. | ProPublica Data Store | ProPublica Data Store | |
103. | Python API for Datasets | Python APIs: Python wrappers for many APIs | |
104. | Quanddl | Quandl: over 10 million financial, economic, and social datasets | |
105. | R Datasets | Rdatasets: collection of 700+ datasets originally distributed with R packages | |
106. | RapidAPI.com | 25 Free Public APIs for Developers & Free Alternatives List | |
107. | rdatamining.com | RDataMining.com | |
108. | Reddit Dataset | Datasets subreddit: ask for help finding a specific data set, or post your own | |
109. | Reddit Dataset from 2500 subreddits | Reddit Top 2.5 Million: all-time top 1,000 posts from each of the top 2,500 subreddits | |
110. | Reddit Dataset Jeopardy Question | 200,000+ Jeopardy questions | |
111. | research.yahoo.com | ||
112. | Sebastian Raschka | Sebastian Raschka: datasets categorized by format and topic | |
113. | Smartcities Data Govt of India | ||
114. | Stanford Edu Dataset | Stanford Large Network Dataset Collection: graph data | |
115. | Suicide Rates 1985-2013 | Suicide Rates Overview 1985 to 2016 | Kaggle |
116. | Sunlight Foundation Govt Data | Sunlight Foundation: government-focused data | |
117. | India, Surat City | ||
118. | UP Govt Economics | Directorate of Economics and Statistics UP Govt. | |
119. | UP Smart Cities | ||
120. | Tamilnadu | 37K Resources, 4,134 Catalog, 101 Departments | |
121. | The MIT-CSAIL Database of Objects and Scenes | The MIT-CSAIL Database of Objects and Scenes - Database for testing multiclass object detection and scene recognition algorithms. Over 72,000 images with 2873 annotated frames. More than 50 annotated object classes. (Formats: jpg) | |
122. | Tiny Images 80 Million tiny images | Tiny Images 80 Million tiny images6. | |
123. | Traffic Image Sequences and ‘Marbled Block’ Sequence | Traffic Image Sequences and ‘Marbled Block’ Sequence - thousands of frames of digitized traffic image sequences as well as the ‘Marbled Block’ sequence (grayscale images) (Formats: GIF) | |
124. | U Oulu wood and knots database | U Oulu wood and knots database - Includes classifications - 1000+ color images (Formats: ppm) | |
125. | UC Irvine Machine Learning Repository | UC Irvine Machine Learning Repository | |
126. | UCI | UC Irvine Machine Learning Repository: datasets specifically designed for machine learning | |
127. | UCI Archive 620+ datasets | ||
128. | UCI-Liver Disorder Datasets | UCI Machine Learning Repository: Liver Disorders Data Set | |
129. | UFO - Geolocation and Time Dataset | UFO reports: geolocated and time-standardized UFO reports for close to a century | |
130. | UK Govt | data.gov.uk | |
131. | University of Oulu Physics-based Face Database | University of Oulu Physics-based Face Database - contains color images of faces under different illuminants and camera calibration conditions as well as skin spectral reflectance measurements of each person. | |
132. | University of Oulu Texture Database | University of Oulu Texture Database - Database of 320 surface textures, each captured under three illuminants, six spatial resolutions and nine rotation angles. A set of test suites is also provided so that texture segmentation, classification, and retrieval algorithms can be tested in a standard manner. (Formats: bmp, ras, xv) | |
133. | US Census Bureau | US Census Bureau | |
134. | US Gov 256K datasets | The Home of the U.S. Government’s Open Data | |
135. | US Govt | data.gov (see also: Project Open Data Dashboard) | |
136. | US Students Univerties | ||
137. | USF Range Image Data with Segmentation | USF Range Image Data with Segmentation Ground Truth - 80 image sets (Formats: Sun rasterimage) | |
138. | Vanderbilt edu dataset websites | ||
139. | Vanderbilt edu datasets | ||
140. | VQA | Visual Question Answering | |
141. | Wikipedia Dataset | Wikipedia:Database download – Wikipedia | |
142. | Wiry Object Recognition Database | Wiry Object Recognition Database - Thousands of images of a cart, ladder, stool, bicycle, chairs, and cluttered scenes with ground truth labelings of edges and regions. | |
143. | World Bank Open Data | World Bank Open Data | |
144. | Yale Face Database - 165 images | Yale Face Database - 165 images (15 individuals) with different lighting, expression, and occlusion configurations. | |
145. | Yale Face Database B - 5760 | Yale Face Database B - 5760 single light source images of 10 subjects each seen under 576 viewing conditions (9 poses x 64 illumination conditions). (Formats: PGM) | |
146. | Yelp.com Datasets Challenge | Yelp Dataset Challenge: Yelp reviews, business attributes, users, and more from 10 cities | |
147. | YouTube-8M Dataset | YouTube-8M Dataset - YouTube-8M is a large-scale labeled video dataset that consists of 8 million YouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities. | |
148. | Aylien News Data API | ||
149. | Aylien Datasets | ||
150. | Finance Datasets on Kaggle | ||
151. | Github philipperemy/financial-news-dataset | ||
152. | FT Markets Data | ||
153. | IMF Datasets | ||
154. | Worldbank Datasets | ||
155. | Reuter Finance News Dataset Title Only | ||
156. | Bloomberg + Reuter Finance News Dataset | ||
157. | Enron Email Dataset | It has more than 500K emails of over 150 users. The size of the data is around 432Mb. Out of 150 users, most of the users are the senior management of Enron. | |
158. | Chatbot Intents Dataset | The dataset for a chatbot is a JSON file that has disparate tags like goodbye, greetings, pharmacy_search, hospital_search, etc. Every tag has a list of patterns that a user can ask, and the chatbot will respond according to that pattern. The dataset is perfect for understanding how chatbot data works. | |
159. | Parkinson Dataset | Parkinson dataset contains biomedical measurements, 195 records of people with 23 different attributes. This data is used to differentiate healthy people and people with Parkinson’s disease. | |
160. | Mall Customers Dataset | The Mall customers dataset holds the details about people visiting the mall. The dataset has an age, customer id, gender, annual income, and spending score. It gains insights from the data and divides the customers into different groups based on their behaviors. | |
161. | Google Trends Data Portal | Google trends data can be used to examine and analyze the data visually. We can find out what’s trending and what people are searching for. | |
162. | Recommender Systems and Personalization Datasets | This is a portal to a collection of rich datasets that were used in lab research projects at UCSD. It contains various datasets from popular websites like Goodreads book reviews, Amazon product reviews, bartending data, data from social media, etc that are used in building a recommender system. | |
163. | GTSRB (German traffic sign recognition benchmark) Dataset | Build a model using a deep learning framework that classifies traffic signs and also recognizes the bounding box of signs. The traffic sign classification is also useful in autonomous vehicles for identifying signs and then taking appropriate actions. | |
164. | Cityscapes Dataset | It contains high-quality pixel-level annotations of video sequences taken in 50 different city streets. The dataset is useful in semantic segmentation and training deep neural networks to understand the urban scene. | |
165. | Kinetics Dataset | There are three different datasets for Kinetics: Kinetics 400, Kinetics 600, and Kinetics 700 dataset. This is a large scale dataset that contains a URL link to around 6.5 million high-quality videos. Build a human action recognition model and detect the action of a human. | |
166. | IMDB-Wiki dataset | The IMDB-Wiki dataset is one of the largest open-source datasets for face images with labeled gender and age. The images are collected from IMDB and Wikipedia. It has 5 million-plus labeled images. | |
167. | Color Detection Dataset | The dataset contains a CSV file that has 865 color names with their corresponding RGB (red, green, and blue) values of the color. | |
168. | Libri Speech Dataset | This dataset contains a large number of English speeches that are derived from the LibriVox project. It has 1000 hours of English-read speech in various accents. The objective of speech recognition is to automatically identify what is being said in the audio. | |
169. | Breast Histopathology Images Dataset | This dataset contains 2,77,524 images of size 50×50 extracted from 162 mount slide images of breast cancer specimens scanned at 40x. There are 1,98,738 negative tests and 78,786 positive tests with IDC. | |
170. | youtube-8M analytics | ||
171. | Temporal concept localization within video - YouTube-8M, Link2 | The YouTube-8M Segments dataset is an extension of the YouTube-8M dataset with human-verified segment annotations. In addition to annotating videos, we would like to temporally localize the entities in the videos, i.e., find out when the entities occur. | |
172. | CodaLab | Hundreds of interesting datasets. | |
173. | Hate Speech Dataset in Devnagari from Kaggle | ||
174. | Stanford Speech Dataset | ||
175. | TED-LIUM corpus release 3 | ||
176. | 40 Open Source Audio Datasets | ||
177. | Microsoft Datasets | ||
178. | 9 Voice Datasets from cmwire | ||
179. | BBC Datasets | Consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005. Class Labels: 5 (business, entertainment, politics, sport, tech) |
Conclusion:
Machine learning datasets play a pivotal role in the development and advancement of various machine learning applications. In this article, we have explored an extensive collection of datasets obtained from more than 150 data sources, encompassing classical machine learning, computer vision, NLP/NLU, audio processing, and time series analysis.
By leveraging these diverse datasets, researchers and practitioners can build more robust and accurate machine learning models. These datasets provide the necessary ingredients for training, testing, and validating models across different domains, enabling the development of intelligent systems that can understand, interpret, and make predictions from complex data.
As the field of machine learning continues to evolve, the availability of high-quality datasets remains crucial. Whether you are embarking on a new project or seeking to enhance your existing models, exploring and utilizing these curated datasets will empower you to push the boundaries of what is possible in machine learning.
Remember, the power of machine learning lies not only in the algorithms and techniques but also in the data that fuels them. Embrace the vast array of datasets at your disposal and embark on exciting journeys of discovery and innovation in the world of machine learning.