datasets

# Datasets Here is a collection of some resources for ML datasets. ## Dataset Search [A Guide to Getting Datasets for Machine Learning in Python](https://machinelearningmastery.com/a-guide-to-getting-datasets-for-machine-learning-in-python/) [A Guide to Obtaining Time Series Datasets in Python](https://machinelearningmastery.com/a-guide-to-obtaining-time-series-datasets-in-python/) [Google Dataset Search](https://datasetsearch.research.google.com) [OpenML Datasets](https://www.openml.org/search?type=data&sort=runs&status=active) [HuggingFace Datasets](https://github.com/huggingface/datasets) [The Best Data is Free Data using Socrata Open Data API](https://towardsdatascience.com/the-best-data-is-free-data-of-course-b88230b5b47f) [Papers with Code](https://paperswithcode.com/) [The Complete Collection Of Data Repositories – Part 1](https://www.kdnuggets.com/2022/04/complete-collection-data-repositories-part-1.html) [The Complete Collection Of Data Repositories – Part 2](https://www.kdnuggets.com/2022/04/complete-collection-data-repositories-part-2.html) [50 Public Sources for Machine Learning Datasets](https://towardsdatascience.com/datasets-for-machine-learning-and-data-science-a27a5d0ba03) [50 Open Source Image Datasets for Computer Vision for Every Use Case](https://www.taqadam.io/open-sourse-datasets/) ## Dataset Cheatsheet Here is a list of public datasets in computer vision, NLP, and more: [ML Cheatsheet Dataset List](https://ml-cheatsheet.readthedocs.io/en/latest/datasets.html) ## Read Datasets [Read Datasets with URL in Python](https://towardsdatascience.com/dont-download-read-datasets-with-url-in-python-8245a5eaa919) A tutorial to read UCI datasets without downloading them locally in any format This tutorial explains how you can read five different types of data file format: data, csv, arff, zip, and rar. ## Working with Large Datasets [Memory Usage Tips](./tips/memory_usage.md) [A Big Problem with Linear Regression and How to Solve It](https://towardsdatascience.com/robust-regression-23b633e5d6a5) ## Generate Dummy Data [How to generate dummy data in Python](https://towardsdatascience.com/how-to-generate-dummy-data-in-python-a05bce24a6c6) How to generate fake data using the Faker library. ---------- ## Structured Datasets ### HuggingFace [HuggingFace](https://github.com/huggingface/datasets) [Forget Complex Traditional Approaches to handle NLP Datasets, HuggingFace Dataset Library](https://medium.com/@arjunkumbakkara/forget-complex-traditional-approaches-to-handle-nlp-datasets-huggingface-dataset-library-is-your-f7445ea79267) HuggingFace Datasets is a lightweight library providing two main features: - one-line dataloaders for many public datasets: one liners to download and pre-process any of the number of datasets major public datasets (in 467 languages and dialects!) provided on the HuggingFace Datasets Hub. - efficient data pre-processing: simple, fast and reproducible data pre-processing for the above public datasets as well as your own local datasets in CSV/JSON/text. ## CV Datasets [Open Source Datasets for Computer Vision](https://www.kdnuggets.com/2021/08/open-source-datasets-computer-vision.html) ### Medical Image Datasets [medical-imaging-datasets](https://github.com/sfikas/medical-imaging-datasets) [A Systematic Collection of Medical Image Datasets for Deep Learning](https://arxiv.org/abs/2106.12864) This paper has three purposes: 1) to provide a most up to date and complete list that can be used as a universal reference to easily find the datasets for clinical image analysis 2) to guide researchers on the methodology to test and evaluate their methods' performance and robustness on relevant datasets 3) to provide a route to relevant algorithms for the relevant medical topics, and challenge leaderboards. ## NLP Datasets [Datasets for Natural Language Processing](https://machinelearningmastery.com/datasets-natural-language-processing/) You need datasets to practice on when getting started with deep learning for natural language processing tasks. It is better to use small datasets that you can download quickly and do not take too long to fit models. Further, it is also helpful to use standard datasets that are well understood and widely used so that you can compare your results to see if you are making progress. In this post, you will discover a suite of standard datasets for natural language processing tasks that you can use when getting started with deep learning. This post is divided into 7 parts: 1. Text Classification 2. Language Modeling 3. Image Captioning 4. Machine Translation 5. Question Answering 6. Speech Recognition 7. Document Summarization ### NLP Open Source Datasets [7 Top Open Source Datasets to Train Natural Language Processing (NLP) & Text Models](https://www.kdnuggets.com/2021/11/top-open-source-datasets-nlp.html) [20 Open Datasets for Natural Language Processing](https://medium.com/@ODSC/20-open-datasets-for-natural-language-processing-538fbfaf8e38) One of the first steps you need to take is training your NLP model on datasets. Creating your own dataset is a lot of work and actually unnecessary when just starting out. There are many open source datasets available but keep in mind that open-source datasets are not without their problems. Unfortunately, you have to deal with bias, incomplete data, and a slew of other concerns when just grabbing any old dataset to test on. There are a couple of places online that do a great job of curating datasets to make it easier to find what you're looking for: - [Papers With Code](https://paperswithcode.com/datasets) - Nearly 5,000 machine learning datasets are categorized and easy to find. - [Hugging Face](https://huggingface.co/datasets) - A great site to find datasets focused on audio, text, speech, and other datasets specifically targeting NLP. In addition, the following is a list of some of the best open-source datasets to start learning NLP: ### [nlp-datasets](https://github.com/niderhoff/nlp-datasets) Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP). Most stuff here is just raw unstructured text data, if you are looking for annotated corpora or Treebanks refer to the sources at the bottom. ### [Internet Archive](https://web.archive.org) Many developers are too quick to choose web crawling as an option to collect data when there are easier, more ethical resources available such as the internet archive. The Internet Archive has been archiving the web for 20 years and has preserved billions of webpages from millions of websites. ### Amazon Reviews and Wikipedia Links [25 Excellent Machine Learning Open Datasets](https://opendatascience.com/25-excellent-machine-learning-open-datasets/) Here we list Amazon Reviews and Wikipedia Links for general NLP and the Standford Sentiment Treebank and Twitter US Airlines Reviews specifically for sentiment analysis.