Skip to content

Great Ways to Get the Datasets for machine learning

  • 5 min read
  • by
Great Ways to Get the Datasets for machine learning

You can’t make any machine learning or deep learning model without data. You must have a dataset of a particular domain to create a model from it. But many times it happens that you can’t find your desired dataset on a particular website. Here are ways to get the dataset of your choice.

Kaggle

homepage of Kaggle

Kaggle is the best platform for data scientists. On Kaggle, many companies put their original data on their platform.  Also, many users also put datasets of the particular domain by themselves. So you can get that dataset. The datasets on Kaggle are 100% free. 

Besides the dataset, Kaggle is also an excellent platform for learning data science. Many programmers put their whole Jupiter notebook on Kaggle with a well-commented structure. So you can learn from it what steps they took to solve a particular problem.

Google datasets

Like google is a search engine for websites, Google datasets is a search engine for datasets. You can type the dataset you want in the search bar just like the Google search engine. And you will get the result from all the websites that contain that dataset.

The Google datasets present the list of websites, therefore in some websites, you can get datasets for free, or on some websites, you have to pay first to get access to that dataset.

UCI machine learning repository

UCI machine learning repository

UCI machine learning repository is a website that contains authentic data. Like Kaggle, in the UCI machine learning repository.

Despite Kaggle UCI having very less datasets but all the datasets in UCI contain correct and rich information. Iris and Wine dataset that is very famous in machine learning is from UCI.

Most of the datasets also contain research papers regarding their data. So you can dig deeper if you want to learn more about that dataset. Which is a plus point.

Azure Open Datasets

Azure Open Datasets

Microsoft’s  Azure Open Datasets also provide datasets for machine learning or deep learning. You can access each dataset with Azure API.

All the datasets on Microsoft Azure are free. You will only be charged if you use any Azure service. So you can try out Azure Open Datasets as well.

Open Data on AWS

AWS Open Datasets

Like Azure, amazon’s Open Data On AWS also contains open, free commercial use datasets.  It contains a large variety of datasets to use.

To download any file you will need the AWS account. If you do have not an AWS account, don’t worry. You can download it using the AWS CLI. Follow this tutorial and you will be able to download files.

Also Read: How to group columns with the mode in DataFrame

Datasets community on Reddit

Reddit  Datasets community

On Reddit, you can find many communities regarding Data science, machine learning, deep learning, computer vision, and many more. One of them is the Datasets community. Which is awesome.

Many community members post datasets with links. Even If you need a dataset and that dataset you can’t find it online, then you can request the dataset via post. And if anyone has that dataset then they will post the link in the comment. Which is great.

Web Scraping

Photo by Artturi Jalli on Unsplash

Many websites like e-commerce or share price and many more contain tons of data. In that cases, Web Scraping comes to help.

Web scraping is an automated process to generate data that you need. You have to code once to capture and save that data only once and run that code whenever you need fresh data.

To do Web scraping, you have to learn some python libraries like BeautifulSoup and Requests.

Crowd Sourcing

Crowd Sourcing is another way to get your desired data. You can create a simple form with the information you want to get from an audience( from websites like Google Forms) and then post them on your social platforms like LinkedIn, Reddit, Instagram, Youtube, etc. And tell users to fill out that form. And you will get data from the user without a single effort.

Crowd Sourcing is one of the simple tricks in manual dataset techniques. But it is also required more time in data cleaning because of wrongly typed information from the audience.

So here are ways to get a dataset for your machine learning or deep learning project. Hope you liked it 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *