S S Bhadauria
3 min readJan 25, 2021

--

Hi folks,

This is software developer launching a very excited content about data science. Many of knows that data science is all about handling data by different techniques. Some of them doing by tools or some of them using different coding perspective or even many do manually, which is the worst way of doing this.

But before analysing or visualising data, you have to collect it from different source. Many non technical guys are collecting data by surveys which is alright at some extent let’s not go in that frame. But most of us like coders do it by writing some lines of code and get huge amount of data from anywhere available online.

So, Today I am gone a take you to some basic part of coding in python which is very easy to follow and from where you can also grab all the data online from your domain interest.

For scraping data from any website. You need to choose any website from where you need to scrap information (Note:- This is illegal and risky to scrap data from any site. If they find any malicious activity on their site by you then they can take action on you. So be careful and do not scrap any intimate data without getting permission from the website owner).

Choose any editor of your choice, if I recommend then go with google colab.

This is very highly configurable editor google provide which is used for data analysis and visualisation, even many industry experts also use this tool for there work.

First import all the required libraries in this colab.

These commands will download all these libraries and now you can better go with this to use them.

request →allows you to send HTTP requests.

beautifulsoup →for pulling data out of HTML and XML files.

pandas → data analysis and manipulation tool.

re →regular expression specifies a set of strings that matches it.

matplotlib → for 2D plots of arrays.

Then, create variable url which contain your website link inside it and get all text content of that website in your content variable.

Now create a dictionary variable where you are going to store all links and titles from that site

After that you need to parse content in html using Beautiful soup and store it in your variable soup.

Now looping on all anchor tag and checking if the link text is greater then 1 and anchor starts with HTTP and neglecting links which contain keywords which link u don’t want to extract.

Then store link text in dictionary text variable and link inside links variable.

Then change that information in table format by using pandas library and set index of title and store that table in a variable blog_list.

Now you can check this table by print(blog_list) syntax and to store in csv file so that you can use it outside the colab by following below code.

That’s all for scraping any website data very easily. There will be some challenges in logic for different website but the basic concept behind everything is that only.

--

--