Web Scraping Using Python And Beautifulsoup

01-05-2021 admin

Website Scraping With Python Using Beautifulsoup And Scrapy
Python Web Scraping Pdf
Web Scraping In Python With Beautifulsoup And Selenium
Web Scraping Using Python And Beautifulsoup Learning
Web Scraping Using Python And Beautifulsoup 2
Web Scraping Using Python And Beautifulsoup Answers

Python web scraping is a field where you can collect data from online web pages. For many different purposes. Mostly for data mining for data analysis, data science and machine learning. However there are so many use cases for web scraping.

Mar 18, 2021 Web scraping with Python is easy due to the many useful libraries available. A barebones installation isn’t enough for web scraping. One of the Python advantages is a large selection of libraries for web scraping. For this Python web scraping tutorial, we’ll be using three important libraries – BeautifulSoup v4, Pandas, and Selenium. Mar 10, 2021 Learn web scraping with Python with this step-by-step tutorial. We will cover almost all of the tools Python offers to scrape the web. From Requests to BeautifulSoup, Scrapy, Selenium and more.

Prefer video check out this python web scraping tutorial on youtube:

Web scraping is a bit of a dark art in the sense, that with great power comes great responsibility. Use what you learn in this tutorial only to do ethical scraping. In this python web scraping tutorial, we will scrape the worldometer website for some data on the pandemic.

Then do something with that data. Since this is a web scraping tutorial we will mainly be focusing on the scraping portion and only very little be touching on the data processing side of the tutorial.

Web Scraping Using Python And Beautifulsoup

We will be using a python library called beautifulsoup for our web scraping project. It is important to note that beautiful soup isn’t the silver bullet in web scraping.

It mainly is a wrapper for a parse which makes it more intuitive and simpler to extract data from markup like HTML and XML. If you are looking for something which can help you navigate pages.

Also be able to crawl websites then beautiful soup won’t do that on it’s own. However it is good to note that there are other options such as python scrapy as well. There will later on be a tutorial on scrapy as well.

Then you can actually decide which is the best for your particular project or use case. Before we can get started let us start by installing beautifulsoup. So for this we will need to create a virtual environment.

If you are using windows, mac, Linux the procedure should be very similar. So here we go:

Python beautifulsoup: installation

Let us create a virtual environment for our project. So if you are on windows open a powershell or cmd prompt. If on mac or linux open up a terminal and execute the following commands.

Where bsenv will be the folder where our virtual environment will be. You can now run:

Windows:

Linux/Mac:

This should now activate your virtual environment like this and we can now install beautifulsoup.

Run this to install on linux/mac/windows:

To test that beautifulsoup is installed. Run a python terminal and import beautifulsoup like this.

If that worked for you then great you are installed! Let us now start with the most basic example.

Beautifulsoup: HTML page python web scraping / parsing

So here is an HTML example we will work with to just start with. So go ahead and paste this into your favorite editor and save it as index.html.

Once you have saved that. Create a new python script called: scrape.py. Here is the code we going to use to get some info from our index.html file.

This very basic bit of code will grab the title tag text from our index.html document. This is how the output will look if you run it.

Let’s now try do something a little more complicated like grabbing all the tr tags. Then displaying all the names in the tags.

Your output should look something like this.

To explain how this code works. We are loading up the html with a basic file read in python. We then instantiate the BeautifulSoup parser with our html data. We then do a find_all on all the tr tags.

I then loop over all the trs and find all the td children. If the td children don’t exist we skip over this item. Then finally we just get the first element and print the text contained in those tags.

So, so far so good pretty simple right. Let’s now look at getting some data using a request. The url we will be using to get some data is: https://www.worldometers.info/coronavirus/

Python web scraping: From a web page url

To be able to load up the html from the web page we will use the requests library in python and then feed that data to beautiful soup. Here is the code.

So if you run this it will output a csv set of data for us will be generated like this which we can use for some data analysis or maybe even some type of other data source.

Here is how the code works. So first thing is we import requests, so that we can make web requests using our python script. We then call requests.get to get the url and at the end choose to get the text version of the data.

So that we get the raw html data. Next we add this to our BeautifulSoup object and use the html.parser. Then we select a table with the id=main_table_countries_today.

How we find this is you will open the page in your chrome or firefox browser head over to the table, then right click and select the inspect option.

If you have done this you should be able to find this id main_table_countries_today. So once we have that we will use data_table variable we just created. Which will now contain the data table element.

Then we simply do a find all to find all the tr elements in that table. Next we loop over all our rows and just initialize an empty rowlist to store our data in. After that we open a try except block, so that if any of the elements are missing in our data we won’t error out our script.

The try except will basically skip over the critical errors so our script can continue reading the html. In the try except we do a find all for all the td tags, then finally loop over them to add the columns to our rowlist.

Outside of that loop we join them on a comma. Then outside of that loop we add the row to our datalist and then finally outside of that loop we join our datalist rows to create a csv like print out of the data.

If you look at the code, that was pretty simple to just gather some basic table form data. Scraping like this needs to really be done carefully as you don’t want to overwhelm anyone’s server.

So once again this is only for educational purposes and what you do with this is at your own risk. However for basic scraping on your own websites which you own you can collect data etc.

If you are wanting to learn web scraping it is best to do as much of it offline, so you can rapidly test and don’t have to do a request to a server the whole time.

For that you can just save a particular web page using your browser and work on that html file locally.

Let us look at another example of some web scraping before we end off this tutorial.

Python web scraping: creating a blog post feed

So you may want to create a blog post feed. For whatever reason, maybe you are building an application.

Which needs to retrieve basic data from a blog post for a social sharing website. Then you can use beautifulsoup to parse some of this data.

To show you how this can work let us scrape some data from my own blog from the home page.

So we will use the url: https://generalistprogrammer.com

So for this I have saved the html page using my browser like this.

I suggest if you want to follow along you should do the same. Please don’t do a request to my server in python. So save that as gp.html.

Now here is a pro tip. You might get this really strange error with this code.

C:Program FilesWindowsAppsPythonSoftwareFoundation.Python.3.8_3.8.1520.0_x64__qbz5n2kfra8p0libencodingscp1252.py”, line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x8d in position 18403: character maps to

If you are getting this error, it is because our reader is not reading this as utf8. So to fix this we modify our code to look like this.

So we basically just added encoding=’utf8′. Then when you run it you will get the output of the web page.

So great you may have noticed I used the .prettify at the end that just makes this html more structured and easier to read using beautifulsoup. So let’s now follow the same procedure as before by using our inspector tool in chrome / firefox like this.

That will open up the inspector tool like this:

So this helps us find the entry point for our scraper. So we want to look at all articles. Then you will notice the article tag slightly above this tag. So let’s start with this bit of code.

Run that and you should see output like this. Which should contain html for one particular article.

So this is a nice example because there is just so much for us to collect from this html. Things like the title, date published, author, etc. So let’s start off with the title and the text of the article then move on to the date published etc.

So to summarize we basically look inside of our article then find the first h2 tag, then the first a tag and get the text for that as our title. Then the summary we look for the div element with the class which is blog-entry-summary, we then find the first p tag in there and get the text to populate our summary.

Pretty simple right? Here is what your output should look like.

So wow simple simple. Let’s try now get the publishedDate and the author.

So now it becomes a little more complicated. So author we first create an author_section variable which we find the li item with the class=’meta-author’.

Next we take that data then get the first a tag and get its text to get our author. Now where it get’s interesting is with our date section. Because our date section looks like this.

It’s not in any predefined tag. So we need to do some manipulation. To show you the issue run this bit of code.

You will notice now you get his output when you run this code.

So that is great if that is how you want it. However I want to only have the date. So to fix that we apply a replace for the string Post published: ending up with this code and this output.

So that looks good right? Well now we need to get all the blog post data on that page. So to do that we need to loop over all articles. So very simple to do this we will add this bit at the top.

Which if you run it gives us this output.

All the different little articles for the page. If you missed the part we added at the top. Here is the full code again for you to test.

So great but what about attributes?

This whole tutorial has basically now shown basic traversal, but what if we want to read some attribute data. So let me show you how using the same code we have above we can do this, but reading another part of that html.

I want to now go and add the url to the full post in our data we scraped as well. For that we will need to read one of the href attributes. Let’s look at how we can retrieve that if we look at the article html.

Here we have a class blog-entry-readmore, follwed by an a tag with the href we looking for. So let’s just add this bit of code to our script.

So pretty simple. We can use article then find the class and traverse to the first a tag then access the attributes using the “href” index in the dictionary beautifulsoup builds up for us.

So if we for arguement sake wanted the title from that same a tag we could use this code.

So really really easy to get those as well.

Another python web scraping with beautifulsoup example

What about using python web scraping for keeping an eye on our favorite stocks. Well you can easily do some web scraping for that as well. Here is a snippet of HTML as an example of data you might want to consume.

Go ahead and copy and paste this into your editor and name it stocks.html. We will now use beautiful soup to perform python web scraping on this data as well.

With this data set we want to do something a little more interesting with it and save it into a database using python.

Just to show you how you can take your python web scraping to the next level. You may one day want to add your data into a database. Where you can query and interrogate data.

Let us start off just by getting a database table working in python. To do this we will use a database called sqlite.

So to start create a dbscrape.py file. First thing we want to do is create ourselves a table. To do that start off with this code.

To quickly explain all this. We import sqlite3 we open a connection to a file called stocks.db. If it doesn’t exist yet the sqlite3 package in python will create it for us.

Then we open up a cursor, so basically a pointer to our database. We then execute a table create statement where we create a time,price,percent and change field in our table.

Finally we commit this change and close our connection.

Simple right? Great!

I like to use a plugin in visual studio code called sqlite explorer by alexcvzz. Here is what it looks like in visual studio code if you want to install it for yourself.

With this installed you can now view your database simply by right clicking it and opening the database.

Which then should look something like this.

So you can see how that is useful. Let us just add a little extra logic now we want to drop our table if it already exists so we can refresh it each time.

In order to do that just add this extra line before your create table query.

Let us create some sample data so we can see how all this will be inserted into our table.

Here is the full code again in case you missed it.

Website Scraping With Python Using Beautifulsoup And Scrapy

If you check that out in your sqllite explorer plugin in visual studio code you should now see this. If you click on this play button.

Then you end up with this table:

Great so now we have a working table which can accept data let us now scrape our data from our html and put it in our sqlite table.

Here is the code to start using python web scraping on our html and putting it into our sqlite table:

So this is how it works.

Import beautifulsoup.
Create a our connection and sqlite table.
We instantiate beautifulsoup with html.parser for our web scraping
Then we find all tr in our html.
Loop over the tr then skip over anything that is a heading or th tag.
Within that tr we find all td.
Then simply assign each variable to a column from each td.
After that we execute the insert code with our data.
Finally commit our data and close our conneciton.

Here is what you should end up with in your sqlite table when you inspect it after running this python script.

As simple as that you can combine python web scraping with persistent storage to store your data.

Python web scraping: use cases

So I can think of a few use cases for this. Where you maybe own an eCommerce store and you maybe need to validate prices on the store. You could use scraping to do a daily check if your pricing on the store is still correct.

If you are a researcher you can now collect data from public data feeds to help your research. Scraping when done ethically and for the right reason can have some really useful applications.

You may even want to use this for automated testing if you are a web developer. You basically create a unit test expecting a certain output on a web page and if it’s what you expect the test passes otherwise it fails.

Python Web Scraping Pdf

So you could build a script to check that your web pages act in the expected way after a major code change.

You could build a scraper to help automate content creation, by collecting data off your blog which you can use excerpts in your youtube video descriptions.

Really you just need to use your imagination of what you would like to do. You could even build some nice little scripts to login to your favorite subscription website which will tell you when new content has been released.

Why do we need Web Scraping?

Suppose someone asks you to get the list of Top 100 Movies and all the details like year, ratings, directors, and actors of the movies then what you’ll do?

First, you’ll search for Top 100 Movies in google, then open the first link (maybe IMDB) and start to copy-pasting the list and the details, this seems a bad idea. What if you have a script or program that takes the URL of the website and extracts all the required information from it.

Similarly, there might be hundreds of websites which have relevant information for you, some of them have static information and some of them have changing information like sports site, news site, etc. In today’s world, digital information is very important and highly valuable.

Web Scraping provides a way to automate the information extraction from the given website(s).

What it is?

Web Scraping is nothing but automated data extraction from the website(s) and after extraction, this data is processed and converted to useful information.

Websites are the collection of the Web Pages, where each web page is built using the text-based markup languages like HTML and XHTML.

The web pages contain useful information in the text form, but they are designed for the humans as the end-users, therefore it requires special tools to automate the information extraction.

Web Scraper tool uses the HTML structural elements (div, span, p, a, etc) and the attributes (id, class) of the web page to extract the text information.

Now before moving towards BeautifulSoup, first let’s take a brief look into some HTML basics.

HTML Basics.

In order to extract the information, first we need to get the insight into the structure of the web page, this will tell us which section of HTML holding particular information.

For better understating of the web page’s structure, we need to know some of the basics of HTML.

Elements.

An HTML element contains 3 parts, start tag, some content, and the end tag.

There are several types of elements present in HTML, all of them has a different purpose and usage. Each type of element is uniquely identified by its tag-name. Elements can also be used in a nested fashion.

h1: This is used for heading, this displays the heading in the biggest size. h2, h3, h4, h5, and h6 are some other heading elements.
p: Used for the paragraph.
a: It is used to provide the hyperlinks.
div: Defines the division or section.
span: This is used to the grouping of inline-elements.

Attributes.

An HTML element may or may not have the attribute(s). These attributes provide additional information about the element. We only talk about two attributes, Class and Id.

Web Scraping In Python With Beautifulsoup And Selenium

Class: The HTML class attribute is used to define equal styles for elements with the same class name.
Id: The id attribute specifies a unique id for an HTML element (the value must be unique within the HTML document).

I have tried to give a brief about some components of HTML, if you are still having doubts or want to explore more then check out w3schools

Ok, so let’s dive into BeautifulSoup, a beautiful tool that makes web scraping super easy.

BeautifulSoup.

BeautifulSoup is a Python package to parse the HTML and XML documents, it provides Pythonic idioms for iterating, searching, and modifying the parse tree.

It can work with different types of parsers like html, xml, html5lib. BeautifulSoup provides API to do a search based on the structural elements and attributes of HTML and XHTML.

Installation.

Since its a Python Package, so the installation is super easy. Install it using PIP. I am assuming that you are working on Python 3 environment.

Usage.

We’ll talk about only those usage which are most common.

Soup.

Before we parse the HTML document, we have to create the soup of the given document, it is the base of all the entire parsing. Soup can be created by passing the HTML content to the BeautifulSoup constructor.

The HTML content can be passed through multiple ways, like passing the file pointer of HTML document(web page) or passing HTML content as a string.

Passing the file pointer.

Passing HTML content as string.

Prettify.

Once the soup is ready we can display it in a very nice and clean way using Prettify. Prettify maintains all the structural hierarchy of HTML while displaying it.

If you run the above code, you’ll get a nice HTML output, which helps us to get a better understanding of the structure and hierarchy of HTML.

Find.

Now we are ready to parse the document and extract the data from it. find helps us to find the HTML elements (div, span, p, a) based on their tag, id, and attributes.

The find always return the only first search result.

Let’s take an small HTML content as an example.

Now we try to extract different information from the above content using different properties.

Find using tag name.

The above code finds out the first HTML element with h1 tag-name, and text returns the text part of that element. So the output of the code will be Godfather.

BeautifulSoup provides multiple ways to extract the same information. You can extract the above information using the following method also.

Find using id.

It provides the functionality to search for an element by its id. You can extract the movie name from the above example using its id.

Find using attributes.

You can search an element based on its attributes like class. We’ll extract genre from the above example by its class.

Find all.

In the above section, we saw that find returns only the first result, but if we want all the elements with the specific property then we can use find_all.

Let’s take another example with multiple movies.

As you can see that all the movies are in the h1 heading elements, so let’s find all the elements with their h1 tag-name. The following code will return all the 3 results.

You might have noticed that we didn’t apply text on the results as we did in find, here in the find_all you have to extract text from the individual result.

Apart from returning multiple results, find_all is functionality-wise same as find.

We have learned some of the basic APIs of BeautifulSoup, but it provides many other APIs for more complex scenarios, so if you are willing to explore then check out the documentation.

Done with the theory? ok, let’s take an example and extract the data from a real website.

For Web Scraping, we need to know the structure of the web page we are dealing with, understanding the HTML structure is the most important part of the Web Scraping. So pay extra attention to the next section.

Analysis of the web page.

From the beginning of the article we are talking about the Top 100 Movies, so let’s take the same example and extract the information about movies from IMDb.

Inspect the web page.

First, open the Top 100 Movies and analyze the HTML structure of the web page. To analyze any web page we can use the inspect feature of the browser. Just do a right-click on the web page and click the inspect option from the list.

This gives us the general HTML layout of the web page, but if you want to see the structure of any particular element then take the cursor over that element on the web page and then do the inspect.

If you move the cursor in the HTML section then you’ll see different highlighted blocks on the web page, which shows the mapping between the HTML element and its corresponding block.

In the above image, there is a highlighted block on the web page and above that block, you can see div.lister-item-content which is the HTML element of that block. Here div is the actual element and lister-item-content is the attribute of the element.

The above-highlighted block contains all the information about a particular movie ( name, year, genre,etc.). Similarly, there are 100 blocks in the web page, one block per movie. If you inspect some of them then you’ll find that they have similar HTML elements and attributes.

Finding the relevant HTML elements.

Now let’s explore the above block further, just click on the div.lister-item-content in the HTML section and you’ll see multiple nested elements. These nested elements have our information, so to find out the HTML element for the given information, just move the cursor over that information on the web page and then inspect it.

Finding out the relevant HTML elements is a fairly easy task. so let’s see what you have found. Below is the list of movie info and their corresponding element.

Web Scraping Using Python And Beautifulsoup Learning

Movie Name: Element a under the h3.lister-item-header.
Year: Element span.lister-item-year text-muted unbold under h3.lister-item-header.
Runtime: Element span.runtime under p.text-muted text-small.
Genre: Element span.genre under p.text-muted text-small.
Rating: Element span.ipl-rating-star__rating.
Director: Element a under second p.text-muted text-small.
Stars: Element a under second p.text-muted text-small.

Finding the above information about HTML elements is the most important step of web scraping, so before we move further, make sure you understand each and every part of it.

If you have understood the above sections clearly, then implementation is gonna be a piece of cake. So let’s implement The Web Scraping in Pythonusing BeautifulSoup.

Implementation in Python

Note: Not all the websites allow the Web Scraping, so please be cautious before you do it on the given website, it might get your IP blocked to access that website.

As we discussed that we are going to use Web Scraping to extract the information of Top 100 Moveis, so let’s implement it in Python step by step.

Web Scraping Using Python And Beautifulsoup 2

Getting the HTML content.

Here we’ll use Python requests module to fetch the HTML content.

Web Scraping Using Python And Beautifulsoup Answers

First, we need to have the actual URL to fetch the content, that we have in the line no 4.

In the line no. 5 we are sending the GET request to the URL, which returns the actual HTML content, and some other response-related information like the header. We can extract the HTML content from the response using r.content.

Once we have the actual HTML content then we have created the soup by passing the r.content to the BeautifulSoup.

Finding the main blocks.

In the Analysis of the web page section we have talked about the main block that has all the movie-related information, and the entire web page has 100 such blocks, one block per movie.

So the first task is to find out all the main blocks. As we know from the last section that a given main block is a div element with a lister-item-content class attribute, so we can use this information to find out all blocks.

We have used find_all because we want to fetch all the blocks. Now we iterate through all the blocks and find the information from each block individually.

Extracting information from each block.

This section involves the actual information extraction, so pay extra attention and if you don’t understand any part of it then please refer to the Analysis of the web page section.

Let’s understand the each part of above code step by step.

Movie Name: Since the movie name comes under the first ‘a’ element therefore we can extract it using find. Text is used to extract the text part of it (discussed earlier).

Year: It comes under span element with lister-item-year text-muted unbold class. In the next line, we are removing the enclosing brackets.

Runtime: It can be found in span element with runtime class.

Genre: We can extract it from the span element with genre class. Once we extracted it, we applied string stripe to remove the unwanted spaces.

Rating: It comes under the span element with ipl-rating-star__rating class.

Directors and Stars: All the above parts were easy, but this part is a little tricky. There are more than one ‘p’ elements with the text-muted text-small class, but the information about Directors and Stars comes under the second result, that is why we have taken the 1st result (count start with 0).

Now we remove all the unwanted ‘n’. As you might have noticed that the list of Director(s) and Star(s) is divided by ‘|’, so we apply the split to divide the text which gives people list. After the split, the first part of people contains the list of Directors, and the second part contains the list of Stars.

Names of all the directors come after the Director or Directors keyword, therefore we applied the re.split (it allows split on multiple delimiters). This split gives the list of two elements as follows,

Output of the above code for the first movie.

As you can see only the first element contains the Director(s) list, that is why we have picked the 1st element only, then we split it on the split(‘,’) to get all the director(s) name(s) as the list. After getting the list we apply the string strip to clean the text.

We applied exactly same logic to extract the list of Stars.

Now we have extracted all the required information that you can save in any format you want.

The next section contains the complete code, and in that code, we save the information into Pandas Dataframe.

Complete Code.

So this is all about the Web Scraping in Python using BeautifulSoup, if you have any issue or doubt then please let me know in the comments.

Thanks for reading !

« The Number Of Neutrons In An Atom Is Equal To

David Perdue Twitter »

Linghunter494