Learn Web Scraping using Beautifulsoup and Python

learn-web-scraping-with-beautifulsoup-in-python

In this article, I will show you how easily you can do scraping or crawling any website data using BeautifulSoup library of Python.

What is web scraping?
Application of Web Scraping
Why use Python for web scraping?
Steps to Make Web Crawler in Python
How to Scrape Dynamic Website in Python

What is web scraping?

Web scraping has various names web crawling, web data extraction, etc. It is an essential part of Data extraction.

As a data analyst, you should not always expect your input data from customers. To do some initial analysis sometimes we need to extract data from freely available websites.

Application of Web Scraping

There are various real-world applications of web crawlers. For example, OpenAI crawled millions of web data to make their input data to train the ChatGPT model.

SEO tools like ahrefs, SemRush, Moz continuously crawl websites to build their keyword database, which they call it bot.

You can make a realtime web scraper to fetch weather data, stock prices, real estate, e-commerce websites, etc.

For example, you made an automated web scraper to continuously monitor the price of a product from different e-commerce websites or platforms. This can help consumers find the best deals.

A great example of this kind of application is keepa.com. It tracks the price of certain products from amazon and notifies you if any price drops.

To make such an application you need to run your web scrapper continually. There are various ways to run Python scripts 24/7. Or you can update scraped data in Google Sheets from Python.

Why use Python for web scraping?

Python is a popular and powerful programming language. Nowadays from web apps to mobile applications, anything can be made by Python. So if your main application is developed using Python then for web scraping tasks also you can use this language.

Along with that Python has various libraries like requests, BeautifulSoup, Scrapy, Selenium, etc. which will make you comfortable to do web scraping.

I worked in various programming languages. But trust me I found Python is the best language to do web scraping or crawling.

If you are new to Python and want to learn this powerful language by reading books then this article is for you: 5 Best Book for Learning Python. Or if you want to learn Python quickly then this Udemy course is for you: Learn Python in 100 days of coding.

Steps to Make Web Crawler in Python

I have been doing web scraping with Python for a long time. I always wanted to write an article on it. Today I managed to find some time to show you the real easy technique to do web crawling in Python with BeautifulSoup library.

Also Read: How to Make Money With Python: 12 Proven Ways

For better explanation, I will break the entire process of making a best Python web crawler into some steps:

Step1: Import Libraries

Like all other Python projects, we first need to import all required libraries. For this simple web scraping tutorial project, we only need requests and BeautifulSoup library of Python.

import requests
from bs4 import BeautifulSoup

Step2: Define URL

Using web scraping, we are planning to extract data from a website available over the internet right? So first we need to find out what kind of data we need for our project and which website contains that kind of data.

For this example, I am going to extract Amazon review data from the Trustpilot website. Before going to code, let me tell you that, Trustpilot is a popular website that contains customer reviews for lots of service providers. If you need customer review data for your NLP project, you can check this website.

url = 'https://www.trustpilot.com/review/www.amazon.com?page=2'

Step3: Get Response

The next step of making our Python web scraper is to get the response of our provided URL. In this step, we are checking if that website can be accessible from the internet (through Python) or not. If accessible, make a connection to that website (URL) programmatically.

# Send an HTTP GET request to the URL
response = requests.get(url)
print(response)

Output:
<Response [200]>

Step4: Parse HTML Content

Once we make the connection to that website, we can extract all information to that particular page. The key idea of extracting all information from that page is by downloading the entire HTML content of that page and read it. This entire thing can be done easily in Python by using the BeautifulSoup library.

# Parse the HTML content of the page
soup = BeautifulSoup(response.content, 'html.parser')
print(soup)

Output:

<!DOCTYPE html>

<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

In the above Python code, we are parsing the entire HTML content of our desired page and storing it in a soup variable. One thing to note is that, this soup variable stores entire HTML data as XML format.

Step5: Find xPath

So we have the entire data of that page with us in XML format. Now everything on that web page is not required. We only need customer comments. With BeautifulSoup in Python, we can extract any specific portion of a webpage using techniques called xPath and cssSelector. In this tutorial, I will use xPath.

Also Read: What is the difference between ' ' and " " in python?

what is xPath?

XPath is a language that helps you find any part of an XML document. XML documents are like trees with branches and leaves.

XPath can tell you the exact path to any branch or leaf. For example, if you want to find the author of the chemistry book in this XML document:

# XML
<bookstore>
  <book category = "Math">
    <title lang="en">IIT Mathematics</title>
    <author>A Das Gupta</author>
  </book>
  <book category = "Chemistry">
    <title lang="en"> Inorganic chemistry for JEE</title>
    <author>V K Jaiswal</author>
  </book>
</bookstore>

You can use this XPath expression:

/bookstore/book[@category='Chemistry']/author

This means: start from the root node (/), then go to the “bookstore” node, then go to the “book” node that has an attribute called category with a value of Chemistry ([@category='Chemistry']), then go to the “author” node.

Now coming back to our main tutorial. For this example, I want to extract the headings of all reviews. To find xPath of headings you can hover your mouse cursor on top of any review heading. Then right-click and go to inspect elements.

right-click-and-inspect-element-to-see-xpath

On the Elements page, you will find the class name. Copy it. This is your xPath to fetch review headings from this web page.

There are various ways to find xPath and cssSelector. But I always use this technique. It is so simple and easy.

copy-xpath-from-webpage-elements-to-do-web-scraping

Step6: Extract Review Headings

So we found our xPath. Let’s now scrape all headings with similar xPaths. to do this use .find_all() method of beautifulSoup package, below is the Python code.

comment_headings = soup.find_all(class_='typography_heading-s__f7029 typography_appearance-default__AAY17')
print(comment_headings)

Output:

[<h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">Best Shopping Website Worldwide</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">I have had problems to generate a…</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">Disgraceful service</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">Amazon has now lost its former trust</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">I ordered a box of a dozen Clif Bars</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">The choice is great</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">Absolutely shocking customer service is…</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">Alexa</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">Christmas Disaster</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">Thank Goodness For Walmart Plus</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">Adverts now shown on amazon prime</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">SCAM AND LIES</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">There are many Alternatives</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">I usually don't leave reviews but this…</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">DISAPPOINTED IN AMAZON</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">Prompt delivery by Royal Mail not…</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">Delivery attempted but I was at home…</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">Ordered a book</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">Just great!</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">It's bad enough packages never arrive…</h2>]

Since we use .find_all() method, it extracts all headings in this page with similar xPath value. If you want only one specific heading, you can use soup.find() function.

soup.find(class_='typography_heading-s__f7029 typography_appearance-default__AAY17').get_text()

Output:
Absolutely shocking customer service is…

Step7: Convert to String

As you can see in the above output, everything we extracted is in HTML format. But we need those headings in string format. We can easily convert those HTML elements to string or text using .get_text() method of beutifulSoup package in Python.

# Printing all Headings in string or text format
for com_head in comment_headings:
    print(com_head.get_text())

Absolutely shocking customer service is…
SCAM AND LIES
Adverts now shown on amazon prime
Thank Goodness For Walmart Plus
There are many Alternatives
I usually don't leave reviews but this…
Delivery attempted but I was at home…
Prompt delivery by Royal Mail not…
Ordered a book
It's bad enough packages never arrive…
I can't rely on Amazon anymore.
Rip off Amazon
Dumped adverts on paying loyal customers
Amazon Customer Service Genuine…
What happened amazon?
Literally the worst shopping experience…
Where do I even begin
Phone Support Does Not Understand English
Never had issues with Amazon
Amazon requires an amendment to enable me to publish reviews once again.

How to Scrape Dynamic Website in Python

So far in this tutorial, I showed you how to extract a simple website using beutifulSoup in Python. Now these website is called static websites.

Table of contents