In this article, I will show you how easily you can do scraping or crawling any website data using BeautifulSoup library of Python.
Table of contents
What is web scraping?
Web scraping has various names web crawling, web data extraction, etc. It is an essential part of Data extraction.
As a data analyst, you should not always expect your input data from customers. To do some initial analysis sometimes we need to extract data from freely available websites.
Application of Web Scraping
There are various real-world applications of web crawlers. For example, OpenAI crawled millions of web data to make their input data to train the ChatGPT model.
SEO tools like ahrefs, SemRush, Moz continuously crawl websites to build their keyword database, which they call it bot.
You can make a realtime web scraper to fetch weather data, stock prices, real estate, e-commerce websites, etc.
For example, you made an automated web scraper to continuously monitor the price of a product from different e-commerce websites or platforms. This can help consumers find the best deals.
A great example of this kind of application is keepa.com. It tracks the price of certain products from amazon and notifies you if any price drops.
To make such an application you need to run your web scrapper continually. There are various ways to run Python scripts 24/7. Or you can update scraped data in Google Sheets from Python.
Why use Python for web scraping?
Python is a popular and powerful programming language. Nowadays from web apps to mobile applications, anything can be made by Python. So if your main application is developed using Python then for web scraping tasks also you can use this language.
Along with that Python has various libraries like requests, BeautifulSoup, Scrapy, Selenium, etc. which will make you comfortable to do web scraping.
I worked in various programming languages. But trust me I found Python is the best language to do web scraping or crawling.
If you are new to Python and want to learn this powerful language by reading books then this article is for you: 5 Best Book for Learning Python. Or if you want to learn Python quickly then this Udemy course is for you: Learn Python in 100 days of coding.
Steps to Make Web Crawler in Python
I have been doing web scraping with Python for a long time. I always wanted to write an article on it. Today I managed to find some time to show you the real easy technique to do web crawling in Python with BeautifulSoup library.
For better explanation, I will break the entire process of making a best Python web crawler into some steps:
Step1: Import Libraries
Like all other Python projects, we first need to import all required libraries. For this simple web scraping tutorial project, we only need requests and BeautifulSoup library of Python.
import requests
from bs4 import BeautifulSoup
Step2: Define URL
Using web scraping, we are planning to extract data from a website available over the internet right? So first we need to find out what kind of data we need for our project and which website contains that kind of data.
For this example, I am going to extract Amazon review data from the Trustpilot website. Before going to code, let me tell you that, Trustpilot is a popular website that contains customer reviews for lots of service providers. If you need customer review data for your NLP project, you can check this website.
url = 'https://www.trustpilot.com/review/www.amazon.com?page=2'
Step3: Get Response
The next step of making our Python web scraper is to get the response of our provided URL. In this step, we are checking if that website can be accessible from the internet (through Python) or not. If accessible, make a connection to that website (URL) programmatically.
# Send an HTTP GET request to the URL
response = requests.get(url)
print(response)
Output:
<Response [200]>
Step4: Parse HTML Content
Once we make the connection to that website, we can extract all information to that particular page. The key idea of extracting all information from that page is by downloading the entire HTML content of that page and read it. This entire thing can be done easily in Python by using the BeautifulSoup library.
# Parse the HTML content of the page
soup = BeautifulSoup(response.content, 'html.parser')
print(soup)
Output:
<!DOCTYPE html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
body {
background-color: #f0f0f2;
margin: 0;
padding: 0;
font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
}
div {
width: 600px;
margin: 5em auto;
padding: 2em;
background-color: #fdfdff;
border-radius: 0.5em;
box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
}
a:link, a:visited {
color: #38488f;
text-decoration: none;
}
@media (max-width: 700px) {
div {
margin: 0 auto;
width: auto;
}
}
</style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
In the above Python code, we are parsing the entire HTML content of our desired page and storing it in a soup variable. One thing to note is that, this soup variable stores entire HTML data as XML format.
Step5: Find xPath
So we have the entire data of that page with us in XML format. Now everything on that web page is not required. We only need customer comments. With BeautifulSoup in Python, we can extract any specific portion of a webpage using techniques called xPath and cssSelector. In this tutorial, I will use xPath.
what is xPath?
XPath is a language that helps you find any part of an XML document. XML documents are like trees with branches and leaves.
XPath can tell you the exact path to any branch or leaf. For example, if you want to find the author of the chemistry book in this XML document:
# XML
<bookstore>
<book category = "Math">
<title lang="en">IIT Mathematics</title>
<author>A Das Gupta</author>
</book>
<book category = "Chemistry">
<title lang="en"> Inorganic chemistry for JEE</title>
<author>V K Jaiswal</author>
</book>
</bookstore>
You can use this XPath expression:
/bookstore/book[@category='Chemistry']/author
This means: start from the root node (/), then go to the “bookstore” node, then go to the “book” node that has an attribute called category with a value of Chemistry ([@category='Chemistry']
), then go to the “author” node.
Now coming back to our main tutorial. For this example, I want to extract the headings of all reviews. To find xPath of headings you can hover your mouse cursor on top of any review heading. Then right-click and go to inspect elements.
On the Elements page, you will find the class name. Copy it. This is your xPath to fetch review headings from this web page.
There are various ways to find xPath and cssSelector. But I always use this technique. It is so simple and easy.
Step6: Extract Review Headings
So we found our xPath. Let’s now scrape all headings with similar xPaths. to do this use .find_all() method of beautifulSoup package, below is the Python code.
comment_headings = soup.find_all(class_='typography_heading-s__f7029 typography_appearance-default__AAY17')
print(comment_headings)
Output:
[<h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">Best Shopping Website Worldwide</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">I have had problems to generate a…</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">Disgraceful service</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">Amazon has now lost its former trust</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">I ordered a box of a dozen Clif Bars</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">The choice is great</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">Absolutely shocking customer service is…</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">Alexa</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">Christmas Disaster</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">Thank Goodness For Walmart Plus</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">Adverts now shown on amazon prime</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">SCAM AND LIES</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">There are many Alternatives</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">I usually don't leave reviews but this…</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">DISAPPOINTED IN AMAZON</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">Prompt delivery by Royal Mail not…</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">Delivery attempted but I was at home…</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">Ordered a book</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">Just great!</h2>, <h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">It's bad enough packages never arrive…</h2>]
Since we use .find_all() method, it extracts all headings in this page with similar xPath value. If you want only one specific heading, you can use soup.find() function.
soup.find(class_='typography_heading-s__f7029 typography_appearance-default__AAY17').get_text()
Output:
Absolutely shocking customer service is…
Step7: Convert to String
As you can see in the above output, everything we extracted is in HTML format. But we need those headings in string format. We can easily convert those HTML elements to string or text using .get_text() method of beutifulSoup package in Python.
# Printing all Headings in string or text format
for com_head in comment_headings:
print(com_head.get_text())
Absolutely shocking customer service is…
SCAM AND LIES
Adverts now shown on amazon prime
Thank Goodness For Walmart Plus
There are many Alternatives
I usually don't leave reviews but this…
Delivery attempted but I was at home…
Prompt delivery by Royal Mail not…
Ordered a book
It's bad enough packages never arrive…
I can't rely on Amazon anymore.
Rip off Amazon
Dumped adverts on paying loyal customers
Amazon Customer Service Genuine…
What happened amazon?
Literally the worst shopping experience…
Where do I even begin
Phone Support Does Not Understand English
Never had issues with Amazon
Amazon requires an amendment to enable me to publish reviews once again.
How to Scrape Dynamic Website in Python
So far in this tutorial, I showed you how to extract a simple website using beutifulSoup in Python. Now these website is called static websites.
Dynamic websites are those that use JavaScript, AJAX, or other techniques to load content dynamically based on user interaction. For example, you want to scrape data from social media like Facebook.
In this case, you only see the data while you scroll down. Along with that, you need to log in to that website. To see comments you need to click the comment buttons.
These types of websites are called dynamic websites. You can not scrape dynamic websites like Facebook by just using BeutifulSoup and Python.
To scrape a dynamic website (for example: social media Facebook), in Python, you can use the Selenium library, which I use mostly.
In my next tutorial, I will share how you can scrape dynamic websites using Selenium in Python. I will not cover that in this article.
This is it for this article. If you have any questions or suggestions regarding this article, please let me know in the comment section below.
Similar Read:
Hi there, I’m Anindya Naskar, Data Science Engineer. I created this website to show you what I believe is the best possible way to get your start in the field of Data Science.