Learn how easy it is to web scrape in Python — add a superpower to your toolbox with BeautifulSoup.

BeautifulSoup project for finding the lyrics to any song. Let’s get the lyrics to a Lady Gaga song:

Get the lyrics to any song using Beautiful Soup. The code for this small program is here.

1 — What is the requests library?

Send all kinds of HTTP requests. It is an easy-to-use library with a lot of features ranging from passing parameters in URLs to sending custom headers and SSL Verification.

2 — What is BeautifulSoup?

BeautifulSoup returns a Parse Tree in HTML or XML.

This might raise more questions that it answers. A parser? XML? HTTP requests? eh?

3 — What is a parser?

In simple terms: parsing is the process of turning some kind of data into some other type of data, usually into a more usable format for the task at hand.

#Analogy

A sentence is made up of words. A parser is similar to if you were to systematically break down the sentence into nouns, verbs, adjectives etc.

4 — What is the difference between HTML and XML formats?

XML focuses on transferring data, and is often used when writing web applications. Whereas, HTML is focused on the presentation of the data. Both are MARKUP LANGUAGES — which defines how a document looks, through annotation or tags. HTML has predefined tags. XML does not have predefined tags. Meaning you can create your own tags in XML which are known as element types.

Why use the requests and BeautifulSoup library together?

Both python packages can be used to get the text from a static webpage. However, the requests library makes it easy to authenticate an HTTP request from the server without having to log in. BeautifulSoup makes it easier to parse through the content. This is why the two packages are often used together.

  • Using soup.prettify() for easily reading HTML (or JSON formats).
  • Easy ways to navigate the soup.
  • Extracting all URLs or text from a webpage.
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

Conclusion

Requests is for sending HTTP requests and you can print the response as text.

  • Get the prices of your competitors products.
  • Scrape review sites to see how your customers view you over time.
  • Scrape job sites to never miss a new job posting.
  • Automating mundane tasks: Say your company what to find the phone number of every person based on their email address. Or get a picture based on a list of requirements — this can be automated.
  • Creating a comparison of the best prices on different ecommerce sites.

2 quick examples of using BeautifulSoup and Request library.

Example 1 — Simple Example using HTML string

!pip install beautifulsoup4
from bs4 import BeautifulSoupSIMPLE_HTML = '''<html><head></head>
<body>
<h1>Title</h1>
<p class="subtitle">Lorem ipsum dolor sit amet. Consectetur edipiscim elit.</p>
<p>Here's another paragraph without a class</p>
<ul>
<li>Garry Barlow</li>
<li>Rolf Harris</li>
<li>Stewart Lee</li>
<li>Jose Mourinho</li>
</ul>
</body>
</html>
'''

Create a variable that instantiates BeautifulSoup.

In [4]:

simple_soup = BeautifulSoup(SIMPLE_HTML, 'html.parser')

Find specific content by tag within the HTML.

print(simple_soup.find('h1').string) #Title

Loop through multiple items

In [6]:

list_items = simple_soup.find_all('li')
list_content = [e.string for e in list_items] # if do not know list comprehensions then use usual for loop
print(list_content)
['Garry Barlow', 'Rolf Harris', 'Stewart Lee', 'Jose Mourinho']

Use the class within the element tag to retrieve more specific content

In [7]:

simple_soup.find('p', {'class': 'subtitle'}).string
'Lorem ipsum dolor sit amet. Consectetur edipiscim elit.'

Create functions

Eventually you would want to create your own class.

def find_title():
print(simple_soup.find('h1').string)
def find_list_items():
list_items = simple_soup.find_all('li')
list_content = [e.string for e in list_items]
print(list_content)
def find_paragraph():
print(simple_soup.find('p', {'class': 'subtitle'}).string)
def find_other_paragraph():
paragraphs = simple_soup.find_all('p')
other_paragraph = [p for p in paragraphs if 'subtitle' not in p.attrs.get('class', [])]
print(other_paragraph[0].string)
find_title()
find_list_items()
find_paragraph()
find_other_paragraph()
Out [8]:Title
['Garry Barlow', 'Rolf Harris', 'Stewart Lee', 'Jose Mourinho']
Lorem ipsum dolor sit amet. Consectetur edipiscim elit.
Here's another paragraph without a class

Example 2 — Using a static website

Let’s use this website as an example: https://uk.trustpilot.com/review/www.americangolf.co.uk

from bs4 import BeautifulSoup
import requests
import pandas as pd
import pprint # this is to get nicer outputs
url = 'https://uk.trustpilot.com/review/costacoffee.co.uk'
result = requests.get(url) # make a request for the HTML content using the requests library
soup = BeautifulSoup(result.text, 'html.parser') # use the html parser

Get the class of the element on the page that we want to isolate

  • Remember to use ctrl + Shift + i.
  • Click on the element you want and note the class identifier.
divTag = soup.find_all("div", {"class": "styles_rating__NPyeH"})
for tag in divTag:
print(tag.text)
rating_today = tag.text
Out[11]: 2.0
  • Rating
  • Reviews
  • Sentiment
  • We’ll also add the date using the datetime library.
divTag2 = soup.find_all("div", {"class": "styles_header__yrrqf"}) # get the class identifierfor tag in divTag2: # loop through the content of this tag
header = tag.find_all("span")[0] # note that this is a list
number_of_reviews = header.text
print(number_of_reviews)
Out[12]: 1,332
divTag3 =  soup.find_all("div", {"class":"styles_container__z2XKR"}) sentiment = []for tag in divTag3:
paragraph = tag.find_all("p")
for goods in paragraph:
sentiment.append(goods.text)
print(sentiment)['Excellent', '24%', 'Great', '4%', 'Average', '5%', 'Poor', '12%', 'Bad', '55%']

We now have the data, so let’s place it into a dataframe using Pandas and export it as a .csv

In [14]:

from datetime import date#Get the date for today
today = date.today()
today_formated = today.strftime("%d/%m/%Y")
df = pd.DataFrame()#Create a dictionary to then turn it into a dataframe
dict = {'Date': today_formated, 'Link': url, 'Score': rating_today, 'Reviews': number_of_reviews}
# loop through the sentiment list and add the sentiments to the dictionary above
for i in range(0,10,2):
dict[sentiment[i]] = sentiment[i+1].replace('<', '')
if i == 8:
break
pprint.pprint(dict)
{'Average': '5%',
'Bad': '55%',
'Date': '27/06/2022',
'Excellent': '24%',
'Great': '4%',
'Link': 'https://uk.trustpilot.com/review/costacoffee.co.uk',
'Poor': '12%',
'Reviews': '1,332',
'Score': '2.0'}
#Create the dataframe
df = df.append(dict, ignore_index = True, sort = False)
#Rearrange the columns
cols = ['Date', 'Link', 'Reviews', 'Score', 'Excellent', 'Great', 'Average', 'Poor', 'Bad']
df = df[cols]
df.head()

Exporting to csv

In [21]:

df.to_csv('Simple_Scrape_Example.csv', index=False)

Create function or class to use for different companies

In [22]:

def webscrape(url):
result = requests.get(url)
soup = BeautifulSoup(result.text, 'html.parser')


divTag = soup.find_all("div", {"class": "styles_rating__NPyeH"})
for tag in divTag:
rating_today = tag.text


divTag2 = soup.find_all("div", {"class": "styles_header__yrrqf"}) # get the class identifier
for tag in divTag2: # loop through the content of this tag
header = tag.find_all("span")[0] # note that this is a list
number_of_reviews = header.text


divTag3 = soup.find_all("div", {"class":"styles_container__z2XKR"})
sentiment = []
for tag in divTag3:
paragraph = tag.find_all("p")
for goods in paragraph:
sentiment.append(goods.text)
return rating_today, number_of_reviews, sentiment
webscrape('https://uk.trustpilot.com/review/cafenero.co.uk')
('2.8',  '140',  ['Excellent',   '33%',  'Great',  '11%',  Average',  '4%',  'Poor',  '10%',  'Bad',  '42%'])

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store