Finding it hard to find datasets…? Web scrape to make the internet your dataset.

Luke Veitch
4 min readJul 6, 2021

Web scraping is not the easiest thing to do. It comes with lots of caveats and loops that you have to jump through. But if you’re tired of finding ancient datasets or not finding them at all, then having the power to scrape the web is immense. It should be in every data scientists toolset. It allows you to harness the perpetual flow of information provided by the internet. If you are new to web scraping, this article introduces you to the concept, scrapes together some valuable resources and a short tutorial on how to web scrape in python to click on an element.

Web scraping refers to the extraction of data from a website. The data is collected and processed into a format that is more useful to the user. Be it in a spreadsheet or an API. This process is automated using either a programming language or software tools.

It wasn’t until recently that I discovered that most major companies, institutions and governments web scrape daily to get their information to get data on competitors, market trends, social media semantic analysis and many other domains.

Web scraping overview:

1. Identify your goal: What information do you require and which page(s) do you want the information from?

2. Identify the HTML tags you want. (ctrl+shift+I)

3. Parse through the web page.

4. Clean you data.

5. Add it to your database or API.

Optional: Automate the script so that it runs again later to ensure you have the most up-to-date information. This can also be useful if you want to build up some time-series data.

Here are several programs and libraries that allow you to web scrape:

Python libraries:

Simple, easy-to-use and intuitive if you know HTTP requests such as GET, POST, etc. This is an essential python library for web scraping. However, it needs to be used in conjunction with lxml or BeautifulSoup — parsing Python libraries.

We require lxml, a high performance, blazingly fast, production-quality HTML, and XML parsing Python library.

BeautifulSoup is perhaps the most widely used Python library for web scraping. It creates a parse tree for parsing HTML and XML documents. Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8.

There is a limitation to all the Python libraries we have discussed so far — we cannot easily scrape data from dynamically populated websites. It happens because sometimes, the data present on the page is loaded through JavaScript. In simple words, if the page is not static, then the Python libraries mentioned earlier struggle to scrape the data from it.

Extremely powerful library, which is known as the BOSS of the web scraping libraries. Look into it at your discretion.

2. No prerequisites:

Free resources:

  • Parsehub — This is an intuitive and free resource that is relatively quick to understand and grasp. It has great tutorials and customer service, although it can be quite fiddly.

https://www.parsehub.com/

  • Octoparse — Similar to Parsehub as its main competitor, but only free for a trial period. The interface is better though and I believe it has more features.

https://www.octoparse.com/

  • Grepsr — This is a chrome extension that allows you to web scrape.

https://www.grepsr.com/

Paid, but inexpensive, resources:

Companies and corporations usually use these services.

Quick Tutorial to web scrape in python:

You will need:
Chrome
Beautiful Soup
Selenium
Virtual Environment
pandas (optional)

Need version of python 3.8.3

  • Open your command prompt (to feel like a hardcore programmer) and create a virtual environment in python:

>python -m venv nameofenvironment #name your virtual environment

>env\Scripts\activate #this command is for windows users only for
>source env/bin/activate # for mac users

  • Now we need to install Selenium

>pip install selenium

  • Now need to create a file called web scraper.py

>echo. >webscraper.py #for windows.

>touch websraper.py # for mac

  • Open the file you have just created: either by clicking manually or typing “webscraper.py” in the cmd.

>vim web #Now we can start writing the websraping code

from selenium import webdriver

## then get the url you want to webscrape

url = “https://en.wikipedia.org/wiki/Dragons%27_Den_(British_TV_programme)"

## Now you need to download the chrome driver for this to work:

browser = webdriver.Chrome()
browser.get(url) #check which version of Chrome you are running

  • Then you must find the right version of the chrome driver that matches with your version of Chrome.
  • Next, provide the path to your chrome driver by dragging and dropping the chrome driver into your working directory.
  • Now choose the element you wish to extract in the HTML using ctrl+shift+I and right click on the element. Select XPath.

Go back to cmd and paste the xpath that you copied into the command.

browser.find_element_by_xpath(‘//*[@id=”mw-content-text”]/div[1]/table[3]/tbody/tr[4]/td[1]/a’).click()

The above script will click on the Peter Jones element in the table in the given URL for the Dragons Den Wikipedia page.

--

--