A Practical Intro to Webscraping

Intro

Web scraping is a powerful tool for automated data collection, allowing us to extract information from websites programmatically. In Python, one of the most popular libraries for web scraping is BeautifulSoup, this package is very simple to use and gives flexibility in the handling of HTML and XML elements . This practical introduction will guide you through the essentials of web scraping using Python's BeautifulSoup package. In this article we will walk through a simple yet effective web scraper that fetches temperature data from a table of global cities on timeanddate.com, storing the results in a CSV file.

Ensure you have Python installed on your computer. If not, you can download it from python.org. Additionally, to work with BeautifulSoup, you'll need to install it alongside the requests library, which can be done using pip, Python’s package installer. Simply run pip install beautifulsoup4 requests in your command line interface.

It is recommended that you use Juypter notebook for this tutorial so you can run the cells one at a time in order to properly inspect the output. Ok, now we get stuck in and pull in the data.

Getting the Soup

import requests
from bs4 import BeautifulSoup
import pandas as pd


response = requests.get('https://www.timeanddate.com/weather/')

soup = BeautifulSoup(response.text, 'html.parser')

Once we have all the elements pulled in to Python, we now have some thinking to do. The BeautifulSoup library isn't a magic wand, there is still a lot of work to do in order to get meaningful data out of the program. So upon inspecting the page, we see that the data of interest (city name , temp) is organized in a table. Which implies that the relevant city and temp will be found within a <td></td> element representing table data.

Below we use the find_all() method to narrow down our search.

tds = soup.find_all('td')

print(f"there are {(len(tds))} table data elements")


'''
there are 564 table data elements

'''

Extracting the Information

Looks like we are making some progress, now let's print out the first 10 elements to see if we can find some sort of pattern.

for td in tds[:10]:
    print(td)


'''
<td><a href="/weather/ghana/accra">Accra</a><span class="wds" id="p0s"></span></td>
<td class="r" id="p0">Sat 05:47</td>
<td class="r"><img alt="Clear. Warm." height="40" src="//c.tadst.com/gfx/w/svg/wt-13.svg" title="Clear. Warm." width="40"/></td>
<td class="rbi">27 °C</td>
<td><a href="/weather/canada/edmonton">Edmonton</a><span class="wds" id="p47s"></span></td>
<td class="r" id="p47">Fri 22:47</td>
<td class="r"><img alt="Passing clouds. Cold." height="40" src="//c.tadst.com/gfx/w/svg/wt-14.svg" title="Passing clouds. Cold." width="40"/></td>
<td class="rbi">-4 °C</td>
<td><a href="/weather/india/new-delhi">New Delhi</a><span class="wds" id="p94s"></span></td>
<td class="r" id="p94">Sat 11:17</td>

'''

Now we are getting even closer, it appears we have found the necessary pattern. It is as follows:

- The presence of a link element <a></a> indicates we have found a new place, between the <a> tags we will find the place from above <a>Accra</a>, <a>Edmonton</a> and <a>New Delhi</a>

- The third element after the presence of a <a> tag we will have the temperature in degrees celsius contained within a <td> class named "rbi"

So now most of the hard work has actually been completed, now all we need to do is write a simple script to extract the data. First we create a small helper function in order to parse the temperature. Note that the \xa0 represents a non-breaking space in html used to ensure the the characters on either side stay on same line.

def extract_temp_as_float(temp):
    return float(temp.split("\xa0")[0])

temps = []
current_city = None

for td in tds:
    # check if there is a <a> present in the td
    if td.find('a'):
        # if we have a link , then we extract the city name
        current_city = td.get_text().strip()  # Get the city name

    # If we have a city and this is a temperature cell
    elif 'rbi' in td.get('class', []) and current_city:
        temp = td.get_text().strip()  # Get the temperature
        # add the city and the temp that has been passed through helper function 
        # to our list of temperatures
        temps.append({'city': current_city, 'temp': extract_temp_as_float(temp)})
        current_city = None  # Reset the current city


print(temps[:3])

And the first three results are shown below. We will assume all the others are correct, and verify some other data points once we have the file in csv format.

[{'city': 'Accra', 'temp': 27.0},
 {'city': 'Edmonton', 'temp': -4.0},
 {'city': 'New Delhi', 'temp': 13.0}]

Saving Data to File

It appears our script has worked. Now all we need to do is save it as a csv so we don't lose the data.


df = pd.DataFrame(temps)


print(df.head())
print(df.tail())

"""
          city  temp
0        Accra  27.0
1     Edmonton  -4.0
2    New Delhi  13.0
3  Addis Ababa  15.0
4    Frankfurt   9.0
        city  temp
134   Zürich   8.0
135    Dubai  23.0
136  Nairobi  19.0
137   Dublin   5.0
138   Nassau  20.0
"""

And it save it , the file will be saved to your current working directory, and now you can open in in Excel or notepad.

df.to_csv('global_temps.csv', index=False)

And that's it, you have created a simple webscraper using BeautifulSoup in Python!.

A Practical Intro to Webscraping

Join the discussion

Intro

Getting the Soup

Extracting the Information

Saving Data to File