Reading Html Data with Pandas

by John | December 26, 2020

The best way to show how Pandas can read html data from the internet is through a brief example and a demonstration of why it may be useful. Admittedly, Pandas may not be the best tool for this type of task in comparison with other packages such as BeautifulSoup. However, this article will show how to download tabular data from the internet and show it in a nice plot. 

 

To fully following along with this article you should install ploty by typing the following command:

 

pip install plotly_express==0.4.0

 

Ok, now on to downloading the first dataset. We will use statistics from Wikipedia found here. This can take a bit of trial and error using Pandas read_html method depending on the url. Import Pandas and type the following into your script.

 

import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_countries_by_traffic-related_death_rate'
df = pd.read_html(url)[1]

df.head()

 

That looks like a bit of a mess right? Let's clean it up a bit. 

 

#delete columns occuring after deaths per capita 
df = df.iloc[:,0:3]

#rename the deaths per capita to something easier to work with
df.columns = ['Country','Continent','deaths_per_capita']

# remove rows that aren't countries
df = df[pd.notna(df.Continent)]

##change the deaths column to numeric
df['deaths_per_capita'] = pd.to_numeric(df.deaths_per_capita,
                                              errors='coerce')

#inspect first 5 rows of data
df.head()

 

Executing the script above should return

 

        Country Continent deaths_per_capita
7   Afghanistan      Asia              15.5
8       Albania    Europe              13.6
9       Algeria    Africa              23.8
10      Andorra    Europe               7.6
11       Angola    Africa              26.9

 

Ok that looks much better! 

Since we want to make a nice plot with the data, something similar to what the Ploty docs show in their examples,  we will download the country codes from worldatlas.com as this seems to be the way Ploty identifies countries on their choropleth maps. 

 

The code is as follows:

 

url2 = 'https://www.worldatlas.com/aatlas/ctycodes.htm'

##notice this is [0] as opposed to [1] in previous example
df2 = pd.read_html(url2)[0]

df2.columns = ['Country', 'A2', 'A3', 'Num', 'Dialing_code']

df2.head()

 

The reason we use [0] here instead of [1] is due to the fact that there are often multiple tables on any given url and it takes some trial and error to get the desired data in Python. 

 

The column of interest here is 'A3' so we are going to merge the datasets together on the 'Country' column. The reason that Russia had to be manually set is due to worldatlas.com listing Russia as Russia Federation. 

 

df2.Country[181] = 'Russia'

###inner just takes the values present in both dfs
df3 = pd.merge(df,df2,how='inner',on='Country')

df3.head()

 

So let's get to the best part! 

 

from plotly.offline import plot
import plotly.graph_objs as go


fig = go.Figure(data=go.Choropleth(
    locations = df3['A3'],
    z = df3['deaths_per_capita'],
    text = df3['Country'],
    colorscale = 'hot',
    autocolorscale=False,
    reversescale=True,
    marker_line_color='darkgray',
    marker_line_width=0.5,
    colorbar_title = 'Deaths<br>Per 100,000',  
))

fig.update_layout(
        title_text='Road Deaths Per Capita',
        title_x=0.50)


plot(fig)

 

If all goes as planned after executing the script above, your browser should automatically open a window with an interactive graph that looks like this:

 

road deaths per capita

 

The full code for this article is as follows:

 

import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_countries_by_traffic-related_death_rate'
df = pd.read_html(url)[1]

###delete columns occuring after deaths per capita 
df = df.iloc[:,0:3]

#rename the deaths per capita to something easier to work with
df.columns = ['Country','Continent','deaths_per_capita']

df = df[pd.notna(df.Continent)]

df['deaths_per_capita'] = pd.to_numeric(df.deaths_per_capita,
                                              errors='coerce')


url2 = 'https://www.worldatlas.com/aatlas/ctycodes.htm'

df2 = pd.read_html(url2)[0]

df2.columns = ['Country', 'A2', 'A3', 'Num','Dialing_code']

df2.Country[181] = 'Russia'

df3 = pd.merge(df,df2,how='inner',on='Country')


from plotly.offline import plot
import plotly.graph_objs as go



fig = go.Figure(data=go.Choropleth(
    locations = df3['A3'],
    z = df3['deaths_per_capita'],
    text = df3['Country'],
    colorscale = 'hot',
    autocolorscale=False,
    reversescale=True,
    marker_line_color='darkgray',
    marker_line_width=0.5,
    colorbar_title = 'Deaths<br>Per 100,000',  
))

fig.update_layout(
        title_text='Road Deaths Per Capita',
        title_x=0.50)


plot(fig)

 

Read Next Article


Join the discussion

Share this post with your friends!