Pandas Groupby Method

by Anonymous | December 26, 2020

 

In this article we will discuss the groupby method in Pandas. The best way to understand this is by using examples, therefore this article will give a number of real-life examples for which this method can be useful. The examples used:

  • World Airports Data
  • Temperature Data
  • Premier League Football Data

 

 

 

Example 1: Global Airports

The website openflights.org kindly provides a number of open source datasets related to global air travel.  A short description of the dataset taken from their website:

"As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe"

 

global_flights
source: https://openflights.org/

 

Each of the red dots above corresponds to an transport terminal, for the purposes of this article we will only be using airports. 

We can download the dataset directly into Python from github. 

 

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

url = 'https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat'
df = pd.read_csv(url)
df.columns = ['Airport_ID', 'name', 'city', 'country', 'IATA',
                                 'ICAO', 'lat', 'long',
                                 'alt', 'timezone','DST', 'db_time',
                                 'type','source']

 

Let's take a look at some of the columns of interest

 

df[['country','city', 'name']].head()
            country          city                                         name
0  Papua New Guinea        Madang                               Madang Airport
1  Papua New Guinea   Mount Hagen                 Mount Hagen Kagamuga Airport
2  Papua New Guinea        Nadzab                               Nadzab Airport
3  Papua New Guinea  Port Moresby  Port Moresby Jacksons International Airport
4  Papua New Guinea         Wewak                  Wewak International Airport

 

Say we want to answer two questions:

1) Which country has the most airports?

2) Which city has the most airports? 

 

Question 1

We will determine the answer to these questions using pandas groupby. Essentially we want to count the number of airports in each country and select the top 10, plot their values and inspect the numbers. 

 

Here we groupby the country column as it is our column of interest. The .size() method counts the number of occurences. 

 

countries= df.groupby('country').size().sort_values(ascending=False)
#plot barchart
countries.head(10).plot.barh()
#print raw numbers
print(countries.head(10))

 

 

top 10 countries by number of airports

out:
country
United States     1512
Canada             430
Australia          334
Brazil             264
Russia             264
Germany            249
China              241
France             217
United Kingdom     167
India              148
dtype: int64

 

Question 2

 

This time we are interested in the number of airports per city. Therefore we should pass in 'city' to the groupby function as follows:

 

cities = df.groupby('city').size().sort_values(ascending=False)

cities.head(10).plot.barh()
plt.xlabel('# Airports')
plt.title('Top 10 Cities by Number of Airports')
plt.tight_layout()
print(cities.head(10))

 

 

top 10 cities by number of airports

 

out:
city
London          9
Columbus        8
San Jose        7
Georgetown      7
Moscow          6
New York        6
Jacksonville    6
Greenville      6
Alexandria      6
Santiago        6
dtype: int64

 

So from the data it seems that the USA is the country with the most airports, and London is the city with the most airports. 

 

 

Example 2: Weather Data

 

In the next example we will consider weather data from fivethirtyeight thankfully we can also download this directly into Python from github. Download the dataset by executing the following commands. 

 

df = pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/us-weather-history/KCLT.csv')

print(df.head(5))
print(df.columns)

 

out:
       date  actual_mean_temp  ...  average_precipitation  record_precipitation
0  2014-7-1                81  ...                   0.10                  5.91
1  2014-7-2                85  ...                   0.10                  1.53
2  2014-7-3                82  ...                   0.11                  2.50
3  2014-7-4                75  ...                   0.10                  2.63
4  2014-7-5                72  ...                   0.10                  1.65
out:
Index(['date', 'actual_mean_temp', 'actual_min_temp', 'actual_max_temp',
       'average_min_temp', 'average_max_temp', 'record_min_temp',
       'record_max_temp', 'record_min_temp_year', 'record_max_temp_year',
       'actual_precipitation', 'average_precipitation',
       'record_precipitation'],
      dtype='object')

 

The data appears to be in daily increments for slightly over a year between 2014 - 2015. Let's answer two relatively simple questions using the data. 

1) What was the hottest month of the year? 

2) Which months had the most rain?

 

Question 1

In order to answer question 1 there are a number of steps we must take first to get the data in the correct format for further analysis. First convert the date column into a pandas datetime. Then create a month column:  Jan =1 , Feb =2 , .... , Dec=12. 

 

df['date'] = pd.to_datetime(df.date) #convert to pd.datetime
df['month'] = df.date.dt.month # create month column

 

Then group the data by the month column we just created and print the mean temperature per month

 

month = df.groupby('month')
print(month['actual_mean_temp'].mean())
out:
month
1     40.516129
2     37.714286
3     54.387097
4     62.466667
5     71.935484
6     80.333333
7     77.741935
8     76.580645
9     72.533333
10    63.225806
11    47.533333
12    45.967742
Name: actual_mean_temp, dtype: float64
#plotting the results above
df.groupby('month')['actual_mean_temp'].mean().plot.bar() 
plt.ylabel('Temperature')
plt.title('Average Temperature by Month')
plt.tight_layout() 

 

 

average temperature by month

 

It looks like June had the higher average temperatures for the period of our analysis. 

 

 

Question 2

Since we already have our grouped dataframe we can just change the column we are interested in. Notice that we can reference columns of our grouped dataframe by grouped['column of interest'] which can be very useful. 

 

#printing average rainfall by month
print(month['average_precipitation'].mean())
out:
month
1     0.110000
2     0.118571
3     0.129355
4     0.101333
5     0.102581
6     0.124667
7     0.118710
8     0.136129
9     0.108333
10    0.109677
11    0.104667
12    0.104839

 

Plot the results:

 

month['average_precipitation'].mean().plot.bar()
plt.ylabel('Rainfall')
plt.title('Average Rainfall by Month')
plt.tight_layout() 

 

average rainfall per month

 

It looks like August had the highest amount of rainfall during the period of analysis

 

 

Example 3: Premier League Football

In the third and final example we will use a dataset from footall-data.co.uk who kindly aggregate a number of football datasets from various leagues. We will use the Premier League results from 2019/2020 season. We can download the dataset directly from their website using Pandas as follows:

 

df = pd.read_csv('http://www.football-data.co.uk/mmz4281/1920/E0.csv')
print(df)
    Div        Date   Time        HomeTeam  ... MaxCAHH  MaxCAHA  AvgCAHH AvgCAHA
0    E0  09/08/2019  20:00       Liverpool  ...    1.99     2.07     1.90    1.99
1    E0  10/08/2019  12:30        West Ham  ...    2.07     1.98     1.97    1.92
2    E0  10/08/2019  15:00     Bournemouth  ...    2.00     1.96     1.96    1.92
3    E0  10/08/2019  15:00         Burnley  ...    1.90     2.07     1.86    2.02
4    E0  10/08/2019  15:00  Crystal Palace  ...    2.03     2.08     1.96    1.93
..   ..         ...    ...             ...  ...     ...      ...      ...     ...
375  E0  26/07/2020  16:00       Leicester  ...    1.94     2.05     1.86    2.02
376  E0  26/07/2020  16:00        Man City  ...    2.06     1.88     2.02    1.84
377  E0  26/07/2020  16:00       Newcastle  ...    2.03     2.00     1.95    1.92
378  E0  26/07/2020  16:00     Southampton  ...    2.03     1.96     1.98    1.89
379  E0  26/07/2020  16:00        West Ham  ...    1.99     2.00     1.93    1.95

[380 rows x 106 columns]

 

As you can see the dataset has a lot of columns. We are going to use this data to answer two questions:

1) Which team scored the most average goals whilst playing at home. 

2) Which team scored the most average goals whilst playing away. 

 

Question 1

To answer this question we first need to decide on which columns are relevant to our problem. Obviously the 'HomeTeam' column is important as it is what we are trying to analyze. There are two more important columns we will need for this analysis:

FTHG: This column represents Full Time Home Goals

FTAG: This column represents Full Time Away Goals 

 

So we need to groupby 'HomeTeam' and then extract the FTAG column described above. 

 

hometeams = df.groupby('HomeTeam')

#extract the average full time home goals from grouped df

hometeams['FTHG'].mean()
Out:
HomeTeam
Arsenal             1.894737
Aston Villa         1.157895
Bournemouth         1.157895
Brighton            1.052632
Burnley             1.263158
Chelsea             1.578947
Crystal Palace      0.789474
Everton             1.263158
Leicester           1.842105
Liverpool           2.736842
Man City            3.000000
Man United          2.105263
Newcastle           1.052632
Norwich             1.000000
Sheffield United    1.263158
Southampton         1.105263
Tottenham           1.894737
Watford             1.157895
West Ham            1.578947
Wolves              1.421053
Name: FTHG, dtype: float64

 

The interpretation of the output above is the average number of goals scored per team whilst playing at home. Another interesting fact we can extract from this grouped dataframe is the average number of goals each of the teams conceeded whilst playing at home. 

 

hometeams['FTAG'].mean()
HomeTeam
Arsenal             1.263158
Aston Villa         1.578947
Bournemouth         1.578947
Brighton            1.421053
Burnley             1.210526
Chelsea             0.842105
Crystal Palace      1.052632
Everton             1.105263
Leicester           0.894737
Liverpool           0.842105
Man City            0.684211
Man United          0.894737
Newcastle           1.105263
Norwich             1.947368
Sheffield United    0.789474
Southampton         1.842105
Tottenham           0.894737
Watford             1.421053
West Ham            1.736842
Wolves              1.000000
Name: FTAG, dtype: float64

 

Obviously the smaller the number here the better. However, it may not be obvious exactly what is going on here. Since we are grouping by hometeam, when we get the average number of away goals, this means the number of goals the oppositions team scored whilst the hometeam was playing on their homegrounds. 

 

Let's visualize the results obtained above.

 

df.groupby('HomeTeam')[['FTHG','FTAG']].mean().plot.barh(figsize=(6,6))
plt.xlabel('Average Goals')
plt.title('Premier League 2019/2020')

premier league results with pandas

 

As you would expect, the usual suspects are doing the best in the plot above. It looks like Man City has the best record. Although, obviously this doesn't necessarily translate into winning games, or the league for that matter, as in fact Liverpool won last season. 

 

 

Question 2

To answer this question we simply need to change the column we are grouping by as follows:

 

df.groupby('AwayTeam')[['FTHG','FTAG']].mean().plot.barh(figsize=(6,6))
plt.xlabel('Average Goals')
plt.title('Premier League 2019/2020')

premier league results away teams

 

 

So since the blue bars above represent goals scored whilst playing away, the higher numbers are better. Again it looks like Man City are doing the best in this regard as well. Norwich seem to be doing the worst. As an exercise you could convert these numbers into ratios and plot them again. 

 

Bonus Question

In this question will we make use of pandas groupby.get_group( ) method, which can often come in handy. Let's say for example we want to extract the number of goals Liverpool scored per game whilst playing at home. First let's set the index to be a pandas datetime so its more interpretable. Then extract the number of goals Liverpool scored whilst playing at home.  

 

df['Date'] = pd.to_datetime(df.Date)
df.set_index('Date',inplace=True)

liverpool_HG = df.groupby('HomeTeam')['FTHG'].get_group('Liverpool') 
print(liverpool_HG)
Out:
Date
2019-09-08    4
2019-08-24    3
2019-09-14    3
2019-05-10    2
2019-10-27    2
2019-10-11    3
2019-11-30    2
2019-04-12    5
2019-12-14    2
2019-12-29    1
2020-02-01    2
2020-01-19    2
2020-01-02    4
2020-02-24    3
2020-07-03    2
2020-06-24    4
2020-05-07    2
2020-11-07    1
2020-07-22    5
Name: FTHG, dtype: int64

 

Visualizing the results

 

liverpool_HG.plot.bar()
plt.title('Liverpool Goals Whilst Playing at Home')
plt.ylabel('Goals')

Liverpool goals scored whilst at home 2019/2020

 

 

 


Join the discussion

Share this post with your friends!