Resampling Pandas Dataframes

by John | December 26, 2020

 

 

Often when doing data analysis it becomes necessary to change the frequency of data. Pandas provides a relatively simple way to do this. 

Take the following example of a business that has daily sales and expenses data for 20 years. Create the example dataframe as follows:

 

import pandas as pd
import numpy as np

df = pd.DataFrame()

df['date'] = pd.date_range(start=pd.datetime(2000,1,1),
              end=pd.datetime(2020,1,1),freq='D') 

df.set_index('date', inplace =True)
df['sales'] = np.random.choice([100,50,200,300,600],len(df))
df['expenses'] = df.sales * np.random.choice([0.4,0.5,0.33,0.66],len(df))

print(df)
out:
            sales  expenses
date                       
2000-01-01    300     150.0
2000-01-02    100      66.0
2000-01-03    200     100.0
2000-01-04    100      33.0
2000-01-05    600     300.0
          ...       ...
2019-12-28    200     100.0
2019-12-29    200     100.0
2019-12-30    300      99.0
2019-12-31    600     240.0
2020-01-01    200      66.0

[7306 rows x 2 columns]

 

The data is currently in daily increments, let's say we wanted to change it to weekly, monthly and annual frequencies. In the context of the example we are using, it seems like a good idea to take the sum of all sales and expenses during a week/month or year. 

 

Note If the following error appears:

 

TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'

 

This means your date column is not in the correct format. To fix this use the following command:

 

df['your date column name'] = pd.to_datetime(df['your date column name'])
df.set_index('your date column name', inplace=True)

 

 

Weekly resample by sum

 

df2 = df.resample('W').sum()
print(df2)
out:
            sales  expenses
date                       
2000-01-02    400     216.0
2000-01-09   1500     664.0
2000-01-16   1900     891.5
2000-01-23   1950     983.0
2000-01-30   1950     898.0
          ...       ...
2019-12-08   1400     677.5
2019-12-15   2050     792.0
2019-12-22   1250     740.0
2019-12-29   2150    1281.0
2020-01-05   1100     405.0

[1045 rows x 2 columns]

 

Monthly resample by sum

 

#monthly resample
df3 = df.resample('M').sum()
print(df3)
Out:
            sales  expenses
date                       
2000-01-31   8300    3952.5
2000-02-29   7650    3586.5
2000-03-31   7850    3951.5
2000-04-30   7200    3334.5
2000-05-31   7500    3648.0
          ...       ...
2019-09-30   6650    3254.5
2019-10-31   8650    4203.5
2019-11-30   7900    3879.5
2019-12-31   7850    3862.5
2020-01-31    200      66.0

[241 rows x 2 columns]

 

Annual resample by sum

 

#annual resample
df4 = df.resample('A').sum()
print(df4)
out: 
          sales  expenses
date                       
2000-12-31  87150   40936.0
2001-12-31  93200   44804.0
2002-12-31  95150   46329.5
2003-12-31  92450   43830.5
2004-12-31  89050   41126.0
2005-12-31  91400   42414.5
2006-12-31  93900   44679.5
2007-12-31  94500   45281.0
2008-12-31  98650   47353.5
2009-12-31  88700   41289.5
2010-12-31  95600   43527.0
2011-12-31  99200   46249.5
2012-12-31  95250   44568.0
2013-12-31  90450   42426.0
2014-12-31  91200   42306.5
2015-12-31  93850   43962.0
2016-12-31  91000   42780.0
2017-12-31  93850   42945.5
2018-12-31  88300   40470.0
2019-12-31  93400   44987.5
2020-12-31    200      66.0

 

Notice here that since we only have one value for the year beginning on 2020, the value is much smaller. It is possible to remove this small value with the following command:

 

df4 = df4[:-1] #removes the last entry

 

 

Resampling data by different methods

In the previous example we only used summuation as the resampling method. However, often it is necessary to resample different columns by different methods.

Keeping the example from the previous section, but adding another column called expense ratio =  \(\frac{expenses}{sales}\) to which could possibly be useful to determine how high expenses are over time relative to sales. 

 

df = pd.DataFrame()

df['date'] = pd.date_range(start=pd.datetime(2000,1,1),
              end=pd.datetime(2020,1,1),freq='D') 

df.set_index('date', inplace =True)
df['sales'] = np.random.choice([100,50,200,300,600],len(df))
df['expenses'] = df.sales * np.random.choice([0.4,0.5,0.33,0.66],len(df)) 
df['expense_ratio'] = df.expenses/df.sales
print(df) 
Out: 
           sales  expenses  expense_ratio
date                                      
2000-01-01     50      20.0           0.40
2000-01-02    200     132.0           0.66
2000-01-03    100      50.0           0.50
2000-01-04    300     150.0           0.50
2000-01-05    600     240.0           0.40
          ...       ...            ...
2019-12-28    200      66.0           0.33
2019-12-29    600     240.0           0.40
2019-12-30    300     120.0           0.40
2019-12-31     50      25.0           0.50
2020-01-01    300      99.0           0.33

[7306 rows x 3 columns]

 

Let's say we wanted to resample on a weekly basis by taking the sum of both sales and expenses, but taking the average of the expense ratio. In order to do this we can pass in a dictionary to to Pandas .agg method

 

df2 = df.resample('W').agg({'sales':'sum', 'expenses':'sum', 'expense_ratio': 'mean'})
print(df2)
out:
            sales  expenses  expense_ratio
date                                      
2000-01-02    500     264.0       0.495000
2000-01-09   1350     449.0       0.340000
2000-01-16   1900     921.0       0.484286
2000-01-23   1250     522.0       0.421429
2000-01-30    950     426.0       0.455714
          ...       ...            ...
2019-12-08   2450    1179.0       0.505714
2019-12-15   1800     925.0       0.482857
2019-12-22   1600     835.5       0.552857
2019-12-29   1600     644.5       0.408571
2020-01-05    700     454.0       0.606667

[1045 rows x 3 columns]

 

 


Join the discussion

Share this post with your friends!