The majority of data-sets in real life are likely to be messy and require data cleaning before any analysis can be done. This often takes the form of missing values. The handling of of missing values is often highly domain specific. For example, in a survey the omission of of a data-point (a question) can often convery information. This is also a feature of credit application / default data-sets. Pandas has a number of useful built-in methods for dealing with missing data effectively.

Types of Missing Data

NaN values: This stands for 'Not a Number' which naturally relates to numeric series in which data is missing.

NaT values: This relates to missing values in a datetime object: 'Not a Date'

None: None values for regular Python can also show up in Pandas objects.

'?' : This is an example of a symbol that can be found in place of missing values, generally for categorical data. Any symbol could be used in place of '?' however, a number of data-sets on the popular UCI Machine Learning Repository use the '?' symbol which is why I use it as an example here.

-99999: Sometimes you will find a large negative integer in place of missing values, this is similar to the '?' above except it is used for numeric series.

This article will focus on NaN values:

Import necessary modules and take the series 'S' as an example:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('seaborn')


S = pd.Series([1, 2, 3, np.nan, 5, 6])

print(S)

Which returns:

0    1.0
1    2.0
2    3.0
3    NaN
4    5.0
5    6.0
dtype: float64

Which methods can we use to handle this?

(if you want the new data to replace the old in memory, you can add inplace=True in the methods listed below)

1) Drop missing values

This is perhaps the most simple method of handling missing data. However, when working with dataframes this can cause the loss of valuable data.

# S.dropna(inplace=True) to replace old series

S.dropna()

Out[143]: 
0    1.0
1    2.0
2    3.0
4    5.0
5    6.0
dtype: float64

2) Fill missing values with a custom value

Say for instance, you want to replace all NaN values with zeros

S.fillna(0)

Out[144]: 
0    1.0
1    2.0
2    3.0
3    0.0
4    5.0
5    6.0
dtype: float64

3) Fill NaN values with the mean of the series

S.fillna(S.mean())

\(3.4 = \frac{1+2+3+ 5 +6}{5}\)

Out[145]: 
0    1.0
1    2.0
2    3.0
3    3.4 #mean 
4    5.0
5    6.0
dtype: float64

4) Back-fill and Forward-fill methods (bfill & ffill)

It can take a while to get used to which is which when using these methods! See the image below for a visual understanding of what each of this methods do:

S.fillna(method='ffill')

0    1.0
1    2.0
2    3.0
3    3.0
4    5.0
5    6.0
dtype: float64

S.fillna(method='bfill')

Out[150]: 
0    1.0
1    2.0
2    3.0
3    5.0
4    5.0
5    6.0
dtype: float64

5) Interpolation

Interpolation is essentially; estimating a data-point based on a known range. This method is particularly useful when working with time series data.

S.interpolate()

\(\frac{t_{i-1}\ +\ t_{i+1}}{2}\ \ =\ \frac{3\ +5}{2}=4 \)

Out[151]: 
0    1.0
1    2.0
2    3.0
3    4.0 #interpolated value
4    5.0
5    6.0
dtype: float64

Example with a DataFrame

Take the dataframe below as an example:

# use this to get the same random numbers as used in this example
np.random.seed(100)

# generate an array of random integers 100 rows 5 columns
rand = np.random.randint(0, 100, size=(100,5))

# adding some NaN values for illustrative purposes
rand = np.where(rand > 90, np.nan, rand)

# create the data frame
columns = ['col1','col2','col3','col4','col5']

df = pd.DataFrame(rand, columns = columns)

df.head(10)

Out[154]: 
   col1  col2  col3  col4  col5
0   8.0  24.0  67.0  87.0  79.0
1  48.0  10.0   NaN  52.0   NaN
2  53.0  66.0   NaN  14.0  34.0
3  24.0  15.0  60.0  58.0  16.0
4   9.0   NaN  86.0   2.0  27.0
5   4.0  31.0   1.0  13.0  83.0
6   4.0   NaN  59.0  67.0   7.0
7  49.0  47.0  65.0  61.0  14.0
8  55.0  71.0  80.0   2.0   NaN
9  19.0   NaN  63.0  53.0  27.0

df.isna().sum() is a useful command to count the number of NaN values column-wise:

df.isna().sum()

Out[155]: 
col1     8
col2     9
col3    10
col4    10
col5    13
dtype: int64

This is relatively easy to interpret since there's only one hundred rows in the dataframe. How could we make this easier to interpret for larger data-sets? Well we could calculate the number of NaN values as a percentage of the total number of rows in the dataframe. Let's say we are going to drop any columns with more than 10% NaN values. The code snippet below calculates the percentage of NaN values per column and plots them in a bar chart:

#calculate pecentage missing values
(df.isna().sum() /len(df)).plot.bar()

#set threshold
Threshold = 0.10

#plot horizontal line with threshold value
plt.axhline(Threshold,color='red')

Which returns the following plot

Ok, so let's drop column 5 and use a different method to fill in the NaN values for each of the remaining columns.

# delete columns with more than 10% missing
del df['col5']

### fill with mean for column 1
df.col1.fillna(df.col1.mean(), inplace=True)

# fill with median for column 2
df.col2.fillna(df.col2.median(), inplace=True)

# forward fill values for column 3 
df.col3.fillna(method='ffill',inplace=True)

# interpolate for column 4
df.col4.interpolate(method='linear', inplace=True)

df.head(10)

Out[159]: 
   col1  col2  col3  col4
0   8.0  24.0  67.0  87.0
1  48.0  10.0  67.0  52.0
2  53.0  66.0  67.0  14.0
3  24.0  15.0  60.0  58.0
4   9.0  44.0  86.0   2.0
5   4.0  31.0   1.0  13.0
6   4.0  44.0  59.0  67.0
7  49.0  47.0  65.0  61.0
8  55.0  71.0  80.0   2.0
9  19.0  44.0  63.0  53.0

Let's double check for missing values:

df.isna().sum()

Out[160]: 
col1    0
col2    0
col3    0
col4    0
dtype: int64

Cool, now we have a clean data-set. If you want to try the methods above on a real data-set, you can use the following code snippet to download the Titanic data from Vanderbilt University. Good luck!

##titanic dataset url from vanderbilt
url = 'http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv'

df= pd.read_csv(url)

df.head()

Out[162]: 
   pclass  survived  ...   body                        home.dest
0       1         1  ...    NaN                     St Louis, MO
1       1         1  ...    NaN  Montreal, PQ / Chesterville, ON
2       1         0  ...    NaN  Montreal, PQ / Chesterville, ON
3       1         0  ...  135.0  Montreal, PQ / Chesterville, ON
4       1         0  ...    NaN  Montreal, PQ / Chesterville, ON

[5 rows x 14 columns]

Handling Missing Data with Pandas

Types of Missing Data

Which methods can we use to handle this?

Example with a DataFrame

Join the discussion