Speed up Pandas to_csv and read_csv by up to 170x with 60% Less Memory

In this article, we will demonstrate the use of feather rather than the considerably slower csv file types. We show that the use of feather files can result in noticeable speed ups with hardly any extra work. To follow along with this post ensure you have pandas installed and additionally you will need pyarrow installed in order to use the feather files.

pip install pyarrow

Some Important Terms

PyArrow is a Python library that is part of the Apache Arrow project. Apache Arrow is a cross-language development platform for in-memory data, designed to improve the efficiency of data interchange between systems.

The key features of PyArrow include:

Efficient Data Structures: It provides efficient data structures (like Arrow tables and arrays) for in-memory computing, which can be used for data exchange between different Python libraries without copying data.

Interoperability: Apache Arrow is designed to support language-independent columnar memory formats for flat and hierarchical data, making it suitable for efficient data interchange among various systems and languages (Python, R, Java, C++, etc.)

Let's kick off with a simple example of writing to a csv file which will will borrow from the previous post

import pandas as pd 
import numpy as np
import time 

df = pd.DataFrame()
df['col1'] = np.random.normal(size=1000000)
df['col2'] = np.random.normal(size=1000000)
df['col3'] = np.random.normal(size=1000000)

Well, let's not waste any time and get stuck right in to compare the performance of writing this dataframe to our current working directory.

pd.to_csv() method

%%timeit
df.to_csv('my_data.csv')


'''
13.2 s ± 867 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
'''

pd.to_feather() method

%%timeit
df.to_feather('my_data.feather')

'''
58.8 ms ± 2.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
'''

Right out of the box , that looks like a speed-up of about 200x (was 170x when I wrote title and don't wanna change it again) , anyway we have gotten an incredible speed-up just from simply switching our file type to feather.

pd.read_csv()

Now let's try loading in the dataframe we just saved following best practices by specifying the data type we expect from the csv file. And since Pandas doesn't have to deal with data type conversions or managing delimiters when writing to a csv this is a lot quicker as we would expect.


# Specify the dtype for each column
dtypes = {
    'col1': 'float64',
    'col2': 'float64',
    'col3': 'float64'
}

%%timeit
csvLoad = pd.read_csv('my_data.csv', dtype=dtypes, index_col=False)

'''
1.47 s ± 112 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

'''

pd.read_feather()

%%timeit
featherLoad = pd.read_feather('my_data.feather')

'''
66.8 ms ± 7.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
'''

We can speed this up even more by passing dtype_backend='pyarrow' in to the constructor.

%%timeit
featherLoad = pd.read_feather('my_data.feather', 
                               dtype_backend='pyarrow')

'''
39 ms ± 5.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
'''

The backend_dytpes argument means the datatypes will be inferred by pyarrow, I haven't had any problems using this method so far, but if there are some weird typing issues maybe you can trace back and see if removing this from constructor solves your problem.

featherLoad = pd.read_feather('my_data.feather', dtype_backend='pyarrow')
print(featherLoad.dtypes)

'''
col1    double[pyarrow]
col2    double[pyarrow]
col3    double[pyarrow]
dtype: object

'''

It appears our data has been correctly inferred as floating point numbers.

Another important point when using the read_feather() method is that it uses threading by default, if you want to stop it doing this you need pass in the following:

featherLoad = pd.read_feather('my_data.feather',
                              use_threads=False, 
                              dtype_backend='pyarrow')

This doesn't seem to cause any noticeable decrease on performance on my computer.

Let's take an example of the memory consumption now.

df = pd.DataFrame()
df['col1'] = np.random.normal(size=10000000)
df['col2'] = np.random.normal(size=10000000)
df['col3'] = np.random.normal(size=10000000)

df.to_csv('my_data.csv')

df.to_feather('my_data.feather')

Using feather results in about 60% less in the size of the files. This means we can store more data!!

$ ls -lh

-rwxrwxrwx 1 john john 647M Dec 30 07:38 my_data.csv
-rwxrwxrwx 1 john john 229M Dec 30 07:38 my_data.feather

What is the downside to using Feather?

Well, if we want to be able to open our dataframe in something like Microsoft Excel, and view our data in a nice interface, that isn't going to work. Same as Vscode, if we open folder where we have stored the data and usually we can open the csv file and take a look at the last few rows, to see if our data has updated etc, we won't be able to do this with feather files. But on the other hand, if we are using a dataframe that is of significant size, it is unlikely we could open it in Excel anyway.

Potential fix, for those of us that like to view data in nice interface is to write out a portion of the file to csv to take a look if that's what we need to do!

Speed up Pandas to_csv and read_csv by up to 170x with 60% Less Memory

Some Important Terms

Share this Post