Reading CSV Files with Pandas

by John | December 26, 2020

Getting data into Pandas is the first step in beginning your analysis. This article will show how to load csv files and Excel spreadsheets into Python. I will also show some common errors in loading data into the Python, and provide a run-through of simple solutions that can get you up and running.

 

CSV Files

 

Csv stands for comma-separated values. You can create a csv file to follow along with this article if you don’t have one already.

Paste the following values into your notepad:

 

Date,Sales,Expenses,Location
01/01/2016,1052,673.28,South America
02/01/2016,599,383.36,North America
03/01/2016,1133,725.12,Asia
04/01/2016,810,518.4,Asia
05/01/2016,1133,725.12,North America
06/01/2016,1216,778.24,North America
07/01/2016,1238,792.32,Asia
08/01/2016,1088,696.32,Europe
09/01/2016,1474,943.36,South America
10/01/2016,1307,836.48,North America

 

Now press File > Save As > example.csv make sure to add the .csv extension to the file. Perhaps saving to your Desktop will be easiest if you are new to Python.

 

Now open your Python editor and write the following commands:

 

import pandas as pd

df = pd.read_csv('example.csv') 

df.head()

 

You should see the following output:

 

 

Date

Sales

Expenses

Location

0

01/01/2016

1052

673.28

South America

1

02/01/2016

599

383.36

North America

2

03/01/2016

1133

725.12

Asia

3

04/01/2016

810

518.4

Asia

4

05/01/2016

1133

725.12

North America

 

If you get the following error:

 

FileNotFoundError: [Errno 2] File example.csv does not exist: 'example.csv'

 

This means that your current working directory is different from where the file is located. In Jupyter a quick fix for this is just to click on the Jupyter logo and then navigate to your Desktop (or location the file is stored) and reopen the notebook.

 

Alternatively, you can use the following command:

df = pd.read_csv(r'Desktop\example.csv') 

 

Another example is if the data is located in a folder named  ‘data’ on your desktop:

df = pd.read_csv(r'Desktop\data\example.csv')

 

Sypder makes this particularly easy, all you need to do is click the folder icon on the top right of the screen > Desktop and press Select Folder and navigate to the folder the file is stored in.

 

What if your csv file is not separated by commas? 

For example if you open your csv file with notepad and it looks something like this:

 

Date;Sales;Expenses;Location
01/01/2016;1052;673.28;South America
02/01/2016;599;383.36;North America
03/01/2016;1133;725.12;Asia
04/01/2016;810;518.4;Asia
05/01/2016;1133;725.12;North America
06/01/2016;1216;778.24;North America
07/01/2016;1238;792.32;Asia
08/01/2016;1088;696.32;Europe
09/01/2016;1474;943.36;South America
10/01/2016;1307;836.48;North America

 

If you try to load the data in the way detailed previously you will likely get an error that looks something like this:

ParserError: Error tokenizing data. C error: Expected 1 fields in line 10, saw 2

 

This can be very frustrating!! As the above error can be  confusing to understand. The issue here is the character which separates each data point known as the delimiter. Lucky the folks over at Pandas make fixing this error very straightforward, simply add the following to your script replacing 'semicolon_format' with the desired filename.

 

df = pd.read_csv('semicolon_format.csv',sep=';')

 

Let's take an extreme example to illustrate how the sep option works. Say, for example you have the misfortune of being given a data-set that looks something like this:

 

Date;Sales#Expenses;Location
01/01/2016:1052;673.28:South America
02/01/2016;599-383.36;North America
03/01/2016#1133;725.12:Asia
04/01/2016:810;518.4;Asia
05/01/2016#1133;725.12-North America
06/01/2016;1216;778.24;North America
07/01/2016#1238;792.32;Asia
08/01/2016;1088;696.32;Europe
09/01/2016;1474;943.36#South America
10/01/2016;1307;836.48;North America

 

In the example above there are multiple delimiters :  ;  -  # which looks like it will be difficult to work with. Again Pandas makes this very easy to work around. 

 

df = pd.read_csv('crazy_format.csv',
                         sep=";|:|#|-",
                         engine='python')

 

Notice here that alll you need to do is add each delimiter separated with a | and the data will load in as normal. The engine='python' line in the snippet above isn't strictly necessary, however it avoids a ParserWarning regarding C being unable to support regular expressions. 

 

More options 

Load certain columns

Continuing with the example.csv file from above. If you require only certain columns of the file to be loaded you can specify this in the read_csv method. For example if you want to load only columns 1, 2, & 4 you can use the following command:

 

df = pd.read_csv('example.csv',
                  usecols=[0,1,3],
                   engine='python')

df.head()

 

Returns the following dataframe:

 

  Date Sales Location
0 01/01/2016 1052 South America
1 02/01/2016 599 North America
2 03/01/2016 1133 Asia
3 04/01/2016 810 Asia
4 05/01/2016 1133 North America
5 06/01/2016 1216 North America

 

It may be worth repeating here, that Python/Pandas uses 0 indexing which can often cause confusion. The first column is 0, the second 1, third 2 and so on. 

 

If you had a very large dataset and you want to load in the first 50 columns you can pass the following into the usecols option:

 

cols = [col for col in range(50)]
df = pd.read_csv('your_filename.csv',
                  usecols=cols)

 

 

Set Index Column

Instead of getting the default range index when loading your dataframe into Python you can specify which column to use as the index:

df = pd.read_csv('example.csv',
                  index_col=0)

 

Returns the following dataframe:

 

  Sales Expenses Location
Date      
01/01/2016 1052 673.28 South America
02/01/2016 599 383.36 North America
03/01/2016 1133 725.12 Asia
04/01/2016 810 518.4 Asia
05/01/2016 1133 725.12 North America
06/01/2016 1216 778.24 North America

 

Parse Dates

The index_col option above is often very useful in conjunction with the parse_dates option to automatically parse the date from the csv file and set it as the index as follows:

 

df = pd.read_csv('example.csv',
                  index_col=0,
                  parse_dates=['Date'])

 

You can change the 'Date' column to whatever column your datetime data is named in your data-set.

That looks nice but notice how Pandas parses the dates as 2016-01-01 which corresponds  to year-month-day, what if the dates represented in our example 01/01/2016 corresponds to day/month/year as opposed to the way Pandas is currently interpreting it? 

 

Well yet again Pandas makes this relatively straightforward to deal with. We can use a custom date_parser method as shown below:

 

def custom_dateparser(raw_dates):
    return pd.to_datetime(raw_dates,format='%d/%m/%Y')

df = pd.read_csv('example.csv',
                  index_col=0,
                  parse_dates=['Date'],
                  date_parser=custom_dateparser)

 

Which returns the following dataframe.

 

             Sales  Expenses       Location
Date                                      
2016-01-01   1052    673.28  South America
2016-01-02    599    383.36  North America
2016-01-03   1133    725.12           Asia
2016-01-04    810    518.40           Asia
2016-01-05   1133    725.12  North America

 

For a full list of options for the read_csv method take a look at the documentation here 

 

Read Next Article

 


Join the discussion

Share this post with your friends!