Posts Tagged ‘NaN’

Pandas for SQL Lovers: Handling Nulls read from CSV

pandaSQL4In part 3 of the series I covered how to load a CSV file into a Pandas DataFrame. In this post I’ll focus on how to deal with NULL or missing values read from CSV files.

Also in this series:

If you work with SQL you have probably had your share of challenges caused by missing/blank and NULL values. Let’s see how read_csv helps us manage these troublemakers when we populate a DataFrame from a csv file.

Let’s start with the following data in a CSV file:

FirstName,LastName,Team,Position,JerseyNumber,Salary,Birthdate
Joe,Pavelski,SJ,C,8,6000000,1984-07-11
Connor,McDavid,EDM,C,97,925000,1997-01-13
Sidney,Crosby,NULL,C,87,8700000,1987-08-07
Carey,Price,Unknown,G,31,10500000,1987-08-16
Daniel,Sedin,VAN,NA,22,,1980-09-26
Henrik,Sedin,VAN,N/A,33,,1980-09-26

I have deliberately provided a variety of values that can be construed as missing values

  • Sidney Crosby: Team is NULL
  • Carey Price” Team is Unknown
  • Daniel Sedin: Position is NA and salary is not provided
  • Henrik Sedin: Position is N/A and salary is not provided

Leave NULL or missing values untouched

If you specify na_filter=false then read_csv will read in all values exactly as they are:


players = pd.read_csv('HockeyPlayersNulls.csv',na_filter=False)

returns:
Dataset with missing and null values highlighted

Replace default missing values with NaN

In Pandas, the equivalent of NULL is NaN. By default, read_csv will replace blanks, NULL, NA, and N/A with NaN:


players = pd.read_csv('HockeyPlayersNulls.csv')

returns:

loaded data frame with blanks and missing values replaced by NaN, the word Unkown was not recognized as null

You can see that most of the ‘missing’ values in my csv files are replaced by NaN, except the value ‘Unknown’ which was not recognized as a missing value.

How to designate values as missing

If there are values in your data which are not recognized as missing, you can use the na_values parameter to specify values you want treated as missing:


players = pd.read_csv('HockeyPlayersNulls.csv',na_values=['Unknown'])

returns:

DataFrame populated with missing values replaced by NaN

Handling blank lines

By default if a blank line is encountered in the CSV file, it is skipped.

So if you have the following file:
FirstName,LastName,Team,Position,JerseyNumber,Salary
Joe,Pavelski,SJ,C,8,6000000
Connor,McDavid,EDM,C,97,925000

Sidney ,Crosby,PIT,C,87,8700000
Carey,Price,MTL,G,31,10500000
Daniel,Sedin,VAN,LW,22,
Henrik,Sedin,VAN,C,33,

You get the following DataFrame:
Data Frame populated with all rows , there is no row where the blank line in the file appeared
If you want the blank line to appear you can specify skip_blank_lines=False
players = pd.read_csv('HockeyPlayersBlankLines.csv', skip_blank_lines=False)

returns:

DataFrame containing a row whose values all show as NaN

Next up

We still need to look at how to control datatypes and how to deal with Dates when using read_csv to populate a DataFrame. Stay tuned!