In part 3 of the series I covered how to load a CSV file into a Pandas DataFrame. In this post I’ll focus on how to deal with NULL or missing values read from CSV files.
Also in this series:
- Creating a DataFrame (CREATE TABLE)
- Populating a DataFrame (INSERT)
- How to load a CSV file into a Pandas DataFrame (BULK INSERT)
- Handling Nulls read from CSV
- Join statements
- Select all rows and columns (SELECT * FROM table)
- Select multiple columns (SELECT col1,col2 FROM table)
If you work with SQL you have probably had your share of challenges caused by missing/blank and NULL values. Let’s see how read_csv helps us manage these troublemakers when we populate a DataFrame from a csv file.
Let’s start with the following data in a CSV file:
FirstName,LastName,Team,Position,JerseyNumber,Salary,Birthdate
Joe,Pavelski,SJ,C,8,6000000,1984-07-11
Connor,McDavid,EDM,C,97,925000,1997-01-13
Sidney,Crosby,NULL,C,87,8700000,1987-08-07
Carey,Price,Unknown,G,31,10500000,1987-08-16
Daniel,Sedin,VAN,NA,22,,1980-09-26
Henrik,Sedin,VAN,N/A,33,,1980-09-26
I have deliberately provided a variety of values that can be construed as missing values
- Sidney Crosby: Team is NULL
- Carey Price” Team is Unknown
- Daniel Sedin: Position is NA and salary is not provided
- Henrik Sedin: Position is N/A and salary is not provided
Leave NULL or missing values untouched
If you specify na_filter=false then read_csv will read in all values exactly as they are:
players = pd.read_csv('HockeyPlayersNulls.csv',na_filter=False)
returns:
Replace default missing values with NaN
In Pandas, the equivalent of NULL is NaN. By default, read_csv will replace blanks, NULL, NA, and N/A with NaN:
players = pd.read_csv('HockeyPlayersNulls.csv')
returns:
You can see that most of the ‘missing’ values in my csv files are replaced by NaN, except the value ‘Unknown’ which was not recognized as a missing value.
How to designate values as missing
If there are values in your data which are not recognized as missing, you can use the na_values parameter to specify values you want treated as missing:
players = pd.read_csv('HockeyPlayersNulls.csv',na_values=['Unknown'])
returns:
Handling blank lines
By default if a blank line is encountered in the CSV file, it is skipped.
So if you have the following file:
FirstName,LastName,Team,Position,JerseyNumber,Salary
Joe,Pavelski,SJ,C,8,6000000
Connor,McDavid,EDM,C,97,925000
Sidney ,Crosby,PIT,C,87,8700000
Carey,Price,MTL,G,31,10500000
Daniel,Sedin,VAN,LW,22,
Henrik,Sedin,VAN,C,33,
You get the following DataFrame:
If you want the blank line to appear you can specify skip_blank_lines=False
players = pd.read_csv('HockeyPlayersBlankLines.csv', skip_blank_lines=False)
Next up
We still need to look at how to control datatypes and how to deal with Dates when using read_csv to populate a DataFrame. Stay tuned!
Posted by Pandas for SQL lovers – JOIN statements | HockeyGeekGirl on September 24, 2019 at 2:15 PM
[…] Handling Nulls […]
Posted by Pandas for SQL Lovers – SELECT * FROM table | HockeyGeekGirl on January 7, 2020 at 2:13 PM
[…] Part 4 Handling Nulls read from CSV […]
Posted by Pandas for SQL Lovers – SELECT col1,col2 FROM Table | HockeyGeekGirl on January 13, 2020 at 4:46 PM
[…] Handling Nulls read from CSV […]
Posted by Pandas for SQL lovers Reading a CSV file / BULK INSERT | HockeyGeekGirl on January 13, 2020 at 4:54 PM
[…] Handling Nulls read from CSV […]
Posted by Pandas for SQL Lovers INSERT / Populating a DataFrame | HockeyGeekGirl on January 13, 2020 at 4:54 PM
[…] Handling Nulls read from CSV […]
Posted by Python Pandas for SQL fans: Creating DataFrames | HockeyGeekGirl on January 13, 2020 at 4:56 PM
[…] Handling Nulls read from CSV […]