I have worked with data for years, and I am very familiar with SQL. When I started exploring data science I found a number of my SQL skills translated well into data science. Pandas tends to be the most popular python library for manipulating data for those familiar with SQL, so I figured I would write a series of posts on how to use Pandas written from a SQL perspective!
In this post we will focus on how to create a DataFrame this is the equivalent of creating a table in a database.
Also in this series:
- Creating a DataFrame (CREATE TABLE)
- Populating a DataFrame (INSERT)
- How to load a CSV file into a Pandas DataFrame (BULK INSERT)
- Handling Nulls read from CSV
- Join statements
- Select all rows and columns (SELECT * FROM table)
- Select multiple columns (SELECT col1,col2 FROM table)
Pre-requisites
You need to import the python pandas and numpy libraries to use any of the code examples below:
import pandas as pd
import numpy as np
CREATE TABLE
If you want to query data in a database, you need to create a table. If you want to query data in Pandas, you need to create a DataFrame.
If I want to create a database table to hold information about hockey players I would use the CREATE TABLE statement:
CREATE TABLE players (
first_name VARCHAR(30),
last_name VARCHAR(30),
team VARCHAR(3),
position VARCHAR(2),
jersey_number INT,
salary DECIMAL,
birthdate DATETIME)
Create Pandas DataFrame
To create a DataFrame to hold the same information in pandas, all I need to do is define the column names and create a DataFrame using the column name:
column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
players = pd.DataFrame(columns=column_names)
Creates an empty dataframe:
Empty DataFrame
Columns: [FirstName, LastName, Team, Position, JerseyNumber, Salary, Birthdate]
Specifying data types
Now you probably noticed I didn’t specify the datatypes for my columns when I created the dataframe. You can display the datatypes for your DataFrame using dtypes:
players.dtypes
which returns:
FirstName object
LastName object
Team object
Position object
JerseyNumber object
Salary object
Birthdate object
Any column with a datatype of string will be listed as object because in Python a string is actually a sequence of characters which does allow you some neat slicing to retrieve substrings.
You can modify the datatypes of the columns after the DataFrame is created using astype:
players = players.astype({'FirstName':str,
'LastName':str,
'Team':str,
'Position':str,
'JerseyNumber':int,
'Salary':float,
'Birthdate':np.datetime64})
Now I have the datatypes:
FirstName object
LastName object
Team object
Position object
JerseyNumber int32
Salary float64
Birthdate datetime64[ns]
Wondering what datatypes are available? Chris Moffitt wrote a nice post summarizing Pandas data types. You can also refer to the official Pandas documentation on dtypes.
If you are wondering, why you can’t specify datatypes for each column when a DataFrame is created, that’s because unlike when you work with database tables, you usually create DataFrames from a dataset and the datatype is inferred from the data. The DataFrame constructor does accept a datatype argument, but you can only use it to specify a datatype to use for all columns in the DataFrame, you cannot specify different datatypes for each column.
Converting all columns to a single datatype
If all your columns will be the same dataype, you can use astype to convert all columns to a new datatype:
column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
players = pd.DataFrame(columns=column_names)
players = players.astype(int)
would give me a DataFrame with every column as an integer:
FirstName int32
LastName int32
Team int32
Position int32
JerseyNumber int32
Salary int32
Birthdate int32
Converting one column to a new datatype
You can also convert a single column in a DataFrame to a new datatype using to_datetime, to_numeric, to timedelta
NOTE: Timedelta is a datatype in python used to specify a gap between two dates and times.
The following code shows how I could use these functions to convert the columns that are not storing strings:
column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
players = pd.DataFrame(columns=column_names)
players['Birthdate']= pd.to_datetime(players['Birthdate'])
players['JerseyNumber']=pd.to_numeric(players['JerseyNumber'],downcast='integer')
players['Salary']=pd.to_numeric(players['Salary'],downcast='float')
Would give me a DataFrame with the data types:
FirstName object
LastName object
Team object
Position object
JerseyNumber int8
Salary float32
Birthdate datetime64[ns]
Note that to_numeric requires an additional parameter downcast to specify the type of numeric datatype required.
Coming next
In the Part 2 you learn the equivalent of the INSERT statement, and learn how to add data to your DataFrame.