Know SQL and trying to learn Python Pandas for data science? Many concepts are the same. In this series I explain Pandas using SQL as a reference point.
In this post I will explain how to populate a Pandas DataFrame.
Also in this series:
- Creating a DataFrame (CREATE TABLE)
- Populating a DataFrame (INSERT)
- How to load a CSV file into a Pandas DataFrame (BULK INSERT)
- Handling Nulls read from CSV
- Join statements
- Select all rows and columns (SELECT * FROM table)
- Select multiple columns (SELECT col1,col2 FROM table)
INSERT INTO
In SQL if you want to insert rows into a table you use the INSERT statement. You can insert a single row
INSERT INTO players
(FirstName, LastName, Team, Position, JerseyNumber, Salary, Birthdate)
VALUES
('Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11')
OR you can insert multiple rows with a single INSERT statement
INSERT INTO players
(FirstName, LastName, Team, Position, JerseyNumber, Salary, Birthdate)
VALUES
('Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11'),
('Connor','McDavid','EDM','C','97,925000.00,'1997-01-13')
Populating a DataFrame when created
When you create your DataFrame, you can provide data and populate it immediately.
column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
data = [['Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11'],
['Connor','McDavid','EDM','C',97,925000.00,'1997-01-13'],
['Sidney','Crosby','PIT','C',87,8700000.00,'1987-08-07'],
['Carey','Price','MTL','G',31,10500000.00,'1987-08-16']]
players = pd.DataFrame(data, columns=column_names)
This creates the following DataFrame
Inferred Datatypes
The advantage to populating the DataFrame when it is created, is that Pandas will infer the datatypes based on the data. If I run the command:
players.dtypes
I can see the DataFrame assigned integer and float datatypes to JerseyNumber and Salary. All the other columns are strings (because strings are stored as a sequence the datatype displayed is object):
FirstName object
LastName object
Team object
Position object
JerseyNumber int64
Salary float64
Birthdate object
Explicit Datatypes
If you want BirthDate to be a date, datatype you will need to convert it explicitly. The line of code below uses to_datetime to convert the Birthdate column to a datetime:
players['Birthdate']= pd.to_datetime(players['Birthdate'])
Now the Birthdate column stores the datatype datetime:
FirstName object
LastName object
Team object
Position object
JerseyNumber int64
Salary float64
Birthdate datetime64[ns]
You can find more details on how to assign datatypes explicitly in the Part 1 of this series: how to create DataFrames.
Indexes in DataFrames
You may have noticed that I did not have any sort of ‘playerid’ value for the rows I inserted. But you can see a number beside each row. This column is called the index. Pandas will automatically create an index for each row in the DataFrame.
Setting your own column as index
If you want to use your own column for the index, you can use set_index. The example below creates a DataFrame with a PlayerId and then users set_index to make PlayerId the index column.
column_names = ['PlayerId',
'FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
data = [[1,'Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11'],
[2,'Connor','McDavid','EDM','C',97,925000.00,'1997-01-13'],
[3, 'Sidney','Crosby','PIT','C',87,8700000.00,'1987-08-07'],
[4, 'Carey','Price','MTL','G',31,10500000.00,'1987-08-16']]
players = pd.DataFrame(data,columns=column_names)
players.set_index(‘PlayerId’, inplace=True)
This produces a DataFrame with PlayerId as the Index column
Using non numeric columns as an index
You are not limited to numeric fields as indexes, you can use any field as your index:
players.set_index('LastName', inplace=True)
Duplicate values in index columns
Unlike an Primary Key in a database, the index on a DataFrame will allow duplicate values. If you decide to use LastName as your index column and you have the Henrik & Daniel Sedin in your DataFrame you will see duplicate indexes.
Adding rows to an existing DataFrame
If you want to add rows to a DataFrame after it is created use append. In the code below let’s recreate the populated DataFrame with the autogenerated index:
column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
data = [['Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11'],
['Connor','McDavid','EDM','C',97,925000.00,'1997-01-13'],
['Sidney','Crosby','PIT','C',87,8700000.00,'1987-08-07'],
['Carey','Price','MTL','G',31,10500000.00,'1987-08-16']]
players = pd.DataFrame(data, columns=column_names)
Now you can use append to add a row. A couple of things to be aware of:
- Using append creates a new DataFrame with the added row, if you want to append the row to your existing DataFrame you need to assign the result of the append to your original DataFrame.
- You must specify ignore_index = True when you are providing explicit values to use for the new row being appended
players = players.append({'FirstName':'Erik',
'LastName':'Karlsson',
'Team':'SJ',
'Position':'D',
'JerseyNumber':65,
'Salary':11500000.00,
'Birthdate':'1990-05-31'},
ignore_index=True)
Summary
Now you can add rows to your DataFrame. In upcoming posts we will look at how to populate your DataFrame from a CSV file or from a database table.
Posted by Python Pandas for SQL fans Part 1: Creating DataFrames | HockeyGeekGirl on July 23, 2019 at 9:36 AM
[…] Part 2 shows you How to insert data into a DataFrame. […]
Posted by Pandas for SQL lovers – Part 3 Reading a CSV file / BULK INSERT | HockeyGeekGirl on August 7, 2019 at 12:01 PM
[…] we saw in part 2 of the series populating a DataFrame, every DataFrame has an index column, but if you do not want an index column created, you can […]
Posted by Pandas for SQL Lovers part 4: Handling Nulls read from CSV | HockeyGeekGirl on August 14, 2019 at 1:50 PM
[…] Part 2 Populating a DataFrame (INSERT) […]
Posted by Pandas for SQL lovers – JOIN statements | HockeyGeekGirl on September 24, 2019 at 2:15 PM
[…] Populating a DataFrame […]
Posted by Pandas for SQL Lovers – SELECT * FROM table | HockeyGeekGirl on January 7, 2020 at 2:13 PM
[…] Part 2 Populating a DataFrame (INSERT) […]
Posted by Pandas for SQL Lovers – SELECT col1,col2 FROM Table | HockeyGeekGirl on January 13, 2020 at 4:46 PM
[…] Populating a DataFrame (INSERT) […]