Posts Tagged ‘append’

Pandas for SQL Lovers INSERT / Populating a DataFrame

SQLPanda2Know SQL and trying to learn Python Pandas for data science? Many concepts are the same. In this series I explain Pandas using SQL as a reference point.

In this post I will explain how to populate a Pandas DataFrame.

Also in this series:

INSERT INTO

In SQL if you want to insert rows into a table you use the INSERT statement. You can insert a single row


INSERT INTO players
(FirstName, LastName, Team, Position, JerseyNumber, Salary, Birthdate)
VALUES
('Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11')

OR you can insert multiple rows with a single INSERT statement


INSERT INTO players
(FirstName, LastName, Team, Position, JerseyNumber, Salary, Birthdate)
VALUES
('Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11'),
('Connor','McDavid','EDM','C','97,925000.00,'1997-01-13')

Populating a DataFrame when created

When you create your DataFrame, you can provide data and populate it immediately.


column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
data = [['Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11'],
['Connor','McDavid','EDM','C',97,925000.00,'1997-01-13'],
['Sidney','Crosby','PIT','C',87,8700000.00,'1987-08-07'],
['Carey','Price','MTL','G',31,10500000.00,'1987-08-16']]
players  = pd.DataFrame(data, columns=column_names)

This creates the following DataFrame

PlayersDataFrame

Inferred Datatypes

The advantage to populating the DataFrame when it is created, is that Pandas will infer the datatypes based on the data. If I run the command:


players.dtypes

I can see the DataFrame assigned integer and float datatypes to JerseyNumber and Salary. All the other columns are strings (because strings are stored as a sequence the datatype displayed is object):

FirstName object
LastName object
Team object
Position object
JerseyNumber int64
Salary float64
Birthdate object

Explicit Datatypes

If you want BirthDate to be a date, datatype you will need to convert it explicitly. The line of code below uses to_datetime to convert the Birthdate column to a datetime:

players['Birthdate']= pd.to_datetime(players['Birthdate'])

Now the Birthdate column stores the datatype datetime:

FirstName object
LastName object
Team object
Position object
JerseyNumber int64
Salary float64
Birthdate datetime64[ns]

You can find more details on how to assign datatypes explicitly in the Part 1 of this series: how to create DataFrames.

Indexes in DataFrames

You may have noticed that I did not have any sort of ‘playerid’ value for the rows I inserted. But you can see a number beside each row.  This column is called the index. Pandas will automatically create an index for each row in the DataFrame.

PlayersDataFrame

Setting your own column as index

If you want to use your own column for the index, you can use set_index. The example below creates a DataFrame with a PlayerId and then users set_index to make PlayerId the index column.

column_names = ['PlayerId',
'FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
data = [[1,'Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11'],
[2,'Connor','McDavid','EDM','C',97,925000.00,'1997-01-13'],
[3, 'Sidney','Crosby','PIT','C',87,8700000.00,'1987-08-07'],
[4, 'Carey','Price','MTL','G',31,10500000.00,'1987-08-16']]
players  = pd.DataFrame(data,columns=column_names)

players.set_index(‘PlayerId’, inplace=True)

This produces a DataFrame with PlayerId as the Index column

PlayerIdIndex

Using non numeric columns as an index

You are not limited to numeric fields as indexes, you can use any field as your index:

 players.set_index('LastName', inplace=True)

LastNameIndex

Duplicate values in index columns

Unlike an Primary Key in a database, the index on a DataFrame will allow duplicate values. If you decide to use LastName as your index column and you have the Henrik & Daniel Sedin in your DataFrame you will see duplicate indexes.

duplicateIndex

Adding rows to an existing DataFrame

If you want to add rows to a DataFrame after it is created use append. In the code below let’s recreate the populated DataFrame with the autogenerated index:

column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
data = [['Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11'],
['Connor','McDavid','EDM','C',97,925000.00,'1997-01-13'],
['Sidney','Crosby','PIT','C',87,8700000.00,'1987-08-07'],
['Carey','Price','MTL','G',31,10500000.00,'1987-08-16']]
players = pd.DataFrame(data, columns=column_names)

PlayersDataFrame

Now you can use append to add a row. A couple of things to be aware of:

  • Using append creates a new DataFrame with the added row, if you want to append the row to your existing DataFrame you need to assign the result of the append to your original DataFrame.
  • You must specify ignore_index = True when you are providing explicit values to use for the new row being appended

players = players.append({'FirstName':'Erik',
'LastName':'Karlsson',
'Team':'SJ',
'Position':'D',
'JerseyNumber':65,
'Salary':11500000.00,
'Birthdate':'1990-05-31'},
ignore_index=True)

AppendedOneRow

Summary

Now you can add rows to your DataFrame. In upcoming posts we will look at how to populate your DataFrame from a CSV file or from a database table.