Posts Tagged ‘sql’

Pandas for SQL Lovers – SELECT col1,col2 FROM Table

In the last post we learned a variety of ways to stwo pandas and a heart with the word SQLelect all rows and all columns from a Pandas DataFrame. In this post we will learn how to return a specified list or range of columns from a DataFrame.

Also in this series:

Once again we will rely heavily on

  • loc return specified rows and columns from a DataFrame by specifying column names
  • iloc return specified rows and columns from a DataFrame by specifying column positions

Let’s use a DataFrame called airports with the following values:

index airport_code city state capacity
0 SEA Seattle WA 45000000
1 BOS Boston MA 36000000
2 HOU Houston TX 55000000
3 YOW Ottawa ON 25000000
4 YYZ Toronto ON 65000000

Let’s start by selecting a single column

SELECT city FROM airports

All the commands below will return the same result

airports['city']
airports.city

loc, and iloc accept parameters to specify which rows and columns to return. A value of : indicates al rows or all columns. You can specify : to return all rows, and the column name or position to specify the column you want to return.

airports.loc[:,'city']

If you use iloc, you specify the index position of the column instead of the column name. The index column is not counted as a column and the first column is column 0.

index airport_code city state capacity
Column index 0 1 2 3
0 SEA Seattle WA 45000000
1 BOS Boston MA 36000000
2 HOU Houston TX 55000000
3 YOW Ottawa ON 25000000
4 YYZ Toronto ON 65000000

If we want to display the city we need to request column 1.

airports.iloc[:,1]

iloc also allows you to use negative numbers to count from the end

index airport_code city state capacity
Reverse Column index -4 -3 -2 -1
0 SEA Seattle WA 45000000
1 BOS Boston MA 36000000
2 HOU Houston TX 55000000
3 YOW Ottawa ON 25000000
4 YYZ Toronto ON 65000000

We can return the city column by requesting column -3

airports.iloc[:,-3]

Any of the commands above return the same output

SELECTcityFROMairports

What if you want to select multiple columns?

SELECT airport_code, capacity FROM airports

You can specify the list of columns to return

airports[['airport_code','capacity']]
airports.loc[:,['airport_code','capacity']]
airports.iloc[:,[0,3]]
airports.iloc[:,[-4,-1]]

Any of the commands above return the following output:

SELECTairportcodecapacityFROMairports

SELECT RangeOfColumns FROM airports

iloc provides the ability to return a range of columns

You can specify a range using the syntax lowerBound: upperBound.

airports.iloc[:,0:2]

returns

SELECTfirsttwocolumnsFROMAirports

Wait a second, let’s take another look at our original DataFrame. We said column 0 was the airport code and column 2 was the state. Shouldn’t a range of 0:2 return airport_code, city, and state?

index airport_code city state capacity
Column index 0 1 2 3
0 SEA Seattle WA 45000000
1 BOS Boston MA 36000000
2 HOU Houston TX 55000000
3 YOW Ottawa ON 25000000
4 YYZ Toronto ON 65000000

When you specify a range the lower bound value is inclusive, the upper bound value is exclusive so it returns rows where:

lowerBound<= Column Index position < upperBound

Which means a range of 0 : 2 will return columns with an index of 0 and 1. For our DataFrame that returns airport_code, and city.

If you specify

airports.iloc[:,1:3]

You get back columns 1 & 2: city and state

SELECTColumsn1To3

If you want to return all the columns you would request columns 0 through 4 (even though there is no column 4 in our DataFrame)

airports.iloc[:,0:4]

SELECTColumns1to8

As a matter of fact you could specify a range of 0 to 8 and it would not give you an error, you would see the same output, even though our DataFrame only contains four columns

airports.iloc[:,0:8]

If this really annoys your inner coder (I want my code to be robust! I want to know how many columns are in the DataFrame, I should never specify a column index that does not exist, that should blow up my code!) You can look up the number of columns in your DataFrame and use that as a parameter for the Upper Bound

len(airports.columns)

Returns the number of columns in the DataFrame

airports.shape

Returns the number of rows and columns

Since we only want the number of columns, we would request the second parameter, since parameter indexing starts at 0. Parameter 0 is the number of rows, Parameter 1 is the number of columns.

airports.shape(1)

Returns the number of columns in the airports DataFrame

We can use these expressions to dynamically return the upper boundary.

airports.iloc[:,0:airports.shape[1]]
airports.iloc[:,0:len(airports.columns)]

Either of these two commands will return all the columns in our DataFrame

SELECTColumns1to8

You can also use negative numbers to specify a range. I have added the corresponding negative column numbers starting from the end of the DataFrame to try and help:

index airport_code city state capacity
Reverse Column index -4 -3 -2 -1
0 SEA Seattle WA 45000000
1 BOS Boston MA 36000000
2 HOU Houston TX 55000000
3 YOW Ottawa ON 25000000
4 YYZ Toronto ON 65000000
airports.iloc[:,-3:-1]

follows the same upperBound and lowerBound rules for exclusivity

LowerBound <= column index < UpperBound

-3 <= columnIndex < -1

So we return columns -3 and -2: city and state.

SELECTNegativeRange

 

Selecting multiple column ranges

Are you lost yet? If so, don’t panic, I gave you a lot of options, find the one that works for you! For those who are still keeping up here’s another cool trick you can do with iloc.

Try this!

airports.iloc[:,[i for i in range (1,3)]]

It might help if you look at what is returned for

for i in range (1,3):
print(i)

This returns

1

2

Therefore

airports.iloc[:,[i for i in range (1,3)]]

will return columns 1 and 2: city and state

SELECTNegativeRange

Why is this useful? Why didn’t I just ask for columns 1 and 2?

Imagine our airports DataFrame had 50 columns and you want columns 45-49 you could do this:

airports.iloc[:,[i for i in range (45,50)]]

But wait, we just saw that we can specify a range of columns so isn’t this simpler?

airports.iloc[:,45:50]

Yes, it is, and that will absolutely work, but what if you want column 0, column 1 AND columns 45-59! Using range you can do this:

airports.iloc[:,[0,1]+[i for i in range (45,49)]]

OR if you want columns 10 through 20 and columns 45-49

airports.iloc[:,[i for i in range(10,21)]+[i for i in range (45,50)]]

Using the range allows you to select two or more ranges of columns from a DataFrame.

airports.iloc[:,[i for i in range(10,21)]+[i for i in range (31,39)]+[i for i in range (45,50)]]

Would return columns 10 through 20, columns 31- 38 and columns 45-49!

You may not need this trick today, but now you know where to find it when you need it!

Pandas for SQL lovers – JOIN statements

This is panda5the next post in the Pandas for SQL lovers series.

Also in this series:

When you have data stored in two database tables you can use a JOIN statement to write a query to join the data stored across the two tables.

The Pandas merge function provides similar functionality for DataFrames.

Start with a csv file loaded into a DataFrame containing flight information which tells you the airport code for the airport where the flight originates.

import pandas as pd
flights_df = pd.read_csv('flight_data_part1.csv')

Flight information including airport codes such ABQ or DUL

The mapping of airport codes to airport names is stored in a separate file which you load into it’s own DataFrame.

airport_codes_df = pd.read_csv('AirportCodeList.csv')

List of airport codes and names

You might need too merge the two DataFrames the same way you might want to do a JOIN statement across two databases tables so you can analyze data by city as well as airport code.

To accomplish the same functionality with Pandas use the merge function.

When you call merge you must specify two DataFrames and the columns to use to identify matching rows.

  • The left_on parameter specifes the name of the column in the first DataFrame listed, which appears on the left when you read the command from left to right.
  • The right_on parameter specifes the name of the column in the second DataFrame listed, which appears on the right when you read the command from left to right.

merged_df = pd.merge(flights_df,airport_codes_df,left_on='ORIGIN',right_on='CODE')
print (merged_df[['FL_DATE','OP_CARRIER_FL_NUM','ORIGIN','CODE','CITY']])

Data showing flight information and airport codes and names in a single data frame

By default merge performs an inner join. This means a record is only returned if a match is found in both DataFrames. i.e. if you have an airport code, but no flights for that airport, you will not see a row for that airport code in the merged DataFrame. Equally true, if there is a flight with an airport code that is not listed in the airport codes list, that flight information will not appear in the merged DataFrame.

If you want to display all the records from one of the DataFrames regardless of whether there is a match  in the other DataFrame you must perform an outer join by specifying a value for the how parameter.

  • right – to display all records from the right dataframe regardless of whether a match is found.
  • left – to display all records from the left DataFrame regardless of whether a match is found.
  • outer – to display all recrods from btoh DtaFrames regardless of whether a match is found.

When there is no matching records NaN is used for the missing values.

merged_df = pd.merge(flights_df,extra_airport_codes_df,left_on='ORIGIN',right_on='CODE',how='right')
print (merged_df[['FL_DATE','OP_CARRIER_FL_NUM','ORIGIN','CODE','CITY']])

DataFrame with extra row for airport code with no matching flights

Happy coding!

Pandas for SQL Lovers INSERT / Populating a DataFrame

SQLPanda2Know SQL and trying to learn Python Pandas for data science? Many concepts are the same. In this series I explain Pandas using SQL as a reference point.

In this post I will explain how to populate a Pandas DataFrame.

Also in this series:

INSERT INTO

In SQL if you want to insert rows into a table you use the INSERT statement. You can insert a single row


INSERT INTO players
(FirstName, LastName, Team, Position, JerseyNumber, Salary, Birthdate)
VALUES
('Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11')

OR you can insert multiple rows with a single INSERT statement


INSERT INTO players
(FirstName, LastName, Team, Position, JerseyNumber, Salary, Birthdate)
VALUES
('Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11'),
('Connor','McDavid','EDM','C','97,925000.00,'1997-01-13')

Populating a DataFrame when created

When you create your DataFrame, you can provide data and populate it immediately.


column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
data = [['Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11'],
['Connor','McDavid','EDM','C',97,925000.00,'1997-01-13'],
['Sidney','Crosby','PIT','C',87,8700000.00,'1987-08-07'],
['Carey','Price','MTL','G',31,10500000.00,'1987-08-16']]
players  = pd.DataFrame(data, columns=column_names)

This creates the following DataFrame

PlayersDataFrame

Inferred Datatypes

The advantage to populating the DataFrame when it is created, is that Pandas will infer the datatypes based on the data. If I run the command:


players.dtypes

I can see the DataFrame assigned integer and float datatypes to JerseyNumber and Salary. All the other columns are strings (because strings are stored as a sequence the datatype displayed is object):

FirstName object
LastName object
Team object
Position object
JerseyNumber int64
Salary float64
Birthdate object

Explicit Datatypes

If you want BirthDate to be a date, datatype you will need to convert it explicitly. The line of code below uses to_datetime to convert the Birthdate column to a datetime:

players['Birthdate']= pd.to_datetime(players['Birthdate'])

Now the Birthdate column stores the datatype datetime:

FirstName object
LastName object
Team object
Position object
JerseyNumber int64
Salary float64
Birthdate datetime64[ns]

You can find more details on how to assign datatypes explicitly in the Part 1 of this series: how to create DataFrames.

Indexes in DataFrames

You may have noticed that I did not have any sort of ‘playerid’ value for the rows I inserted. But you can see a number beside each row.  This column is called the index. Pandas will automatically create an index for each row in the DataFrame.

PlayersDataFrame

Setting your own column as index

If you want to use your own column for the index, you can use set_index. The example below creates a DataFrame with a PlayerId and then users set_index to make PlayerId the index column.

column_names = ['PlayerId',
'FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
data = [[1,'Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11'],
[2,'Connor','McDavid','EDM','C',97,925000.00,'1997-01-13'],
[3, 'Sidney','Crosby','PIT','C',87,8700000.00,'1987-08-07'],
[4, 'Carey','Price','MTL','G',31,10500000.00,'1987-08-16']]
players  = pd.DataFrame(data,columns=column_names)

players.set_index(‘PlayerId’, inplace=True)

This produces a DataFrame with PlayerId as the Index column

PlayerIdIndex

Using non numeric columns as an index

You are not limited to numeric fields as indexes, you can use any field as your index:

 players.set_index('LastName', inplace=True)

LastNameIndex

Duplicate values in index columns

Unlike an Primary Key in a database, the index on a DataFrame will allow duplicate values. If you decide to use LastName as your index column and you have the Henrik & Daniel Sedin in your DataFrame you will see duplicate indexes.

duplicateIndex

Adding rows to an existing DataFrame

If you want to add rows to a DataFrame after it is created use append. In the code below let’s recreate the populated DataFrame with the autogenerated index:

column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
data = [['Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11'],
['Connor','McDavid','EDM','C',97,925000.00,'1997-01-13'],
['Sidney','Crosby','PIT','C',87,8700000.00,'1987-08-07'],
['Carey','Price','MTL','G',31,10500000.00,'1987-08-16']]
players = pd.DataFrame(data, columns=column_names)

PlayersDataFrame

Now you can use append to add a row. A couple of things to be aware of:

  • Using append creates a new DataFrame with the added row, if you want to append the row to your existing DataFrame you need to assign the result of the append to your original DataFrame.
  • You must specify ignore_index = True when you are providing explicit values to use for the new row being appended

players = players.append({'FirstName':'Erik',
'LastName':'Karlsson',
'Team':'SJ',
'Position':'D',
'JerseyNumber':65,
'Salary':11500000.00,
'Birthdate':'1990-05-31'},
ignore_index=True)

AppendedOneRow

Summary

Now you can add rows to your DataFrame. In upcoming posts we will look at how to populate your DataFrame from a CSV file or from a database table.

Python Pandas for SQL fans: Creating DataFrames

panda holding SQL heartI have worked with data for years, and I am very familiar with SQL. When I started exploring data science I found a number of my SQL skills translated well into data science. Pandas tends to be the most popular python library for manipulating data for those familiar with SQL, so I figured I would write a series of posts on how to use Pandas written from a SQL perspective!

In this post we will focus on how to create a DataFrame this is the equivalent of creating a table in a database.

Also in this series:

Pre-requisites

You need to import the python pandas and numpy libraries to use any of the code examples below:

import pandas as pd
import numpy as np

CREATE TABLE

If you want to query data in a database, you need to create a table. If you want to query data in Pandas, you need to create a DataFrame.

If I want to create a database table to hold information about hockey players I would use the CREATE TABLE statement:

CREATE TABLE players (
first_name   VARCHAR(30),
last_name VARCHAR(30),
team VARCHAR(3),
position VARCHAR(2),
jersey_number INT,
salary DECIMAL,
birthdate DATETIME)

Create Pandas DataFrame

To create a DataFrame to hold the same information in pandas, all I need to do is define the column names and create a DataFrame using the column name:

column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
players  = pd.DataFrame(columns=column_names)

Creates an empty dataframe:

Empty DataFrame
Columns: [FirstName, LastName, Team, Position, JerseyNumber, Salary, Birthdate]

Specifying data types

Now you probably noticed I didn’t specify the datatypes for my columns when I created the dataframe. You can display the datatypes for your DataFrame using dtypes:

players.dtypes

which returns:

FirstName object
LastName object
Team object
Position object
JerseyNumber object
Salary object
Birthdate object

Any column with a datatype of string will be listed as object because in Python a string is actually a sequence of characters which does allow you some neat slicing to retrieve substrings.

You can modify the datatypes of the columns after the DataFrame is created using astype:

players = players.astype({'FirstName':str,
'LastName':str,
'Team':str,
'Position':str,
'JerseyNumber':int,
'Salary':float,
'Birthdate':np.datetime64})

Now I have the datatypes:

FirstName object
LastName object
Team object
Position object
JerseyNumber int32
Salary float64
Birthdate datetime64[ns]

Wondering what datatypes are available? Chris Moffitt wrote a nice post summarizing Pandas data types. You can also refer to the official Pandas documentation on dtypes.

If you are wondering, why you can’t specify datatypes for each column when a DataFrame is created, that’s because unlike when you work with database tables, you usually create DataFrames from a dataset and the datatype is inferred from the data. The DataFrame constructor does accept a datatype argument, but you can only use it to specify a datatype to use for all columns in the DataFrame, you cannot specify different datatypes for each column.

Converting all columns to a single datatype

If all your columns will be the same dataype, you can use astype to convert all columns to a new datatype:

column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
players  = pd.DataFrame(columns=column_names)
players = players.astype(int)

would give me a DataFrame with every column as an integer:

FirstName int32
LastName int32
Team int32
Position int32
JerseyNumber int32
Salary int32
Birthdate int32

Converting one column to a new datatype

You can also convert a single column in a DataFrame to a new datatype using to_datetime, to_numeric, to timedelta

NOTE: Timedelta is a datatype in python used to specify a gap between two dates and times.

The following code shows how I could use these functions to convert the columns that are not storing strings:

column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
players  = pd.DataFrame(columns=column_names)
players['Birthdate']= pd.to_datetime(players['Birthdate'])
players['JerseyNumber']=pd.to_numeric(players['JerseyNumber'],downcast='integer')
players['Salary']=pd.to_numeric(players['Salary'],downcast='float')

Would give me a DataFrame with the data types:

FirstName object
LastName object
Team object
Position object
JerseyNumber int8
Salary float32
Birthdate datetime64[ns]

Note that to_numeric requires an additional parameter downcast to specify the type of numeric datatype required.

Coming next

In the Part 2 you learn the equivalent of the INSERT statement, and learn how to add data to your DataFrame.