Posts Tagged ‘outer join’

Pandas for SQL lovers – JOIN statements

This is panda5the next post in the Pandas for SQL lovers series.

Also in this series:

When you have data stored in two database tables you can use a JOIN statement to write a query to join the data stored across the two tables.

The Pandas merge function provides similar functionality for DataFrames.

Start with a csv file loaded into a DataFrame containing flight information which tells you the airport code for the airport where the flight originates.

import pandas as pd
flights_df = pd.read_csv('flight_data_part1.csv')

Flight information including airport codes such ABQ or DUL

The mapping of airport codes to airport names is stored in a separate file which you load into it’s own DataFrame.

airport_codes_df = pd.read_csv('AirportCodeList.csv')

List of airport codes and names

You might need too merge the two DataFrames the same way you might want to do a JOIN statement across two databases tables so you can analyze data by city as well as airport code.

To accomplish the same functionality with Pandas use the merge function.

When you call merge you must specify two DataFrames and the columns to use to identify matching rows.

  • The left_on parameter specifes the name of the column in the first DataFrame listed, which appears on the left when you read the command from left to right.
  • The right_on parameter specifes the name of the column in the second DataFrame listed, which appears on the right when you read the command from left to right.

merged_df = pd.merge(flights_df,airport_codes_df,left_on='ORIGIN',right_on='CODE')
print (merged_df[['FL_DATE','OP_CARRIER_FL_NUM','ORIGIN','CODE','CITY']])

Data showing flight information and airport codes and names in a single data frame

By default merge performs an inner join. This means a record is only returned if a match is found in both DataFrames. i.e. if you have an airport code, but no flights for that airport, you will not see a row for that airport code in the merged DataFrame. Equally true, if there is a flight with an airport code that is not listed in the airport codes list, that flight information will not appear in the merged DataFrame.

If you want to display all the records from one of the DataFrames regardless of whether there is a match  in the other DataFrame you must perform an outer join by specifying a value for the how parameter.

  • right – to display all records from the right dataframe regardless of whether a match is found.
  • left – to display all records from the left DataFrame regardless of whether a match is found.
  • outer – to display all recrods from btoh DtaFrames regardless of whether a match is found.

When there is no matching records NaN is used for the missing values.

merged_df = pd.merge(flights_df,extra_airport_codes_df,left_on='ORIGIN',right_on='CODE',how='right')
print (merged_df[['FL_DATE','OP_CARRIER_FL_NUM','ORIGIN','CODE','CITY']])

DataFrame with extra row for airport code with no matching flights

Happy coding!