In the last post we learned a variety of ways to select all rows and all columns from a Pandas DataFrame. In this post we will learn how to return a specified list or range of columns from a DataFrame.
Also in this series:
- Creating a DataFrame (CREATE TABLE)
- Populating a DataFrame (INSERT)
- How to load a CSV file into a Pandas DataFrame (BULK INSERT)
- Handling Nulls read from CSV
- Join statements
- Select all rows and columns (SELECT * FROM table)
Once again we will rely heavily on
- loc return specified rows and columns from a DataFrame by specifying column names
- iloc return specified rows and columns from a DataFrame by specifying column positions
Let’s use a DataFrame called airports with the following values:
index | airport_code | city | state | capacity |
---|---|---|---|---|
0 | SEA | Seattle | WA | 45000000 |
1 | BOS | Boston | MA | 36000000 |
2 | HOU | Houston | TX | 55000000 |
3 | YOW | Ottawa | ON | 25000000 |
4 | YYZ | Toronto | ON | 65000000 |
Let’s start by selecting a single column
SELECT city FROM airports
All the commands below will return the same result
airports['city']
airports.city
loc, and iloc accept parameters to specify which rows and columns to return. A value of : indicates al rows or all columns. You can specify : to return all rows, and the column name or position to specify the column you want to return.
airports.loc[:,'city']
If you use iloc, you specify the index position of the column instead of the column name. The index column is not counted as a column and the first column is column 0.
index | airport_code | city | state | capacity |
---|---|---|---|---|
Column index | 0 | 1 | 2 | 3 |
0 | SEA | Seattle | WA | 45000000 |
1 | BOS | Boston | MA | 36000000 |
2 | HOU | Houston | TX | 55000000 |
3 | YOW | Ottawa | ON | 25000000 |
4 | YYZ | Toronto | ON | 65000000 |
If we want to display the city we need to request column 1.
airports.iloc[:,1]
iloc also allows you to use negative numbers to count from the end
index | airport_code | city | state | capacity |
---|---|---|---|---|
Reverse Column index | -4 | -3 | -2 | -1 |
0 | SEA | Seattle | WA | 45000000 |
1 | BOS | Boston | MA | 36000000 |
2 | HOU | Houston | TX | 55000000 |
3 | YOW | Ottawa | ON | 25000000 |
4 | YYZ | Toronto | ON | 65000000 |
We can return the city column by requesting column -3
airports.iloc[:,-3]
Any of the commands above return the same output
What if you want to select multiple columns?
SELECT airport_code, capacity FROM airports
You can specify the list of columns to return
airports[['airport_code','capacity']]
airports.loc[:,['airport_code','capacity']]
airports.iloc[:,[0,3]]
airports.iloc[:,[-4,-1]]
Any of the commands above return the following output:
SELECT RangeOfColumns FROM airports
iloc provides the ability to return a range of columns
You can specify a range using the syntax lowerBound: upperBound.
airports.iloc[:,0:2]
returns
Wait a second, let’s take another look at our original DataFrame. We said column 0 was the airport code and column 2 was the state. Shouldn’t a range of 0:2 return airport_code, city, and state?
index | airport_code | city | state | capacity |
---|---|---|---|---|
Column index | 0 | 1 | 2 | 3 |
0 | SEA | Seattle | WA | 45000000 |
1 | BOS | Boston | MA | 36000000 |
2 | HOU | Houston | TX | 55000000 |
3 | YOW | Ottawa | ON | 25000000 |
4 | YYZ | Toronto | ON | 65000000 |
When you specify a range the lower bound value is inclusive, the upper bound value is exclusive so it returns rows where:
lowerBound<= Column Index position < upperBound
Which means a range of 0 : 2 will return columns with an index of 0 and 1. For our DataFrame that returns airport_code, and city.
If you specify
airports.iloc[:,1:3]
You get back columns 1 & 2: city and state
If you want to return all the columns you would request columns 0 through 4 (even though there is no column 4 in our DataFrame)
airports.iloc[:,0:4]
As a matter of fact you could specify a range of 0 to 8 and it would not give you an error, you would see the same output, even though our DataFrame only contains four columns
airports.iloc[:,0:8]
If this really annoys your inner coder (I want my code to be robust! I want to know how many columns are in the DataFrame, I should never specify a column index that does not exist, that should blow up my code!) You can look up the number of columns in your DataFrame and use that as a parameter for the Upper Bound
len(airports.columns)
Returns the number of columns in the DataFrame
airports.shape
Returns the number of rows and columns
Since we only want the number of columns, we would request the second parameter, since parameter indexing starts at 0. Parameter 0 is the number of rows, Parameter 1 is the number of columns.
airports.shape(1)
Returns the number of columns in the airports DataFrame
We can use these expressions to dynamically return the upper boundary.
airports.iloc[:,0:airports.shape[1]]
airports.iloc[:,0:len(airports.columns)]
Either of these two commands will return all the columns in our DataFrame
You can also use negative numbers to specify a range. I have added the corresponding negative column numbers starting from the end of the DataFrame to try and help:
index | airport_code | city | state | capacity |
---|---|---|---|---|
Reverse Column index | -4 | -3 | -2 | -1 |
0 | SEA | Seattle | WA | 45000000 |
1 | BOS | Boston | MA | 36000000 |
2 | HOU | Houston | TX | 55000000 |
3 | YOW | Ottawa | ON | 25000000 |
4 | YYZ | Toronto | ON | 65000000 |
airports.iloc[:,-3:-1]
follows the same upperBound and lowerBound rules for exclusivity
LowerBound <= column index < UpperBound
-3 <= columnIndex < -1
So we return columns -3 and -2: city and state.
Selecting multiple column ranges
Are you lost yet? If so, don’t panic, I gave you a lot of options, find the one that works for you! For those who are still keeping up here’s another cool trick you can do with iloc.
Try this!
airports.iloc[:,[i for i in range (1,3)]]
It might help if you look at what is returned for
for i in range (1,3):
print(i)
This returns
1
2
Therefore
airports.iloc[:,[i for i in range (1,3)]]
will return columns 1 and 2: city and state
Why is this useful? Why didn’t I just ask for columns 1 and 2?
Imagine our airports DataFrame had 50 columns and you want columns 45-49 you could do this:
airports.iloc[:,[i for i in range (45,50)]]
But wait, we just saw that we can specify a range of columns so isn’t this simpler?
airports.iloc[:,45:50]
Yes, it is, and that will absolutely work, but what if you want column 0, column 1 AND columns 45-59! Using range you can do this:
airports.iloc[:,[0,1]+[i for i in range (45,49)]]
OR if you want columns 10 through 20 and columns 45-49
airports.iloc[:,[i for i in range(10,21)]+[i for i in range (45,50)]]
Using the range allows you to select two or more ranges of columns from a DataFrame.
airports.iloc[:,[i for i in range(10,21)]+[i for i in range (31,39)]+[i for i in range (45,50)]]
Would return columns 10 through 20, columns 31- 38 and columns 45-49!
You may not need this trick today, but now you know where to find it when you need it!