Archive for the ‘Technical’ Category

My next adventure! New job helping kids learn to code

I think I'm quite ready for another adventure Bilbo Baggins quoteIn a time of pandemic, protests, and riots, it is somewhat surreal to be sharing news that I start a new job today. But, in many ways it makes me more excited about my new role, because underrepresented and underserved communities are some of those suffering the most in these difficult times, so I look forward to a chance to try and give them some new hope and new opportunities.

So what exactly is this new job? Well as of June 1, 2020, my job title is Head of Amazon Future Engineer-Canada. I’ll be leadiing the Amazon Future Engineer program in Canada.

Amazon Future Engineer is a childhood-to-career program aimed to increase access to computer science education for children and young adults from underserved and underrepresented communities.

Those of you who know me read that sentence and said “Well that has Susan written all over it!” Yes it does! I have spent over 20 years of my career in various aspects of technical education:

  • Working as a technical trainer
  • Writing courses, labs
  • Delivering presentations at conferences
  • Working as a technical evangelist/advocate
  • Running workshops, hackathons, and coding competitions
  • Creating live and on-demand video content
  • Promoting and designing certifications

Many of you also know the last 10 years of my career has focused primarily on students. I started out teaching professional developers who were already working and just needed to learn a new technology for their jobs. I was originally assigned to work with students because no-one else on the team wanted to do it. (do you remember that conversation Ryan ?). Shortyl afterwards, my career choices were driven by the opportunity to work with students.

Working with students was the happiest accident to befall my career. I was able to teach students how code and technology opens new doors, new doors that would give them new opportunities, and they ran with it! They say give someone a fish and you feed them for a day, teach someone to fish and you feed them for a lifetime. Teach someone how to be comfortable with coding and technology and you potentially give them a career!

Am I rambling, probably, it’s just my way of saying, yes I am excited about my new job, a chance to work with organizations to give those who might otherwise never have the chance, the opportunity to discover coding and technology, the chance to discover new career options.

So off I go now, I have my first meeting with the boss in a few hours, in a time when it the news is downright depressing, and I often feel so powerless, I am going to print out the quote below and put it above my desk, and each morning I’ll be reminded why I chose this job

“Education is the most powerful weapon which you can use to change the world.” + – Nelson Mandela

Wish me luck!

(PS. at this point my geeky brain immediately pictures Heath Ledger as the Joker saying – “and here we go…” but only because I loved that movie, and Heath Ledger’s Joker)

Pandas for SQL Lovers – SELECT col1,col2 FROM Table

In the last post we learned a variety of ways to stwo pandas and a heart with the word SQLelect all rows and all columns from a Pandas DataFrame. In this post we will learn how to return a specified list or range of columns from a DataFrame.

Also in this series:

Once again we will rely heavily on

  • loc return specified rows and columns from a DataFrame by specifying column names
  • iloc return specified rows and columns from a DataFrame by specifying column positions

Let’s use a DataFrame called airports with the following values:

index airport_code city state capacity
0 SEA Seattle WA 45000000
1 BOS Boston MA 36000000
2 HOU Houston TX 55000000
3 YOW Ottawa ON 25000000
4 YYZ Toronto ON 65000000

Let’s start by selecting a single column

SELECT city FROM airports

All the commands below will return the same result

airports['city']
airports.city

loc, and iloc accept parameters to specify which rows and columns to return. A value of : indicates al rows or all columns. You can specify : to return all rows, and the column name or position to specify the column you want to return.

airports.loc[:,'city']

If you use iloc, you specify the index position of the column instead of the column name. The index column is not counted as a column and the first column is column 0.

index airport_code city state capacity
Column index 0 1 2 3
0 SEA Seattle WA 45000000
1 BOS Boston MA 36000000
2 HOU Houston TX 55000000
3 YOW Ottawa ON 25000000
4 YYZ Toronto ON 65000000

If we want to display the city we need to request column 1.

airports.iloc[:,1]

iloc also allows you to use negative numbers to count from the end

index airport_code city state capacity
Reverse Column index -4 -3 -2 -1
0 SEA Seattle WA 45000000
1 BOS Boston MA 36000000
2 HOU Houston TX 55000000
3 YOW Ottawa ON 25000000
4 YYZ Toronto ON 65000000

We can return the city column by requesting column -3

airports.iloc[:,-3]

Any of the commands above return the same output

SELECTcityFROMairports

What if you want to select multiple columns?

SELECT airport_code, capacity FROM airports

You can specify the list of columns to return

airports[['airport_code','capacity']]
airports.loc[:,['airport_code','capacity']]
airports.iloc[:,[0,3]]
airports.iloc[:,[-4,-1]]

Any of the commands above return the following output:

SELECTairportcodecapacityFROMairports

SELECT RangeOfColumns FROM airports

iloc provides the ability to return a range of columns

You can specify a range using the syntax lowerBound: upperBound.

airports.iloc[:,0:2]

returns

SELECTfirsttwocolumnsFROMAirports

Wait a second, let’s take another look at our original DataFrame. We said column 0 was the airport code and column 2 was the state. Shouldn’t a range of 0:2 return airport_code, city, and state?

index airport_code city state capacity
Column index 0 1 2 3
0 SEA Seattle WA 45000000
1 BOS Boston MA 36000000
2 HOU Houston TX 55000000
3 YOW Ottawa ON 25000000
4 YYZ Toronto ON 65000000

When you specify a range the lower bound value is inclusive, the upper bound value is exclusive so it returns rows where:

lowerBound<= Column Index position < upperBound

Which means a range of 0 : 2 will return columns with an index of 0 and 1. For our DataFrame that returns airport_code, and city.

If you specify

airports.iloc[:,1:3]

You get back columns 1 & 2: city and state

SELECTColumsn1To3

If you want to return all the columns you would request columns 0 through 4 (even though there is no column 4 in our DataFrame)

airports.iloc[:,0:4]

SELECTColumns1to8

As a matter of fact you could specify a range of 0 to 8 and it would not give you an error, you would see the same output, even though our DataFrame only contains four columns

airports.iloc[:,0:8]

If this really annoys your inner coder (I want my code to be robust! I want to know how many columns are in the DataFrame, I should never specify a column index that does not exist, that should blow up my code!) You can look up the number of columns in your DataFrame and use that as a parameter for the Upper Bound

len(airports.columns)

Returns the number of columns in the DataFrame

airports.shape

Returns the number of rows and columns

Since we only want the number of columns, we would request the second parameter, since parameter indexing starts at 0. Parameter 0 is the number of rows, Parameter 1 is the number of columns.

airports.shape(1)

Returns the number of columns in the airports DataFrame

We can use these expressions to dynamically return the upper boundary.

airports.iloc[:,0:airports.shape[1]]
airports.iloc[:,0:len(airports.columns)]

Either of these two commands will return all the columns in our DataFrame

SELECTColumns1to8

You can also use negative numbers to specify a range. I have added the corresponding negative column numbers starting from the end of the DataFrame to try and help:

index airport_code city state capacity
Reverse Column index -4 -3 -2 -1
0 SEA Seattle WA 45000000
1 BOS Boston MA 36000000
2 HOU Houston TX 55000000
3 YOW Ottawa ON 25000000
4 YYZ Toronto ON 65000000
airports.iloc[:,-3:-1]

follows the same upperBound and lowerBound rules for exclusivity

LowerBound <= column index < UpperBound

-3 <= columnIndex < -1

So we return columns -3 and -2: city and state.

SELECTNegativeRange

 

Selecting multiple column ranges

Are you lost yet? If so, don’t panic, I gave you a lot of options, find the one that works for you! For those who are still keeping up here’s another cool trick you can do with iloc.

Try this!

airports.iloc[:,[i for i in range (1,3)]]

It might help if you look at what is returned for

for i in range (1,3):
print(i)

This returns

1

2

Therefore

airports.iloc[:,[i for i in range (1,3)]]

will return columns 1 and 2: city and state

SELECTNegativeRange

Why is this useful? Why didn’t I just ask for columns 1 and 2?

Imagine our airports DataFrame had 50 columns and you want columns 45-49 you could do this:

airports.iloc[:,[i for i in range (45,50)]]

But wait, we just saw that we can specify a range of columns so isn’t this simpler?

airports.iloc[:,45:50]

Yes, it is, and that will absolutely work, but what if you want column 0, column 1 AND columns 45-59! Using range you can do this:

airports.iloc[:,[0,1]+[i for i in range (45,49)]]

OR if you want columns 10 through 20 and columns 45-49

airports.iloc[:,[i for i in range(10,21)]+[i for i in range (45,50)]]

Using the range allows you to select two or more ranges of columns from a DataFrame.

airports.iloc[:,[i for i in range(10,21)]+[i for i in range (31,39)]+[i for i in range (45,50)]]

Would return columns 10 through 20, columns 31- 38 and columns 45-49!

You may not need this trick today, but now you know where to find it when you need it!

Pandas for SQL Lovers – SELECT * FROM table

pandaSQL5Earlier in the series, I covered how to load data from a csv file into a pandas DataFrame. In this post I’ll focus on the building block for all other DataFrame queries: SELECT * FROM DataFrame

Also in this series:

So you have your data loaded into a Pandas DataFrame and now you want to run a query against the DataFrame.

There are different methods we can use depending on what we want to do. Let’s start with loc and iloc

  • loc allows you to specify which columns you want by providing the column name
  • iloc allows you to specify which columns you want be providing the column position

Imagine you have a DataFrame called airports with the following values:

index airport_code city state capacity
0 SEA Seattle WA 45000000
1 BOS Boston MA 36000000
2 HOU Houston TX 55000000
3 YOW Ottawa ON 25000000
4 YYZ Toronto ON 65000000

The first query you need is the same one you learn when you first learn SQL

SELECT * FROM airports

All the commands below will return the same result

Just want all the rows and columns? If you are using a Jupyter notebook just type the DataFrame name

airports

To display the values as output from any Python code

print(airports)

loc, and iloc accept parameters to specify which rows and columns to return. A value of : indicates all rows, or all columns.

airports.loc[:,:]
airports.iloc[:,:]

If you do not specify a value for the columns parameter, loc and iloc return all columns.

airports.loc[:]
airports.iloc[:]

If you know how many rows are in the DataFrame, you could specify you want rows 0 through to the last row in the DataFrame.

airports.loc[0:4,:]
airports.loc[0:4]
airports.iloc[0:4,:]
airports.iloc[0:4]

If you don’t know the number of rows in the DataFrame or the number of rows might change, use len to return the number of rows and make your code more robust.

airports.loc[0:len(airports),:]
airports.loc[0:len(airports)]
airports.iloc[0:len(airports),:]
airports.iloc[0:len(airports)

 

Every command above returns the exact same output:

SELECTALLROWSALLCOLUMNS

As is so often the case when writing code, there are lots of different ways to accomplish the same task. Getting familiar with these different techniques will give you a good base to build on as me to more advanced queries.

Looking for similar topics and technical posts? See the full list

Pandas for SQL lovers – JOIN statements

This is panda5the next post in the Pandas for SQL lovers series.

Also in this series:

When you have data stored in two database tables you can use a JOIN statement to write a query to join the data stored across the two tables.

The Pandas merge function provides similar functionality for DataFrames.

Start with a csv file loaded into a DataFrame containing flight information which tells you the airport code for the airport where the flight originates.

import pandas as pd
flights_df = pd.read_csv('flight_data_part1.csv')

Flight information including airport codes such ABQ or DUL

The mapping of airport codes to airport names is stored in a separate file which you load into it’s own DataFrame.

airport_codes_df = pd.read_csv('AirportCodeList.csv')

List of airport codes and names

You might need too merge the two DataFrames the same way you might want to do a JOIN statement across two databases tables so you can analyze data by city as well as airport code.

To accomplish the same functionality with Pandas use the merge function.

When you call merge you must specify two DataFrames and the columns to use to identify matching rows.

  • The left_on parameter specifes the name of the column in the first DataFrame listed, which appears on the left when you read the command from left to right.
  • The right_on parameter specifes the name of the column in the second DataFrame listed, which appears on the right when you read the command from left to right.

merged_df = pd.merge(flights_df,airport_codes_df,left_on='ORIGIN',right_on='CODE')
print (merged_df[['FL_DATE','OP_CARRIER_FL_NUM','ORIGIN','CODE','CITY']])

Data showing flight information and airport codes and names in a single data frame

By default merge performs an inner join. This means a record is only returned if a match is found in both DataFrames. i.e. if you have an airport code, but no flights for that airport, you will not see a row for that airport code in the merged DataFrame. Equally true, if there is a flight with an airport code that is not listed in the airport codes list, that flight information will not appear in the merged DataFrame.

If you want to display all the records from one of the DataFrames regardless of whether there is a match  in the other DataFrame you must perform an outer join by specifying a value for the how parameter.

  • right – to display all records from the right dataframe regardless of whether a match is found.
  • left – to display all records from the left DataFrame regardless of whether a match is found.
  • outer – to display all recrods from btoh DtaFrames regardless of whether a match is found.

When there is no matching records NaN is used for the missing values.

merged_df = pd.merge(flights_df,extra_airport_codes_df,left_on='ORIGIN',right_on='CODE',how='right')
print (merged_df[['FL_DATE','OP_CARRIER_FL_NUM','ORIGIN','CODE','CITY']])

DataFrame with extra row for airport code with no matching flights

Happy coding!

Pandas for SQL Lovers: Handling Nulls read from CSV

pandaSQL4In part 3 of the series I covered how to load a CSV file into a Pandas DataFrame. In this post I’ll focus on how to deal with NULL or missing values read from CSV files.

Also in this series:

If you work with SQL you have probably had your share of challenges caused by missing/blank and NULL values. Let’s see how read_csv helps us manage these troublemakers when we populate a DataFrame from a csv file.

Let’s start with the following data in a CSV file:

FirstName,LastName,Team,Position,JerseyNumber,Salary,Birthdate
Joe,Pavelski,SJ,C,8,6000000,1984-07-11
Connor,McDavid,EDM,C,97,925000,1997-01-13
Sidney,Crosby,NULL,C,87,8700000,1987-08-07
Carey,Price,Unknown,G,31,10500000,1987-08-16
Daniel,Sedin,VAN,NA,22,,1980-09-26
Henrik,Sedin,VAN,N/A,33,,1980-09-26

I have deliberately provided a variety of values that can be construed as missing values

  • Sidney Crosby: Team is NULL
  • Carey Price” Team is Unknown
  • Daniel Sedin: Position is NA and salary is not provided
  • Henrik Sedin: Position is N/A and salary is not provided

Leave NULL or missing values untouched

If you specify na_filter=false then read_csv will read in all values exactly as they are:


players = pd.read_csv('HockeyPlayersNulls.csv',na_filter=False)

returns:
Dataset with missing and null values highlighted

Replace default missing values with NaN

In Pandas, the equivalent of NULL is NaN. By default, read_csv will replace blanks, NULL, NA, and N/A with NaN:


players = pd.read_csv('HockeyPlayersNulls.csv')

returns:

loaded data frame with blanks and missing values replaced by NaN, the word Unkown was not recognized as null

You can see that most of the ‘missing’ values in my csv files are replaced by NaN, except the value ‘Unknown’ which was not recognized as a missing value.

How to designate values as missing

If there are values in your data which are not recognized as missing, you can use the na_values parameter to specify values you want treated as missing:


players = pd.read_csv('HockeyPlayersNulls.csv',na_values=['Unknown'])

returns:

DataFrame populated with missing values replaced by NaN

Handling blank lines

By default if a blank line is encountered in the CSV file, it is skipped.

So if you have the following file:
FirstName,LastName,Team,Position,JerseyNumber,Salary
Joe,Pavelski,SJ,C,8,6000000
Connor,McDavid,EDM,C,97,925000

Sidney ,Crosby,PIT,C,87,8700000
Carey,Price,MTL,G,31,10500000
Daniel,Sedin,VAN,LW,22,
Henrik,Sedin,VAN,C,33,

You get the following DataFrame:
Data Frame populated with all rows , there is no row where the blank line in the file appeared
If you want the blank line to appear you can specify skip_blank_lines=False
players = pd.read_csv('HockeyPlayersBlankLines.csv', skip_blank_lines=False)

returns:

DataFrame containing a row whose values all show as NaN

Next up

We still need to look at how to control datatypes and how to deal with Dates when using read_csv to populate a DataFrame. Stay tuned!

 

Pandas for SQL lovers Reading a CSV file / BULK INSERT

Panda and heart with the word SQLIn this post I will explain how to populate a Pandas DataFrame from a CSV file.

Also in this series:

If you have ever used BULK INSERT to read a CSV file into a database table, you are going to fall in love with the functionality for reading a csv file into a Pandas DataFrame!

Let’s start with how to:

  • Read a file
  • Handle error rows
  • Specify the index column
  • Specify column delimiters
  • Specify column headers
  • Select a subset of rows or columns

We load a csv file into a Pandas dataframe using read_csv

read_csv

As usual the first thing we need to do is import the numpy and pandas libraries


import pandas as pd
import numpy as np

I have the following csv file:

FirstName,LastName,Team,Position,JerseyNumber,Salary,Birthdate
Joe,Pavelski,SJ,C,8,6000000,1984-07-11
Connor,McDavid,EDM,C,97,925000,1997-01-13
Sidney ,Crosby,PIT,C,87,8700000,1987-08-07
Carey,Price,MTL,G,31,10500000,1987-08-16
Daniel,Sedin,VAN,LW,22,,1980-09-26
Henrik,Sedin,VAN,C,33,,1980-09-26

Use the read_csv method to load a comma separated file into a DataFrame:

players = pd.read_csv('HockeyPlayers.csv')

Creates the following DataFrame:

HockeyPlayerData

A few things to note:

  • Column names were read by default from the first row in the file
  • An index column was added numbering each row
  • The Salary for Daniel & Henrik Sedin was blank in the original file and appears as NaN (we will talk more about handling blanks and nulls in a later post)

Handling invalid rows

By default invalid rows will raise an error, for example if my data has rows with too many values.

I have extra ‘,’ in Connor McDavid’s name and in Carey Price salary:

FirstName,LastName,Team,Position,JerseyNumber,Salary,Birthdate
Joe,Pavelski,SJ,C,8,6000000,1984-07-11
Connor,Mc,David,EDM,C,97,925000,1997-01-13
Sidney ,Crosby,PIT,C,87,8700000,1987-08-07
Carey,Price,MTL,G,31,10,500,000,1987-08-16
Daniel,Sedin,VAN,LW,22,,1980-09-26
Henrik,Sedin,VAN,C,33,,1980-09-26

If I try to load the data I get an error:

ParserError: Error tokenizing data. C error: Expected 7 fields in line 3, saw 8

If you specify error_bad_lines=False, invalid rows are skipped:


players = pd.read_csv('HockeyPlayersInvalidRows.csv',error_bad_lines=False)

InvalidRowsSkipped

In Jupyter Notebooks you will see a message informing you lines were skipped:

'Skipping line 3: expected 7 fields, saw 8\nSkipping line 5: expected 7 fields, saw 9

Specifying your own index column

As we saw in part 2 of the series populating a DataFrame, every DataFrame has an index column, but if you do not want an index column created, you can specify a column in your csv file to use as the index column. The index column does not need to be a numeric value.

NOTE: unlike a PRIMARY KEY in a database table, an index column in a DataFrame can contain duplicate values.


players = pd.read_csv('HockeyPlayers.csv', index_col='LastName')

Would give me a DataFrame that uses LastName as the index column:

lastNameindexCol

You can specify a combination of columns to use as your index:


players = pd.read_csv('HockeyPlayers.csv', index_col=['LastName','FirstName'])

Will return the following DataFrame:

MultiColumnIndex

Column delimiters

The default column delimiter is a comma. If you use a different delimiter all the columns are read as a single column.

If I try to read a data file using semi-colons as column delimiters

FirstName;LastName;Team;Position;JerseyNumber;Salary;Birthdate
Joe;Pavelski;SJ;C;8;6000000;1984-07-11
Connor;McDavid;EDM;C;97;925000;1997-01-13
Sidney ;Crosby;PIT;C;87;8700000;1987-08-07
Carey;Price;MTL;G;31;10500000;1987-08-16
Daniel;Sedin;VAN;LW;22;;1980-09-26
Henrik;Sedin;VAN;C;33;;1980-09-26


players = pd.read_csv('HockeyPlayersSemiColonDelimiter.csv')

I get the following DataFrame:

HockeyPlayerWrongDelimiter

Use the delimiter parameter to specify the correct delimiter character


players = pd.read_csv('HockeyPlayersSemiColonDelimiter.csv', delimiter=';')

returns the DataFrame:

semiColonDelimiter

Column names

If your file does not contain a row with column names, the first row by default will be treated as column headers, so you might end up with a DataFrame like this:

NoHeaderRow

If you specify header=None, columns will be identified by number:


players = pd.read_csv('HockeyPlayersNoHeaderRow.csv', header=None)

NumericColumnNames

You can specify column names using the names parameter

players = pd.read_csv('HockeyPlayersNoHeaderRow.csv',
names = ['First','Last','TeamCode','Pos','JerseyNbr','Salary','Birthdate'])

Will return the DataFrame:

SpecifyColumnNames

If your csv file contains column headers you can use the names parameter to rename the columns.


players = pd.read_csv('HockeyPlayers.csv',

names = ['First','Last','TeamCode','Pos','JerseyNbr','Salary','Birthdate'])

But be warned, if the first row of your file contains column names and you specify the names parameter, the first row is read as a row of data:

ColumnNamesReadAsData

You need to specify the header parameter and indicate the first row of the file is a header row and not a data row


players = pd.read_csv('HockeyPlayers.csv',
header=0,
names = ['First','Last','TeamCode','Pos','JerseyNbr','Salary','Birthdate'])

Will give you the desired DataFrame:

NewColumnNames

Reading a subset of the csv file

You can specify which columns to read from the file with the usecols parameter:


players = pd.read_csv('HockeyPlayers.csv',

   usecols=['LastName','FirstName','Team'])

SpecifyColumnsToInclude

Use skiprows to skip rows at the start of the file. Of course if you skip the row containing the column headers, you need to specify the column names as well.


players = pd.read_csv('HockeyPlayers.csv',
skiprows=3,
names=['FirstName','LastName','Team','Position','Salary','BirthDate'])

will give you the DataFrame:

SKipRows

Use skipfooter to skip rows at the bottom of the file. If you use skipfooter you must also specify the parameter engine=Python. read_csv supports a C, and a Python engine. The C engine is faster, but does not support all the features. The Python engine supports all the features of read_csv.

Now when I use the Python engine, I noticed in Jupyter Notebooks that the first column name gets weird characters appended to it.


players = pd.read_csv('HockeyPlayers.csv',


skipfooter=2,


engine='python')

MEssedUpColumnName

This might just be something odd with my file, or with my notebook, but I can fix it easily enough by just specifying my own column names and skipping the row containing the column names:


players = pd.read_csv('HockeyPlayers.csv',
skiprows=1,
skipfooter=2,
engine='python',
names=['FirstName','LastName','Team','Position','Salary','BirthDate'])

SkipFooter

If you want to read in a specific number of rows you can use the nrows parameter:


players = pd.read_csv('HockeyPlayers.csv', nrows=3)

nrows

If you have a very large data file you can also read it in chunks using the chunksize parameter and store each chunk separately for analysis or processing.


for playerchunk in pd.read_csv('HockeyPlayers.csv', chunksize=2):
print(playerchunk.shape)
print(playerchunk)

Chunks

Summary

Okay we’ve covered how to

    • Read a csv file
    • Control behaviour for error rows
    • Specify an index column
    • Specify column names
    • Specify which rows and columns to load

Still to come we will look at how you work with read_csv to control the column, handle nulls, and date handling. Stay tuned!

Pandas for SQL Lovers INSERT / Populating a DataFrame

SQLPanda2Know SQL and trying to learn Python Pandas for data science? Many concepts are the same. In this series I explain Pandas using SQL as a reference point.

In this post I will explain how to populate a Pandas DataFrame.

Also in this series:

INSERT INTO

In SQL if you want to insert rows into a table you use the INSERT statement. You can insert a single row


INSERT INTO players
(FirstName, LastName, Team, Position, JerseyNumber, Salary, Birthdate)
VALUES
('Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11')

OR you can insert multiple rows with a single INSERT statement


INSERT INTO players
(FirstName, LastName, Team, Position, JerseyNumber, Salary, Birthdate)
VALUES
('Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11'),
('Connor','McDavid','EDM','C','97,925000.00,'1997-01-13')

Populating a DataFrame when created

When you create your DataFrame, you can provide data and populate it immediately.


column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
data = [['Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11'],
['Connor','McDavid','EDM','C',97,925000.00,'1997-01-13'],
['Sidney','Crosby','PIT','C',87,8700000.00,'1987-08-07'],
['Carey','Price','MTL','G',31,10500000.00,'1987-08-16']]
players  = pd.DataFrame(data, columns=column_names)

This creates the following DataFrame

PlayersDataFrame

Inferred Datatypes

The advantage to populating the DataFrame when it is created, is that Pandas will infer the datatypes based on the data. If I run the command:


players.dtypes

I can see the DataFrame assigned integer and float datatypes to JerseyNumber and Salary. All the other columns are strings (because strings are stored as a sequence the datatype displayed is object):

FirstName object
LastName object
Team object
Position object
JerseyNumber int64
Salary float64
Birthdate object

Explicit Datatypes

If you want BirthDate to be a date, datatype you will need to convert it explicitly. The line of code below uses to_datetime to convert the Birthdate column to a datetime:

players['Birthdate']= pd.to_datetime(players['Birthdate'])

Now the Birthdate column stores the datatype datetime:

FirstName object
LastName object
Team object
Position object
JerseyNumber int64
Salary float64
Birthdate datetime64[ns]

You can find more details on how to assign datatypes explicitly in the Part 1 of this series: how to create DataFrames.

Indexes in DataFrames

You may have noticed that I did not have any sort of ‘playerid’ value for the rows I inserted. But you can see a number beside each row.  This column is called the index. Pandas will automatically create an index for each row in the DataFrame.

PlayersDataFrame

Setting your own column as index

If you want to use your own column for the index, you can use set_index. The example below creates a DataFrame with a PlayerId and then users set_index to make PlayerId the index column.

column_names = ['PlayerId',
'FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
data = [[1,'Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11'],
[2,'Connor','McDavid','EDM','C',97,925000.00,'1997-01-13'],
[3, 'Sidney','Crosby','PIT','C',87,8700000.00,'1987-08-07'],
[4, 'Carey','Price','MTL','G',31,10500000.00,'1987-08-16']]
players  = pd.DataFrame(data,columns=column_names)

players.set_index(‘PlayerId’, inplace=True)

This produces a DataFrame with PlayerId as the Index column

PlayerIdIndex

Using non numeric columns as an index

You are not limited to numeric fields as indexes, you can use any field as your index:

 players.set_index('LastName', inplace=True)

LastNameIndex

Duplicate values in index columns

Unlike an Primary Key in a database, the index on a DataFrame will allow duplicate values. If you decide to use LastName as your index column and you have the Henrik & Daniel Sedin in your DataFrame you will see duplicate indexes.

duplicateIndex

Adding rows to an existing DataFrame

If you want to add rows to a DataFrame after it is created use append. In the code below let’s recreate the populated DataFrame with the autogenerated index:

column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
data = [['Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11'],
['Connor','McDavid','EDM','C',97,925000.00,'1997-01-13'],
['Sidney','Crosby','PIT','C',87,8700000.00,'1987-08-07'],
['Carey','Price','MTL','G',31,10500000.00,'1987-08-16']]
players = pd.DataFrame(data, columns=column_names)

PlayersDataFrame

Now you can use append to add a row. A couple of things to be aware of:

  • Using append creates a new DataFrame with the added row, if you want to append the row to your existing DataFrame you need to assign the result of the append to your original DataFrame.
  • You must specify ignore_index = True when you are providing explicit values to use for the new row being appended

players = players.append({'FirstName':'Erik',
'LastName':'Karlsson',
'Team':'SJ',
'Position':'D',
'JerseyNumber':65,
'Salary':11500000.00,
'Birthdate':'1990-05-31'},
ignore_index=True)

AppendedOneRow

Summary

Now you can add rows to your DataFrame. In upcoming posts we will look at how to populate your DataFrame from a CSV file or from a database table.

Python Pandas for SQL fans: Creating DataFrames

panda holding SQL heartI have worked with data for years, and I am very familiar with SQL. When I started exploring data science I found a number of my SQL skills translated well into data science. Pandas tends to be the most popular python library for manipulating data for those familiar with SQL, so I figured I would write a series of posts on how to use Pandas written from a SQL perspective!

In this post we will focus on how to create a DataFrame this is the equivalent of creating a table in a database.

Also in this series:

Pre-requisites

You need to import the python pandas and numpy libraries to use any of the code examples below:

import pandas as pd
import numpy as np

CREATE TABLE

If you want to query data in a database, you need to create a table. If you want to query data in Pandas, you need to create a DataFrame.

If I want to create a database table to hold information about hockey players I would use the CREATE TABLE statement:

CREATE TABLE players (
first_name   VARCHAR(30),
last_name VARCHAR(30),
team VARCHAR(3),
position VARCHAR(2),
jersey_number INT,
salary DECIMAL,
birthdate DATETIME)

Create Pandas DataFrame

To create a DataFrame to hold the same information in pandas, all I need to do is define the column names and create a DataFrame using the column name:

column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
players  = pd.DataFrame(columns=column_names)

Creates an empty dataframe:

Empty DataFrame
Columns: [FirstName, LastName, Team, Position, JerseyNumber, Salary, Birthdate]

Specifying data types

Now you probably noticed I didn’t specify the datatypes for my columns when I created the dataframe. You can display the datatypes for your DataFrame using dtypes:

players.dtypes

which returns:

FirstName object
LastName object
Team object
Position object
JerseyNumber object
Salary object
Birthdate object

Any column with a datatype of string will be listed as object because in Python a string is actually a sequence of characters which does allow you some neat slicing to retrieve substrings.

You can modify the datatypes of the columns after the DataFrame is created using astype:

players = players.astype({'FirstName':str,
'LastName':str,
'Team':str,
'Position':str,
'JerseyNumber':int,
'Salary':float,
'Birthdate':np.datetime64})

Now I have the datatypes:

FirstName object
LastName object
Team object
Position object
JerseyNumber int32
Salary float64
Birthdate datetime64[ns]

Wondering what datatypes are available? Chris Moffitt wrote a nice post summarizing Pandas data types. You can also refer to the official Pandas documentation on dtypes.

If you are wondering, why you can’t specify datatypes for each column when a DataFrame is created, that’s because unlike when you work with database tables, you usually create DataFrames from a dataset and the datatype is inferred from the data. The DataFrame constructor does accept a datatype argument, but you can only use it to specify a datatype to use for all columns in the DataFrame, you cannot specify different datatypes for each column.

Converting all columns to a single datatype

If all your columns will be the same dataype, you can use astype to convert all columns to a new datatype:

column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
players  = pd.DataFrame(columns=column_names)
players = players.astype(int)

would give me a DataFrame with every column as an integer:

FirstName int32
LastName int32
Team int32
Position int32
JerseyNumber int32
Salary int32
Birthdate int32

Converting one column to a new datatype

You can also convert a single column in a DataFrame to a new datatype using to_datetime, to_numeric, to timedelta

NOTE: Timedelta is a datatype in python used to specify a gap between two dates and times.

The following code shows how I could use these functions to convert the columns that are not storing strings:

column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
players  = pd.DataFrame(columns=column_names)
players['Birthdate']= pd.to_datetime(players['Birthdate'])
players['JerseyNumber']=pd.to_numeric(players['JerseyNumber'],downcast='integer')
players['Salary']=pd.to_numeric(players['Salary'],downcast='float')

Would give me a DataFrame with the data types:

FirstName object
LastName object
Team object
Position object
JerseyNumber int8
Salary float32
Birthdate datetime64[ns]

Note that to_numeric requires an additional parameter downcast to specify the type of numeric datatype required.

Coming next

In the Part 2 you learn the equivalent of the INSERT statement, and learn how to add data to your DataFrame.

Object detection with Custom Vision Service

In this post I’ll show you how to train your own model to detect objects in an image.

As many of you know, I am a runner. After I run a race I usually get an email trying to sell me my race photos. This got me thinking about how photographers manage to sort through all the photos.  It’s a great example of leveraging image analysis for a business. You need to find the runner in the photo, locate their bib, and read the number on their bib.

In this post I’ll show you how I was able to use a drag and drop tool to locate the bibs in race photos.

If you want to try it yourself, you need a Microsoft account and an Azure subscription.

Launch the Azure Custom Vision Service.

Custom Vision Service Home Page

Custom Vision Service Home page

Sign in with your Microsoft account, and Create a New Project

Create New Project Custom Vision Service

Create New Project

  • Enter a project name and description.
  • Select the project type to Object Detection
  • Select Create new resource group

Set Project Properties

Set Project Properties

A resource group is like a project folder. Your custom vision service will be stored inside the resource group.

  • Enter a name for your resource group
  • Select your Azure subscription
  • Select a data centre location where you want your resource group stored
  • Select a pricing tier

Create resource group

Create resource group

Select Create resource and then select Create project to create your project.

You will be redirected to the project home page.

You need to upload images to train your model.

Upload images

Upload images

Select Add images and upload at least 15 images containing the object you wish to detect

If you want to try the running bibs example, you can use the images in the zip files in my Analyze Image GitHub \ Object Detection folder. These images were generously provided by some of my running friends.

Once the images are uploaded, you need to tag each image with the object you are training your model to detect. I am going to train my model to identify runners and bibs.

Select the first image you uploaded and it opens in the tagging page

SelectImageToTag

ImageDetail

Once you have your image open, click on the surface and draw a rectangle around the object you want to detect and then provide a tag to identify that object.

I have drawn a box around the runner in the middle of the photo and their bib and tagged them as runner and bib.

I repeat the same process for the other runners in the picture and for their bibs if they are visible

SecondRunner

When you are done you can close the window and open the next photo for tagging, or you can just select the arrow key to the far right of the Image detail to scroll to the next image to tag.

FinishedTagging

Complete this process for each of your training images.

Now select Train from the top menu bar to train your model.

TRainModel

Select a Training Type

Fast training is great for proof of concept situations like this one. Advanced training allows you to increase the training time (which will consume more cloud credits) and will notify you when training is completed.

TRainingType

 

Once training is complete you will see your iteration statistics and you can test it using Quick Test

QuickTest

Select Browse local files or provide a URL to an image containing the object you want to detect. The trained model will attempt to identify the regions containing the images and provide a probability/confidence level for each.Results

My model has successfully identified multiple runners in the photo and the location of the race bib.

The trained model can now be called from code to retrieve the predictions as JSON.  I have an example of how to call a classification model created with this tool in my GitHub.

 

 

 

Azure Machine Learning Service Visual Interface vs Azure Machine Learning Studio

In this post I’ll compare the Azure Machine Service Visual Interface to Azure Machine Learning Studio.

If you are a fan of Azure Machine Learning Studio this should help you get started with Azure Machine Learning Service Visual Interface and understand some of the differences.

I quite like Azure Machine Learning Studio. It’s a drag and drop tool for training machine learning models and deploying them to a web service hosted on Azure. It has limitations, but it allows you to train, tweak, and retrain machine learning models without writing any code.  The tool was deprecated a couple of years ago.  Which in software terms usually means, we aren’t doing any more work on this tool and don’t recommend continuing to use it.  I think Machine Learning Workbench was supposed to replace it, but that product did not gain much traction in the market and is also deprecated.

So, I was very interested to see the Visual Interface Preview for Azure Machine Learning Service, which looks very familiar to Azure Machine Learning Studio users.

In this post I’ll talk about the difference in

  • Home Screen
  • Launching the tool
  • Datasets
  • Data prep and column selection
  • Running your experiment
  • Deploying your model

Home Screen

At first glance the only notable difference is the appearance of the +New button, and the lack of Notebooks through the interface. You can create Azure notebooks seperately, but the preview does not have integrated notebooks.

Home page Azure Machine Learning Service Visual Interface preview

Azure Machine Learning Service Visual Interface (preview)

Home Page Azure Machine Learning Studio

Azure Machine Learning Studio

Because Azure Machine Learning Service Visual Interface is such a long winded product name, I will refer to it as the ML Visual Interface from this point forward in my post. I will refer to Azure Machine Learning Studio as ML Studio.

Launching the tool

The first difference ML Studio users will notice is it’s a little more work to reach the home page pictured above in ML Visual interface. with ML Studio you could just go to the ML Studio home page and log in. The Azure back end was abstracted from the user.

To get started with ML Visual Interface you need an Azure subscription and you need to use the Azure portal . From the portal, you create a Machine Learning service workspace. If you want to try it yourself, you can find instructions in the Quickstart: Prepare and visualize data without writing code in Azure Machine Learning.

Creating Azure Machine Learning Service Workspace

Create an Azure Machine Learning service workspace

Once you have created the workspace you launch the visual interface from the menu blade

Launch Visual Interface from menu

Launch Visual Interface

Datasets

Both tools come with a variety of pre-loaded datasets. Both tools allow you to upload your own datasets. I was mildly disappointed that my favorite flight dataset is not pre-loaded in ML Visual interface. I found training a model to predict which flights were late or on time was a great example, since you don’t need to be a data scientist to understand what data might affect whether or not a flight is likely to be late.

I have uploaded the titanic dataset to try and predict which passengers would survive.

As soon as I drag the dataset to the work area I discovered the first ‘quirk’ of the preview. When I try to visualize my dataset, that option is grayed out. I am going to assume this is just one of the joys of working with a product in preview and will be fixed in a later release. But it is quite irritating, because of course I always want to look at my data set and make sure it uploaded correctly.

Data set added to work area but VIsualize menu item is grayed out

Adding data set to work area

Data prep and column selection

We still have all the great modules like Select Columns in DatasetClean Missing Data, and Edit Metadata.

Unfortunately and I will also chalk this up to quirks of previews that will be fixed in a later release, whatever module you connect first (e.g. Select Columns in Dataset) will not know the names of the columns in your dataset.  So you will want to make sure YOU know the column names because you have to type them in and the interface will not warn you if you enter an invalid column name.

Entering column names manually for select olumns in dataset does not catch invalid column names

You can specify invalid column names

Running your experiment

In ML Studio, you were unable to scale up the compute power used to run your experiment even though it was running on cloud resources. This meant it sometimes took a LONG time to run.

One of the great things about the ML Visual Interface is you control the compute power used to run your experiment! The only drawback is you have to create the compute instance before you can run your experiment. Well worth it for the benefits of being able to use more compute power when needed!

The first time you run the experiment you will need to create new compute. I am just using the preconfigured defaults for now, but I could go to the Azure portal and create a new compute target (you will need to learn to do that at some point anyway, because when you are ready to deploy it will require you to create a compute manually)

Creating compute can take 5-15 minutes, so go ahead and grab a coffee while you wait.

Create compute to run your experiment in Azurel Machine Learning Service Visual interface

Create compute to run your experiment

The good news is after the compute is created and the experiment runs, you can use the Visualize to see your modified datasets after the task (though you still can’t visualize the original dataset).  Also, any tasks you add to your experiment will now validate columns names.

Data visualization after running experiment

Visualization of Data

Training and Scoring your model

Just like ML Studio, you just drag and drop the tasks you need to prepare your data, split, train, score, and evaluate your model. Set the properties in the property window. The user experience for adding modules and setting properties is the same in ML Visual interface as ML Studio.

Finished experiment in Azure Machine Learning Service Visual Interface

Finished Experiment

When you re-run your experiment, you do not need to Create new compute every time. You can select existing and re-use the compute you created for the first run of your experiment. The first time I used the tool I had problems with my previous compute not showing up in the list of existing compute targets (even when I selected Refresh) but when I closed the browser window and re-opened it, it would show up.  Still faster than creating new Compute every time I want to run the experiment.  Today, I did not need to refresh and my compute target was listed as you would expect, so maybe they fixed it (if so well done that was quick!), or maybe it depends on your browser or maybe it’s an intermittent quirk of the preview.

re-use existing compute when rerunning experiment

re-use existing compute

Deploying your model

To deploy, you must first create and run a predictive experiment, just like ML Studio.

One of the other BIG differences in ML Visual Interface is the ability to deploy the model somewhere other than an Azure web service.  You have control. If you know your way around managing and deploying models, the world is your oyster. I have not yet had a chance to try it myself (if you try it before I do, please share your experience). Check out Deploy models with the Azure Machine Learning Service for more instructions and details.

If you just want to deploy it to a web service the way we did in ML Studio, that still works. Possible kudos to the product team for quick work, because when I tried this last week it required me to create a compute target with 12 nodes to deploy. Today I was able to re-use the existing compute I had used to run my experiment which had only 6 nodes.

Requiring 12 nodes was a hassle because my Pay as You Go Azure subscription had a quota limit of 10 nodes. I had to submit a request to increase my CPU quota. To the credit of the Azure support team (no pun intended) The request was processed almost immediately.  I was able to confirm my new quota with the Azure CLI command.

az vm list-usage –location “EastUS”

Quota limit

Quota limit

Just for extra fun, the VM family I requested a quota limit for was not one of the valid VMs supported by the tool, so I had to look up which VM family supported the VMS listed as supported in the error message to request the correct quota increase.

I did not have to do ANY of that stuff with quotas today, I just clicked DEploy web service, and re-used the existing comptue I created to run my experiment. I am leaving those links and information in the blog post, just in case anyone else runs into it and also so I can remember how to do it if I ever run into that issue again with another service.  Blogs make a handy place to keep notes for yourself.

Testing your trained model

If you deploy your trained model as a web service, you can test it and find sample code to copy and paste just like you did with ML Studio.

Select Web Services | Test

Type in your values and see the result.

TEsting the trained model Azure MAchine Learning Service VIsual Interface

Testing the trained model

This is not a secure way to deploy your model, but it’s great for proof of concept, and testing.

You will see a warning with links to instructions on secure deployment when you open the web service.

Instructions are provided for secure deployment on web service page

Secure deployment

Summary

Fans of Azure Machine Learning Studio are likely to become bigger fans of Azure Machine Learning Service Visual Interface.  Two of the biggest complaints about ML Studio were the inability to scale compute and the inability to deploy models outside of Azure web services. Both of these concerns are addressed with Azure Machine Learning Service Visual Interface.