Model Development#
Supervised algorithms use inputs (independent variables) and labeled outputs (dependent variable -the “answers”) to create a model that can measure its performance and learn over time. Splitting the data into independent and dependent variables, we have the following:
#Note: we only repeat this step from before, because this is a new .ipyb page.
# it only needs to be executed once per file.
#We'll import libraries as needed, but when submitting, having them all at the top is best practice
import pandas as pd
# Reloading the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
column_names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'type']
df = pd.read_csv(url, names = column_names) #read CSV into Python as a dataframe
X = df.drop(columns=['type']) #indpendent variables
y = df[['type']].copy() #dependent variables
Note
The focus of Task 2 part D Data Product will be the what, how, and why of your model’s development.
Training the Model#
Studying for a test when you have all the answers beforehand will likely yield a good grade. But how well would that grade measure understanding of material outside those answers? Similarly, supervised methods tend to perform well when tested on their training data, but you want your model to perform well on unseen data. So while it’s not required, separating data used to train and test the model (validation) is good practice. Furthermore, it provides content for part D of the documentation.
Fortunately, most libraries have built-in functions for this. Here we’ll stick with scikit-learn aka sklearn validation processes We’ll need to randomly split the data into independent (input values) and dependent (output, i.e., the answers) variables. For now, we’ll keep things as DataFrames, but later convert them to 2-d arrays
import numpy as np
from sklearn.model_selection import train_test_split
#split the variable sets into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.333, random_state=41)
Show code cell source
#Nice displays are nice but not required.
from IPython.display import display_html
X_train_styler = X_train.head(5).style.set_table_attributes("style='display:inline'").set_caption('Independents variables')
y_train_styler = y_train.head(5).style.set_table_attributes("style='display:inline'").set_caption('Dependents variables')
space = "\xa0" * 10 #space between columns
display_html(X_train_styler._repr_html_()+ space + y_train_styler._repr_html_(), raw=True)
sepal-length | sepal-width | petal-length | petal-width | |
---|---|---|---|---|
111 | 6.400000 | 2.700000 | 5.300000 | 1.900000 |
82 | 5.800000 | 2.700000 | 3.900000 | 1.200000 |
130 | 7.400000 | 2.800000 | 6.100000 | 1.900000 |
27 | 5.200000 | 3.500000 | 1.500000 | 0.200000 |
33 | 5.500000 | 4.200000 | 1.400000 | 0.200000 |
type | |
---|---|
111 | Iris-virginica |
82 | Iris-versicolor |
130 | Iris-virginica |
27 | Iris-setosa |
33 | Iris-setosa |
Read the docs! By default train_test_split
, “randomly” splits the sets. Setting the seed (or state) with random_state
controls the experiments. See should you use a random seed?.
We can now train a model using the independent (usually denoted X
) and dependent variables (usually denoted y
) from the training data. Sklearn has a deep supervised learning library. Note that many of these models (including SVM) have both classification and regression extensions.
from sklearn import svm
svm_model = svm.SVC(gamma='scale', C=1) #Creates a svm model object. Mote, 'scale' and 1.0 are gamma and C's respective defaults
svm_model.fit(X_train,y_train)
C:\Users\ashej\.virtualenvs\jupyter-books-WZpnkDri\Lib\site-packages\sklearn\utils\validation.py:1141: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
SVC(C=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(C=1)
What’s with the warning? DataConversionWarning: A column-vector y was passed when a 1d array was expected.
Looking at the sklearn.svm.SVC
docs for the fit
function, a 1d array was expected for the y
, but we gave it a DataFrame. This is a warning -not an error, and the model appears to work. However, it’s best practice to clean warnings up when possible, and in this case, it’s an easy fix.
y_train_array, y_test_array = y_train['type'].values, y_test['type'].values
svm_model.fit(X_train,y_train_array)
#Note: you can also use 'svm_model.fit(X_train,y_train_array.ravel())'
SVC(C=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(C=1)
Applying the Model#
Now we’ve trained the model (without warnings)! What does that mean? Sklearn’s SVM algorithm creates an equation representing the relationship between the variables.
For example, the Iris at index 82
has the values:
Show code cell source
X_train.loc[[82]]
sepal-length | sepal-width | petal-length | petal-width | |
---|---|---|---|---|
82 | 5.8 | 2.7 | 3.9 | 1.2 |
And svm_model.predict(X_train.loc[[82]])
inputs the flower dimensions into the prediction function:
print(svm_model.predict(X_train.loc[[82]]))
#Alternatively: svm_model.predict([[5.8, 2.7, 3.9, 1.2]])
['Iris-versicolor']
Which in this example turns out to be correct:
Show code cell source
df.loc[[82]]
sepal-length | sepal-width | petal-length | petal-width | type | |
---|---|---|---|---|---|
82 | 5.8 | 2.7 | 3.9 | 1.2 | Iris-versicolor |
Applying the prediction function to the entire dataset, we get a prediction for each flower:
predictions = svm_model.predict(X)
print(predictions)
['Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-virginica' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-virginica' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-versicolor' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica']
But how good are these predictions? Answering that question is our next step.