Model Development

Model Development#

Supervised algorithms use inputs (independent variables) and labeled outputs (dependent variable -the “answers”) to create a model that can measure its performance and learn over time. Splitting the data into independent and dependent variables, we have the following:

#Note: we only repeat this step from before, because this is a new .ipyb page.
#   it only needs to be executed once per file. 
  
#We'll import libraries as needed, but when submitting, having them all at the top is best practice
import pandas as pd

# Reloading the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
column_names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'type']
df = pd.read_csv(url, names = column_names) #read CSV into Python as a dataframe

X = df.drop(columns=['type']) #indpendent variables
y = df[['type']].copy() #dependent variables

Note

The focus of Task 2 part D Data Product will be the what, how, and why of your model’s development.

Training the Model#

Studying for a test when you have all the answers beforehand will likely yield a good grade. But how well would that grade measure understanding of material outside those answers? Similarly, supervised methods tend to perform well when tested on their training data, but you want your model to perform well on unseen data. So while it’s not required, separating data used to train and test the model (validation) is good practice. Furthermore, it provides content for part D of the documentation.

Fortunately, most libraries have built-in functions for this. Here we’ll stick with scikit-learn aka sklearn validation processes We’ll need to randomly split the data into independent (input values) and dependent (output, i.e., the answers) variables. For now, we’ll keep things as DataFrames, but later convert them to 2-d arrays

import numpy as np
from sklearn.model_selection import train_test_split

#split the variable sets into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.333, random_state=41)

Independents variables
	sepal-length	sepal-width	petal-length	petal-width
111	6.400000	2.700000	5.300000	1.900000
82	5.800000	2.700000	3.900000	1.200000
130	7.400000	2.800000	6.100000	1.900000
27	5.200000	3.500000	1.500000	0.200000
33	5.500000	4.200000	1.400000	0.200000

Dependents variables
	type
111	Iris-virginica
82	Iris-versicolor
130	Iris-virginica
27	Iris-setosa
33	Iris-setosa

Read the docs! By default train_test_split, “randomly” splits the sets. Setting the seed (or state) with random_state controls the experiments. See should you use a random seed?.

We can now train a model using the independent (usually denoted X) and dependent variables (usually denoted y) from the training data. Sklearn has a deep supervised learning library. Note that many of these models (including SVM) have both classification and regression extensions.

from sklearn import svm

svm_model = svm.SVC(gamma='scale', C=1) #Creates a svm model object. Mote, 'scale' and 1.0 are gamma and C's respective defaults 
svm_model.fit(X_train,y_train)

C:\Users\ashej\.virtualenvs\jupyter-books-WZpnkDri\Lib\site-packages\sklearn\utils\validation.py:1141: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

SVC(C=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

What’s with the warning? DataConversionWarning: A column-vector y was passed when a 1d array was expected.

Looking at the sklearn.svm.SVC docs for the fit function, a 1d array was expected for the y, but we gave it a DataFrame. This is a warning -not an error, and the model appears to work. However, it’s best practice to clean warnings up when possible, and in this case, it’s an easy fix.

y_train_array, y_test_array = y_train['type'].values, y_test['type'].values
svm_model.fit(X_train,y_train_array) 
#Note: you can also use 'svm_model.fit(X_train,y_train_array.ravel())' 

SVC(C=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Applying the Model#

Now we’ve trained the model (without warnings)! What does that mean? Sklearn’s SVM algorithm creates an equation representing the relationship between the variables.

\[F_{\text{predict}}(X)=\text{prediction(s)}\]

For example, the Iris at index 82 has the values:

	sepal-length	sepal-width	petal-length	petal-width
82	5.8	2.7	3.9	1.2

And svm_model.predict(X_train.loc[[82]]) inputs the flower dimensions into the prediction function:

\[F_{\text{predict}}(5.8, 2.7, 3.9, 1.2)=\text{Iris-versicolor}\]

print(svm_model.predict(X_train.loc[[82]]))
#Alternatively: svm_model.predict([[5.8, 2.7, 3.9, 1.2]])

['Iris-versicolor']

Which in this example turns out to be correct:

	sepal-length	sepal-width	petal-length	petal-width	type
82	5.8	2.7	3.9	1.2	Iris-versicolor

Applying the prediction function to the entire dataset, we get a prediction for each flower:

predictions = svm_model.predict(X)
print(predictions)

['Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-virginica' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-virginica' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica' 'Iris-virginica'
 'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
 'Iris-versicolor' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
 'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
 'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
 'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
 'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
 'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
 'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
 'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
 'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
 'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
 'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica']

But how good are these predictions? Answering that question is our next step.

Model Development

Contents

Model Development#

Training the Model#

Applying the Model#