{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"(sup_class_ex:develop)=\n",
"# Model Development\n",
"\n",
"Supervised algorithms use inputs (independent variables) and labeled outputs (dependent variable -the \"answers\") to create a model that can measure its performance and learn over time. Splitting the data into independent and dependent variables, we have the following:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"#Note: we only repeat this step from before, because this is a new .ipyb page.\n",
"# it only needs to be executed once per file. \n",
" \n",
"#We'll import libraries as needed, but when submitting, having them all at the top is best practice\n",
"import pandas as pd\n",
"\n",
"# Reloading the dataset\n",
"url = \"https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv\"\n",
"column_names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'type']\n",
"df = pd.read_csv(url, names = column_names) #read CSV into Python as a dataframe\n",
"\n",
"X = df.drop(columns=['type']) #indpendent variables\n",
"y = df[['type']].copy() #dependent variables"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"```{note}\n",
"The focus of [Task 2 part D *Data Product*](task2d:dataproduct) will be the what, how, and why of your model's development.\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"(sup_class_ex:develop:train)=\n",
"## Training the Model\n",
"\n",
"```{margin}\n",
"A model can learn the details and noise particular to the training data so well that it doesn't perform well on new data. This is called [*overfitting*](https://en.wikipedia.org/wiki/Overfitting). Overcomplicated non-linear and nonparametric models are more susceptible to this. The term *overtraining* can be used synonymously or to mean too much training causing overfitting. \n",
"```\n",
"\n",
"Studying for a test when you have all the answers beforehand will likely yield a good grade. But how well would that grade measure understanding of material outside those answers? Similarly, supervised methods tend to perform well when tested on their training data, but you want your model to perform well on *unseen* data. So while it's not required, separating data used to train and test the model (validation) is good practice. Furthermore, it provides content for part D of the documentation. \n",
"\n",
"Fortunately, most libraries have built-in functions for this. Here we'll stick with [scikit-learn aka sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) validation processes We'll need to randomly split the data into independent (input values) and dependent (output, i.e., the answers) variables. For now, we'll keep things as DataFrames, but later convert them to 2-d arrays"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"#split the variable sets into training and testing subsets\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.333, random_state=41)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"tags": [
"hide-input"
]
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" Independents variables\n",
" \n",
" \n",
" | \n",
" sepal-length | \n",
" sepal-width | \n",
" petal-length | \n",
" petal-width | \n",
"
\n",
" \n",
" \n",
" \n",
" 111 | \n",
" 6.400000 | \n",
" 2.700000 | \n",
" 5.300000 | \n",
" 1.900000 | \n",
"
\n",
" \n",
" 82 | \n",
" 5.800000 | \n",
" 2.700000 | \n",
" 3.900000 | \n",
" 1.200000 | \n",
"
\n",
" \n",
" 130 | \n",
" 7.400000 | \n",
" 2.800000 | \n",
" 6.100000 | \n",
" 1.900000 | \n",
"
\n",
" \n",
" 27 | \n",
" 5.200000 | \n",
" 3.500000 | \n",
" 1.500000 | \n",
" 0.200000 | \n",
"
\n",
" \n",
" 33 | \n",
" 5.500000 | \n",
" 4.200000 | \n",
" 1.400000 | \n",
" 0.200000 | \n",
"
\n",
" \n",
"
\n",
" \n",
"\n",
" Dependents variables\n",
" \n",
" \n",
" | \n",
" type | \n",
"
\n",
" \n",
" \n",
" \n",
" 111 | \n",
" Iris-virginica | \n",
"
\n",
" \n",
" 82 | \n",
" Iris-versicolor | \n",
"
\n",
" \n",
" 130 | \n",
" Iris-virginica | \n",
"
\n",
" \n",
" 27 | \n",
" Iris-setosa | \n",
"
\n",
" \n",
" 33 | \n",
" Iris-setosa | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"#Nice displays are nice but not required. \n",
"from IPython.display import display_html \n",
"X_train_styler = X_train.head(5).style.set_table_attributes(\"style='display:inline'\").set_caption('Independents variables')\n",
"y_train_styler = y_train.head(5).style.set_table_attributes(\"style='display:inline'\").set_caption('Dependents variables')\n",
"space = \"\\xa0\" * 10 #space between columns\n",
"display_html(X_train_styler._repr_html_()+ space + y_train_styler._repr_html_(), raw=True)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Read the [docs](sklearn_link)! By default `train_test_split`, [\"randomly\"](https://engineering.mit.edu/engage/ask-an-engineer/can-a-computer-generate-a-truly-random-number/) splits the sets. Setting the seed (or state) with `random_state` controls the experiments. See [should you use a random seed?](https://datascience.stackexchange.com/questions/78109/should-you-use-random-state-or-random-seed-in-machine-learning-models).\n",
"\n",
"We can now train a model using the independent (usually denoted `X`) and dependent variables (usually denoted `y`) from the training data. Sklearn has a deep [supervised learning library](https://scikit-learn.org/stable/supervised_learning.html). Note that many of these models (including SVM) have both classification and regression extensions. "
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\ashej\\.virtualenvs\\jupyter-books-WZpnkDri\\Lib\\site-packages\\sklearn\\utils\\validation.py:1141: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
" y = column_or_1d(y, warn=True)\n"
]
},
{
"data": {
"text/html": [
"SVC(C=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. "
],
"text/plain": [
"SVC(C=1)"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn import svm\n",
"\n",
"svm_model = svm.SVC(gamma='scale', C=1) #Creates a svm model object. Mote, 'scale' and 1.0 are gamma and C's respective defaults \n",
"svm_model.fit(X_train,y_train)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"What's with the warning? `DataConversionWarning: A column-vector y was passed when a 1d array was expected.`\n",
"\n",
"Looking at the `sklearn.svm.SVC` [docs for the `fit` function](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.fit), a 1d array was expected for the `y`, but we gave it a DataFrame. This is a warning -not an error, and the model appears to work. However, it's best practice to clean warnings up when possible, and in this case, it's an easy fix.\n",
"\n",
"```{margin}\n",
"[What's that `gamma` and `C` for?](https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html), read the docs!\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"SVC(C=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. "
],
"text/plain": [
"SVC(C=1)"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_train_array, y_test_array = y_train['type'].values, y_test['type'].values\n",
"svm_model.fit(X_train,y_train_array) \n",
"#Note: you can also use 'svm_model.fit(X_train,y_train_array.ravel())' "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Applying the Model\n",
"\n",
"Now we've trained the model (without warnings)! What does that mean? Sklearn's SVM algorithm creates an equation representing the relationship between the variables. \n",
"\n",
"$$F_{\\text{predict}}(X)=\\text{prediction(s)}$$\n",
"\n",
"For example, the Iris at index `82` has the values:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"tags": [
"hide-input"
]
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal-length | \n",
" sepal-width | \n",
" petal-length | \n",
" petal-width | \n",
"
\n",
" \n",
" \n",
" \n",
" 82 | \n",
" 5.8 | \n",
" 2.7 | \n",
" 3.9 | \n",
" 1.2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal-length sepal-width petal-length petal-width\n",
"82 5.8 2.7 3.9 1.2"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.loc[[82]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And `svm_model.predict(X_train.loc[[82]])` inputs the flower dimensions into the prediction function:\n",
"\n",
"$$F_{\\text{predict}}(5.8, 2.7, 3.9, 1.2)=\\text{Iris-versicolor}$$"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Iris-versicolor']\n"
]
}
],
"source": [
"print(svm_model.predict(X_train.loc[[82]]))\n",
"#Alternatively: svm_model.predict([[5.8, 2.7, 3.9, 1.2]])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Which in this example turns out to be correct:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"hide-input"
]
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal-length | \n",
" sepal-width | \n",
" petal-length | \n",
" petal-width | \n",
" type | \n",
"
\n",
" \n",
" \n",
" \n",
" 82 | \n",
" 5.8 | \n",
" 2.7 | \n",
" 3.9 | \n",
" 1.2 | \n",
" Iris-versicolor | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal-length sepal-width petal-length petal-width type\n",
"82 5.8 2.7 3.9 1.2 Iris-versicolor"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[[82]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Applying the prediction function to the entire dataset, we get a prediction for each flower:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"output_scroll"
]
},
"outputs": [
{
"data": {
"text/plain": [
"array(['Iris-setosa', 'Iris-setosa', 'Iris-setosa', ..., 'Iris-virginica',\n",
" 'Iris-virginica', 'Iris-virginica'], dtype=object)"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predictions = svm_model.predict(X)\n",
"print(predictions)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"But how good are these predictions? Answering that question is our next step. "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "jupyter-books-WZpnkDri",
"language": "python",
"name": "jupyter-books-wzpnkdri"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.1"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "3ff4b9f9a77e43d422b45ad0e34f66a3a995e732d437005df0ccbc0093bddc0e"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}