{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"(sup_reg_ex: develop)=\n",
"# Regression Model Development (part 1)\n",
"\n",
"Supervised algorithms use inputs (independent variables) and labeled outputs (dependent variable -the \"answers\") to create a model that can measure its performance and learn over time. Splitting the data into independent and dependent variables, we have the following (again, this will be very similar to the [previous example](sup_class_ex:develop)):"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"#Note: we only repeat this step from before, because this is a new .ipyb page.\n",
"# it only needs to be executed once per file.\n",
"\n",
"#We'll import libraries as needed, but when submitting, having them all at the top is best practice\n",
"import pandas as pd\n",
"\n",
"# Reloading the dataset\n",
"url = \"https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv\"\n",
"df = pd.read_csv(url) #read CSV into Python as a dataframe\n",
"\n",
"column_names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'type']\n",
"df = pd.read_csv(url, names = column_names) #read CSV into Python as a dataframe\n",
"\n",
"#Choosing the variables. \n",
"X = df.drop(columns=['sepal-length']) #indpendent variables\n",
"y = df[['sepal-length']].copy() #dependent variables"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"(sup_reg_ex: develop: train)=\n",
"## Train Model\n",
"\n",
"Recall, splitting the data into training and testing sets is not required, but it is good practice. Furthermore, it provides content for part D. As with the [previous example](sup_class_ex:develop), we'll use [scikit-learn aka sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) built-ins for this."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"#split the variable sets into training and testing subsets\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.333, random_state=41)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"tags": [
"hide-input"
]
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" Independents variables\n",
" \n",
" \n",
" | \n",
" sepal-width | \n",
" petal-length | \n",
" petal-width | \n",
" type | \n",
"
\n",
" \n",
" \n",
" \n",
" 111 | \n",
" 2.700000 | \n",
" 5.300000 | \n",
" 1.900000 | \n",
" Iris-virginica | \n",
"
\n",
" \n",
" 82 | \n",
" 2.700000 | \n",
" 3.900000 | \n",
" 1.200000 | \n",
" Iris-versicolor | \n",
"
\n",
" \n",
" 130 | \n",
" 2.800000 | \n",
" 6.100000 | \n",
" 1.900000 | \n",
" Iris-virginica | \n",
"
\n",
" \n",
" 27 | \n",
" 3.500000 | \n",
" 1.500000 | \n",
" 0.200000 | \n",
" Iris-setosa | \n",
"
\n",
" \n",
" 33 | \n",
" 4.200000 | \n",
" 1.400000 | \n",
" 0.200000 | \n",
" Iris-setosa | \n",
"
\n",
" \n",
"
\n",
" \n",
"\n",
" Dependents variables\n",
" \n",
" \n",
" | \n",
" sepal-length | \n",
"
\n",
" \n",
" \n",
" \n",
" 111 | \n",
" 6.400000 | \n",
"
\n",
" \n",
" 82 | \n",
" 5.800000 | \n",
"
\n",
" \n",
" 130 | \n",
" 7.400000 | \n",
"
\n",
" \n",
" 27 | \n",
" 5.200000 | \n",
"
\n",
" \n",
" 33 | \n",
" 5.500000 | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"#Nice displays are nice but not required. \n",
"from IPython.display import display_html \n",
"X_train_styler = X_train.head(5).style.set_table_attributes(\"style='display:inline'\").set_caption('Independents variables')\n",
"y_train_styler = y_train.head(5).style.set_table_attributes(\"style='display:inline'\").set_caption('Dependents variables')\n",
"space = \"\\xa0\" * 10 #space between columns\n",
"display_html(X_train_styler._repr_html_()+ space + y_train_styler._repr_html_(), raw=True)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Review sklearn's nice [supervised learning library](https://scikit-learn.org/stable/supervised_learning.html). Note that many of these models have both classification and regression extensions."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Our data is mostly quantitative and the scatterplots indicate some linear relations between variables. So linear regression isn't a bad place to start. Once we've trained and tested a linear regression model, we'll easily be able to experiment with different algorithms. \n",
"\n",
":::{margin} Is linear regression ML?\n",
"It depends on who you ask. Google \"Is linear regression machine learning?\" and you'll see some interesting (and entertaining) discussion. For the capstone, it applies an algorithm to data so the answer is -yes. \n",
":::"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"tags": [
"scroll-output"
]
},
"outputs": [
{
"ename": "ValueError",
"evalue": "could not convert string to float: 'Iris-virginica'",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[1;32mIn[5], line 2\u001b[0m\n\u001b[0;32m 1\u001b[0m linear_reg_model \u001b[39m=\u001b[39m LinearRegression()\n\u001b[1;32m----> 2\u001b[0m linear_reg_model\u001b[39m.\u001b[39;49mfit(X_train,y_train)\n",
"File \u001b[1;32mc:\\Users\\ashej\\.virtualenvs\\jupyter-books-WZpnkDri\\Lib\\site-packages\\sklearn\\linear_model\\_base.py:649\u001b[0m, in \u001b[0;36mLinearRegression.fit\u001b[1;34m(self, X, y, sample_weight)\u001b[0m\n\u001b[0;32m 645\u001b[0m n_jobs_ \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mn_jobs\n\u001b[0;32m 647\u001b[0m accept_sparse \u001b[39m=\u001b[39m \u001b[39mFalse\u001b[39;00m \u001b[39mif\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mpositive \u001b[39melse\u001b[39;00m [\u001b[39m\"\u001b[39m\u001b[39mcsr\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39m\"\u001b[39m\u001b[39mcsc\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39m\"\u001b[39m\u001b[39mcoo\u001b[39m\u001b[39m\"\u001b[39m]\n\u001b[1;32m--> 649\u001b[0m X, y \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_validate_data(\n\u001b[0;32m 650\u001b[0m X, y, accept_sparse\u001b[39m=\u001b[39;49maccept_sparse, y_numeric\u001b[39m=\u001b[39;49m\u001b[39mTrue\u001b[39;49;00m, multi_output\u001b[39m=\u001b[39;49m\u001b[39mTrue\u001b[39;49;00m\n\u001b[0;32m 651\u001b[0m )\n\u001b[0;32m 653\u001b[0m sample_weight \u001b[39m=\u001b[39m _check_sample_weight(\n\u001b[0;32m 654\u001b[0m sample_weight, X, dtype\u001b[39m=\u001b[39mX\u001b[39m.\u001b[39mdtype, only_non_negative\u001b[39m=\u001b[39m\u001b[39mTrue\u001b[39;00m\n\u001b[0;32m 655\u001b[0m )\n\u001b[0;32m 657\u001b[0m X, y, X_offset, y_offset, X_scale \u001b[39m=\u001b[39m _preprocess_data(\n\u001b[0;32m 658\u001b[0m X,\n\u001b[0;32m 659\u001b[0m y,\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 662\u001b[0m sample_weight\u001b[39m=\u001b[39msample_weight,\n\u001b[0;32m 663\u001b[0m )\n",
"File \u001b[1;32mc:\\Users\\ashej\\.virtualenvs\\jupyter-books-WZpnkDri\\Lib\\site-packages\\sklearn\\base.py:554\u001b[0m, in \u001b[0;36mBaseEstimator._validate_data\u001b[1;34m(self, X, y, reset, validate_separately, **check_params)\u001b[0m\n\u001b[0;32m 552\u001b[0m y \u001b[39m=\u001b[39m check_array(y, input_name\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39my\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39m*\u001b[39m\u001b[39m*\u001b[39mcheck_y_params)\n\u001b[0;32m 553\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[1;32m--> 554\u001b[0m X, y \u001b[39m=\u001b[39m check_X_y(X, y, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mcheck_params)\n\u001b[0;32m 555\u001b[0m out \u001b[39m=\u001b[39m X, y\n\u001b[0;32m 557\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mnot\u001b[39;00m no_val_X \u001b[39mand\u001b[39;00m check_params\u001b[39m.\u001b[39mget(\u001b[39m\"\u001b[39m\u001b[39mensure_2d\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39mTrue\u001b[39;00m):\n",
"File \u001b[1;32mc:\\Users\\ashej\\.virtualenvs\\jupyter-books-WZpnkDri\\Lib\\site-packages\\sklearn\\utils\\validation.py:1104\u001b[0m, in \u001b[0;36mcheck_X_y\u001b[1;34m(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)\u001b[0m\n\u001b[0;32m 1099\u001b[0m estimator_name \u001b[39m=\u001b[39m _check_estimator_name(estimator)\n\u001b[0;32m 1100\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[0;32m 1101\u001b[0m \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39m{\u001b[39;00mestimator_name\u001b[39m}\u001b[39;00m\u001b[39m requires y to be passed, but the target y is None\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m 1102\u001b[0m )\n\u001b[1;32m-> 1104\u001b[0m X \u001b[39m=\u001b[39m check_array(\n\u001b[0;32m 1105\u001b[0m X,\n\u001b[0;32m 1106\u001b[0m accept_sparse\u001b[39m=\u001b[39;49maccept_sparse,\n\u001b[0;32m 1107\u001b[0m accept_large_sparse\u001b[39m=\u001b[39;49maccept_large_sparse,\n\u001b[0;32m 1108\u001b[0m dtype\u001b[39m=\u001b[39;49mdtype,\n\u001b[0;32m 1109\u001b[0m order\u001b[39m=\u001b[39;49morder,\n\u001b[0;32m 1110\u001b[0m copy\u001b[39m=\u001b[39;49mcopy,\n\u001b[0;32m 1111\u001b[0m force_all_finite\u001b[39m=\u001b[39;49mforce_all_finite,\n\u001b[0;32m 1112\u001b[0m ensure_2d\u001b[39m=\u001b[39;49mensure_2d,\n\u001b[0;32m 1113\u001b[0m allow_nd\u001b[39m=\u001b[39;49mallow_nd,\n\u001b[0;32m 1114\u001b[0m ensure_min_samples\u001b[39m=\u001b[39;49mensure_min_samples,\n\u001b[0;32m 1115\u001b[0m ensure_min_features\u001b[39m=\u001b[39;49mensure_min_features,\n\u001b[0;32m 1116\u001b[0m estimator\u001b[39m=\u001b[39;49mestimator,\n\u001b[0;32m 1117\u001b[0m input_name\u001b[39m=\u001b[39;49m\u001b[39m\"\u001b[39;49m\u001b[39mX\u001b[39;49m\u001b[39m\"\u001b[39;49m,\n\u001b[0;32m 1118\u001b[0m )\n\u001b[0;32m 1120\u001b[0m y \u001b[39m=\u001b[39m _check_y(y, multi_output\u001b[39m=\u001b[39mmulti_output, y_numeric\u001b[39m=\u001b[39my_numeric, estimator\u001b[39m=\u001b[39mestimator)\n\u001b[0;32m 1122\u001b[0m check_consistent_length(X, y)\n",
"File \u001b[1;32mc:\\Users\\ashej\\.virtualenvs\\jupyter-books-WZpnkDri\\Lib\\site-packages\\sklearn\\utils\\validation.py:877\u001b[0m, in \u001b[0;36mcheck_array\u001b[1;34m(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)\u001b[0m\n\u001b[0;32m 875\u001b[0m array \u001b[39m=\u001b[39m xp\u001b[39m.\u001b[39mastype(array, dtype, copy\u001b[39m=\u001b[39m\u001b[39mFalse\u001b[39;00m)\n\u001b[0;32m 876\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[1;32m--> 877\u001b[0m array \u001b[39m=\u001b[39m _asarray_with_order(array, order\u001b[39m=\u001b[39;49morder, dtype\u001b[39m=\u001b[39;49mdtype, xp\u001b[39m=\u001b[39;49mxp)\n\u001b[0;32m 878\u001b[0m \u001b[39mexcept\u001b[39;00m ComplexWarning \u001b[39mas\u001b[39;00m complex_warning:\n\u001b[0;32m 879\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[0;32m 880\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mComplex data not supported\u001b[39m\u001b[39m\\n\u001b[39;00m\u001b[39m{}\u001b[39;00m\u001b[39m\\n\u001b[39;00m\u001b[39m\"\u001b[39m\u001b[39m.\u001b[39mformat(array)\n\u001b[0;32m 881\u001b[0m ) \u001b[39mfrom\u001b[39;00m \u001b[39mcomplex_warning\u001b[39;00m\n",
"File \u001b[1;32mc:\\Users\\ashej\\.virtualenvs\\jupyter-books-WZpnkDri\\Lib\\site-packages\\sklearn\\utils\\_array_api.py:185\u001b[0m, in \u001b[0;36m_asarray_with_order\u001b[1;34m(array, dtype, order, copy, xp)\u001b[0m\n\u001b[0;32m 182\u001b[0m xp, _ \u001b[39m=\u001b[39m get_namespace(array)\n\u001b[0;32m 183\u001b[0m \u001b[39mif\u001b[39;00m xp\u001b[39m.\u001b[39m\u001b[39m__name__\u001b[39m \u001b[39min\u001b[39;00m {\u001b[39m\"\u001b[39m\u001b[39mnumpy\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39m\"\u001b[39m\u001b[39mnumpy.array_api\u001b[39m\u001b[39m\"\u001b[39m}:\n\u001b[0;32m 184\u001b[0m \u001b[39m# Use NumPy API to support order\u001b[39;00m\n\u001b[1;32m--> 185\u001b[0m array \u001b[39m=\u001b[39m numpy\u001b[39m.\u001b[39;49masarray(array, order\u001b[39m=\u001b[39;49morder, dtype\u001b[39m=\u001b[39;49mdtype)\n\u001b[0;32m 186\u001b[0m \u001b[39mreturn\u001b[39;00m xp\u001b[39m.\u001b[39masarray(array, copy\u001b[39m=\u001b[39mcopy)\n\u001b[0;32m 187\u001b[0m \u001b[39melse\u001b[39;00m:\n",
"File \u001b[1;32mc:\\Users\\ashej\\.virtualenvs\\jupyter-books-WZpnkDri\\Lib\\site-packages\\pandas\\core\\generic.py:2070\u001b[0m, in \u001b[0;36mNDFrame.__array__\u001b[1;34m(self, dtype)\u001b[0m\n\u001b[0;32m 2069\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39m__array__\u001b[39m(\u001b[39mself\u001b[39m, dtype: npt\u001b[39m.\u001b[39mDTypeLike \u001b[39m|\u001b[39m \u001b[39mNone\u001b[39;00m \u001b[39m=\u001b[39m \u001b[39mNone\u001b[39;00m) \u001b[39m-\u001b[39m\u001b[39m>\u001b[39m np\u001b[39m.\u001b[39mndarray:\n\u001b[1;32m-> 2070\u001b[0m \u001b[39mreturn\u001b[39;00m np\u001b[39m.\u001b[39;49masarray(\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_values, dtype\u001b[39m=\u001b[39;49mdtype)\n",
"\u001b[1;31mValueError\u001b[0m: could not convert string to float: 'Iris-virginica'"
]
}
],
"source": [
"linear_reg_model = LinearRegression()\n",
"linear_reg_model.fit(X_train,y_train)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"*Error? Wait what happened!?!* This error returns a lot of output, but the last line makes things clear:\n",
"\n",
"`ValueError: could not convert string to float: 'Iris-virginica'`\n",
"\n",
"The algorithm expected numbers; it does not know what to do with the flower types (strings). So how do we fix this?\n",
"\n",
"(sup_reg_ex: develop: train: categorical_1)=\n",
"## Processing Categorical Data (the easy way)\n",
"\n",
"One way to fix a problem is to avoid it. You are not required to use all the data -only some of it. Sometimes choosing the right variables is the real trick. [Dimensionality reduction](https://en.wikipedia.org/wiki/Dimensionality_reduction) is an important part of the data sciences. Here, those flower types DO matter, and it would be best to include that data -but goal #1 is to get things working. Improving things is step #2 and step #3 and step #4 and ... step $\\# \\infty$.\n",
"\n",
"So just to get things rolling, let's remove the column with the categorical data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"scroll-output"
]
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal-width | \n",
" petal-length | \n",
" petal-width | \n",
"
\n",
" \n",
" \n",
" \n",
" 119 | \n",
" 2.2 | \n",
" 5.0 | \n",
" 1.5 | \n",
"
\n",
" \n",
" 128 | \n",
" 2.8 | \n",
" 5.6 | \n",
" 2.1 | \n",
"
\n",
" \n",
" 135 | \n",
" 3.0 | \n",
" 6.1 | \n",
" 2.3 | \n",
"
\n",
" \n",
" 91 | \n",
" 3.0 | \n",
" 4.6 | \n",
" 1.4 | \n",
"
\n",
" \n",
" 112 | \n",
" 3.0 | \n",
" 5.5 | \n",
" 2.1 | \n",
"
\n",
" \n",
" 71 | \n",
" 2.8 | \n",
" 4.0 | \n",
" 1.3 | \n",
"
\n",
" \n",
" 123 | \n",
" 2.7 | \n",
" 4.9 | \n",
" 1.8 | \n",
"
\n",
" \n",
" 85 | \n",
" 3.4 | \n",
" 4.5 | \n",
" 1.6 | \n",
"
\n",
" \n",
" 147 | \n",
" 3.0 | \n",
" 5.2 | \n",
" 2.0 | \n",
"
\n",
" \n",
" 143 | \n",
" 3.2 | \n",
" 5.9 | \n",
" 2.3 | \n",
"
\n",
" \n",
" 127 | \n",
" 3.0 | \n",
" 4.9 | \n",
" 1.8 | \n",
"
\n",
" \n",
" 39 | \n",
" 3.4 | \n",
" 1.5 | \n",
" 0.2 | \n",
"
\n",
" \n",
" 38 | \n",
" 3.0 | \n",
" 1.3 | \n",
" 0.2 | \n",
"
\n",
" \n",
" 93 | \n",
" 2.3 | \n",
" 3.3 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 23 | \n",
" 3.3 | \n",
" 1.7 | \n",
" 0.5 | \n",
"
\n",
" \n",
" 133 | \n",
" 2.8 | \n",
" 5.1 | \n",
" 1.5 | \n",
"
\n",
" \n",
" 30 | \n",
" 3.1 | \n",
" 1.6 | \n",
" 0.2 | \n",
"
\n",
" \n",
" 83 | \n",
" 2.7 | \n",
" 5.1 | \n",
" 1.6 | \n",
"
\n",
" \n",
" 37 | \n",
" 3.1 | \n",
" 1.5 | \n",
" 0.1 | \n",
"
\n",
" \n",
" 41 | \n",
" 2.3 | \n",
" 1.3 | \n",
" 0.3 | \n",
"
\n",
" \n",
" 81 | \n",
" 2.4 | \n",
" 3.7 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 120 | \n",
" 3.2 | \n",
" 5.7 | \n",
" 2.3 | \n",
"
\n",
" \n",
" 43 | \n",
" 3.5 | \n",
" 1.6 | \n",
" 0.6 | \n",
"
\n",
" \n",
" 2 | \n",
" 3.2 | \n",
" 1.3 | \n",
" 0.2 | \n",
"
\n",
" \n",
" 64 | \n",
" 2.9 | \n",
" 3.6 | \n",
" 1.3 | \n",
"
\n",
" \n",
" 62 | \n",
" 2.2 | \n",
" 4.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 56 | \n",
" 3.3 | \n",
" 4.7 | \n",
" 1.6 | \n",
"
\n",
" \n",
" 67 | \n",
" 2.7 | \n",
" 4.1 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 49 | \n",
" 3.3 | \n",
" 1.4 | \n",
" 0.2 | \n",
"
\n",
" \n",
" 63 | \n",
" 2.9 | \n",
" 4.7 | \n",
" 1.4 | \n",
"
\n",
" \n",
" 79 | \n",
" 2.6 | \n",
" 3.5 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 54 | \n",
" 2.8 | \n",
" 4.6 | \n",
" 1.5 | \n",
"
\n",
" \n",
" 106 | \n",
" 2.5 | \n",
" 4.5 | \n",
" 1.7 | \n",
"
\n",
" \n",
" 90 | \n",
" 2.6 | \n",
" 4.4 | \n",
" 1.2 | \n",
"
\n",
" \n",
" 145 | \n",
" 3.0 | \n",
" 5.2 | \n",
" 2.3 | \n",
"
\n",
" \n",
" 14 | \n",
" 4.0 | \n",
" 1.2 | \n",
" 0.2 | \n",
"
\n",
" \n",
" 141 | \n",
" 3.1 | \n",
" 5.1 | \n",
" 2.3 | \n",
"
\n",
" \n",
" 51 | \n",
" 3.2 | \n",
" 4.5 | \n",
" 1.5 | \n",
"
\n",
" \n",
" 139 | \n",
" 3.1 | \n",
" 5.4 | \n",
" 2.1 | \n",
"
\n",
" \n",
" 70 | \n",
" 3.2 | \n",
" 4.8 | \n",
" 1.8 | \n",
"
\n",
" \n",
" 97 | \n",
" 2.9 | \n",
" 4.3 | \n",
" 1.3 | \n",
"
\n",
" \n",
" 55 | \n",
" 2.8 | \n",
" 4.5 | \n",
" 1.3 | \n",
"
\n",
" \n",
" 32 | \n",
" 4.1 | \n",
" 1.5 | \n",
" 0.1 | \n",
"
\n",
" \n",
" 104 | \n",
" 3.0 | \n",
" 5.8 | \n",
" 2.2 | \n",
"
\n",
" \n",
" 136 | \n",
" 3.4 | \n",
" 5.6 | \n",
" 2.4 | \n",
"
\n",
" \n",
" 18 | \n",
" 3.8 | \n",
" 1.7 | \n",
" 0.3 | \n",
"
\n",
" \n",
" 108 | \n",
" 2.5 | \n",
" 5.8 | \n",
" 1.8 | \n",
"
\n",
" \n",
" 98 | \n",
" 2.5 | \n",
" 3.0 | \n",
" 1.1 | \n",
"
\n",
" \n",
" 45 | \n",
" 3.0 | \n",
" 1.4 | \n",
" 0.3 | \n",
"
\n",
" \n",
" 68 | \n",
" 2.2 | \n",
" 4.5 | \n",
" 1.5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal-width petal-length petal-width\n",
"119 2.2 5.0 1.5\n",
"128 2.8 5.6 2.1\n",
"135 3.0 6.1 2.3\n",
"91 3.0 4.6 1.4\n",
"112 3.0 5.5 2.1\n",
"71 2.8 4.0 1.3\n",
"123 2.7 4.9 1.8\n",
"85 3.4 4.5 1.6\n",
"147 3.0 5.2 2.0\n",
"143 3.2 5.9 2.3\n",
"127 3.0 4.9 1.8\n",
"39 3.4 1.5 0.2\n",
"38 3.0 1.3 0.2\n",
"93 2.3 3.3 1.0\n",
"23 3.3 1.7 0.5\n",
"133 2.8 5.1 1.5\n",
"30 3.1 1.6 0.2\n",
"83 2.7 5.1 1.6\n",
"37 3.1 1.5 0.1\n",
"41 2.3 1.3 0.3\n",
"81 2.4 3.7 1.0\n",
"120 3.2 5.7 2.3\n",
"43 3.5 1.6 0.6\n",
"2 3.2 1.3 0.2\n",
"64 2.9 3.6 1.3\n",
"62 2.2 4.0 1.0\n",
"56 3.3 4.7 1.6\n",
"67 2.7 4.1 1.0\n",
"49 3.3 1.4 0.2\n",
"63 2.9 4.7 1.4\n",
"79 2.6 3.5 1.0\n",
"54 2.8 4.6 1.5\n",
"106 2.5 4.5 1.7\n",
"90 2.6 4.4 1.2\n",
"145 3.0 5.2 2.3\n",
"14 4.0 1.2 0.2\n",
"141 3.1 5.1 2.3\n",
"51 3.2 4.5 1.5\n",
"139 3.1 5.4 2.1\n",
"70 3.2 4.8 1.8\n",
"97 2.9 4.3 1.3\n",
"55 2.8 4.5 1.3\n",
"32 4.1 1.5 0.1\n",
"104 3.0 5.8 2.2\n",
"136 3.4 5.6 2.4\n",
"18 3.8 1.7 0.3\n",
"108 2.5 5.8 1.8\n",
"98 2.5 3.0 1.1\n",
"45 3.0 1.4 0.3\n",
"68 2.2 4.5 1.5"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train_no_type = X_train.drop(columns = ['type'])\n",
"X_test_no_type = X_test.drop(columns = ['type'])\n",
"\n",
"X_test_no_type"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Now the models will train without error:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. "
],
"text/plain": [
"LinearRegression()"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"linear_reg_model.fit(X_train_no_type, y_train)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"And the model can make predictions for an entire set:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"scroll-output"
]
},
"outputs": [
{
"data": {
"text/plain": [
"array([[5.98770812],\n",
" [6.42220879],\n",
" [6.81026327],\n",
" [6.3038615 ],\n",
" [6.48153317],\n",
" [5.75587752],\n",
" [6.02106191],\n",
" [6.34587216],\n",
" [6.31716812],\n",
" [6.788514 ],\n",
" [6.23165894],\n",
" [5.01764515],\n",
" [4.57470184],\n",
" [5.07393461],\n",
" [4.87302581],\n",
" [6.48997581],\n",
" [4.88812177],\n",
" [6.34092093],\n",
" [4.885904 ],\n",
" [4.00445291],\n",
" [5.46842817],\n",
" [6.62636673],\n",
" [4.85349432],\n",
" [4.71509986],\n",
" [5.50178197],\n",
" [5.57125107],\n",
" [6.43782043],\n",
" [6.00331976],\n",
" [4.86637251],\n",
" [6.31473613],\n",
" [5.44667891],\n",
" [6.08460761],\n",
" [5.63522521],\n",
" [6.01862993],\n",
" [6.08060051],\n",
" [5.19561829],\n",
" [6.06972588],\n",
" [6.28433001],\n",
" [6.47065854],\n",
" [6.29098332],\n",
" [6.06929744],\n",
" [6.16124571],\n",
" [5.58789409],\n",
" [6.64589822],\n",
" [6.60683523],\n",
" [5.3817326 ],\n",
" [6.61032665],\n",
" [4.89225584],\n",
" [4.57691961],\n",
" [5.58233992]])"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_pred_no_type = linear_reg_model.predict(X_test_no_type)\n",
"y_pred_no_type"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Or a single input:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[4.71509986]]\n"
]
}
],
"source": [
"# The model was trained with a dataframe, so you can only predict on dataframes\n",
"# Recall we removed the petal type, and we are predicting the sepal-length\n",
"column_names_short = ['sepal-width', 'petal-length', 'petal-width']\n",
"\n",
"# Creates a dataframe from a single element for input. This avoids a warning for missing feature names. \n",
"#Alternatively, you can use print(linear_reg_model.predict([[3.2, 1.3, .2]])) without error. \n",
"input_df = pd.DataFrame(np.array([[3.2, 1.3, .2]]), columns = column_names_short)\n",
"\n",
"print(linear_reg_model.predict(input_df))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"```{note}\n",
"Your model can only predict on data simliar to what it was trained with. Since this model was trained with a dataframe, a matching new dataframe, 'input_df' was created to predict a single input. Alternatively, we could have converted the original data to an array using `ravel()` or `.values` (see the [previous example](sup_class_ex:develop:train)). \n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"(sup_reg_ex: develop: accuracy)=\n",
"## Accuracy Analysis (for Regression) part 1\n",
"\n",
"Now that the model works, we can work on improving it. But first we'll need metrics so we can tell if we're making progress. As we are trying to predict a continuous number, even the very best model will have errors in almost every prediction (if not, it's almost certainly [overfitted](https://en.wikipedia.org/wiki/Overfitting)). Whereas measuring the success of our classification model was a simple ratio, here we need a way to measure how much those predictions deviate from actual values. See sklearn's list of [metrics and scoring](https://scikit-learn.org/stable/modules/model_evaluation.html) for regression. \n",
"\n",
"To get things started, we'll use the *mean squared error* (MSE), a popular metric for evaluating regression models. Regression metrics are covered in more depth in the [Regression Accuracy Analysis](sup_reg_ex: develop: accuracy) section. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.12264182555541722"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.metrics import mean_squared_error\n",
"\n",
"mean_squared_error(y_test, y_pred_no_type)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"What does this mean? The closer the MSE is to 0, the better. However, this value is not given in terms of the dependent variable, and MSE values are not comparable across use cases, i.e., comparing an MSE from your project to that of a different model is not comparing \"apples to apples.\" A \"good\" MSE will depend on your data and project needs.\n",
"\n",
"This value can be used to determine if tweaks improve the results. For example, reviewing the [visualizations of this dataset](sup_class_ex:descriptive_methods_and_visualizations), we might expect that the regression coefficients (numbers determining the lines directions) should be positive, and try `LinearRegression(positive = True)`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.10302108975537984"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"linear_reg_model2 = LinearRegression(positive = True)\n",
"linear_reg_model2.fit(X_train_no_type, y_train)\n",
"y_pred_no_type = linear_reg_model2.predict(X_test_no_type)\n",
"\n",
"mean_squared_error(y_test, y_pred_no_type)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And $0.103 < 0.122$."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "jupyter-books-WZpnkDri",
"language": "python",
"name": "jupyter-books-wzpnkdri"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.1"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}