Example: Supervised Regression App#

To predict a number for a feature contained in the data, use a supervised regression method (but not logistic regression).

For this example, we’ll slightly modify the previous example. Instead of predicting the category type, we’ll predict the number sepal-length.

Hide code cell source
#We'll import libraries as needed, but when submitting, 
# it's best having them all at the top.
import pandas as pd

# Load this familiar dataset:
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
df = pd.read_csv(url) #read CSV into Python as a DataFrame

#Attach column names to the dataframe
column_names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'type']
df = pd.read_csv(url, names = column_names) #read CSV into Python as a DataFrame
pd.options.display.show_dimensions = False #suppresses dimension output

##preserves Jupyter preview style (the '...') after applying .style
def display_df(dataframe, column_names, highlighted_col, precision=2):
    pd.set_option("display.precision", 2)
    columns_dict = {}
    for i in column_names:
        columns_dict[i] ='...'
    df2 = pd.concat([dataframe.iloc[:5,:],
                       pd.DataFrame(index=['...'], data=columns_dict),
                       dataframe.iloc[-5:,:]]).style.format(precision = precision).set_properties(subset=[highlighted_col], **{'background-color': 'yellow'})
    pd.options.display.show_dimensions = True
    display(df2)
    
#display df with highlighted column 
display_df(df, column_names, 'sepal-length', 1)
  sepal-length sepal-width petal-length petal-width type
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica

The highlighted numbers, ‘sepal-length,’ provides something to predict (dependent variables), and the non-highlighted columns are something by which to make that prediction (independent variables). The big differences from the previous example are as follows:

  • Data processing (maybe) if we choose to include type as an independent variable, it’ll need to be converted from categorical data into numbers the model can use.

  • Model Development As we’ll be predicting a number, a regression method will be used instead of a classification method.

  • Accuracy Metric instead of a simple percentage, we’ll need a measurement of how close the data fits the model. e.g., mean squared error.

Data Exploring and Processing#

As the data is identical, this step will be similar to what was done in the previous example; please refer to it. Focusing on the sepal length, we can certainly see patterns:

Hide code cell source
import matplotlib.pyplot as plt
import seaborn as sns

#correlogram
sns.pairplot(df, y_vars=['sepal-length'], hue='type')
plt.show()
../../_images/4d03f8457c1206e4e557a2bbf1b9ff17643e34b0e31874e2afd7355079f7a7b9.png