import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Supervised Learning and Loss Functions¶
The goal in a supervised learning task is to produce a model that we can use to predict the value of a label (or response) $y$ given values for a set of features (or predictors), $x^{(1)}, x^{(2)}, ..., x^{(p)}$. This model is typically represented by a predict() function.
When we select a learning algorithm, we are effectively selecting a class of models that we will be considering for our particular task. This set of allowed models is called our hypothesis space. Our learning algorithm will search through the hypothesis space to find the model that performs the best on our training data. To determine what model from the hypothesis space is the "best", we need to provide the learning algorithm with a method of score different models. Such a scoring method is called an objective function. An objective function takes a model and a dataset as input, and produces a numerical score as its output. If an objective function is defined in such a way that lower scores are better, then we call it a loss function. The goal of a learning algorithm is to minimize its loss on the training data.
We will illustrate these concepts with the linear regression learning algorithm.
Simple Linear Regresion¶
In a simple linear regression task, we have a single feature, x, from which we wish to predict a continuous, real-valued label y.
- Assume that our training set has $n$ observations.
- Denote the observed values of the training feature (predictor) as $x_1, x_2, ..., x_n$.
- Denote the observed values of the label (response variable) as $y_1, y_2, ..., y_n$.
- We assume that our model has the following form: $\large \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x$.
- In the model above, $\hat{\beta}_0$ and $\hat\beta_1$ are model parameters that are learned from the training data.
- $\hat{y}$ represents the predicted value of $y$ given some value of $x$.
Training the Model¶
- Let $b_0$ and $b_1$ be a pair of (not necessarily optimal) parameter values used to define a model $\large \hat{y} = {b}_0 + {b}_1 x$.
- For each training observation $(x_i, y_i)$, let $\large\hat{y}_i = b_0 + b_1 x_i$.
- For each $i$, calculate the error (residual) $\large\hat{e}_i = \hat{y}_i - y_i$.
- The goal of the algorithm is to find the parameter values that minimize the Sum of Squared Errors loss function, given by: $ \large SSE = \sum \hat{e}_i^2 $
- We will denote the optimal parameter values by $\hat{\beta}_0$ and $\hat{\beta}_1$.
%matplotlib inline
%run -i snippets/snippet04.py
<Figure size 396x396 with 0 Axes>
SLR with Scikit-Learn¶
Scikit-Learn is a library that contains implementations of many machine learning algorithms, as well as useful tools for evaluatting models, processing data, and generating synthetic data sets. We will use this package extensively in this class.
In this lesson, we will illustrate how to use Scikit-Learn to construct linear regression models. We begin by recreating the dataset used in the demo above.
# Generate Data
np.random.seed(164)
x = np.random.uniform(low=0, high=10, size=12)
y = 5 + 1.4 * x + np.random.normal(loc=0, scale=2.5, size=12)
# The features need to be stored in a 2D array or DataFrame
X = x.reshape(12,1)
print(X.shape)
print(y.shape)
print("HELLO WORLD")
(12, 1) (12,) HELLO WORLD
We will now use the LinearRegression class from Scikit-Learn to create the SLR model.
from sklearn.linear_model import LinearRegression
# We create an instance of the LinearRegression() class.
# Then we call its fit() method.
mod = LinearRegression()
mod.fit(X,y)
# We print the model parameters.
print('Intercept:', mod.intercept_)
print('Slope:', mod.coef_)
Intercept: 7.175724099149425 Slope: [1.10640235]
We will plot the dataset, along with the line that represents the model.
b = mod.intercept_
m = mod.coef_[0]
plt.close()
plt.scatter(x,y)
plt.plot([0,10],[m*0 + b, m*10 + b], c='r')
plt.xlim([0,10])
plt.ylim([0,25])
plt.xlabel('x')
plt.ylabel('y')
plt.show()
Generating Predictions¶
Ever Scikit-Learn model comes equipped with a predict method that can be used to generate predictions. We will use this method to find the y values predicted for our model for x=2, x=5, and x=8.
Xnew = np.array([2, 5, 8])
print(Xnew.shape)
(3,)
The predict function requires a 2D array as input, so we will reshape Xnew.
Xnew = Xnew.reshape((3,1))
print(Xnew.shape)
(3, 1)
print(mod.predict(Xnew))
[ 9.38852881 12.70773587 16.02694293]
Scoring the Model¶
We know that the parameter values were selected in order to minimize SSE for the given training data. Let's calculate the SSE for the model we have generated. This will require us to use the predict method built into our model.
y_hat = mod.predict(X)
residuals = y - y_hat
SSE = np.sum(residuals**2)
print('Training SSE:', SSE)
Training SSE: 29.739546986084278
r-Squared¶
When working with linear regression, the optimal parameter values are determined by minimizing the objective function SSE. However, a different value is typically used to assess the quality of a regression model. This alternate scoring method is called the called the r-squared value. It is defined as follows:
$ \large SST = \sum (y_i - \bar y ) ^2 $
$ \large SSE = \sum \hat{e}_i^2 = \sum (y_i - \hat {y}_i ) ^2 $
$ \large r^2 = 1 - \frac{SSE}{SST}$
Since SST is a constant for the supplied training data, minimizing SSE is equivalent to maximizing $r^2$. The score supplied by $r^2$ has two advantages over SSE:
- The value $r^2$ is "normalized" to always be between 0 and 1. A value close to 1 indicates that our model provides a very good fit for the data.
- The $r^2$ value has a useful interpretation not present with SSE. We can think of $r^2$ as reporting the proportion of the variance in the training labels that has been accounted for by our model.
We will now directly compute the $r^2$ value for our model.
SST = np.sum((y - np.mean(y))**2)
r2 = 1 - SSE / SST
print(r2)
0.7298814238421224
Scikit-Learn LinearRegression objects do not have an r_squared attribute, but they do contain a method called score that can be used to calcuate $r^2$.
print(mod.score(X,y))
0.7298814238421224
Multiple Regression¶
In a multiple linear regression task, we have several features, $X = [x^{(1)}, x^{(2)}, ..., x^{(p)}]$, from which we wish to predict a single continuous, real-valued label y.
- Assume that our training set has $n$ observations.
- Denote the values of the training features for observation number $i$ as $X_i = [x^{(1)}_i, x^{(2)}_i, ..., x^{(p)}_i]$.
- Denote the value of the label for observation $i$ as $y_i$.
- We assume that our model has the following form: $\large \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x^{(1)} + \hat{\beta}_2 x^{(2)} ... + \hat{\beta}_p x^{(p)}$.
- In the model above, $\hat{\beta}_0$, $\hat{\beta}_1$, ..., $\hat{\beta}_p$ are model parameters that are learned from the training data.
- $\hat{y}$ represents the predicted value of $y$ given some input vector $X = [x^{(1)}, x^{(2)}, ..., x^{(p)}]$.
Training the Model¶
- Let $b_0, b_1, ..., b_p$ be a set (not necessarily optimal) parameter values used to define a model $\large \hat{y} = {b}_0 + {b}_1 x^{(1)} + {b}_2 x^{(2)} + ... + {b}_p x^{(p)}$.
- For each training observation $(x_i, y_i)$, let $\large\hat{y}_i = {b}_0 + {b}_1 x^{(1)}_i + {b}_2 x^{(2)}_i + ... + {b}_p x^{(p)}_i$.
- For each $i$, calculate the error (residual) $\large\hat{e}_i = \hat{y}_i - y_i$.
- The goal of the algorithm is to find the parameter values that minimize the Sum of Squared Errors objective function, given by: $ \large SSE = \sum \hat{e}_i^2 $
- We will denote the optimal parameter values by $\hat{\beta}_0$, $\hat{\beta}_1$, ..., $\hat{\beta}_p$.
Example with Synthetic Data¶
We will generate a synthetic dataset to illustrate how multiple regression works.
np.random.seed(1)
n = 200
x1 = np.random.uniform(0, 10, n)
x2 = np.random.uniform(0, 10, n)
y = 7 + 1.3 * x1 + 2.5 * x2 + np.random.normal(0, 3, n)
df = pd.DataFrame({'x1':x1, 'x2':x2, 'y':y})
df.head()
| x1 | x2 | y | |
|---|---|---|---|
| 0 | 4.170220 | 9.501761 | 38.139430 |
| 1 | 7.203245 | 5.566532 | 30.126983 |
| 2 | 0.001144 | 9.156063 | 27.714854 |
| 3 | 3.023326 | 6.415662 | 24.366173 |
| 4 | 1.467559 | 3.900077 | 18.250087 |
We can use the LinearRegression class from Scikit-Learn to create multiple regression models.
X = np.hstack([x1.reshape(n,1), x2.reshape(n,1)])
print(X.shape)
mlr = LinearRegression()
mlr.fit(X,y)
print('\nModel intercept:', mlr.intercept_)
print('\nModel coefficients:', mlr.coef_)
print('\nr-Squared:', mlr.score(X,y))
(200, 2) Model intercept: 6.536812277597992 Model coefficients: [1.37737604 2.50605913] r-Squared: 0.8721337623707668
Example: Predicting Median Home Value¶
In this example, we will be working with the "Boston Housing" dataset. This dataset contains data for 506 census tracts of Boston from the 1970 census.
The dataset contains the following 19 pieces of information for each census tract:
town- name of towntract- census tractlon- longitude of census tractlat- latitude of census tractmedv- median value of owner-occupied homes in USD 1000'scmedv- corrected median value of owner-occupied homes in USD 1000'scrim- per capita crime rate by townzn- proportion of residential land zoned for lots over 25,000 sq.ftindus- proportion of non-retail business acres per townchas- Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)nox- nitric oxides concentration (parts per 10 million)rm- average number of rooms per dwellingage- proportion of owner-occupied units built prior to 1940dis- weighted distances to five Boston employment centresrad- index of accessibility to radial highwaystax- full-value property-tax rate per USD 10,000ptratio- pupil-teacher ratio by townb- 1000(B - 0.63)^2 where B is the proportion of blacks by townlstat- percentage of lower status of the population
We will start by importing the dataset from a text file, and then viewing the first 10 rows.
df = pd.read_csv('data/BostonHousingV2.txt', sep='\t')
df.head(n=10)
| town | tract | lon | lat | medv | cmedv | crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | b | lstat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Nahant | 2011 | -70.9550 | 42.2550 | 24.0 | 24.0 | 0.00632 | 18.0 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296 | 15.3 | 396.90 | 4.98 |
| 1 | Swampscott | 2021 | -70.9500 | 42.2875 | 21.6 | 21.6 | 0.02731 | 0.0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242 | 17.8 | 396.90 | 9.14 |
| 2 | Swampscott | 2022 | -70.9360 | 42.2830 | 34.7 | 34.7 | 0.02729 | 0.0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 392.83 | 4.03 |
| 3 | Marblehead | 2031 | -70.9280 | 42.2930 | 33.4 | 33.4 | 0.03237 | 0.0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222 | 18.7 | 394.63 | 2.94 |
| 4 | Marblehead | 2032 | -70.9220 | 42.2980 | 36.2 | 36.2 | 0.06905 | 0.0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222 | 18.7 | 396.90 | 5.33 |
| 5 | Marblehead | 2033 | -70.9165 | 42.3040 | 28.7 | 28.7 | 0.02985 | 0.0 | 2.18 | 0 | 0.458 | 6.430 | 58.7 | 6.0622 | 3 | 222 | 18.7 | 394.12 | 5.21 |
| 6 | Salem | 2041 | -70.9360 | 42.2970 | 22.9 | 22.9 | 0.08829 | 12.5 | 7.87 | 0 | 0.524 | 6.012 | 66.6 | 5.5605 | 5 | 311 | 15.2 | 395.60 | 12.43 |
| 7 | Salem | 2042 | -70.9375 | 42.3100 | 27.1 | 22.1 | 0.14455 | 12.5 | 7.87 | 0 | 0.524 | 6.172 | 96.1 | 5.9505 | 5 | 311 | 15.2 | 396.90 | 19.15 |
| 8 | Salem | 2043 | -70.9330 | 42.3120 | 16.5 | 16.5 | 0.21124 | 12.5 | 7.87 | 0 | 0.524 | 5.631 | 100.0 | 6.0821 | 5 | 311 | 15.2 | 386.63 | 29.93 |
| 9 | Salem | 2044 | -70.9290 | 42.3160 | 18.9 | 18.9 | 0.17004 | 12.5 | 7.87 | 0 | 0.524 | 6.004 | 85.9 | 6.5921 | 5 | 311 | 15.2 | 386.71 | 17.10 |
Let's check the dimensions of the DataFrame.
print(df.shape)
(506, 19)
We will use this dataset for a regression task in which we will attempt to predict the label cmedv using the last 13 columns as features. We will now prepare our feature and label arrays.
X = df.iloc[:,6:].values
y = df.iloc[:,5].values
print(X.shape)
print(y.shape)
(506, 13) (506,)
In order to measure how well our model generalizes to new data, we will create a train/test split of our data.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
print(X_train.shape)
print(X_test.shape)
(404, 13) (102, 13)
We will now use the LinearRegression class to create our model. We will use the score method of our trained model to calculate the r-Squared value on our training set, as well as on our test set.
model = LinearRegression()
model.fit(X_train, y_train)
print("Training r-Squared:", model.score(X_train, y_train))
print("Testing r-Squared: ", model.score(X_test, y_test))
Training r-Squared: 0.7341832055169145 Testing r-Squared: 0.7639579157366432
The model object has attributes containing the optimal parameter values that define our model.
np.set_printoptions(precision=6, suppress=True)
print(model.intercept_)
print(model.coef_)
42.761131545077376 [ -0.110315 0.06 0.021576 2.134697 -19.522247 3.075544 0.004304 -1.536018 0.303108 -0.011639 -0.950799 0.0072 -0.551868]
Using a Subset of the Available Features¶
It is often undesireable to use all features that are potentially available to you in a supervised learning task. A few reasons why you might want to consider using a subset of the available features are:
- Using too many features could result in over-fitting.
- Some features might not be relevant.
- Some features might be redundant.
- Having too many features can have a negative impact on the interpretability of your model.
- It might be expensive or time-consuming to collect values for certain features.
Let train a second MLR model on this data, but this time using only the features rm, ptratio, and lstat.
df.iloc[:, [11,16,18]].head()
| rm | ptratio | lstat | |
|---|---|---|---|
| 0 | 6.575 | 15.3 | 4.98 |
| 1 | 6.421 | 17.8 | 9.14 |
| 2 | 7.185 | 17.8 | 4.03 |
| 3 | 6.998 | 18.7 | 2.94 |
| 4 | 7.147 | 18.7 | 5.33 |
Xs = df.iloc[:,[11,16,18]].values
print(Xs.shape)
(506, 3)
Xs_train, Xs_test, y_train, y_test = train_test_split(Xs, y, test_size=0.2, random_state=1)
model_s = LinearRegression()
model_s.fit(Xs_train, y_train)
print("Training r-Squared:", model_s.score(Xs_train, y_train))
print("Testing r-Squared: ", model_s.score(Xs_test, y_test))
Training r-Squared: 0.6739085856095928 Testing r-Squared: 0.6958714532115466
print(model_s.intercept_)
print(model_s.coef_)
23.92724415645076 [ 3.795969 -0.952999 -0.597414]