Select one of the keywords on the left…

Machine LearningA regression example: linear models

Reading time: ~20 min

A very common approach to estimating the regression function for a particular joint distribution is to assume that it takes a particular form. For example, to perform a linear regression, we posit that r(x) = \beta_0 + \beta_1 x for some constants \beta_0 and \beta_1. To estimate \boldsymbol{\beta} = [\beta_0,\beta_1] from the n observations \{(x_i,y_i)\}_{i=1}^n, we can minimize the empirical mean squared error (also known as the residual sum of squares:

\begin{align*} \operatorname{RSS}(\boldsymbol{\beta}) = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2. \end{align*}

The line of best fit minimizes the sum of the squares of the lengths of the red segments.

Find the value of \boldsymbol{\beta} which minimizes \operatorname{RSS}(\boldsymbol{\beta}).

Solution. We can write \operatorname{RSS}(\boldsymbol{\beta}) as

\begin{align*}|\mathbf{y} - X\boldsymbol{\beta}|^2,\end{align*}

where \mathbf{y} = [y_1, \ldots, y_n] and

\begin{align*}X = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix}. \end{align*}

The function \boldsymbol{\beta}\mapsto |\mathbf{y} - X\boldsymbol{\beta}|^2 goes to infinity as \boldsymbol{\beta} \to \infty and is continuous, so it necessarily has at least one global minimum. Differentiating, we get

\begin{align*} \frac{\partial}{\partial \boldsymbol{\beta}}(\mathbf{y} - X\boldsymbol{\beta})'(\mathbf{y} - X\boldsymbol{\beta}) = -2X'(\mathbf{y} - X\boldsymbol{\beta}).\end{align*}

we can solve to find that

\begin{align*}\boldsymbol{\beta} = (X' X)^{-1}X' \mathbf{y}.\end{align*}

is the only critical point. Therefore, it must be the unique global minimizer.

Let's apply this formula to find the linear regression estimate for the exam-scores example in the statistics course.

y = [Y for (X,Y) in observations]
X = [ones(n) [X for (X,Y) in observations]]
β = (X'*X) \ X'*y # (computes (X'X)⁻¹X'y)
scatter(observations, label = "observations")
plot!(xs, [β'*[1,x] for x in xs], label = "empirical risk minimizer")

We can also use linear regression techniques to handle polynomial regression. If we posit that

\begin{align*} r(x) = \beta_0 + \beta_1 X + \beta_2 X^2,\end{align*}

then we find the coefficients \boldsymbol{\beta} = [\beta_0, \beta_1, \beta_2] which minimize the residual sum of squares by writing

\begin{align*}\operatorname{RSS}(\boldsymbol{\beta}) = |y - X\boldsymbol{\beta}|^2,\end{align*}


\begin{align*} X = \begin{bmatrix} 1 & x_1 & x_1^2 \\ 1 & x_2 & x_2^2\\ \vdots & \vdots & \vdots \\ 1 & x_n & x_n^2 \end{bmatrix}.\end{align*}


Perform a quadratic regression on the exam-scores example, and show that its integrated squared difference from the true regression function is lower than that of the linear estimator and the Nadaraya-Watson estimator.


We perform the quadratic regression by doing the same calculation as for the linear regression but with an extra column in X. We approximate the integrated squared error using a Riemann sum as we did for the kernel density estimator. We find that the integrated squared errors for the quadratic estimator, the kernel density estimator, and the linear estimator are 0.61, 1.9, and 9.24 respectively.

y = [Y for (X,Y) in observations]
X = [ones(n) [X for (X,Y) in observations] [X^2 for (X,Y) in observations]]
βq = (X'*X) \ X'*y
quaderr = sum((r(x)-βq'*[1,x,x^2])^2 for x in xs)*step(xs)
print("error for quadratic estimator is $quaderr")
linerr = sum((r(x)-β'*[1,x])^2 for x in xs)*step(xs)
print("error for linear estimator is $linerr")
scatter(observations, label = "observations")
plot!(xs, [βq'*[1,x,x^2] for x in xs], label = "quadratic regression estimator")

The results of this example are telling: the true regression function was quadratic for this example, and assuming quadraticness enabled us to achieve more than three times lower error than for the nonparametric estimator. On the other hand, the assumption that the regression function is linear resulted in significantly greater error. Thus we see that inductive bias is a double-edged sword: it tends to enhance accuracy if the assumptions applied are correct or approximately correct, and it can result in lower accuracy if not.

Despite their simplicity, linear models are quite capable of overfitting. Ridge and Lasso regression address this problem by adding a term to the loss functional which penalizes large coefficients. In this problem, we will explore how these methods behave on a simulated example where the response variable is defined to be a linear combination of some of the feature variables, while other features are nearly linearly dependent.

(a) Execute this cell to see what happens if there is nearly a linear dependence relationship among the features. Note that the loss minimizer is [3, 1, -4, 0, 0, 0, 0, 0, 0, 0], since we are defining the response variable so that it has a conditional expectation given X which is equal to a linear combination of the features with those weights.

using Plots, Random
function groupedbar(β)
    bar([3; 1; -4; zeros(7)], bar_width = 0.4,
        label = "actual coefficients", legend = :topleft)
    bar!((1:length(β)) .+ 0.4, β, bar_width = 0.4,
         label = "coefficients from fit")

n = 10
# generate n-1 features, together with a column of ones:
X = [ones(n) randn(n, n-1)]
# make the last two columns nearly parallel
X[:,10] = 3X[:,9] .+ 0.01randn(n)
# response variable depends on first two features:
y = 3.0 .+ X[:,2] .- 4X[:,3] + 0.1randn(n)
β = X \ y

(b) Execute this cell to see how ridge regression mitigates the problems you observed in (a):

using Optim, LinearAlgebra
λ = 1
o = optimize(β -> sum((y - X*β).^2) + λ*β[2:end]'*β[2:end], ones(10))

(c) Execute this cell to see how lasso regression compares to ridge regression.

λ = 1
o = optimize(β -> sum((y - X*β).^2) + λ*norm(β[2:end],1), ones(10))

Which model does the best job of zeroing out the coefficients that should be zero, in this example?

Solution. With no regularization, the coefficients for the last two features are very large. Ridge and lasso both reduce this problem substantially, although lasso does a better job of getting the coefficients that should be zero very close to zero. An explanation for this phenomenon is discussed in the optimization section of the Multivariable Calculus course.

Bruno Bruno