Select one of the keywords on the left…

Machine LearningIntroduction

Reading time: ~45 min

At the highest level, the goal of statistical learning is to draw conclusions about an unknown probability measure given independent observations drawn from the measure. These observations are called training data. In supervised learning, the unknown measure \mathbb{P} is on a product space \mathcal{X} \times \mathcal{Y}. In other words, each training observation has the form (\mathbf{X}, Y) where \mathbf{X} is an element of \mathcal{X} and \mathbf{Y} is an element of \mathcal{Y}.

We aim to use the training data to predict Y given \mathbf{X}, where (\mathbf{X},Y) denotes a random variable in \mathcal{X} \times \mathcal{Y} with distribution \mathbb{P}.

Suppose that \mathbf{X} = [X_1, X_2], where X_1 is the color of a banana, X_2 is the weight of the banana, and Y is a measure of banana deliciousness. Values of X_1, X_2, and Y are recorded for many bananas, and they are used to predict Y for other bananas whose \mathbf{X} values are known.

Do you expect the prediction function to be more sensitive to changes in X_1 or changes in X_2?

Solution. We would expect X_1 to have more influence on Y, since bananas can taste great whether small or large, while green or dark brown bananas generally taste very bad and yellow bananas mostly taste good.

We call the components of \mathbf{X} features, predictors, or input variables, and we call Y the response variable or output variable.

We categorize supervised learning problems based on the structure of \mathcal{Y}.

A supervised learning problem is a regression problem if Y is quantitative (\mathcal{Y}\subset \mathbb{R}) and a classification problem if \mathcal{Y} is a set of labels.

We can treat the banana problem above as either a regression problem (if deliciousness is a score on a spectrum) or a classification problem (if the bananas are dichotomized as "good" or "bad"). We will see that many models are specific to regression or classification, while others can be adapted to handle either type of problem.

Explain how a machine learning task like image recognition fits into this framework. In other words, describe an appropriate feature space \mathcal{X}, describe in loose terms what the probability measure looks like, and describe in geometric terms what we are trying to accomplish when we define our prediction function.

Solution. We can take \mathcal{X} to be [0,1]^P, where P is the number of pixels in each image., because a grayscale image with P pixels can be thought of as list of 1000 numbers in [0,1], with each component indicating the darkness of a single pixel. Let's suppose P = 10^6, which is a fairly typical image size for modern cameras.

The output set \mathcal{Y} contains labels. For example, if we just have cat images and dog images to classify, then \mathcal{Y} would be {"cat", "dog"}.

The probability mass occupies a very small portion of the million-dimensional cube, since the vast majority of possible images do not represent a cat or a dog. Furthermore, the geometry of the sets where mass is concentrated is presumably quite complex, since the set of changes which can be made to an image to preserve its "catness" or "dogness" is not straightforward to describe. See the figure to the right for a rough visual; that this visualization is necessarily strained, since the 3D cube and the million dimensional cube have very different geometric properties.

Each dimension in this figure represents a pixel darkness, and the red and blue surfaces represent level sets of the conditional probability density functions given the animal genus. Cat images are likely to fall inside the red surfaces, while dog images are more likely to fall inside the blue surfaces. This visualization should be taken with a grain of salt, since real images have many more than three pixels.

Still, one thing we can use this picture to think about is what our goal is when we're coming up with a prediction function: we should carve up the cube so as to separate the regions where there's more cat probability mass from regions where there's more dog probability mass. That way, any image of a cat or dog can be identified as "more likely a cat" or "more likely a dog".

The gray surface separates regions where there's more red probability mass (for cats) from regions where there's more blue probability mass (for dogs). The goal of the prediction function is to compute whether a given point falls on the cat side or the dog side.

The loss functional

To make meaningful and unambiguous statements about a proposed prediction function h: \mathcal{X} \to \mathcal{Y}, we need a rubric by which to assess it. This is customarily done by defining a loss (or risk, or error ) L(h), with the idea that smaller loss is better. We might wish to define L only for h's in a specified class \mathcal{H} of candidate functions. Since L : \mathcal{H} \to \mathbb{R} is defined on a set of functions, we call L the loss functional.

Once a set \mathcal{H} of candidate functions and a loss functional L are specified, a prediction function h \in H must be proposed. This can often be done the lazy way, by defining our prediction function implicitly as the function in \mathcal{H} which minimizes L:

Given a statistical learning problem, a space \mathcal{H} of candidate prediction functions, and a loss functional L: \mathcal{H} \to \mathbb{R}, we define the target function to be \operatorname{argmin}_{h \in \mathcal{H}}L(h).

(Note: argmin means the value which minimizes. So for example, \operatorname{argmin}_{x \in \mathbb{R}} (x-14)^2 is equal to 14.)

Let's look at some common loss functionals. For regression, we often use the mean squared error:

\begin{align*}L(h) = \mathbb{E}[(h(X)-Y)^2]\end{align*}

If \mathcal{H} contains the regression function r(\mathbf{x}) = \mathbb{E}[Y | \mathbf{X} = \mathbf{x}], then r is the target function. This is because the mean of a random variable Y is the constant c which minimizes the expression \mathbb{E}[(Y - c)^2].

Recall the exam scores example from the KDE section of the statistics course in which we know the exact density of the distribution which generates the hours-score pairs for students taking an exam (run this cell):

using LinearAlgebra, Statistics, Roots, Optim, Plots, Random

# the true regression function
r(x) = 2 + 1/50*x*(30-x)
# the true density function
σy = 3/2  
f(x,y) = 3/4000 * 1/√(2π*σy^2) * x*(20-x)*exp(-1/(2σy^2)*(y-r(x))^2)

heatmap(0:0.02:20, -2:0.01:12, f,
        fillcolor = cgrad([:white, :MidnightBlue]),
        ratio = 1, fontfamily = "Palatino",
        size = (600,300), xlims = (0,20), ylims = (0,12),
        xlabel = "hours studied", ylabel = "score")

scatter!([(5,2), (5,4), (7,4), (15,4.5), (18, 4), (10,6.1)],
         markersize = 3, label = "observations")

plot!(0:0.02:20, r, label = "target function",
      legend = :topleft, linewidth = 2)

(a) What must be true of the class \mathcal{H} of candidate functions in order for the target function to be equal to the regression function r?

(b) Suppose we collect six observations, and we get the six data points shown in the plot above. Can the loss value of the prediction function plotted be decreased by lowering its graph a bit?

Solution. (a) The class of candidate functions must contain r. For example, it might consistnt of all quadratic functions, or all differentiable functions. If it only contained linear functions, then the target function would be different from the regression function.

(b) No, the loss value of the prediction function depends on the actual probability measure, not on training observations. If we know the probability measure, then we have the best possible information for making future predictions, and we can write off a handful of training samples as anomolous.

For classification, we consider the misclassification probability

\begin{align*}L(h) = \mathbb{E}\left[\boldsymbol{1}_{\{h(\mathbf{X}) \neq Y\}}\right] = \mathbb{P}(h(\mathbf{X}) \neq Y). \end{align*}

If \mathcal{H} contains G(\mathbf{x}) = \operatorname{argmax}_c\mathbb{P}(Y=c | \mathbf{X} = \mathbf{x}), then G is the target function for this loss functional.

Find the target function for the misclassification loss in the case where \mathcal{X} = \mathbb{R}, \mathcal{Y} = \{0,1\} and the probability mass on \mathcal{X} \times \mathcal{Y} is spread out in the plane according to the one-dimensional density function

\begin{align*}f(x,y) = \begin{cases} \frac{1}{3}\mathbf{1}_{\{x \in [0,2]\}} & \text{if }y = 0 \\ \frac{1}{6}\mathbf{1}_{\{x \in [1,3]\}} & \text{if }y = 1 \\ \end{cases}\end{align*}

plot([(0,0),(2,0)], linewidth = 4, color = :MidnightBlue,
      label = "probability mass", xlims = (-2,5), ylims = (-1,2))
plot!([(1,1),(3,1)], linewidth = 2, color = :MidnightBlue,
      label = "")

Solution. If x is between 0 and 1, the target prediction function will return 0. If x is between 2 and 3, the target prediction function will return 1. If x is between 1 and 2, then the more likely outcome is 0, so r should return 0 for those values as well.

plot([(0,0),(2,0)], linewidth = 4, label = "probability mass",
     xlims = (-2,5), ylims = (-1,2))
plot!([(1,1),(3,1)], linewidth = 2, label = "")
plot!([(0,0),(2,1),(3,1)], seriestype = :steppost,
      label = "target function", linewidth = 2)

Empirical Risk Minimization

Note that neither the mean squared error nor the misclassification probability can be computed directly unless the probability measure \mathbb{P} on \mathcal{X} \times \mathcal{Y} is known. Since the goal of statistical learning is to make inferences about \mathbb{P} when it is not known, we must approximate L (and likewise also the target function h) using the training data.

The most straightforward way to do this is to replace \mathbb{P} with the empirical probability measure associated with the training data \{(\mathbf{X}_i, Y_i)\}_{i=1}^n. This is the probability measure which places \frac{1}{n} units of probability mass at (\mathbf{X}_i, Y_i), for each i from 1 to n. The empirical risk of a candidate function h \in \mathcal{H} is the risk functional evaluated with respect to the empirical measure of the training data.

A learner is a function which takes a set of training data as input and returns a prediction function \widehat{h} as output. A common way to specify a learner is to let \widehat{h} be the empirical risk minimizer (ERM), which is the function in \mathcal{H} which minimizes the empirical risk.

Suppose that \mathcal{X} = [0,1] and \mathcal{Y} = \mathbb{R}, and that the probability measure on \mathcal{X} \times \mathcal{Y} is the one which corresponds to sampling X uniformly from [0,1] and then sampling Y from \mathcal{N}(X/2 + 1, 1).

Let \mathcal{H} be the set of monic polynomials of degree six or less. Given training observations \{(\mathbf{X}_i, Y_i)\}_{i=1}^6, find the risk minimizer and the empirical risk minimizer for the mean squared error.

Solution. The risk minimizer is \mathbb{E}[Y | X], which is given in the problem statement as X/2 + 1. The empirical loss function is obtained by replacing the expectation in the loss functional with an expectation with respect to the empirical measure. So the empirical risk minimizer is the function which minimizes

\begin{align*}\sum_{i=1}^{n} (Y_i - h(\mathbf{X}_i))^2.\end{align*}

Since there is exactly one monic polynomial of degree 6 or less which passes through the points \{(\mathbf{X}_i, Y_i)\}_{i=1}^6, that function is the empirical risk minimizer.

The risk minimizer and empirical risk minimizer can be quite different.

Run this cell to see the true density, the observations, the risk minimizer and the empirical risk minimizer in the same figure:

using Plots, Distributions, Polynomials, Random
X = rand(6)
Y = X/2 .+ 1 .+ randn(6)
p = polyfit(X,Y)
heatmap(0:0.01:1, -4:0.01:4, (x,y) -> pdf(Normal(x/2+1),y), opacity = 0.8, fontfamily = "Palatino",
        color = cgrad([:white, :MidnightBlue]), xlabel = "x", ylabel = "y")
plot!(0:0.01:1, x->p(x), label = "empirical risk minimizer", color = :purple)
plot!(0:0.01:1, x->x/2 + 1, label = "actual risk minimizer", color = :Gold)
scatter!(X, Y, label = "training points", ylims = (-1,4), color = :red)

This example illustrates a phenomenon called . Although the empirical risk is small for the prediction function h we found, smallness of the empirical risk does not imply smallness of the true risk. The difference between empirical risk and the actual value of the risk functional is called generalization error (or test error).

We mitigate overfitting by building inductive bias into the model. Common approaches include

  • using a restrictive class \mathcal{H} of candidate functions,
  • regularizing: adding a term to the loss functional which penalizes model complexity, and
  • cross-validating: proposing a spectrum of candidate functions and selecting the one which performs best on withheld training observations.


  • Which method of introducing inductive bias does linear regression use?
  • Which method did we use for kernel density estimation in the statistics course?


  • Linear regression uses a restrictive class \mathcal{H} of candidate functions. The kind of overfitting that we saw in the polynomial example above is impossible if such curvy functions are excluded from consideration.
  • We used cross-validation. We suggested a family of density estimators \widehat{f}_\lambda parameterized by bandwidth \lambda, and we chose a value of \lambda by withholding a single data point at a time to estimate the integrated squared error.

The Bias-Complexity Tradeoff

Inductive bias can lead to underfitting: relevant relations are missed, so both training and actual error are larger than necessary. The tension between underfitting and overfitting is the bias-complexity (or bias-variance ) tradeoff.

This tension is fundamentally unresolvable, in the sense that all learners are equal on average (!!).

Theorem (No free lunch)
Suppose \mathcal{X} and \mathcal{Y} are finite sets, and let f denote a probability distribution on \mathcal{X} \times \mathcal{Y}. Let D be a collection of n independent observations from f, and let h_1 and h_2 be prediction functions (which associate a prediction h_j(d,\mathbf{x}) \in \mathcal{Y} to each pair (d,\mathbf{x}) where d is a set of training observations and \mathbf{x} \in \mathcal{X}). Consider the cost random variable C_j = (h_j(D,X) - Y)^2 (or C_j = \boldsymbol{1}_{\{h_j(D,X) - Y\}}) for j \in {1,2}.

The average over all distributions f of the distribution of C_1 is equal to the average over all distributions f of the distribution of C_2. (Note that it makes sense to average distributions in this context, because the distributions of C_1 and C_2 are both functions on a finite set.)

The no-free-lunch theorem implies that machine learning is effective in real life only because the learners used by practitioners possess inductive bias which is well-aligned with data-generating processes encountered in machine learning application domains. A learner which is more effective than another on a particular class of problems must perform worse on average on problems which are not in that class.

Cross-validation seems to be free of inductive bias: insisting that an algorithm perform well on withheld observations hardly seems to assume any particular structure to the probability distribution or regression function being estimated. Show that this is not the case.

Solution. The no-free-lunch theorem implies that a learner which employs cross-validation performs exactly the same on average as a learner which employs anti-cross-validation (that is, selection based on the worst performance on withheld observations).

Of course, cross-validation works much better than anti-cross-validation in practice. But for any class of problems where it performs better, it is necessarily worse on the complement of that class. So using cross-validation does implicitly assume that the problem at hand is in a class which is amenable to cross-validation. This is OK in practice because the class of problems machine learning researchers encounter in real life do have some structure that make them different from a distribution chosen uniformly at random from the set of all distributions.

Bruno Bruno