Select one of the keywords on the left…

StatisticsEstimating Joint Densities

Reading time: ~50 min

The following scenario will be our running example throughout this section. We will begin by assuming that we know the joint distribution of two random variables X and Y, and then we will explore what happens when we drop that assumption.

Suppose that X is a random variable which represents the number of hours a student studies for an exam, and Y is a random variable which represents their performance on the exam. Suppose that the joint density of X and Y is given by

\begin{align*} f(x,y) = \frac{3}{4000(3/2)\sqrt{2\pi}}x(20-x)e^{-\frac{1}{2(3/2)^2}\left(y - 2 - \frac{1}{50}x(30-x)\right)^2}.\end{align*}

Given that a student's number of hours of preparation is known to be x, what is the best prediction for Y (as measured by mean squared error)?

Figure 1.1: The probability density function of the joint distribution of (X,Y), where X is the number of hours of studying and Y is the exam score.

Solution. The best guess of a random variable (as measured by mean squared error) is the expectation of the random variable. Likewise, the best guess of Y given \{X=x\} is the conditional expectation of Y given \{X=x\}. Inspecting the density function, we recognize that restricting it to a vertical line at position x gives an expression which is proportional to the Gaussian density centered at 2 + \frac{1}{50}x(30-x). Therefore,

\begin{align*}\mathbb{E}[Y | X=x] = 2 + \frac{1}{50}x(30-x).\end{align*}

A graph of this function is shown in red in the figure below:

Figure 1.2: The conditional expectation of Y given \{X = x\}, as a function of x.

We call r(x) = \mathbb{E}[Y | X = x] the regression function for the problem of predicting Y given the value of X.

Given the distribution of (X,Y), finding the regression function is a probability problem. However, in practice the probability measure is typically only known empirically, that is, by way of a collection of independent observations sampled from the measure. For example, to understand the relationship between exam scores and studying hours, we would record the values (X_i,Y_i) for many past exams i \in \{1,2,\ldots,n\} (see Figure 1.3 below).

Figure 1.3: One thousand independent samples from the joint distribution of (X,Y).

Describe an example of a random experiment where the probability measure may be reasonably inferred without the need for experimentation or data collection. What is the difference between such an experiment and one for which an empirical approach is required?

Solution. There are many random experiments for which it is reasonable to use a measure derived from principle rather than from experimentation. For example, if we are drawing a card from a well-shuffled deck, or drawing a marble from a well-stirred urn, then we would assume that the probability measure is uniform. This assumption would lead to more accurate predictions than if we used an inferred measure based on repeated observations of the experiment. By extension, if we have many random variables whose distributions are known from principle and which are known from principle to be independent, then we know that their average is approximately normally distributed.

The key difference between these examples and those requiring an empirical approach is the use of symmetry. There is no a priori reason to believe that, for example, the possible scores on an exam are equally likely, or that an exam score is a sum of independent random variables.

You can convince yourself that r, and indeed the whole joint distribution of X and Y, is approximately recoverable from the samples shown in Figure 1.3: if we draw a curve roughly through the middle of the point cloud, it would be pretty close to the graph of r. If we shade the region occupied by the point cloud, darker where there are more points and lighter where there are fewer, it will be pretty close to the heat map of the density function.

We will want to sharpen this intuition into a computable algorithm, but the picture gives us reason to be optimistic.

In the context of Figure 1.3, why doesn't it work to approximate the conditional distribution of Y given \{X = x\} using all of the samples which are along the vertical line at position x?

Solution. Because for almost all values of x, there will be no points along the vertical line at position x!

Kernel Density Estimation

The empirical distribution associated with a collection of observations (x,y) from two random variables X and Y is the probability measure which assigns mass \frac{1}{n} to each location (x,y). The previous exercise illustrated shortcomings of using the empirical distribution directly for this problem: the mass is not appropriately spread out in the rectangle. It makes more sense to conclude from the presence of an observation at a point (x,y) that the unknown distribution of (X,Y) has some mass near (x,y), not necessarily exactly at (x,y).

We can capture this idea mathematically by spreading out \frac{1}{n} units of probability mass around each sample (x_i,y_i). We have to choose a function—called the kernel—to specify how the mass is spread out.

There are many kernel functions in common use; we will base our kernel on the tricube function, which is defined by

\begin{align*} D(u) = \left\{ \begin{array}{cl} \frac{70}{81}(1-|u|^3)^3 & \text{if }|u| < 1\\ 0 & \text{if }|u| \geq 1. \end{array}\right. \end{align*}

We define D_\lambda(u) = \frac{1}{\lambda} D\left(\frac{u}{\lambda}\right); The parameter \lambda > 0 can be tuned to adjust how tightly the measure is concentrated at the center. Experiment with the slider below to get a feel for how changing \lambda affects the shape of the graph of D_\lambda.


We define the kernel function K_\lambda as a product of two copies of D_\lambda, one for each coordinate (see Figure 1.4 for a graph):

\begin{align*}K_\lambda(x,y) = D_\lambda(x)D_\lambda(y).\end{align*}

Figure 1.4: The graph of the kernel K_\lambda(x,y) with \lambda = 1.

Is the probability mass more or less tightly concentrated around the origin when \lambda is small?

Solution. Note that K_\lambda(x,y) as soon as either x or y is larger than \lambda. Therefore, all of the probability mass is less than \lambda \sqrt{2} units from the origin. So small \lambda corresponds to tight concentration of the mass near zero.

Now, let's place small piles of probability mass with the shape shown in Figure 1.4 at each sample point. The resulting approximation of the joint density f(x,y) is

\begin{align*}\widehat{f}_\lambda(x,y) = \frac{1}{n}\sum_{i=1}^n K_\lambda(x-X_i,y-Y_i).\end{align*}

Graphs of \widehat{f}_\lambda for various values of \lambda are shown below. We can see that the best match is going to be somewhere between the \lambda \to 0 and \lambda\to\infty extremes (which are too concentrated and too diffuse, respectively).

Figure 1.5: The density estimator \widehat{f}_\lambda for \lambda = 0.25, 1, 3, and 10. The last figure shows the actual density function.

Estimate the best value of \lambda by eyeballing the figure above (the values of \lambda in the first four figures are 0.25, 1, 3, and 10. The last figure shows the actual density).

Solution. In the \lambda=1 picture, it looks like the estimated measure is taking individual points too seriously, since there are lots of points with more mass around them than in the gaps between observations. In the \lambda = 10 picture, the mass looks way too spread out. Among the ones shown, \lambda = 3 appears to be the best.

We can take advantage of our knowledge of the density function to determine which \lambda value works best. In practice, we would not have access to this information, but let's postpone dealing with that issue till the next section.

We measure the difference between f and \widehat{f}_\lambda by evaluating both functions at each point in a square grid and finding the sum of the squares of these differences. This sum is approximately proportional to .

using LinearAlgebra, Statistics, Roots, Optim, Plots, Random
n = 1000 # number of samples to draw

# the true regression function
r(x) = 2 + 1/50*x*(30-x)
# the true density function
σy = 3/2  
f(x,y) = 3/4000 * 1/√(2π*σy^2) * x*(20-x)*exp(-1/(2σy^2)*(y-r(x))^2)

# x values and y values for a grid
xs = 0:1/2^3:20
ys = 0:1/2^3:10

# F(t) is the CDF of X
F(t) = -t^3/4000 + 3t^2/400

"Sample from the distribution of X (inverse CDF trick)"
function sampleX()
    U = rand()

"Sample from joint distribution of X and Y"
function sampleXY(r,σ)
    X = sampleX()
    Y = r(X) + σ*randn()

# Sample n observations
sample = [sampleXY(r,σy) for i=1:n]

D(u) = abs(u) < 1 ? 70/81*(1-abs(u)^3)^3 : 0 # tri-cube function
D(λ,u) = 1/λ*D(u/λ) # scaled tri-cube
K(λ,x,y) = D(λ,x) * D(λ,y) # kernel
kde(λ,x,y,observations) = sum(K(λ,x-Xi,y-Yi) for (Xi,Yi) in observations)/length(observations)
# `optimize` takes functions which accept vector arguments, so we treat
# λ as a one-entry vector
L(λ) = sum((f(x,y) - kde(λ,x,y,sample))^2 for x=xs,y=ys)*step(xs)*step(ys)

# minimize L using the BFGS method
λ_best = optimize(λ->L(first(λ)),[1.0],BFGS())

We find that the best \lambda value comes out to about \lambda = 1.92. This is significantly smaller than our "eyeball" value of 3 from earlier, but not terribly far off.


Since we only know the density function in this case because we're in the controlled environment of an exercise, we need to think about how we could have chosen the optimal \lambda value without that information.

One idea, which we will find is applicable in many statistical contexts, is to reserve one observation and form an estimator which only uses the other n-1 observations. We evaluate this density approximation at the reserved sample point, and we repeat all of these steps for each sample in the data set. If the resulting density value is consistently very small, then our density estimator is being consistently "surprised" by the location of the reserved sample. This suggests that our \lambda value is too large or too small. This idea is called cross-validation.

Experiment with the sliders below to adjust the bandwidth \lambda and the omitted point i to find a value of \lambda you find satisfactory (set i to 7 to include all six points).



The density value at the omitted point is small when \lambda is too small because , and the value is small when \lambda is too large because .

We've decided we want the densities at the omitted points to be not too small, but we haven't specified an objective function to say exactly what we mean by that. It turns out that we can do this in a principled way, starting from the idea of integrated squared error.

Let's aim to minimize the loss (or error, or risk) of our estimator, which we define to be the integrated squared difference between \widehat{f}_\lambda and the true density f. We can write this integral as

\begin{align*}\int (\widehat{f}_\lambda - f)^2 = \int \widehat{f}_\lambda^{2} - 2 \int \widehat{f}_\lambda f + \int f^2,\end{align*}

using . The third term does not involve \widehat{f}, so minimizing \int (\widehat{f}_\lambda - f)^2 is the same as minimizing \int \widehat{f}_\lambda^{2} - 2 \int \widehat{f}_\lambda f, which we will call J(\lambda).

Recalling the expectation formula

\begin{align*}\mathbb{E}[g(X,Y)] = \int g(x,y) f_{X,Y}(x,y) \mathrm{d} x \mathrm{d} y,\end{align*}

we recognize the second term in the top equation as , where (X,Y) is an observation from the true density f which is independent of \widehat{f}_\lambda. To get observations which are independent of the estimator, we will define \widehat{f}_{\lambda}^{(-i)} to be the density estimator obtained using the observations other than the i\text{th} one. Since the observations (X_i,Y_i) are drawn from the joint density of (X,Y) and are assumed to be independent, we can suggest the approximation

\begin{align*}\mathbb{E}[\widehat{f}_\lambda(X,Y)] \approx \frac{1}{n} \sum_{i=1}^n \widehat{f}_{\lambda}^{\:(-i)}(X_i,Y_i),\end{align*}

We call

\begin{align*}\widehat{J}(\lambda) = \int \widehat{f}_\lambda^{2} - \frac{2}{n} \sum_{i=1}^n \widehat{f}_{\lambda}^{\:(-i)}(X_i,Y_i)\end{align*}

the cross-validation loss estimator. Let's find the value of \lambda which minimizes \widehat{J}(\lambda) for the present example:

"Evaluate the summation Σᵢ f⁽⁻ⁱ⁾(Xᵢ,Yᵢ) in J(λ)'s second term"
function kdeCV(λ,i,observations)
    x,y = observations[i]
    newobservations = copy(observation)

# first line approximates ∫f̂², the second line approximates -(2/n)∫f̂f
J(λ) = sum([kde(λ,x,y,observations)^2 for x=xs,y=ys])*step(xs)*step(ys) -
        2/length(observations)*sum(kdeCV(λ,i,observations) for i=1:length(observations))
λ_best_cv = optimize(λ->J(first(λ)),[1.0],BFGS())

The cross-validated minimizing value is \lambda = 1.88. Recall that the true minimizer is \lambda =1.92. So in this case, cross validation gets us quite close to the optimal \lambda value without needing to know the actual density. In fact, the cross-validation estimator performs well in general given enough data, in the following sense:

Theorem (Stone's Theorem)
Suppose that f is a bounded probability density function. Let \widehat{f}^{\mathrm{CV}}_n be the kernel density estimator with bandwidth \lambda obtained by cross-validation, and let \widehat{f}_n be the kernel density estimator with optimal bandwidth \lambda^{\mathrm{min}}_n. Then

\begin{align*} \frac{\int (f - \widehat{f}^{\mathrm{CV}}_n)^2}{ \int (f-\widehat{f}_n)^2} \end{align*}

converges in probability to 1 as n\to\infty. Furthermore, there are constants C_1 and C_2 such that \int (f-\widehat{f}_n)^2\approx C_1n^{-4/5} and \lambda^{\mathrm{min}}_n \approx C_2n^{-1/5} for large n.

Nonparametric Regression

With an estimate of the joint density of X and Y in hand, we can turn to the problem of estimating the regression function r(x) = \mathbb{E}[Y | X = x]. If we restrict the density estimate \widehat{f}_\lambda to the vertical line at position x, we find that the conditional density is

\begin{align*} \widehat{f}_{Y | X = x}(y) = \frac{\displaystyle{\frac{1}{n}\sum_{i=1}^n K_\lambda(x-X_i,y-Y_i)}}{\displaystyle{ \int \frac{1}{n}\sum_{i=1}^n K_\lambda(x-X_i,y-Y_i) \mathrm{d} y}}. \end{align*}

Figure 1.6: A kernel density estimator with 16 observations and \lambda = 1.

So the conditional expectation of Y given \{X = x\} is

\begin{align*} \widehat{r}(x) = \int y \widehat{f}_{Y | X = x}(y) \mathrm{d} y = \frac{\displaystyle{\int y \frac{1}{n}\sum_{i=1}^n K_\lambda(x-X_i,y-Y_i) \mathrm{d} y}}{\displaystyle{ \int \frac{1}{n}\sum_{i=1}^n K_\lambda(x-X_i,y-Y_i) \mathrm{d} y}}. \end{align*}

Let's try to simplify this expression. Looking at Figure 1.6, we can see by the symmetry of the function D that each observation (X_i,Y_i) contributes Y_iD_\lambda(x-X_i) to the numerator of the above equation. To arrive at the same conclusion by manipulating the formula directly, we write

\begin{align*} \int \frac{1}{n}\sum_{i=1}^n K_\lambda(x-X_i,y-Y_i) \mathrm{d} y = \frac{1}{n} \sum_{i=1}^n D_\lambda(x-X_i) \int yD_\lambda(y-Y_i) \mathrm{d} y. \end{align*}

Then \int yD(y-Y_i) \mathrm{d} y is the average of the probability measure with density D centered at Y_i. Since D is symmetric, this average is the center point Y_i.

Applying a similar idea to the denominator (note that instead of Y_i, we get the total mass 1 from integrating D(y-Y_i)), we find that

\begin{align*} \widehat{r}(x) = \frac{\displaystyle{\sum_{i=1}^nD_\lambda(x-X_i)Y_i}}{\displaystyle{\sum_{i=1}^nD_\lambda(x-X_i)}}. \end{align*}

Figure 1.7: The Nadaraya-Watson estimator \widehat{r}.

The graph of this function is shown in Figure 1.7. We see that it matches the graph of r quite closely except near the ends of the interval.

λ = first(λ_best_cv.minimizer)
r̂(x) = sum(D(λ,x-Xi)*Yi for (Xi,Yi) in observations)/sum(D(λ,x-Xi) for (Xi,Yi) in observations)
scatter([x for (x,y) in observations],
        [y for (x,y) in observations],label="observations",
        markersize=2, legend = :bottomright)
plot!(0:0.2:20, r̂, label = L"\hat{r}", ylims = (0,10), linewidth = 3)

The approximate integrated squared error of this estimator is sum((r(x)-r̂(x))^2 for x in xs)*step(xs) = 1.90.

Kernel density estimation is just one approach to estimating a joint density, and the Nadaraya-Watson estimator is just one approach to estimating a regression function. In the machine learning course following this one, we will explore a wide variety of machine learning models which take quite different approaches, and each will have its own strengths and weaknesses.

Bruno Bruno