Select one of the keywords on the left…

StatisticsPoint Estimation

Reading time: ~30 min

In the previous section, we discussed the problem of estimating a distribution given a list of independent observations from it. Now we turn to the simpler task of point estimation: estimating a single real-valued feature (such as the mean, variance, or maximum) of a distribution. We begin by formalizing the notion of a real-valued feature of a distribution.

Definition (Statistical functional)
A statistical functional is any function T from the set of distributions to [-\infty,\infty].

For example, if we define T_1(\nu) to be the mean of the distribution \nu, then T_1 is a statistical functional. Similarly, consider the maximum functional T_2(\nu) = F^{-1}(1) where F is the CDF of \nu. To give a more complicated example, we can define T_3(\nu) to be the expected value of the difference between the greatest and least of 10 independent random variables with common distribution \nu. Then T_3 also a statistical functional.

Given a statistical functional, our goal will be to use a list of independent observations from \nu to estimate T(\nu).

Definition (Estimator)
An estimator \widehat{\theta} is a random variable which is a function of n i.i.d. random variables.

For example, the random variable X_1 + \cdots + X_n is an estimator, if X_1, \ldots, X_n are independent and identically distributed. Let's develop a general approach for defining estimators. We begin by observing a large independent sample from a distribution gives us direct information about the CDF of the distribution.

Draw 500 independent observations from an exponential distribution with parameter 1. Plot the function \widehat{F} which maps x to the proportion of observations at or to the left of x on the number line. We call \widehat{F} the empirical CDF. Compare the graph of the empirical CDF to the graph of the CDF of the exponential distribution with parameter 1.

Solution. We can graph \widehat{F} using a step plot:

using Plots, Distributions
n = 500
xs = range(0, 8, length=100)
plot(xs, x-> 1-exp(-x), label = "true CDF", legend = :bottomright)
plot!(sort(rand(Exponential(1),n)), (1:n)/n,
      seriestype = :steppre, label = "empirical CDF")
n <- 500
xvals = seq(0,8,length=100)

ggplot() +
  geom_line(aes(x=xvals,y=1-exp(-xvals))) +

This example suggests an idea for estimating \widehat{\theta}: since the unknown distribution \nu is typically close to the measure \widehat{\nu} which places mass \frac{1}{n} at each of the observed observations, we can build an estimator of T(\nu) by plugging \widehat{\nu} into T.

Definition (Plug-in estimator)
The plug-in estimator of \theta = T(\nu) is \widehat{\theta} = T(\widehat{\nu}).

Find the plug-in estimator of the mean of a distribution. Find the plug-in estimator of the variance.

Solution. The plug-in estimator of the mean is the mean of the empirical distribution, which is the average of the locations of the observations. We call this the sample mean:

\begin{align*}\overline{X} = \frac{X_1 + \cdots + X_n}{n}.\end{align*}

Likewise, the plug-in estimator of the variance is sample variance

\begin{align*}S^2 = \frac{1}{n}\left( (X_1 - \overline{X})^2 + (X_2 - \overline{X})^2 + \cdots + (X_n - \overline{X})^2\right).\end{align*}

Ideally, an estimator \widehat{\theta} is close to \theta with high probability. We will see that we can decompose the question of whether \widehat{\theta} is close to \theta into two sub-questions: is the mean of \widehat{\theta} close to \theta, and is \widehat{\theta} close to its mean with high probability?


Definition (Bias)
The bias of an estimator \widehat{\theta} is

\begin{align*}\mathbb{E}[\widehat{\theta}] - \theta.\end{align*}

An estimator is said to be biased if its bias is nonzero and unbiased if its bias is zero.

Consider the estimator

\begin{align*}\widehat{\theta} = \max(X_1, \ldots, X_n)\end{align*}

of the maximum functional. Assuming that the distribution is described by a density function (in other words, it's a continuous rather than a discrete random variable), show that \widehat{\theta} is biased.

Solution. If \nu is a continuous distribution, then the probability of the event \{X_i < T(\nu)\} is 1 for all i=1,2,\ldots,n. This implies that \widehat{\theta} < T(\nu) with probability 1. Taking expectation of both sides, we find that \mathbb{E}[\widehat{\theta}] < T(\nu). Therefore, this estimator has negative bias.

We can numerically experiment to approximate the bias of this estimator in a specific instance. For example, if we estimate the maximum of a uniform distribution on [0,b] with the sample maximum of 100 observations, we get a bias of approximately

using Statistics
mean(maximum(rand() for _ in 1:100) - 1 for _ in 1:10_000)

which is about -0.0098. We can visualize these sample maximum estimates with a histogram:

using Plots
histogram([maximum(rand() for _ in 1:100) for _ in 1:10_000],
          label = "sample maximum", xlims = (0,1),
          legend = :topleft)

Standard Error

Zero or small bias is a desirable property of an estimator: it means that the estimator is accurate on average. The second desirable property of an estimator is for the probability mass of its distribution to be concentrated near its mean:

Definition (Standard error)
The standard error \operatorname{se}(\widehat{\theta}) of an estimator \widehat{\theta} is its standard deviation.

Find the standard error of the sample mean if the distribution \nu with variance \sigma^2.

Solution. We have

\begin{align*}\operatorname{Var}\left(\frac{X_1 + X_2 + \cdots + X_n}{n}\right) = \frac{1}{n^2}(n\operatorname{Var} X_1) = \frac{\sigma^2}{n}.\end{align*}

Therefore, the standard error is \sigma/\sqrt{n}.

We can see how the standard error decreases with n by computing the sample mean for many independent datasets and plotting the resulting histogram:

n = 100
histogram([mean(rand() for _ in 1:n) for _ in 1:10_000],
          label = "sample mean, $n observations",
          xlims = (0,1), size = (600,400))

Mean Squared Error

If the expectation of an estimator of \theta is close to \theta and if the estimator close to its average with high probability, then it makes sense that \widehat{\theta} and \theta are close to each other with high probability. We can measure the discrepancy between \widehat{\theta} and \theta directly by computing their average squared difference:

Definition (Mean squared error)
The mean squared error of an estimator \widehat{\theta} is \mathbb{E}[(\widehat{\theta} - \theta)^2].

As advertised, the mean squared error decomposes as a sum of squared bias and squared standard error:

The mean squared error of an estimator \theta is equal to its variance plus its squared bias:

\begin{align*}\mathbb{E}[(\widehat{\theta} - \mathbb{E}[\widehat{\theta}])^2] + (\mathbb{E}[\widehat{\theta}] - \theta)^2.\end{align*}

Proof. The idea is to add and subtract the mean of \widehat{\theta}. We find that

\begin{align*}\mathbb{E}[(\widehat{\theta} - \theta)^2] &= \mathbb{E}[(\widehat{\theta} - \mathbb{E}[\widehat{\theta}] + \mathbb{E}[\widehat{\theta}] - \theta)^2] \\ &= \mathbb{E}[(\widehat{\theta} - \mathbb{E}[\widehat{\theta}])^2] + 2\mathbb{E}[(\widehat{\theta} - \mathbb{E}[\widehat{\theta}])(\mathbb{E}[\widehat{\theta}] - \theta)] + (\mathbb{E}[\widehat{\theta}] - \theta)^2.\end{align*}

The middle term is zero by . The first and third terms represent the variance and squared bias respectively, of \widehat{\theta}, so this concludes the proof.

If the bias and standard error of an estimator both converge to 0, then the estimator is consistent:

Definition (Consistent)
An estimator is consistent if \widehat{\theta} converges to \theta in probability as n\to\infty.

Show that the plug-in maximum estimator \widehat{\theta}_n = \max(X_1, \ldots, X_n) of \theta = T(\nu) = F^{-1}(1) is consistent, assuming that the distribution belongs to the parametric family \{\operatorname{Unif}([0,b]) : b \in \mathbb{R}\}.

Solution. The probability that \widehat{\theta}_n is more than \epsilon units from \theta is equal to the probability that every sample is less than \theta - \epsilon, which by independence is equal to

\begin{align*}\left(\frac{\theta - \epsilon}{\theta}\right)^n.\end{align*}

This converges to 0 as n \to \infty, since \frac{\theta - \epsilon}{\theta} < 1.

The figure below summarizes the four possibilities for combinations of high or low bias and variance.

An estimator of \theta has high or low bias depending on whether its mean is far from or close to \theta. It has high or low variance depending on whether its mass is spread out or concentrated.

Show that the sample variance S^2 = \frac{1}{n}\sum_{i=1}^n (X_i - \overline{X})^2 is biased.

Solution. We will perform the calculation for n = 3. It may be generalized to other values of n by replacing 3 with n and 2 with n-1. We have

\begin{align*}\mathbb{E}[S^2] = \frac{1}{3}\mathbb{E}\left[ \left(\frac{2}{3}X_1 - \frac{1}{3}X_2 - \frac{1}{3}X_3\right)^2 + \left(\frac{2}{3}X_2 - \frac{1}{3}X_3 - \frac{1}{3}X_1\right)^2 + \left(\frac{2}{3}X_3 - \frac{1}{3}X_1 - \frac{1}{3}X_2\right)\right]^2\end{align*}

Squaring out each trinomial, we get \frac{4}{9}X_1^2 from the first term and \frac{1}{9}X_1^2 from each of the other two. So altogether the X_1^2 term is \frac{6}{9}X_1^2. By symmetry, the same is true of X_2^2 and X_3^2. For cross-terms, we get -\frac{4}{9}X_1X_2 from the first squared expression, -\frac{4}{9}X_1X_2 from the second, and \frac{2}{9}X_1X_2 from the third. Altogether, we get -\frac{6}{9}X_1X_2. By symmetry, the remaining two terms are -\frac{6}{9}X_1X_3 -\frac{6}{9}X_2X_3.

Recalling that \operatorname{Var}(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2 for any random variable X, we have \mathbb{E}[X_1^2] = \mu^2 + \sigma^2, where \mu and \sigma are the mean and standard deviation of the distribution of X_1(and similarly for X_2 and X_3. So we have

\begin{align*}\mathbb{E}[S^2] &= \frac{1}{3}\left(\frac{6}{9}(X_1^2 + X_2^2 + X_3^2) - \frac{6}{9}(X_1X_2 +X_1X_3 + X_2X_3)\right) \\ &= \frac{1}{3}\cdot\frac{6}{9}(3(\sigma^2 + \mu^2) - 3\mu^2) = \frac{2}{3}\sigma^2. \end{align*}

If we repeat the above calculation with n in place of 3, we find that the resulting expectation is \frac{n-1}{n}\sigma^2.

Motivated by this example, we define the unbiased sample variance


Let's revisit the adult height distribution from the first section. We observed the human adult heights shown below (in inches). If we want to approximate the height distribution with a Gaussian, it seems reasonable to estimate μ and σ² using the unbiased estimators \mu = \frac{1}{n}(X_1 + \cdots + X_n) and \widehat{\sigma}^2 = \frac{1}{n-1}\sum_{i=1}^n(X_i-\mu)^2.

Calculate these estimators for the height data.

heights = [71.54, 66.62, 64.11, 62.72, 68.12,
           69.07, 64.82, 61.92, 68.45, 66.3,
           66.99, 62.2, 61.04, 63.31, 68.94,
           66.27, 66.8, 71.7, 68.93, 66.65,
           71.97, 60.27, 62.81, 70.64, 71.61,
           65.51, 63.1, 66.21, 68.23, 72.32,
           62.29, 63.12, 64.94, 71.89, 65.48,
           63.66, 56.11, 65.63, 61.26, 65.12,
           66.93, 68.51, 67.2, 71.57, 66.65,
           59.77, 61.51, 63.25, 69.12, 64.98]

Solution. Julia has built-in functions for this:

mean(heights), var(heights)

We could also write our own:

μ = sum(heights)/length(heights)
σ̂² = 1/(length(heights)-1) * sum((h-μ)^2 for h in heights)

In the next section, we will develop an important extension of point estimation which supplies additional information about how accurate we expect our point estimate to be.

Bruno Bruno