Glossary

Select one of the keywords on the left…

StatisticsConfidence Intervals

Reading time: ~20 min

It is often of limited use to know the value of an estimator given an observed collection of observations, since the single value does not indicate how close we should expect \theta to be to \widehat{\theta}. For example, if a poll estimates that a randomly selected voter has 46% probability of being a supporter of candidate A and a 42% probability of being a supporter of candidate B, then knowing more information about the distributions of the estimators is essential if we want to know . Thus we introduce the idea of a confidence interval.

Definition (Confidence interval)
Consider an unknown probability distribution \nu from which we get n independent observations X_1, \ldots, X_n, and suppose that \theta is the value of some statistical functional of \nu. A confidence interval for \theta is an interval-valued function of the sample data X_1, \ldots, X_n. A confidence interval has confidence level 1-\alpha if it contains \theta with probability at least 1-\alpha.

Example
Consider a distribution \nu of the form \operatorname{Unif}([0,b]), and let T be the maximum functional (so ). Consider the max estimator \widehat{b} = \operatorname{max}(X_1, \ldots, X_{10}) of 10 observations. Find a 90% confidence interval for b.

Solution. We expect b to be a little larger than the largest observation, so we look for a confidence interval of the form (b, b+\text{something}). We'd like to make the interval short so that it's , but we can't make it too short or else .

For example, the probability that all 10 observations will be less than 90% of b is (0.9)^{10} = 34.9%. So with probability about 65.1%, we will trap the value of b in the interval .

We can replace 90% with a variable k and solve the equation k^{10} = 0.1 to find that (\widehat{b}, \widehat{b}/0.794) is the shortest 90% confidence interval.

This first example was exceptionally amenable to analysis because we can solve exactly for the relevant probabilities. Estimators based on sums of observations are more typical, and in those cases we usually use the normal approximation:

Exercise
Show that if \widehat{\theta} is unbiased and approximately normally distributed, then (\widehat{\theta} - k \operatorname{se}(\widehat{\theta}), \widehat{\theta} + k \operatorname{se}(\widehat{\theta})) is an approximate 1 - 2\Phi(-k) confidence interval, where \Phi is the CDF of the standard normal distribution.

Solution. A normal random variable is within k standard deviations of its mean with probability \Phi(k) - \Phi(-k) = 1-\Phi(-k) - \Phi(-k) = 1 - 2\Phi(-k). Since the mean of \widehat{\theta} is \theta, this implies that (\widehat{\theta} - k \operatorname{se}(\widehat{\theta}), \widehat{\theta} + k \operatorname{se}(\widehat{\theta})) includes \theta with probability approximately 1 - 2\Phi(-k).

Exercise
One thousand people are polled, and 462 of them express a preference for candidate A, while 417 express a preference for candidate B. Suppose that the 1000 preferences are chosen independently from a distribution \nu on \{\text{A}, \text{B}, \text{no preference}\} which assigns probability mass m_{\text{A}} and m_{\text{B}} to the first two outcomes. Use the normal approximation to find 95% confidence intervals for the functionals T_A(\nu) = m_{\text{A}} and T_B(\nu) = m_{\text{B}}.

Note: although it is a bit of a cheat, you can approximate m_{\text{A}} with \widehat{m}_{\text{A}} when you calculate the standard error (and similarly for B).

Solution. The standard deviation of a Bernoulli random variable with parameter m_\text{A} is \sqrt{m_\text{A}(1-m_\text{A})}. Therefore, the average of 1000 independent observations from such a distribution is within 1.96\sqrt{m_\text{A}(1-m_\text{A})/1000} units of m_\text{A} (on the number line) with probability about 95%.

Although we don't know the value of m_\text{A} in this expression, we don't lose too much by approximating it with \widehat{m}<em>\text{A} = 0.462. Making this substitution, we get a confidence interval of 46.2\% \pm 3.1\%. The standard deviation for B works out to the same value to the nearest tenth, so we get 41.7\% \pm 3.1\% as a 95% confidence interval for m</em>\text{B}.

Warning. In the standard confidence interval framework (as described above), the value \theta of the statistical functional T is not random. Furthermore, the values of our estimators are random, even though they will realize concrete real-number values once the data are collected. This is opposite to the way probability questions are usually framed (asking for a given random variable how much of its probability mass lies in a particular, fixed interval).

One way to avoid the pitfall of thinking of the parameter as random is to speak of the random confidence interval trapping the value of the statistical functional, rather than speaking of the unknown parameter as falling into the given interval.

Exercise
Suppose we have a 95% confidence interval [A, B] for \theta = T(\nu). For each of the following statements, determine whether it's true or false.

  1. Given observed values A and B, \theta has a 95% chance of falling within [A,B].

  2. Suppose we have a large number of draws from the distribution, and we progressively update A and B according to the observations we've made so far. Then the sequence of confidence intervals [A,B] contains \theta at least 95% of the time, on average.

Confidence bands

If we are estimating a function-valued feature of \nu rather than a single number (for example, a regression function), then we might want to provide a confidence band which traps the whole graph of the function with specified probability (we'll see an example, the DKW theorem, in the next section).

Definition (Confidence band)
Let I \subset \mathbb{R}, and suppose that T is a function from the set of distributions to the set of real-valued functions on I.

A 1-\alpha confidence band for T(\nu) is pair of random functions y_{\textrm{min}} and y_{\textrm{max}} from I to \mathbb{R} defined in terms of n independent observations from \nu and having y_{\textrm{min}} \leq T(\nu) \leq y_{\textrm{max}} everywhere on I with probability at least 1-\alpha.

Bruno
Bruno Bruno