Select one of the keywords on the left…

ProbabilityExpectation and Variance

Reading time: ~40 min

We often want to distill a random variable's distribution down to a single number. For example, consider the height of an individual selected uniformly at random from a given population. This is a random variable, and communicating its distribution would involve communicating the heights of every person in the population. However, we can summarize the distribution by reporting an average height: we add up the heights of the people in the population and by the number of people.

If the random individual is selected according to some non-uniform probability distribution on the population, then it makes sense to calculate a average rather than a uniform average. The probability-weighted average of the values of a random variable is called its expectation.

The expectation \mathbb{E}[X] (or mean \mu_X) of a random variable X is the probability-weighted average of X:

\begin{align*}\mathbb{E}[X] = \sum_{\omega \in \Omega} X(\omega) m(\omega)\end{align*}

For example, the expected number of heads in two fair coin flips is

\begin{align*}\mathbb{E}[X] = \tfrac{1}{4}\cdot 2 + \tfrac{1}{4}\cdot 1 + \tfrac{1}{4}\cdot 1 + \tfrac{1}{4}\cdot 0 = 1.\end{align*}

There are two common ways of interpreting expected value.

  • The expectation \mathbb{E}[X] may be thought of as the value of a random game with payout X. According to this interpretation, you should be willing to pay anything less than $1 to play the game where you get a dollar for each head in two fair coin flips. For more than $1 you should be unwilling to play the game, and at $1 you should be indifferent.
  • The second way of thinking about expected value is as a long-run average. If you play the dollar-per-head two-coin-flip game a very large number of times, then your average payout per play is very likely to be close to $1.

We can test this second interpretation out:

Use the expression sum(randint(0,2) + randint(0,2) for _ in range(10**6))/10**6 mean(rand(0:1) + rand(0:1) for k=1:10^6) to play the dollar-per-head two-coin-flip game a million times and calculate the average payout in those million runs.

How close to 1 is the result typically? Choose the best answer.

Around 0.1
Around 0.01
Around 0.0001
Around 0.0000001
from numpy.random import randint
sum(randint(0,2) + randint(0,2) for _ in range(10**6))/10**6
mean(rand(0:1) + rand(0:1) for k=1:10^6)

Solution. Running the code several times, we see that the error is seldom as large as 0.01 or as small as 0.0000001. So the correct answer choice is the third one.

We will see that this second interpretation is actually a theorem in probability, called the law of large numbers. In the meantime, however, this interpretation gives us a useful tool for investigation: if a random variable is easy to simulate, then we can sample from it many times and calculate the average of the resulting samples. This will not give us the expected value exactly, but we can get as close as desired by using sufficiently many samples. This is called the Monte Carlo method of approximating the expectation of a random variable.

Use a Monte Carlo simulation to estimate the expectation of X/Y, where X and Y are independent die rolls.

import numpy as np

Solution. sum(randint(1,7)/randint(1,7) for i in range(10_000_000))/10_000_000 returns approximately 1.43. The actual mean is sum(x/y for x in range(1,7) for y in range(1,7))/36, which is \frac{343}{240} = 1.4291\overline{6}. So we can say that the Monte Carlo result with 10 million trials is quite close to the correct value.

Solution. mean(rand(1:6)/rand(1:6) for i=1:10^8) returns approximately 1.43. The actual mean is mean(x/y for x=1:6, y=1:6), which is \frac{343}{240} = 1.4291\overline{6}, so we can say that the Monte Carlo result with 100 million trials is very close to the correct value.

The following exercise confirms an intuitive fact about expectation: a random variable which is always larger than another has a larger mean. We will state this idea with "larger" replaced by its weak version "at least as large as".

Explain why \mathbb{E}[X] \leq \mathbb{E}[Y] if X(\omega) \leq Y(\omega) for all \omega \in \Omega.

Solution. If X(\omega) \leq Y(\omega) for all \omega \in \Omega, then we have

\begin{align*}\mathbb{E}[X] &= \sum_{\omega \in \Omega}X(\omega)m(\omega) \\\ &\leq \sum_{\omega \in \Omega}Y(\omega)m(\omega) \\\ &= \mathbb{E}[Y].\end{align*}

Expectation and distribution

Although the definition \mathbb{E}[X] = \sum_{\omega \in \Omega} X(\omega) m(\omega) involves the probability space \Omega, we can also write a formula for expectation in terms of the probability mass function of the distribution of X:

The expectation of a discrete random variable X is equal to

\begin{align*}\mathbb{E}[X] = \sum_{x \in \mathbb{R}} x \mathbb{P}(X = x).\end{align*}

The idea is that the given formula is just a rearrangement of the terms in the definition of expectation. Let's begin by considering an example. Suppose \Omega = \{1,2,3\} with probability mass function m satisfying m(1) = 1/6, m(2) = 2/6, and m(3) = 3/6. Suppose X(1) = 5, X(2) = 5 and X(3)= 7. Then

\begin{align*}\mathbb{E}[X] = \frac{1}{6}\cdot 5 + \frac{2}{6} \cdot 5 + \frac{3}{6}\cdot 7.\end{align*}

We can group the first two terms together to get

\begin{align*}\mathbb{E}[X] = \left(\frac{1}{6} + \frac{2}{6}\right)\cdot 5 + \frac{3}{6}\cdot 7.\end{align*}

This expression is the one we would get if we wrote out

\begin{align*}\sum_{x \in \mathbb{R}} x \mathbb{P}(X = x).\end{align*}

Therefore, we can see that the two sides are the same.

Let's write this idea down in general form. We group terms on the right-hand side in the formula \mathbb{E}[X] = \sum_{\omega \in \Omega} X(\omega) m(\omega) according to the value of X(\omega):

\begin{align*}\mathbb{E}[X] = \sum_{x\in \mathbb{R}}\sum_{\omega \in \Omega: X(\omega) = x} X(\omega)m(\omega).\end{align*}

Then we can replace X(\omega) with x and pull it out of the inside sum to get

\begin{align*}\mathbb{E}[X] = \sum_{x \in \mathbb{R}} x \sum_{\omega \in \Omega : X(\omega) = x} m(\omega).\end{align*}

Since \sum_{\omega \in \Omega:X(\omega) = x} m(\omega) is equal to \mathbb{P}(X = x), we get

\begin{align*}\mathbb{E}[X] = \sum_{x \in \mathbb{R}} x \mathbb{P}(X = x),\end{align*}

as desired.

The expectation of a random variable need not be finite or even well-defined. Show that the expectation of the random variable which assigns a probability mass of 2^{-n} to the point 2^{n}(for all n \geq 1) is not finite.

Consider a random variable X whose distribution assigns a probability mass of 2^{-|n|-1} to each point 2^n for n \geq1 and a probability mass of 2^{-|n|-1} to -2^n for each n \leq -1. Show that \mathbb{E}[X] is not well-defined. (Note: a sum \sum_{x \in \mathbb{R}} f(x) is not defined if \sum_{x \in \mathbb{R} : f(x) > 0} f(x) and \sum_{x \in \mathbb{R} : f(x) < 0} f(x) are equal to \infty and -\infty, respectively.)

Solution. We multiply the probability mass at each point x by the location x and sum to get

\begin{align*}\sum_{n = 1}^\infty 2^{-n}2^n = \sum_{n=1}^\infty 1 = \infty.\end{align*}

For the second distribution, the positive and negative parts of the are both infinite for the same reason. Therefore, the sum does not make sense and the mean is therefore not well-defined.

We can also work out the expectation of a function of two discrete random variables in terms of their joint distribution.

If f:\mathbb{R}^2 \to \mathbb{R}, and X and Y are discrete random variables defined on the same probability space, then

\begin{align*}\mathbb{E}[f(X,Y)] = \sum_{(x,y)\in \mathbb{R}^2} f(x,y) \mathbb{P}(X=x \text{ and }Y = y).\end{align*}

Proof. We use the same idea we used in the proof of the expectation formula: group terms in the definition of expectation according the value of the pair (X(\omega),Y(\omega)). We get

\begin{align*}\mathbb{E}[f(X,Y)] &= \sum_{\omega \in \Omega}f(X(\omega),Y(\omega)) m(\omega) \\\ &= \sum_{(x,y) \in \mathbb{R}^2} \sum_{\substack{\omega \in \Omega : \\\ X(\omega) = x \text{ and } Y(\omega) = y}} f(X(\omega),Y(\omega)) m(\omega) \\\ &= \sum_{(x,y) \in \mathbb{R}^2} f(x,y) \mathbb{P}(X = x \text{ and } Y = y).\end{align*}

We can use this theorem to show that expectation distributes across multiplication for independent random variables:

Exercise (independence product formula)
Show that \mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y] if X and Y are independent random variables.

Solution. Using the definition of independence, we have

\begin{align*}\mathbb{E}[XY] &= \sum_{\omega \in \Omega}XY \mathbb{P}(X = x \text{ and } Y = y) \\\ &= \sum_{(x,y) \in \mathbb{R}^2}XY \mathbb{P}(X =x) \mathbb{P}(Y = y) \\\ &= \left(\sum_{x \in \mathbb{R}} x\mathbb{P}(X=x)\right)\left(\sum_{x \in \mathbb{R}} x\mathbb{P}(X=x)\right) \\\ &= \mathbb{E}[X]\mathbb{E}[Y],\end{align*}

as desired.


The expectation of a random variable gives us some coarse information about where on the number line the random variable's probability mass is located. The variance gives us some information about how widely the probability mass is spread around its mean. A random variable whose distribution is highly concentrated about its mean will have a small variance, and a random variable which is likely to be very far from its mean will have a large variance. We define the variance of a random variable X to be the average squared distance from X to its mean:

Definition (Variance)
The variance of a random variable X is defined to be

\begin{align*}\operatorname{Var} X = \mathbb{E}[(X - \mathbb{E}[X])^2].\end{align*}

The standard deviation \sigma(X) of X is the square root of the variance:

\begin{align*}\sigma(X) = \sqrt{\operatorname{Var} X}.\end{align*}

Consider a random variable which is obtained by making a selection from the list

[0.245, 0.874, 0.998, 0.567, 0.482]

uniformly at random. Make a rough estimate of the mean and variance of this random variable just from looking at the number line. Then use Python to calculate the mean and variance exactly to see how close your estimates were.

import numpy as np

Solution. My estimate of the mean and variance are 0.6 and 0.1, respectively. The value 0.6 appears to be around the balance point of the points on the number line, and the squared deviation from 0.6 is very small for a couple of the points and as large as about 0.15 for others.

Calculating the mean exactly using m = mean([0.245, 0.874, 0.998, 0.567, 0.482]), we get a value of 0.6332. Calculating the variance exactly using mean([(a-m)^2 for a in A]) (where A is the array above), we get a value of 0.074. Therefore, my estimate was a little high.

Consider the following game. We begin by picking a number in \{0,\frac{1}{1000}, \frac{2}{1000}, \ldots, \frac{1000}{1000}\} with uniform probability. If that number is less than 1, we pick another number from the same distribution and add it to the first. We repeat this procedure until the running sum exceeds 1. Let X be the random variable whose value is the number of draws needed to end the game. Use a simulation to approximate the expected value and variance of X. Include your code in your answer as well as some discussion of your results.

Tips: rand(0:1000)/1000 returns a sample from the desired distribution. Also, it's a good idea to wrap a single run of the game into a zero-argument function.

import numpy as np

Solution. We define a function run which plays the game once, and we record the result of the game over a million runs. We estimate the mean as the mean of the resulting list, and we estimate the variance using

import numpy as np
def runs_till_over():
    s = 0
    ctr = 0
    while s < 1.0:
        s += np.random.randint(0,1001)/1000
    return ctr

A = [runs_till_over() for _ in range(1_000_000)]
μ = np.mean(A)
var = np.mean((a-μ)**2 for a in A)
function runs_till_over()
    s = 0
    ctr = 0
    while s < 1.0
        s += rand(1:1000)/1000

A = [runs_till_over() for _ in 1:1_000_000]
μ = mean(A)
var = mean((a-μ)^2 for a in A)

We get a mean of about 2.71, and a variance of about 0.77.

We can use linearity of expectation to rewrite the formula for variance in a simpler form:

\begin{align*}\operatorname{Var} X &= \mathbb{E}[X^2 - 2X \mathbb{E}[X] + \mathbb{E}[X]^2] \\\ &= \mathbb{E}[X^2] - 2\mathbb{E}[X\mathbb{E}[X]] + \mathbb{E}[X]^2 \\\ &= \mathbb{E}[X^2] - 2 \mathbb{E}[X]^2 + \mathbb{E}[X]^2 \\\ &= \mathbb{E}[X^2] - \mathbb{E}[X]^2.\end{align*}

We can use this formula to show how variance interacts with linear operations:

Show that variance satisfies the properties

\begin{align*}\left\{\begin{array}{r@{\,}c@{\,}ll} \operatorname{Var}(a X) &=& a^2 \operatorname{Var} X, \\\ \operatorname{Var}(X+Y) &=& \operatorname{Var}(X) + \operatorname{Var}(Y), \end{array}\right.\end{align*}

if a is a real number and X is a random variable, and if X and Y are independent random variables, respectively.

Proof. The first part of the statement follows easily from linearity of expectation

\begin{align*}\operatorname{Var}(aX) &= \mathbb{E}[a^2X^2] - \mathbb{E}[aX]^2\\\ &= a^2\mathbb{E}[X^2] - a^2\mathbb{E}[X]^2 \\\ & = a^2 (\mathbb{E}[X^2] - \mathbb{E}[X]^2)\\\ &= a^2 \operatorname{Var}(X).\end{align*}

Since \mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y] by linearity, we have

\begin{align*}\operatorname{Var}(X + Y) &= \mathbb{E}[(X + Y)^2] - (\mathbb{E}[X] + \mathbb{E}[Y])^2 \\\ &= \mathbb{E}[X^2 + 2XY + Y^2] - \mathbb{E}[X]^2 -2\mathbb{E}[X]\mathbb{E}[Y] -\mathbb{E}[Y]^2.\end{align*}

Rearranging and using linearity of expectation, we get

\begin{align*}\operatorname{Var}(X+Y) &= \mathbb{E}[X^2] - \mathbb{E}[X]^2 + \mathbb{E}[Y^2] - \mathbb{E}[Y]^2 + 2(\mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]) \\\ &= \operatorname{Var}(X) + \operatorname{Var}(Y) + 2(\mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]).\end{align*}

The desired result follows because if X and Y are independent, \mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y] by the independence product formula.

Consider the distribution which assigns a probability mass of \frac{c}{n^3} to each integer point n \geq 1, where c is equal to the reciprocal of \sum_{n=1}^\infty \frac{1}{n^3}.

Show that this distribution has a finite mean but not a finite variance.

Solution. Let X be a random variable with this distribution. Then

\begin{align*}\mathbb{E}[X] &= \sum_{n=1}^{\infty} n \cdot \frac{c}{n^3}= c \sum_{n=1}^{\infty} \frac{1}{n^2}.\end{align*}

Since the sum on the right converges by the p-test, it follows that \mathbb{E}[X] is finite. On the other hand,

\begin{align*}\mathbb{E}[X^2] &= \sum_{n=1}^{\infty} n^2 \cdot \frac{c}{n^3}= c \sum_{n=1}^{\infty} \frac{1}{n}.\end{align*}

does not converge because of the harmonic series term. Therefore \operatorname{Var}(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2 is infinite.

Bruno Bruno