ProbabilityCommon Distributions

There are several specific probability distributions which arise commonly in real-world applications. Learning the properties of these distributions and how those properties lead to their frequent appearance can help us build random models and reason about properties about random systems that involve these distributions. In this section, we will explore several such distributions.

Bernoulli distribution

Suppose we conduct an experiment with exactly two outcomes, which we will encode as 0 and 1. For example, consider the following scenarios

You flip a coin and it comes up heads (1) or tails (0)
Someone's position on a political issue is either positive (1) or negative (0)}
Someone can either be healthy (1) or sick (0)
In an online survey, a user answers either true (1) or false (0)

The distribution of the result of such an experiment is governed by a single parameter , which is the probability of the outcome encoded as 1. The probability of the other outcome is , since one of the two outcomes must occur. It is customary to think of the outcomes 1 and 0 as success and failure, respectively, in which case may be referred to as the success probability. A sequence of independent Bernoulli random variables with the same success probability is referred to as a sequence of Bernoulli trials.

We write to mean that is Bernoulli distributed with success probability . The expected value of a random variable is

and its variance is

Exercise
Consider a sum of 10 independent Bernoulli random variables with success probability .

Find the mean and variance of .
Find the value of which maximizes . Hint: write down an expression for and then use Julia to find its maximum value.

Solution.

For let then Therefore,

and

Observe that for we have if and only if there are successes. Now, there are ways in which we can have success and each of them occurs with probability , by independence. Therefore

We can use Julia's built-in binomial function and an array comprehension as follows

maximum([binomial(10, k)*0.36^k*(1 - 0.36)^(10 - k) for k in 0:10])

to find that the value of that maximizes is and the maximum is approximately

The binomial distribution

Example
What is the probability of rolling exactly 18 sixes in 100 independent rolls of a fair die?

Solution. There are many ways to roll 18 sixes. We could roll 18 sixes followed by 82 non-sixes, and that happens with probability

by independence. Similarly, the probability of rolling 2 non-sixes, then 9 sixes, then 14 non-sixes, then 9 more sixes, and finally 66 non-sixes also has probability given by . In fact, for every choice of 18 positions in which the sixes may fall, there is a an outcome with exactly 18 sixes whose probability is . Since there are of these outcomes, the probability that one of them occurs is

Generally, independent trials with success probability will lead to total successes with probability

This distribution is called the binomial distribution and is denoted .

Exercise
Stirling's approximation allows us to more easily manipulate factorial expressions algebraically. It says that

Suppose that is even and that . Use Stirling's approximation to show that times the probability mass assigned to 0 by the distribution converges to a finite, positive constant as . Find the value of this constant.

Solution. Let be the probability mass at 0. Substituting Stirling's approximation for the factorial expressions in tells us that

as . Simplifying the big mess of an expression on the left hand side tells us that as . Therefore, as .

Geometric distribution

The geometric distribution with parameter is the distribution of the index of the first success in a sequence of independent Bernoulli trials.

The probability that the first success occurs on trial is equal to the probability that the first trials fail and the th trial succeeds. The probability of this event is . Therefore, the probability mass function of the geometric distribution is

Exercise
Use Monte Carlo to find the mean and variance of the geometric distribution with parameter .

Hint: you can sample from the geometric distribution using the definition: count the number of times you have to run rand(Uniform(0, 1)) until you get a result less than .

Solution. Here's an example solution:

using Statistics, Distributions

function sample_geometric(p)
    k = 1
    while true
        if rand(Uniform(0, 1)) < p
            return k
        else
            k += 1
        end
    end
end

samples = [sample_geometric(1/3) for i=1:1_000_000]

m = mean(samples)
σ² = mean(x^2 for x in samples) - m^2

(m, σ²)

The pair returned by this block is very close to , leading us to conjecture that the mean and variance are 3 and 6, respectively.

Note: the superscript of 2 is part of the variable name. You can get this symbol at the Julia prompt using \^2«tab»

We can use Taylor series to work out exact expressions for the mean and variance. The mean is equal to

and we recognize all the terms except the first as times the derivative of

By the formula for the sum of a geometric series, this expression is equal to

and so the mean of the geometric distribution is

The variance can be worked in a similar but more tedious way, and the result is

These expressions do indeed evaluate to 3 and 6, respectively, when is substituted.

Exercise
Suppose that is geometric with success probability , and consider the random variable . What is the expected value of ?

Solution. The random variable is equal to with probability , for all positive integers . Therefore, the expected value is

So has infinite mean.

Exercise
Explain why ceil(log(rand())/log(1-p)) returns a random variable whose distribution is geometric with success probability .

Solution. Let define the ceiling function on The question is asking to show that if is uniformly distributed in , then

is geometrically distributed with success probability .

This is true because of the inverse cdf trick of Exercise . To show that this is indeed the case, it suffices to show that if is the cdf of a geometrically distributed random variable with success probability then the generalized inverse of is

for all Now, let be the cdf of a geometric random variable with success probability and denote the floor function on Then

where the last line follows from evaluating the geometric sum. The jumps in clearly occur at positive integer values. Therefore, if we let we find that the generalized inverse of is given by

for all But if then because Therefore, for all

Now if is uniformly distributed in then is also uniformly distributed in so is indeed geometrically distributed with success probability

Exercise
Every time you visit your favorite restaurant, you choose a meal uniformly at random from the 10 available meals. How many visits will it take on average before you've tried all 10 meals?

Hint: try letting be the number of visits from the time you try the th unique meal to the time when you try the st unique meal.

Solution. For let be the number of visits it takes to try the th unique meal after trying the th unique meal. Then the number of visits it takes to try all the meals is Now, for any non-negative integer $X_k = n$ if all the previous visits yielded the meals that have already been tried. Because the meals are chosen independently and uniformly at random, we find that

for all and any non-negative integer For notational simplicity, let Then

for all Now, as we recall from elementary calculus, the term-by-term differentiation theorem gives

Therefore,

for all and thus

We find that, on average, about visits are needed to try all the different meals.

Poisson Distribution

The Poisson distribution arises as the number of 1's observed in a large number of low-probability Bernoulli random variables. This situation models a surprising variety of real-world scenarios:

The number of calls received at a call center in a given hour. Each potential caller has a low probability of calling during that particular hour, and there are many potential callers who are acting independently.
The number of meteorites which strike earth in a given year. There are many meteorites which might hit earth, and each one does so with low probability.
The number of mutations on a strand of DNA. Each mutation occurs with low probability, but there are many potential sites for a mutation.
The number of claims filed with an insurance company in a given month. There are many customers, and they file claims independently and with low probability each month.

Exercise

Find the expected value of , where is a sum of 1000 independent Bernoulli random variables with success probability .
Find the probability mass function of . Hint: find an expression representing the probability mass at each from 0 to 1000, and then use Julia to evaluate it. You will need to define n = big(1000) and p = big(3)/1000 because arbitrary precision arithmetic is required to avoid overflow issues.
Compare your results to the probability mass function defined on .

Solution. (i) The expected value of each Bernoulli random variable is , so by linearity of expectation the expected value of is .

(ii) Consider all possible length-1000 strings of 0's or 1's. Of these, there are with ones and zeros, and each of those strings has a probability of of being the result of independent sequence of random variables (where ). Therefore, the probability of the event is . We can obtain a vector of these probabilities as follows:

     n = big(1000)
p = big(3)/1000
massfunction = [binomial(n,k)*p^k*(1-p)^(n-k) for k=0:1000]

(iii) We can run [3^big(k)/factorial(big(k))*exp(-3) for k=0:1000] to get the first 1001 values of the given probability mass function. We see that the values are quite similar. The first ten pairs of values are

(0.0495631, 0.0497871)
(0.149137, 0.149361)    
(0.224154, 0.224042)    
(0.224379, 0.224042)    
(0.168284, 0.168031)    
(0.100869, 0.100819)    
(0.0503334, 0.0504094)  
(0.0215065, 0.021604)   
(0.00803259, 0.00810151)
(0.0026641, 0.0027005)

Inspired by this exercise, we make the following definition:

Definition (Poisson distribution)
The Poisson distribution with mean is the distribution whose probability mass function is

The probability mass function XEQUATIONX1785XEQUATIONX for XEQUATIONX1786XEQUATIONX

The expression in the definition of the Poisson distribution arises as a limit of the expression

In other words, we use a success probability of so that the expected number of successes remains constant as .

The connection between the Poisson and Bernoulli random variables may be used to obtain the mean and variance of the Poisson distribution. The average number of successes in Bernoulli() trials is , by linearity of expectation. Therefore, we expect that the mean of a Poisson random variable with parameter is equal to . Similarly, the variance of the number of successes in Bernoulli( ) trials is equal to . Taking , we predict that the variance of a Poisson random variable with parameter is also equal to . Both of these predictions are accurate:

Theorem
The mean and variance of a Poisson random variable with parameter are and , respectively.

Exercise
Suppose that the number of typos on a page is a Poisson random variable with mean .

Provide an explanation for why the Poisson distribution might be a good approximation for the distribution of typos on a page.
Find the probability that a particular page is typo-free.

Solution. (i) A typo opportunities on a page convert to actual typos with a small but roughly constant probability, there are quite a few of them, and different typos are (roughly) independent of one another. Thus the number of typos is a sum of independent Bernoulli random variables. (ii) The probability that a Poisson random variable with parameter is equal to 0 is

Exponential distribution

The exponential distribution also emerges as a limit involving Bernoulli random variables: imagine placing a light bulbs activated by independent random variables at every multiple of on the positive real number line. Consider the position of the leftmost lit bulb. The probability that it occurs to the right of a point is equal to the probability that all of the bulbs to the left remain unlit:

This probability converges to as .

Definition (Exponential distribution)
Let . The exponential distribution with parameter is the probability measure on which assigns mass to the interval , for all .

Equivalently, the exponential distribution with parameter is the probability measure whose density is

Exercise
Find the mean of the exponential distribution with parameter .

Solution. We calculate

Exercise
Suppose that is an exponentially distributed random variable with mean . Show that

Solution. Observing that we use the definition of conditional probability to get

as required.

Cauchy distribution

The Cauchy distribution spreads probability mass way out on the real number line.

Definition (Cauchy distribution)

The Cauchy distribution is the probability measure on whose density function is

The amount of probability mass assigned by the Cauchy distribution to the interval is

This mass goes to 0 so slowly that the Cauchy distribution doesn't even have a well-defined mean, let alone a variance. We say that the Cauchy distribution is heavy-tailed, and we will use it as an example when we want to study the effects of heavy tails on results like the law of large numbers or the central limit theorem.

Exercise
Show that the mean of the Cauchy distribution is not well-defined.

Solution. Let be Cauchy-distributed. Then

Therefore is undefined because the term in the square brackets is undefined.

Exercise
Choose uniformly at random from the interval and fire a ray from the origin at angle with respect to the positive -axis. Let be the point where this ray intersects the line . Show that is Cauchy-distributed.

Solution. Let and be the cdf and pdf of respectively. We need to show that Now, for all a ray from the origin that intersects the line at has slope giving If the ray intersects at then Let be defined by

Then for all we observe that if and only if Therefore,

Since is uniformly distributed in it follows that

Now, by the fundamental theorem of calculus, we know that if exists for all then But Now, by construction, is a continuous strictly decreasing function with

for all Since it follows that for all and thus exists. Therefore,

and is Cauchy-distributed.

Normal distribution

Because of the central limit theorem, which we will discuss in the next section, the normal distribution plays a central role in probability and statistics.

Definition (Normal distribution)

For and , we define the normal distribution, denoted , to be the probability measure on whose density function is

The standard normal distribution is .

Exercise
Show that if is a standard normal random variable and , then the distribution of is .

Solution. Let and be the CDF and PDF of respectively. We need to show that

for all Now, for any we have

But whenever the derivative exists. Therefore

and thus

Example
In terms of the cumulative distribution function of the standard normal, express the probability that a normally distributed random variable with mean 1 and variance 3 is between 2 and 4.

Solution. Let's denote by a random variable with mean and variance . Then is a standard normal random variable. Furthermore, is between 2 and 4 if and only if is between and .

Therefore, the desired probability is

We can compute this probability in Julia as follows:

using Distributions
Φ(x) = cdf(Normal(0, 1),x)
Φ(3/sqrt(3)) - Φ(1/sqrt(3))

We find that the probability is approximately .

If we sum two independent random variables with means and and variances and , respectively, then the mean and variance of the resulting sum are and . Remarkably, if the random variables being summed are normal, then the sum is also normal:

Theorem
If and are independent normal random variables with distributions and , respectively, then the sum has distribution .

Exercise
Suppose that and that are independent standard normal random variables. Find the distribution of .

Solution. We know that is normal with mean 0 and variance . Then , which is a sum of and is normal with mean 0 and variance .

Continuing in this way, we find that is a normal random variable with mean 0 and variance . Dividing this random variable by divides its variance by , so we find that the distribution of is .

The multivariate normal distribution

If is an independent sequence of standard normal random variables, is an matrix of constants, and is an vector of constants, then the vector

is said to have multivariate normal distribution.

If is invertible, then the pdf of is given by

The figure below shows a graph of this density as well as 1000 samples from the distribution of , where and

A graph of a multivariable normal density

Exercise
Show that the covariance matrix of a multivariate normal random vector is and that its mean is .

Note: you may use the following properties: for any constant and any random variables and , and if is an random matrix and is an matrix of constants, then .

Solution. Before showing that we first make two observations. First, for any and two real-valued random variables and we have

Second, if is an random matrix and is an matrix of real constants, then because

A similar argument also shows that for any matrix of real constants.

Now, from the first observation, we deduce that is the same as the covariance matrix of By the second observation, we find because . Combining these with Exercise , we find that

Now, is the covariance matrix of which is the identity matrix in because the components of are independent standard normals. Therefore, required.

Change Language

Sign in to Mathigon

Share

Reset Progress

Glossary