Select one of the keywords on the left…

Bayesian Inference and Graphical ModelsMarkov Chain Monte Carlo

Reading time: ~35 min

While conjugate priors are elegant, they aren't always applicable. We might want to use a prior distribution which isn't in a conjugate family, or we might not be able to come up with a conjugate family for a given problem. So more general techniques are called for. Let's look at a simple example where we run into difficulties trying to work entirely with on-paper formulas.

Suppose that we have a prior density for a parameter \theta given by f(\theta) = \frac{1}{\log(4)-1}\log(\theta). Suppose that the data are drawn from the distribution f(x|\theta) \sim \operatorname{Normal}(\theta,1).

Derive a formula for the posterior mean of \theta, given a single observation x from this distribution. Can the formula be evaluated analytically?

Solution. The posterior mean is given by

\begin{align*}\mathbb{E}[\theta|x] &= \frac{\frac{1}{\sqrt{2\pi}} \cdot \frac{1}{\log(4)-1}\int_1^2 \theta e^{-\frac{(x-\theta)^2}{2}}\log(\theta) d\theta} {\frac{1}{\sqrt{2\pi}} \cdot \frac{1}{\log(4)-1}\int_1^2 e^{-\frac{(x-\theta)^2}{2}}\log(\theta) d\theta} \\ &= \frac{\int_1^2\theta e^{-\frac{(x-\theta)^2}{2}}\log(\theta) d\theta} {\int_1^2 e^{-\frac{(x-\theta)^2}{2}}\log(\theta) d\theta}.\end{align*}

The integrals in the numerator and denominator cannot be computed analytically.

We could evaluate these integrals to high precision using standard numerical integration techniques like the trapezoid rule. However we're going to want to use a less accurate but more flexible approach based on Markov chains, because the trapezoid rule and similar approaches become intractable when our data are high-dimensional. For example, a 28 by 28 grayscale image is an element of [0,1]^{728}, which is practically impossible to sample densely—counting just the corners, it contains a set of 2^{728} \approx 1.4 \times 10^{219} points with no point closer than one unit to another.

Markov Chains

A discrete Markov chain is a sequence of random variables whose joint distribution is specified by a diagram like this one:

Transition probabilities for a four-state Markov chain

The value of the initial random variable X_1 is one of the four states (A, B, C, or D in this example), and then the conditional distribution of X_2 given X_1 is determined by the outgoing arrows from X_1. For example, if X_1 = \mathrm{C}, then X_2 will be \mathrm{C} with probability 1/2 or \mathrm{D} with probability \frac{1}{2}. Then the conditional distribution of X_3 given X_2 is determined by the outgoing arrows from X_2, and so on.

Write down the matrix P whose (i,j)\text{th} entry is the probability of transitioning to the j\text{th} state given that the Markov chain is currently at the i\text{th} state.

Solution. The transition matrix P is given by

P = [1/2 1/2  0    0   
     1/2  0  1/2   0   
      0   0  1/2  1/2
      1   0   0    0]

Plot an example trajectory of the Markov chain above starting from state A.

using Plots


We sample the next state from the corresponding row of P:

using Plots, StatsBase
function takestep(i)
   sample(1:4, weights(P[i,:]))
function chain(start, n_steps)
    x = [start]
    for i in 1:n_steps
        push!(x, takestep(x[end]))
P = [1/2 1/2  0    0   
     1/2  0  1/2   0   
      0   0  1/2  1/2
      1   0   0    0]
plot(chain(1, 100), size = (800, 100), legend = false,
     yticks = (1:4, ["A", "B", "C", "D"]))

Show that (P^2)_{1,1} (the top left entry of P^2) yields the probability of being at state 1 in two steps given that you start from state 1.

Solution. Squaring P and looking at the top left entry gives (1/2)(1/2) + (1/2)(1/2) = 1/2, while calculating the probability of being at state 1 in two steps starting from state 1 gives the same result: (1/2)(1/2) for the probability that we stay at A for both steps, plus (1/2)(1/2) for the probability that we go to B and come back to A.

More generally, we can say that the $(i,j)$th entry of P^n is equal to the probability of being at state j after n steps, given that we began in state i.

The animation below shows that happens to the probability of being in each state after n steps, starting from state A.

We see that the probability of being at state j after n steps converges as n\to\infty. We call this collection of limiting probabilities the stationary distribution of the Markov chain. It turns out not to depend on the starting state:

The goal of Markov Chain Monte Carlo is to reverse this procedure: we'll start with a desired probability measure on a set \Omega and come up with a collection of transition probabilities that give that measure as its stationary distribution. Then we can sample from the desired measure by starting somewhere in \Omega and running a long Markov chain with those transition probabilities.


Finding transition probabilities which give rise to a particular stationary distribution might seem like a tall order, but there is a surprisingly general method of achieving it.

Suppose \Omega is the set we're looking to sample from, and f is either a PMF or PDF on \Omega. Given a symmetric transition matrix Q—called the proposal transition matrix—Metropolis-Hastings defines the following Markov chain:

  • Choose the initial state X_0 arbitrarily, and sample X_{\text{prop}} from the distribution Q(X_0, \cdot).
  • Define X_1 to be X_{\text{prop}} with probability \frac{f(X_{\text{prop}})}{f(X_0)} (or 1, if this ratio exceeds 1) and X_0 otherwise.
  • Repeat steps (ii) and (iii) to obtain X_2 from X_1, X_3 from X_2, and so on.

Note: we call \frac{f(X_{\text{prop}})}{f(X_0)} the ratio \alpha(X_0, X_{\text{prop}}), since it determines the probability with which we accept the proposal.

This algorithm is called Metropolis-Hastings. To the extent that Metropolis-Hastings is sometimes difficult to implement to the desired effect, the main difficulties usually have to do with choosing an efficient proposal distribution Q or knowing how long to run the Markov chain until the distribution of the current state is close to the stationary distribution.

Apply Metropolis-Hastings to the four-state Markov chain above, with the stationary distribution [4/9, 2/9, 2/9, 1/9].

Solution. The probability of moving from B given that you start at A is the probability that the proposal suggests B, which is 1/2, multiplied by the probability you accept the proposal, which is \frac{2/9}{4/9} = 1/2. So that entry is 1/4. Likewise, the transition probability from A to D is \frac{1}{8}. Continuing in this way, we fill out the matrix, and it indeed gives rise to the desired stationary distribution:

P = [5/8 1/4 0 1/8
     1/2 0 1/2 0
     0 1/2 1/4 1/4
     1/2 0 1/2 0]

# let's check that each row is close to the
# desired stationary distribution:

You might be thinking that surely we could sample from this distribution in easier ways (like generating a uniform random variable and checking whether it's between 0 and 4/9, or between 4/9 and 4/9 + 2/9, etc.). That's a fair point. Let's look at an example where Metropolis-Hastings actually allows us to do something new.

Suppose we want to sample from a probability measure on the set of length-100 binary strings.

(a) How do we sample from the uniform distribution on such strings?

(b) How do we sample from the distribution whose probability mass function f assigns to each string a mass proportional to the number of 1's (for example, a string with 82 ones would be twice as likely as a string with 41 ones)?

Solution. In neither case do we want to try inverse CDF sampling, because splitting the unit interval into 2^{100} tiny intervals is impractical (even the Float64 system doesn't allow that much resolution throughout the interval). However, we can sample from the measure specified in (a) without any new ideas, since the values in each position are independent under the the uniform measure. So we can just do

rand(0:1, 100)

For (b), we can use Metropolis-Hastings. We choose an initial state somehow, maybe all ones. Then we propose changing the string by, let's say, choosing a random position and flipping the bit in that position. If the number of ones goes up, we accept that switch. If it goes down, we accept it with probability f_{\text{proposed}}/f_{\text{current}} = (H-1)/H, where H is the current number of heads.

function rand_weighted_by_num_ones(n)
    x = fill(1, n)
    n_ones = n
    for i in 1:10^6
        k = rand(1:n)
        if x[k] == 0
            x[k] = 1
            n_ones += 1
            if rand() < (n_ones-1)/n_ones
                x[k] = 0
                n_ones -= 1


The following video provides a visualization for the algorithm that rand_weighted_by_num_ones executes.

Notice here that it's helpful to choose a proposal distribution that suggests changes which can be executed quickly (for example, in this case we just looked at one position at a time rather than several), because we'll have to perform lots of these changes to get close to the stationary distribution.

Bayesian Linear Regression

Finally, let's see how to use MCMC to do Bayesian linear regression.

Consider a linear regression model y = mx + b + \epsilon, where \epsilon is normally distributed with mean 0 and variance 1. We'll treat the parameters m amd b as Bayesian parameters, meaning that they're random variables with some distribution. As priors, let's take normal distributions with mean 0 and variance 100 for both m and b.

Our goal is to sample from the posterior distribution given some observations (x_i, y_i) as i ranges from 1 to n. Let's use kind of an extreme example to make the point:

using Plots, Distributions
x = rand(Uniform(0, 1), 1000)
y_mean = 0.4x .+ 0.2
y = y_mean + rand(Normal(0, 0.05), 1000)
function observations()
    scatter(x, y, ms = 1, msw = 0.2,
                  size = (400, 250), legend = false)

To sample from the posterior, we'll use Metropolis-Hastings on the parameters m and b. That means we'll start at initial values (m_0, b_0) for those variables, then propose an update drawn from q(m_0, b_0). We'll calculate the acceptance ratio for the posterior distribution, move or stay based on that ratio, and so on.

What is the acceptance ratio for the posterior distribution, given the observations?

Solution. Since posterior is proportional to likelihood times prior, we need to compute likelihood ratios and ratios of priors. The latter we'll get by deferring to Julia's pdf, but the former we can do analytically. Given m and b, the probability density evaluated at the observed data is

\begin{align*}\prod_{i=1}^n\exp(-(y_i - mx_i - b)^2/2) = \exp\left(-\sum_{i=1}^n(y_i - mx_i - b)^2/2\right).\end{align*}

Let's code this up. We begin by writing a function to compute the acceptance ratio.

N = Normal(0, 10)
δ(x,y,m,b) = sum((yᵢ - m*xᵢ - b)^2/2 for (xᵢ, yᵢ) in zip(x,y))
function α(x, y, m, b, m_prop, b_prop)
    min(1.0, exp(-δ(x,y,m_prop,b_prop) + δ(x,y,m,b)) *
        pdf(N, m_prop)/pdf(N, m) * pdf(N, b_prop)/pdf(N, b))
function mcmc(n_iterations)
    m, b, σ = 0.0, 0.0, 1.0
    θs = [(m, b)]
    for i in 1:n_iterations
        m_prop, b_prop = m + rand(Normal(0,0.005)), b + rand(Normal(0,0.005))
        if rand() < α(x, y, m, b, m_prop, b_prop)
            m, b = m_prop, b_prop
            push!(θs, (m, b))

Finally, we can visualize the result:

θs = mcmc(3_000)
m, b = θs[end]
plot!(0:1, x-> m*x + b, size = (400, 300), ylims = (-0.5, 1))

If we run this code cell several times, then we'll get slightly different results each time. That's because we're sampling m and b from their posterior distributions, rather than choosing specific "best" values as we did in frequentist statistics. We could use many runs of the Markov chain sampling algorithm to get quantiles for m and b and even see how they're distributed jointly given the observed data:

scatter([mcmc(3000)[end] for _ in 1:1000], size = (400, 400),
         ms = 2, msw = 0.2, title = "posterior distribution",
         xlabel = "m", ylabel = "b", legend = false)

Explain why it makes intuitive sense that the joint posterior distribution of m and b would be negatively correlated.

Solution. If b happens to be on the large side, then the slope needs to be smaller than usual to be reasonably compatible with the data (that is, to cut through the middle of the point cloud). Said another way, m and b are both exceptionally large, then the likelihood for that pair would be extremely small because of the Gaussian factors picking up large vertical differences between the observed points and the line. Similarly, if b is smaller than usual, then m is likely to be larger than usual.

Bruno Bruno