Select one of the keywords on the left…

Machine LearningEnsemble Methods

Reading time: ~25 min

Humans making important decisions often consult a panel of experts, rather than a single expert. While one individual might have biases that affect their ability to make the right decision, the hope is that these idiosyncrasies tend to cancel when aggregated across the group. In this way, the collection of experts might be able to make better decisions than even the best individual expert.

One crucial assumption is that the agents' decision-making processes exhibit some independence. If they're all constrained to approach problems in very similar ways, then the wisdom of the group will closely reflect the wisdom of each individual. This phenomenon is called .

We can follow a similar approach with machine learners: we take a majority vote among an ensemble of them. We can do this with a variety of models (one linear model, one SVM, etc.), but we can also do this with many models of the same type (for example, a collection of many decision trees). If trained deterministically on the same data, the trees would all be the same, so to get better results we will need to randomize either the training process or the data or both.


One common way to do this is to train each decision tree on a sample obtain by sampling the training data with replacement. This is called bootstrap aggregation, or bagging, for short. We can also randomize the training process by randomly choosing a subset of the features to consider for possible splits at each node of the tree. In addition to making the decision tree predictions less correlated, this also has the benefit of making training .

Change DecisionTreeClassifier to RandomForestClassifier in the second cell below to see how using the ensemble method changes the way the model performs. Compare your results to the Bayes classifier (which is in the third cell).

In this cell, we generate example training data.

# generate observations
n = 1000
features = zeros(n,2)
labels = ["" for _ in 1:n]
for i in 1:n
    if rand() < 0.4
        features[i,:] = [2 1; 1 1]*randn(2) + [0, 2]
        labels[i] = "red"
        features[i,:] = randn(2) + [0, 2]
        labels[i] = "blue"

Here we train a model and visualize the resulting classification function:

# train model and plot prediction function
using Plots, Distributions, DecisionTree
model = DecisionTreeClassifier(max_depth=3)
fit!(model, features, labels)
heatmap(-6.0:0.1:6, -6.0:0.1:6,
        (x,y) -> predict(model, [x, y]) == "red",
        color = cgrad([:blue, :red]), opacity = 0.4, colorbar = false)
scatter!(features[:,1], features[:,2], color = labels,
         ratio = 1, size = (400,400), markerstrokewidth = 0.1,
         markersize = 3, opacity = 0.2, legend = false)

We can compare our classifier with the Bayes classifier, which makes the best possible predictions (based on knowledge of the underlying distribution):

N₁ = MvNormal([0,2], [2.0 1; 1 1]*[2 1; 1 1]')
N₂ = MvNormal([1,1], [1.0 0; 0 1])
heatmap(-6.0:0.1:6, -6.0:0.1:6,
        (x,y) -> 0.4pdf(N₁, [x,y]) > 0.6pdf(N₂, [x,y]) ? 1 : 0,
        color = cgrad([:blue, :red]), ratio = 1, opacity = 0.5,
        size = (400,400))


Another way to combine models into a single, stronger model is build them up in sequence, with each model aiming to address the deficiencies of the previous ones. This approach is called boosting. We will discuss two boosting methods: AdaBoost, and gradient boosting.


The core idea of adaptive boosting is to weight observations in the training of each model, based on how well they've been predicted by previous models. Thus points which are more difficult to predict get more attention.

For simplicity, let's consider a binary classification problem with n training observations (\mathbf{X}_1, Y_1), \ldots, (\mathbf{X}_n, Y_n). Let's suppose that the \mathbf{X}_i's are points in the plane, while the Y_i's are elements of \{-1,+1\}.

We begin by training a decision tree on the training data in the usual way, and we call the resulting predictor h_1. We associate h_1 with a value \alpha_1 which indicates h_1's overall effectiveness: it's defined to be \frac{1}{2}\log\left(\frac{1-\epsilon_1}{\epsilon_1}\right), where \epsilon_1 is the proportion of misclassified training data. The relationship between \epsilon and \alpha is shown here (running the cell twice is recommended):

     using Plots, LaTeXStrings
plot(0.01:0.01:0.99, ϵ -> 1/2 * log((1-ϵ)/ϵ), frame = :origin,
     xlabel = L"\epsilon", ylabel = L"\alpha",
     label = L"\alpha = \frac{1}{2}\log\left(\frac{1-\epsilon}{\epsilon}\right)")

So if \epsilon is small, then \alpha is , while if \epsilon is close to 1, then \alpha is .

Next, we come up with new weights for the training observations. We assign the i\text{th} observation a weight which is proportional to \operatorname{e}^{-\alpha_1 Y_i h_1(\mathbf{X}_i)}. In other words, if Y_i h_1(\mathbf{X}_i) = 1, indicating classification, then the observation gets a weight of \operatorname{e}^{-\alpha_1}. Otherwise, the observation gets a weight of \operatorname{e}^{\alpha_1}.

Next, we find a new predictor which is trained on the same training data but with the new weights we just worked out. This requires that we modify the training algorithm to accommodate weights for the observations. For example, we could adjust the CART algorithm by defining the Gini impurity of a weighted set of labeled objects to be the probability of getting different labels when drawing two objects independently with probabilities proportional the objects' weights.

We then define \epsilon_2 to be weighted proportion of misclassified training data, and define h_2's overall effectiveness score of \alpha_2 = \frac{1}{2}\log\left(\frac{1-\epsilon_2}{\epsilon_2}\right). We update the weight associated with each observation i by multiplying the current weight by \operatorname{e}^{-\alpha_2 Y_i h_2(\mathbf{X}_i)}.

Repeating this process T times for some prescribed integer T, we return

\begin{align*}H(x) = \operatorname{sign}\left(\sum_{t=1}^T \alpha_t h_t(\mathbf{x})\right)\end{align*}

as our final predictor. In other words, the vote of each classifier is weighted according to its .

Gradient Boosting

Gradient boosting takes a different approach to boosting: we train each new model on the difference between the training response values and the values predicted by our current model. Since we're looking to subtract response values, gradient boosting is naturally suited to problems.

Since the models are trained to approximate the difference between the response variable and the preceding models, we need to the values returned by the models to produce our final regression estimator. Here's an example of by-hand gradient boosting with three decision trees:

using Plots, DecisionTree
n = 50
X = rand(n,1)
Y = [x - x^2 + 0.05randn() for x in X[:]]

model₁ = DecisionTreeRegressor(max_depth=2)
model₂ = DecisionTreeRegressor(max_depth=2)
model₃ = DecisionTreeRegressor(max_depth=2)

fit!(model₁, X, Y)
Y₁ = predict(model₁, X)
fit!(model₂, X, Y - Y₁)
Y₂ = predict(model₂, X)
fit!(model₃, X, Y - Y₁ - Y₂)

r̂(x) = sum(predict(m, [x]) for m in (model₁, model₂, model₃))
scatter(X[:], Y, label = "training data")
plot!(0:0.001:1, r̂, label = "gradient boosted prediction")

Gradient boosting with decision trees is often an extremely effective method in practice. The library XGBoost implements gradient-boosted decision trees, and it's one of the most commonly used machine learning libraries. It often produces state-of-the-art results despite its relative simplicity.

Bruno Bruno