Select one of the keywords on the left…

The Data Science PipelineModeling

Reading time: ~55 min

A machine learning model is a mathematical description of a relationship between random variables. In this section, we will focus on supervised learning, which is the task of predicting the values of one random variable given the values of others. We call the variable to predict the response variable and the other variables features.

We get information about the joint distribution of the features and response variable in the form of a tidy data frame whose rows represent independent samples from the distribution. These data are called training data.

Machine learning with no machine

Let's warm up with some supervised human learning.

Fill in the missing data in the following data frame. Think of each row as having been randomly sampled from the distribution of a random vector (X,Y) in \mathbb{R}^2.


In the data frame above, not much is clear about the distribution of the random variable X. But it does seem reasonable to speculate that Y = 5 - X based on the available data. Based on this observation, we might choose to use h(x) = 5 - x as our prediction function for the problem of predicting Y based on the value of X.

The situation where one random variable is strictly determined by another is rare in practice. Let's try a trickier version of the previous exercise:

Fill in the missing data in the following data frame. Hint: it might help to make a scatter plot.

import as px
from datagymnasia import show
import pandas as pd
df = pd.DataFrame({'x': [2, 13, 1, 9, 20, 12, 18, 9, -8],
                 'y': [1, 325, 28, 190, 818, 291, 592, 153, 80]})
show(px.scatter(df, x = 'x', y = 'y'))

The data in the table above were actually generated from a particular distribution, and the value of y for the last observation happened to be 118. However, all you can really tell from the training data is that it looks like the value is probably between 0 and 200 or so. Because the conditional variance of Y given X is nonzero, it is not possible to predict Y values exactly given corresponding X values.

As illustrated in this example, when predicting one random variable Y given another random variable X, we accept that we don't have enough information (even with perfect knowledge of the joint distribution of X and Y) to make an exact prediction. Instead, we typically estimate the conditional expectation of Y given X. This is a value we should be able to estimate with an error going to zero as the number of observations goes to infinity and we get a clearer picture of the joint distribution of X and Y.

The human brain is remarkably well-suited to learning tasks like the one in the previous exercise. It's worth reflecting on some of its insights. Which of the following was closer to your mind's implicit model for how the random variables are related?

The advantage of the first model is that it's very . However, (spoiler alert!) the y values for the data points were in fact generated as twice the square of the corresponding x value plus an independent integer selected uniformly at random from \{-80,-79,\ldots,79,80\}. The step of squaring the x value produces some curviness which is not captured by the straight line. We say that this model is underfit: it doesn't match the data well enough because it too simple.

The advantage of the second model is that it passes through of the training points. The disadvantage is that it is clearly taking the locations of these points . If we got more observations, we would would not expect them to continue to fall on the curve. We say that the this model is overfit: it matches the training data well but will not generalize to new observations.

The third model represents a compromise between underfitting and overfitting. It's curvy enough to match the training data reasonably well, but not so well that it will fail to generalize to new observations.

The tension between underfitting and overfitting is called the bias-complexity tradeoff, and it lies at the heart of machine learning.

The following two figures represent two machine learning models for predicting a categorical variable. There are two features, represented by horizontal and vertical position in the plane. The response variable can be either 'red x' or 'blue o'. The prediction function is specified by color: points which fall in the red region are predicted to be red, and points which fall in the blue region are predicted to be blue.

Which model correctly classifies more training points?

Which model is more likely to correctly classify a new observation based on its feature values?

Even with perfect information about the distribution that the data frame rows are sampled from, is it possible to correctly classify new observations with 100% accuracy?

If the response variable is categorical, then the prediction problem is called a problem. If the response variable is quantitative, then the prediction problem is called a problem.


The examples in the previous section hint at some of the disadvantages of the relying on the human brain to make data predictions: our ability to meaningfully visualize features maxes out pretty quickly. We can visualize only features using distinct spatial dimensions, and other aesthetics like color, size, and shape can handle only a few more. Furthermore, important patterns in the data might not be visually salient in the graphical depictions we come up with.

So let's turn to doing machine learning with machines. Python's standard machine learning package is called Scikit-Learn. Let's see how it can be used to replicate the regression task you performed above, starting with linear regression.

Machine learning models in Scikit-Learn are represented as objects whose class reflects the type of estimator. For example, a linear regression model has class LinearRegression. Such objects can be fit to training data using the fit method, which takes two arguments: a two-dimensional array or data frame of features and a one-dimensional array of response values. Once the model has been fit, it can predict response values for a new two-dimensional array of features:

from sklearn.linear_model import LinearRegression
import pandas as pd
df = pd.DataFrame({'x': [2, 13, 1, 9, 20, 12, 18, 9, -8],
                  'y': [1, 325, 28, 190, 818, 291, 592, 153, 80]})
model = LinearRegression()[['x']],df['y'])

This estimate ends up being significantly than the actual response value of 118 for that sample. The linear model underfits the data, and that happens to lead to an overestimate at x = 5.

Next let's use sklearn to fit the same data with a quadratic polynomial. There is no class for polynomial fitting, because polynomial regression is actually a special case of linear regression: we create new features by computing pairwise products of the original features and then fit a linear model using the new data frame. The coefficients for that model provide the quadratic coefficients of best fit for the original features.

In Scikit-Learn, creating new columns for the pairwise products of original columns is achieved using a class called PolynomialFeatures. To actually apply the transformation, we use the fit_transform method:

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
extended_features = poly.fit_transform(df[['x']])

Now we can use this new array to fit a linear model.

model = LinearRegression(),df['y'])

Oops! That didn't work because we need to process the input in the same way we processed the training data before fitting the model.


This model returns a value which is quite close to the conditional expectation of Y given X=5, which is (Hint: look back to where you got a peek behind the curtain to see how the y values were generated).

Sklearn has a solution to the problem of having to apply the same preprocessing steps multiple times (once for training data and again at prediction time). The idea is to bind the preprocessing steps and the machine learning model into a single object. This object is called a pipeline. Let's do the same calculation again:

from sklearn.pipeline import Pipeline
model = Pipeline([('poly', PolynomialFeatures(degree=2)),
                  ('linear', LinearRegression(fit_intercept=False))])[['x']],df['y'])

Much simpler! Note that we turned the off for the LinearRegression model because the first step in the pipeline outputs a constant column (which plays the role of an intercept).

Having used toy data to go through some of the basics of sklearn, let's fit some models on a real data set on home prices in Boston.

Let's begin by looking at the dataset's documentation.

import pydataset
boston ='Boston')'Boston',show_doc=True)

Our goal will be to predict median housing prices (the column 'medv') based on the other columns in the data frame. However, we already have all of these values! If we want to be able to assess how well we are predicting median housing prices, we need to set some of the data aside for testing. It's important that we do this from the outset, because any work we do with the test data has the potential to influence our model in the direction of working better on the data we have than on actual fresh data. The arguments test_size and random_state specify the proportion of data set aside for testing (20%) and a seed for the random number generator to ensure we reproduce the same split if we run the code again.

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(boston.drop('medv',axis=1),
                                                test_size = 0.2,
                                                random_state = 42)
linear = LinearRegression(), Y_train)

At this point we'd like to test our dflinear on the testing data we set aside. However, we should be very careful about doing that, because if we repeat that process on many models and select the best, then the testing data have effectively crept into the training loop, and we are no longer confident about how the model will perform on fresh data.

Instead what we'll do is carve out preliminary testing data from our training data. Then we'll follow the same fitting procedure to train a model on remaining training data. We can do this repeatedly to get a better sense for how our model tends to perform when it sees new data. This process is called cross-validation, and Scikit-Learn has built-in methods for it. We'll use a version called k-fold cross-validation which partitions the training data into k subsets called folds and trains the model with k-1 of the folds as training data and the last fold as test data. Each fold takes one turn as the test data, so you get k fits in total.

We supply the model and the training data to the function cross_val_score. We use negative mean squared error for assessing how well the model fits the test data. The mean squared error refers to the average squared difference between predictions and actual response values, and this value is negated since Scikit-Learn's cross-validation is designed for the convention that a higher score is better (whereas un-negated squared error has the property that is better). The cv argument specifies the number of folds.

import numpy as np
from sklearn.model_selection import cross_val_score
scores = cross_val_score(linear, X_train, Y_train,
                         scoring="neg_mean_squared_error", cv=10)
linreg_cv_scores = np.sqrt(-scores)

The average value of the variable we're trying to predict is around , so the model's accuracy is not particularly impressive.

We can swap out the linear model for other sklearn models to see how they compare. For example, let's try decision tree regression, which can account for complex relationships between features by applying branching sequences of rules.

from sklearn.tree import DecisionTreeRegressor
decisiontree = DecisionTreeRegressor()
scores = cross_val_score(decisiontree, X_train, Y_train,
                         scoring="neg_mean_squared_error", cv=10)
tree_cv_scores = np.sqrt(-scores)
import as px
from datagymnasia import show
results = pd.DataFrame(
  {'scores': np.hstack((linreg_cv_scores,
   'model': np.hstack((np.full(10,'linear'),
show(px.histogram(results, x = 'scores', color = 'model', barmode = 'group'))

Let's try one more model: the random forest. As the name suggests, random forests are made up of many decision trees.

from sklearn.ensemble import RandomForestRegressor
randomforest = RandomForestRegressor()
scores = cross_val_score(randomforest, X_train, Y_train,
                         scoring="neg_mean_squared_error", cv=10)
forest_cv_scores = np.sqrt(-scores)

Now let's compare all three models:

results = pd.DataFrame(
    {'scores': np.hstack((linreg_cv_scores,
     'model': np.hstack((np.full(10,'linear'),
show(px.histogram(results, x = 'scores', color = 'model', barmode = 'group'))

According to this bar chart, the best model among these three is the , while the second best is the . Let's check whether that holds up when we finally use the test set:, Y_train)
print(np.mean((linear.predict(X_test) - Y_test)**2)), Y_train)
print(np.mean((decisiontree.predict(X_test) - Y_test)**2)), Y_train)
print(np.mean((randomforest.predict(X_test) - Y_test)**2))

Are these results expected? Discuss.

Hyperparameter tuning

A hyperparameter is a value which impacts the behavior of a model but which is held fixed during the fitting process. For example, if you fit a polynomial of degree n to a set of data, then the coefficients of the polynomial are parameters and n is a hyperparameter.

It's ultimately up to the data scientist to make good choices for hyperparameter values. However, you can automate this process by getting Scikit-Learn to search through a collection of values and return the best ones (selected according to cross-validation results).

The hyperparameters for a random forest include n_estimators (the number of trees in the forest) and max_features (the number of features that each branch in each decision tree is allowed to consider). GridSearchCV is a Scikit-Learn class for performing hyperparameter searches. It does cross-validation for every combination of the supplied parameters (hence grid in the name).

from sklearn.model_selection import GridSearchCV
param_grid = [
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
grid_search = GridSearchCV(randomforest, param_grid, cv=5,
                           scoring='neg_mean_squared_error'), Y_train)

Although there's a lot more to know about data preparation and modeling, we've seen some of the basic elements in action: do any necessary preprocessing, prepare the data for modeling using a train-test split, fit a model to the data, assess models using cross-validation, and choose model hyperparameters using a grid search.

As you gain experience, the wrangling, visualization, and modeling phases will intertwine: you wrangle to get a better visualization and to prepare data for the model, you gain modeling insights from visualization, and your preliminary modeling results suggest visualizations to investigate. The process of navigating fluidly between wrangling, visualization, and modeling is called exploratory data analysis.

Bruno Bruno