Gradient Boosting in python using scikit-learn

Gradient boosting has become a big part of Kaggle competition winners’ toolkits. It was initially searched in earnest by Jerome Friedman in the paper Greedy Function Approximation: A Gradient Boosting Machine. In this post, we’ll see gradient boosting and its use in python with the scikit-learn library.
Gradient boosting is a boosting ensemble method.
Ensemble machine learning methods are things in which several predictors are aggregated to produce a final prediction, which has lower bias and variance than any specific predictors.
Ensemble machine learning methods come in 2 different flavors — bagging and boosting.
Bagging is a method in which several predictors are trained independently of one another, and then they are aggregated later using an average (majority vote or mode, mean, weighted mean). Random forests are an example of a bagging algorithm.
Boosting is a technique in which the predictors do train sequentially (the error of one stage is passed as input into the next step).
Gradient boosting produces an ensemble of decision trees that are weak decision models on their own. Let’s take a look at how this model works.
We’ll start with some imports.

Now we will create some datasets to illustrate how the gradient boosting method works. The label(y) to predict commonly increases with the feature variable (x), but we see different areas in this data with other distributions of data.

Boosting

Let’s take a look first at how a linear regression model would fit this data.

Linear regression models aim to minimise the squared error between the prediction and the actual output and it is clear from our pattern of residuals that the sum of the residual errors is approximately 0:

It is also clear from this plot that there is a pattern in the residual errors. These are not random errors. We could fit the model to the error terms from the output of the first model.
This is the idea behind boosting.

Gradient Boosting

Gradient boosting applies a set of decision trees in series in an ensemble to predict y.
So let’s begin with a Gradient Boosting regression model that has just one estimator and a tree with a depth of only 1:

We see that the depth 1 decision tree is split at x < 50 and x >= 50, where:

  • If x < 50, y = 56
  • If x >= 50, y = 250

This isn’t the best model, but Gradient Boosting models aren’t expected to have just 1 estimator and a single tree split. So where do we proceed from here, let’s look again at the residuals from this model:

So now we’ll plot the residuals from the predictions of this model:

With one estimator, the residuals between 30–40 are very high. So what if we had 2 estimators and we fed the residuals from this first tree into the next tree, what would we expect?

Let’s give this a go:

Just as we expect, the individual split for the second tree is made at 30 to go up to prediction from our first line and take down the residual error for the area between 30–40.

If we maintain to add estimators we get a closer and closer approximation of the distribution of y:

These models only consider a tree depth of 1 (single split).

Let’s see what happens if we expand the depth of the trees in our ensemble model, let’s take our 10 estimators gradient boosting and increase the tree depth:

We can see how by increasing both the estimators and the max depth, we prepare a better estimate of y but we can start to make the model somewhat prone to overfitting.

Hence it is important to make sure we do use validation splits or cross-validation to make sure we are not overfitting our Gradient Boosting models.

Data science enthusiastic

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store