Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.
Contents
- Informal introduction
- Algorithm
- Gradient tree boosting
- Size of trees
- Regularization
- Shrinkage
- Stochastic gradient boosting
- Number of observations in leaves
- Penalize Complexity of Tree
- Usage
- Names
- References
The idea of gradient boosting originated in the observation by Leo Breiman that boosting can be interpreted as an optimization algorithm on a suitable cost function. Explicit regression gradient boosting algorithms were subsequently developed by Jerome H. Friedman simultaneously with the more general functional gradient boosting perspective of Llew Mason, Jonathan Baxter, Peter Bartlett and Marcus Frean. The latter two papers introduced the abstract view of boosting algorithms as iterative functional gradient descent algorithms. That is, algorithms that optimize a cost function over function space by iteratively choosing a function (weak hypothesis) that points in the negative gradient direction. This functional gradient view of boosting has led to the development of boosting algorithms in many areas of machine learning and statistics beyond regression and classification.
Informal introduction
(This section follows the exposition of gradient boosting by Li.)
Like other boosting methods, gradient boosting combines weak "learners" into a single strong learner, in an iterative fashion. It is easiest to explain in the least-squares regression setting, where the goal is to "teach" a model
At each stage
or, equivalently,
Therefore, gradient boosting will fit h to the residual
Algorithm
In many supervised learning problems one has an output variable y and a vector of input variables x connected together via a joint probability distribution
Gradient boosting method assumes a real-valued y and seeks an approximation
In accordance with the empirical risk minimization principle, the method tries to find an approximation
where f is restricted to be a function from the class ℋ of base learner functions.
However, the problem of choosing at each step the best f for an arbitrary loss function L is a hard optimization problem in general, and so we'll "cheat" by solving a much easier problem instead.
The idea is to apply a steepest descent step to this minimization problem. If we only cared about predictions at the points of the training set, and f were unrestricted, we'd update the model per the following equation, where we view
But as f must come from a restricted class of functions (that's what allows us to generalize), we'll just choose the one that most closely approximates the gradient of L. Having chosen f, the multiplier γ is then selected using line search just as shown in the second equation above.
In pseudocode, the generic gradient boosting method is:
Gradient tree boosting
Gradient boosting is typically used with decision trees (especially CART trees) of a fixed size as base learners. For this special case Friedman proposes a modification to gradient boosting method which improves the quality of fit of each base learner.
Generic gradient boosting at the m-th step would fit a decision tree
where
Then the coefficients
Friedman proposes to modify this algorithm so that it chooses a separate optimal value
Size of trees
Hastie et al. comment that typically
Regularization
Fitting the training set too closely can lead to degradation of the model's generalization ability. Several so-called regularization techniques reduce this overfitting effect by constraining the fitting procedure.
One natural regularization parameter is the number of gradient boosting iterations M (i.e. the number of trees in the model when the base learner is a decision tree). Increasing M reduces the error on training set, but setting it too high may lead to overfitting. An optimal value of M is often selected by monitoring prediction error on a separate validation data set. Besides controlling M, several other regularization techniques are used.
Shrinkage
An important part of gradient boosting method is regularization by shrinkage which consists in modifying the update rule as follows:
where parameter
Empirically it has been found that using small learning rates (such as
Stochastic gradient boosting
Soon after the introduction of gradient boosting Friedman proposed a minor modification to the algorithm, motivated by Breiman's bagging method. Specifically, he proposed that at each iteration of the algorithm, a base learner should be fit on a subsample of the training set drawn at random without replacement. Friedman observed a substantial improvement in gradient boosting's accuracy with this modification.
Subsample size is some constant fraction f of the size of the training set. When f = 1, the algorithm is deterministic and identical to the one described above. Smaller values of f introduce randomness into the algorithm and help prevent overfitting, acting as a kind of regularization. The algorithm also becomes faster, because regression trees have to be fit to smaller datasets at each iteration. Friedman obtained that
Also, like in bagging, subsampling allows one to define an out-of-bag error of the prediction performance improvement by evaluating predictions on those observations which were not used in the building of the next base learner. Out-of-bag estimates help avoid the need for an independent validation dataset, but often underestimate actual performance improvement and the optimal number of iterations.
Number of observations in leaves
Gradient tree boosting implementations often also use regularization by limiting the minimum number of observations in trees' terminal nodes (this parameter is called n.minobsinnode
in the R gbm
package). It is used in the tree building process by ignoring any splits that lead to nodes containing fewer than this number of training set instances.
Imposing this limit helps to reduce variance in predictions at leaves.
Penalize Complexity of Tree
Another useful regularization techniques for gradient boosted trees is to penalize model complexity of the learned model. The model complexity can be defined as the proportional number of leaves in the learned trees. The joint optimization of loss and model complexity corresponds to a post-pruning algorithm to remove branches that fail to reduce the loss by a threshold. Other kinds of regularization such as an
Usage
Recently, gradient boosting has gained some popularity in the field of learning to rank. The commercial web search engines Yahoo and Yandex use variants of gradient boosting in their machine-learned ranking engines.
Names
The method goes by a variety of names. Friedman introduced his regression technique as a "Gradient Boosting Machine" (GBM). Mason, Baxter et. el. described the generalized abstract class of algorithms as "functional gradient boosting".
A popular open-source implementation for R calls it "Generalized Boosting Model". Commercial implementations from Salford Systems use the names "Multiple Additive Regression Trees" (MART) and TreeNet, both trademarked.