AdaBoost

Updated on Sep 24, 2024

Edit

Comment

AdaBoost, short for "Adaptive Boosting", is a machine learning meta-algorithm formulated by Yoav Freund and Robert Schapire who won the Gödel Prize in 2003 for their work. It can be used in conjunction with many other types of learning algorithms to improve their performance. The output of the other learning algorithms ('weak learners') is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers. AdaBoost is sensitive to noisy data and outliers. In some problems it can be less susceptible to the overfitting problem than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing (e.g., their error rate is smaller than 0.5 for binary classification), the final model can be proven to converge to a strong learner.

While every learning algorithm will tend to suit some problem types better than others, and will typically have many different parameters and configurations to be adjusted before achieving optimal performance on a dataset, AdaBoost (with decision trees as the weak learners) is often referred to as the best out-of-the-box classifier. When used with decision tree learning, information gathered at each stage of the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the tree growing algorithm such that later trees tend to focus on harder-to-classify examples.

Overview

Problems in machine learning often suffer from the curse of dimensionality — each sample may consist of a huge number of potential features (for instance, there can be 162,336 Haar features, as used by the Viola–Jones object detection framework, in a 24×24 pixel image window), and evaluating every feature can reduce not only the speed of classifier training and execution, but in fact reduce predictive power, per the Hughes Effect. Unlike neural networks and SVMs, the AdaBoost training process selects only those features known to improve the predictive power of the model, reducing dimensionality and potentially improving execution time as irrelevant features do not need to be computed.

Training

AdaBoost refers to a particular method of training a boosted classifier. A boost classifier is a classifier in the form

F T ( x ) = ∑ t = 1 T f t ( x )

where each f t is a weak learner that takes an object x as input and returns a value indicating the class of the object. For example in the two class problem, the sign of the weak learner output identifies the predicted object class and the absolute value gives the confidence in that classification. Similarly, the T th classifier will be positive if the sample is believed to be in the positive class and negative otherwise.

Each weak learner produces an output, hypothesis h ( x i ) , for each sample in the training set. At each iteration t , a weak learner is selected and assigned a coefficient α t such that the sum training error E t of the resulting t -stage boost classifier is minimized.

E t = ∑ i E [ F t − 1 ( x i ) + α t h ( x i ) ]

Here F t − 1 ( x ) is the boosted classifier that has been built up to the previous stage of training, E ( F ) is some error function and f t ( x ) = α t h ( x ) is the weak learner that is being considered for addition to the final classifier.

Weighting

At each iteration of the training process, a weight w t is assigned to each sample in the training set equal to the current error E ( F t − 1 ( x i ) ) on that sample. These weights can be used to inform the training of the weak learner, for instance, decision trees can be grown that favor splitting sets of samples with high weights.

Derivation

This derivation follows Rojas (2009):

Suppose we have a data set { ( x 1 , y 1 ) , … , ( x N , y N ) } where each item x i has an associated class y i ∈ { − 1 , 1 } , and a set of weak classifiers { k 1 , … , k L } each of which outputs a classification k j ( x i ) ∈ { − 1 , 1 } for each item. After the m − 1 -th iteration our boosted classifier is a linear combination of the weak classifiers of the form:

C ( m − 1 ) ( x i ) = α 1 k 1 ( x i ) + ⋯ + α m − 1 k m − 1 ( x i )

At the m -th iteration we want to extend this to a better boosted classifier by adding a multiple of one of the weak classifiers:

C m ( x i ) = C ( m − 1 ) ( x i ) + α m k m ( x i )

So it remains to determine which weak classifier is the best choice for k m , and what its weight α m should be. We define the total error E of C m to be the sum of its exponential loss on each data point, given as follows:

E = ∑ i = 1 N e − y i C m ( x i )

Letting w i ( 1 ) = 1 and w i ( m ) = e − y i C m − 1 ( x i ) for m > 1 , we have:

E = ∑ i = 1 N w i ( m ) e − y i α m k m ( x i )

We can split this summation between those data points that are correctly classified by k m (so y i k m ( x i ) = 1 ) and those which are misclassified (so y i k m ( x i ) = − 1 ):

E = ∑ y i = k m ( x i ) w i ( m ) e − α m + ∑ y i ≠ k m ( x i ) w i ( m ) e α m = ∑ i = 1 N w i ( m ) e − α m + ∑ y i ≠ k m ( x i ) w i ( m ) ( e α m − e − α m )

Since the only part of the right-hand side of this equation that depends on k m is ∑ y i ≠ k m ( x i ) w i ( m ) , we see that the k m that minimizes E is the one that minimizes ∑ y i ≠ k m ( x i ) w i ( m ) , i.e. the weak classifier with the lowest weighted error (with weights w i ( m ) = e − y i C m − 1 ( x i ) ).

In order to determine the desired weight α m that minimizes E with the k m that we just determined, we differentiate:

d E d α m = ∑ y i ≠ k m ( x i ) w i ( m ) e α m − ∑ y i = k m ( x i ) w i ( m ) e − α m

Setting this to zero and solving for α m yields:

α m = 1 2 ln ⁡ ( ∑ y i = k m ( x i ) w i ( m ) ∑ y i ≠ k m ( x i ) w i ( m ) )

We calculate the weighted error rate of the weak classifier to be ϵ m = ∑ y i ≠ k m ( x i ) w i ( m ) / ∑ i = 1 N w i ( m ) , so it follows that:

α m = 1 2 ln ⁡ ( 1 − ϵ m ϵ m )

which is the negative logit function multiplied by 0.5.

Thus we have derived the AdaBoost algorithm: At each iteration, choose the classifier k m which minimizes the total weighted error ∑ y i ≠ k m ( x i ) w i ( m ) , use this to calculate the error rate ϵ m = ∑ y i ≠ k m ( x i ) w i ( m ) / ∑ i = 1 N w i ( m ) , use this to calculate the weight α m = 1 2 ln ⁡ ( 1 − ϵ m ϵ m ) , and finally use this to improve the boosted classifier C m − 1 to C m = C ( m − 1 ) + α m k m .

Statistical understanding of boosting

Boosting is a form of linear regression in which the features of each sample x i are the outputs of some weak learner h applied to x i . Specifically, in the case where all weak learners are known a priori, AdaBoost corresponds to a single iteration of the backfitting algorithm in which the smoothing splines are the minimizers of ∑ i = 1 n e − Y i μ ^ ( x i ) + ∞ ∫ x 1 x n μ ^ ″ ( x ) 2 d x , that is: μ ^ i fits an exponential cost function and is linear with respect to the observation. Thus, boosting is seen to be a specific type of linear regression.

While regression tries to fit F ( x ) to y ( x ) as precisely as possible without loss of generalization, typically using least square error E ( f ) = ( y ( x ) − f ( x ) ) 2 , the AdaBoost error function E ( f ) = e − y ( x ) f ( x ) takes into account the fact that only the sign of the final result will be used, thus | F ( x ) | can be far larger than 1 without increasing error. However, the exponential increase in the error for sample x i as − y ( x i ) f ( x i ) increases results in excessive weight being assigned to outliers.

One feature of the choice of exponential error function is that the error of the final additive model is the product of the error of each stage, that is, e ∑ i − y i f ( x i ) = ∏ i e − y i f ( x i ) . Thus it can be seen that the weight update in the AdaBoost algorithm is equivalent to recalculating the error on F t ( x ) after each stage.

There is a lot of flexibility allowed in the choice of loss function. As long as the loss function is monotonic and continuously differentiable, the classifier will always be driven toward purer solutions. Zhang (2004) provides a loss function based on least squares, a modified Huber loss function:

ϕ ( y , f ( x ) ) = { − 4 y f ( x ) if y f ( x ) < − 1 , ( y f ( x ) − 1 ) 2 if − 1 ≤ y f ( x ) ≤ 1 , 0 if y f ( x ) > 1

This function is more well-behaved than LogitBoost for f ( x ) close to 1 or -1, does not penalise ‘overconfident’ predictions ( y f ( x ) > 1 ), unlike unmodified least squares, and only penalises samples misclassified with confidence greater than 1 linearly, as opposed to quadratically or exponentially, and is thus less susceptible to the effects of outliers.

Boosting as gradient descent

Boosting can be seen as minimization of a convex loss function over a convex set of functions. Specifically, the loss being minimized by AdaBoost is the exponential loss ∑ i ϕ ( i , y , f ) = ∑ i e − y i f ( x i ) , whereas LogitBoost performs logistic regression, minimizing ∑ i ϕ ( i , y , f ) = ∑ i ln ⁡ ( 1 + e − y i f ( x i ) ) .

In the gradient descent analogy, the output of the classifier for each training point is considered to be a point ( F t ( x 1 ) , … , F t ( x n ) ) in n-dimensional space, where each axis corresponds to a training sample, each weak learner h ( x ) corresponds to a vector of fixed orientation and length, and the goal is to reach the target point ( y 1 , … , y n ) (or any region where the value of loss function E T ( x 1 , … , x n ) is less than the value at that point), in the least number of steps. Thus AdaBoost algorithms perform either Cauchy (find h ( x ) with the steepest gradient, choose α to minimize test error) or Newton (choose some target point, find α h ( x ) that will bring F t closest to that point) optimization of training error.

Example algorithm (Discrete AdaBoost)

With:

Samples x 1 … x n

Desired outputs y 1 … y n , y ∈ { − 1 , 1 }

Initial weights w 1 , 1 … w n , 1 set to 1 n

Error function E ( f ( x ) , y , i ) = e − y i f ( x i )

Weak learners h : x → [ − 1 , 1 ]

For t in 1 … T :

Choose h t ( x ) :

Find weak learner h t ( x ) that minimizes ϵ t , the weighted sum error for misclassified points ϵ t = ∑ h t ( x i ) ≠ y i i = 1 n w i , t

Choose α t = 1 2 ln ⁡ ( 1 − ϵ t ϵ t )

Add to ensemble:

F t ( x ) = F t − 1 ( x ) + α t h t ( x )

Update weights:

w i , t + 1 = w i , t e − y i α t h t ( x i ) for all i

Renormalize w i , t + 1 such that ∑ i w i , t + 1 = 1

(Note: It can be shown that ∑ h t + 1 ( x i ) = y i w i , t + 1 ∑ h t + 1 ( x i ) ≠ y i w i , t + 1 = ∑ h t ( x i ) = y i w i , t ∑ h t ( x i ) ≠ y i w i , t at every step, which can simplify the calculation of the new weights.)

Choosing αt

α t is chosen as it can be analytically shown to be the minimizer of the exponential error function for Discrete AdaBoost.

Minimize:

∑ i w i e − y i h i α t

Using the convexity of the exponential function, and assuming that ∀ i , h i ∈ [ − 1 , 1 ] we have:

∑ i w i e − y i h i α t ≤ ∑ i ( 1 − y i h i 2 ) w i e α t + ∑ i ( 1 + y i h i 2 ) w i e − α t = ( 1 + ϵ t 2 ) e α t + ( 1 − ϵ t 2 ) e − α t

We then differentiate that expression with respect to α t and set it to zero to find the minimum of the upper bound:

( 1 + ϵ t 2 ) e α t − ( 1 − ϵ t 2 ) e − α t = 0 α t = 1 2 ln ⁡ ( 1 − ϵ t 1 + ϵ t )

Note that this only applies when h i ∈ { − 1 , 1 } , though it can be a good starting guess in other cases, such as when the weak learner is biased ( h ( x ) ∈ { a , b } , a ≠ − b ), has multiple leaves ( h ( x ) ∈ { a , b , … , n } ) or is some other function h ( x ) ∈ R . In such cases the choice of weak learner and coefficient can be condensed to a single step in which f t = α t h t ( x ) is chosen from all possible α , h as the minimizer of ∑ i w i , t e − y i f t ( x i ) by some numerical searching routine.

Real AdaBoost

The output of decision trees is a class probability estimate p ( x ) = P ( y = 1 | x ) , the probability that x is in the positive class. Friedman, Hastie and Tibshirani derive an analytical minimizer for e − y ( F t − 1 ( x ) + f t ( p ( x ) ) ) for some fixed p ( x ) (typically chosen using weighted least squares error):

f t ( x ) = 1 2 ln ⁡ ( x 1 − x ) .

Thus, rather than multiplying the output of the entire tree by some fixed value, each leaf node is changed to output half the logit transform of its previous value.

LogitBoost

LogitBoost represents an application of established logistic regression techniques to the AdaBoost method. Rather than minimizing error with respect to y, weak learners are chosen to minimize the (weighted least-squares) error of f t ( x ) with respect to

z t = y ∗ − p t ( x ) 2 p t ( x ) ( 1 − p t ( x ) ) , where p t ( x ) = e F t − 1 ( x ) e F t − 1 ( x ) + e − F t − 1 ( x ) , w t = p t ( x ) ( 1 − p t ( x ) ) and y ∗ = y + 1 2 .

That is z t is the Newton-Raphson approximation of the minimizer of the log-likelihood error at stage t , and the weak learner f t is chosen as the learner that best approximates z t by weighted least squares.

As p approaches either 1 or 0, the value of p t ( x i ) ( 1 − p t ( x i ) ) becomes very small and the z term, which will be large for misclassified samples, can become numerically unstable, due to machine precision rounding errors. This can be overcome by enforcing some limit on the absolute value of z and the minimum value of w.

Gentle AdaBoost

While previous boosting algorithms choose f t greedily, minimizing the overall test error as much as possible at each step, GentleBoost features a bounded step size. f t is chosen to minimize ∑ i w t , i ( y i − f t ( x i ) ) 2 , and no further coefficient is applied. Thus, in the case where a weak learner exhibits perfect classification performance, GentleBoost will choose f t ( x ) = α t h t ( x ) exactly equal to y , while steepest descent algorithms will try to set α t = ∞ . Empirical observations about the good performance of GentleBoost appear to back up Schapire and Singer's remark that allowing excessively large values of α can lead to poor generalization performance.

Early Termination

A technique for speeding up processing of boosted classifiers, early termination refers to only testing each potential object with as many layers of the final classifier necessary to meet some confidence threshold, speeding up computation for cases where the class of the object can easily be determined. One such scheme is the object detection framework introduced by Viola and Jones: in an application with significantly more negative samples than positive, a cascade of separate boost classifiers is trained, the output of each stage biased such that some acceptably small fraction of positive samples is mislabeled as negative, and all samples marked as negative after each stage are discarded. If 50% of negative samples are filtered out by each stage, only a very small number of objects would pass through the entire classifier, reducing computation effort. This method has since been generalized, with a formula provided for choosing optimal thresholds at each stage to achieve some desired false positive and false negative rate.

In the field of statistics, where AdaBoost is more commonly applied to problems of moderate dimensionality, early stopping is used as a strategy to reduce overfitting. A validation set of samples is separated from the training set, performance of the classifier on the samples used for training is compared to performance on the validation samples, and training is terminated if performance on the validation sample is seen to decrease even as performance on the training set continues to improve.

Totally Corrective Algorithms

For steepest descent versions of AdaBoost, where α t is chosen at each layer t to minimize test error, the next layer added is said to be maximally independent of layer t: it is unlikely that a weak learner t+1 will be chosen that is similar to learner t. However, there remains the possibility that t+1 produces similar information to some other earlier layer. Totally corrective algorithms, such as LPBoost, optimize the value of every coefficient after each step, such that new layers added are always maximally independent of every previous layer. This can be accomplished by backfitting, linear programming or some other method.

Pruning

Pruning refers to the process of removing poorly performing weak classifiers, in order to improve the memory and execution-time cost of the boosted classifier. The simplest methods, which can be particularly effective in conjunction with totally corrective training, are weight- or margin-trimming: when the coefficient, or the contribution to the total test error, of some weak classifier falls below a certain threshold, that classifier is dropped. Margineantu & Dietterich suggest an alternative criterion for trimming: weak classifiers should be selected such that the diversity of the ensemble is maximized. If two weak learners produce very similar outputs, efficiency can be improved by removing one of them and increasing the coefficient of the remaining weak learner.

Implementations

AdaBoost and the Super Bowl of Classifiers - A Tutorial on AdaBoost.

AdaBoost in C++, an implementation of AdaBoost in C++ and boost by Antonio Gulli

bonzaiboost, a fast (and multi-threaded) C++ implementation of multi-class/multi-label Adaboost.MH algorithm over small decision tree (bonsai). It offers several text features extraction facilities.

icsiboost, an open source implementation of Boostexter

JBoost, a site offering a classification and visualization package, implementing AdaBoost among other boosting algorithms.

MATLAB AdaBoost toolbox. Includes Real AdaBoost, Gentle AdaBoost and Modest AdaBoost implementations.

A Matlab Implementation of AdaBoost

Multi-threaded MATLAB-compatible implementation of Boosted Trees

milk for Python implements AdaBoost.

MPBoost++, a C++ implementation of the original AdaBoost.MH algorithm and of an improved variant, the MPBoost algorithm.

multiboost, a fast C++ implementation of multi-class/multi-label/multi-task boosting algorithms. It is based on AdaBoost.MH but also implements popular cascade classifiers and FilterBoost along with a batch of common multi-class base learners (stumps, trees, products, Haar filters).

NPatternRecognizer , a fast machine learning algorithm library written in C#. It contains support vector machine, neural networks, bayes, boost, k-nearest neighbor, decision tree, ..., etc.

OpenCV implementation of several boosting variants

Into contains open source implementations of many AdaBoost and FloatBoost variants in C++.

Mallet Java implementation.

adabag adabag: An R package for binary and multiclass Boosting and Bagging.

Scikit-learn Python implementation.

fertilized forests A general purpose, platform independent, easy to extend decision forest library that supports boosted training based on multiclass AdaBoost.M2, Samme and Samme.R

References

AdaBoost Wikipedia

(Text) CC BY-SA

Contents