AdaBoost
Adaptive boosting based classification algorithm / From Wikipedia, the free encyclopedia
Dear Wikiwand AI, let's keep it short by simply answering these key questions:
Can you list the top facts and stats about AdaBoost?
Summarize this article for a 10 year old
AdaBoost, short for Adaptive Boosting, is a statistical classification meta-algorithm formulated by Yoav Freund and Robert Schapire in 1995, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many other types of learning algorithms to improve performance. The output of the other learning algorithms ('weak learners') is combined into a weighted sum that represents the final output of the boosted classifier. Usually, AdaBoost is presented for binary classification, although it can be generalized to multiple classes or bounded intervals on the real line.[1][2]
AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers. In some problems it can be less susceptible to the overfitting problem than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing, the final model can be proven to converge to a strong learner.
Although AdaBoost is typically used to combine weak base learners (such as decision stumps), it has been shown that it can also effectively combine strong base learners (such as deep decision trees), producing an even more accurate model.[3]
Every learning algorithm tends to suit some problem types better than others, and typically has many different parameters and configurations to adjust before it achieves optimal performance on a dataset. AdaBoost (with decision trees as the weak learners) is often referred to as the best out-of-the-box classifier.[4][5] When used with decision tree learning, information gathered at each stage of the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the tree growing algorithm such that later trees tend to focus on harder-to-classify examples.
AdaBoost refers to a particular method of training a boosted classifier. A boosted classifier is a classifier of the form
where each is a weak learner that takes an object as input and returns a value indicating the class of the object. For example, in the two-class problem, the sign of the weak learner's output identifies the predicted object class and the absolute value gives the confidence in that classification. Similarly, the -th classifier is positive if the sample is in a positive class and negative otherwise.
Each weak learner produces an output hypothesis which fixes a prediction for each sample in the training set. At each iteration , a weak learner is selected and assigned a coefficient such that the total training error of the resulting -stage boosted classifier is minimized.
Here is the boosted classifier that has been built up to the previous stage of training and is the weak learner that is being considered for addition to the final classifier.
Weighting
At each iteration of the training process, a weight is assigned to each sample in the training set equal to the current error on that sample. These weights can be used in the training of the weak learner. For instance, decision trees can be grown which favor the splitting of sets of samples with large weights.
This derivation follows Rojas (2009):[6]
Suppose we have a data set where each item has an associated class , and a set of weak classifiers each of which outputs a classification for each item. After the -th iteration our boosted classifier is a linear combination of the weak classifiers of the form:
where the class will be the sign of . At the -th iteration we want to extend this to a better boosted classifier by adding another weak classifier , with another weight :
So it remains to determine which weak classifier is the best choice for , and what its weight should be. We define the total error of as the sum of its exponential loss on each data point, given as follows:
Letting and for , we have:
We can split this summation between those data points that are correctly classified by (so ) and those that are misclassified (so ):
Since the only part of the right-hand side of this equation that depends on is , we see that the that minimizes is the one that minimizes [assuming that ], i.e. the weak classifier with the lowest weighted error (with weights ).
To determine the desired weight that minimizes with the that we just determined, we differentiate:
Luckily the minimum occurs when setting this to zero, then solving for yields:
because does not depend on
We calculate the weighted error rate of the weak classifier to be , so it follows that:
which is the negative logit function multiplied by 0.5. Due to the convexity of as a function of , this new expression for gives the global minimum of the loss function.
Note: This derivation only applies when , though it can be a good starting guess in other cases, such as when the weak learner is biased (), has multiple leaves () or is some other function .
Thus we have derived the AdaBoost algorithm: At each iteration, choose the classifier , which minimizes the total weighted error , use this to calculate the error rate , use this to calculate the weight , and finally use this to improve the boosted classifier to .