A Quick Summary of Ensemble Learning Strategies
Last updated:Ensemble learning refers to mixing the outputs of several underlying classifiers in various ways, in order to:
Get more accurate predictions that each model individually.
Help in generalization, thus reducing the risk of overfitting.
Most Kaggle competitions are won by ensemble methods (as of 2018).
The main types of ensemble techniques are:
Simple Voting
Train each underlying model on the whole training data
- The output of the ensemble model is the output of the majority of underlying methods.
- All underlying models have the same voting power.
The equivalent technique for regression is Model Averaging
Weighted Voting
Train each underlying model on the whole training data.
- The output of the ensemble model is the output of the majority of underlying methods (as above)
- But some models have more voting power than others.
Bagging
Split the training data into random subsets (sampling with replacement)
- The output of the ensemble model is the average of the underlying methods.
- All underlying models have the same voting power.
Example: Random Forest is Bagging applied to Decision Trees.
The objective is to increase generalization power.
Stacking
Train each underlying model on the whole training data.
Train another model (e.g. Logistic Regression) to learn how to best combine the outputs of each underlying model.
The objective is to increase model accuracy and generalization power.
Boosting
Train a model on the whole training data.
- Then train a model on the errors (residuals) of the previous model.
- Repeat until convergence.
The objective is to increase the accuracy.
Examples: AdaBoost and XGBoost are variants of boosting.
This short post is part of the data newsletter. Click here to sign up.
References
Learn by marketing: Kaggle Competition Analysis
- Ensembles and XGBoost (a specific type of ensemble) win by a large margin.
MLWave: Kaggle Ensembling Guide
- Large document, with tons of examples and Kaggle competitions to try the methods in.
Cross Validated Answer to: Bagging Boosting and Stacking in Machine Learning
- Good overview of pros/cons of bagging and boosting.