A Quick Summary of Ensemble Learning Strategies

Last updated:

Ensemble learning refers to mixing the outputs of several underlying classifiers in various ways, in order to:

  • Get more accurate predictions that each model individually.

  • Help in generalization, thus reducing the risk of overfitting.

Most Kaggle competitions are won by ensemble methods (as of 2017).

The main types of ensemble techniques are:

Simple Voting

  • Train each underlying model on the whole training data

    • The output of the ensemble model is the output of the majority of underlying methods.
    • All underlying models have the same voting power.
  • The equivalent technique for regression is Model Averaging

Weighted Voting

  • Train each underlying model on the whole training data.

    • The output of the ensemble model is the output of the majority of underlying methods (as above)
    • But some models have more voting power than others.

Bagging

  • Split the training data into random subsets (sampling with replacement)

    • The output of the ensemble model is the average of the underlying methods.
    • All underlying models have the same voting power.
  • Example: Random Forest is Bagging applied to Decision Trees.

  • The objective is to increase generalization power.

Stacking

  • Train each underlying model on the whole training data.

  • Train another model (e.g. Logistic Regression) to learn how to best combine the outputs of each underlying model.

  • The objective is to increase model accuracy and generalization power.

Boosting

  • Train a model on the whole training data.

    • Then train a model on the errors (residuals) of the previous model.
    • Repeat until convergence.
  • The objective is to increase the accuracy.

  • Examples: AdaBoost and XGBoost are variants of boosting.

This short post is part of the data newsletter. Click here to sign up.


References

Dialogue & Discussion