Paper Summary: SMOTE: Synthetic Minority Over-sampling Technique

Paper Summary: SMOTE: Synthetic Minority Over-sampling Technique

Last updated:

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

WHAT

They propose a method for preprocessing an imbalanced dataset to increase the performance of classifiers trained on it.

WHY

Because many datasets in the wild are imbalanced - some classes are much more common than others.

Most classifiers tend to give more weight to classes that are more common in the training dataset, skewing results.

HOW

They do both undersampling of common classes and oversampling of rare classes.

However, they don't just add repeated instances of the rare classes, they create synthethic examples.

For oversampling, they create synthetic examples by looking at the nearest neighbours (of the same class) of each point having the rare class. Then they generate a new point by selecting one of those neighbours at random and creating a point at a random distance between the original point and the chosen neighbour.

For undersampling, they just sample a random fraction of the points of the most common class (without replacement).

CLAIMS

  • The two common ways to deal with imbalanced datasets are by assigning distinct costs to some training examples and by oversampling rare classes or under sampling very common classes.

  • In general, undersampling common classes works better than oversampling rare classes.

  • SMOTE works better than other methods that combine over/undersampling because oversamples are not just replicated examples.

NOTES

  • Imbalanced datasets are those where the class distribution is very skewed. In other others, some classes appear much more frequently than others.

    • A simple example is fraud detection. Fraudulent transactions are much rarer than regular ones.
  • SMOTE is method-agnostic, meaning it can be used with any classifier.

  • SMOTE is not the first method to combine undersampling and oversampling at the same time.


References

Dialogue & Discussion