Paper Summary: Identifying Mislabeled Instances in Classification Datasets

Paper Summary: Identifying Mislabeled Instances in Classification Datasets

Last updated:
Table of Contents

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

WHAT

Authors create a tool that estimates the likelihood that a label is wrong, in a labelled dataset.

WHY

Because it's usually very costly to check every example for mislabels in a labelled dataset.

But such wrongly labelled examples affect the quality of any model trained on that data, so it's important to minimize that.

HOW

They build a system that estimates the likelihood that a given label is wrong. The objective here is to then present a small set of candidate examples for a human-expert to check.

They train a single, heavily regularized, dense, fully-connected neural net on the full data. Then they re-score the original dataset, noting down the predicted class, the actual class and the residual (calulated by an inner product between one-hot-encoded label vectors).

Then they sort the dataset in descending order of residuals and select the top X examples. (Where X is the amount of examples you want to re-check manually)

CLAIMS

  • "... the impact of label or class noise is generally higher than feature noise because there is only one label but many features, some of which may be redundant." (Highlight added by myself.)

  • There are 3 approaches to dealing with label noise:

    • Including bad labels and using robust classifiers
    • Removing labels where there is even a slight amount of uncertainty (training with other examples only)
    • Using some heuristic or data-driven method to identify labels with a high likelihood of being wrong. Then fixing them.

QUOTES

"There are three ways to deal with noisy datasets. First, it is possible to design robust algorithms that can learn even from noisy data. Second, mislabeled instances can be found and automatically removed from the dataset before training. Third, mislabeled instances can be identified and then re-evaluated by a domain expert"

NOTES

  • Classification in the presence of label noise (several types thereof) is apparently a well-researched field; it even has a survey here: Frenay and Verleysen 2013

  • Very interesting approach suggested by Sabzevari et al., 2018, where they train an ensemble containing multiple random forests on boostrapped samples of the data; they then measure the disagreement ratio between the classifiers for a given example, to help identify wrong labels.

MY 2¢

  • A bit naïve: The solution presented in the paper sounds quite naïve to me. All they do is to shuffle the data then train a dense (or convolutional if images) neural net1 on it, score the train set and retrieve the largest residuals.

    • Other contributions include the fact that the model to be used must be heavily regularized and some specific metrics that may make comparision easier across strategies.
  • Imbalanced datasets: Bad labels are especially problematic in imbalanced datasets (where somes labels are much rarer that others) for example in fraud, anomaly detection, credit default, etc.

    • This is because only a handful of wrongly-labelled positive examples may cause a lot of trouble.
  • Unreal assumptions: I think that the assumption that label noise is independent of class (i.e. class-independent label noise) is not at all a good approximation of reality. Even when authors use pairwise label noise, it's not very clear how this matches reality at all.

  • Data versioning: When one uses these label correcting procedures in production settings, data versioning is probably crucial, otherwise the team will go crazy trying to figure out what exact version of the dataset models were trained/validated on, etc.


References

1: Albeit heavily regularized