Paper Summary: The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets

Last updated:

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

WHAT

Authors analyzed the difference in how several types of threshold-free performance charts (for binary classification) behave when dealing with imbalanced datasets.

They also did a short analysis of several Medical papers to see which measure these articles generally used.

WHY

Because imbalanced problems are the norm in many areas of science (medicine, fraud, credit, etc) and knowing how to best convey the results of models is very important.

HOW

  • 1) Authors generated two sets of sample data, one containing balanced and the other imbalanced targets.

  • 2) They then simulated the performance of 3 dummy classifiers (bad, good and excellent) on those datasets, in addition to 2 baseline cases(random behaviour and a perfect classifier).

  • 3) They plotted the results for both datasets, the 5 levels of performance using 4 chart types:

    • ROC (Receive-Operator Curve)
    • CROC (Variation of ROC that focuses on top examples only)
    • CC (Cost Curve is a plot that takes misclassifications into consideration)
    • PR-Curve (Precision-Recall curve)
  • 4) They analysed the information conveyed by each plot under each scenario, as well as the numerical area under each curve.

CLAIMS

  • Precision (X-axis in PR-curve) is an intuitive measure, easy for nontechnical stakeholders to understand and it's useful in imbalanced problems (doesn't hide information such as the accuracy).

  • The baseline case for PR-curves moves accordingly in the plot in the imbalanced case1, giving a better idea of what random performance would look like.

  • With PR-Curves, it is easily visible how each classifier performs in so-called early retrieval2 vs the full case. It is not possible to see this graphical in other chart types.

  • Neither the ROC-curve, Concentrated ROC or the Cost Curve (and the areas under each) are able to differentiate between cases when there is imbalanced VS balanced targets.

    • The PR-Curve (and therefore the area under the curve) is able to show this clearly.

QUOTES

  • Comparing single-threshold vs threshold-free measures:

All of these measures are single-threshold measures, that is, they are defined for individual score thresholds (cutoffs) of a classifier and cannot give an overview of the range of performance with varying thresholds.

While any such threshold, which divides a dataset into positively and negatively predicted classes, can be reasonable in a particular application, it is not obvious how the right threshold value should be chosen.

A powerful solution is to use threshold-free measures such as the ROC and PRC plots.

NOTES

Threshold-free vs single-threshold measures

  • Threshold-free measures such as AUC-ROC and AUC-PR give measure the model performance across all thresholds

  • Single-threshold measures such as precision, recall, etc, measure the model performance at a given threshold level

ROC vs PR-curve axes

Good for reference:

ROC PR-Curve
Y-axis Sensitivity
(i.e. True Positive Rate or TPR)
Precision
X-axis 1 - Specificity
(i.e. False Positive Rate or FPR)
Recall
(i.e. True Positive Rate or TPR)

MY 2¢

  • I would say that (nearly?) all models that output real-valued scored in binary classification will be used with some sort of policy threshold, which indicates where some action will be taken.

    • Under this assumption, it is downright dangerous to use ROC Curves instead of PR-curves because you are only interested in what happens given the threshold you will be working with!
    • If you use ROC-AUC you can be misled into using a classifier that is better overall but worse in the threshold you are interested in!
  • It's a very well-written article; I would recommend it for novice data scientists to understand the trade-offs and human aspects of the various metrics analyzed, as applied to real-world problems.

    • All measures are built from first principles (TF, TN, FP and FN), with good tables.
    • Also there are graphical interpretations for most explanations and conclusions.
  • A good article to refer people to when they question why you're using AUC-PR rather than AUC-ROC measure for a model in an imbalanced-data regimen.


1: In the usual AUC-PR plot, the random baseline case (a diagonal line) stays the same no matter the target distribution.

2: Early retrieval refers to the top results retrieved by the model, e.g. at a very low recall threshold.


References

Dialogue & Discussion