Highlights of Podcast episode on AUC and Calibrated Models

Last updated:
Highlights of Podcast episode on AUC and Calibrated Models
Source
Table of Contents

This is an overview of an earlier (2017) podcast episode by Linear Digressions.

Scores aren't always Probabilities

The score you get from a binary classifier (that outputs a number between 0 and 1) is not necessarily a well-calibrated probability.

This is not always a problem because it generally sufices to have scores that correctly order the samples, even if they don't actually correspond to probabilities.

Example: A logistic regression binary classifier naturally produces scores that are actual probabilities (i.e. if it reports a score of 0.9 for an instance, it means that that instance would be True 90% of the time and False 10% of the time.)

Calibration vs Discrimination

Calibration: How well model output actually matches the probability of the event. It can be measured by the Hosmer-Lemeshow statistic.

Discrimination: For every two examples A and B where A is True and B is False.It can be measure by the AUC.

If order to understand how they differ, imagine the following:

You have a model that gives a AUC score of 0.52 to every True instance and 0.51 to every False. It will have perfect discrimination (AUC = 1.0) but very poor calibration (i.e. the score has very little correlation with actual event probability).

Calibration is important in problems where you need to make decisions

In many cases, the output scores of these models is used to drive actions and help people make decisions.

The natural way to do this is to use thresholds, i.e. define a cutoff value for the scores, and act on the instances that corss that threshold.

Examples:

  • If your model outputs credit default probabilities, it may the company's policy to contact every customer whose default risk is over 0.6.

  • If your model outputs risk of heart failure in the next 3 months, the doctors (or medical guidelines) may need to act on the people whose risk is above 0.5, to prevent the event from taking place.

If your model's calibration isn't up to scratch, you'll mislead anyone taking action based on its outputs.

AUC is not the metric you want to look at if

  • You want scores you can interpret at probabilities

    AUC may be higher for models that don't output calibrated probabilities.

    In other words, if you want to measure risk of something happening (heart disease, credit default, etc), AUC is not the metric for you.

  • Feature selection

    If you want to select features by looking at AUC of models trained with them, you may be misled by AUC.

    This is because a feature's importance may not overly change the discrimination of the model even though it may increase the accuracy in the probabilities output.

  • You want to create stratified groups depending on output scores

    If your model outputs credit default risk scores, one thing you may be asked to do is to group those clients into ratings. For example, you would want to assign credit rating "A" to clients on bottom 10% of default risk, "B" to clients having 10%-20% risk, and so on, until "H".

    In other words, if you need to get the order of your scores right, AUC isn't a good metric to help you with that (because it measures discrimination, not calibration).


References

Dialogue & Discussion