Evaluation Metrics for Ranking problems: Introduction and Examples

Last updated:
Table of Contents

WIP Alert This is a work in progress. Current information is correct but more content may be added in the future.

Ranking Problems

In many domains, data scientists are asked to not just predict what class/classes an example belongs to, but to rank classes according to how likely they are for a particular example.

Classification Ranking
Order of predictions doesn't matter Order of predictions does matter

This is often the case because, in the real world, resources are limited.

This means that whoever will use the predictions your model makes has limited time, limited space. So they will likely prioritize.

Some domains where this effect is particularly noticeable:

  • Search engines: Predict which documents match a query on a search engine.

  • Tag suggestion for Tweets: Predict which tags should be assigned to a tweet.

  • Image label prediction: Predict what labels should be suggested for an uploaded picture.

If your machine learning model produces a real-value for each of the possible classes, you can turn a classification problem into a ranking problem.

In other words, if you predict scores for a set of examples and you have a ground truth, you can order your predictions from highest to lowest and compare them with the ground truth:

  • Search engines: Do relevant documents appear up on the list or down at the bottom?

  • Tag suggestion for Tweets: Are the correct tags predicted with higher score or not?

  • Image label prediction: Does your system correctly give more weight to correct labels?

In the following sections, we will go over many ways to evaluate ranked predictions with respect to actual values, or ground truth.

Sample dataset - Ground Truth

We will use the following dummy dataset to illustrate examples in this post:

ID Actual
Relevance
Text
00RelevantLorem ipsum dolor sit amet, consectetur adipiscing elit.
01Not RelevantNulla non semper lorem, id tincidunt nunc.
02RelevantSed scelerisque volutpat eros nec tincidunt.
03Not RelevantFusce vel varius erat, vitae elementum lacus.
04Not RelevantDonec vitae lobortis odio.
05RelevantQuisque congue suscipit augue, congue porta est pretium vel.
06RelevantDonec eget enim vel nisl feugiat tincidunt.
07Not RelevantNunc vitae bibendum tortor.
This is our sample dataset, with actual values for each document.
Documents 0,2,5 and 6 are relevant; 1,3,4 and 7 are not

Precision @k

More information: Precision

Precision means: "of all examples I predicted to be TRUE, how many were actually TRUE?"

\(Precision\) \(@k\) ("Precision at \(k\)") is simply Precision evaluated only up to the \(k\)-th prediction, i.e.:

$$ \text{Precision}@k = \frac{true \ positives \ @ k}{(true \ positives \ @ k) + (false \ positives \ @ k)} $$

Predicted relevance
scores
ID Predicted
Relevance
Actual
Relevance
000.63Relevant
010.24Not Relevant
020.36Relevant
030.85Not Relevant
040.47Not Relevant
050.71Relevant
060.90Relevant
070.16Not Relevant
Suppose we trained a model and these are the
predicted relevances for each document
Predicted relevance
scores (Sorted)
k ID Predicted
Relevance
Actual
Relevance
Precision @k
1060.90Relevant1.0
2030.85Not Relevant0.50
3050.71Relevant0.66
4000.63Relevant0.75
5040.47Not Relevant0.60
6020.36Relevant0.66
7010.24Not Relevant0.57
8070.16Not Relevant0.50
Same values as before, but ordered
from highest to lowest


Example: Precision @1

I.e. what Precision do I get if I only use the top 1 prediction?

$$ \text{Precision}@1 = \frac{\text{true positives} \ @ 1}{(\text{true positives} \ @ 1) + (\text{false positives} \ @ 1)} $$

$$ \hphantom{\text{Precision}@1} = \frac{\text{true positives considering} \ k=1}{(\text{true positives considering} \ k=1) + (\text{false positives considering} \ k=1)} $$

$$ = \frac{1}{1 + 0} $$

$$ =1 $$

Example: Precision @4

Similarly, \(Precision@4\) only takes into account predictions up to \(k=4\):

$$ \text{Precision}@4 = \frac{\text{true positives} \ @ 4}{\text{(true positives} \ @ 4) + (\text{false positives} \ @ 4)} $$

$$ \hphantom{\text{Precision}@4} = \frac{\text{true positives considering} \ k=4}{(\text{true positives considering} \ k=4) + (\text{false positives considering} \ k=4)} $$

$$ = \frac{3}{3 + 1} $$

$$ =0.75 $$

Example: Precision @8

Finally, \(Precision@8\) is just the precision, since 8 is the total number of predictions:

$$ \text{Precision}@8 = \frac{\text{true positives} \ @ 8}{(\text{true positives} \ @ 8) + (\text{false positives} \ @ 8)} $$

$$ \hphantom{\text{Precision}@8} = \frac{\text{true positives considering} \ k=8}{(\text{true positives considering} \ k=8) + (\text{false positives considering} \ k=8)} $$

$$ = \frac{4}{4 + 4} $$

$$ =0.5 $$

Recall @k

More information: Recall

Recall means: "of all examples that were actually TRUE, how many I predicted to be TRUE?"

\(Recall\) \(@k\) ("Recall at \(k\)") is simply Recall evaluated only up to the \(k\)-th prediction, i.e.:

$$ \text{Recall}@k = \frac{true \ positives \ @ k}{(true \ positives \ @ k) + (false \ negatives \ @ k)} $$

Predicted relevance
scores
ID Predicted
Relevance
Actual
Relevance
000.63Relevant
010.24Not Relevant
020.36Relevant
030.85Not Relevant
040.47Not Relevant
050.71Relevant
060.90Relevant
070.16Not Relevant
Suppose we trained a model and these are the
predicted relevances for each document
Predicted relevance
scores (Sorted)
k ID Predicted
Relevance
Actual
Relevance
Recall @k
1060.90Relevant0.25
2030.85Not Relevant0.25
3050.71Relevant0.50
4000.63Relevant0.75
5040.47Not Relevant0.75
6020.36Relevant1.0
7010.24Not Relevant1.0
8070.16Not Relevant1.0
Same values as before, but
ordered from highest to lowest


Example: Recall @1

I.e. what Recall do I get if I only use the top 1 prediction?

$$ \text{Recall}@1 = \frac{\text{true positives} \ @ 1}{(\text{true positives} \ @ 1) + (\text{false negatives} \ @ 1)} $$

$$ \hphantom{\text{Recall}@1} = \frac{\text{true positives considering} \ k=1}{(\text{true positives considering} \ k=1) + (\text{false negatives considering} \ k=1)} $$

$$ = \frac{1}{1 + 3} $$

$$ =0.25 $$

Example: Recall @4

Similarly, \(Recall@4\) only takes into account predictions up to \(k=4\):

$$ \text{Recall}@4 = \frac{true \ positives \ @ 4}{(true \ positives \ @ 4) + (false \ negatives \ @ 4)} $$

$$ \hphantom{\text{Recall}@4} = \frac{\text{true positives considering} \ k=4}{(\text{true positives considering} \ k=4) + (\text{false negatives considering} \ k=4)} $$

$$ = \frac{3}{3 + 1} $$

$$ =0.75 $$

Example: Recall @8

$$ \text{Recall}@8 = \frac{true \ positives \ @ 8}{(true \ positives \ @ 8) + (false \ negatives \ @ 8)} $$

$$ \hphantom{\text{Recall}@8} = \frac{\text{true positives considering} \ k=8}{(\text{true positives considering} \ k=8) + (\text{false negatives considering} \ k=8)} $$

$$ = \frac{4}{4 + 0} $$

$$ =1 $$

F1 @k

More information: F1

\(F_1\)-score (alternatively, \(F_1\)-Measure), is a mixed metric that takes into account both Precision and Recall.

Similarly to \(\text{Precision}@k\) and \(\text{Recall}@k\), \(F_1@k\) is a rank-based metric that can be summarized as follows: "What \(F_1\)-score do I get if I only consider the top \(k\) predictions my model outputs?"

$$ F_1 @k = 2 \cdot \frac{(Precision @k) \cdot (Recall @k) }{(Precision @k) + (Recall @k)} $$

An alternative formulation for \(F_1 @k\) is as follows:

$$ F_1 @k = \frac{2 \cdot (\text{true positives} \ @k)}{2 \cdot (\text{true positives} \ @k ) + (\text{false negatives} \ @k) + (\text{false positives} \ @k) } $$

Predicted relevance
scores
ID Predicted
Relevance
Actual
Relevance
000.63Relevant
010.24Not Relevant
020.36Relevant
030.85Not Relevant
040.47Not Relevant
050.71Relevant
060.90Relevant
070.16Not Relevant
Suppose we trained a model and these are the
predicted relevances for each document
Predicted relevance
scores (Sorted)
k ID Predicted
Relevance
Actual
Relevance
F1 @k
1060.90Relevant0.40
2030.85Not Relevant0.33
3050.71Relevant0.62
4000.63Relevant0.75
5040.47Not Relevant0.66
6020.36Relevant0.80
7010.24Not Relevant0.73
8070.16Not Relevant0.66
Same values as before, but
ordered from highest to lowest


Example: F1 @1

$$ F_1 @1 = \frac{2 \cdot (\text{true positives} \ @1)}{2 \cdot (\text{true positives} \ @1 ) + (\text{false negatives} \ @1) + (\text{false positives} \ @1) } $$

$$ = \frac{2 \cdot (\text{true positives considering} \ k=1)}{2 \cdot (\text{true positives considering} \ k=1 ) + (\text{false negatives considering} \ k=1) + (\text{false positives considering} \ k=1) } $$

$$ = \frac{2 \cdot 1 }{ (2 \cdot 1) + 3 + 0 } $$

$$ = \frac{2}{5} $$

$$ = 0.4 $$

Which is the same result you get if you use the original formula:

$$ F_1 @1 = 2 \cdot \frac{(Precision @1) \cdot (Recall @1) }{(Precision @1) + (Recall @1)} $$

$$ = 2 \cdot \frac{1 \cdot 0.25}{1 + 0.25} $$

$$ = 2 \cdot 0.2 = 0.4 $$

Example: F1 @4

$$ F_1 @4 = \frac{2 \cdot (\text{true positives} \ @4)}{2 \cdot (\text{true positives} \ @4 ) + (\text{false negatives} \ @4) + (\text{false positives} \ @4) } $$

$$ = \frac{2 \cdot (\text{true positives considering} \ k=4)}{2 \cdot (\text{true positives considering} \ k=4 ) + (\text{false negatives considering} \ k=4) + (\text{false positives considering} \ k=4) } $$

$$ = \frac{2 \cdot 3 }{ (2 \cdot 3) + 1 + 1 } $$

$$ = \frac{6}{8} $$

$$ = 0.75 $$

Which is the same result you get if you use the original formula:

$$ F_1 @4 = 2 \cdot \frac{(Precision @4) \cdot (Recall @4) }{(Precision @4) + (Recall @4)} $$

$$ = 2 \cdot \frac{0.75 \cdot 0.75}{0.75 + 0.75} $$

$$ = 2 \cdot \frac{0.5625}{1.5} = 0.75 $$

Example: F1 @8

$$ F_1 @8 = \frac{2 \cdot (\text{true positives} \ @8)}{2 \cdot (\text{true positives} \ @8 ) + (\text{false negatives} \ @8) + (\text{false positives} \ @8) } $$

$$ = \frac{2 \cdot (\text{true positives considering} \ k=8)}{2 \cdot (\text{true positives considering} \ k=8 ) + (\text{false negatives considering} \ k=8) + (\text{false positives considering} \ k=8) } $$

$$ = \frac{2 \cdot 4 }{ (2 \cdot 4) + 0 + 4 } $$

$$ = \frac{8}{12} $$

$$ = 0.666 $$

Which is the same result you get if you use the original formula:

$$ F_1 @8 = 2 \cdot \frac{(Precision @8) \cdot (Recall @8) }{(Precision @8) + (Recall @8)} $$

$$ = 2 \cdot \frac{0.5 \cdot 1}{0.5 + 1} $$

$$ = 2 \cdot \frac{0.5}{1.5} $$

$$ = \frac{1}{1.5} = 0.666 $$

AP: Average Precision

AP (Average Precision) is another metric to compare a ranking with a set of relevant/non-relevant items.

One way to explain what AP represents is as follows:

AP is a metric that tells you how much of the relevant documents are concentrated in the highest ranked predictions.

Formula

$$ AP = \sum_{K} (Recall @k - Recall @k\text{-}1) \cdot Precision @k $$

So for each threshold level (\(k\)) you take the difference between the Recall at the current level and the Recall at the previous threshold and multiply by the Precision at that level. Then sum the contributions of each.

Algorithm to calculate AP

You can calculate the AP using the following algorithm:

algorithm-for-calculating-average-precision Algorithm for calculating the AP based on a
ranked list of predictions.

Example: AP Calculated using the algorithm

Predicted relevance
scores
ID Predicted
Relevance
Actual
Relevance
000.63Relevant
010.24Not Relevant
020.36Relevant
030.85Not Relevant
040.47Not Relevant
050.71Relevant
060.90Relevant
070.16Not Relevant
Suppose we trained a model and these are the
predicted relevances for each document
Predicted relevance
scores (Sorted)
k ID Predicted
Relevance
Actual
Relevance
1060.90Relevant
2030.85Not Relevant
3050.71Relevant
4000.63Relevant
5040.47Not Relevant
6020.36Relevant
7010.24Not Relevant
8070.16Not Relevant
Same values as before, but
ordered from highest to lowest

Following the algorithm described above, let's go about calculating the AP for our guiding example:

  • At rank 1:

    • \(\text{RunningSum} = 0 + \frac{1}{1} = 1, \text{CorrectPredictions} = 1\)
  • At rank 2:

    • No change. We don't update either the RunningSum or the CorrectPredictions count, since the document at rank 2 is not relevant.
    • In other words, we don't count when there's a wrong prediction.
  • At rank 3:

    • \(\text{RunningSum} = 1 + \frac{2}{3} = 1 + 0.8 = 1.8\)
    • \(\text{CorrectPredictions} = 2\)
  • At rank 4:

    • \(\text{RunningSum} = 1.8 + \frac{3}{4} = 1.8 + 0.75 = 2.55\)
    • \(\text{CorrectPredictions} = 3\)
  • At rank 5:

    • No change, wrong prediction.
  • At rank 6:

    • \(\text{RunningSum} = 2.55 + \frac{4}{6} = 2.55 + 0.66 = 3.22\)
    • \(\text{CorrectPredictions} = 4\)
  • At rank 7:

    • No change, wrong prediction.
  • At rank 8:

    • No change, wrong prediction.

And at the end we divide everything by the number of Relevant Documents which is, in this case, equal to the number of correct predictions:

\(AP = \dfrac{\text{RunningSum}}{\text{CorrectPredictions}} \)

\(AP = \dfrac{3.22}{4} \)

\(AP = 0.80\)

DCG: Discounted Cumulative Gain

TODO

NDCG: Normalized Discounted Cumulative Gain

TODO


What about MAP (Mean Average Precision)?

AP (Average Precision) is a metric that tells you how a single sorted prediction compares with the ground truth. E.g. AP would tell you how correct a single ranking of documents is, with respect to a single query.

But what if you need to know how your model's rankings perform when evaluated on a whole validation set? After all, it is really of no use if your trained model correctly ranks classes for some examples but not for others.

This is where MAP (Mean Average Precision) comes in. All you need to do is to sum the AP value for each example in a validation dataset and then divide by the number of examples. In other words, take the mean of the AP over all examples.

So, to summarize:

AP (Average Precision) MAP (Mean Average Precision)
Informs you how correct a model's ranked
predictions are for a single example
Informs you how correct a model's ranked
predictions are, on average, over a whole validation dataset

It is a simple average of AP over all examples in
a validation set.

What about AP @k (Average Precision at k)?

Although AP (Average Precision) is not usually presented like this, nothing stops us from calculating AP at each threshold value.

So for all practical purposes, we could calculate \(AP \ @k\) as follows:

formula-for-average-precision-at-k By slightly modifying the algorithm to calculate the
Average Precision, you can derive AP @k, for varying k

So, using our old guiding example:

Predicted relevance
scores
ID Predicted
Relevance
Actual
Relevance
000.63Relevant
010.24Not Relevant
020.36Relevant
030.85Not Relevant
040.47Not Relevant
050.71Relevant
060.90Relevant
070.16Not Relevant
Suppose we trained a model and these are the
predicted relevances for each document
Predicted relevance
scores (Sorted)
k ID Predicted
Relevance
Actual
Relevance
AP @k
1060.90Relevant1.0
2030.85Not Relevant1.0
3050.71Relevant0.83
4000.63Relevant0.81
5040.47Not Relevant0.81
6020.36Relevant0.77
7010.24Not Relevant0.77
8070.16Not Relevant0.77
Same values as before, but
ordered from highest to lowest


References

Dialogue & Discussion