Evaluation Metrics for Ranking problems: Introduction and Examples

Last updated:
Evaluation Metrics for Ranking problems: Introduction and Examples
Source
Table of Contents

Ranking Problems

In many domains, data scientists are asked to not just predict what class/classes an example belongs to, but to rank classes according to how likely they are for a particular example.

Classification Ranking
Order of predictions doesn't matter Order of predictions does matter

This is often the case because, in the real world, resources are limited.

This means that whoever will use the predictions your model makes has limited time, limited space. So they will likely prioritize.

Some domains where this effect is particularly noticeable:

  • Search engines: Predict which documents match a query on a search engine.

  • Tag suggestion for Tweets: Predict which tags should be assigned to a tweet.

  • Image label prediction: Predict what labels should be suggested for an uploaded picture.

If your machine learning model produces a real-value for each of the possible classes, you can turn a classification problem into a ranking problem.

In other words, if you predict scores for a set of examples and you have a ground truth, you can order your predictions from highest to lowest and compare them with the ground truth:

  • Search engines: Do relevant documents appear up on the list or down at the bottom?

  • Tag suggestion for Tweets: Are the correct tags predicted with higher score or not?

  • Image label prediction: Does your system correctly give more weight to correct labels?

In the following sections, we will go over many ways to evaluate ranked predictions with respect to actual values, or ground truth.

Sample dataset - Ground Truth

We will use the following dummy dataset to illustrate examples in this post:

ID Actual
Relevance
Text
00Relevant (1.0)Lorem ipsum dolor sit amet, consectetur adipiscing elit.
01Not Relevant (0.0)Nulla non semper lorem, id tincidunt nunc.
02Relevant (1.0)Sed scelerisque volutpat eros nec tincidunt.
03Not Relevant (0.0)Fusce vel varius erat, vitae elementum lacus.
04Not Relevant (0.0)Donec vitae lobortis odio.
05Relevant (1.0)Quisque congue suscipit augue, congue porta est pretium vel.
06Relevant (1.0)Donec eget enim vel nisl feugiat tincidunt.
07Not Relevant (0.0)Nunc vitae bibendum tortor.
This is our sample dataset, with actual values for each document.
Documents 0,2,5 and 6 are relevant; 1,3,4 and 7 are not

Precision @k

More information: Precision

Precision means: "of all examples I predicted to be TRUE, how many were actually TRUE?"

\(Precision\) \(@k\) ("Precision at \(k\)") is simply Precision evaluated only up to the \(k\)-th prediction, i.e.:

$$ \text{Precision}@k = \frac{true \ positives \ @ k}{(true \ positives \ @ k) + (false \ positives \ @ k)} $$

Predicted relevance
scores (Sorted)
k Document
ID
Predicted
Relevance
Actual
Relevance
Precision @k
1060.90Relevant (1.0)1.0
2030.85Not Relevant (0.0)0.50
3050.71Relevant (1.0)0.66
4000.63Relevant (1.0)0.75
5040.47Not Relevant (0.0)0.60
6020.36Relevant (1.0)0.66
7010.24Not Relevant (0.0)0.57
8070.16Not Relevant (0.0)0.50
Same values as before, but ordered
from highest to lowest


Example: Precision @1

I.e. what Precision do I get if I only use the top 1 prediction?

$$ \text{Precision}@1 = \frac{\text{true positives} \ @ 1}{(\text{true positives} \ @ 1) + (\text{false positives} \ @ 1)} $$

$$ \hphantom{\text{Precision}@1} = \frac{\text{true positives considering} \ k=1}{(\text{true positives considering} \ k=1) + \\ (\text{false positives considering} \ k=1)} $$

$$ = \frac{1}{1 + 0} $$

$$ =1 $$

Example: Precision @4

Similarly, \(Precision@4\) only takes into account predictions up to \(k=4\):

$$ \text{Precision}@4 = \frac{\text{true positives} \ @ 4}{\text{(true positives} \ @ 4) + (\text{false positives} \ @ 4)} $$

$$ \hphantom{\text{Precision}@4} = \frac{\text{true positives considering} \ k=4}{(\text{true positives considering} \ k=4) + \\ (\text{false positives considering} \ k=4)} $$

$$ = \frac{3}{3 + 1} $$

$$ =0.75 $$

Example: Precision @8

Finally, \(Precision@8\) is just the precision, since 8 is the total number of predictions:

$$ \text{Precision}@8 = \frac{\text{true positives} \ @ 8}{(\text{true positives} \ @ 8) + (\text{false positives} \ @ 8)} $$

$$ \hphantom{\text{Precision}@8} = \frac{\text{true positives considering} \ k=8}{(\text{true positives considering} \ k=8) + \\ (\text{false positives considering} \ k=8)} $$

$$ = \frac{4}{4 + 4} $$

$$ =0.5 $$

Recall @k

More information: Recall

Recall means: "of all examples that were actually TRUE, how many I predicted to be TRUE?"

\(Recall\) \(@k\) ("Recall at \(k\)") is simply Recall evaluated only up to the \(k\)-th prediction, i.e.:

$$ \text{Recall}@k = \frac{true \ positives \ @ k}{(true \ positives \ @ k) + (false \ negatives \ @ k)} $$

Predicted relevance
scores (Sorted)
k Document
ID
Predicted
Relevance
Actual
Relevance
Recall @k
1060.90Relevant (1.0)0.25
2030.85Not Relevant (0.0)0.25
3050.71Relevant (1.0)0.50
4000.63Relevant (1.0)0.75
5040.47Not Relevant (0.0)0.75
6020.36Relevant (1.0)1.0
7010.24Not Relevant (0.0)1.0
8070.16Not Relevant (0.0)1.0
Same values as before, but
ordered from highest to lowest


Example: Recall @1

I.e. what Recall do I get if I only use the top 1 prediction?

$$ \text{Recall}@1 = \frac{\text{true positives} \ @ 1}{(\text{true positives} \ @ 1) + (\text{false negatives} \ @ 1)} $$

$$ \hphantom{\text{Recall}@1} = \frac{\text{true positives considering} \ k=1}{(\text{true positives considering} \ k=1) + \\ (\text{false negatives considering} \ k=1)} $$

$$ = \frac{1}{1 + 3} $$

$$ =0.25 $$

Example: Recall @4

Similarly, \(Recall@4\) only takes into account predictions up to \(k=4\):

$$ \text{Recall}@4 = \frac{true \ positives \ @ 4}{(true \ positives \ @ 4) + (false \ negatives \ @ 4)} $$

$$ \hphantom{\text{Recall}@4} = \frac{\text{true positives considering} \ k=4}{(\text{true positives considering} \ k=4) + \\ (\text{false negatives considering} \ k=4)} $$

$$ = \frac{3}{3 + 1} $$

$$ =0.75 $$

Example: Recall @8

$$ \text{Recall}@8 = \frac{true \ positives \ @ 8}{(true \ positives \ @ 8) + (false \ negatives \ @ 8)} $$

$$ \hphantom{\text{Recall}@8} = \frac{\text{true positives considering} \ k=8}{(\text{true positives considering} \ k=8) + \\ (\text{false negatives considering} \ k=8)} $$

$$ = \frac{4}{4 + 0} $$

$$ =1 $$

F1 @k

More information: F1

\(F_1\)-score (alternatively, \(F_1\)-Measure), is a mixed metric that takes into account both Precision and Recall.

Similarly to \(\text{Precision}@k\) and \(\text{Recall}@k\), \(F_1@k\) is a rank-based metric that can be summarized as follows: "What \(F_1\)-score do I get if I only consider the top \(k\) predictions my model outputs?"

$$ F_1 @k = 2 \cdot \frac{(Precision @k) \cdot (Recall @k) }{(Precision @k) + (Recall @k)} $$

An alternative formulation for \(F_1 @k\) is as follows:

$$ F_1 @k = \frac{2 \cdot (\text{true positives} \ @k)}{2 \cdot (\text{true positives} \ @k ) + (\text{false negatives} \ @k) + (\text{false positives} \ @k) } $$

Predicted relevance
scores (Sorted)
k Document
ID
Predicted
Relevance
Actual
Relevance
F1 @k
1060.90Relevant (1.0)0.40
2030.85Not Relevant (0.0)0.33
3050.71Relevant (1.0)0.62
4000.63Relevant (1.0)0.75
5040.47Not Relevant (0.0)0.66
6020.36Relevant (1.0)0.80
7010.24Not Relevant (0.0)0.73
8070.16Not Relevant (0.0)0.66
Same values as before, but
ordered from highest to lowest


Example: F1 @1

$$ F_1 @1 = \frac{2 \cdot (\text{true positives} \ @1)}{2 \cdot (\text{true positives} \ @1 ) + (\text{false negatives} \ @1) + (\text{false positives} \ @1) } $$

$$ = \frac{2 \cdot (\text{true positives considering} \ k=1)}{2 \cdot (\text{true positives considering} \ k=1 ) + \\ \, \, \, \, \, \, (\text{false negatives considering} \ k=1) + \\ \, \, \, \, \, \, (\text{false positives considering} \ k=1) } $$

$$ = \frac{2 \cdot 1 }{ (2 \cdot 1) + 3 + 0 } $$

$$ = \frac{2}{5} $$

$$ = 0.4 $$

Which is the same result you get if you use the original formula:

$$ F_1 @1 = 2 \cdot \frac{(Precision @1) \cdot (Recall @1) }{(Precision @1) + (Recall @1)} $$

$$ = 2 \cdot \frac{1 \cdot 0.25}{1 + 0.25} $$

$$ = 2 \cdot 0.2 = 0.4 $$

Example: F1 @4

$$ F_1 @4 = \frac{2 \cdot (\text{true positives} \ @4)}{2 \cdot (\text{true positives} \ @4 ) + (\text{false negatives} \ @4) + (\text{false positives} \ @4) } $$

$$ = \frac{2 \cdot (\text{true positives considering} \ k=4)}{2 \cdot (\text{true positives considering} \ k=4 ) + \\ \, \, \, \, \, \, (\text{false negatives considering} \ k=4) + \\ \, \, \, \, \, \, (\text{false positives considering} \ k=4) } $$

$$ = \frac{2 \cdot 3 }{ (2 \cdot 3) + 1 + 1 } $$

$$ = \frac{6}{8} $$

$$ = 0.75 $$

Which is the same result you get if you use the original formula:

$$ F_1 @4 = 2 \cdot \frac{(Precision @4) \cdot (Recall @4) }{(Precision @4) + (Recall @4)} $$

$$ = 2 \cdot \frac{0.75 \cdot 0.75}{0.75 + 0.75} $$

$$ = 2 \cdot \frac{0.5625}{1.5} = 0.75 $$

Example: F1 @8

$$ \begin{align} A & = B \\ & = C \end{align} $$

$$ F_1 @8 = \frac{2 \cdot (\text{true positives} \ @8)}{2 \cdot (\text{true positives} \ @8 ) + (\text{false negatives} \ @8) + (\text{false positives} \ @8) } $$

$$ = \frac{2 \cdot (\text{true positives considering} \ k=8)}{2 \cdot (\text{true positives considering} \ k=8 ) + \\ \, \, \, \, \, \, (\text{false negatives considering} \ k=8) + \\ \, \, \, \, \, \, (\text{false positives considering} \ k=8) } $$

$$ = \frac{2 \cdot 4 }{ (2 \cdot 4) + 0 + 4 } $$

$$ = \frac{8}{12} $$

$$ = 0.666 $$

Which is the same result you get if you use the original formula:

$$ F_1 @8 = 2 \cdot \frac{(Precision @8) \cdot (Recall @8) }{(Precision @8) + (Recall @8)} $$

$$ = 2 \cdot \frac{0.5 \cdot 1}{0.5 + 1} $$

$$ = 2 \cdot \frac{0.5}{1.5} $$

$$ = \frac{1}{1.5} = 0.666 $$

AP: Average Precision

AP (Average Precision) is another metric to compare a ranking with a set of relevant/non-relevant items.

One way to explain what AP represents is as follows:

AP is a metric that tells you how much of the relevant documents are concentrated in the highest ranked predictions.

Formula

$$ AP = \sum_{K} (Recall @k - Recall @k\text{-}1) \cdot Precision @k $$

So for each threshold level (\(k\)) you take the difference between the Recall at the current level and the Recall at the previous threshold and multiply by the Precision at that level. Then sum the contributions of each.

Algorithm to calculate AP

You can calculate the AP using the following algorithm:

algorithm-for-calculating-average-precision Algorithm for calculating the AP based on a
ranked list of predictions.

Example: AP Calculated using the algorithm

Predicted relevance
scores (Sorted)
k Document
ID
Predicted
Relevance
Actual
Relevance
1060.90Relevant (1.0)
2030.85Not Relevant (0.0)
3050.71Relevant (1.0)
4000.63Relevant (1.0)
5040.47Not Relevant (0.0)
6020.36Relevant (1.0)
7010.24Not Relevant (0.0)
8070.16Not Relevant (0.0)
Same values as before, but
ordered from highest to lowest


Following the algorithm described above, let's go about calculating the AP for our guiding example:

  • At rank 1:

    • \(\text{RunningSum} = 0 + \frac{1}{1} = 1, \text{CorrectPredictions} = 1\)
  • At rank 2:

    • No change. We don't update either the RunningSum or the CorrectPredictions count, since the document at rank 2 is not relevant.
    • In other words, we don't count when there's a wrong prediction.
  • At rank 3:

    • \(\text{RunningSum} = 1 + \frac{2}{3} = 1 + 0.8 = 1.8\)
    • \(\text{CorrectPredictions} = 2\)
  • At rank 4:

    • \(\text{RunningSum} = 1.8 + \frac{3}{4} = 1.8 + 0.75 = 2.55\)
    • \(\text{CorrectPredictions} = 3\)
  • At rank 5:

    • No change, wrong prediction.
  • At rank 6:

    • \(\text{RunningSum} = 2.55 + \frac{4}{6} = 2.55 + 0.66 = 3.22\)
    • \(\text{CorrectPredictions} = 4\)
  • At rank 7:

    • No change, wrong prediction.
  • At rank 8:

    • No change, wrong prediction.

And at the end we divide everything by the number of Relevant Documents which is, in this case, equal to the number of correct predictions:

\(AP = \dfrac{\text{RunningSum}}{\text{CorrectPredictions}} \)

\(AP = \dfrac{3.22}{4} \)

\(AP = 0.80\)

DCG: Discounted Cumulative Gain

One advantage of DCG over other metrics is that it also works if document relevances are a real number. In other words, when each document is not simply relevant/non-relevant (as in the example), but has a relevance score instead.

$$ DCG \ @k = \sum\limits_{i=1}^{k} \frac{2^{rel_i} - 1}{log_2(i+1)} $$

where \(rel_i\) is the relevance of the document at index \(i\). Since we're dealing with binary relevances, \(rel_i\) equals 1 if document \(i\) is relevant and 0 otherwise.

Predicted relevance
scores (Sorted)
k Document
ID
Predicted
Relevance
Actual
Relevance
DCG @k
1060.90Relevant (1.0)1.0
2030.85Not Relevant (0.0)1.0
3050.71Relevant (1.0)1.5
4000.63Relevant (1.0)1.93
5040.47Not Relevant (0.0)1.93
6020.36Relevant (1.0)2.29
7010.24Not Relevant (0.0)2.29
8070.16Not Relevant (0.0)2.29
Same values as before, but
ordered from highest to lowest


NDCG: Normalized Discounted Cumulative Gain

As you can see in the previous section, DCG either goes up with \(k\) or it stays the same. This means that queries that return larger result sets will probably always have higher DCG scores than queries that return small result sets.

A way to make comparison across queries fairer is to normalize the DCG score by the maximum possible DCG at each threshold \(k\).

$$ NDCG \ @k = \dfrac{DCG \ @k}{IDCG \ @k} $$

Where \(IDCG \ @k\) is the best possible value for \(DCG \ @k\), i.e. the value of DCG for the best possible ranking of relevant documents at threshold \(k\), i.e.:

$$ IDCG \ @k = \sum\limits_{i=1}^{relevant \ documents \\ \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, at \ k} \frac{2^{rel_i} - 1}{log_2(i+1)} $$

Predicted relevance
scores (Sorted)
k Document
ID
Predicted
Relevance
Actual
Relevance
DCG @k IDCG @k NDCG @k
1 06 0.90 Relevant (1.0) 1.0 1.0 1.0
2 03 0.85 Not Relevant (0.0) 1.0 1.63 0.61
3 05 0.71 Relevant (1.0) 1.5 2.13 0.70
4 00 0.63 Relevant (1.0) 1.93 2.56 0.75
5 04 0.47 Not Relevant (0.0) 1.93 2.56 0.75
6 02 0.36 Relevant (1.0) 2.29 2.56 0.89
7 01 0.24 Not Relevant (0.0) 2.29 2.56 0.89
8 07 0.16 Not Relevant (0.0) 2.29 2.56 0.89
Same values as before, but
ordered from highest to lowest


What about MAP (Mean Average Precision)?

AP (Average Precision) is a metric that tells you how a single sorted prediction compares with the ground truth. E.g. AP would tell you how correct a single ranking of documents is, with respect to a single query.

But what if you need to know how your model's rankings perform when evaluated on a whole validation set? After all, it is really of no use if your trained model correctly ranks classes for some examples but not for others.

This is where MAP (Mean Average Precision) comes in. All you need to do is to sum the AP value for each example in a validation dataset and then divide by the number of examples. In other words, take the mean of the AP over all examples.

So, to summarize:

AP (Average Precision) MAP (Mean Average Precision)
Informs you how correct a model's ranked
predictions are for a single example
Informs you how correct a model's ranked
predictions are, on average, over a whole validation dataset

It is a simple average of AP over all examples in
a validation set.

What about AP @k (Average Precision at k)?

Although AP (Average Precision) is not usually presented like this, nothing stops us from calculating AP at each threshold value.

So for all practical purposes, we could calculate \(AP \ @k\) as follows:

formula-for-average-precision-at-k By slightly modifying the algorithm to calculate the
Average Precision, you can derive AP @k, for varying k

So, using our old guiding example:

Predicted relevance
scores (Sorted)
k Document
ID
Predicted
Relevance
Actual
Relevance
AP @k
1060.90Relevant (1.0)1.0
2030.85Not Relevant (0.0)1.0
3050.71Relevant (1.0)0.83
4000.63Relevant (1.0)0.81
5040.47Not Relevant (0.0)0.81
6020.36Relevant (1.0)0.77
7010.24Not Relevant (0.0)0.77
8070.16Not Relevant (0.0)0.77
Same values as before, but
ordered from highest to lowest

DCG vs NDCG

NDCG is used when you need to compare the ranking for one result set with another ranking, with potentially less elements, different elements, etc.

You can't do that using DCG because query results may vary in size, unfairly penalizing queries that return long result sets.

NDCG normalizes a DCG score, dividing it by the best possible DCG at each threshold.1

DCG NDCG
(Normalized DCG)
Returns absolute valuesReturns normalized values
Does not allow comparison
between queries
Allows comparison between queries
Cannot be used to gauge the performance
of a ranking model on a whole validation
dataset
Can be used to measure a model across a full
dataset.

Just average NDCG values for each
example in the validation set.


References


1: Also called the \(IDCG_k\) or the ideal or best possible value for DCG at threshold \(k\).

Dialogue & Discussion