Evaluation Metrics for Ranking problems: Introduction and Examples

Last updated: 27 Dec 2022

Source

Table of Contents

Ranking Problems
Sample dataset (Ground Truth)
Precision @k
Recall @k
F1 @k
AP: Average Precision
DCG: Discounted Cumulative Gain
NDCG: Normalized Discounted Cumulative Gain
MAP: Mean Average Precision
Average Precision @k
DCG vs NDCG

Ranking Problems

In many domains, data scientists are asked to not just predict what class/classes an example belongs to, but to rank classes according to how likely they are for a particular example.

Classification	Ranking
Order of predictions doesn't matter	Order of predictions does matter

This is often the case because, in the real world, resources are limited.

This means that whoever will use the predictions your model makes has limited time, limited space. So they will likely prioritize.

Some domains where this effect is particularly noticeable:

Search engines: Predict which documents match a query on a search engine.
Tag suggestion for Tweets: Predict which tags should be assigned to a tweet.
Image label prediction: Predict what labels should be suggested for an uploaded picture.

If your machine learning model produces a real-value for each of the possible classes, you can turn a classification problem into a ranking problem.

In other words, if you predict scores for a set of examples and you have a ground truth, you can order your predictions from highest to lowest and compare them with the ground truth:

Search engines: Do relevant documents appear up on the list or down at the bottom?
Tag suggestion for Tweets: Are the correct tags predicted with higher score or not?
Image label prediction: Does your system correctly give more weight to correct labels?

In the following sections, we will go over many ways to evaluate ranked predictions with respect to actual values, or ground truth.

Sample dataset (Ground Truth)

We will use the following dummy dataset to illustrate examples in this post:

ID	Actual Relevance	Text
00	Relevant (1.0)	Lorem ipsum dolor sit amet, consectetur adipiscing elit.
01	Not Relevant (0.0)	Nulla non semper lorem, id tincidunt nunc.
02	Relevant (1.0)	Sed scelerisque volutpat eros nec tincidunt.
03	Not Relevant (0.0)	Fusce vel varius erat, vitae elementum lacus.
04	Not Relevant (0.0)	Donec vitae lobortis odio.
05	Relevant (1.0)	Quisque congue suscipit augue, congue porta est pretium vel.
06	Relevant (1.0)	Donec eget enim vel nisl feugiat tincidunt.
07	Not Relevant (0.0)	Nunc vitae bibendum tortor.

This is our sample dataset, with actual values for each document.
Documents 0,2,5 and 6 are relevant; 1,3,4 and 7 are not

Precision @k

More information: Precision

Precision means: "of all examples I predicted to be TRUE, how many were actually TRUE?"

$Precision$ $@k$ ("Precision at $k$ ") is simply Precision evaluated only up to the $k$ -th prediction, i.e.:

$\text{Precision}@k = \frac{true \ positives \ @ k}{(true \ positives \ @ k) + (false \ positives \ @ k)}$

Predicted relevance
scores (Sorted)

k	Document ID	Predicted Relevance	Actual Relevance	Precision @k
1	06	0.90	Relevant (1.0)	1.0
2	03	0.85	Not Relevant (0.0)	0.50
3	05	0.71	Relevant (1.0)	0.66
4	00	0.63	Relevant (1.0)	0.75
5	04	0.47	Not Relevant (0.0)	0.60
6	02	0.36	Relevant (1.0)	0.66
7	01	0.24	Not Relevant (0.0)	0.57
8	07	0.16	Not Relevant (0.0)	0.50

Same values as before, but ordered
from highest to lowest

Example: Precision @1

I.e. what Precision do I get if I only use the top 1 prediction?

$\text{Precision}@1 = \frac{\text{true positives} \ @ 1}{(\text{true positives} \ @ 1) + (\text{false positives} \ @ 1)}$

$\hphantom{\text{Precision}@1} = \frac{\text{true positives considering} \ k=1}{(\text{true positives considering} \ k=1) + \\ (\text{false positives considering} \ k=1)}$

$= \frac{1}{1 + 0}$

$=1$

Example: Precision @4

Similarly, $Precision@4$ only takes into account predictions up to $k=4$ :

$\text{Precision}@4 = \frac{\text{true positives} \ @ 4}{\text{(true positives} \ @ 4) + (\text{false positives} \ @ 4)}$

$\hphantom{\text{Precision}@4} = \frac{\text{true positives considering} \ k=4}{(\text{true positives considering} \ k=4) + \\ (\text{false positives considering} \ k=4)}$

$= \frac{3}{3 + 1}$

$=0.75$

Example: Precision @8

Finally, $Precision@8$ is just the precision, since 8 is the total number of predictions:

$\text{Precision}@8 = \frac{\text{true positives} \ @ 8}{(\text{true positives} \ @ 8) + (\text{false positives} \ @ 8)}$

$\hphantom{\text{Precision}@8} = \frac{\text{true positives considering} \ k=8}{(\text{true positives considering} \ k=8) + \\ (\text{false positives considering} \ k=8)}$

$= \frac{4}{4 + 4}$

$=0.5$

Recall @k

More information: Recall

Recall means: "of all examples that were actually TRUE, how many I predicted to be TRUE?"

$Recall$ $@k$ ("Recall at $k$ ") is simply Recall evaluated only up to the $k$ -th prediction, i.e.:

$\text{Recall}@k = \frac{true \ positives \ @ k}{(true \ positives \ @ k) + (false \ negatives \ @ k)}$

Predicted relevance
scores (Sorted)

k	Document ID	Predicted Relevance	Actual Relevance	Recall @k
1	06	0.90	Relevant (1.0)	0.25
2	03	0.85	Not Relevant (0.0)	0.25
3	05	0.71	Relevant (1.0)	0.50
4	00	0.63	Relevant (1.0)	0.75
5	04	0.47	Not Relevant (0.0)	0.75
6	02	0.36	Relevant (1.0)	1.0
7	01	0.24	Not Relevant (0.0)	1.0
8	07	0.16	Not Relevant (0.0)	1.0

Same values as before, but
ordered from highest to lowest

Example: Recall @1

I.e. what Recall do I get if I only use the top 1 prediction?

$\text{Recall}@1 = \frac{\text{true positives} \ @ 1}{(\text{true positives} \ @ 1) + (\text{false negatives} \ @ 1)}$

$\hphantom{\text{Recall}@1} = \frac{\text{true positives considering} \ k=1}{(\text{true positives considering} \ k=1) + \\ (\text{false negatives considering} \ k=1)}$

$= \frac{1}{1 + 3}$

$=0.25$

Example: Recall @4

Similarly, $Recall@4$ only takes into account predictions up to $k=4$ :

$\text{Recall}@4 = \frac{true \ positives \ @ 4}{(true \ positives \ @ 4) + (false \ negatives \ @ 4)}$

$\hphantom{\text{Recall}@4} = \frac{\text{true positives considering} \ k=4}{(\text{true positives considering} \ k=4) + \\ (\text{false negatives considering} \ k=4)}$

$= \frac{3}{3 + 1}$

$=0.75$

Example: Recall @8

$\text{Recall}@8 = \frac{true \ positives \ @ 8}{(true \ positives \ @ 8) + (false \ negatives \ @ 8)}$

$\hphantom{\text{Recall}@8} = \frac{\text{true positives considering} \ k=8}{(\text{true positives considering} \ k=8) + \\ (\text{false negatives considering} \ k=8)}$

$= \frac{4}{4 + 0}$

$=1$

F1 @k

More information: F1

$F_1$ -score (alternatively, $F_1$ -Measure), is a mixed metric that takes into account both Precision and Recall.

Similarly to $\text{Precision}@k$ and $\text{Recall}@k$ , $F_1@k$ is a rank-based metric that can be summarized as follows: "What $F_1$ -score do I get if I only consider the top $k$ predictions my model outputs?"

$F_1 @k = 2 \cdot \frac{(Precision @k) \cdot (Recall @k) }{(Precision @k) + (Recall @k)}$

An alternative formulation for $F_1 @k$ is as follows:

$F_1 @k = \frac{2 \cdot (\text{true positives} \ @k)}{2 \cdot (\text{true positives} \ @k ) + (\text{false negatives} \ @k) + (\text{false positives} \ @k) }$

Predicted relevance
scores (Sorted)

k	Document ID	Predicted Relevance	Actual Relevance	F1 @k
1	06	0.90	Relevant (1.0)	0.40
2	03	0.85	Not Relevant (0.0)	0.33
3	05	0.71	Relevant (1.0)	0.62
4	00	0.63	Relevant (1.0)	0.75
5	04	0.47	Not Relevant (0.0)	0.66
6	02	0.36	Relevant (1.0)	0.80
7	01	0.24	Not Relevant (0.0)	0.73
8	07	0.16	Not Relevant (0.0)	0.66

Same values as before, but
ordered from highest to lowest

Example: F1 @1

$F_1 @1 = \frac{2 \cdot (\text{true positives} \ @1)}{2 \cdot (\text{true positives} \ @1 ) + (\text{false negatives} \ @1) + (\text{false positives} \ @1) }$

$= \frac{2 \cdot (\text{true positives considering} \ k=1)}{2 \cdot (\text{true positives considering} \ k=1 ) + \\ \, \, \, \, \, \, (\text{false negatives considering} \ k=1) + \\ \, \, \, \, \, \, (\text{false positives considering} \ k=1) }$

$= \frac{2 \cdot 1 }{ (2 \cdot 1) + 3 + 0 }$

$= \frac{2}{5}$

$= 0.4$

Which is the same result you get if you use the original formula:

$F_1 @1 = 2 \cdot \frac{(Precision @1) \cdot (Recall @1) }{(Precision @1) + (Recall @1)}$

$= 2 \cdot \frac{1 \cdot 0.25}{1 + 0.25}$

$= 2 \cdot 0.2 = 0.4$

Example: F1 @4

$F_1 @4 = \frac{2 \cdot (\text{true positives} \ @4)}{2 \cdot (\text{true positives} \ @4 ) + (\text{false negatives} \ @4) + (\text{false positives} \ @4) }$

$= \frac{2 \cdot (\text{true positives considering} \ k=4)}{2 \cdot (\text{true positives considering} \ k=4 ) + \\ \, \, \, \, \, \, (\text{false negatives considering} \ k=4) + \\ \, \, \, \, \, \, (\text{false positives considering} \ k=4) }$

$= \frac{2 \cdot 3 }{ (2 \cdot 3) + 1 + 1 }$

$= \frac{6}{8}$

$= 0.75$

Which is the same result you get if you use the original formula:

$F_1 @4 = 2 \cdot \frac{(Precision @4) \cdot (Recall @4) }{(Precision @4) + (Recall @4)}$

$= 2 \cdot \frac{0.75 \cdot 0.75}{0.75 + 0.75}$

$= 2 \cdot \frac{0.5625}{1.5} = 0.75$

Example: F1 @8

$\begin{align} A & = B \\ & = C \end{align}$

$F_1 @8 = \frac{2 \cdot (\text{true positives} \ @8)}{2 \cdot (\text{true positives} \ @8 ) + (\text{false negatives} \ @8) + (\text{false positives} \ @8) }$

$= \frac{2 \cdot (\text{true positives considering} \ k=8)}{2 \cdot (\text{true positives considering} \ k=8 ) + \\ \, \, \, \, \, \, (\text{false negatives considering} \ k=8) + \\ \, \, \, \, \, \, (\text{false positives considering} \ k=8) }$

$= \frac{2 \cdot 4 }{ (2 \cdot 4) + 0 + 4 }$

$= \frac{8}{12}$

$= 0.666$

Which is the same result you get if you use the original formula:

$F_1 @8 = 2 \cdot \frac{(Precision @8) \cdot (Recall @8) }{(Precision @8) + (Recall @8)}$

$= 2 \cdot \frac{0.5 \cdot 1}{0.5 + 1}$

$= 2 \cdot \frac{0.5}{1.5}$

$= \frac{1}{1.5} = 0.666$

AP: Average Precision

AP (Average Precision) is another metric to compare a ranking with a set of relevant/non-relevant items.

One way to explain what AP represents is as follows:

AP is a metric that tells you how much of the relevant documents are concentrated in the highest ranked predictions.

Formula

$AP = \sum_{K} (Recall @k - Recall @k\text{-}1) \cdot Precision @k$

So for each threshold level ( $k$ ) you take the difference between the Recall at the current level and the Recall at the previous threshold and multiply by the Precision at that level. Then sum the contributions of each.

Algorithm to calculate AP

You can calculate the AP using the following algorithm:

algorithm-for-calculating-average-precision

Algorithm for calculating the AP based on a
ranked list of predictions.

Example: AP Calculated using the algorithm

Predicted relevance
scores (Sorted)

k	Document ID	Predicted Relevance	Actual Relevance
1	06	0.90	Relevant (1.0)
2	03	0.85	Not Relevant (0.0)
3	05	0.71	Relevant (1.0)
4	00	0.63	Relevant (1.0)
5	04	0.47	Not Relevant (0.0)
6	02	0.36	Relevant (1.0)
7	01	0.24	Not Relevant (0.0)
8	07	0.16	Not Relevant (0.0)

Same values as before, but
ordered from highest to lowest

Following the algorithm described above, let's go about calculating the AP for our guiding example:

At rank 1:
- $\text{RunningSum} = 0 + \frac{1}{1} = 1, \text{CorrectPredictions} = 1$
At rank 2:
- No change. We don't update either the RunningSum or the CorrectPredictions count, since the document at rank 2 is not relevant.
- In other words, we don't count when there's a wrong prediction.
At rank 3:
- $\text{RunningSum} = 1 + \frac{2}{3} = 1 + 0.666 = 1.666$
- $\text{CorrectPredictions} = 2$
At rank 4:
- $\text{RunningSum} = 1.666 + \frac{3}{4} = 1.666 + 0.75 = 2.416$
- $\text{CorrectPredictions} = 3$
At rank 5:
- No change, wrong prediction.
At rank 6:
- $\text{RunningSum} = 2.416 + \frac{4}{6} = 2.416 + 0.66 = 3.076$
- $\text{CorrectPredictions} = 4$
At rank 7:
- No change, wrong prediction.
At rank 8:
- No change, wrong prediction.

And at the end we divide everything by the number of Relevant Documents which is, in this case, equal to the number of correct predictions:

$AP = \dfrac{\text{RunningSum}}{\text{CorrectPredictions}}$

$AP = \dfrac{3.076}{4}$

$AP = 0.77$

DCG: Discounted Cumulative Gain

One advantage of DCG over other metrics is that it also works if document relevances are a real number. In other words, when each document is not simply relevant/non-relevant (as in the example), but has a relevance score instead.

$DCG \ @k = \sum\limits_{i=1}^{k} \frac{2^{rel_i} - 1}{log_2(i+1)}$

where $rel_i$ is the relevance of the document at index $i$ . Since we're dealing with binary relevances, $rel_i$ equals 1 if document $i$ is relevant and 0 otherwise.

Predicted relevance
scores (Sorted)

k	Document ID	Predicted Relevance	Actual Relevance	DCG @k
1	06	0.90	Relevant (1.0)	1.0
2	03	0.85	Not Relevant (0.0)	1.0
3	05	0.71	Relevant (1.0)	1.5
4	00	0.63	Relevant (1.0)	1.93
5	04	0.47	Not Relevant (0.0)	1.93
6	02	0.36	Relevant (1.0)	2.29
7	01	0.24	Not Relevant (0.0)	2.29
8	07	0.16	Not Relevant (0.0)	2.29

Same values as before, but
ordered from highest to lowest

NDCG: Normalized Discounted Cumulative Gain

As you can see in the previous section, DCG either goes up with $k$ or it stays the same. This means that queries that return larger result sets will probably always have higher DCG scores than queries that return small result sets.

A way to make comparison across queries fairer is to normalize the DCG score by the maximum possible DCG at each threshold $k$ .

$NDCG \ @k = \dfrac{DCG \ @k}{IDCG \ @k}$

Where $IDCG \ @k$ is the best possible value for $DCG \ @k$ , i.e. the value of DCG for the best possible ranking of relevant documents at threshold $k$ , i.e.:

$IDCG \ @k = \sum\limits_{i=1}^{relevant \ documents \\ \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, at \ k} \frac{2^{rel_i} - 1}{log_2(i+1)}$

Predicted relevance
scores (Sorted)

k	Document ID	Predicted Relevance	Actual Relevance	DCG @k	IDCG @k	NDCG @k
1	06	0.90	Relevant (1.0)	1.0	1.0	1.0
2	03	0.85	Not Relevant (0.0)	1.0	1.63	0.61
3	05	0.71	Relevant (1.0)	1.5	2.13	0.70
4	00	0.63	Relevant (1.0)	1.93	2.56	0.75
5	04	0.47	Not Relevant (0.0)	1.93	2.56	0.75
6	02	0.36	Relevant (1.0)	2.29	2.56	0.89
7	01	0.24	Not Relevant (0.0)	2.29	2.56	0.89
8	07	0.16	Not Relevant (0.0)	2.29	2.56	0.89

Same values as before, but
ordered from highest to lowest

MAP: Mean Average Precision

What about Mean Average Precision (MAP)?

AP (Average Precision) is a metric that tells you how a single sorted prediction compares with the ground truth. E.g. AP would tell you how correct a single ranking of documents is, with respect to a single query.

But what if you need to know how your model's rankings perform when evaluated on a whole validation set? After all, it is really of no use if your trained model correctly ranks classes for some examples but not for others.

This is where MAP (Mean Average Precision) comes in. All you need to do is to sum the AP value for each example in a validation dataset and then divide by the number of examples. In other words, take the mean of the AP over all examples.

So, to summarize:

AP (Average Precision)	MAP (Mean Average Precision)
Informs you how correct a model's ranked predictions are for a single example	Informs you how correct a model's ranked predictions are, on average, over a whole validation dataset It is a simple average of AP over all examples in a full validation set.

Average Precision @k

What about AP @k (Average Precision at k)?

Although AP (Average Precision) is not usually presented like this, nothing stops us from calculating AP at each threshold value.

So for all practical purposes, we could calculate $AP \ @k$ as follows:

By slightly modifying the algorithm to calculate the
Average Precision, you can derive AP @k, for varying k

So, using our old guiding example:

Predicted relevance
scores (Sorted)

k	Document ID	Predicted Relevance	Actual Relevance	AP @k
1	06	0.90	Relevant (1.0)	1.0
2	03	0.85	Not Relevant (0.0)	1.0
3	05	0.71	Relevant (1.0)	0.83
4	00	0.63	Relevant (1.0)	0.81
5	04	0.47	Not Relevant (0.0)	0.81
6	02	0.36	Relevant (1.0)	0.77
7	01	0.24	Not Relevant (0.0)	0.77
8	07	0.16	Not Relevant (0.0)	0.77

Same values as before, but ordered from highest to lowest.
Note that the value at the end is precisely the same as what we got for the full Average Precision

DCG vs NDCG

NDCG is used when you need to compare the ranking for one result set with another ranking, with potentially less elements, different elements, etc.

You can't do that using DCG because query results may vary in size, unfairly penalizing queries that return long result sets.

NDCG normalizes a DCG score, dividing it by the best possible DCG at each threshold.¹

DCG	NDCG (Normalized DCG)
Returns absolute values	Returns normalized values
Does not allow comparison between queries	Allows comparison between queries
Cannot be used to gauge the performance of a ranking model on a whole validation dataset	Can be used to measure a model across a full dataset. Just average NDCG values for each example in the validation set.

References

Chen et al. 2009: Ranking Measures and Loss Functions in Learning to Rank
- This is interesting because although we use Ranked evaluation metrics, the loss functions we use often do not directly optimize those metrics.
- Published at NIPS 2009

1: Also called the $IDCG_k$ or the ideal or best possible value for DCG at threshold $k$ .

Felipe 24 Jan 2019 27 Dec 2022 machine-learning model-evaluation