Discounted cumulative gain (DCG) is a measure of ranking quality. In information retrieval, it is often used to measure effectiveness of web search engine algorithms or related applications. Using a graded relevance scale of documents in a search-engine result set, DCG measures the usefulness, or gain, of a document based on its position in the result list. The gain is accumulated from the top of the result list to the bottom, with the gain of each result discounted at lower ranks.
Contents
Overview
Two assumptions are made in using DCG and its related measures.
- Highly relevant documents are more useful when appearing earlier in a search engine result list (have higher ranks)
- Highly relevant documents are more useful than marginally relevant documents, which are in turn more useful than non-relevant documents.
DCG originates from an earlier, more primitive, measure called Cumulative Gain.
Cumulative Gain
Cumulative Gain (CG) is the predecessor of DCG and does not include the position of a result in the consideration of the usefulness of a result set. In this way, it is the sum of the graded relevance values of all results in a search result list. The CG at a particular rank position
Where
The value computed with the CG function is unaffected by changes in the ordering of search results. That is, moving a highly relevant document
Discounted Cumulative Gain
The premise of DCG is that highly relevant documents appearing lower in a search result list should be penalized as the graded relevance value is reduced logarithmically proportional to the position of the result. The discounted CG accumulated at a particular rank position
Previously there has not been any theoretically sound justification for using a logarithmic reduction factor other than the fact that it produces a smooth reduction. But Wang et al. (2013) give theoretical guarantee for using the logarithmic reduction factor in NDCG. The authors show that for every pair of substantially different ranking functions, the NDCG can decide which one is better in a consistent manner.
An alternative formulation of DCG places stronger emphasis on retrieving relevant documents:
The latter formula is commonly used in industry including major web search companies and data science competition platform such as Kaggle.
These two formulations of DCG are the same when the relevance values of documents are binary;
Note that Croft et al. (2010) and Burges et al. (2005) present the second DCG with a log of base e, while both versions of DCG above use a log of base 2. When computing NDCG with the second formulation of DCG, the base of the log does not matter, but the base of the log does affect the value of NDCG for the first formulation. Clearly, the base of the log affects the value of DCG in both formulations.
Normalized DCG
Search result lists vary in length depending on the query. Comparing a search engine's performance from one query to the next cannot be consistently achieved using DCG alone, so the cumulative gain at each position for a chosen value of
where:
and |REL| represents the list of relevant documents (ordered by their relevance) in the corpus up to position p.
The nDCG values for all queries can be averaged to obtain a measure of the average performance of a search engine's ranking algorithm. Note that in a perfect ranking algorithm, the
The main difficulty encountered in using nDCG is the unavailability of an ideal ordering of results when only partial relevance feedback is available.
Example
Presented with a list of documents in response to a search query, an experiment participant is asked to judge the relevance of each document to the query. Each document is to be judged on a scale of 0-3 with 0 meaning not relevant, 3 meaning highly relevant, and 1 and 2 meaning "somewhere in between". For the documents ordered by the ranking algorithm as
the user provides the following relevance scores:
That is: document 1 has a relevance of 3, document 2 has a relevance of 2, etc. The Cumulative Gain of this search result listing is:
Changing the order of any two documents does not affect the CG measure. If
So the
Now a switch of
The performance of this query to another is incomparable in this form since the other query may have more results, resulting in a larger overall DCG which may not necessarily be better. In order to compare, the DCG values must be normalized.
To normalize DCG values, an ideal ordering for the given query is needed. For this example, that ordering would be the monotonically decreasing sort of the relevance judgments provided by the experiment participant, which is:
The DCG of this ideal ordering, or IDCG (Ideal DCG) , is then:
And so the nDCG for this query is given as:
Limitations
- Normalized DCG metric does not penalize for bad documents in the result. For example, if a query returns two results with scores
1 , 1 , 1 and1 , 1 , 1 , 0 respectively, both would be considered equally good even if the latter contains a bad result. One way to take into account this limitation is to use1 − 2 r e l i 2 r e l i − 1 for all others. For example, for the ranking judgmentsE x c e l l e n t , F a i r , B a d one might use numerical scores1 , 0 , − 1 instead of2 , 1 , 0 . - Normalized DCG does not penalize for missing documents in the result. For example, if a query returns two results with scores
1 , 1 , 1 and1 , 1 , 1 , 1 , 1 respectively, both would be considered equally good. One way to take into account this limitation is to enforce fixed set size for the result set and use minimum scores for the missing documents. In previous example, we would use the scores1 , 1 , 1 , 0 , 0 and1 , 1 , 1 , 1 , 1 and quote nDCG as nDCG@5. - Normalized DCG may not be suitable to measure performance of queries that may typically often have several equally good results. This is especially true when this metric is limited to only first few results as it is done in practice. For example, for queries such as "restaurants" nDCG@1 would account for only first result and hence if one result set contains only 1 restaurant from the nearby area while the other contains 5, both would end up having same score even though latter is more comprehensive.