Kneser–Ney smoothing - Alchetron, The Free Social Encyclopedia

Kneser–Ney smoothing is a method primarily used to calculate the probability distribution of n-grams in a document based on their histories. It is widely considered the most effective method of smoothing due to its use of absolute discounting by subtracting a fixed value from the probability's lower order terms to omit n-grams with lower frequencies. This approach has been considered equally effective for both higher and lower order n-grams.

A common example that illustrates the concept behind this method is the frequency of the bigram "San Francisco". If it appears several times in a training corpus, the frequency of the unigram "Francisco" will also be high. Relying on only the unigram frequency to predict the frequencies of n-grams leads to skewed results; however, Kneser–Ney smoothing corrects this by considering the frequency of the unigram in relation to possible words preceding it.

Method

Let c ( w , w ′ ) be the number of occurrences of the word w followed by the word w ′ in the corpus.

The equation for bigram probabilities is as follows:

p K N ( w i | w i − 1 ) = max ( c ( w i − 1 , w i ) − δ , 0 ) ∑ w ′ c ( w i − 1 , w ′ ) + λ w i − 1 p K N ( w i )

Where the unigram probability p K N ( w i ) depends on how likely it is to see the word w i in an unfamiliar context, which is estimated as the number of times it appears after any other word divided by the number of distinct pairs of consecutive words in the corpus:

p K N ( w i ) = | { w ′ : 0 < c ( w ′ , w i ) } | | { ( w ′ , w ″ ) : 0 < c ( w ′ , w ″ ) } |

Please note that p K N is a proper distribution, as the values defined in above way are non-negative and sum to one.

The parameter δ is a constant which denotes the discount value subtracted from the count of each n-gram, usually between 0 and 1.

The normalizing constant λ w i − 1 has value chosen carefully to make the sum of conditional probabilities p K N ( w i | w i − 1 ) over all w i equal to one. Observe that (provided δ < 1 ) for each w i which occurs at least once in the context of w i − 1 in the corpus we discount the probability by exactly the same constant amount δ / ( ∑ w ′ c ( w i − 1 , w ′ ) ) , so the total discount depends linearly on the number of unique words w i that can occur after w i − 1 . This total discount is a budget we can spread over all p K N ( w i | w i − 1 ) proportionally to p K N ( w i ) . As the values of p K N ( w i ) sum to one, we can simply define λ w i − 1 to be equal to this total discount:

λ w i − 1 = δ ∑ w ′ c ( w i − 1 , w ′ ) | { w ′ : 0 < c ( w i − 1 , w ′ ) } |

This equation can be extended to n-grams. Let w i − n + 1 i − 1 be the n − 1 words before w i :

p K N ( w i | w i − n + 1 i − 1 ) = max ( c ( w i − n + 1 i − 1 , w i ) − δ , 0 ) ∑ w ′ c ( w i − n + 1 i − 1 , w ′ ) + δ | { w ′ : 0 < c ( w i − n + 1 i − 1 , w ′ ) } | ∑ w i c ( w i − n + 1 i ) p K N ( w i | w i − n + 2 i − 1 )

This model uses the concept of absolute-discounting interpolation which incorporates information from higher and lower order language models. The addition of the term for lower order n-grams adds more weight to the overall probability when the count for the higher order n-grams is zero. Similarly, the weight of the lower order model decreases when the count of the n-gram is non zero.

References

Kneser–Ney smoothing Wikipedia

(Text) CC BY-SA