Harman Patil (Editor)

Kneser–Ney smoothing

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit

Kneser–Ney smoothing is a method primarily used to calculate the probability distribution of n-grams in a document based on their histories. It is widely considered the most effective method of smoothing due to its use of absolute discounting by subtracting a fixed value from the probability's lower order terms to omit n-grams with lower frequencies. This approach has been considered equally effective for both higher and lower order n-grams.

A common example that illustrates the concept behind this method is the frequency of the bigram "San Francisco". If it appears several times in a training corpus, the frequency of the unigram "Francisco" will also be high. Relying on only the unigram frequency to predict the frequencies of n-grams leads to skewed results; however, Kneser–Ney smoothing corrects this by considering the frequency of the unigram in relation to possible words preceding it.

Method

Let c ( w , w ) be the number of occurrences of the word w followed by the word w in the corpus.

The equation for bigram probabilities is as follows:

p K N ( w i | w i 1 ) = max ( c ( w i 1 , w i ) δ , 0 ) w c ( w i 1 , w ) + λ w i 1 p K N ( w i )

Where the unigram probability p K N ( w i ) depends on how likely it is to see the word w i in an unfamiliar context, which is estimated as the number of times it appears after any other word divided by the number of distinct pairs of consecutive words in the corpus:

p K N ( w i ) = | { w : 0 < c ( w , w i ) } | | { ( w , w ) : 0 < c ( w , w ) } |

Please note that p K N is a proper distribution, as the values defined in above way are non-negative and sum to one.

The parameter δ is a constant which denotes the discount value subtracted from the count of each n-gram, usually between 0 and 1.

The normalizing constant λ w i 1 has value chosen carefully to make the sum of conditional probabilities p K N ( w i | w i 1 ) over all w i equal to one. Observe that (provided δ < 1 ) for each w i which occurs at least once in the context of w i 1 in the corpus we discount the probability by exactly the same constant amount δ / ( w c ( w i 1 , w ) ) , so the total discount depends linearly on the number of unique words w i that can occur after w i 1 . This total discount is a budget we can spread over all p K N ( w i | w i 1 ) proportionally to p K N ( w i ) . As the values of p K N ( w i ) sum to one, we can simply define λ w i 1 to be equal to this total discount:

λ w i 1 = δ w c ( w i 1 , w ) | { w : 0 < c ( w i 1 , w ) } |

This equation can be extended to n-grams. Let w i n + 1 i 1 be the n 1 words before w i :

p K N ( w i | w i n + 1 i 1 ) = max ( c ( w i n + 1 i 1 , w i ) δ , 0 ) w c ( w i n + 1 i 1 , w ) + δ | { w : 0 < c ( w i n + 1 i 1 , w ) } | w i c ( w i n + 1 i ) p K N ( w i | w i n + 2 i 1 )


This model uses the concept of absolute-discounting interpolation which incorporates information from higher and lower order language models. The addition of the term for lower order n-grams adds more weight to the overall probability when the count for the higher order n-grams is zero. Similarly, the weight of the lower order model decreases when the count of the n-gram is non zero.

References

Kneser–Ney smoothing Wikipedia