In information theory and statistics, Kullback's inequality is a lower bound on the Kullback–Leibler divergence expressed in terms of the large deviations rate function. If P and Q are probability distributions on the real line, such that P is absolutely continuous with respect to Q, i.e. P<<Q, and whose first moments exist, then
D K L ( P ∥ Q ) ≥ Ψ Q ∗ ( μ 1 ′ ( P ) ) , where Ψ Q ∗ is the rate function, i.e. the convex conjugate of the cumulant-generating function, of Q , and μ 1 ′ ( P ) is the first moment of P .
The Cramér–Rao bound is a corollary of this result.
Let P and Q be probability distributions (measures) on the real line, whose first moments exist, and such that P<<Q. Consider the natural exponential family of Q given by
Q θ ( A ) = ∫ A e θ x Q ( d x ) ∫ − ∞ ∞ e θ x Q ( d x ) = 1 M Q ( θ ) ∫ A e θ x Q ( d x ) for every measurable set A, where M Q is the moment-generating function of Q. (Note that Q0=Q.) Then
D K L ( P ∥ Q ) = D K L ( P ∥ Q θ ) + ∫ s u p p P ( log d Q θ d Q ) d P . By Gibbs' inequality we have D K L ( P ∥ Q θ ) ≥ 0 so that
D K L ( P ∥ Q ) ≥ ∫ s u p p P ( log d Q θ d Q ) d P = ∫ s u p p P ( log e θ x M Q ( θ ) ) P ( d x ) Simplifying the right side, we have, for every real θ where M Q ( θ ) < ∞ :
D K L ( P ∥ Q ) ≥ μ 1 ′ ( P ) θ − Ψ Q ( θ ) , where μ 1 ′ ( P ) is the first moment, or mean, of P, and Ψ Q = log M Q is called the cumulant-generating function. Taking the supremum completes the process of convex conjugation and yields the rate function:
D K L ( P ∥ Q ) ≥ sup θ { μ 1 ′ ( P ) θ − Ψ Q ( θ ) } = Ψ Q ∗ ( μ 1 ′ ( P ) ) . Let Xθ be a family of probability distributions on the real line indexed by the real parameter θ, and satisfying certain regularity conditions. Then
lim h → 0 D K L ( X θ + h ∥ X θ ) h 2 ≥ lim h → 0 Ψ θ ∗ ( μ θ + h ) h 2 , where Ψ θ ∗ is the convex conjugate of the cumulant-generating function of X θ and μ θ + h is the first moment of X θ + h .
The left side of this inequality can be simplified as follows:
lim h → 0 D K L ( X θ + h ∥ X θ ) h 2 = lim h → 0 1 h 2 ∫ − ∞ ∞ log ( d X θ + h d X θ ) d X θ + h = lim h → 0 1 h 2 ∫ − ∞ ∞ log ( 1 − ( 1 − d X θ + h d X θ ) ) d X θ + h = lim h → 0 1 h 2 ∫ − ∞ ∞ [ ( 1 − d X θ d X θ + h ) + 1 2 ( 1 − d X θ d X θ + h ) 2 + o ( ( 1 − d X θ d X θ + h ) 2 ) ] d X θ + h Taylor series for log ( 1 − t ) = lim h → 0 1 h 2 ∫ − ∞ ∞ [ 1 2 ( 1 − d X θ d X θ + h ) 2 ] d X θ + h = lim h → 0 1 h 2 ∫ − ∞ ∞ [ 1 2 ( d X θ + h − d X θ d X θ + h ) 2 ] d X θ + h = 1 2 I X ( θ ) which is half the Fisher information of the parameter θ.
The right side of the inequality can be developed as follows:
lim h → 0 Ψ θ ∗ ( μ θ + h ) h 2 = lim h → 0 1 h 2 sup t { μ θ + h t − Ψ θ ( t ) } . This supremum is attained at a value of t=τ where the first derivative of the cumulant-generating function is Ψ θ ′ ( τ ) = μ θ + h , but we have Ψ θ ′ ( 0 ) = μ θ , so that
Ψ θ ″ ( 0 ) = d μ θ d θ lim h → 0 h τ . Moreover,
lim h → 0 Ψ θ ∗ ( μ θ + h ) h 2 = 1 2 Ψ θ ″ ( 0 ) ( d μ θ d θ ) 2 = 1 2 V a r ( X θ ) ( d μ θ d θ ) 2 . We have:
1 2 I X ( θ ) ≥ 1 2 V a r ( X θ ) ( d μ θ d θ ) 2 , which can be rearranged as:
V a r ( X θ ) ≥ ( d μ θ / d θ ) 2 I X ( θ ) .