In probability theory and statistics, the Jensen–Shannon divergence is a method of measuring the similarity between two probability distributions. It is also known as information radius (IRad) or total divergence to the average. It is based on the Kullback–Leibler divergence, with some notable (and useful) differences, including that it is symmetric and it is always a finite value. The square root of the Jensen–Shannon divergence is a metric often referred to as JensenShannon distance.
Consider the set
M
+
1
(
A
)
of probability distributions where A is a set provided with some σalgebra of measurable subsets. In particular we can take A to be a finite or countable set with all subsets being measurable.
The Jensen–Shannon divergence (JSD)
M
+
1
(
A
)
×
M
+
1
(
A
)
→
[
0
,
∞
)
is a symmetrized and smoothed version of the Kullback–Leibler divergence
D
(
P
∥
Q
)
. It is defined by
J
S
D
(
P
∥
Q
)
=
1
2
D
(
P
∥
M
)
+
1
2
D
(
Q
∥
M
)
where
M
=
1
2
(
P
+
Q
)
A more general definition, allowing for the comparison of more than two probability distributions, is:
J
S
D
π
1
,
…
,
π
n
(
P
1
,
P
2
,
…
,
P
n
)
=
H
(
∑
i
=
1
n
π
i
P
i
)
−
∑
i
=
1
n
π
i
H
(
P
i
)
where
π
1
,
…
,
π
n
are weights that are selected for the probability distributions
P
1
,
P
2
,
…
,
P
n
and
H
(
P
)
is the Shannon entropy for distribution
P
. For the twodistribution case described above,
P
1
=
P
,
P
2
=
Q
,
π
1
=
π
2
=
1
2
.
The Jensen–Shannon divergence is bounded by 1 for two probability distributions, given that one uses the base 2 logarithm.
0
≤
J
S
D
(
P
∥
Q
)
≤
1
For log base e, or ln, which is commonly used in statistical thermodynamics, the upper bound is ln(2):
0
≤
J
S
D
(
P
∥
Q
)
≤
ln
(
2
)
A more general bounds, the Jensen–Shannon divergence is bounded by log2(n) for more than two probability distributions, given that one uses the base 2 logarithm.
0
≤
J
S
D
π
1
,
…
,
π
n
(
P
1
,
P
2
,
…
,
P
n
)
≤
log
2
(
n
)
The Jensen–Shannon divergence is the mutual information between a random variable
X
associated to a mixture distribution between
P
and
Q
and the binary indicator variable
Z
that is used to switch between
P
and
Q
to produce the mixture. Let
X
be some abstract function on the underlying set of events that discriminates well between events, and choose the value of
X
according to
P
if
Z
=
0
and according to
Q
if
Z
=
1
. That is, we are choosing
X
according to the probability measure
M
=
(
P
+
Q
)
/
2
, and its distribution is the mixture distribution. We compute
I
(
X
;
Z
)
=
H
(
X
)
−
H
(
X

Z
)
=
−
∑
M
log
M
+
1
2
[
∑
P
log
P
+
∑
Q
log
Q
]
=
−
∑
P
2
log
M
−
∑
Q
2
log
M
+
1
2
[
∑
P
log
P
+
∑
Q
log
Q
]
=
1
2
∑
P
(
log
P
−
log
M
)
+
1
2
∑
Q
(
log
Q
−
log
M
)
=
J
S
D
(
P
∥
Q
)
It follows from the above result that the Jensen–Shannon divergence is bounded by 0 and 1 because mutual information is nonnegative and bounded by
H
(
Z
)
=
1
. The JSD is not always bounded by 0 and 1: the upper limit of 1 arises here because we are considering the specific case involving the binary variable
Z
.
One can apply the same principle to a joint distribution and the product of its two marginal distribution (in analogy to Kullback–Leibler divergence and mutual information) and to measure how reliably one can decide if a given response comes from the joint distribution or the product distribution—subject to the assumption that these are the only two possibilities.
The generalization of probability distributions on density matrices allows to define quantum Jensen–Shannon divergence (QJSD). It is defined for a set of density matrices
(
ρ
1
,
…
,
ρ
n
)
and a probability distribution
π
=
(
π
1
,
…
,
π
n
)
as
Q
J
S
D
(
ρ
1
,
…
,
ρ
n
)
=
S
(
∑
i
=
1
n
π
i
ρ
i
)
−
∑
i
=
1
n
π
i
S
(
ρ
i
)
where
S
(
ρ
)
is the von Neumann entropy of
ρ
. This quantity was introduced in quantum information theory, where it is called the Holevo information: it gives the upper bound for amount of classical information encoded by the quantum states
(
ρ
1
,
…
,
ρ
n
)
under the prior distribution
π
(see Holevo's theorem). Quantum Jensen–Shannon divergence for
π
=
(
1
2
,
1
2
)
and two density matrices is a symmetric function, everywhere defined, bounded and equal to zero only if two density matrices are the same. It is a square of a metric for pure states but it is unknown whether the metric property holds in general. The Bures metric is closely related to the quantum JS divergence; it is the quantum analog of the Fisher information metric.
The Jensen–Shannon divergence has been applied in bioinformatics and genome comparison, in protein surface comparison, in the social sciences, in the quantitative study of history, and in machine learning.