In statistical classification the Bayes classifier minimizes the probability of misclassification.
Suppose a pair
(
X
,
Y
)
takes values in
R
d
×
{
1
,
2
,
…
,
K
}
, where
Y
is the class label of
X
. This means that the conditional distribution of X, given that the label Y takes the value r is given by
X
∣
Y
=
r
∼
P
r
for
r
=
1
,
2
,
…
,
K
where "
∼
" means "is distributed as", and where
P
r
denotes a probability distribution.
A classifier is a rule that assigns to an observation X=x a guess or estimate of what the unobserved label Y=r actually was. In theoretical terms, a classifier is a measurable function
C
:
R
d
→
{
1
,
2
,
…
,
K
}
, with the interpretation that C classifies the point x to the class C(x). The probability of misclassification, or risk, of a classifier C is defined as
R
(
C
)
=
P
{
C
(
X
)
≠
Y
}
.
The Bayes classifier is
C
Bayes
(
x
)
=
argmax
r
∈
{
1
,
2
,
…
,
K
}
P
(
Y
=
r
∣
X
=
x
)
.
In practice, as in most of statistics, the difficulties and subtleties are associated with modeling the probability distributions effectively—in this case,
P
(
Y
=
r
∣
X
=
x
)
. The Bayes classifier is a useful benchmark in statistical classification.
The excess risk of a general classifier
C
(possibly depending on some training data) is defined as
R
(
C
)
−
R
(
C
Bayes
)
.
Thus this non-negative quantity is important for assessing the performance of different classification techniques. A classifier is said to be consistent if the excess risk converges to zero as the size of the training data set tends to infinity.