The Rand index or Rand measure (named after William M. Rand) in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings. A form of the Rand index may be defined that is adjusted for the chance grouping of elements, this is the adjusted Rand index. From a mathematical standpoint, Rand index is related to the accuracy, but is applicable even when class labels are not used.
Given a set of
n
elements
S
=
{
o
1
,
…
,
o
n
}
and two partitions of
S
to compare,
X
=
{
X
1
,
…
,
X
r
}
, a partition of R into r subsets, and
Y
=
{
Y
1
,
…
,
Y
s
}
, a partition of S into s subsets, define the following:
a
, the number of pairs of elements in
S
that are in the same subset in
X
and in the same subset in
Y
b
, the number of pairs of elements in
S
that are in different subsets in
X
and in different subsets in
Y
c
, the number of pairs of elements in
S
that are in the same subset in
X
and in different subsets in
Y
d
, the number of pairs of elements in
S
that are in different subsets in
X
and in the same subset in
Y
The Rand index,
R
, is:
R
=
a
+
b
a
+
b
+
c
+
d
=
a
+
b
(
n
2
)
Intuitively,
a
+
b
can be considered as the number of agreements between
X
and
Y
and
c
+
d
as the number of disagreements between
X
and
Y
.
Since the denominator is the total number of pairs, the Rand index represents the frequency of occurrence of agreements over the total pairs, or the probability that
X
and
Y
will agree on a randomly chosen pair.
The Rand index has a value between 0 and 1, with 0 indicating that the two data clusterings do not agree on any pair of points and 1 indicating that the data clusterings are exactly the same.
In mathematical terms, a, b, c, d are defined as follows:
a
=
|
S
∗
|
, where
S
∗
=
{
(
o
i
,
o
j
)
|
o
i
,
o
j
∈
X
k
,
o
i
,
o
j
∈
Y
l
}
b
=
|
S
∗
|
, where
S
∗
=
{
(
o
i
,
o
j
)
|
o
i
∈
X
k
1
,
o
j
∈
X
k
2
,
o
i
∈
Y
l
1
,
o
j
∈
Y
l
2
}
c
=
|
S
∗
|
, where
S
∗
=
{
(
o
i
,
o
j
)
|
o
i
,
o
j
∈
X
k
,
o
i
∈
Y
l
1
,
o
j
∈
Y
l
2
}
d
=
|
S
∗
|
, where
S
∗
=
{
(
o
i
,
o
j
)
|
o
i
∈
X
k
1
,
o
j
∈
X
k
2
,
o
i
,
o
j
∈
Y
l
}
for some
1
≤
i
,
j
≤
n
,
i
≠
j
,
1
≤
k
,
k
1
,
k
2
≤
r
,
k
1
≠
k
2
,
1
≤
l
,
l
1
,
l
2
≤
s
,
l
1
≠
l
2
Adjusted Rand index
The adjusted Rand index is the corrected-for-chance version of the Rand index. Though the Rand Index may only yield a value between 0 and +1, the adjusted Rand index can yield negative values if the index is less than the expected index.
Given a set
S
of
n
elements, and two groupings or partitions (e.g. clusterings) of these points, namely
X
=
{
X
1
,
X
2
,
…
,
X
r
}
and
Y
=
{
Y
1
,
Y
2
,
…
,
Y
s
}
, the overlap between
X
and
Y
can be summarized in a contingency table
[
n
i
j
]
where each entry
n
i
j
denotes the number of objects in common between
X
i
and
Y
j
:
n
i
j
=
|
X
i
∩
Y
j
|
.
The adjusted form of the Rand Index, the Adjusted Rand Index, is
A
d
j
u
s
t
e
d
I
n
d
e
x
=
I
n
d
e
x
−
E
x
p
e
c
t
e
d
I
n
d
e
x
M
a
x
I
n
d
e
x
−
E
x
p
e
c
t
e
d
I
n
d
e
x
, more specifically
A
R
I
=
∑
i
j
(
n
i
j
2
)
−
[
∑
i
(
a
i
2
)
∑
j
(
b
j
2
)
]
/
(
n
2
)
1
2
[
∑
i
(
a
i
2
)
+
∑
j
(
b
j
2
)
]
−
[
∑
i
(
a
i
2
)
∑
j
(
b
j
2
)
]
/
(
n
2
)
where
n
i
j
,
a
i
,
b
j
are values from the contingency table.