The Rand index or Rand measure (named after William M. Rand) in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings. A form of the Rand index may be defined that is adjusted for the chance grouping of elements, this is the adjusted Rand index. From a mathematical standpoint, Rand index is related to the accuracy, but is applicable even when class labels are not used.
Given a set of n elements S = { o 1 , … , o n } and two partitions of S to compare, X = { X 1 , … , X r } , a partition of R into r subsets, and Y = { Y 1 , … , Y s } , a partition of S into s subsets, define the following:
a , the number of pairs of elements in S that are in the same subset in X and in the same subset in Y b , the number of pairs of elements in S that are in different subsets in X and in different subsets in Y c , the number of pairs of elements in S that are in the same subset in X and in different subsets in Y d , the number of pairs of elements in S that are in different subsets in X and in the same subset in Y The Rand index, R , is:
R = a + b a + b + c + d = a + b ( n 2 ) Intuitively, a + b can be considered as the number of agreements between X and Y and c + d as the number of disagreements between X and Y .
Since the denominator is the total number of pairs, the Rand index represents the frequency of occurrence of agreements over the total pairs, or the probability that X and Y will agree on a randomly chosen pair.
The Rand index has a value between 0 and 1, with 0 indicating that the two data clusterings do not agree on any pair of points and 1 indicating that the data clusterings are exactly the same.
In mathematical terms, a, b, c, d are defined as follows:
a = | S ∗ | , where S ∗ = { ( o i , o j ) | o i , o j ∈ X k , o i , o j ∈ Y l } b = | S ∗ | , where S ∗ = { ( o i , o j ) | o i ∈ X k 1 , o j ∈ X k 2 , o i ∈ Y l 1 , o j ∈ Y l 2 } c = | S ∗ | , where S ∗ = { ( o i , o j ) | o i , o j ∈ X k , o i ∈ Y l 1 , o j ∈ Y l 2 } d = | S ∗ | , where S ∗ = { ( o i , o j ) | o i ∈ X k 1 , o j ∈ X k 2 , o i , o j ∈ Y l } for some 1 ≤ i , j ≤ n , i ≠ j , 1 ≤ k , k 1 , k 2 ≤ r , k 1 ≠ k 2 , 1 ≤ l , l 1 , l 2 ≤ s , l 1 ≠ l 2
Adjusted Rand index
The adjusted Rand index is the corrected-for-chance version of the Rand index. Though the Rand Index may only yield a value between 0 and +1, the adjusted Rand index can yield negative values if the index is less than the expected index.
Given a set S of n elements, and two groupings or partitions (e.g. clusterings) of these points, namely X = { X 1 , X 2 , … , X r } and Y = { Y 1 , Y 2 , … , Y s } , the overlap between X and Y can be summarized in a contingency table [ n i j ] where each entry n i j denotes the number of objects in common between X i and Y j : n i j = | X i ∩ Y j | .
The adjusted form of the Rand Index, the Adjusted Rand Index, is A d j u s t e d I n d e x = I n d e x − E x p e c t e d I n d e x M a x I n d e x − E x p e c t e d I n d e x , more specifically
A R I = ∑ i j ( n i j 2 ) − [ ∑ i ( a i 2 ) ∑ j ( b j 2 ) ] / ( n 2 ) 1 2 [ ∑ i ( a i 2 ) + ∑ j ( b j 2 ) ] − [ ∑ i ( a i 2 ) ∑ j ( b j 2 ) ] / ( n 2 )
where n i j , a i , b j are values from the contingency table.