Kalpana Kalpana (Editor)

Rand index

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit

The Rand index or Rand measure (named after William M. Rand) in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings. A form of the Rand index may be defined that is adjusted for the chance grouping of elements, this is the adjusted Rand index. From a mathematical standpoint, Rand index is related to the accuracy, but is applicable even when class labels are not used.

Contents

Definition

Given a set of n elements S = { o 1 , , o n } and two partitions of S to compare, X = { X 1 , , X r } , a partition of R into r subsets, and Y = { Y 1 , , Y s } , a partition of S into s subsets, define the following:

  • a , the number of pairs of elements in S that are in the same subset in X and in the same subset in Y
  • b , the number of pairs of elements in S that are in different subsets in X and in different subsets in Y
  • c , the number of pairs of elements in S that are in the same subset in X and in different subsets in Y
  • d , the number of pairs of elements in S that are in different subsets in X and in the same subset in Y
  • The Rand index, R , is:

    R = a + b a + b + c + d = a + b ( n 2 )

    Intuitively, a + b can be considered as the number of agreements between X and Y and c + d as the number of disagreements between X and Y .

    Since the denominator is the total number of pairs, the Rand index represents the frequency of occurrence of agreements over the total pairs, or the probability that X and Y will agree on a randomly chosen pair.

    Properties

    The Rand index has a value between 0 and 1, with 0 indicating that the two data clusterings do not agree on any pair of points and 1 indicating that the data clusterings are exactly the same.

    In mathematical terms, a, b, c, d are defined as follows:

  • a = | S | , where S = { ( o i , o j ) | o i , o j X k , o i , o j Y l }
  • b = | S | , where S = { ( o i , o j ) | o i X k 1 , o j X k 2 , o i Y l 1 , o j Y l 2 }
  • c = | S | , where S = { ( o i , o j ) | o i , o j X k , o i Y l 1 , o j Y l 2 }
  • d = | S | , where S = { ( o i , o j ) | o i X k 1 , o j X k 2 , o i , o j Y l }
  • for some 1 i , j n , i j , 1 k , k 1 , k 2 r , k 1 k 2 , 1 l , l 1 , l 2 s , l 1 l 2

    Adjusted Rand index

    The adjusted Rand index is the corrected-for-chance version of the Rand index. Though the Rand Index may only yield a value between 0 and +1, the adjusted Rand index can yield negative values if the index is less than the expected index.

    The contingency table

    Given a set S of n elements, and two groupings or partitions (e.g. clusterings) of these points, namely X = { X 1 , X 2 , , X r } and Y = { Y 1 , Y 2 , , Y s } , the overlap between X and Y can be summarized in a contingency table [ n i j ] where each entry n i j denotes the number of objects in common between X i and Y j  : n i j = | X i Y j | .

    Definition

    The adjusted form of the Rand Index, the Adjusted Rand Index, is A d j u s t e d I n d e x = I n d e x E x p e c t e d I n d e x M a x I n d e x E x p e c t e d I n d e x , more specifically
    A R I = i j ( n i j 2 ) [ i ( a i 2 ) j ( b j 2 ) ] / ( n 2 ) 1 2 [ i ( a i 2 ) + j ( b j 2 ) ] [ i ( a i 2 ) j ( b j 2 ) ] / ( n 2 )
    where n i j , a i , b j are values from the contingency table.

    References

    Rand index Wikipedia