![]() | ||
In multivariate statistics and the clustering of data, spectral clustering techniques make use of the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided as an input and consists of a quantitative assessment of the relative similarity of each pair of points in the dataset.
Contents
In application to image segmentation, spectral clustering is known as segmentation-based object categorization.
Algorithms
Given an enumerated set of data points, the similarity matrix may be defined as a symmetric matrix
One spectral clustering technique is the normalized cuts algorithm or Shi–Malik algorithm introduced by Jianbo Shi and Jitendra Malik, commonly used for image segmentation. It partitions points into two sets
where
A mathematically equivalent algorithm takes the eigenvector corresponding to the largest eigenvalue of the random walk normalized Laplacian matrix
Another possibility is to use the Laplacian matrix defined as
rather than the symmetric normalized Laplacian matrix.
Partitioning may be done in various ways, such as by computing the median
If the similarity matrix
For large-sized graphs, the second eigenvalue of the (normalized) graph Laplacian matrix is often ill-conditioned, leading to slow convergence of iterative eigenvalue solvers. Preconditioning is a key technology accelerating the convergence, e.g., in the matrix-free LOBPCG method. Spectral clustering has been successfully applied on large graphs by first identifying their community structure, and then clustering communities.
Spectral clustering is closely related to Nonlinear dimensionality reduction, and dimension reduction techniques such as locally-linear embedding can be used to reduce errors from noise or outliers.
Free software to implement spectral clustering is available in large open source projects like Scikit-learn, MLlib for pseudo-eigenvector clustering using the power iteration method, and R.
Relationship with k-means
The kernel k-means problem is an extension of the k-means problem where the input data points are mapped non-linearly into a higher-dimensional feature space via a kernel function
Suppose
such that,
such that
where
This problem can be recast as,
This problem is equivalent to the spectral clustering problem when the identity constraints on
Measures to compare clusterings
Ravi Kannan, Santosh Vempala and Adrian Vetta in the following paper proposed a bicriteria measure to define the quality of a given clustering. They said that a clustering was an (α, ε)-clustering if the conductance of each cluster(in the clustering) was at least α and the weight of the inter-cluster edges was at most ε fraction of the total weight of all the edges in the graph. They also look at two approximation algorithms in the very same paper.