Recently I read a post on reddit/r/machinelearning that talks about intuition behind spectral clustering. I was interested, because all I have used are K-means and Meanshift clusterings before. Thus after writing out test codes in python, I decided to post about it.
K-means and spectral clusterings are some common clustering algorithms (that requires prior knowledge of the number of clusters in the data)
As a gentle reminder, K-means basically is a clustering method where we:
- First randomly select K number of points as the initial means / cluster centers.
- Then calculate distance from each of the data point to each of those centers
- Label the data points according to the closest center.
- Repeat step 2 – 4 for certain number of iterations or until the results converge.
Now K-means algorithm works really well most of the time, especially when the clusters look like the following figure.
However, what if the clusters look like two concentric rings? It’s easy for us to see that we can cluster them as outer and inner rings, but for K-means, most likely we will end up with 2 clusters that centers on left and right side of the data.
Here comes Spectral Clustering.
In a overly reductive way, spectral clustering in a nutshell basically poses this question: “What if we don’t run K-means on the dataset directly, but we find the cluster of the eigenvectors instead?”
Spectral clustering algorithm is as follows:
- First build an adjacency matrix that shows which vertices is ‘connected’ to which other vertices. The connectivity can be represented using euclidian distance, gaussian distance, nearest neighbors, etc. The adjacency matrix will be a symmetric matrix, with the diagonal values = 0.
- Then build a degree matrix out from the adjacency matrix. This matrix basically shows the number vertices that are connected to each vertex.
- Create a Laplacian matrix. Laplacian = Degree – Adjacency. Thus, now each column (or row, since it’s symmetric) shows to which other vertices is the current vertex connected to, and how many of other vertices are connected to it. Wikipedia has a fairly simple and clear example of Laplacian matrix
- Perform eigendecomposition on this Laplacian, to get the eigenvalues and eigenvectors
- The eigenvalues that are equal to 0, actually correspond to the connected components, so we obtain the first K eigenvalues that are 0 (or very close to 0), and also the corresponding eigenvectors
- Now, finally perform K-means clustering on those eigenvectors in order to cluster the dataset.
Below are two sample datasets that K-means algorithm is having hard time with, but could be solved easily using Spectral clustering
You can find sample python code that illustrates the difference between those two clusterings (and to certain extent, the algorithm of spectral clustering) over here: https://github.com/subokita/Sandbox/blob/master/spectral_clustering.py
Please note that the sample code requires at least scikit-learn, numpy, scipy, and matplotlib modules installed.