Spectral Clustering

  • Intro
  • Overview
  • Training a Model
  • Applying a Transformation
  • Conclusions

Information

Primary software used Jupyter Notebook
Course Spectral Clustering
Primary subject AI & ML
Secondary subject Machine Learning
Level Intermediate
Last updated November 27, 2024
Keywords

Spectral Clustering 0/4

Spectral Clustering link copied

Spectral clustering is a technique that uses the eigenvalues of a similarity matrix derived from the data to perform dimensionality reduction before applying a standard clustering algorithm like K-Means. It is particularly effective for identifying clusters in non-convex or complex-shaped datasets by exploiting the connectivity structure of the data in a graph-based representation.

For this tutorial you need to have installed Python, Jupyter notebooks, and some common libraries including Scikit Learn. Please see the following tutorial for more information.

Download the Jupyter notebook here to follow along with the tutorial.

Download SpectralClustering_PYscript_01
application/zip (ZIP, 189 KB)

Spectral Clustering 1/4

Overview link copied

Spectral Clustering uses the eigenvalues of the similarity matrix of data to reduce their dimensionality and thus cluster them in a low dimensional manifold. It relates to graph theory and the concept of identifying groups of nodes based on the edges connecting them. Spectral Clustering is useful when the structure of a cluster is highly non-convex, or when a measure of the center and spread of a cluster does not describe a complete cluster. This happens, for example, when clusters are nested circles on a 2D plane.

Spectral Clustering 2/4

Training a Model link copied

Dataset

For this notebook, we will use a generated dataset in the shape of two intersecting half moons. We can generate and visualize the dataset below.

from sklearn.datasets import make_moons
X, y = make_moons(n_samples=1000, noise=0.05)

import matplotlib.pyplot as plt
# plot
plt.scatter(
   X[:, 0], X[:, 1],
   c='white', marker='o',
   edgecolor='black', s=50
)
plt.show()
A plot of the data set
A plot of the data set

K-Means

We saw in the previous notebook that K-Means works well for data sets that have a cluster structure that is spherical like the example data set used in that notebook. Now, let’s take a look at a moon-shaped dataset. K-means cannot successfully cluster data in these configurations (even though visually we can see two clusters) because it uses a distance matrix to group points to a centroid. Below we will see what happens when we try to apply K-Means.

from sklearn.cluster import KMeans

clusters = 2

km = KMeans(
    n_clusters=clusters, init='random',
    n_init=16, max_iter=300, 
    tol=1e-04, random_state=0
)
y_km = km.fit_predict(X)
# plot the 3 clusters
colors = ['lightgreen', 'orange', 'lightblue', 'lightcoral', 'gold', 'plum', 'midnightblue', ' indigo', 'sandybrown']
markers = ['s', 'H', 'v', 'p', '^', 'o', 'X', 'd', 'P' ]

for i in range(clusters):
    plt.scatter(
        X[y_km == i, 0], X[y_km == i, 1],
        s=50, c=colors[i],
        marker=markers[i], edgecolor='black',
        label='cluster ' + str(i)
    )

# plot the centroids
plt.scatter(
    km.cluster_centers_[:, 0], km.cluster_centers_[:, 1],
    s=250, marker='*',
    c='red', edgecolor='black',
    label='centroids'
)
plt.legend(scatterpoints=1)
plt.grid()
plt.show()
A plot showing how the data set is clustered using K-Means clustering.
A plot showing how the data set is clustered using K-Means clustering.

Since we know that k-means does not work, we need another clustering option. This is where spectral clustering comes in. For this example, we will use the generated dataset with the two intersecting moon shapes we just tried to cluster above.

Train a classifier using the SpectralClustering method and visualize it with the half moon dataset.

from sklearn.cluster import SpectralClustering 
from sklearn.preprocessing import StandardScaler, normalize 
from sklearn.decomposition import PCA 
import pandas as pd
# Building the clustering model 
spectral_model_rbf_2 = SpectralClustering(n_clusters = 2, affinity ='nearest_neighbors') 
  
# Training the model and Storing the predicted cluster labels 
labels_rbf_2 = spectral_model_rbf_2.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1],  
           c = labels_rbf_2, edgecolor='white', cmap =plt.cm.winter) 
plt.grid()
plt.show()
A plot showing the clusters when using spectral clustering.
A plot showing the clusters when using spectral clustering.

Spectral Clustering 3/4

Applying a Transformation link copied

Dataset

It is also possible to find other methods to change the dimensionality of the data. First, let’s take a look at a dataset that consists of an inner circle and an outer ring. Spectral Clustering is able to classify the data.

import matplotlib.pyplot as plt
from sklearn.datasets import make_circles

#Generate data set
X_1, y_1 = make_circles(n_samples=(300, 100), shuffle=True, noise=0.05, factor=0.1, random_state=3)

# Plot data set
plt.scatter(
   X_1[:, 0], X_1[:, 1],
   c='black', marker='o',
   edgecolor='white', s=50
)
plt.grid()
plt.show()
A plot of the data set
A plot of the data set

Clustering with Spectral Clustering

We can first see the clusters with spectral clustering.

# Building the clustering model 
spectral_model_nn = SpectralClustering(n_clusters = 2, affinity ='nearest_neighbors') 
  
# Training the model and Storing the predicted cluster labels 
labels_nn = spectral_model_nn.fit_predict(X_1)

plt.scatter(X_1[:, 0], X_1[:, 1],  
           c = labels_nn, edgecolor='white', cmap =plt.cm.winter)
plt.grid()
plt.show()
A plot showing the two clusters of data when using spectral clustering
A plot showing the two clusters of data when using spectral clustering

Clustering by Transforming the Data

Alternatively, we can apply a transformation to the data set ourselves to reduce the dimensionality. For the dataset above, we can find the distance to the center (0,0) in this case. This transforms the dataset to one-dimensional space and the data becomes linearly separable. We can then apply K-Means to get the clusters.

import numpy as np
import copy

#Distance to 0,0
X_1_np = np.array(X_1) #points in dataset
dist = (X_1_np[:, 0] ** 2 + X_1_np[:, 1] ** 2) ** 0.5 #get distance to zero
dist_coord = np.array(list(zip(dist, np.zeros(dist.shape))))

#Plot
plt.scatter(dist_coord[:,0], dist_coord[:,1],  
           c = 'black', edgecolor='white', cmap =plt.cm.winter)
plt.grid()
plt.show()
A plot showing the transformed data: plotting the distance to the center
A plot showing the transformed data: plotting the distance to the center
from sklearn.cluster import KMeans

#K-Means with Visualization
km = KMeans(
    n_clusters=2, init='random',
    n_init=10, max_iter=300, 
    tol=1e-04, random_state=0
    )

y_km = km.fit_predict(dist_coord)


plt.scatter(dist_coord[y_km == 0,0], dist_coord[y_km == 0, 1], 
           label = 'cluster 1', edgecolor='white', cmap =plt.cm.winter) 
plt.scatter(dist_coord[y_km == 1,0], dist_coord[y_km == 1, 1],  
           label = 'cluster 2', edgecolor='white', cmap =plt.cm.winter) 
plt.legend(scatterpoints=1)
plt.grid()
plt.show()
The two clusters of the transformed data
The two clusters of the transformed data
#Apply labels determined by K-Means back to the origional data set

plt.scatter(X_1[y_km == 0,0], X_1[y_km == 0, 1], 
           label = 'cluster 1', edgecolor='white', cmap =plt.cm.winter) 
plt.scatter(X_1[y_km == 1,0], X_1[y_km == 1, 1],  
           label = 'cluster 2', edgecolor='white', cmap =plt.cm.winter) 
plt.legend(scatterpoints=1)
plt.grid()
plt.show()
Assigning the cluster number back to the original data points. The assignment is based on the transformed data clusters.
Assigning the cluster number back to the original data points. The assignment is based on the transformed data clusters.
TASK: Choose two features from the wine dataset which are appropriate for clustering and apply spectral clustering on the data based on those two features.

Don’t forget to normalize the input data using a pipeline like we have done in previous notebooks. Check the SVM notebook for an example.

Spectral Clustering 4/4

Conclusions link copied

  • Spectral Clustering uses the eigenvalues of the similarity matrix of data to reduce the data’s dimensionality and then clusters it in fewer dimensions 
  • It draws inspiration from graph theory and the concept of identifying groups of nodes based on the edges connecting them
  • Spectral Clustering is useful when the structure of a cluster is highly non-convex or when a measure of the center and spread of a cluster does not describe a complete cluster
  • It is possible to use a similar process without implementing spectral clustering by transforming data to a lower dimension and then applying K-Means or Gaussian Mixtures on the linearly separable data