Document

Courses
Courses
- Bachelor
  Courses
  
  Bachelor
  - BK Bachelor
- Master
  Courses
  
  Master
  - BK Master
Subjects
Subjects
- 3D Modelling
  Subjects
  
  3D Modelling
  - 3D modelling
- AI & ML
  Subjects
  
  AI & ML
  - Machine Learning
  - Optimization
- Analysis & simulation
  Subjects
  
  Analysis & simulation
  - Climate analysis
- Collaboration
  Subjects
  
  Collaboration
  - Collaboration
- Geospatial and Geographic Information Systems
  Subjects
  
  Geospatial and Geographic Information Systems
  - Geospatial and Geographic Information Systems
- Other
  Subjects
  
  Other
  - Other
  - Robot Programming
- Parametric Modeling
  Subjects
  
  Parametric Modeling
  - General
Software
Software
- ANSYS
- ArcGIS
- BIM360
- Grasshopper
- Honeybee
- Jupyter Notebook
- Ladybug
- PDOK
- PUG
- Python
- QGIS
- Revit
- Rhino
- Speckle
- Wallacei
Labs
Labs
- BK Labs
  Labs
  
  BK Labs
  - BK Labs
  - Modelhall
- TU Labs at BK
  Labs
  
  TU Labs at BK
  - Artificial Intelligence Labs

Clustering 2D Data

Intro
Generate Dataset
Run K-Means Clustering
Visualise the Decision Boundary
Hyperparameters Tuning
Cluster Evaluation
Conclusion

Information

Primary software used	Python
Software version	1.0
Course	Clustering 2D Data
Primary subject	AI & ML
Secondary subject	Machine Learning
Level	Intermediate
Last updated	November 19, 2024
Keywords	Clustering K-Means Matplotlib Pandas Scikit-learn Unsupervised Learning

Responsible

Teachers	Gabrielle Mirra , Lotte Kat , Michela Turrin
Faculty	Bouwkunde

Clustering 2D Data 0/6

Clustering 2D Data link copied

This is an example of clustering 2D data.

This tutorial will walk you through using a Machine Learning algorithm called K-means to group data points according to their similarities. This process, named Clustering, is useful when dealing with unlabelled datasets, that is, datasets that lack a predefined target class.

This tutorial demonstrates clustering an artificially generated dataset in 2D. We will explore how the K-means algorithm works, visualise the clusters in the feature space, and uncover the algorithm’s limitations.

After this tutorial, you will understand how to apply the K-means algorithm to group data points in 2D and higher-dimensional feature spaces. You’ll be able to visualise clusters, assess the quality of the clustering process, and use hyperparameter tuning techniques.

Start Start

Clustering 2D Data 1/6

Generate Dataset link copied

We will use the sklearn library to generate a 2D point distribution. These points represent our input dataset.
We start by importing the libraries for Machine Learning and data visualisation

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np
import matplotlib.pyplot as plt

We use the sklearn.make_blobs function to produce Gaussian-distributed point clusters. This function asks for the number of centres we want, which means the number of point clusters. Then we must specify cluster_std, the standard deviation, which indicates how much the points are scattered within each cluster. The random_state argument is an integer value controlling the random initialization of the cluster center locations.

# generate Gaussian blobs for clustering
X, _ = make_blobs(n_samples=300, centers=3, cluster_std=2.0, random_state=42)

Visualise the Dataset

We can use a scatter plot to visualise the point distribution we created on a 2D graph.

Try generating different datasets by changing the cluster_std parameter. For higher values, the points will start overlapping, making it harder to visually recognize distinct clusters.

visualisation of the artificially generated data distribution in 2D

# plot the generated dataset
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], s=20, c='b')
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("data distribution")
plt.show()

Previous chapter Previous chapter Next chapter Next chapter

Clustering 2D Data 2/6

Run K-Means Clustering link copied

We will now use the K-Means implementation provided with the sklearn library.

We must make an initial guess about the number of clusters `n_clusters`. Since we have a dataset containing artificially generated point groups following a specific number of Gaussian distributions, we can set `n_clusters` to exactly correspond to the number of point groups.

The `random_state` argument controls where the cluster centroids are located when the algorithm is initialised.

After training K-Means on our data points `X`, we use the predict method to produce a vector `y_kmeans` containing an integer value for each and every point. This value represents the cluster ID for that point.

# apply KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
print(y_kmeans)

output = [1 1 0 2 1 2 0 2 0 0 0 2 0 0 1 0 1 2 0 0 0 0 2 1 0 1 1 2 2 0 0 0 1 0 1 0 1
2 1 2 2 0 1 2 0 0 1 2 1 2 2 1 1 0 1 2 1 0 2 0 1 2 2 1 1 2 2 1 1 0 2 1 1 0
0 1 1 2 0 2 0 0 1 0 2 1 1 0 2 0 1 0 1 0 0 1 1 0 1 1 2 0 2 0 0 0 0 0 2 1 2
0 0 0 0 2 1 2 1 2 2 2 0 1 1 1 1 0 1 1 0 0 0 0 0 2 2 1 0 1 0 0 1 0 2 2 2 0
2 0 0 1 2 1 0 2 2 1 1 0 0 1 1 1 0 1 2 0 0 0 0 0 2 0 2 2 2 0 2 2 1 0 1 2 2
1 2 0 2 2 1 1 2 1 2 2 2 2 0 1 0 0 2 2 0 2 1 1 2 0 0 1 2 2 1 1 1 1 0 1 1 2
1 1 0 2 1 1 2 0 0 1 0 1 2 2 1 2 1 1 1 2 2 0 1 2 2 2 1 2 1 2 1 2 2 1 2 0 1
0 0 0 1 0 2 2 1 2 2 0 0 2 2 2 1 1 1 0 0 0 2 2 2 2 1 2 1 2 2 1 0 2 2 0 1 0
2 0 1 1]

Plot Clustering Results

We now have two vectors:

X: the dataset samples
y_kmeans: the cluster IDs predicted by the model for the dataset samples

We can plot our dataset distribution by assigning to each data point a colour corresponding to one of the three cluster IDs predicted by the K-Means algorithm.

This graph demonstrates that K-Means successfully located the centres of the three distributions characterizing our artificial dataset.

Visualisation of k-means clustering output. Each dataset sample is coloured based on the cluster ID predicted by the model.

# plot the dataset with the cluster IDs
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=20, cmap='viridis')

plt.title("K-Means Clustering")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()

Note: for a simple 2D problem like this, we just need to visualize the data distribution to check if k-means clustering was successful. As we will see later, numerical techniques for 2D and higher dimensional datasets allow us to check the cluster quality and even make an initial guess about the number of clusters K-means should look for.

Previous chapter Previous chapter Next chapter Next chapter

Clustering 2D Data 3/6

Visualise the Decision Boundary link copied

As with any other Machine Learning method, K-means can make predictions for points that are not part of the original dataset. The algorithm checks the Euclidean distance between any new data point and each cluster’s centre, and assigns a cluster ID based on point proximity.

Here we construct a point grid and use it as input data. The cluster IDs predicted by K-means are used to produce a visualisation of the Decision Boundary separating the feature space into distinct regions each containing points with the same cluster ID.

# generate input data points to plot the decision boundary
xx_range = np.arange(-15, 15, 0.02)
yy_range = np.arange(-15, 15, 0.02)
xx, yy = np.meshgrid(xx_range,yy_range)

# predict on each point
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# plot the decision boundary and the data points
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, cmap='viridis', alpha=0.3)

Decision boundary constructed from Cluster IDs predicted by the model using a grid of evenly spaced points as input.

# plot the data points and color them by their cluster
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis', s=20)

plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("K-means Clustering with Decision Boundaries")
plt.show()

Previous chapter Previous chapter Next chapter Next chapter

Clustering 2D Data 4/6

Hyperparameters Tuning link copied

A key question when using clustering algorithms is: ‘was the initial guess about the number of clusters correct?’ In a 2D feature space, one can visually inspect the clusters and determine if clusters capture the shape of the data distribution. However, as we will see later, this is impossible in the general high-dimensional case.

We will use a numerical technique called the Elbow Method to see for which number of clusters K-means reduces the Within-Cluster Sum of Squares, also called inertia.

Apply the Elbow Method

We will perform clustering on our dataset several times. Each time we run the fit method, we assign a different value for n_clusters and compute the inertia. We will then plot a curve comparing inertia values with cluster numbers and identify the elbow point, which is the point where the decrease in inertia starts to slow down. This point often indicates a good initial choice for n_clusters.

Here the elbow point occurs for n_cluster equal to 3. This is expected, as our dataset consists exactly of three groups of points (see Section ‘Generate Dataset’)

# Calculate the within-cluster sum of squares (inertia) for a range of k values
inertia = []
k_values = range(1, 11)

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertia.append(kmeans.inertia_)

graph showing the relationship between the number of clusters and inertia used to identify the elbow point.

# Plot the Elbow Method graph
plt.figure(figsize=(8, 5))
plt.plot(k_values, inertia, 'bo-', linewidth=2)
plt.xlabel('Number of Clusters')
plt.ylabel('Within-Cluster Sum of Squares (inertia)')
plt.title('Elbow Method for Optimal n_clusters')
plt.xticks(k_values)
plt.show()

Previous chapter Previous chapter Next chapter Next chapter

Clustering 2D Data 5/6

Cluster Evaluation link copied

To further validate the clustering process, we may want to numerically assess the quality of our clusters.
We will use a metric called Silhouette Score. This metric measures cluster quality by checking how similar each point is to points within its cluster compared to other clusters. The score ranges from -1 to 1, with 1 indicating perfectly separated clusters.

In this case, we have a silhouette score of approximately 0.7, which is larger than 0.5 and thus indicates well-formed clusters.

# Calculate the silhouette score
silhouette_avg = silhouette_score(X, y_kmeans)
print(silhouette_avg)

Output = 0.6996309397540692

To understand the effects of hyperparameter choices on the cluster quality and performance indicators, run K-means on a different dataset. Try generating different point distributions by changing the parameters of make_blobs or changing the n_clusters variable.

Previous chapter Previous chapter Next chapter Next chapter

Clustering 2D Data 6/6

Conclusion link copied

In this tutorial, you learned about clustering with the K-means algorithm. You learned how to apply K-means to cluster an artificially generated dataset and visualise clusters and decision boundaries in 2D. Additionally, you applied methods to evaluate cluster quality, like the Elbow Method and Silhouette Score, which help identify the optimal number of clusters.

Exercise file

Here you can find the full Python code which you can open in Google Colab or respective program.

Download Array

Array (ZIP, 404 KB)

Previous chapter Previous chapter Next chapter Next chapter