What is clustering
Clustering is one of the most common exploratory data analysis technique used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different.
Unlike supervised learning, clustering is considered an unsupervised learning method since we don’t have the ground truth to compare the output of the clustering algorithm to the true labels to evaluate its performance. We only want to try to investigate the structure of the data by grouping the data points into distinct subgroups.
K-Means Clustering Algorithm: Applications, Types, Demos and Use Cases
Every Machine Learning engineer wants to achieve accurate predictions with their algorithms. Such learning algorithms are generally broken down into two types — supervised and unsupervised. K-means clustering is one of the unsupervised algorithms where the available input data does not have a labeled response.
Before diving further into the concepts of clustering, let us check out the topics to be covered in this article:
- Types of clustering
- What is k-means clustering?
- Applications of k-means clustering
- Common distance measure
- How does k-means clustering work?
- K-Means clustering algorithm
- Demo: k-means clustering
- Use Case: color compression
Types of Clustering
Clustering is a type of unsupervised learning wherein data points are grouped into different sets based on their degree of similarity.
The various types of clustering are:
- Hierarchical clustering
- Partitioning clustering
Hierarchical clustering is further subdivided into:
- Agglomerative clustering
- Divisive clustering
Partitioning clustering is further subdivided into:
- K-Means clustering
- Fuzzy C-Means clustering
What is meant by the K-means algorithm?
K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. K-Means performs the division of objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster.
The term ‘K’ is a number. You need to tell the system how many clusters you need to create. For example, K = 2 refers to two clusters. There is a way of finding out what is the best or optimum value of K for a given data.
For a better understanding of k-means, let’s take an example from cricket. Imagine you received data on a lot of cricket players from all over the world, which gives information on the runs scored by the player and the wickets taken by them in the last ten matches. Based on this information, we need to group the data into two clusters, namely batsmen and bowlers-
Cases in the Security Domain
We need to create the clusters, as shown below:
considering the same data set, let us solve the problem using K-Means clustering (taking K = 2).
The first step in k-means clustering is the allocation of two centroids randomly (as K=2). Two points are assigned as centroids. Note that the points can be anywhere, as they are random points. They are called centroids, but initially, they are not the central point of a given data set. The next step is to determine the distance between each of the randomly assigned centroids’ data points. For every point, the distance is measured from both the centroids and whichever distance is less, that point is assigned to that centroid. You can see the data points attached to the centroids and represented here in blue and yellow.
The next step is to determine the actual centroid for these two clusters. The original randomly allocated centroid is to be repositioned to the actual centroid of the clusters.
This process of calculating the distance and repositioning the centroid continues until we obtain our final cluster. Then the centroid repositioning stops.
As seen above, the centroid doesn’t need any more repositioning, and it means the algorithm has converged, and we have the two clusters with a centroid.