According to the formal definition of K-means clustering – K-means clustering is an iterative algorithm that partitions a group of data containing n values into k subgroups. Each of the n value belongs to the k cluster with the nearest mean.
This means that given a group of objects, we partition that group into several sub-groups. These sub-groups are formed on the basis of their similarity and the distance of each data-point in the sub-group with the mean of their centroid.
K-means clustering is the most popular form of an unsupervised learning algorithm. It is easy to understand and implement.
The objective of the K-means clustering is to minimize the Euclidean distance that each point has from the centroid of the cluster. This is known as intra-cluster variance and can be minimized using the following squared error function –
Where J is the objective function of the centroid of the cluster. K are the number of clusters and n are the number of cases. C is the number of centroids and j is the number of clusters.
X is the given data-point from which we have to determine the Euclidean Distance to the centroid. Let us have a look at the algorithm for K-means clustering –
- First, we randomly initialize and select the k-points. These k-points are the means.
- We use the Euclidean distance to find data-points that are closest to their centreW of the cluster.
- Then we calculate the mean of all the points in the cluster which is finding their centroid.
- We iteratively repeat step 1, 2 and 3 until all the points are assigned to their respective clusters.