The K-means algorithm is the well-known partitional clustering algorithm. Given a set of data points and the required number of k clusters (k is specified by the user), this algorithm iteratively partitions the data into k clusters based on a distance function. Concretely, with a set of data points x1,…xn. The K-means algorithm groups the data into k cohesive clusters. Each cluster has a cluster center, called centroid. The cluster centroid is used to represent the cluster and it is the mean of all the data points that belongs to the cluster.
The intuition behind K-means is an iterative procedure that starts by randomly assigning k data points as initial centroids. Then, the distance is computed between each centroid and every data point. Each data point is assigned to the cluster (centroid) with the lowest distance. After the assignment of every data points to the clusters, the centroids are re-computed using the data points of the newly created clusters. These steps of cluster assignment and centroid re-computation are repeated until the convergence criteria is met.
We need to implement three methods to perform following tasks:
- Randomly initializing K centroids
- Assigning data points to K clusters
- Re-computing K centroids
Step 1 will be performed in the start of the training process. Whereas, step 2 and 3 will be repeated in a loop until convergence. So let's get started by implementing first method for random initalization of K centroids.
initCentroids method defined above will return K random indices from X as centroids.
randperm method will randomly reorder the indices of the dataset (X).
getClosestCentroids method will compute euclidean distance of every data point to each cluster centroid and assign the data point to the one with lowest distance.
The last method that we need to implement is to re-compute the cluster centroids. For this purpose we will compute the mean of all the data points in each cluster and assign the calculated mean as new centroid of the cluster.
That's it, now we have all the pieces required to implement the K-Means. The final piece of code (given below) will execute defined methods until maximum iterations are reached. For stoping criteria you can use other techniques (e.g. no assignment of data points to different clusters) but here for simplicity I am using maximum number of iterations. If you interested in finding out how to select K automatically check the resources mentioned at the end of the tutorial.
For learning more about K-Means, check the following resources:
- Machine learning course by Andrew Ng
- Chapter 4 of Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data by Bing Liu
- Pham, Duc Truong, Stefan S. Dimov, and C. D. Nguyen. Selection of K in K-means clustering.