K-Means Clustering using R: Step by Step Guide
This step by step guide to K-means clustering algorithm covers:
Common use cases where K-means is used
Steps to perform K-means clustering
An R example
1. If you are a Novice, this tutorial is for you.
2. No experience with any specific topic is required.
3. The reader should have some basic data analysis knowledge (such as descriptive statistics) and some basis knowledge about programming in R.
4. If you want to follow along, you can download the sample data here
Introduction to K-means Clustering
The goal of K-means clustering is to identify observations which are similar and put them into different clusters such that the observations within the same cluster have high degree of association and observations within different clusters have low degree of association. K-means is an unsupervised machine learning technique and it is used when we the data is not labeled (that is when the data does not have a variable that defines the categories or groups). In the notation K-means, the number k represents the number of clusters. The simple way of understanding K-means is that each data point is iteratively assigned to one of the K groups based on the features. This assignment of the data point is done based on the similarity/closeness of the data point to the cluster centroid. The results of the K-means clustering algorithm are:
- The centroids of the K clusters, that can be used to label or group the new data.
- Labels or group of each data point in the training data.
Clustering helps us in organically identifying the groups that can be created out of the data, rather than defining groups before looking at the data. The centroid of each cluster represents a unique attribute represents the feature or characteristics of other members in the group. Examining the centroid feature weights can be used to qualitatively interpret what kind of group each cluster represents.
Common Use Cases
Let us look at some of the use cases that require application of K-mean Clustering. As I mentioned earlier, K-means algorithm is used when we want to find groups/clusters within data that has got no explicit labels. In other words, when we don’t know which observations belongs to (or represents) which group we can use K-mean clustering. Having said that, we can see that K-means clustering algorithm can be used as a confirmation technique about the types of groups that exists within the data or it can be used to identify unknown groups within a complex data. Once the groups/clusters are established using the old data, we can easily assign the new data to the appropriate group.
Let us look at some of the use cases:-
- Segment customers basis their shopping behavior (Behavioral Segmentation)
- Segment users basis their activities (Behavioral Segmentation)
- Creating groups bases on sales (Inventory Clustering)
- Grouping of Images
- Detecting bots or anomalies
- Tracking if the data point switches between groups over time