Applied Unsupervised Learning with R
上QQ阅读APP看书,第一时间看更新

Introduction to k-means Clustering

K-means clustering is one of the most basic types of unsupervised learning algorithm. This algorithm finds natural groupings in accordance with a predefined similarity or distance measure. The distance measure can be any of the following:

  • Euclidean distance
  • Manhattan distance
  • Cosine distance
  • Hamming distance

To understand what a distance measure does, take the example of a bunch of pens. You have 12 pens. Six of them are blue, and six red. Six of them are ball point pens and six are ink pens. If you were to use ink color as a similarity measure, then six blue pens and six red pens will be in different clusters. The six blue pens can be ink or ball point here; there's no restriction on that. But if you were to use the type of pen as the similarity measure, then the six ink pens and six ball point pens would be in different clusters. Now it doesn't matter whether the pens in each cluster are of the same color or not.

Euclidean Distance

Euclidean distance is the straight-line distance between any two points. Calculation of this distance in two dimensions can be thought of an extension of the Pythagorean theorem, which you might have studied in school. But Euclidean distance can be calculated between two points in any n-dimensional space, not just a two-dimensional space. The Euclidean distance between any two points is the square root of the sum of squares of differences between the coordinates. An example of the calculation of Euclidean distance is presented here:

Figure 1.5: Representation of Euclidean distance calculation

In k-means clustering, Euclidean distance is used. One disadvantage of using Euclidean distance is that it loses its meaning when the dimensionality of data is very high. This is related to a phenomenon known as the curse of dimensionality. When datasets possess many dimensions, they can be harder to work with, since distances between all points can become extremely high, and the distances are difficult to interpret and visualize.

So, when the dimensionality of data is very high, either we reduce its dimensions with principal component analysis, which we're going to study in Chapter 4, Dimension Reduction, or we use cosine similarity.

Manhattan Distance

By definition, Manhattan distance is the distance between two points measured along a right angle to the axes:

Figure 1.6: Representation of Manhattan distance

The length of the diagonal line is the Euclidean distance between the two points. Manhattan distance is simply the sum of the absolute value of the differences between two coordinates. So, the main difference between Euclidean distance and Manhattan distance is that with Euclidean distance, we square the distances between coordinates and then take the root of the sum, but in Manhattan distance, we directly take the sum of the absolute value of the differences between coordinates.

Cosine Distance

Cosine similarity between any two points is defined as the cosine of the angle between any two points with the origin as its vertex. It can be calculated by dividing the dot product of any two vectors by the product of the magnitudes of the vectors:

Figure 1.7: Representation of cosine similarity and cosine distance

Cosine distance is defined as (1-cosine similarity).

Cosine distance varies from 0 to 2, whereas cosine similarity varies between -1 to 1. Always remember that cosine similarity is one minus the value of of the cosine distance.

The Hamming Distance

The Hamming distance is a special type of distance that is used for categorical variables. Given two points of equal dimensions, the Hamming distance is defined as the number of coordinates differing from one another. For example, let's take two points, (0, 1, 1) and (0, 1, 0). Only one value, which is the last value, is different between these two variables. As such, the Hamming distance between them is 1:

Figure 1.8: Representation of the Hamming distance

k-means Clustering Algorithm

K-means clustering is used to find clusters in a dataset of similar points when we have unlabeled data. In this chapter, we're going to use the Iris flowers dataset. This dataset contains information about the length and breadth of sepals and petals of flowers of different species. With the help of unsupervised learning, we're going to learn how to differentiate between them without knowing which properties belong to which species. The following is the scatter plot of our dataset:

Figure 1.9: A scatter plot of the Iris flowers dataset

This is the scatter plot of two variables of the Iris flower dataset: sepal length and sepal width.

If we were to identify the clusters in the preceding dataset according to the distance between the points, we would choose clusters that look like bunches of grapes hanging from a tree. You can see that there are two major bunches (one in the top left and the other being the remaining points). The k-means algorithm identifies these "bunches" of grapes.

The following figure shows the same scatter plot, but with the three different species of Iris shown in different colors. These species are taken from the 'species' column of the original dataset, and are as follows: Iris setosa (shown in green), Iris versicolor (shown in red), and Iris virginica (shown in blue). We're going to see whether we can determine these species by forming our own classifications using clustering:

Figure 1.10: A scatter plot showing different species of Iris flowers dataset

Here is a photo of Iris setosa, which is represented in green in the preceding scatter plot:

Figure 1.11: Iris setosa

The following is a photo of Iris versicolor, which is represented in red in the preceding scatter plot:

Figure 1.12: Iris versicolor

Here is a photo of Iris virginica, which is represented in blue in the preceding scatter plot:

Figure 1.13: Iris virginica

Steps to Implement k-means Clustering

As we saw in the scatter plot in figure 1.9, each data point represents a flower. We're going to find clusters that will identify these species. To do this type of clustering, we're going to use k-means clustering, where k is the number of clusters we want. The following are the steps to perform k-means clustering, which, for simplicity of understanding, we're going to demonstrate with two clusters. We will build up to using three clusters later, in order to try and match the actual species groupings:

  1. Choose any two random coordinates, k1 and k2, on the scatter plot as initial cluster centers.
  2. Calculate the distance of each data point in the scatter plot from coordinates k1 and k2.
  3. Assign each data point to a cluster based on whether it is closer to k1 or k2
  4. Find the mean coordinates of all points in each cluster and update the values of k1 and k2 to those coordinates respectively.
  5. Start again from Step 2 until the coordinates of k1 and k2 stop moving significantly, or after a certain pre-determined number of iterations of the process.

We're going to demonstrate the preceding algorithm with graphs and code.

Exercise 2: Implementing k-means Clustering on the Iris Dataset

In this exercise, we will implement k-means clustering step by step:

  1. Load the built-in Iris dataset in the iris_data variable:

    iris_data<-iris

  2. Set the color for different species for representation on the scatter plot. This will help us see how the three different species are split between our initial two groupings:

    iris_data$t_color='red'

    iris_data$t_color[which(iris_data$Species=='setosa')]<-'green'

    iris_data$t_color[which(iris_data$Species=='virginica')]<-'blue'

  3. Choose any two random clusters' centers to start with:

    k1<-c(7,3)

    k2<-c(5,3)

    Note

    You can try changing the points and see how it affects the final clusters.

  4. Plot the scatter plot along with the centers you chose in the previous step. Pass the length and width of the sepals of the iris flowers, along with the color, to the plot function in the first line, and then pass x and y coordinates of both the centers to the points() function. Here, pch is for selecting the type of representation of the center of the clusters – in this case, 4 is a cross and 5 is a diamond:

    plot(iris_data$Sepal.Length,iris_data$Sepal.Width,col=iris_data$t_color)

    points(k1[1],k1[2],pch=4)

    points(k2[1],k2[2],pch=5)

    The output is as follows:

    Figure 1.14: A scatter plot of the chosen cluster centers

  5. Choose the number of iterations you want. The number of iterations should be such that the centers stop changing significantly after each iteration. In our case, six iterations are sufficient:

    number_of_steps<-6

  6. Initialize the variable that will keep track of the number of iterations in the loop:

    n<-1

  7. Start the while loop to find the final cluster centers:

    while(n<number_of_steps)

    {

  8. Calculate the distance of each point from the current cluster centers, which is Step 2 in the algorithm. We're calculating the Euclidean distance here using the sqrt function:

    iris_data$distance_to_clust1 <- sqrt((iris_data$Sepal.Length-k1[1])^2+(iris_data$Sepal.Width-k1[2])^2)

    iris_data$distance_to_clust2 <- sqrt((iris_data$Sepal.Length-k2[1])^2+(iris_data$Sepal.Width-k2[2])^2)

  9. Assign each point to the cluster whose center it is closest to, which is Step 3 of the algorithm:

    iris_data$clust_1 <- 1*(iris_data$distance_to_clust1<=iris_data$distance_to_clust2)

    iris_data$clust_2 <- 1*(iris_data$distance_to_clust1>iris_data$distance_to_clust2)

  10. Calculate new cluster centers by calculating the mean x and y coordinates of the point in each cluster (step 4 in the algorithm) with the mean() function in R:

    k1[1]<-mean(iris_data$Sepal.Length[which(iris_data$clust_1==1)])

    k1[2]<-mean(iris_data$Sepal.Width[which(iris_data$clust_1==1)])

    k2[1]<-mean(iris_data$Sepal.Length[which(iris_data$clust_2==1)])

    k2[2]<-mean(iris_data$Sepal.Width[which(iris_data$clust_2==1)])

  11. Update the variable that is keeping the count of iterations for us to effectively carry out step 5 of the algorithm:

    n=n+1

    }

  12. Now we're going to overwrite the species colors with new colors to demonstrate the two clusters. So, there will only be two colors on our next scatter plot – one color for cluster 1, and one color for cluster 2:

    iris_data$color='red'

    iris_data$color[which(iris_data$clust_2==1)]<-'blue'

  13. Plot the new scatter plot, which contains clusters along with cluster centers:

    plot(iris_data$Sepal.Length,iris_data$Sepal.Width,col=iris_data$color)

    points(k1[1],k1[2],pch=4)

    points(k2[1],k2[2],pch=5)

    The output will be as follows:

Figure 1.15: A scatter plot representing each cluster with a different color

Notice how setosa (which used to be green) has been grouped in the left cluster, while most of the virginica flowers (which were blue) have been grouped into the right cluster. The versicolor flowers (which were red) have been split between the two new clusters.

You have successfully implemented the k-means clustering algorithm to identify two groups of flowers based on their sepal size. Notice how the position of centers has changed after running the algorithm.

In the following activity, we are going to increase the number of clusters to three to see whether we can group the flowers correctly into their three different species.

Activity 1: k-means Clustering with Three Clusters

Write an R program to perform k-means clustering on the Iris dataset using three clusters. In this activity, we're going to perform the following steps:

  1. Choose any three random coordinates, k1, k2, and k3, on the plot as centers.
  2. Calculate the distance of each data point from k1, k2, and k3.
  3. Classify each point by the cluster whose center it is closest to.
  4. Find the mean coordinates of all points in the respective clusters and update the values of k1, k2, and k3 to those values.
  5. Start again from Step 2 until the coordinates of k1, k2, and k3 stop moving significantly, or after 10 iterations of the process.

The outcome of this activity will be a chart with three clusters, as follows:

Figure 1.16: The expected scatter plot for the given cluster centers

You can compare your chart to Figure 1.10 to see how well the clusters match the actual species classifications.

Note

The solution for this activity can be found on page 198.