The Unsupervised Learning Workshop
上QQ阅读APP看书,第一时间看更新

Introduction to Hierarchical Clustering

So far, we have shown you that hierarchies can be excellent structures to organize information that clearly shows nested relationships among data points. While this helps us gain an understanding of the parent/child relationships between items, it can also be very handy when forming clusters. Expanding on the animal example in the previous section, imagine that you were simply presented with two features of animals: their height (measured from the tip of the nose to the end of the tail) and their weight. Using this information, you then have to recreate a hierarchical structure in order to identify which records in your dataset correspond to dogs and cats, as well as their relative subspecies.

Since you are only given animal heights and weights, you won't be able to deduce the specific names of each species. However, by analyzing the features that you have been provided with, you can develop a structure within the data that serves as an approximation of what animal species exist in your data. This perfectly sets the stage for an unsupervised learning problem that is well solved with hierarchical clustering. In the following plot, you can see the two features that we created on the left, with animal height in the left-hand column and animal weight in the right-hand column. This is then charted on a two-axis plot with the height on the X-axis and the weight on the Y-axis:

Figure 2.4: An example of a two-feature dataset comprising animal height and animal weight

One way to approach hierarchical clustering is by starting with each data point, serving as its own cluster, and recursively joining the similar points together to form clusters – this is known as agglomerative hierarchical clustering. We will go into more detail about the different ways of approaching hierarchical clustering in the Agglomerative versus Divisive Clustering section.

In the agglomerative hierarchical clustering approach, the concept of data point similarity can be thought of in the paradigm that we saw during k-means. In k-means, we used the Euclidean distance to calculate the distance from the inpidual points to the centroids of the expected "k" clusters. In this approach to hierarchical clustering, we will reuse the same distance metric to determine the similarity between the records in our dataset.

Eventually, by grouping inpidual records from the data with their most similar records recursively, you end up building a hierarchy from the bottom up. The inpidual single-member clusters join into one single cluster at the top of our hierarchy.

Steps to Perform Hierarchical Clustering

To understand how agglomerative hierarchical clustering works, we can trace the path of a simple toy program as it merges to form a hierarchy:

  1. Given n sample data points, view each point as an inpidual "cluster" with just that one point as a member (the centroid).
  2. Calculate the pairwise Euclidean distance between the centroids of all the clusters in your data. (Here, minimum distance between clusters, maximum distance between clusters, average distance between clusters, or distance between two centroids can also be considered. In this example, we are considering the distance between two cluster centroids).
  3. Group the closest clusters/points together.
  4. Repeat Step 2 and Step 3 until you get a single cluster containing all the data in your set.
  5. Plot a dendrogram to show how your data has come together in a hierarchical structure. A dendrogram is simply a diagram that is used to represent a tree structure, showing an arrangement of clusters from top to bottom. We will go into the details of how this may be helpful in the following walkthrough.
  6. Decide what level you want to create the clusters at.

An Example Walkthrough of Hierarchical Clustering

While slightly more complex than k-means, hierarchical clustering is, in fact, quite similar to it from a logistical perspective. Here is a simple example that walks through the preceding steps in slightly more detail:

  1. Given a list of four sample data points, view each point as a centroid that is also its own cluster with the point indices from 0 to 3:

    Clusters (4): [ (1,7) ], [ (-5,9) ], [ (-9,4) ] , [ (4, -2) ]

    Centroids (4): [ (1,7) ], [ (-5,9) ], [ (-9,4) ] , [ (4, -2) ]

  2. Calculate the pairwise Euclidean distance between the centroids of all the clusters.

    Note

    Refer to the K-means Clustering In-Depth Walkthrough section in Chapter 1, Introduction to Clustering for a refresher on Euclidean distance.

    In the matrix displayed in Figure 2.5, the point indices are between 0 and 3 both horizontally and vertically, showing the distance between the respective points. Notice that the values are mirrored across the diagonal – this happens because you are comparing each point against all the other points, so you only need to worry about the set of numbers on one side of the diagonal:

    Figure 2.5: An array of distances

  3. Group the closest point pairs together.

    In this case, points [1,7] and [-5,9] join into a cluster since they are the closest, with the remaining two points left as single-member clusters:

    Figure 2.6: An array of distances

    Here are the resulting three clusters:

    [ [1,7], [-5,9] ]

    [-9,4]

    [4,-2]

  4. Calculate the mean point between the points of the two-member cluster to find the new centroid:

    mean([ [1,7], [-5,9] ]) = [-2,8]

  5. Add the centroid to the two single-member centroids and recalculate the distances:

    Clusters (3):

    [ [1,7], [-5,9] ]

    [-9,4]

    [4,-2]

    Centroids (3):

    [-2,8]

    [-9,4]

    [4,-2]

    Once again, we'll calculate the Euclidean distance between the points and the centroid:

    Figure 2.7: An array of distances

  6. As shown in the preceding image, point [-9,4 ] is the shortest distance from the centroid and thus it is added to cluster 1. Now, the cluster list changes to the following:

    Clusters (2):

    [ [1,7], [-5,9], [-9,4] ]

    [4,-2]

  7. With only point [4,-2] left as the furthest distance away from its neighbors, you can just add it to cluster 1 to unify all the clusters:

    Clusters (1):

    [ [ [1,7], [-5,9], [-9,4], [4,-2] ] ]

  8. Plot a dendrogram to show the relationship between the points and the clusters:

Figure 2.8: A dendrogram showing the relationship between the points and the clusters

Dendrograms show how data points are similar and will look familiar to the hierarchical tree structures that we discussed earlier. There is some loss of information, as with any visualization technique; however, dendrograms can be very helpful when determining how many clusters you want to form. In the preceding example, you can see four potential clusters across the X-axis, if each point was its own cluster. As you travel vertically, you can see which points are closest together and can potentially be clubbed into their own cluster. For example, in the preceding dendrogram, the points at indices 0 and 1 are the closest and can form their own cluster, while index 2 remains a single-point cluster.

Revisiting the previous animal taxonomy example that involved dog and cat species, imagine that you were presented with the following dendrogram:

Figure 2.9: An animal taxonomy dendrogram

If you were just interested in grouping your species dataset into dogs and cats, you could stop clustering at the first level of the grouping. However, if you wanted to group all species into domesticated or non-domesticated animals, you could stop clustering at level two. The great thing about hierarchical clustering and dendrograms is that you can see the entire breakdown of potential clusters to choose from.

Exercise 2.01: Building a Hierarchy

Let's implement the preceding hierarchical clustering approach in Python. With the framework for the intuition laid out, we can now explore the process of building a hierarchical cluster with some helper functions provided in sciPy. SciPy (https://www.scipy.org/docs.html) is an open source library that packages functions that are helpful in scientific and technical computing. Examples of this include easy implementations of linear algebra and calculus-related methods. In this exercise, we will specifically be using helpful functions from the cluster subsection of SciPy. In addition to scipy, we will be using matplotlib to complete this exercise. Follow these steps to complete this exercise:

  1. Generate some dummy data, as follows:

    from scipy.cluster.hierarchy import linkage, dendrogram, fcluster

    from sklearn.datasets import make_blobs

    import matplotlib.pyplot as plt

    %matplotlib inline

  2. Generate a random cluster dataset to experiment with. X = coordinate points, y = cluster labels (not needed):

    X, y = make_blobs(n_samples=1000, centers=8, \

                      n_features=2, random_state=800)

  3. Visualize the data, as follows:

    plt.scatter(X[:,0], X[:,1])

    plt.show()

    The output is as follows:

    Figure 2.10: A plot of the dummy data

    After plotting this simple toy example, it should be pretty clear that our dummy data comprises eight clusters.

  4. We can easily generate the distance matrix using the built-in SciPy package, linkage. We will go further into what's happening with the linkage function shortly; however, for now it's good to know that there are pre-built tools that calculate distances between points:

    # Generate distance matrix with 'linkage' function

    distances = linkage(X, method="centroid", metric="euclidean")

    print(distances)

    The output is as follows:

    Figure 2.11: A matrix of the distances

    If you experiment with different methods by trying to autofill the method hyperparameter of the linkage function, you will see how they affect overall performance. Linkage works by simply calculating the distances between each of the data points. We will go into specifically what it is calculating in the Linkage topic. In the linkage function, we have the option to select both the metric and the method (we will cover this in more detail later).

    After we determine the linkage matrix, we can easily pass it through the dendrogram function provided by SciPy. As the name suggests, the dendrogram function uses the distances calculated in Step 4 to generate a visually clean way of parsing grouped information.

  5. We will be using a custom function to clean up the styling of the original output (note that the function provided in the following snippet is using the base SciPy implementation of the dendrogram, and the only custom code is for cleaning up the visual output):

    # Take normal dendrogram output and stylize in cleaner way

    def annotated_dendrogram(*args, **kwargs):

        # Standard dendrogram from SciPy

        scipy_dendro = dendrogram(*args, truncate_mode='lastp', \

                                  show_contracted=True,\

                                  leaf_rotation=90.)

        plt.title('Blob Data Dendrogram')

        plt.xlabel('cluster size')

        plt.ylabel('distance')

        for i, d, c in zip(scipy_dendro['icoord'], \

                           scipy_dendro['dcoord'], \

                           scipy_dendro['color_list']):

            x = 0.5 * sum(i[1:3])

            y = d[1]

            if y > 10:

                plt.plot(x, y, 'o', c=c)

                plt.annotate("%.3g" % y, (x, y), xytext=(0, -5), \

                             textcoords='offset points', \

                             va='top', ha='center')

        return scipy_dendro

    dn = annotated_dendrogram(distances)

    plt.show()

    The output is as follows:

    Figure 2.12: A dendrogram of the distances

    This plot will give us some perspective on the potential breakouts of our data. Based on the distances calculated in prior steps, it shows a potential path that we can use to create three separate groups around the distance of seven that are distinctly different enough to stand on their own.

  6. Using this information, we can wrap up our exercise on hierarchical clustering by using the fcluster function from SciPy:

    scipy_clusters = fcluster(distances, 3, criterion="distance")

    plt.scatter(X[:,0], X[:,1], c=scipy_clusters)

    plt.show()

    The fcluster function uses the distances and information from the dendrogram to cluster our data into a number of groups based on a stated threshold. The number 3 in the preceding example represents the maximum inter-cluster distance threshold hyperparameter that you can set. This hyperparameter can be tuned based on the dataset that you are looking at; however, it is supplied to you as 3 for this exercise. The final output is as follows:

Figure 2.13: A scatter plot of the distances

In the preceding plot, you can see that by using our threshold hyperparameter, we've identified eight distinct clusters. By simply calling a few helper functions provided by SciPy, you can easily implement agglomerative clustering in just a few lines of code. While SciPy does help with many of the intermediate steps, this is still an example that is a bit more verbose than what you will probably see in your regular work. We will cover more streamlined implementations later.

Note

To access the source code for this specific section, please refer t o https://packt.live/2VTRp5K.

You can also run this example online at https://packt.live/2Cdyiww.