The Unsupervised Learning Workshop
上QQ阅读APP看书,第一时间看更新

Introduction

In previous chapters, we evaluated a number of different approaches to data clustering, including k-means and hierarchical clustering. While k-means is the simplest form of clustering, it is still extremely powerful in the right scenarios. In situations where k-means can't capture the complexity of the dataset, hierarchical clustering proves to be a strong alternative.

One of the key challenges in unsupervised learning is that you will be presented with a collection of feature data but no complementary labels telling you what a target state will be. While you may not get a discrete view of what the target labels are, you can get some semblance of structure out of the data by clustering similar groups together and seeing what is similar within groups. The first approach we covered to achieve this goal of clustering similar data points is k-means. K-means clustering works best for simple data challenges where speed is paramount. Simply looking at the closest data point (cluster centroid) does not require a lot of computational overhead; however, there is also a greater challenge posed when it comes to higher-dimensional datasets. K-means clustering is also not ideal if you are unaware of the potential number of clusters you are looking for. An example we worked with in Chapter 2, Hierarchical Clustering, entailed looking at chemical profiles to determine which wines belonged together in a disorganized shipment. This exercise only worked well because we knew that three wine types were ordered; however, k-means would have been less successful if you had no idea regarding what the original order constituted.

The second clustering approach we explored was hierarchical clustering. This method can work in two ways – either agglomerative or pisive. Agglomerative clustering works with a bottom-up approach, treating each data point as its own cluster and recursively grouping them together with linkage criteria. Divisive clustering works in the opposite way by treating all data points as one large class and recursively breaking them down into smaller clusters. This approach has the benefit of fully understanding the entire data distribution, as it calculates splitting potential; however, it is typically not implemented in practice due to its greater complexity. Hierarchical clustering is a strong contender for your clustering needs when it comes to not knowing anything about the data. Using a dendrogram, you can visualize all the splits in your data and consider what number of clusters makes sense after the fact. This can be really helpful in your specific use case; however, it also comes at a higher computational cost than is associated with k-means.

In this chapter, we will cover a clustering approach that will serve us best in the sphere of highly complex data: Density-Based Spatial Clustering of Applications with Noise (DBSCAN). Canonically, this method has always been seen as a high performer in datasets that have a lot of densely interspersed data. Let's walk through why it does so well in these use cases.