Hands-On Unsupervised Learning with Python
上QQ阅读APP看书,第一时间看更新

Cluster analysis

Cluster analysis (normally called just clustering) is an example of a task where we want to find out common features among large sets of samples. In this case, we always suppose the existence of a data generating process  and we define the dataset X as:

A clustering algorithm is based on the implicit assumption that samples can be grouped according to their similarities. In particular, given two vectors, a similarity function is defined as the reciprocal or inverse of a metric function. For example, if we are working in a Euclidean space, we have:

In the previous formula, the constant ε has been introduced to avoid division by zero. It's obvious that d(a, c) < d(a, b) ⇒ s(a, c) > s(a, b). Therefore, given a representative of each cluster , we can create the set of assigned vectors considering the rule:

In other words, a cluster contains all those elements whose distance from the representative is minimum compared to all other representatives. This implies that a cluster contains samples whose similarity with the representative is maximal compared to all representatives. Moreover, after the assignment, a sample gains the right to share its feature with the other members of the same cluster.

In fact, one of the most important applications of cluster analysis is trying to increase the homogeneity of samples that are recognized as similar. For example, a recommendation engine could be based on the clustering of the user vectors (containing information about their interests and bought products). Once the groups have been defined, all the elements belonging to the same cluster are considered as similar, hence we are implicitly authorized to share the differences. If user A has bought the product P and rated it positively, we can suggest this item to user B who didn't buy it and the other way around. The process can appear arbitrary, but it turns out to be extremely effective when the number of elements is large and the feature vectors contain many discriminative elements (for example, ratings).