Hands-On Unsupervised Learning with Python

上QQ阅读APP看书，第一时间看更新

K-means

K-means is the simplest implementation of the principle of maximum separation and maximum internal cohesion. Let's suppose we have a dataset X ∈ ℜM×N (that is, M N-dimensional samples) that we want to split into K clusters and a set of K centroids corresponding to the means of the samples assigned to each cluster K_j:

The set M and the centroids have an additional index (as a superscript) indicating the iterative step. Starting from an initial guess M(0), K-means tries to minimize an objective function called inertia (that is, the total average intra-cluster distance between samples assigned to a cluster K_j and its centroid μ_j):

It's easy to understand that S(t) cannot be considered as an absolute measure because its value is highly influenced by the variance of the samples. However, S(t+1) < S(t) implies that the centroids are moving closer to an optimal position where the points assigned to a cluster have the smallest possible distance to the corresponding centroid. Hence, the iterative procedure (also known as Lloyd's algorithm) starts by initializing M(0) with random values. The next step is the assignment of each sample x_i ∈ X to the cluster whose centroid has the smallest distance from x_i:

Once all assignments have been completed, the new centroids are recomputed as arithmetic means:

The procedure is repeated until the centroids stop changing (this implies also a sequence S(0) > S(1) > ... > S(t_end)). The reader should have immediately understood that the computational time is highly influenced by the initial guess. If M(0) is very close to M(t_end), a few iterations can find the optimal configuration. Conversely, when M(0) is purely random, the probability of an inefficient initial choice is close to 1 (that is, every initial uniform random choice is almost equivalent in terms of computational complexity).