Hands-On Unsupervised Learning with Python

上QQ阅读APP看书，第一时间看更新

Adjusted Mutual Information (AMI) score

The main goal is of this score is to evaluate the level of agreement between Y_true and Y_pred without taking into account the permutations. Such an objective can be measured by employing the information theory concept of Mutual Information (MI); in our case, it's defined as:

The functions are the same as previously defined. When MI → 0, n(i, j) → n_true(i)n_pred(j), whose terms are proportional respectively to p(i, j) and p_true(i)p_pred(j). Hence, this condition is equivalent to saying that Y_true and Y_pred are statistically independent and there's no agreement. On the other side, with some simple manipulations, we can rewrite MI as:

Hence, as H(Y_pred|Y_true) ≤ H(Y_pred), when the knowledge of the ground truth reduces the uncertainty about Y_pred, it follows thatH(Y_pred|Y_true) → 0 and the MI is maximized. For our purposes, it's preferable to consider a normalized version (bounded between 0 and 1) that is also adjusted for chance (that is, considering the possibility that a true assignment is due to the chance). The AMI score, whose complete derivation is non-trivial and beyond the scope of this book, is defined as:

This value is equal to 0 in the case of the total absence of agreement and equal to 1 when Y_true and Y_pred completely agree (also in the presence of permutations). For the Breast Cancer Wisconsin dataset and K=2, we obtain the following:

from sklearn.metrics import adjusted_mutual_info_score

print('Adj. Mutual info: {}'.format(adjusted_mutual_info_score(kmdff['diagnosis'], kmdff['prediction'])))

The output is as follows:

Adj. Mutual info: 0.42151741598216214

The agreement is moderate and compatible with the other measure. Assuming the presence of permutations and the possibility of chance assignments, Y_true and Y_pred share a medium level of information because, as we have discussed, K-means is able to correctly assign all the samples where the probability of overlap is negligible, while it tends to consider benign many malignant samples that are on the boundary between the two clusters (conversely, it doesn't make wrong assignments for the benign samples). Without any further indication, this index suggests also checking other clustering algorithms that can manage non-convex clusters, because the lack of shared information is mainly due to the impossibility of capturing complex geometries using standard balls (in particular in the subspace where the overlap is more significant).