Hands-On Unsupervised Learning with Python
上QQ阅读APP看书,第一时间看更新

Homogeneity score

The homogeneity score is complementary to the previous one and it's based on the assumption that a cluster must contain only samples having the same true label. It is defined as:

Analogously to the completeness score, when H(Ytrue|Ypred) → H(Ytrue), it means that the assignments have no impact on the conditional entropy, hence the uncertainty is not reduced after the clustering (for example, every cluster contains samples belonging to all classes) and → 0. Conversely, when H(Ytrue|Ypred) → 0, h → 1, because knowledge of the predictions has reduced the uncertainty about the true assignments and the clusters contain almost exclusively samples with the same label. It's important to remember that this score alone is not enough, because it doesn't guarantee that a cluster contains all samples xi ∈ X with the same true label. That's why the homogeneity score is always evaluated together with the completeness score.

For the Breast Cancer Wisconsin dataset and K=2, we obtain the following:

from sklearn.metrics import homogeneity_score

print('Homogeneity: {}'.format(homogeneity_score(kmdff['diagnosis'], kmdff['prediction'])))

The corresponding output is as follows:

Homogeneity: 0.42229071246999117

This value (in particular, for K=2) confirms our initial analysis. At least one cluster (the one with the majority of benign samples) is not completely homogeneous, because it contains samples belonging to both classes. However, as the value is not very close to 0, we can be sure that the assignments are partially correct. Considering both values, h and c, we can deduct that K-means is not performing extremely well (probably because of non-convexity), but, at the same time, it's able to separate correctly all those samples whose nearest cluster distance is above a specific threshold. It goes without saying that, with knowledge of the ground truth, we cannot easily accept K-means and we should look for another algorithm that is able to yield both h and c → 1.