FaQ

  1. Clustering measuring scores: Rand Index, Homogeneity, Completeness and V-measure
  • Note that in unsupervised clustering, we do not have ground truth labels that would allow us to compute the accuracy of the algorithm.
  • These measures are intended to  measure the quality of the clustering result. They can be organised into two families: those for comparing different clustering techniques results; those checking specific properties of a clustering result.

Comparing different clustering strategies/techniques

These were not used in the lab because we only studied one algorithm). 

  • In the lab we tested Rand-index to show the possibilities of the libraries.
  • What we showed is that Rand index  can be used to compare the coincidence of different clusterings obtained by different approaches or criteria.

Checking on specific properties of the clustering 

  • A clustering result satisfies a homogeneity criterion if all of its clusters contain only data points which are members of the same original (single) class. 
  • A clustering result satisfies a completeness criterion if all the data points that are members of a given class are elements of the same predicted cluster.

Note that both scores have real positive values between 0.0 and 1.0, larger values being desirable.

For the example shown in the lab, we considered two toy clustering sets (i.e., original and predicted) with four samples and two labels.

print("%.3f" % metrics.homogeneity_score([0, 0, 1, 1], [0, 0, 0, 0]))

Homogeneity is 0 ( 🙁 ) because because samples from the original cluster belonging to 0 (that is those in the first and second position of the first parameter [0, 0, 1, 1]) belong to the predicted cluster 0 ( see [0, 0, 0, 0]). Yet, those belonging to cluster 1 in the original cluster (that is those in the third and fourth position of the first parameter [0, 0, 1, 1]) also belong to 0 en the predicted cluster ([0, 0, 0, 0])). So the predicted cluster could not organise them into two clusters ( 🙁 ).

Then we measured completeness using again two toy clustering sets.

print metrics.completeness_score([0, 0, 1, 1], [1, 1, 0, 0])

Completeness is 1 ( 🙂 ) because since all the samples from the original cluster with label 0 (that is those in the the first and the second position [0, 0, 1, 1]) go into the same predicted cluster with label 1 (see [1, 1, 0, 0]). Also, all the samples from the original cluster with label 1  (that is those in the the third and the fourth position [0, 0, 1, 1]) go into the same predicted cluster with label 0 ( see [1, 1, 0, 0] ) . Thus, the second clustering result could group the samples that belonged to the same class in the same cluster ( 🙂).

2. How to get the index from a Series. For example, given the DataFrame df defined as follows

df = pd.DataFrame(data) # Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes
df.columns = ['age', 'type_employer', 'fnlwgt', 'education', 
"education_num","marital", "occupation", "relationship", "race","sex",
"capital_gain", "capital_loss", "hr_per_week","country","income"]

Get the index of the number of people with the most popular age within a group of people. The following expression groups the number of people by age.

temp = df.groupby('age').size() # grouping by age

If you want to get  out of the Series “temp” the age of the majority of people in the sample.

x = temp.idxmax()
x, temp.get(x)