**Clustering measuring scores: Rand Index, Homogeneity, Completeness and V-measure**

- Note that in unsupervised clustering, we do not have ground truth labels that would allow us to compute the accuracy of the algorithm.
- These measures are intended to measure the quality of the clustering result. They can be organised into two families: those for comparing different clustering techniques results; those checking specific properties of a clustering result.

#### Comparing different clustering strategies/techniques

T*hese were not used in the lab because we only studied one algorithm*).

- In the lab we tested Rand-index to show the possibilities of the libraries.
- What we showed is that Rand index can be used to compare the coincidence of
**different clusterings**obtained by**different approaches or criteria**.

#### Checking on specific properties of the clustering

- A clustering result satisfies a
**homogeneity criterion**if all of its clusters contain**only**data points which are members of the**same original (single) class**. - A clustering result satisfies a
**completeness criterion**if**all**the data points that are members of a given class are elements of the**same predicted cluster**.

Note that both scores have real positive values between 0.0 and 1.0, **larger values being desirable.**

For the example shown in the lab, we considered **two** **toy clustering sets** (i.e., original and predicted) with four samples and two labels.

print("%.3f" % metrics.homogeneity_score([0, 0, 1, 1], [0, 0, 0, 0]))

Homogeneity is 0 ( **🙁** ) because because samples from the original cluster belonging to 0 (that is those in the first and second position of the first parameter [**0, 0**, 1, 1]) belong to the predicted cluster 0 ( see [**0, 0**, 0, 0]). Yet, those belonging to cluster 1 in the original cluster (that is those in the third and fourth position of the first parameter [0, 0, **1, 1**]) also belong to 0 en the predicted cluster ([0, 0, **0, 0**])). So the predicted cluster could not organise them into two clusters ( **🙁** ).

Then we measured completeness using again two toy clustering sets.

print metrics.completeness_score([0, 0, 1, 1], [1, 1, 0, 0])

Completeness is 1 ( 🙂 ) because since all the samples from the original cluster with label 0 (that is those in the the first and the second position [**0, 0,** 1, 1]) go into the same predicted cluster with label 1 (see [**1, 1**, 0, 0]). Also, all the samples from the original cluster with label 1 (that is those in the the third and the fourth position [0, 0, **1, 1**]) go into the same predicted cluster with label 0 ( see [1, 1, **0, 0**] ) . Thus, the second clustering result could group the samples that belonged to the same class in the same cluster ( **🙂**).

**2. How to get the index from a Series. For example, given the DataFrame df defined as follows**

df = pd.DataFrame(data) # Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes

df.columns = ['age', 'type_employer', 'fnlwgt', 'education', "education_num","marital", "occupation", "relationship", "race","sex", "capital_gain", "capital_loss", "hr_per_week","country","income"]

Get the index of the number of people with the most popular age within a group of people. The following expression groups the number of people by age.

temp = df.groupby('age').size() # grouping by age

If you want to get out of the Series “temp” the age of the majority of people in the sample.

x = temp.idxmax() x, temp.get(x)