**Clustering measuring scores: Rand Index, Homogeneity, Completeness and V-measure**

- Note that in unsupervised clustering, we do not have ground truth labels that would allow us to compute the accuracy of the algorithm.
- These measures are intended to measure the quality of the clustering result. They can be organised into two families: those for comparing different clustering techniques results; those checking specific properties of a clustering result.

#### Comparing different clustering strategies/techniques

T*hese were not used in the lab because we only studied one algorithm*).

- In the lab we tested Rand-index to show the possibilities of the libraries.
- What we showed is that Rand index can be used to compare the coincidence of
**different clusterings**obtained by**different approaches or criteria**.

#### Checking on specific properties of the clustering

- A clustering result satisfies a
**homogeneity criterion**if all of its clusters contain**only**data points which are members of the**same original (single) class**. - A clustering result satisfies a
**completeness criterion**if**all**the data points that are members of a given class are elements of the**same predicted cluster**.

Note that both scores have real positive values between 0.0 and 1.0, **larger values being desirable.**

For the example shown in the lab, we considered **two** **toy clustering sets** (i.e., original and predicted) with four samples and two labels.

print("%.3f" % metrics.homogeneity_score([0, 0, 1, 1], [0, 0, 0, 0]))

In the ground classification there are two clusters labelled respectively 0 and 1. Data points G1, G2, G3, G4 according to their position in the list are distributed in two clusters:

Cluster 0 – G1, G2 : Cluster 1 – G3, G4

The predicted classification organises the data points all in only one cluster:

Cluster 0: G1, G2, G3, G4

Homogeneity is 0 ( **🙁** ) because because samples from the original cluster belonging to 0 (that is those in the first and second position of the first parameter [**0, 0**, 1, 1]) belong to the predicted cluster 0 ( see [**0, 0**, 0, 0]). Yet, those belonging to cluster 1 in the original cluster (that is those in the third and fourth position of the first parameter [0, 0, **1, 1**]) also belong to 0 en the predicted cluster ([0, 0, **0, 0**])). So the predicted cluster could not organise them into two clusters ( **🙁** ). According to the definition of homogeneity, cluster 0 can only contain elements of the same original class (i.e., cluster), **yet G3, G4 do not belong to the same class as G1, G2.**

Then we measured completeness using again two toy clustering sets.

print (metrics.completeness_score([0, 0, 1, 1], [1, 1, 0, 0]))

Completeness is 1 ( 🙂 ) because since all the samples from the original cluster with label 0 (that is those in the the first and the second position [**0, 0,** 1, 1]) go into the same predicted cluster with label 1 (see [**1, 1**, 0, 0]). Also, all the samples from the original cluster with label 1 (that is those in the the third and the fourth position [0, 0, **1, 1**]) go into the same predicted cluster with label 0 ( see [1, 1, **0, 0**] ) . Thus, the second clustering result could group the samples that belonged to the same class in the same cluster ( **🙂**).

What about **homogeneity**? It is also 1 ! because G1 and G2 are grouped together in cluster 1 and G3 and G4 in cluster 0 which satisfies the definition.

Non-perfect labelings that further split classes into more clusters can be perfectly homogeneous:

print (metrics.homogeneity_score([0, 0, 1, 1], [0, 1, 2, 3]))

Homogeneity = 1 because all clusters contain elements of a single class

**2. Explaining the meaning of rank**

totalSum.rank(ascending = False, method = 'dense')

**Parameters:**

**axis: **0 or ‘index’ for rows and 1 or ‘columns’ for Column.

**method: **Takes a string input(‘average’, ‘min’, ‘max’, ‘first’, ‘dense’) which tells pandas what to do with same values. Default is average which means assign average of ranks to the similar values.

**numeric_only: **Takes a boolean value and the rank function works on non-numeric value only if it’s False.

**na_option: **Takes 3 string input(‘keep’, ‘top’, ‘bottom’) to set position of Null values if any in the passed Series.

**ascending: **Boolean value which ranks in ascending order if True.

**pct: **Boolean value which ranks percentage wise if True.

Series with Rank of every index of caller series.**Return type: **

N.B. DENSE_RANK. DENSE_RANK computes the **rank** of a row in an ordered group of rows and returns the **rank** as a NUMBER . The **ranks** are consecutive integers beginning with 1. The largest **rank** value is the number of unique values returned by the query.

**3. How to get the index from a Series. For example, given the DataFrame df defined as follows**

df = pd.DataFrame(data) # Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes

df.columns = ['age', 'type_employer', 'fnlwgt', 'education', "education_num","marital", "occupation", "relationship", "race","sex", "capital_gain", "capital_loss", "hr_per_week","country","income"]

Get the index of the number of people with the most popular age within a group of people. The following expression groups the number of people by age.

temp = df.groupby('age').size() # grouping by age

If you want to get out of the Series “temp” the age of the majority of people in the sample.

x = temp.idxmax() x, temp.get(x)