FaQ

  1. Clustering measuring scores: Rand Index, Homogeneity, Completeness and V-measure
  • Note that in unsupervised clustering, we do not have ground truth labels that would allow us to compute the accuracy of the algorithm.
  • These measures are intended to  measure the quality of the clustering result. They can be organised into two families: those for comparing different clustering techniques results; those checking specific properties of a clustering result.

Comparing different clustering strategies/techniques

These were not used in the lab because we only studied one algorithm). 

  • In the lab we tested Rand-index to show the possibilities of the libraries.
  • What we showed is that Rand index  can be used to compare the coincidence of different clusterings obtained by different approaches or criteria.

Checking on specific properties of the clustering 

  • A clustering result satisfies a homogeneity criterion if all of its clusters contain only data points which are members of the same original (single) class. 
  • A clustering result satisfies a completeness criterion if all the data points that are members of a given class are elements of the same predicted cluster.

Note that both scores have real positive values between 0.0 and 1.0, larger values being desirable.

For the example shown in the lab, we considered two toy clustering sets (i.e., original and predicted) with four samples and two labels.

print("%.3f" % metrics.homogeneity_score([0, 0, 1, 1], [0, 0, 0, 0]))

In the ground classification there are two clusters labelled respectively 0 and 1. Data points G1, G2, G3, G4 according to their position in the list are distributed in two clusters:

Cluster 0 – G1, G2 : Cluster 1 – G3, G4

The predicted classification organises the data points all in only one cluster:

Cluster 0: G1, G2, G3, G4

Homogeneity is 0 ( 🙁 ) because because samples from the original cluster belonging to 0 (that is those in the first and second position of the first parameter [0, 0, 1, 1]) belong to the predicted cluster 0 ( see [0, 0, 0, 0]). Yet, those belonging to cluster 1 in the original cluster (that is those in the third and fourth position of the first parameter [0, 0, 1, 1]) also belong to 0 en the predicted cluster ([0, 0, 0, 0])). So the predicted cluster could not organise them into two clusters ( 🙁 ). According to the definition of homogeneity,  cluster 0 can only contain elements of the same original class (i.e., cluster), yet G3, G4 do not belong to the same class as G1, G2.

Then we measured completeness using again two toy clustering sets.

print (metrics.completeness_score([0, 0, 1, 1], [1, 1, 0, 0]))

Completeness is 1 ( 🙂 ) because since all the samples from the original cluster with label 0 (that is those in the the first and the second position [0, 0, 1, 1]) go into the same predicted cluster with label 1 (see [1, 1, 0, 0]). Also, all the samples from the original cluster with label 1  (that is those in the the third and the fourth position [0, 0, 1, 1]) go into the same predicted cluster with label 0 ( see [1, 1, 0, 0] ) . Thus, the second clustering result could group the samples that belonged to the same class in the same cluster ( 🙂).

What about homogeneity? It is also 1 ! because G1 and G2 are grouped together in cluster 1 and G3 and G4 in cluster 0 which satisfies the definition.

Non-perfect labelings that further split classes into more clusters can be perfectly homogeneous:

print (metrics.homogeneity_score([0, 0, 1, 1], [0, 1, 2, 3]))

Homogeneity = 1 because all clusters contain elements of a single class

2. Explaining the meaning of rank

totalSum.rank(ascending = False, method = 'dense')

Parameters:
axis: 0 or ‘index’ for rows and 1 or ‘columns’ for Column.
method: Takes a string input(‘average’, ‘min’, ‘max’, ‘first’, ‘dense’) which tells pandas what to do with same values. Default is average which means assign average of ranks to the similar values.
numeric_only: Takes a boolean value and the rank function works on non-numeric value only if it’s False.
na_option: Takes 3 string input(‘keep’, ‘top’, ‘bottom’) to set position of Null values if any in the passed Series.
ascending: Boolean value which ranks in ascending order if True.
pct: Boolean value which ranks percentage wise if True.

Return type: Series with Rank of every index of caller series.

N.B. DENSE_RANK. DENSE_RANK computes the rank of a row in an ordered group of rows and returns the rank as a NUMBER . The ranks are consecutive integers beginning with 1. The largest rank value is the number of unique values returned by the query.

3. How to get the index from a Series. For example, given the DataFrame df defined as follows

df = pd.DataFrame(data) # Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes
df.columns = ['age', 'type_employer', 'fnlwgt', 'education', 
"education_num","marital", "occupation", "relationship", "race","sex",
"capital_gain", "capital_loss", "hr_per_week","country","income"]

Get the index of the number of people with the most popular age within a group of people. The following expression groups the number of people by age.

temp = df.groupby('age').size() # grouping by age

If you want to get  out of the Series “temp” the age of the majority of people in the sample.

x = temp.idxmax()
x, temp.get(x)