# Unsupervised learning comparing clustering methods

### Objective

Discuss different techniques for unsupervised learning and will focus on several clustering techniques.

• Consider basic concepts like distance and similarity, taxonomy of clustering techniques and goodness of clustering quality.
• Explore three basic clustering techniques, namely, K-means, spectral clustering and hierarchical clustering.
• Illustrate the use of clustering techniques on a real problem: defining groups of countries according to their economic expenditure on education.

### 2. Clustering

Partition unlabeled examples into disjoint subsets of clusters, such that:

• Examples within a cluster are similar (high intra-class similarity).
• Examples in different clusters are different (low inter-class similarity).
• It can help in discovering new categories in an unsupervised manner (no sample category labels provided).

Clustering will help us to analyse and get insight of the data, but the quality of the partition depends on the application and the analyst.

#### 2.1 Similarity and distance

The notion of similarity is a tough one, however we can use the notion
of distance as a surrogate. The most well-known instantiations of this metric are:

• Euclidean distance.
• Manhattan distance.
• Max-distance.

#### 2.2 What is a good clustering?

The Rand index or Rand measure (named after William M. Rand) is a measure of the similarity between two data clusterings. A form of the Rand index may be defined that is adjusted for the chance grouping of elements, this is the  adjusted Rand index. From a mathematical standpoint, Rand index is related to the accuracy, but is applicable even when class labels are not used.

Given a set of n elements :math:S = {o1, …, on} and two partitions of S to compare, X = {X1, …, Xr}, a partition of S into r subsets, and Y = {Y1, …, Ys}, a partition of S into s subsets. Let us use the annotations as follows:

• a, the number of pairs of elements in S that are in the same set in X and in the same set in Y
• b, the number of pairs of elements in S that are in different sets in X and in different sets in Y
• c, the number of pairs of elements in S that are in the same set in X and in different sets in Y
• d, the number of pairs of elements in S that are in different sets in X and in the same set in Y.

The Rand index, R is: The adjusted Rand index, AR is: A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of the same original (a single) class.

A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same automatic cluster.

Both scores have positive values between 0.0 and 1.0, larger values being desirable.

```import matplotlib.pylab as plt
```
```from sklearn import metrics
metrics.homogeneity_score([0, 0, 1, 1], [1, 1, 0, 0])```
`print("%.3f" % metrics.homogeneity_score([0, 0, 1, 1], [0, 0, 1, 2]))`
`print("%.3f" % metrics.homogeneity_score([0, 0, 1, 1], [0, 1, 2, 3]))`
```print("%.3f" % metrics.homogeneity_score([0, 0, 1, 1], [0, 1, 0, 1]))
print("%.3f" % metrics.homogeneity_score([0, 0, 1, 1], [0, 0, 0, 0]))
print (metrics.completeness_score([0, 0, 1, 1], [1, 1, 0, 0]))
print(metrics.completeness_score([0, 0, 1, 1], [0, 0, 0, 0]))
print(metrics.completeness_score([0, 1, 2, 3], [0, 0, 1, 1]))
print(metrics.completeness_score([0, 0, 1, 1], [0, 1, 0, 1]))
print(metrics.completeness_score([0, 0, 0, 0], [0, 1, 2, 3]))```

V-measure  is the harmonic mean between homogeneity and completeness:

v = 2 * (homogeneity * completeness) / (homogeneity + completeness)

Note that this metric is not dependent of the absolute values of the labels: a permutation of the class or cluster label values will not change the score value in any way. Moreover, the metric is symmetric with respect to switching between the predicted and the original cluster label.

This can be useful to measure the agreement of two independent label assignments strategies on the same dataset when the real ground truth is not known. If class members are completely split across different clusters, the assignment is totally incomplete, hence the V-measure is null.

Perfect labelings are both homogeneous and complete, hence have score 1.0:

```print (metrics.v_measure_score([0, 0, 1, 1], [0, 0, 1, 1]))
print (metrics.v_measure_score([0, 0, 1, 1], [1, 1, 0, 0]))```

Question: Labelings that assign all classes members to the same clusters are: ___________, but not _____________:

```print("%.3f" % metrics.completeness_score([0, 1, 2, 3], [0, 0, 0, 0]))
print("%.3f" % metrics.homogeneity_score([0, 1, 2, 3], [0, 0, 0, 0]))
print("%.3f" % metrics.v_measure_score([0, 1, 2, 3], [0, 0, 0, 0]))
print("%.3f" % metrics.v_measure_score([0, 0, 1, 2], [0, 0, 1, 1]))
print("%.3f" % metrics.v_measure_score([0, 1, 2, 3], [0, 0, 1, 1]))```

Question: Labelings that have pure clusters with members coming from the same classes are ________________ but un-necessary splits harm ____________________ and thus penalise V-measure as well:

```print("%.3f" % metrics.v_measure_score([0, 0, 1, 1], [0, 0, 1, 2]))
print("%.3f" % metrics.v_measure_score([0, 0, 1, 1], [0, 1, 2, 3]))```

If classes members are completely split across different clusters, the assignment is totally incomplete, hence the V-Measure is null:

`print("%.3f" % metrics.v_measure_score([0, 0, 0, 0], [0, 1, 2, 3]))`

Question: Clusters that include samples from totally different classes totally destroy the _______________________  of the labelling, hence:

`print("%.3f" % metrics.v_measure_score([0, 0, 1, 1], [0, 0, 0, 0]))`

#### Advantages

• Bounded scores: 0.0 is as bad as it can be, 1.0 is a perfect score.
• Intuitive interpretation: clustering with bad V-measure can be qualitatively analyzed in terms of homogeneity and completeness to better feel what ‘kind’ of mistakes is done by the assignment.
• No assumption is made on the cluster structure: can be used to compare clustering algorithms such as K-means which assumes isotropic blob shapes with results of spectral clustering algorithms which can find cluster with “folded” shapes.
##### Drawbacks
• The previously introduced metrics are not normalised with regards to random labelling. This means that depending on the number of samples, clusters and ground truth classes, a completely random labelling will not always yield the same values for homogeneity, completeness and hence V-measure. In particular random labelling will not yield zero scores especially when the number of clusters is large.
• This problem can safely be ignored when the number of samples is high i.e. more than a thousand and the number of clusters is less than 10.
• These metrics require the knowledge of the ground truth classes  while almost never available in practice or require manual assignment by human annotators (as in the supervised learning setting).

And if we do not have ground truth?

The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is: where b is the distance between a sample and the nearest cluster that the sample is not part of.

• If the Silhouette s(i) is close to 0, it means that the sample is on the border of its cluster and the closest one from the rest of the dataset clusters.
• A negative value means that the sample is closer to the neighbor cluster.
• The average of the Silhouette coefficients of all samples of a given cluster defines the “goodness” of the cluster.
• The average of the Silhouette coefficients of all clusters gives idea of the quality of the clustering result.

Note that the Silhouette coefficient only makes sense when the number of labels predicted is less than the number of samples clustered.

• The score is bounded between -1 and +1.
• The score is higher when clusters are dense and well separated, which
relates to a standard concept of a cluster.

### 2.3 Clustering techniques: how to group samples?

There are two big families of clustering techniques:

• Partitional algorithms: Start with a random partition and refine it iteratively.
• Hierarchical algorithms: Agglomerative (bottom-up), top-down.
• Partitional algorithms. They can be divided in two branches:
• Hard partition algorithms, such as K-means, assign a unique cluster value to each element in the feature space.
• Soft partition algorithms, such as Mixture of Gaussians, can be viewed as density estimators and assign a confidence or probability to each point in the space.

In order to build our intuition about clustering, we will start with the simplest, but one of the most famous methods: K-means.

#### 2.3.1 K-means algorithm

##### Algorithm:
1. Initialise the value K of desirable clusters.
2. Initialise the K cluster centres, e.g. randomly.
3. Decide the class memberships of the N data samples by assigning them to the
nearest cluster centroids (e.g. the center of gravity or mean).
4. Re-estimate the K cluster centres, by assuming the memberships found above are correct.
5. If none of the N objects changed membership in the last iteration, exit. Otherwise go to 3.

Let us see this in action:

```import numpy as np

#Create some data
MAXN=40
X = np.concatenate([1.25*np.random.randn(MAXN,2), 5+1.5*np.random.randn(MAXN,2)])
X = np.concatenate([X,[8,3]+1.2*np.random.randn(MAXN,2)])
X.shape```
```#Just for visualization purposes, create the labels of the 3 distributions
y = np.concatenate([np.ones((MAXN,1)),2*np.ones((MAXN,1))])
y = np.concatenate([y,3*np.ones((MAXN,1))])

plt.subplot(1,2,1)
plt.scatter(X[(y==1).ravel(),0],X[(y==1).ravel(),1],color='r')
plt.scatter(X[(y==2).ravel(),0],X[(y==2).ravel(),1],color='b')
plt.scatter(X[(y==3).ravel(),0],X[(y==3).ravel(),1],color='g')
plt.title('Data as were generated')

plt.subplot(1,2,2)
plt.scatter(X[:,0],X[:,1],color='r')
plt.title('Data as the algorithm sees them')

plt.savefig("files/ch07/sample.png",dpi=300, bbox_inches='tight')

from sklearn import cluster

K=3 # Assuming to be 3 clusters!

clf = cluster.KMeans(init='random', n_clusters=K)
clf.fit(X)```

Each clustering algorithm comes in two variants:

• a class, that implements the fit method to learn the clusters on train data,
• a predict function, that, given test data, returns an array of integer labels corresponding to the different clusters.

For the class, the labels over the training data can be found in the labels attribute.

```print (clf.labels_) # or
print (clf.predict(X)) # equivalent```
```print (X[(y==1).ravel(),0]) #numpy.ravel() returns a flattened array
print (X[(y==1).ravel(),1])```
```plt.scatter(X[(y==1).ravel(),0],X[(y==1).ravel(),1],color='r')
plt.scatter(X[(y==2).ravel(),0],X[(y==2).ravel(),1],color='b')
plt.scatter(X[(y==3).ravel(),0],X[(y==3).ravel(),1],color='g')

fig = plt.gcf()
fig.set_size_inches((6,5))```
```x = np.linspace(-5,15,200)
XX,YY = np.meshgrid(x,x)
sz=XX.shape
data=np.c_[XX.ravel(),YY.ravel()]
# c_ translates slice objects to concatenation along the second axis.```
```Z=clf.predict(data) # returns the labels of the data
print (Z)```

Questions: How many “misclusterings” do we have?

```# Visualize space partition
plt.imshow(Z.reshape(sz), interpolation='bilinear', origin='lower',
extent=(-5,15,-5,15),alpha=0.3, vmin=0, vmax=K-1)
plt.title('Space partitions', size=14)
plt.scatter(X[(y==1).ravel(),0],X[(y==1).ravel(),1],color='r')
plt.scatter(X[(y==2).ravel(),0],X[(y==2).ravel(),1],color='b')
plt.scatter(X[(y==3).ravel(),0],X[(y==3).ravel(),1],color='g')

fig = plt.gcf()
fig.set_size_inches((6,5))

plt.savefig("files/ch07/samples3.png",dpi=300, bbox_inches='tight')```
```clf = cluster.KMeans(n_clusters=K, random_state=0)
#initialize the k-means clustering
clf.fit(X) #run the k-means clustering

data=np.c_[XX.ravel(),YY.ravel()]
Z=clf.predict(data) # returns the clustering labels of the data```

Visualising true labels by coloured points and space tessellation:

```plt.title('Final result of K-means', size=14)

plt.scatter(X[(y==1).ravel(),0],X[(y==1).ravel(),1],color='r')
plt.scatter(X[(y==2).ravel(),0],X[(y==2).ravel(),1],color='b')
plt.scatter(X[(y==3).ravel(),0],X[(y==3).ravel(),1],color='g')

plt.imshow(Z.reshape(sz), interpolation='bilinear', origin='lower',
extent=(-5,15,-5,15),alpha=0.3, vmin=0, vmax=K-1)

x = np.linspace(-5,15,200)
XX,YY = np.meshgrid(x,x)
fig = plt.gcf()
fig.set_size_inches((6,5))

plt.savefig("files/ch07/randscore.png",dpi=300, bbox_inches='tight')```
```clf = cluster.KMeans(init='random', n_clusters=K, random_state=0)
#initialize the k-means clustering
clf.fit(X) #run the k-means clustering
Zx=clf.predict(X)

plt.subplot(1,3,1)
plt.title('Original labels', size=14)
plt.scatter(X[(y==1).ravel(),0],X[(y==1).ravel(),1],color='r')
plt.scatter(X[(y==2).ravel(),0],X[(y==2).ravel(),1],color='b') # b
plt.scatter(X[(y==3).ravel(),0],X[(y==3).ravel(),1],color='g') # g
fig = plt.gcf()
fig.set_size_inches((12,3))

plt.subplot(1,3,2)
plt.title('Data without labels', size=14)
plt.scatter(X[(y==1).ravel(),0],X[(y==1).ravel(),1],color='r')
plt.scatter(X[(y==2).ravel(),0],X[(y==2).ravel(),1],color='r') # b
plt.scatter(X[(y==3).ravel(),0],X[(y==3).ravel(),1],color='r') # g
fig = plt.gcf()
fig.set_size_inches((12,3))

plt.subplot(1,3,3)
plt.title('Clustering labels', size=14)
plt.scatter(X[(Zx==1).ravel(),0],X[(Zx==1).ravel(),1],color='r')
plt.scatter(X[(Zx==2).ravel(),0],X[(Zx==2).ravel(),1],color='b')
plt.scatter(X[(Zx==0).ravel(),0],X[(Zx==0).ravel(),1],color='g')
fig = plt.gcf()
fig.set_size_inches((12,3))```

The K-means algorithm clusters data by trying to separate samples in n groups of equal variance. In other words, the K-means algorithm divides a set of N samples X into K disjoint clusters C, each described by the mean of the samples in the cluster. The means are commonly called the cluster “centroids”.

Question: Shall the centroids belong to the original set of points?

The K-means algorithm aims to choose centroids minimising a criterion known as the inertia or within-cluster sum-of-squares: Inertia or the within-cluster sum of squares criterion, can be recognised as a measure of how internally coherent clusters are. Several issues should be taken into account:

• Inertia responds poorly to elongated clusters, or manifolds with irregular shapes.
• Given enough time, K-means will always converge.
• This algorithm requires the number of clusters to be specified.
• It scales well to large number of samples and has been used across a large range of application areas in many different fields.

The computation is often done several times, with different initializations of the centroids. One method to help address this issue is the k-means++ initialization scheme, which has been implemented in scikit-learn (use the init=’kmeans++’ parameter). This initialises the centroids to be (generally) distant from each other, leading to provably better results than random initialisation. Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings:

##### Summary
• (+) Select good seeds using a heuristic (e.g. seeds with large distance among them).
• (+) Try out multiple starting points.
• (+) Initialize with the results of another method.
• (-) Tends to look for spherical clusters.
• (-) Prone to local minima stabilization.
```from sklearn import metrics

clf = cluster.KMeans(n_clusters=K, init='k-means++', random_state=0,
max_iter=300, n_init=10)
#initialize the k-means clustering
clf.fit(X) #run the k-means clustering

print ('Final evaluation of the clustering:')

print('Inertia: %.2f' % clf.inertia_)

print('Adjusted_rand_score %.2f' % metrics.adjusted_rand_score(y.ravel(),
clf.labels_))

print('Homogeneity %.2f' % metrics.homogeneity_score(y.ravel(),
clf.labels_))

print('Completeness %.2f' % metrics.completeness_score(y.ravel(),
clf.labels_))

print('V_measure %.2f' % metrics.v_measure_score(y.ravel(), clf.labels_))

print('Silhouette %.2f' % metrics.silhouette_score(X, clf.labels_,
metric='euclidean'))

clf1 = cluster.KMeans(n_clusters=K, init='random', random_state=0,
max_iter=2, n_init=2)
#initialize the k-means clustering
clf1.fit(X) #run the k-means clustering

print ('Final evaluation of the clustering:')

print ('Inertia: %.2f' % clf1.inertia_)

print ('Adjusted_rand_score %.2f' % metrics.adjusted_rand_score(y.ravel(),
clf1.labels_))

print ('Homogeneity %.2f' % metrics.homogeneity_score(y.ravel(),
clf1.labels_))

print ('Completeness %.2f' % metrics.completeness_score(y.ravel(),
clf1.labels_))

print ('V_measure %.2f' % metrics.v_measure_score(y.ravel(),
clf1.labels_))

print ('Silhouette %.2f' % metrics.silhouette_score(X, clf1.labels_,
metric='euclidean'))```

#### 2.3.2 Spectral clustering

Spectral clustering refers to a family of methods that use spectral techniques. Specifically, these techniques are related to the eigen-decomposition of an affinity or similarity matrix and attempt to solve the problem of clustering according to connectivity. Let us consider an ideal similarity matrix of two clear sets.

Let us denote the similarity matrix, S, as the matrix Sij = s(xi; xj) gives the similarity between observations xi and xj. The simplest similarity measure is the Euclidean distance. We expect that when two points are from different clusters, they are far away. However, it might also happen that two points from the same cluster are also far away, but still belong to the same cluster as soon as there is a sequence of points from the same cluster that form a path between them.

An easy way to do it is by applying a k-nearest neighbour to build a graph connecting just the k closest dataset points. Using the affinity matrix, clustering is considered as a graph-partition problem, where connected graph components are interpreted as clusters.

The graph constructed by the spectral clustering must be partitioned such that edges connecting different clusters should have low weighs, and edges within the same cluster must have high values. Additionally to the affinity matrix, we need a degree matrix D so that each diagonal value is the degree of the respective graph node and all other positions are zero. Finally, we compute the unnormalized graph Laplacian (U = D A).

If we assume that there are k clusters, the next step is to find the k smallest eigenvectors, without considering the trivial constant eigenvector. Each row of the matrix formed by the k smallest eigenvectors of the Laplacian matrix defines a transformation for data xi. Thus, in this transformed space, we can apply the K-means clustering in order to find the final clusters. If we do not know in advance the number of clusters k, we can look for sudden change in the sorted
eigenvalues of the matrix, U and keep the smallest ones.

Let us illustrate it on some examples with non Gaussian distribution. Scikit-learn has a library to generate datasets with different shapes like moons, blobs, etc.

```from sklearn.datasets.samples_generator import make_blobs,make_moons

X, labels_true = make_blobs(n_samples=1000, centers=3, cluster_std=[1.7,1.7,1.7])

plt.subplot(1,2,1)
plt.scatter(X[:, 0], X[:, 1], c='r', marker='o',s=20)
plt.axis('equal')
plt.title('Compact clusters',size=14)

[Xmoons, ymoons] = make_moons(n_samples=300, noise=.05)
plt.subplot(1,2,2)
plt.scatter(Xmoons[:, 0], Xmoons[:, 1], c='r', marker='o',s=20)
plt.axis('equal')
plt.title('Connectivity-base clusters', size=14)
fig = plt.gcf()
fig.set_size_inches((11,6))

Let us apply the Spectral clustering to the two moons toy problem.```
```from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import kneighbors_graph
from sklearn.metrics import euclidean_distances

colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
colors = np.hstack([colors] * 20)

# normalize dataset for easier parameter selection
X = StandardScaler().fit_transform(Xmoons)

# Compute distances
distances = euclidean_distances(Xmoons)

spectral = cluster.SpectralClustering(n_clusters=2,
affinity="nearest_neighbors")

spectral.fit(Xmoons)
y_pred = spectral.labels_.astype(np.int)```

Following, we visualise the spectral clustering results:

```plt.subplot(1,2,1)
plt.scatter(Xmoons[:, 0], Xmoons[:, 1], c='r', marker='o',s=20)
plt.axis('equal')

plt.subplot(1,2,2)
plt.scatter(Xmoons[y_pred==0, 0], Xmoons[y_pred==0, 1], c='b',
marker='o',s=20)
plt.scatter(Xmoons[y_pred==1, 0], Xmoons[y_pred==1, 1], c='y',
marker='o',s=20)
plt.axis('equal')
fig=plt.gcf()
fig.set_size_inches((11,6))
plt.scatter(Xmoons[:, 0], Xmoons[:, 1], color=colors[y_pred].tolist(),
s=10)```

Let us compare the result to the K-means:

```plt.subplot(1,2,1)
plt.scatter(Xmoons[:, 0], Xmoons[:, 1], c='r', marker='o',s=20)
plt.axis('equal')

# Cluster using k-means
clf = cluster.KMeans(n_clusters=2,init='k-means++')
clf.fit(Xmoons)
y_pred=clf.predict(Xmoons)

# Visualize k-means result
plt.subplot(1,2,2)
plt.scatter(Xmoons[y_pred==0, 0], Xmoons[y_pred==0, 1], c='b', marker='o',s=20)
plt.scatter(Xmoons[y_pred==1, 0], Xmoons[y_pred==1, 1], c='y', marker='o',s=20)
plt.axis('equal')
fig=plt.gcf()
fig.set_size_inches((11,6))```

Note that since K-means looks for spheric clusters, it is unable to separate the two moon data, in contrast to the spectral clustering.

#### 2.3.3 Hierarchical clustering

Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging or splitting them successively. This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample. This is a nice tool, because of its interpretability. The result of the technique is a tree showing the similarity among the samples.

Partitioning is computed by selecting a cut on that tree at a certain level. The tree is usually called dendrogram.

##### Bottom-Up agglomerative clustering sketch of algorithm
1. Starts with each sample data in a separate cluster.
2. Then, repeatedly joins the closest pair of clusters.
3. Until there is only one cluster.

The history of merging forms a binary tree or hierarchy.

##### Top-Down divisive clustering sketch of algorithm
1. Starting with all the data in a single cluster.
2. Consider every possible way to divide the cluster into two. Choose the best division.
3. Recursively operate on both sides.

The Agglomerative Clustering performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together.

##### Defining the similarity of two clusters:

The linkage criterion determines the metric used for the merge strategy:

• Maximum or complete linkage minimises the maximum distance between observations of pairs of clusters. Based on the similarity of the two least similar members, it will give tight spherical clusters.
• Average linkage averages similarity between members i.e. minimises the average of the distances between all observations of pairs of clusters.
• Single linkage works on the similarity of two most similar members. It can create chain effects, such as follow the nearest neighbour.
• Ward minimizes the sum of squared differences within all clusters. It is a variance-minimizing approach and in this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical approach.

Agglomerative Clustering can also scale to large number of samples when it is used jointly with a connectivity matrix, but is computationally expensive when no connectivity constraints are added between samples: it considers at each step all the possible merges.

Let us illustrate how the different linkages work with an example. Let us generate three clusters as follows:

```MAXN1 =500
MAXN2 =400
MAXN3 =300
X1 = np.concatenate ([2.25*np.random.randn(MAXN1,2),4+1.7* np.random.randn (MAXN2 ,2)])
X1 = np.concatenate ([X1,[8,3]+1.9* np.random.randn(MAXN3 ,2)])
y1 = np.concatenate ([ np.ones ((MAXN1,1)),2* np.ones((MAXN2,1))])
y1 = np.concatenate ([y1,3* np.ones((MAXN3,1))]).ravel()
y1 = np.int_(y1)
labels_y1=['+','*','o']
colors=['r','g', 'b']```

Let us apply agglomerative clustering using the different linkages:

```import time

from sklearn.cluster import AgglomerativeClustering

for linkage in ('ward', 'complete', 'average'):
clustering = AgglomerativeClustering(linkage=linkage,n_clusters =3)
clustering.fit(X1)

x_min , x_max = np.min (X1, axis =0) , np.max (X1,axis =0)
X1 = (X1 - x_min ) / ( x_max - x_min )
fig = plt.figure ()
fig.set_size_inches((5,5))
for i in range (X1.shape ) :
plt.text(X1[i,0],X1[i,1],labels_y1[y1[i]-1],color=colors[y1[i]-1])
plt.title ("%s linkage" % linkage,size =20)
plt.tight_layout()
plt.savefig("files/ch07/%slinkage.png" % linkage,dpi=300, bbox_inches='tight')

plt.show()```

Agglomerative clustering has a “rich get richer” behaviour that leads to uneven cluster sizes. In this regard, complete linkage is the worst strategy, and Ward gives the most regular sizes. However, the affinity cannot be varied with Ward, thus for non Euclidean metrics, average linkage is a good alternative. Let us illustrate the performance on some other datasets with more complex distributions:

```from sklearn import cluster, datasets
from sklearn.cluster import AgglomerativeClustering

[X1, y1] = datasets.make_circles(n_samples=600, factor=.5, noise=.05)
[X2, y2] = datasets.make_circles(n_samples=600, factor=.5, noise=.15)

n_clusters=4

for X in [X1,X2]:
plt.figure(figsize=(12, 4))
for index, linkage in enumerate(('average', 'complete', 'ward')):

plt.subplot(1, 3, index + 1)
model = AgglomerativeClustering(linkage=linkage,n_clusters=n_clusters)
model.fit(X)
plt.scatter(X[:, 0], X[:, 1], c=model.labels_, cmap=plt.cm.nipy_spectral)
plt.title('%s linkage' % linkage,fontdict=dict(verticalalignment='top'))
plt.axis('equal')
plt.axis('off')

plt.show()```

#### 2.3.4 Adding connectivity constraints

An interesting aspect of Agglomerative Clustering is that connectivity constraints can be added to this algorithm (only adjacent clusters can be merged together), through a connectivity matrix that defines for each sample the neighbouring samples following a given structure of the data. For instance, in the swiss-roll example below, the connectivity constraints forbid the merging of points that are not adjacent on the swiss roll, and thus avoid forming clusters that extend across overlapping folds of the roll.

These constraints are useful to impose a certain local structure, but they also make the algorithm faster, especially when the number of the samples is high. The connectivity constraints are imposed via a connectivity matrix: a scipy sparse matrix that has elements only at the intersection of a row and a column with indices of the dataset that should be connected.

This matrix can be constructed from a-priori information: for instance, you may wish to cluster web pages by only merging pages with a link pointing from one to another.

It can also be learned from the data, for instance using sklearn.neighbors.kneighbors_graph to restrict merging to nearest neighbours as in this example, or using sklearn.feature_extraction.image.grid_to_graph to enable only merging of neigh-boring pixels on an image.

This example shows the effect of imposing a connectivity graph to capture local structure in the data. The graph is simply the graph of 20 nearest neighbours. Two consequences of imposing a connectivity can be seen. First clustering with a connectivity matrix is much faster. Second, when using a connectivity matrix, average and complete linkage are unstable and tend to create a few clusters that grow very quickly. Indeed, average and complete linkage fight this percolation behaviour by considering all the distances between two clusters when merging them.

The connectivity graph breaks this mechanism. This effect is more pronounced for very sparse graphs (try decreasing the number of neighbours in kneighbors_graph) and with complete linkage. In particular, having a very small number of neighbours in the graph, imposes a geometry that is close to that of single linkage, which is well known to have this percolation instability.

```[X1, y1] = datasets.make_circles(n_samples=600, factor=.5, noise=.07)
[X2, y2] = datasets.make_circles(n_samples=600, factor=.5, noise=.1)

for X in [X1,X2]:
knn_graph = kneighbors_graph(X, 5)
for connectivity in (None, knn_graph):
plt.figure(figsize=(12, 4))
for index, linkage in enumerate(('average', 'complete', 'ward')):

plt.subplot(1, 3, index + 1)
model = AgglomerativeClustering(linkage=linkage,n_clusters=n_clusters,
connectivity=connectivity)
model.fit(X)
plt.scatter(X[:, 0], X[:, 1], c=model.labels_, cmap=plt.cm.nipy_spectral)
plt.title('%s linkage' % linkage,fontdict=dict(verticalalignment='top'))
plt.axis('equal')
plt.axis('off')
if connectivity is None:
plt.suptitle('Without connectivity', size=20)
else: plt.suptitle('With connectivity', size=20)
# plt.suptitle('Without connectivity: %r' % connectivity is None, size=20)
plt.show()```

#### 2.3.5 Comparing different clustering algorithms on toy datasets

Following, we compare the different clustering algorithms, discussed so far. Let us generate 4 cases of different samples configurations:

• separated circular groups of data,
• uniform data distribution,
• embedded circles,
• non-flat configurations of data.

To this purpose, scikit-learn has predefined functions like: datasets.make_blobs(), datasets.make_circles(), datasets.make_moons(), etc. The code provided by scikit-learn for
generating the testing samples distributions is as follows:

```from sklearn import cluster, datasets

np.random.seed(0)

# Generate datasets.
n_samples = 1500
no_structure = np.random.rand(n_samples, 2), None
blobs = datasets.make_blobs(n_samples=n_samples, random_state=8, centers=4)
noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5,
noise=.05)
noisy_moons = datasets.make_moons(n_samples=n_samples, noise=.05)

colors = np.array([x for x in 'cmykbgrcmykbgrcmykbgrcmykbgr'])
colors = np.hstack([colors] * 20)
```
```uniform = np.random.rand(n_samples, 2), None

plt.figure(figsize=(12, 10))
plt.subplots_adjust(left=.001, right=.999, bottom=.01, top=.99, wspace=.05, hspace=.01)
plot_num = 1

for i, n_clrs in enumerate([2,4]):
dataset=uniform
i_dataset=0
X, y = dataset
X = StandardScaler().fit_transform(X)

connectivity = kneighbors_graph(X, n_neighbors=10)
connectivity = 0.5 * (connectivity + connectivity.T)

distances = euclidean_distances(X)

kmeans = cluster.KMeans(n_clusters=n_clrs)
spectral_clustering = cluster.SpectralClustering(n_clusters=n_clrs,
#eigen_solver='arpack',
affinity="nearest_neighbors")
average_agglomerative = cluster.AgglomerativeClustering(linkage="average",connectivity=connectivity,
affinity="cityblock", n_clusters=n_clrs)
ward_agglomerative = cluster.AgglomerativeClustering(n_clusters=n_clrs,
linkage='ward', connectivity=connectivity)

for method, algr in [('KMeans', kmeans),('Spectral Clst.', spectral_clustering),
('Average Aggl. Clst.', average_agglomerative),
('Ward Aggl. Clst.', ward_agglomerative)]:
algr.fit(X)
y_pred = algr.labels_.astype(np.int)

plt.subplot(4, 4, plot_num)
if i == 0:
plt.title(method, size=15)
plt.scatter(X[:, 0], X[:, 1], color=colors[y_pred].tolist(), s=10)

plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.xticks(())
plt.yticks(())
plot_num += 1
plt.savefig("files/ch07/uniform.png",dpi=300, bbox_inches='tight')
plt.show()```
```spheres = datasets.make_blobs(n_samples=n_samples, random_state=3, centers=4)

plt.figure(figsize=(12, 10))
plt.subplots_adjust(left=.001, right=.999, bottom=.01, top=.99, wspace=.05, hspace=.01)
plot_num = 1

for i, n_clrs in enumerate([4,2]):
dataset=spheres
i_dataset=0
X, y = dataset
X = StandardScaler().fit_transform(X)

connectivity = kneighbors_graph(X, n_neighbors=10)
connectivity = 0.5 * (connectivity + connectivity.T)

distances = euclidean_distances(X)

kmeans = cluster.KMeans(n_clusters=n_clrs)
spectral_clustering = cluster.SpectralClustering(n_clusters=n_clrs,
#eigen_solver='arpack',
affinity="nearest_neighbors")
average_agglomerative = cluster.AgglomerativeClustering(linkage="average",connectivity=connectivity,
affinity="cityblock", n_clusters=n_clrs)
ward_agglomerative = cluster.AgglomerativeClustering(n_clusters=n_clrs,
linkage='ward', connectivity=connectivity)

for method, algr in [('KMeans', kmeans),('Spectral Clst.', spectral_clustering),
('Average Aggl. Clst.', average_agglomerative),
('Ward Aggl. Clst.', ward_agglomerative)]:
algr.fit(X)
y_pred = algr.labels_.astype(np.int)

plt.subplot(4, 4, plot_num)
if i == 0:
plt.title(method, size=15)
plt.scatter(X[:, 0], X[:, 1], color=colors[y_pred].tolist(), s=10)

plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.xticks(())
plt.yticks(())
plot_num += 1
plt.savefig("files/ch07/blobs.png",dpi=300, bbox_inches='tight')
plt.show()```
```moons = datasets.make_moons(n_samples=n_samples, noise=.08)

plt.figure(figsize=(12, 10))
plt.subplots_adjust(left=.001, right=.999, bottom=.01, top=.99, wspace=.05, hspace=.01)
plot_num = 1

for i, n_clrs in enumerate([2,4]):
dataset=moons
X, y = dataset
X = StandardScaler().fit_transform(X)

connectivity = kneighbors_graph(X, n_neighbors=10)
connectivity = 0.5 * (connectivity + connectivity.T)
distances = euclidean_distances(X)

kmeans = cluster.KMeans(n_clusters=n_clrs)
spectral_clustering = cluster.SpectralClustering(n_clusters=n_clrs, affinity="nearest_neighbors")
average_agglomerative = cluster.AgglomerativeClustering(linkage="average",connectivity=connectivity,
affinity="cityblock", n_clusters=n_clrs)
ward_agglomerative = cluster.AgglomerativeClustering(n_clusters=n_clrs,
linkage='ward', connectivity=connectivity)

for method, algr in [('KMeans', kmeans),('Spectral Clst.', spectral_clustering),
('Average Aggl. Clst.', average_agglomerative),
('Ward Aggl. Clst.', ward_agglomerative)]:
algr.fit(X)
y_pred = algr.labels_.astype(np.int)

plt.subplot(4, 4, plot_num)
if i == 0:
plt.title(method, size=15)
plt.scatter(X[:, 0], X[:, 1], color=colors[y_pred].tolist(), s=10)

plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.xticks(())
plt.yticks(())
plot_num += 1
plt.savefig("files/ch07/moons.png" , bbox_inches='tight')
plt.show()```

The code provided by scikit-learn for comparing the different clustering techniques when k=4 is as follows:

```plt.figure(figsize=(11, 9.5))
plt.subplots_adjust(left=.001, right=.999, bottom=.001, top=.96, wspace=.05, hspace=.01)
plot_num = 1

for i_dataset, dataset in enumerate([blobs,
no_structure, noisy_circles, noisy_moons]):
X, y = dataset
X = StandardScaler().fit_transform(X)

connectivity = kneighbors_graph(X, n_neighbors=10)
connectivity = 0.5 * (connectivity + connectivity.T)

distances = euclidean_distances(X)

means = cluster.KMeans(n_clusters=4)
spectral = cluster.SpectralClustering(n_clusters=3, eigen_solver='arpack',
affinity="nearest_neighbors")
average_linkage = cluster.AgglomerativeClustering(linkage="average",
affinity="cityblock", n_clusters=3,
connectivity=connectivity)
ward = cluster.AgglomerativeClustering(n_clusters=3,
linkage='ward', connectivity=connectivity)

for name, algorithm in [('KMeans', means),('SpectrClust', spectral),
('AgglomClust (average)', average_linkage),
('AgglomClust (ward)', ward)]:
algorithm.fit(X)
if hasattr(algorithm, 'labels_'):
y_pred = algorithm.labels_.astype(np.int)
else:
y_pred = algorithm.predict(X)

plt.subplot(4, 4, plot_num)
if i_dataset == 0:
plt.title(name, size=18)
plt.scatter(X[:, 0], X[:, 1], color=colors[y_pred].tolist(), s=10)

plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.xticks(())
plt.yticks(())
plot_num += 1

plt.show()```

The code provided by scikit-learn for comparing the different clustering techniques when k=2 is as follows:

```plt.figure(figsize=(11, 9.5))
plt.subplots_adjust(left=.001, right=.999, bottom=.001, top=.96, wspace=.05, hspace=.01)

plot_num = 1
for i_dataset, dataset in enumerate([blobs, no_structure, noisy_circles,
noisy_moons]):
X, y = dataset
X = StandardScaler().fit_transform(X)

connectivity = kneighbors_graph(X, n_neighbors=10)
connectivity = 0.5 * (connectivity + connectivity.T)

distances = euclidean_distances(X)

means = cluster.KMeans(n_clusters=2)
spectral = cluster.SpectralClustering(n_clusters=2, eigen_solver='arpack',
affinity="nearest_neighbors")
average_linkage = cluster.AgglomerativeClustering(linkage="average",
affinity="cityblock", n_clusters=2,
connectivity=connectivity)
complete_linkage = cluster.AgglomerativeClustering(linkage="complete",
affinity="cityblock", n_clusters=2,
connectivity=connectivity)
ward_linkage = cluster.AgglomerativeClustering(n_clusters=2,
linkage='ward', connectivity=connectivity)

for name, algorithm in [
('KMeans', means),
('AgglomClust (average)', average_linkage),
('AgglomClust (average)', complete_linkage),
('AgglomClust (ward)', ward_linkage),
('SpectrClust', spectral)
]:
algorithm.fit(X)
if hasattr(algorithm, 'labels_'):
y_pred = algorithm.labels_.astype(np.int)
else:
y_pred = algorithm.predict(X)

plt.subplot(4, 5, plot_num)
plt.scatter(X[:, 0], X[:, 1], color=colors[y_pred].tolist(), s=10)

plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.xticks(())
plt.yticks(())
plot_num += 1

plt.show()```

### 3. CASE STUDY: EUROSTAT data analysis

Eurostat is the home of the `European Commission data

http://ec.europa.eu/eurostat

Eurostat’s main role is to process and publish comparable statistical information at European level. Data in Eurostat is provided by each member state. Eurostat’s re-use policy is free re-use of its data, both for non-commercial and commercial purposes (with some minor exceptions).

#### Applying clustering to analyze countries according to their education resourses

In order to illustrate the clustering on a real dataset, we will analyze the indicators on education finance data among the European member states, provided by the Eurostat data bank2. The data is organized by year (TIME): [2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010,
2011] and country (GEO): [‘Albania’, ‘Austria’, ‘Belgium’, ‘Bulgaria’, ‘Croatia’, ‘Cyprus’, ‘Czech Republic’, ‘Denmark’, ‘Estonia’, ‘Euro area (13 countries)’, ‘Euro area (15 countries)’, ‘European Union (25 countries)’, ‘European Union (27 countries)’, ‘Finland’, ‘Former Yugoslav Republic of Macedonia, the’, ‘France’, ‘Germany (until 1990 former territory of the FRG)’, ‘Greece’, ‘Hungary’, ‘Iceland’, ‘Ireland’, ‘Italy’, ‘Japan’, ‘Latvia’, ‘Liechtenstein’, ‘Lithuania’, ‘Luxembourg’, ‘Malta’, ‘Netherlands’, ‘Norway’, ‘Poland’, ‘Portugal’, ‘Romania’, ‘Slovakia’, ‘Slovenia’, ‘Spain’, ‘Sweden’, ‘Switzerland’, ‘Turkey’, ‘United Kingdom’, ‘United States’]. Twelve indicators (INDICED) on education finance with their values (Value) are given like:-

1. Expenditure on educational institutions from private sources as % of Gross Domestic Product (GDP), for all levels of education combined;
2. Expenditure on educational institutions from public sources as % of GDP, for all levels of government combined,
3. Expenditure on educational institutions from public sources as % of total public expenditure, for all levels of education combined,
4. Public subsidies to the private sector as % of GDP, for all levels of education combined,
5. Public subsidies to the private sector as % of total public expenditure, for all levels of education combined, etc. We can store in a table the 12 indicators for a given year (e.g. 2010).

Let us start having a look at the data.

```#Read and check the dataset downloaded from the EuroStat

import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn import cluster

edu=pd.read_csv('./files/ch07/educ_figdp_1_Data.csv',na_values=':')
edu.head()```
`edu.tail()`

Data in CSV and databases are often organised in what is called stacked or record formats. In our case for each year (“TIME“) and country (“GEO“) of the EU as well as some reference countries such as Japan and United States, we have twelve indicators (“INDIC_ED“) on education finance with their values (“Value“). Let us reshape the table into a feature vector style data set.

To the process of reshaping stacked data into a table is sometimes called pivoting.

```#Pivot table in order to get a nice feature vector representation with dual indexing by TIME and GEO
pivedu=pd.pivot_table(edu, values='Value', index=['TIME', 'GEO'], columns=['INDIC_ED'])
pivedu.head()```
```print ('Let us check the two indices:\n')
print ('\nPrimary index (TIME): \n' + str(pivedu.index.levels.tolist()))
print ('\nSecondary index (GEO): \n' + str(pivedu.index.levels.tolist()))```

Observe that we have ten years information on these indicators, and as expected we have all members of the European Union with some aggregates and control/reference countries. For the sake of simplicity, let us focus on values on year 2010.

```#Extract 2010 set of values
edu2010=pivedu.ix
edu2010.head()```

Let us clean and store the names of the features and the countries.

```#Store column names and clear them for better handling. Do the same with countries
edu2010 = edu2010.rename(index={'Euro area (13 countries)': 'EU13',
'Euro area (15 countries)': 'EU15',
'European Union (25 countries)': 'EU25',
'European Union (27 countries)': 'EU27',
'Former Yugoslav Republic of Macedonia, the': 'Macedonia',
'Germany (until 1990 former territory of the FRG)': 'Germany'
})
features = edu2010.columns.tolist()

countries = edu2010.index.tolist()

edu2010.columns=range(12)
edu2010.head()```

As we can observe, this is not a clean data set, there are missing values. Some countries may not collect or have access to some indicators and there are countries without any indicators. Let us display this effect.

```#Check what is going on in the NaN data
nan_countries=np.sum(np.where(edu2010.isnull(),1,0),axis=1)
plt.bar(np.arange(nan_countries.shape),nan_countries)
plt.xticks(np.arange(nan_countries.shape),countries,rotation=90,horizontalalignment='left',
fontsize=12)
fig = plt.gcf()
fig.set_size_inches((12,5))```

We do not have info on Albania, Macedonia and Greece. And very limited info from Liechtenstein, Luxembourg and Turkey. So let us work without them. Now let us check the features.

```#Remove non info countries
wrk_countries = nan_countries<4

educlean=edu2010.ix[wrk_countries] #.ix - Construct an open mesh from multiple sequences.

#Let us check the features we have
na_features = np.sum(np.where(educlean.isnull(),1,0),axis=0)
print (na_features)

plt.bar(np.arange(na_features.shape),na_features)
plt.xticks(fontsize=12)
fig = plt.gcf()
fig.set_size_inches((8,4))```

There are four features with missing data. At this point we can proceed in two ways:

• Fill in the features with some non-informative, non-biasing data.
• Drop the features with missing values.

If we have many features and only a few have missing values then it is not much harmful to drop them. However, if missing values are spread across the features, we have to eventually deal with them. In our case, both options seem reasonable, so we will proceed with both at the same time.

```#Option A fills those features with some value, at risk of extracting wrong information
#Constant filling : edufill0=educlean.fillna(0)
edufill=educlean.fillna(educlean.mean())
print ('Filled in data shape: ' + str(edufill.shape))

#Option B drops those features
edudrop=educlean.dropna(axis=1)
#dropna: Return object with labels on given axis omitted where alternately any or
# all of the data are missing
print ('Drop data shape: ' + str(edudrop.shape))```

In the fill-in option, we have decided to fill the data with the mean value of the feature. This will not bias the distribution of the feature, though it has consequences in the interpretation of the results.

Let us now apply a K-means clustering technique on this data in order to partition the countries according to their investment in education and check their profiles.

```scaler = StandardScaler() #Standardize features by removing the mean and scaling to unit variance

X_train_fill = edufill.values
X_train_fill = scaler.fit_transform(X_train_fill)

clf = cluster.KMeans(init='k-means++', n_clusters=3, random_state=42)

clf.fit(X_train_fill) #Compute k-means clustering.

y_pred_fill = clf.predict(X_train_fill)
#Predict the closest cluster each sample in X belongs to.

idx=y_pred_fill.argsort()```

Let us visualise the result of the K-means clustering:

```plt.plot(np.arange(35),y_pred_fill[idx],'ro')
wrk_countries_names = [countries[i] for i,item in enumerate(wrk_countries) if item ]

plt.xticks(np.arange(len(wrk_countries_names)),[wrk_countries_names[i] for i in idx],
rotation=90,horizontalalignment='left',fontsize=12)
plt.title('Using filled in data', size=15)
plt.yticks([0,1,2])
fig = plt.gcf()

fig.set_size_inches((12,5))```

Let us apply the clustering on the dataset with dropped missing values:

```X_train_drop = edudrop.values
X_train_drop = scaler.fit_transform(X_train_drop)

clf.fit(X_train_drop) #Compute k-means clustering.
y_pred_drop = clf.predict(X_train_drop) #Predict the closest cluster of each sample in X.```
```idx=y_pred_drop.argsort()
plt.plot(np.arange(35),y_pred_drop[idx],'ro')
wrk_countries_names = [countries[i] for i,item in enumerate(wrk_countries) if item ]

plt.xticks(np.arange(len(wrk_countries_names)),[wrk_countries_names[i] for i in idx],
rotation=90,horizontalalignment='left',fontsize=12)
plt.title('Using dropped missing values data',size=15)
fig = plt.gcf()
plt.yticks([0,1,2])
fig.set_size_inches((12,5))```

We have sorted the data for better visualization. At a simple glance we can see that both partitions can be different. We can better check this effect plotting the clusters values of one technique against the other.

```plt.plot(y_pred_drop+0.2*np.random.rand(35),y_pred_fill+0.2*np.random.rand(35),'bo')
plt.xlabel('Predicted clusters for the filled in dataset.')
plt.ylabel('Predicted clusters for the dropped missing values dataset.')
plt.title('Correlations')
plt.xticks([0,1,2])
plt.yticks([0,1,2])
plt.savefig("files/ch07/correlationkmeans.png",dpi=300, bbox_inches='tight')```

Well, looking at both methods, both may yield the same results, but not necessarily always. This is mainly due to two aspects: the random initialisation of the k-means clustering and the fact that each method works in a different space (dropped data vs. filled-in data).

Let us check the list of countries in both methods. Note that we should not consider the cluster value, since it is irrelevant.

```print ('Cluster 0: \n' + str([wrk_countries_names[i] for i,item in enumerate(y_pred_fill)
if item==0]))
print ('Cluster 0: \n' + str([wrk_countries_names[i] for i,item in enumerate(y_pred_drop)
if item==0]))
print ('\n')
print ('Cluster 1: \n' + str([wrk_countries_names[i] for i,item in enumerate(y_pred_fill)
if item==1]))
print ('Cluster 1: \n' + str([wrk_countries_names[i] for i,item in enumerate(y_pred_drop)
if item==1]))
print ('\n')
print ('Cluster 2: \n' + str([wrk_countries_names[i] for i,item in enumerate(y_pred_fill)
if item==2]))
print ('Cluster 2: \n' + str([wrk_countries_names[i] for i,item in enumerate(y_pred_drop)
if item==2]))
print ('\n')```

Let us check the profile of the clusters by looking at the centroids:

```width=0.3
p1 = plt.bar(np.arange(8),scaler.inverse_transform(clf.cluster_centers_),width,color='b')
# Scale back the data to the original representation
p2 = plt.bar(np.arange(8)+width,scaler.inverse_transform(clf.cluster_centers_),
width,color='yellow')
p0 = plt.bar(np.arange(8)+2*width,scaler.inverse_transform(clf.cluster_centers_),
width,color='r')

plt.legend( (p0, p1, p2), ('Cluster 0', 'Cluster 1', 'Cluster 2') ,loc=9)
plt.xticks(np.arange(8) + 0.5, np.arange(8),size=12)
plt.yticks(size=12)
plt.xlabel('Economical indicators')
plt.ylabel('Average expanditure')
fig = plt.gcf()

plt.savefig("files/ch07/clusterexpenditure.png",dpi=300, bbox_inches='tight')```

It looks like cluster “1“ spends more on education while cluster “0“ is the one with less resources on education. What about Spain?

Let us refine a little bit more cluster “0“ and check how close are members from this cluster to cluster “1“. This may give us a hint on a possible ordering.

```from scipy.spatial import distance
p = distance.cdist(X_train_drop[y_pred_drop==0,:],[clf.cluster_centers_],'euclidean')
#the distance of the elements of cluster 0 to the center of cluster 1

fx = np.vectorize(np.int)

plt.plot(np.arange(p.shape),
fx(p)
)

wrk_countries_names = [countries[i] for i,item in enumerate(wrk_countries) if item ]
zero_countries_names = [wrk_countries_names[i] for i,item in enumerate(y_pred_drop)
if item==0]
plt.xticks(np.arange(len(zero_countries_names)),zero_countries_names,rotation=90,
horizontalalignment='left',fontsize=12)```

Well, it seems that Spain belongs to cluster “0“, it is the closest to change to a policy in the lines of the other clusters.

Additionally, we can also check the distance to the centroid of cluster “0“.

```from scipy.spatial import distance
p = distance.cdist(X_train_drop[y_pred_drop==0,:],[clf.cluster_centers_],'euclidean')
pown = distance.cdist(X_train_drop[y_pred_drop==0,:],[clf.cluster_centers_],'euclidean')

width=0.45
p0=plt.plot(np.arange(p.shape),fx(p),width)
p1=plt.plot(np.arange(p.shape)+width,fx(pown),width,color = 'red')

wrk_countries_names = [countries[i] for i,item in enumerate(wrk_countries) if item ]
zero_countries_names = [wrk_countries_names[i] for i,item in enumerate(y_pred_drop)
if item==0]
plt.xticks(np.arange(len(zero_countries_names)),zero_countries_names,rotation=90,
horizontalalignment='left',fontsize=12)
plt.legend( (p0, p1), ('d -> 1', 'd -> 0') ,loc=1)
plt.savefig("files/ch07/dist2cluster01.png",dpi=300, bbox_inches='tight')```

Let us redo the clustering with K=4 and see what we can conclude.

```X_train = edudrop.values
clf = cluster.KMeans(init='k-means++', n_clusters=4, random_state=0)
clf.fit(X_train)
y_pred = clf.predict(X_train)

idx=y_pred.argsort()
plt.plot(np.arange(35),y_pred[idx],'ro')
wrk_countries_names = [countries[i] for i,item in enumerate(wrk_countries) if item ]

plt.xticks(np.arange(len(wrk_countries_names)),[wrk_countries_names[i] for i in idx],rotation=90,
horizontalalignment='left',fontsize=12)
plt.title('Using drop features',size=15)
plt.yticks([0,1,2,3])
fig = plt.gcf()
fig.set_size_inches((12,5))```
```width=0.2
p0 = plt.bar(np.arange(8)+1*width,clf.cluster_centers_,width,color='r')
p1 = plt.bar(np.arange(8),clf.cluster_centers_,width,color='b')
p2 = plt.bar(np.arange(8)+3*width,clf.cluster_centers_,width,color='yellow')
p3 = plt.bar(np.arange(8)+2*width,clf.cluster_centers_,width,color='pink')

plt.legend( (p0, p1, p2, p3), ('Cluster 0', 'Cluster 1', 'Cluster 2',
'Cluster 3') ,loc=9)
plt.xticks(np.arange(8) + 0.5, np.arange(8),size=12)
plt.yticks(size=12)
plt.xlabel('Economical indicator')
plt.ylabel('Average expenditure')
fig = plt.gcf()
fig.set_size_inches((12,5))
plt.savefig("files/ch07/distances4clusters.png",dpi=300, bbox_inches='tight')```

Spain is still in cluster “0“. But as we observed in our previous clustering it was very close to changing cluster. This time cluster “0“ includes the averages values for the EU members. Just for the sake of completeness, let us write down the name of the countries in the clusters.

```print ('Cluster 0: \n' + str([wrk_countries_names[i] for i,item in enumerate(y_pred) if item==0]))

print ('Cluster 1: \n' + str([wrk_countries_names[i] for i,item in enumerate(y_pred) if item==1]))

print ('Cluster 2: \n' + str([wrk_countries_names[i] for i,item in enumerate(y_pred) if item==2]))

print ('Cluster 3: \n' + str([wrk_countries_names[i] for i,item in enumerate(y_pred) if item==3]))

#Save data for future use.
import pickle
ofname = open('edu2010.pkl', 'wb')
s = pickle.dump([edu2010, wrk_countries_names,y_pred ],ofname)
ofname.close()```

We can repeat the process using the alternative clustering techniques and compare their results. Let us first apply the spectral clustering. The corresponding code will be:

```from scipy.cluster.hierarchy import linkage, dendrogram
from scipy.spatial.distance import pdist
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import kneighbors_graph
from sklearn.metrics import euclidean_distances

X = StandardScaler().fit_transform(edudrop.values)

distances = euclidean_distances(edudrop.values)

spectral = cluster.SpectralClustering(n_clusters=4, affinity="nearest_neighbors")
spectral.fit(edudrop.values)

y_pred = spectral.labels_.astype(np.int)```

If we visualise the results:

```idx=y_pred.argsort()

plt.plot(np.arange(35),y_pred[idx],'ro')
wrk_countries_names = [countries[i] for i,item in enumerate(wrk_countries) if item ]

plt.xticks(np.arange(len(wrk_countries_names)),[wrk_countries_names[i]
for i in idx],rotation=90,horizontalalignment='left',fontsize=12)

plt.yticks([0,1,2,3])

plt.title('Applying Spectral Clustering on the drop features',size=15)
fig = plt.gcf()
fig.set_size_inches((12,5))```

Note that in general, the spectral clustering intends to obtain more balanced clusters. In this way, the predicted cluster 1 merges the cluster 2 and 3 of the K-means clustering, cluster 2 corresponds to the cluster 1 of the K-means clustering, cluster 0 mainly goes to cluster 2, and clusters 3 corresponds to cluster 0 of the K-means.

Applying the agglomerative clustering, we obtain not only the different clusters, but also we can see how different clusters are obtained. This, in some way it is giving us information on which are the pairs of countries and clusters that are most similar. The corresponding code that applies the agglomerative clustering is:

```X_train = edudrop.values
dist = pdist(X_train,'euclidean')
linkage_matrix = linkage(dist,method = 'complete');
plt.figure() # we need a tall figure
fig = plt.gcf()
fig.set_size_inches((12,12))
dendrogram(linkage_matrix, orientation="right", color_threshold = 4,labels = wrk_countries_names, leaf_font_size=20);

plt.savefig("files/ch07/ACCountires.png",dpi=300, bbox_inches='tight')
plt.show()

#plt.tight_layout() # fixes margins```

In scikit-learn, the parameter color_threshold colors all the descendent links below a cluster node k the same color if k is the first node below the color threshold. All links connecting nodes with distances greater than or equal to the threshold are colored blue. Thus, if we use color threshold = 3, the obtained clusters are as follows:

• Cluster 0: [‘Cyprus’, ‘Denmark’, ‘Iceland’]
• Cluster 1: [‘Bulgaria’, ‘Croatia’, ‘Czech Republic’, ‘Italy’,
‘Japan’, ‘Romania’, ‘Slovakia’]
• Cluster 2: [‘Belgium’, ‘Finland’, ‘Ireland’, ‘Malta’, ‘Norway’,
‘Sweden’]
• Cluster 3: [‘Austria’, ‘Estonia’, ‘EU13’, ‘EU15’, ‘EU25’, ‘EU27’,
‘France’, ‘Germany’, ‘Hungary’, ‘Latvia’, ‘Lithuania’, ‘Netherlands’,
‘Poland’, ‘Portugal’, ‘Slovenia’, ‘Spain’, ‘Switzerland’, ‘United
Kingdom’, ‘United States’]

Note that they correspond in high degree to the clusters obtained by the K-means (except permutation of clusters labels that is irrelevant). The figure shows the construction of the clusters using the complete linkage agglomerative clustering. Different cuts at different levels of the dendrogram allow to obtain different number of clusters. As a summary, let us compare the results of the three approaches of clustering. We cannot expect that the results coincide since different approaches are based on different criteria to construct the clusters. Still, we can observe that in this case K-means and the agglomerative approaches gave the same results (up to a permutation of the number of cluster that is irrelevant), meanwhile the spectral clustering gave more evenly distributed clusters. It fused cluster 0 and 2 of the agglomerative clustering in cluster 1, and split cluster 3 of agglomerative clustering in clusters 0 and 3 of it. Note that these results can change when using different distance between data.

### References

This notebook was created by Petia Radeva and Oriol Pujol Vila.