{"id":184,"date":"2018-10-17T12:34:59","date_gmt":"2018-10-17T12:34:59","guid":{"rendered":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/?page_id=184"},"modified":"2021-10-17T16:49:06","modified_gmt":"2021-10-17T16:49:06","slug":"a-step-forward-for-discovering-knowledge-using-unsupervised-learning","status":"publish","type":"page","link":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/hands-on\/a-step-forward-for-discovering-knowledge-using-unsupervised-learning\/","title":{"rendered":"A step forward for discovering knowledge using unsupervised learning"},"content":{"rendered":"<h3>Objective<\/h3>\n<p>Discuss different techniques for unsupervised\u00a0learning and will focus on several clustering techniques.<\/p>\n<ul>\n<li>Consider\u00a0basic concepts like distance and similarity, taxonomy of\u00a0clustering techniques and goodness of clustering quality.<\/li>\n<li>Explore three basic clustering techniques, namely, K-means, spectral\u00a0clustering and hierarchical clustering.<\/li>\n<li>Illustrate\u00a0the use of clustering techniques on a real problem: defining groups of\u00a0countries according to their economic expenditure on education.<\/li>\n<\/ul>\n<h3>2. Clustering<\/h3>\n<p>Partition unlabeled examples into disjoint subsets of clusters,\u00a0such that:<\/p>\n<ul>\n<li>Examples within a cluster are similar (high intra-class\u00a0similarity).<\/li>\n<li>Examples in different clusters are different (low inter-class\u00a0similarity).<\/li>\n<li>It can help in discovering new categories in an unsupervised manner (no\u00a0sample category labels provided).<\/li>\n<\/ul>\n<p>Clustering will help us to analyse and get insight of the data, but the\u00a0quality of the partition depends on the application and the analyst.<\/p>\n<h4>2.1 Similarity and distance<\/h4>\n<p>The notion of similarity is a tough one, however we can use the notion<br \/>\nof distance as a surrogate.\u00a0The most well-known instantiations of this metric are:<\/p>\n<ul>\n<li>Euclidean distance.<\/li>\n<li>Manhattan distance.<\/li>\n<li>Max-distance.<\/li>\n<\/ul>\n<h4>2.2 What is a good clustering?<\/h4>\n<p>The <em><strong>Rand index<\/strong><\/em> or <em><strong>Rand measure<\/strong><\/em> (named after William M. Rand) is a measure of the\u00a0similarity between two data clusterings. A form of the Rand index may be\u00a0defined that is adjusted for the chance grouping of elements, this is\u00a0the \u00a0<em><strong>adjusted Rand index<\/strong><\/em>. From a mathematical standpoint, Rand index\u00a0is related to the accuracy, but is applicable even when class labels are\u00a0not used.<\/p>\n<p>Given a set of n elements :math:S = {o1, &#8230;, on} and two\u00a0partitions of S to compare, X = {X1, &#8230;, Xr}, a\u00a0partition of S into r subsets, and\u00a0Y = {Y1, &#8230;, Ys}, a partition of S into\u00a0s subsets.\u00a0<span style=\"font-size: 1rem;\">Let us use the annotations as follows:<\/span><\/p>\n<ul>\n<li>a, the number of pairs of elements in S that are in the same set in X\u00a0and in the same set in Y<\/li>\n<li>b, the number of pairs of elements in S that are in different sets in\u00a0X and in different sets in Y<\/li>\n<li>c, the number of pairs of elements in S that are in the same set in X\u00a0and in different sets in Y<\/li>\n<li>d, the number of pairs of elements in S that are in different sets in\u00a0X and in the same set in Y.<\/li>\n<\/ul>\n<p>The <em><strong>Rand inde<\/strong><\/em>x, R is:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-172 aligncenter\" src=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/10\/Capture-d\u2019\u00e9cran-2018-10-17-\u00e0-13.30.51.png\" alt=\"\" width=\"224\" height=\"58\" srcset=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/10\/Capture-d\u2019\u00e9cran-2018-10-17-\u00e0-13.30.51.png 348w, http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/10\/Capture-d\u2019\u00e9cran-2018-10-17-\u00e0-13.30.51-300x78.png 300w\" sizes=\"auto, (max-width: 224px) 100vw, 224px\" \/><\/p>\n<p>The <strong>adjusted\u00a0<\/strong><em><strong>Rand inde<\/strong><\/em>x, AR is:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-173 aligncenter\" src=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/10\/Capture-d\u2019\u00e9cran-2018-10-17-\u00e0-13.33.03.png\" alt=\"\" width=\"435\" height=\"64\" srcset=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/10\/Capture-d\u2019\u00e9cran-2018-10-17-\u00e0-13.33.03.png 692w, http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/10\/Capture-d\u2019\u00e9cran-2018-10-17-\u00e0-13.33.03-300x44.png 300w, http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/10\/Capture-d\u2019\u00e9cran-2018-10-17-\u00e0-13.33.03-624x92.png 624w\" sizes=\"auto, (max-width: 435px) 100vw, 435px\" \/><\/p>\n<p>A clustering result satisfies <em>homogeneity<\/em>\u00a0if all of its clusters\u00a0contain only data points which are members of the same original (a\u00a0single) class.<\/p>\n<p>A clustering result satisfies <em>completeness<\/em> if all the data points\u00a0that are members of a given class are elements of the same automatic\u00a0cluster.<\/p>\n<p>Both scores have positive values between 0.0 and 1.0, larger values\u00a0being desirable.<\/p>\n<pre>import matplotlib.pylab as plt<\/pre>\n<pre>from sklearn import metrics\nmetrics.homogeneity_score([0, 0, 1, 1], [1, 1, 0, 0])<\/pre>\n<pre>print(\"%.3f\" % metrics.homogeneity_score([0, 0, 1, 1], [0, 0, 1, 2]))<\/pre>\n<pre>print(\"%.3f\" % metrics.homogeneity_score([0, 0, 1, 1], [0, 1, 2, 3]))<\/pre>\n<pre>print(\"%.3f\" % metrics.homogeneity_score([0, 0, 1, 1], [0, 1, 0, 1]))\nprint(\"%.3f\" % metrics.homogeneity_score([0, 0, 1, 1], [0, 0, 0, 0]))\nprint (metrics.completeness_score([0, 0, 1, 1], [1, 1, 0, 0]))\nprint(metrics.completeness_score([0, 0, 1, 1], [0, 0, 0, 0]))\nprint(metrics.completeness_score([0, 1, 2, 3], [0, 0, 1, 1]))\nprint(metrics.completeness_score([0, 0, 1, 1], [0, 1, 0, 1]))\nprint(metrics.completeness_score([0, 0, 0, 0], [0, 1, 2, 3]))<\/pre>\n<p><em><strong>V-measure<\/strong><\/em>\u00a0 is\u00a0the harmonic mean between homogeneity and completeness:<\/p>\n<p style=\"text-align: center;\">\u00a0 \u00a0 v = 2 * (homogeneity * completeness) \/ (homogeneity + completeness)<\/p>\n<p>Note that this metric is not dependent of the absolute values of the labels: a permutation of the class or cluster label values will not change the score value in any way. Moreover, the metric is symmetric with respect to switching between the predicted and the original cluster label.<\/p>\n<p>This can be useful to measure the agreement of two independent label\u00a0assignments strategies on the same dataset when the real ground truth is\u00a0not known.\u00a0If class members are completely split across different clusters, the assignment is totally incomplete, hence the V-measure is null.<\/p>\n<p>Perfect labelings are both homogeneous and complete, hence have score\u00a01.0:<\/p>\n<pre>print (metrics.v_measure_score([0, 0, 1, 1], [0, 0, 1, 1]))\nprint (metrics.v_measure_score([0, 0, 1, 1], [1, 1, 0, 0]))<\/pre>\n<p><span style=\"color: #0000ff;\"><strong>Question: Labelings that assign all classes members to the same\u00a0clusters are: ___________, but not _____________:<\/strong><\/span><\/p>\n<pre>print(\"%.3f\" % metrics.completeness_score([0, 1, 2, 3], [0, 0, 0, 0]))\nprint(\"%.3f\" % metrics.homogeneity_score([0, 1, 2, 3], [0, 0, 0, 0]))\nprint(\"%.3f\" % metrics.v_measure_score([0, 1, 2, 3], [0, 0, 0, 0]))\nprint(\"%.3f\" % metrics.v_measure_score([0, 0, 1, 2], [0, 0, 1, 1]))\nprint(\"%.3f\" % metrics.v_measure_score([0, 1, 2, 3], [0, 0, 1, 1]))<\/pre>\n<p><span style=\"color: #0000ff;\"><strong>Question: Labelings that have pure clusters with members coming from the same\u00a0classes are ________________ but un-necessary splits harm ____________________ and thus penalise\u00a0V-measure as well:<\/strong><\/span><\/p>\n<pre>print(\"%.3f\" % metrics.v_measure_score([0, 0, 1, 1], [0, 0, 1, 2]))\nprint(\"%.3f\" % metrics.v_measure_score([0, 0, 1, 1], [0, 1, 2, 3]))<\/pre>\n<p>If classes members are completely split across different clusters,\u00a0the assignment is totally incomplete, hence the V-Measure is null:<\/p>\n<pre>print(\"%.3f\" % metrics.v_measure_score([0, 0, 0, 0], [0, 1, 2, 3]))<\/pre>\n<p><span style=\"color: #0000ff;\"><strong>Question: Clusters that include samples from totally different classes totally\u00a0destroy the _______________________ \u00a0of the labelling, hence:<\/strong><\/span><\/p>\n<pre>print(\"%.3f\" % metrics.v_measure_score([0, 0, 1, 1], [0, 0, 0, 0]))<\/pre>\n<h4>Advantages<\/h4>\n<ul>\n<li>Bounded scores: 0.0 is as bad as it can be, 1.0 is a perfect\u00a0score.<\/li>\n<li>Intuitive interpretation: clustering with bad V-measure can be\u00a0qualitatively analyzed in terms of homogeneity and completeness to\u00a0better feel what \u2018kind\u2019 of mistakes is done by the assignment.<\/li>\n<li>No assumption is made on the cluster structure: can be used to\u00a0compare clustering algorithms such as K-means which assumes isotropic\u00a0blob shapes with results of spectral clustering algorithms which can\u00a0find cluster with \u201cfolded\u201d shapes.<\/li>\n<\/ul>\n<h5>Drawbacks<\/h5>\n<ul>\n<li>The previously introduced metrics are <em>not normalised with regards\u00a0to random labelling.<\/em>\u00a0This means that depending on the number of\u00a0samples, clusters and ground truth classes, a completely random\u00a0labelling will not always yield the same values for homogeneity,\u00a0completeness and hence V-measure. In particular random labelling will not yield zero scores especially when the number of clusters is large.<\/li>\n<li>This problem can safely be ignored when <em>the number of samples<\/em>\u00a0is\u00a0high i.e. more than a thousand and the number of clusters is less\u00a0than 10.<\/li>\n<li>These metrics require the <em>knowledge of the ground truth classe<\/em>s \u00a0while almost never available in practice or require manual assignment\u00a0by human annotators (as in the supervised learning setting).<\/li>\n<\/ul>\n<p>And if we do not have ground truth?<\/p>\n<p>The <em><strong>Silhouette Coefficient<\/strong><\/em> is calculated using the <em>mean\u00a0intra-cluster distance<\/em>\u00a0(a) and the <em>mean nearest-cluster distance<\/em>\u00a0(b)\u00a0for each sample. The <em><strong>Silhouette Coefficient<\/strong><\/em> for a sample is:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-176 aligncenter\" src=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/10\/Capture-d\u2019\u00e9cran-2018-10-17-\u00e0-13.43.37.png\" alt=\"\" width=\"202\" height=\"46\" srcset=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/10\/Capture-d\u2019\u00e9cran-2018-10-17-\u00e0-13.43.37.png 370w, http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/10\/Capture-d\u2019\u00e9cran-2018-10-17-\u00e0-13.43.37-300x68.png 300w\" sizes=\"auto, (max-width: 202px) 100vw, 202px\" \/><\/p>\n<p>where b is the distance between a sample and the nearest cluster that\u00a0the sample is not part of.<\/p>\n<ul>\n<li>If the Silhouette s(i) is close to 0, it means that the sample is on the border of its cluster and the closest one from the rest of the dataset clusters.<\/li>\n<li>A negative value means that the sample is closer to the neighbor cluster.<\/li>\n<li>The average of the Silhouette coefficients of all samples of a given cluster defines the \u201cgoodness\u201d of the cluster.<\/li>\n<li>The <em><strong>average of the Silhouette<\/strong> <\/em>coefficients of all clusters gives idea of the quality of the clustering result.<\/li>\n<\/ul>\n<p>Note that the Silhouette coefficient only makes sense when the number of labels predicted is less than the number of samples clustered.<\/p>\n<ul>\n<li><span style=\"color: #000000;\">The score is bounded between -1 and +1.\u00a0<\/span><\/li>\n<li><span style=\"color: #000000;\">The score is higher when clusters are dense and well separated, which<\/span><br \/>\n<span style=\"color: #000000;\">relates to a standard concept of a cluster.<\/span><\/li>\n<\/ul>\n<h3>2.3 Clustering techniques: how to group samples?<\/h3>\n<p>There are two big families of clustering techniques:<\/p>\n<ul>\n<li>Partitional algorithms: Start with a random partition and refine\u00a0it iteratively.<\/li>\n<li>Hierarchical algorithms: Agglomerative (bottom-up), top-down.<\/li>\n<li>Partitional algorithms.\u00a0<span style=\"font-size: 1rem;\">They can be divided in two branches:<\/span>\n<ul>\n<li>Hard partition algorithms, such as <em><strong>K-means<\/strong><\/em>, assign a unique cluster\u00a0value to each element in the feature space.<\/li>\n<li>Soft partition algorithms, such as <em><strong>Mixture of Gaussians<\/strong><\/em>, can be\u00a0viewed as density estimators and assign a confidence or probability\u00a0to each point in the space.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>In order to build our intuition about clustering, we will start with the\u00a0simplest, but one of the most famous methods:<strong> K-means<\/strong>.<\/p>\n<h4>2.3.1 K-means algorithm<\/h4>\n<h5>Algorithm:<\/h5>\n<ol>\n<li>Initialise the value K of desirable clusters.<\/li>\n<li>Initialise the K cluster centres, e.g. randomly.<\/li>\n<li>Decide\u00a0the class memberships of the N data samples by assigning them to the<br \/>\nnearest cluster centroids (e.g. the center of gravity or mean).<\/li>\n<li>Re-estimate the K cluster centres, by assuming the memberships\u00a0found above are correct.<\/li>\n<li>If none of the N objects changed membership\u00a0in the last iteration, exit. Otherwise go to 3.<\/li>\n<\/ol>\n<p>Let us see this in action:<\/p>\n<pre>import numpy as np\n\n#Create some data\nMAXN=40\nX = np.concatenate([1.25*np.random.randn(MAXN,2), 5+1.5*np.random.randn(MAXN,2)])\nX = np.concatenate([X,[8,3]+1.2*np.random.randn(MAXN,2)])\nX.shape<\/pre>\n<pre>#Just for visualization purposes, create the labels of the 3 distributions\ny = np.concatenate([np.ones((MAXN,1)),2*np.ones((MAXN,1))])\ny = np.concatenate([y,3*np.ones((MAXN,1))])\n\nplt.subplot(1,2,1)\nplt.scatter(X[(y==1).ravel(),0],X[(y==1).ravel(),1],color='r')\nplt.scatter(X[(y==2).ravel(),0],X[(y==2).ravel(),1],color='b')\nplt.scatter(X[(y==3).ravel(),0],X[(y==3).ravel(),1],color='g')\nplt.title('Data as were generated')\n\nplt.subplot(1,2,2)\nplt.scatter(X[:,0],X[:,1],color='r')\nplt.title('Data as the algorithm sees them')\n\nfrom sklearn import cluster\n\nK=3 # Assuming to be 3 clusters!\n\nclf = cluster.KMeans(init='random', n_clusters=K)\nclf.fit(X)<\/pre>\n<p>Each clustering algorithm comes in two variants:<\/p>\n<ul>\n<li>a class, that\u00a0implements the <em>fit<\/em>\u00a0method to learn the clusters on train data,<\/li>\n<li>a <em>predict<\/em>\u00a0function, that, given test data, returns an array of integer\u00a0labels corresponding to the different clusters.<\/li>\n<\/ul>\n<p>For the class, the\u00a0labels over the training data can be found in the labels attribute.<\/p>\n<pre>print (clf.labels_) # or\nprint (clf.predict(X)) # equivalent<\/pre>\n<pre>print (X[(y==1).ravel(),0]) #numpy.ravel() returns a flattened array\nprint (X[(y==1).ravel(),1])<\/pre>\n<pre>plt.scatter(X[(y==1).ravel(),0],X[(y==1).ravel(),1],color='r')\nplt.scatter(X[(y==2).ravel(),0],X[(y==2).ravel(),1],color='b')\nplt.scatter(X[(y==3).ravel(),0],X[(y==3).ravel(),1],color='g')\n\nfig = plt.gcf()\nfig.set_size_inches((6,5))<\/pre>\n<pre>x = np.linspace(-5,15,200)\nXX,YY = np.meshgrid(x,x)\nsz=XX.shape\ndata=np.c_[XX.ravel(),YY.ravel()]\n# c_ translates slice objects to concatenation along the second axis.<\/pre>\n<pre>Z=clf.predict(data) # returns the labels of the data\nprint (Z)<\/pre>\n<p><span style=\"color: #0000ff;\"><strong>Questions: How many &#8220;misclusterings&#8221; do we have?<\/strong><\/span><\/p>\n<pre># Visualize space partition\nplt.imshow(Z.reshape(sz), interpolation='bilinear', origin='lower',\nextent=(-5,15,-5,15),alpha=0.3, vmin=0, vmax=K-1)\nplt.title('Space partitions', size=14)\nplt.scatter(X[(y==1).ravel(),0],X[(y==1).ravel(),1],color='r')\nplt.scatter(X[(y==2).ravel(),0],X[(y==2).ravel(),1],color='b')\nplt.scatter(X[(y==3).ravel(),0],X[(y==3).ravel(),1],color='g')\n\nfig = plt.gcf()\nfig.set_size_inches((6,5))\n<\/pre>\n<pre>clf = cluster.KMeans(n_clusters=K, random_state=0)\n#initialize the k-means clustering\nclf.fit(X) #run the k-means clustering\n\ndata=np.c_[XX.ravel(),YY.ravel()]\nZ=clf.predict(data) # returns the clustering labels of the data<\/pre>\n<p>Visualising true labels by coloured points and space tessellation:<\/p>\n<pre>plt.title('Final result of K-means', size=14)\n\nplt.scatter(X[(y==1).ravel(),0],X[(y==1).ravel(),1],color='r')\nplt.scatter(X[(y==2).ravel(),0],X[(y==2).ravel(),1],color='b')\nplt.scatter(X[(y==3).ravel(),0],X[(y==3).ravel(),1],color='g')\n\nplt.imshow(Z.reshape(sz), interpolation='bilinear', origin='lower',\nextent=(-5,15,-5,15),alpha=0.3, vmin=0, vmax=K-1)\n\nx = np.linspace(-5,15,200)\nXX,YY = np.meshgrid(x,x)\nfig = plt.gcf()\nfig.set_size_inches((6,5))\n<\/pre>\n<pre>clf = cluster.KMeans(init='random', n_clusters=K, random_state=0)\n#initialize the k-means clustering\nclf.fit(X) #run the k-means clustering\nZx=clf.predict(X)\n\nplt.subplot(1,3,1)\nplt.title('Original labels', size=14)\nplt.scatter(X[(y==1).ravel(),0],X[(y==1).ravel(),1],color='r')\nplt.scatter(X[(y==2).ravel(),0],X[(y==2).ravel(),1],color='b') # b\nplt.scatter(X[(y==3).ravel(),0],X[(y==3).ravel(),1],color='g') # g\nfig = plt.gcf()\nfig.set_size_inches((12,3))\n\nplt.subplot(1,3,2)\nplt.title('Data without labels', size=14)\nplt.scatter(X[(y==1).ravel(),0],X[(y==1).ravel(),1],color='r')\nplt.scatter(X[(y==2).ravel(),0],X[(y==2).ravel(),1],color='r') # b\nplt.scatter(X[(y==3).ravel(),0],X[(y==3).ravel(),1],color='r') # g\nfig = plt.gcf()\nfig.set_size_inches((12,3))\n\nplt.subplot(1,3,3)\nplt.title('Clustering labels', size=14)\nplt.scatter(X[(Zx==1).ravel(),0],X[(Zx==1).ravel(),1],color='r')\nplt.scatter(X[(Zx==2).ravel(),0],X[(Zx==2).ravel(),1],color='b')\nplt.scatter(X[(Zx==0).ravel(),0],X[(Zx==0).ravel(),1],color='g')\nfig = plt.gcf()\nfig.set_size_inches((12,3))<\/pre>\n<p>The K-means algorithm clusters data by trying to separate samples in n groups of equal variance. In other words, the K-means\u00a0algorithm divides a set of N samples X into K disjoint clusters C, each\u00a0described by the mean of the samples in the cluster. The means are\u00a0commonly called the cluster \u201c<em><strong>centroids<\/strong><\/em>\u201d.<\/p>\n<p><strong><span style=\"color: #0000ff;\">Question: Shall the centroids belong to the original set of points?<\/span><\/strong><\/p>\n<p>The K-means algorithm aims to choose centroids minimising a criterion\u00a0known as the <em><strong>i<\/strong><strong>nertia<\/strong><\/em>\u00a0or <em><strong>within-cluster<\/strong><\/em>\u00a0sum-of-squares:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-177 aligncenter\" src=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/10\/Capture-d\u2019\u00e9cran-2018-10-17-\u00e0-13.54.16.png\" alt=\"\" width=\"284\" height=\"70\" srcset=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/10\/Capture-d\u2019\u00e9cran-2018-10-17-\u00e0-13.54.16.png 478w, http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/10\/Capture-d\u2019\u00e9cran-2018-10-17-\u00e0-13.54.16-300x74.png 300w\" sizes=\"auto, (max-width: 284px) 100vw, 284px\" \/><\/p>\n<p>Inertia or the within-cluster sum of squares criterion, can be\u00a0recognised as a measure of how internally coherent clusters are.\u00a0Several issues should be taken into account:<\/p>\n<ul>\n<li>Inertia responds poorly\u00a0to elongated clusters, or manifolds with irregular shapes.<\/li>\n<li>Given enough time, K-means will always converge.<\/li>\n<li>This algorithm requires the number of clusters to be specified.<\/li>\n<li>It scales well to large number of samples and has been used\u00a0across a large range of application areas in many different fields.<\/li>\n<\/ul>\n<p>The computation is often done several times, with different\u00a0initializations of the <em>centroids<\/em>. One method to help address this issue\u00a0is the <em><strong>k-means++<\/strong><\/em> initialization scheme, which has been implemented in\u00a0scikit-learn (use the init=&#8217;kmeans++&#8217; parameter). This initialises the\u00a0centroids to be (generally) distant from each other, leading to provably\u00a0better results than random initialisation.\u00a0Some seeds can result in poor convergence rate, or\u00a0convergence to sub-optimal clusterings:<\/p>\n<h5>Summary<\/h5>\n<ul>\n<li>(+) Select good seeds using a heuristic (e.g. seeds with large\u00a0distance among them).<\/li>\n<li>(+) Try out multiple starting points.<\/li>\n<li>(+) Initialize with the results of another method.<\/li>\n<li>(-) Tends to look for spherical clusters.<\/li>\n<li>(-) Prone to local minima stabilization.<\/li>\n<\/ul>\n<pre>from sklearn import metrics\n\nclf = cluster.KMeans(n_clusters=K, init='k-means++', random_state=0,\nmax_iter=300, n_init=10)\n#initialize the k-means clustering\nclf.fit(X) #run the k-means clustering\n\nprint ('Final evaluation of the clustering:')\n\nprint('Inertia: %.2f' % clf.inertia_)\n\nprint('Adjusted_rand_score %.2f' % metrics.adjusted_rand_score(y.ravel(),\nclf.labels_))\n\nprint('Homogeneity %.2f' % metrics.homogeneity_score(y.ravel(),\nclf.labels_))\n\nprint('Completeness %.2f' % metrics.completeness_score(y.ravel(),\nclf.labels_))\n\nprint('V_measure %.2f' % metrics.v_measure_score(y.ravel(), clf.labels_))\n\nprint('Silhouette %.2f' % metrics.silhouette_score(X, clf.labels_,\nmetric='euclidean'))\n\nclf1 = cluster.KMeans(n_clusters=K, init='random', random_state=0,\nmax_iter=2, n_init=2)\n#initialize the k-means clustering\nclf1.fit(X) #run the k-means clustering\n\nprint ('Final evaluation of the clustering:')\n\nprint ('Inertia: %.2f' % clf1.inertia_)\n\nprint ('Adjusted_rand_score %.2f' % metrics.adjusted_rand_score(y.ravel(),\nclf1.labels_))\n\nprint ('Homogeneity %.2f' % metrics.homogeneity_score(y.ravel(),\nclf1.labels_))\n\nprint ('Completeness %.2f' % metrics.completeness_score(y.ravel(),\nclf1.labels_))\n\nprint ('V_measure %.2f' % metrics.v_measure_score(y.ravel(),\nclf1.labels_))\n\nprint ('Silhouette %.2f' % metrics.silhouette_score(X, clf1.labels_,\nmetric='euclidean'))<\/pre>\n<h3>3. CASE STUDY: EUROSTAT data analysis<\/h3>\n<p>Eurostat is the home of the `European Commission\u00a0data<\/p>\n<p style=\"text-align: center;\">http:\/\/ec.europa.eu\/eurostat<\/p>\n<p>Eurostat\u2019s main role is to\u00a0process and publish comparable statistical information at European\u00a0level. Data in Eurostat is provided by each member state. Eurostat&#8217;s\u00a0re-use policy is free re-use of its data, both for non-commercial and\u00a0commercial purposes (with some minor exceptions).<\/p>\n<h4>Applying clustering to analyze countries according to their education resourses<\/h4>\n<p>In order to illustrate the clustering on a real dataset, we will analyze\u00a0the indicators on education finance data among the European member\u00a0states, provided by the Eurostat data bank2. The data is organized by\u00a0year (TIME): [2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010,<br \/>\n2011] and country (GEO): [&#8216;Albania&#8217;, &#8216;Austria&#8217;, &#8216;Belgium&#8217;, &#8216;Bulgaria&#8217;,\u00a0&#8216;Croatia&#8217;, &#8216;Cyprus&#8217;, &#8216;Czech Republic&#8217;, &#8216;Denmark&#8217;, &#8216;Estonia&#8217;, &#8216;Euro area\u00a0(13 countries)&#8217;, &#8216;Euro area (15 countries)&#8217;, &#8216;European Union (25\u00a0countries)&#8217;, &#8216;European Union (27 countries)&#8217;, &#8216;Finland&#8217;, &#8216;Former\u00a0Yugoslav Republic of Macedonia, the&#8217;, &#8216;France&#8217;, &#8216;Germany (until 1990\u00a0former territory of the FRG)&#8217;, &#8216;Greece&#8217;, &#8216;Hungary&#8217;, &#8216;Iceland&#8217;,\u00a0&#8216;Ireland&#8217;, &#8216;Italy&#8217;, &#8216;Japan&#8217;, &#8216;Latvia&#8217;, &#8216;Liechtenstein&#8217;, &#8216;Lithuania&#8217;,\u00a0&#8216;Luxembourg&#8217;, &#8216;Malta&#8217;, &#8216;Netherlands&#8217;, &#8216;Norway&#8217;, &#8216;Poland&#8217;, &#8216;Portugal&#8217;,\u00a0&#8216;Romania&#8217;, &#8216;Slovakia&#8217;, &#8216;Slovenia&#8217;, &#8216;Spain&#8217;, &#8216;Sweden&#8217;, &#8216;Switzerland&#8217;,\u00a0&#8216;Turkey&#8217;, &#8216;United Kingdom&#8217;, &#8216;United States&#8217;]. Twelve indicators (INDICED) on education finance with their values (Value) are given like:-<\/p>\n<ol>\n<li>Expenditure on educational institutions from private sources as %\u00a0of Gross Domestic Product (GDP), for all levels of education\u00a0combined;<\/li>\n<li>Expenditure on educational institutions from public sources as %\u00a0of GDP, for all levels of government combined,<\/li>\n<li>Expenditure on educational institutions from public sources as %\u00a0of total public expenditure, for all levels of education combined,<\/li>\n<li>Public subsidies to the private sector as % of GDP, for all levels\u00a0of education combined,<\/li>\n<li>Public subsidies to the private sector as % of total public\u00a0expenditure, for all levels of education combined, etc. We can\u00a0store in a table the 12 indicators for a given year (e.g. 2010).<\/li>\n<\/ol>\n<p>Let us start having a look at the data.<\/p>\n<pre>#Read and check the dataset downloaded from the EuroStat\n\nimport pandas as pd\nimport numpy as np\n\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn import cluster\n\nedu=pd.read_csv('.\/files\/ch07\/educ_figdp_1_Data.csv',na_values=':')\nedu.head()<\/pre>\n<pre>edu.tail()<\/pre>\n<p>Data in CSV and databases are often organised in what is called\u00a0<em>stacked<\/em>\u00a0or <em>record<\/em>\u00a0formats. In our case for each year (&#8220;TIME&#8220;)\u00a0and country (&#8220;GEO&#8220;) of the EU as well as some reference countries such\u00a0as Japan and United States, we have twelve indicators (&#8220;INDIC_ED&#8220;) on\u00a0education finance with their values (&#8220;Value&#8220;). Let us reshape the\u00a0table into a feature vector style data set.<\/p>\n<p>To the process of reshaping stacked data into a table is sometimes\u00a0called <em>pivoting<\/em>.<\/p>\n<pre>#Pivot table in order to get a nice feature vector representation with dual indexing by TIME and GEO\npivedu=pd.pivot_table(edu, values='Value', index=['TIME', 'GEO'], columns=['INDIC_ED'])\npivedu.head()<\/pre>\n<pre>print ('Let us check the two indices:\\n')\nprint ('\\nPrimary index (TIME): \\n' + str(pivedu.index.levels[0].tolist()))\nprint ('\\nSecondary index (GEO): \\n' + str(pivedu.index.levels[1].tolist()))<\/pre>\n<p>Observe that we have ten years information on these indicators, and as\u00a0expected we have all members of the European Union with some aggregates\u00a0and control\/reference countries. For the sake of simplicity, let us\u00a0focus on values on year 2010.<\/p>\n<pre>#Extract 2010 set of values\nedu2010=pivedu.loc[2010]\nedu2010.head()<\/pre>\n<p>Let us clean and store the names of the features and the countries.<\/p>\n<pre>#Store column names and clear them for better handling. Do the same with countries\nedu2010 = edu2010.rename(index={'Euro area (13 countries)': 'EU13',\n'Euro area (15 countries)': 'EU15',\n'European Union (25 countries)': 'EU25',\n'European Union (27 countries)': 'EU27',\n'Former Yugoslav Republic of Macedonia, the': 'Macedonia',\n'Germany (until 1990 former territory of the FRG)': 'Germany'\n})\nfeatures = edu2010.columns.tolist()\n\ncountries = edu2010.index.tolist()\n\nedu2010.columns=range(12)\nedu2010.head()<\/pre>\n<p>As we can observe, this is not a clean data set, there are missing\u00a0values. Some countries may not collect or have access to some indicators\u00a0and there are countries without any indicators. Let us display this\u00a0effect.<\/p>\n<pre>#Check what is going on in the NaN data\nnan_countries=np.sum(np.where(edu2010.isnull(),1,0),axis=1)\nplt.bar(np.arange(nan_countries.shape[0]),nan_countries)\nplt.xticks(np.arange(nan_countries.shape[0]),countries,rotation=90,horizontalalignment='left',\nfontsize=12)\nfig = plt.gcf()\nfig.set_size_inches((12,5))<\/pre>\n<p>We do not have info on Albania, Macedonia and Greece. And very limited\u00a0info from Liechtenstein, Luxembourg and Turkey. So let us work without\u00a0them. Now let us check the features.<\/p>\n<pre>#Remove non info countries\nwrk_countries = nan_countries&lt;4\n\neduclean=edu2010.ix[wrk_countries] #.ix - Construct an open mesh from multiple sequences.\n\n#Let us check the features we have\nna_features = np.sum(np.where(educlean.isnull(),1,0),axis=0)\nprint (na_features)\n\nplt.bar(np.arange(na_features.shape[0]),na_features)\nplt.xticks(fontsize=12)\nfig = plt.gcf()\nfig.set_size_inches((8,4))<\/pre>\n<p>There are four features with missing data. At this point we can proceed\u00a0in two ways:<\/p>\n<ul>\n<li>Fill in the features with some non-informative, non-biasing data.<\/li>\n<li>Drop the features with missing values.<\/li>\n<\/ul>\n<p>If we have many features and only a few have missing values then it is\u00a0not much harmful to drop them. However, if missing values are spread\u00a0across the features, we have to eventually deal with them. In our case,\u00a0both options seem reasonable, so we will proceed with both at the same\u00a0time.<\/p>\n<pre>#Option A fills those features with some value, at risk of extracting wrong information\n#Constant filling : edufill0=educlean.fillna(0)\nedufill=educlean.fillna(educlean.mean())\nprint ('Filled in data shape: ' + str(edufill.shape))\n\n#Option B drops those features\nedudrop=educlean.dropna(axis=1)\n#dropna: Return object with labels on given axis omitted where alternately any or\n# all of the data are missing\nprint ('Drop data shape: ' + str(edudrop.shape))<\/pre>\n<p>In the fill-in option, we have decided to fill the data with the mean\u00a0value of the feature. This will not bias the distribution of the\u00a0feature, though it has consequences in the interpretation of the\u00a0results.<\/p>\n<p>Let us now apply a K-means clustering technique on this data in order to\u00a0partition the countries according to their investment in education and\u00a0check their profiles.<\/p>\n<pre>scaler = StandardScaler() #Standardize features by removing the mean and scaling to unit variance\n\nX_train_fill = edufill.values\nX_train_fill = scaler.fit_transform(X_train_fill)\n\nclf = cluster.KMeans(init='k-means++', n_clusters=3, random_state=42)\n\nclf.fit(X_train_fill) #Compute k-means clustering.\n\ny_pred_fill = clf.predict(X_train_fill)\n#Predict the closest cluster each sample in X belongs to.\n\nidx=y_pred_fill.argsort()<\/pre>\n<p>Let us visualise the result of the K-means clustering:<\/p>\n<pre>plt.plot(np.arange(35),y_pred_fill[idx],'ro')\nwrk_countries_names = [countries[i] for i,item in enumerate(wrk_countries) if item ]\n\nplt.xticks(np.arange(len(wrk_countries_names)),[wrk_countries_names[i] for i in idx],\nrotation=90,horizontalalignment='left',fontsize=12)\nplt.title('Using filled in data', size=15)\nplt.yticks([0,1,2])\nfig = plt.gcf()\n\nfig.set_size_inches((12,5))<\/pre>\n<p>Let us apply the clustering on the dataset with dropped missing values:<\/p>\n<pre>X_train_drop = edudrop.values\nX_train_drop = scaler.fit_transform(X_train_drop)\n\nclf.fit(X_train_drop) #Compute k-means clustering.\ny_pred_drop = clf.predict(X_train_drop) #Predict the closest cluster of each sample in X.<\/pre>\n<pre>idx=y_pred_drop.argsort()\nplt.plot(np.arange(35),y_pred_drop[idx],'ro')\nwrk_countries_names = [countries[i] for i,item in enumerate(wrk_countries) if item ]\n\nplt.xticks(np.arange(len(wrk_countries_names)),[wrk_countries_names[i] for i in idx],\nrotation=90,horizontalalignment='left',fontsize=12)\nplt.title('Using dropped missing values data',size=15)\nfig = plt.gcf()\nplt.yticks([0,1,2])\nfig.set_size_inches((12,5))<\/pre>\n<p>We have sorted the data for better visualization. At a simple glance we\u00a0can see that both partitions can be different. We can better check this\u00a0effect plotting the clusters values of one technique against the other.<\/p>\n<pre>plt.plot(y_pred_drop+0.2*np.random.rand(35),y_pred_fill+0.2*np.random.rand(35),'bo')\nplt.xlabel('Predicted clusters for the filled in dataset.')\nplt.ylabel('Predicted clusters for the dropped missing values dataset.')\nplt.title('Correlations')\nplt.xticks([0,1,2])\nplt.yticks([0,1,2])\n<\/pre>\n<p>Well, looking at both methods, both may yield the same results, but not\u00a0necessarily always. This is mainly due to two aspects: the random\u00a0initialisation of the k-means clustering and the fact that each method\u00a0works in a different space (dropped data vs. filled-in data).<\/p>\n<p>Let us check the list of countries in both methods. Note that we should\u00a0not consider the cluster value, since it is irrelevant.<\/p>\n<pre>print ('Cluster 0: \\n' + str([wrk_countries_names[i] for i,item in enumerate(y_pred_fill)\nif item==0]))\nprint ('Cluster 0: \\n' + str([wrk_countries_names[i] for i,item in enumerate(y_pred_drop)\nif item==0]))\nprint ('\\n')\nprint ('Cluster 1: \\n' + str([wrk_countries_names[i] for i,item in enumerate(y_pred_fill)\nif item==1]))\nprint ('Cluster 1: \\n' + str([wrk_countries_names[i] for i,item in enumerate(y_pred_drop)\nif item==1]))\nprint ('\\n')\nprint ('Cluster 2: \\n' + str([wrk_countries_names[i] for i,item in enumerate(y_pred_fill)\nif item==2]))\nprint ('Cluster 2: \\n' + str([wrk_countries_names[i] for i,item in enumerate(y_pred_drop)\nif item==2]))\nprint ('\\n')<\/pre>\n<p>Let us check the profile of the clusters by looking at the centroids:<\/p>\n<pre>width=0.3\np1 = plt.bar(np.arange(8),scaler.inverse_transform(clf.cluster_centers_[1]),width,color='b')\n# Scale back the data to the original representation\np2 = plt.bar(np.arange(8)+width,scaler.inverse_transform(clf.cluster_centers_[2]),\nwidth,color='yellow')\np0 = plt.bar(np.arange(8)+2*width,scaler.inverse_transform(clf.cluster_centers_[0]),\nwidth,color='r')\n\nplt.legend( (p0[0], p1[0], p2[0]), ('Cluster 0', 'Cluster 1', 'Cluster 2') ,loc=9)\nplt.xticks(np.arange(8) + 0.5, np.arange(8),size=12)\nplt.yticks(size=12)\nplt.xlabel('Economical indicators')\nplt.ylabel('Average expanditure')\nfig = plt.gcf()<\/pre>\n<p>It looks like cluster &#8220;1&#8220; spends more on education while cluster &#8220;0&#8220;\u00a0is the one with less resources on education. What about Spain?<\/p>\n<p>Let us refine a little bit more cluster &#8220;0&#8220; and check how close are\u00a0members from this cluster to cluster &#8220;1&#8220;. This may give us a hint on a\u00a0possible ordering.<\/p>\n<pre>from scipy.spatial import distance\np = distance.cdist(X_train_drop[y_pred_drop==0,:],[clf.cluster_centers_[1]],'euclidean')\n#the distance of the elements of cluster 0 to the center of cluster 1\n\nfx = np.vectorize(np.int)\n\nplt.plot(np.arange(p.shape[0]),\nfx(p)\n)\n\nwrk_countries_names = [countries[i] for i,item in enumerate(wrk_countries) if item ]\nzero_countries_names = [wrk_countries_names[i] for i,item in enumerate(y_pred_drop)\nif item==0]\nplt.xticks(np.arange(len(zero_countries_names)),zero_countries_names,rotation=90,\nhorizontalalignment='left',fontsize=12)<\/pre>\n<p>Well, it seems that Spain belongs to cluster &#8220;0&#8220;, it is the closest to\u00a0change to a policy in the lines of the other clusters.<\/p>\n<p>Additionally, we can also check the distance to the centroid of cluster\u00a0&#8220;0&#8220;.<\/p>\n<pre>from scipy.spatial import distance\np = distance.cdist(X_train_drop[y_pred_drop==0,:],[clf.cluster_centers_[1]],'euclidean')\npown = distance.cdist(X_train_drop[y_pred_drop==0,:],[clf.cluster_centers_[0]],'euclidean')\n\nwidth=0.45\np0=plt.plot(np.arange(p.shape[0]),fx(p),width)\np1=plt.plot(np.arange(p.shape[0])+width,fx(pown),width,color = 'red')\n\nwrk_countries_names = [countries[i] for i,item in enumerate(wrk_countries) if item ]\nzero_countries_names = [wrk_countries_names[i] for i,item in enumerate(y_pred_drop)\nif item==0]\nplt.xticks(np.arange(len(zero_countries_names)),zero_countries_names,rotation=90,\nhorizontalalignment='left',fontsize=12)\nplt.legend( (p0[0], p1[0]), ('d -&gt; 1', 'd -&gt; 0') ,loc=1)<\/pre>\n<p>Let us redo the clustering with K=4 and see what we can\u00a0conclude.<\/p>\n<pre>X_train = edudrop.values\nclf = cluster.KMeans(init='k-means++', n_clusters=4, random_state=0)\nclf.fit(X_train)\ny_pred = clf.predict(X_train)\n\nidx=y_pred.argsort()\nplt.plot(np.arange(35),y_pred[idx],'ro')\nwrk_countries_names = [countries[i] for i,item in enumerate(wrk_countries) if item ]\n\nplt.xticks(np.arange(len(wrk_countries_names)),[wrk_countries_names[i] for i in idx],rotation=90,\nhorizontalalignment='left',fontsize=12)\nplt.title('Using drop features',size=15)\nplt.yticks([0,1,2,3])\nfig = plt.gcf()\nfig.set_size_inches((12,5))<\/pre>\n<pre>width=0.2\np0 = plt.bar(np.arange(8)+1*width,clf.cluster_centers_[0],width,color='r')\np1 = plt.bar(np.arange(8),clf.cluster_centers_[1],width,color='b')\np2 = plt.bar(np.arange(8)+3*width,clf.cluster_centers_[2],width,color='yellow')\np3 = plt.bar(np.arange(8)+2*width,clf.cluster_centers_[3],width,color='pink')\n\nplt.legend( (p0[0], p1[0], p2[0], p3[0]), ('Cluster 0', 'Cluster 1', 'Cluster 2',\n'Cluster 3') ,loc=9)\nplt.xticks(np.arange(8) + 0.5, np.arange(8),size=12)\nplt.yticks(size=12)\nplt.xlabel('Economical indicator')\nplt.ylabel('Average expenditure')\nfig = plt.gcf()\nfig.set_size_inches((12,5))<\/pre>\n<p>Spain is still in cluster &#8220;0&#8220;. But as we observed in our previous\u00a0clustering it was very close to changing cluster. This time cluster\u00a0&#8220;0&#8220; includes the averages values for the EU members. Just for the sake\u00a0of completeness, let us write down the name of the countries in the\u00a0clusters.<\/p>\n<pre>print ('Cluster 0: \\n' + str([wrk_countries_names[i] for i,item in enumerate(y_pred) if item==0]))\n\nprint ('Cluster 1: \\n' + str([wrk_countries_names[i] for i,item in enumerate(y_pred) if item==1]))\n\nprint ('Cluster 2: \\n' + str([wrk_countries_names[i] for i,item in enumerate(y_pred) if item==2]))\n\nprint ('Cluster 3: \\n' + str([wrk_countries_names[i] for i,item in enumerate(y_pred) if item==3]))\n\n#Save data for future use.\nimport pickle\nofname = open('edu2010.pkl', 'wb')\ns = pickle.dump([edu2010, wrk_countries_names,y_pred ],ofname)\nofname.close()<\/pre>\n<p>We can repeat the process using the alternative clustering techniques\u00a0and compare their results. Let us first apply the spectral clustering.\u00a0The corresponding code will be:<\/p>\n<pre>from scipy.cluster.hierarchy import linkage, dendrogram\nfrom scipy.spatial.distance import pdist\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.neighbors import kneighbors_graph\nfrom sklearn.metrics import euclidean_distances\n\nX = StandardScaler().fit_transform(edudrop.values)\n\ndistances = euclidean_distances(edudrop.values)\n\nspectral = cluster.SpectralClustering(n_clusters=4, affinity=\"nearest_neighbors\")\nspectral.fit(edudrop.values)\n\ny_pred = spectral.labels_.astype(np.int)<\/pre>\n<p>If we visualise the results:<\/p>\n<pre>idx=y_pred.argsort()\n\nplt.plot(np.arange(35),y_pred[idx],'ro')\nwrk_countries_names = [countries[i] for i,item in enumerate(wrk_countries) if item ]\n\nplt.xticks(np.arange(len(wrk_countries_names)),[wrk_countries_names[i]\nfor i in idx],rotation=90,horizontalalignment='left',fontsize=12)\n\nplt.yticks([0,1,2,3])\n\nplt.title('Applying Spectral Clustering on the drop features',size=15)\nfig = plt.gcf()\nfig.set_size_inches((12,5))<\/pre>\n<p>Note that in general, the spectral clustering intends to obtain more\u00a0balanced clusters. In this way, the predicted cluster 1 merges the\u00a0cluster 2 and 3 of the K-means clustering, cluster 2 corresponds to the\u00a0cluster 1 of the K-means clustering, cluster 0 mainly goes to cluster 2,\u00a0and clusters 3 corresponds to cluster 0 of the K-means.<\/p>\n<p>Applying the agglomerative clustering, we obtain not only the different\u00a0clusters, but also we can see how different clusters are obtained. This,\u00a0in some way it is giving us information on which are the pairs of\u00a0countries and clusters that are most similar. The corresponding code\u00a0that applies the agglomerative clustering is:<\/p>\n<pre>X_train = edudrop.values\ndist = pdist(X_train,'euclidean')\nlinkage_matrix = linkage(dist,method = 'complete');\nplt.figure() # we need a tall figure\nfig = plt.gcf()\nfig.set_size_inches((12,12))\ndendrogram(linkage_matrix, orientation=\"right\", color_threshold = 4,labels = wrk_countries_names, leaf_font_size=20);\n\nplt.show()<\/pre>\n<p>In scikit-learn, the parameter color_threshold colors all the\u00a0descendent links below a cluster node k the same color if k is the first\u00a0node below the color threshold. All links connecting nodes with\u00a0distances greater than or equal to the threshold are colored blue. Thus,\u00a0if we use color threshold = 3, the obtained clusters are as follows:<\/p>\n<ul>\n<li>Cluster 0: [&#8216;Cyprus&#8217;, &#8216;Denmark&#8217;, &#8216;Iceland&#8217;]<\/li>\n<li>Cluster 1: [&#8216;Bulgaria&#8217;, &#8216;Croatia&#8217;, &#8216;Czech Republic&#8217;, &#8216;Italy&#8217;,<br \/>\n&#8216;Japan&#8217;, &#8216;Romania&#8217;, &#8216;Slovakia&#8217;]<\/li>\n<li>Cluster 2: [&#8216;Belgium&#8217;, &#8216;Finland&#8217;, &#8216;Ireland&#8217;, &#8216;Malta&#8217;, &#8216;Norway&#8217;,<br \/>\n&#8216;Sweden&#8217;]<\/li>\n<li>Cluster 3: [&#8216;Austria&#8217;, &#8216;Estonia&#8217;, &#8216;EU13&#8217;, &#8216;EU15&#8217;, &#8216;EU25&#8217;, &#8216;EU27&#8217;,<br \/>\n&#8216;France&#8217;, &#8216;Germany&#8217;, &#8216;Hungary&#8217;, &#8216;Latvia&#8217;, &#8216;Lithuania&#8217;, &#8216;Netherlands&#8217;,<br \/>\n&#8216;Poland&#8217;, &#8216;Portugal&#8217;, &#8216;Slovenia&#8217;, &#8216;Spain&#8217;, &#8216;Switzerland&#8217;, &#8216;United<br \/>\nKingdom&#8217;, &#8216;United States&#8217;]<\/li>\n<\/ul>\n<p>Note that they correspond in high degree to the clusters obtained by the\u00a0K-means (except permutation of clusters labels that is irrelevant). The\u00a0figure shows the construction of the clusters using the complete linkage\u00a0agglomerative clustering. Different cuts at different levels of the\u00a0dendrogram allow to obtain different number of clusters. As a summary,\u00a0let us compare the results of the three approaches of clustering. We\u00a0cannot expect that the results coincide since different approaches are\u00a0based on different criteria to construct the clusters. Still, we can\u00a0observe that in this case K-means and the agglomerative approaches gave\u00a0the same results (up to a permutation of the number of cluster that is\u00a0irrelevant), meanwhile the spectral clustering gave more evenly\u00a0distributed clusters. It fused cluster 0 and 2 of the agglomerative\u00a0clustering in cluster 1, and split cluster 3 of agglomerative clustering\u00a0in clusters 0 and 3 of it. Note that these results can change when using\u00a0different distance between data.<\/p>\n<h3>References<\/h3>\n<p>This notebook was created by Petia\u00a0Radeva and Oriol Pujol\u00a0Vila.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Objective Discuss different techniques for unsupervised&nbsp;learning and will focus on several clustering techniques. Consider&nbsp;basic concepts like distance and similarity, taxonomy of&nbsp;clustering techniques and goodness of clustering quality. Explore three basic clustering techniques, namely, K-means, spectral&nbsp;clustering and hierarchical clustering. Illustrate&nbsp;the use of clustering techniques on a real problem: defining groups of&nbsp;countries according to their economic expenditure [&hellip;]<\/p>\n","protected":false},"author":11,"featured_media":0,"parent":11,"menu_order":5,"comment_status":"closed","ping_status":"closed","template":"page-templates\/full-width.php","meta":{"footnotes":""},"class_list":["post-184","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-json\/wp\/v2\/pages\/184","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-json\/wp\/v2\/users\/11"}],"replies":[{"embeddable":true,"href":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-json\/wp\/v2\/comments?post=184"}],"version-history":[{"count":7,"href":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-json\/wp\/v2\/pages\/184\/revisions"}],"predecessor-version":[{"id":439,"href":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-json\/wp\/v2\/pages\/184\/revisions\/439"}],"up":[{"embeddable":true,"href":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-json\/wp\/v2\/pages\/11"}],"wp:attachment":[{"href":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-json\/wp\/v2\/media?parent=184"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}