{"id":56,"date":"2018-09-29T19:28:05","date_gmt":"2018-09-29T19:28:05","guid":{"rendered":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/?page_id=56"},"modified":"2022-02-15T11:53:07","modified_gmt":"2022-02-15T11:53:07","slug":"exploring-data-collections-using-descriptive-statistics","status":"publish","type":"page","link":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/hands-on\/exploring-data-collections-using-descriptive-statistics\/","title":{"rendered":"Exploring data collections using descriptive statistics"},"content":{"rendered":"<h2>1.\u00a0\u00a0\u00a0Objective<\/h2>\n<p>Apply descriptive statistics to explore data collections and compute quantitative descriptions. Recall that descriptive statistics applies the concepts, measures, and terms that are used to describe the basic features of the samples in a study. Together with simple graphics, these measures form the basis of quantitative analysis of data. To describe the sample data and to be able to infer any conclusion, we should go through several steps: data preparation and descriptive analytics.<\/p>\n<h2>2.\u00a0\u00a0\u00a0Material<\/h2>\n<ul>\n<li>Jupyter Notebook Server<\/li>\n<li>Python libraries<\/li>\n<li>Data collection: https:\/\/archive.ics.uci.edu\/ml\/datasets\/Adult.<\/li>\n<\/ul>\n<h2>3.\u00a0\u00a0\u00a0To Do and To Hand In<\/h2>\n<ul>\n<li>Go through the exercise and execute the functions that help you explore the content of the data collection.<\/li>\n<li><strong>In markdown cells in your notebook\u00a0<\/strong>report the questions and answers obtained. Write at the end the research questions that the exploration answered to, and your interpretation for answering them according to the statistical results.<\/li>\n<li>Download your notebook store it in your personal public space and share the URL <a href=\"https:\/\/docs.google.com\/spreadsheets\/d\/1X0b_mVOf1ot8dkB3nh-f6YLg0VIpA-TaplSyjIyoWy4\/edit?usp=sharing\">here<\/a> in the column HO-2<\/li>\n<\/ul>\n<h2>4.\u00a0\u00a0\u00a0Playing with the Education and training data<\/h2>\n<p>Let us consider a public database called the \u201cAdult\u201d dataset, hosted on the UCI\u2019s Machine Learning Repository. It contains approximately 32,000 observations concerning different financial parameters related to the US population: age, sex, marital (marital status of the individual), country, income (Boolean variable: whether the person makes more than $50,000 per annum), education (the highest level of education achieved by the individual), occupation, capital gain, etc.<\/p>\n<p>We will show that we can explore the data by asking questions like: <em>\u201cAre men more likely to become high-income professionals than women, i.e., to receive an income of over $50,000 per annum?\u201d<\/em><\/p>\n<h3>4.1 \u00a0Getting acquainted with data<\/h3>\n<p>First, let us read the data:<\/p>\n<pre>file = open('files\/adult.data', 'r')<\/pre>\n<p>Checking the data, we obtain:<\/p>\n<pre>def chr_int(a):\n    if a.isdigit():\n        return int(a)\n    else:\n        return 0\n                \ndata=[]\nfor line in file:\n     data1=line.split(', ')\n     if len(data1)==15:\n        data.append([chr_int(data1[0]),data1[1],chr_int(data1[2]),data1[3],chr_int(data1[4]),data1[5],data1[6],\\\n            data1[7],data1[8],data1[9],chr_int(data1[10]),chr_int(data1[11]),chr_int(data1[12]),data1[13],\\\n            data1[14]])<\/pre>\n<pre>print (data[1:2])<\/pre>\n<ol>\n<li><strong>What is the obtained result? What did you ask for in the previous command? Explain.<\/strong><\/li>\n<\/ol>\n<p>One of the easiest ways to manage data in Python is by using the DataFramestructure, defined in the <em>Pandas <\/em>library, which is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes:<\/p>\n<pre><strong><span style=\"color: #999999;\">%matplotlib inline<\/span><\/strong>\nimport pandas as pd\n\ndf = pd.DataFrame(data) #  Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes \n\ndf.columns = ['age', 'type_employer', 'fnlwgt', 'education', \n                \"education_num\",\"marital\", \"occupation\", \"relationship\", \"race\",\"sex\",\n                \"capital_gain\", \"capital_loss\", \"hr_per_week\",\"country\",\"income\"]\ndf.head()<\/pre>\n<ol start=\"2\">\n<li><strong>Describe an explain the result.<\/strong><\/li>\n<\/ol>\n<pre>df.tail()<\/pre>\n<ol start=\"3\">\n<li><strong>Describe and explain the result. Compare with the previous one.<\/strong><\/li>\n<\/ol>\n<p>The command shape gives exactly the number of data samples (in rows, in this case) and features (in columns):<\/p>\n<pre>df.shape<\/pre>\n<ol start=\"4\">\n<li><strong>Describe an explain the result.<\/strong><\/li>\n<\/ol>\n<pre>counts = df.groupby('country').size()\n\nprint (counts) \n<\/pre>\n<p><strong>5. How many items are there for USA? and for Mexico?<\/strong><\/p>\n<pre>counts = df.groupby('age').size() # grouping by age\nprint (counts)<\/pre>\n<p><strong>6. What is the age of the most represented people?<\/strong><\/p>\n<p>The first row shows the number of samples with unknown country, followed by the number of samples corresponding to the first countries in the dataset.<\/p>\n<p>Let us split people according to their gender into two groups: men and women.<\/p>\n<pre>ml = df[(df.sex == 'Male')] # grouping by sex\nml.shape\nml1 = df[(df.sex == 'Male')&amp;(df.income=='&gt;50K\\n')]\nml1.shape<\/pre>\n<p>If we focus on high-income professionals separated by sex, we can do:<\/p>\n<pre>fm =df[(df.sex == 'Female')]\nfm.shape<\/pre>\n<pre>fm1 =df[(df.sex == 'Female')&amp;(df.income=='&gt;50K\\n')]\nfm1.shape\n<\/pre>\n<pre>df1=df[(df.income=='&gt;50K\\n')]\n\nprint ('The rate of people with high income is: ', int(len(df1)\/float(len(df))*100), '%.' )\nprint ('The rate of men with high income is: ', int(len(ml1)\/float(len(ml))*100), '%.' )\nprint ('The rate of women with high income is: ', int(len(fm1)\/float(len(fm))*100), '%.' )<\/pre>\n<h2>4.2 Explanatory Data Analysis<\/h2>\n<p>The data that come from performing a particular measurement on all the subjects in a sample represent our observations for a single characteristic like country, age, education, etc.<\/p>\n<p>Objective: to visualize and summarize the sample distribution, thereby allowing us to make tentative assumptions about the population distribution.<\/p>\n<h3>4.2.1 Summarizing the Data<\/h3>\n<p>The data in can be <em>categorical<\/em>for which a simple tabulation of the frequency of each category is the best non-graphical exploration for data analysis. For example, we can ask <em>what is the proportion of high- income professionals in our database:<\/em><\/p>\n<pre>df1=df[(df.income=='&gt;50K\\n')]\n\nprint ('The rate of people with high income is: ', int(len(df1)\/float(len(df))*100), '%.' )\nprint ('The rate of men with high income is: ', int(len(ml1)\/float(len(ml))*100), '%.' )\nprint ('The rate of women with high income is: ', int(len(fm1)\/float(len(fm))*100), '%.' )<\/pre>\n<p><strong>7. Describe an explain the result.<\/strong><\/p>\n<p><em>\u00a0<\/em><em>Quantitative<\/em>: exploratory data analysis is a way to make preliminary assessments about the population distribution of the variable. The characteristics of the population distribution of a quantitative variable are its <em>mean<\/em>, <em>deviation<\/em>, <em>histograms<\/em>,<em>outliers<\/em>, etc.<\/p>\n<h4>4.2.1.1 Mean<\/h4>\n<p>In our case, we can consider what the average age of men and women samples in our dataset would be in terms of their mean:<\/p>\n<pre>print ('The average age of men is: ', ml['age'].mean(), '.' )\nprint ('The average age of women is: ', fm['age'].mean(), '.')<\/pre>\n<pre>print ('The average age of high-income men is: ', ml1['age'].mean(), '.' )\nprint ('The average age of high-income women is: ', fm1['age'].mean(), '.')<\/pre>\n<p><strong>8. Describe an explain the result.<\/strong><\/p>\n<p><em>Comment:<\/em>Later, we will work with both concepts: the population mean and the sample mean. We should not confuse them! The first is the mean of samples taken from the population; the second, the mean of the whole population.<\/p>\n<h4>4.2.1.2 Sample Variance<\/h4>\n<p>Usually, mean is not a sufficient descriptor of the data, we can do a little better with two numbers: mean and\u00a0<strong>variance<\/strong>:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-65 aligncenter\" src=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/Sans-titre.png\" alt=\"\" width=\"141\" height=\"50\" srcset=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/Sans-titre.png 370w, http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/Sans-titre-300x106.png 300w\" sizes=\"auto, (max-width: 141px) 100vw, 141px\" \/><\/p>\n<p><strong>Variance<\/strong>\u00a0\u03c32\u00a0describes the\u00a0<em>spread<\/em>\u00a0of data. The term\u00a0(xi\u2212\u03bc)\u00a0is called the\u00a0<em>deviation from the mean<\/em>, so variance is the mean squared deviation.<\/p>\n<p>The square root of variance,\u00a0\u03c3, is called the\u00a0<strong>standard deviation<\/strong>. We define standard deviation because variance is hard to interpret (in the case the units are grams, the variance is in grams squared).\u00a0Let us compute the mean and the variance of hours per week men and women in our dataset work:<\/p>\n<pre>ml_mu = ml['age'].mean()\nfm_mu = fm['age'].mean()\nml_var = ml['age'].var()\nfm_var = fm['age'].var()\nml_std = ml['age'].std()\nfm_std = fm['age'].std()\n\nprint ('Statistics of age for men: mu:', ml_mu, 'var:', ml_var, 'std:', ml_std)\nprint ('Statistics of age for women: mu:', fm_mu, 'var:', fm_var, 'std:', fm_std)<\/pre>\n<pre>ml_mu_hr = ml['hr_per_week'].mean()\nfm_mu_hr = fm['hr_per_week'].mean()\nml_var_hr = ml['hr_per_week'].var()\nfm_var_hr = fm['hr_per_week'].var()\nml_std_hr = ml['hr_per_week'].std()\nfm_std_hr = fm['hr_per_week'].std()\n\nprint ('Statistics of hours per week for men: mu:', ml_mu_hr, 'var:', ml_var_hr, 'std:', ml_std_hr)\nprint ('Statistics\u00a0of hours per week for women: mu:', fm_mu_hr, 'var:', fm_var_hr, 'std:', fm_std_hr)<\/pre>\n<p><strong>9. Describe an explain the result.<\/strong><\/p>\n<h4>4.2.1.3 Sample Median<\/h4>\n<p>What will happen if in the sample set there is an error with a value very different from the rest? For example, considering hours worked per week, it would normally be in a range between 20 and 80; but what would happen if by mistake there was a value of 1000? An item of data that is significantly different from the rest of the data is called an <em>outlier<\/em>. In this case, the mean, \u03bc, will be drastically changed towards the outlier. One solution to this drawback is offered by the statistical <em>median<\/em>, \u03bc12, which is an order statistic giving the middle value of a sample. In this case, all the values are ordered by their magnitude and the median is defined as the value that is in the middle of the ordered list. Hence, it is a value that is much more robust in the face of outliers.<\/p>\n<p>Let us see, the median age of working men and women in our dataset and the median age of high-income men and women:<\/p>\n<pre>ml_median= ml['age'].median()\nfm_median= fm['age'].median()\n\nprint (\"Median age per men and women: \", ml_median, fm_median)<\/pre>\n<pre>ml_median_age= ml1['age'].median()\nfm_median_age= fm1['age'].median()\n\nprint (\"Median age per men and women with high-income: \", ml_median_age, fm_median_age)<\/pre>\n<pre>ml_median_hr= ml['hr_per_week'].median()\nfm_median_hr= fm['hr_per_week'].median()\nprint (\"Median hours per week per men and women: \", ml_median_hr, fm_median_hr)<\/pre>\n<p><strong>10. Describe an explain the result.<\/strong><\/p>\n<h4>4.2.1.4 Quantiles and Percentiles<\/h4>\n<p>Order the sample\u00a0{x<sub>i<\/sub>}, then find\u00a0x<sub>p<\/sub>\u00a0so that it divides the data into two parts where:<\/p>\n<ul>\n<li>a fraction\u00a0p\u00a0of the data values are less than or equal to\u00a0x<sub>p<\/sub>\u00a0and<\/li>\n<li>the remaining fraction\u00a0(1\u2212p)\u00a0are greater than\u00a0x<sub>p<\/sub>.<\/li>\n<\/ul>\n<p>That value\u00a0x<sub>p<\/sub>\u00a0is the pth-quantile, or 100\u00d7pth percentile.<\/p>\n<p><strong>5-number summary<\/strong>:\u00a0x<sub>min<\/sub>,Q<sub>1<\/sub>,Q<sub>2<\/sub>,Q<sub>3<\/sub>,x<sub>max<\/sub>, where\u00a0Q<sub>1<\/sub>\u00a0is the 25\u00d7pth percentile,\u00a0Q<sub>2<\/sub>\u00a0is the 50\u00d7pth percentile and\u00a0Q<sub>3<\/sub>\u00a0is the 75\u00d7pth percentile.<\/p>\n<h4>4.2.1.5 Data distributions<\/h4>\n<p>We can have a look at the data distribution, which describes how often each value appears (i.e., what is its frequency).<\/p>\n<p>The most common representation of a distribution is a <em>histogram<\/em>, which is a graph that shows the frequency of each value. Let us show the age of working men and women separately.<\/p>\n<pre>import matplotlib.pyplot as plt\nml_age=ml['age']\nml_age.hist(histtype='stepfilled', bins=20)<\/pre>\n<p><strong>10. Show the graphics and an explain the result.<\/strong><\/p>\n<pre>fm_age=fm['age']\nfm_age.hist(histtype='stepfilled', bins=10)\nplt.xlabel('Age',fontsize=15)\nplt.ylabel('Female samples',fontsize=15)\nplt.show()<\/pre>\n<p><strong>11. Show the graphics and an explain the result.<\/strong><\/p>\n<p>We can normalize the frequencies of the histogram by dividing\/normalizing by <em>n<\/em>, the number of samples. The normalized histogram is called the <em>Probability Mass Function <\/em>(PMF).<\/p>\n<pre>import seaborn as sns\nfm_age.hist(histtype='stepfilled', alpha=.5, bins=20)   # default number of bins = 10\nml_age.hist(histtype='stepfilled', alpha=.5, color=sns.desaturate(\"indianred\", .75), bins=10)\nplt.xlabel('Age',fontsize=15)\nplt.ylabel('Samples',fontsize=15)\nplt.show()<\/pre>\n<p><strong>12. Show the graphics and an explain the result.<\/strong><\/p>\n<p>The <em>Cumulative Distribution Function <\/em>(CDF), or just distribution function, describes the probability that a real-valued random variable <em>X <\/em>with a given proba- bility distribution will be found to have a value less than or equal to <em>x<\/em>. Let us show the CDF of age distribution for both men and women.<\/p>\n<pre>fm_age.hist(histtype='stepfilled', alpha=.5, bins=20)   # default number of bins = 10\nml_age.hist(histtype='stepfilled', alpha=.5, color=sns.desaturate(\"indianred\", .75), bins=10)\nplt.xlabel('Age',fontsize=15)\nplt.ylabel('PMF',fontsize=15)\nplt.show()<\/pre>\n<p><strong>13. Show the graphics and an explain the result.<\/strong><\/p>\n<h3>4.2.2 Data distributions<\/h3>\n<p>Summarizing can be dangerous: very different data can be described by the same statistics. It must be validated by inspecting the data.<\/p>\n<p>We can look at the\u00a0<strong>data distribution<\/strong>, which describes how often (frequency) each value appears.<\/p>\n<p>We can normalize the frequencies of the histogram by dividing\/normalizing by\u00a0n, the number of samples. The normalized histogram is called\u00a0<strong>Probability Mass Function (PMF)<\/strong>.<\/p>\n<p>Let&#8217;s visualize and compare the MPF of male and female age in our example:<\/p>\n<pre>ml_age.hist(histtype='stepfilled', bins=20)\n\nplt.xlabel('Age',fontsize=15)\nplt.ylabel('Probability',fontsize=15)\nplt.show()\n<\/pre>\n<p><strong>14. Show the graphics and an explain the result.<\/strong><\/p>\n<pre>fm_age.hist(histtype='stepfilled', bins=20)\n\nplt.xlabel('Age',fontsize=15)\nplt.ylabel('Probability',fontsize=15)\nplt.show()<\/pre>\n<p><strong>15. Show the graphics and an explain the result.<\/strong><\/p>\n<p>The\u00a0<strong>cumulative distribution function (CDF)<\/strong>, or just distribution function, describes the probability that a real-valued random variable X with a given probability distribution will be found to have a value less than or equal to x. For our example, the CDFs will be:<\/p>\n<pre>ml_age.hist(histtype='step', cumulative=True, linewidth=3.5, bins=20)\n\nplt.xlabel('Age',fontsize=15)\nplt.ylabel('CDF',fontsize=15)\nplt.show()<\/pre>\n<p><strong>16. Show the graphics and an explain the result.<\/strong><\/p>\n<pre>fm_age.hist(histtype='step', cumulative=True, linewidth=3.5, bins=20)\n\nplt.xlabel('Age',fontsize=15)\nplt.ylabel('CDF',fontsize=15)\nplt.show()<\/pre>\n<p><strong>17. Show the graphics and an explain the result.<\/strong><\/p>\n<pre>ml_age.hist(bins=10, histtype='stepfilled', alpha=.5)   # default number of bins = 10\nfm_age.hist(bins=10, histtype='stepfilled', alpha=.5, color=sns.desaturate(\"indianred\", .75))\nplt.xlabel('Age',fontsize=15)\nplt.ylabel('Probability',fontsize=15)\nplt.show()<\/pre>\n<p><strong>18. Show the graphics and an explain the result.<\/strong><\/p>\n<pre>ml_age.hist(histtype='step', cumulative=True,  linewidth=3.5, bins=20)\nfm_age.hist(histtype='step', cumulative=True,  linewidth=3.5, bins=20, color=sns.desaturate(\"indianred\", .75))\nplt.xlabel('Age',fontsize=15)\nplt.ylabel('CDF',fontsize=15)\nplt.show()<\/pre>\n<p><strong>19. Show the graphics and an explain the result.<\/strong><\/p>\n<pre>print (\"The mean sample difference is \", ml_age.mean() - fm_age.mean())<\/pre>\n<p><strong>20. Explain the result.<\/strong><\/p>\n<h3>4.2.3 Outliers<\/h3>\n<p><strong>Ouliers<\/strong>\u00a0are data samples with a value that is far from the central tendency.<\/p>\n<p>We can find outliers by:<\/p>\n<ul>\n<li>Computing samples that are\u00a0<em>far<\/em>\u00a0from the median.<\/li>\n<li>Computing samples whose value\u00a0<em>exceeds the mean<\/em>\u00a0by 2 or 3 standard deviations.<\/li>\n<\/ul>\n<p>This expression will return a series of boolean values that you can then index the series by:<\/p>\n<pre>df['age'].median()<\/pre>\n<h4>4.2.3.1 Outliner Treatment<\/h4>\n<p>For example, in our case, we are interested in the age statistics of men versus women with high incomes and we can see that in our dataset, the minimum age is 17 years and the maximum is 90 years. We can consider that some of these samples are due to errors or are not representable. Applying the domain knowledge, we focus on the median age (37, in our case) up to 72 and down to 22 years old, and we consider the rest as outliers.<\/p>\n<p>Let&#8217;s see how many outliers we can detect in our example:<\/p>\n<pre>len(df[(df.income == '&gt;50K\\n') &amp; (df['age'] &lt; df['age'].median() - 15)])<\/pre>\n<pre>len(df[(df.income == '&gt;50K\\n') &amp; (df['age'] &gt; df['age'].median() + 35)])<\/pre>\n<p>If we think that outliers correspond to errors, an option is to trim the data by discarting the highest and lowest values.<\/p>\n<pre>df2 = df.drop(df.index[(df.income=='&gt;50K\\n') &amp; (df['age']&gt;df['age'].median() +35) &amp; (df['age'] &gt; df['age'].median()-15)])\n\ndf2.shape<\/pre>\n<pre>ml1_age=ml1['age']\nfm1_age=fm1['age']<\/pre>\n<pre>ml2_age = ml1_age.drop(ml1_age.index[(ml1_age &gt;df['age'].median()+35) &amp; (ml1_age&gt;df['age'].median() - 15)])\n\nfm2_age = fm1_age.drop(fm1_age.index[(fm1_age &gt; df['age'].median()+35) &amp; (fm1_age &gt; df['age'].median()- 15)])<\/pre>\n<pre>mu2ml = ml2_age.mean()\nstd2ml = ml2_age.std()\nmd2ml = ml2_age.median()\n\n# Computing the mean, std, median, min and max for the high-income male population\n\nprint (\"Men statistics: Mean:\", mu2ml, \"Std:\", std2ml, \"Median:\", md2ml, \"Min:\", ml2_age.min(), \"Max:\",ml2_age.max())<\/pre>\n<pre>mu3ml = fm2_age.mean()\nstd3ml = fm2_age.std()\nmd3ml = fm2_age.median()\n\n# Computing the mean, std, median, min and max for the high-income female population\nprint (\"Women statistics: Mean:\", mu2ml, \"Std:\", std2ml, \"Median:\", md2ml, \"Min:\", fm2_age.min(), \"Max:\",fm2_age.max())<\/pre>\n<pre>print ('The mean difference with outliers is: %4.2f.'% (ml_age.mean() - fm_age.mean()))\nprint (\"The mean difference without outliers is: %4.2f.\"% (ml2_age.mean() - fm2_age.mean()))<\/pre>\n<p>Let us compare visually the age distributions before and after removing the outliers:<\/p>\n<pre>plt.figure(figsize=(13.4,5))\n\ndf.age[(df.income == '&gt;50K\\n')].plot(alpha=.25, color='blue')\ndf2.age[(df2.income == '&gt;50K\\n')].plot(alpha=.45,color='red')\n\nplt.ylabel('Age')\nplt.xlabel('Samples')<\/pre>\n<p>Let us see what is happening near the mode:<\/p>\n<pre><strong>import <\/strong>numpy <strong>as <\/strong>np\n\ncountx,divisionx = np.histogram(ml2_age)\ncounty,divisiony = np.histogram(fm2_age)<\/pre>\n<pre><strong>import <\/strong>matplotlib.pyplot <strong>as <\/strong>plt\n\nval = [(divisionx[i]+divisionx[i+1])\/2 <strong>for <\/strong>i <strong>in <\/strong>range(len(divisionx)-1)]\n\nplt.plot(val, countx-county,'o-')\nplt.title('Differences in promoting men vs. women')\nplt.xlabel('Age',fontsize=15)\nplt.ylabel('Differences',fontsize=15)\nplt.show()<\/pre>\n<p>There is still some evidence for our hypothesis!<\/p>\n<pre>print (\"Remember:\\n We have the following mean values for men, women and the difference:\\nOriginally: \", ml_age.mean(), fm_age.mean(),  ml_age.mean()- fm_age.mean()) # The difference between the mean values of male and female populations.)\nprint (\"For high-income: \", ml1_age.mean(), fm1_age.mean(), ml1_age.mean()- fm1_age.mean()) # The difference between the mean values of male and female populations.)\nprint (\"After cleaning: \", ml2_age.mean(), fm2_age.mean(), ml2_age.mean()- fm2_age.mean()) # The difference between the mean values of male and female populations.)\n\nprint (\"\\nThe same for the median:\")\nprint (ml_age.median(), fm_age.median(), ml_age.median()- fm_age.median()) # The difference between the mean values of male and female populations.)\nprint (ml1_age.median(), fm1_age.median(), ml1_age.median()- fm1_age.median()) # The difference between the mean values of male and female populations.)\nprint (ml2_age.median(), fm2_age.median(), ml2_age.median()- fm2_age.median()), # The difference between the mean values of male and female populations.)<\/pre>\n<h4>4.2.1.7 Measuring Asymmetry<\/h4>\n<p>For univariate data, the formula for <em>skewness <\/em>is a statistic that measures the asymmetry of the set of <em>n <\/em>data samples, <em>xi <\/em>:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-75 aligncenter\" src=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/standardDeviation.png\" alt=\"\" width=\"178\" height=\"73\" srcset=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/standardDeviation.png 358w, http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/standardDeviation-300x123.png 300w\" sizes=\"auto, (max-width: 178px) 100vw, 178px\" \/><\/p>\n<p>where \u03bcis the mean, \u03c3is the standard deviation, and <em>n<\/em>is the number of data points. The numerator is the mean squared deviation (or variance) and the denominator the mean cubed deviation.<\/p>\n<p>Negative deviation indicates that the distribution \u201cskews left\u201d (it extends further to the left than to the right). One can easily see that the skewness for a normal distribution is zero, and any symmetric data must have a skewness of zero. Note that skewness can be affected by outliers! A simpler alternative is to look at the relationship between the mean \u03bcand the median \u03bc<sub>1\/2<\/sub>.<\/p>\n<pre>def skewness(x):\n    res=0\n    m=x.mean()\n    s=x.std()\n    for i in x:\n        res+=(i-m)*(i-m)*(i-m)\n    res\/=(len(x)*s*s*s)\n    return res\n\nprint (\"The skewness of the male population is:\", skewness(ml2_age))\nprint (\"The skewness of the female population is:\", skewness(fm2_age))<\/pre>\n<ol start=\"21\">\n<li><strong>Explain the result<\/strong><\/li>\n<\/ol>\n<p>The <strong>Pearson\u2019s median skewness coefficient <\/strong>is a more robust alternative to the skewness coefficient and is defined as follows:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-77 aligncenter\" src=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/skewness.png\" alt=\"\" width=\"213\" height=\"49\" srcset=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/skewness.png 304w, http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/skewness-300x69.png 300w\" sizes=\"auto, (max-width: 213px) 100vw, 213px\" \/><\/p>\n<p>There are many other definitions for skewness that will not be discussed here. In our case, if we check the Pearson\u2019s skewness coefficient for both men and women, we can see that the difference between them actually increases:<\/p>\n<pre>def pearson(x):\n    return 3*(x.mean()-x.median())\/x.std()\n\nprint (\"The Pearson's coefficient of the male population is:\", pearson(ml2_age))\nprint (\"The Pearson's coefficient of the female population is:\", pearson(fm2_age))<\/pre>\n<h4>4.2.1.8 Relative Risk<\/h4>\n<p>Let&#8217;s say that a person is &#8220;early&#8221; promoted if he\/she is promoted before the age of 41, &#8220;on time&#8221; if he\/she is promoted of age 41, 42, 43 or 44, and &#8220;late&#8221; promoted if he\/she is ascended to get income bigger than 50K after being 44 years old. Let us compute the probability of being early, on time and late promoted for men and women:<\/p>\n<pre>ml1 = df[(df.sex == 'Male')&amp;(df.income=='&gt;50K\\n')]\n\nml2 = ml1.drop(ml1.index[(ml1['age']&gt;df['age'].median() +35)&amp;(ml1['age']&gt; df['age'].median()- 15)])\n\nfm2 = fm1.drop(fm1.index[(fm1['age']&gt; df['age'].median() + 35)&amp; (fm1['age']&gt; df['age'].median() - 15)])\n\nprint (ml2.shape, fm2.shape)<\/pre>\n<pre>print (\"Men grouped in 3 categories:\")\nprint (\"Young:\",int(round(100*len(ml2_age[ml2_age&lt;41])\/float(len(ml2_age.index)))),\"%.\")\nprint (\"Elder:\", int(round(100*len(ml2_age[ml2_age &gt;44])\/float(len(ml2_age.index)))),\"%.\")\nprint (\"Average age:\", int(round(100*len(ml2_age[(ml2_age&gt;40) &amp; (ml2_age&lt; 45)])\/float(len(ml2_age.index)))),\"%.\")<\/pre>\n<pre>print (\"Women grouped in 3 categories:\")\nprint (\"Young:\",int(round(100*len(fm2_age[fm2_age &lt;41])\/float(len(fm2_age.index)))),\"%.\")\nprint (\"Elder:\", int(round(100*len(fm2_age[fm2_age &gt;44])\/float(len(fm2_age.index)))),\"%.\")\nprint (\"Average age:\", int(round(100*len(fm2_age[(fm2_age&gt;40) &amp; (fm2_age&lt; 45)])\/float(len(fm2_age.index)))),\"%.\")<\/pre>\n<p>The\u00a0<strong>relative risk<\/strong>\u00a0is the ratio of two probabilities. In order to get the relative risk \u00a0of early promotion, we need to consider the fraction of both probabilities.<\/p>\n<pre>print (\"The male mean:\", ml2_age.mean())\nprint (\"The female mean:\", fm2_age.mean())<\/pre>\n<pre>ml2_young = len(ml2_age[(ml2_age&lt;41)])\/float(len(ml2_age.index))\nfm2_young  = len(fm2_age[(fm2_age&lt;41)])\/float(len(fm2_age.index))\nprint (\"The relative risk of female early promotion is: \", 100*(1-ml2_young\/fm2_young))<\/pre>\n<pre>ml2_elder = len(ml2_age[(ml2_age&gt;44)])\/float(len(ml2_age.index))\nfm2_elder  = len(fm2_age[(fm2_age&gt;44)])\/float(len(fm2_age.index))\nprint (\"The relative risk of male late promotion is: \", 100*ml2_elder\/fm2_elder)<\/pre>\n<h4>Discussion<\/h4>\n<p>After exploring the data, we obtained some apparent effects that support our initial assumptions. For example, the mean age for men in our dataset is 39.4 years; while for women, is 36.8 years. When analyzing the high-income salaries, the mean age for men increased to 44.6 years; while for women, increased to 42.1 years. When the data were cleaned from outliers, we obtained mean age for high-income men: 44.3, and for women: 41.8. Moreover, histograms and other statistics show the skewness of the data and the fact that women used to be promoted a little bit earlier than men, in general.<\/p>\n<h3>4.2.5 Continuous Distribution<\/h3>\n<p>The distributions we have considered up to now are based on empirical observations and thus are called <em>empirical distributions<\/em>. As an alternative, we may be interested in considering distributions that are defined by a continuous function and are called <em>continuous distributions<\/em>.<\/p>\n<h4>4.2.5.1 Exponential distribution<\/h4>\n<p>The CDF of the exponential distribution is:<\/p>\n<p style=\"text-align: center;\">CDF(x)=1\u2212exp<sup>\u2212<\/sup><sup>\u03bb<\/sup><sup>x<\/sup><\/p>\n<p>And its PDF is:<\/p>\n<p style=\"text-align: center;\">PDF(x)=\u03bbexp<sup>\u2212<\/sup><sup>\u03bb<\/sup><sup>x<\/sup><\/p>\n<p>The parameter\u00a0\u03bb\u00a0determines the shape of the distribution, the mean of the distribution is\u00a01\/\u03bb\u00a0and its variance is\u00a01\/\u03bb<sup>2<\/sup>. The median is\u00a0ln(2)\/\u03bb.<\/p>\n<pre>l = 3\nx=np.arange(0,2.5,0.1)\ny= 1- np.exp(-l*x)\n\nplt.plot(x,y,'-')\nplt.title('Exponential CDF: $\\lambda$ =%.2f'% l ,fontsize=15)\nplt.xlabel('x',fontsize=15)\nplt.ylabel('CDF',fontsize=15)\nplt.show()<\/pre>\n<pre>from __future__ import division\nimport scipy.stats as stats\n\nl = 3\nx=np.arange(0,2.5,0.1)\ny= l * np.exp(-l*x)\n\nplt.plot(x,y,'-')\nplt.title('Exponential PDF: $\\lambda$ =%.2f'% l, fontsize=15)\nplt.xlabel('x', fontsize=15)\nplt.ylabel('PDF', fontsize=15)\nplt.show()<\/pre>\n<p><strong>\u00a0<\/strong>There are a lot of real world events that can be described with this distribution.<\/p>\n<ul>\n<li>The time until a radioactive particle decays,<\/li>\n<li>The time it takes before your next telephone call,<\/li>\n<li>The time until default (on payment to company debt holders) in reduced form credit risk modeling.<\/li>\n<\/ul>\n<p>The random variable\u00a0X\u00a0of the life lengths of some batteries is associated with a probability density function of the form:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-78 aligncenter\" src=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/PDF.png\" alt=\"\" width=\"196\" height=\"57\" srcset=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/PDF.png 370w, http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/PDF-300x88.png 300w\" sizes=\"auto, (max-width: 196px) 100vw, 196px\" \/><\/p>\n<pre>l = 0.25\n\nx=np.arange(0,25,0.1)\ny= l * np.exp(-l*x)\n\nplt.plot(x,y,'-')\nplt.title('Exponential: $\\lambda$ =%.2f' %l ,fontsize=15)\nplt.xlabel('x',fontsize=15)\nplt.ylabel('PDF',fontsize=15)\nplt.show()<\/pre>\n<h4><span lang=\"EN-US\">4.2.5.2 Normal distribution<\/span><\/h4>\n<p><span lang=\"EN-US\">The<span class=\"apple-converted-space\">\u00a0<\/span><strong>normal, or Gaussian distribution<\/strong><span class=\"apple-converted-space\">\u00a0<\/span>is the most used one because it describes a lot of phenomena and because it is amenable for analysis.<span class=\"apple-converted-space\">\u00a0<\/span><\/span><\/p>\n<p><span lang=\"EN-US\">Its CDF has no closed-form expression and its more common representation is the PDF:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-79 aligncenter\" src=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/gaussian.png\" alt=\"\" width=\"251\" height=\"56\" srcset=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/gaussian.png 447w, http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/gaussian-300x67.png 300w\" sizes=\"auto, (max-width: 251px) 100vw, 251px\" \/><\/p>\n<pre>u=6 # mean\ns=2 # standard deviation\n\nx=np.arange(0,15,0.1)\n\ny=(1\/(np.sqrt(2*np.pi*s*s)))*np.exp(-(((x-u)**2)\/(2*s*s)))\n\nplt.plot(x,y,'-')\nplt.title('Gaussian PDF: $\\mu$=%.1f, $\\sigma$=%.1f'%(u,s),fontsize=15)\nplt.xlabel('x',fontsize=15)\nplt.ylabel('Probability density',fontsize=15)\nplt.show()<\/pre>\n<p>Examples:<\/p>\n<ul>\n<li>Measures of size of living tissue (length, height, skin area, weight);<\/li>\n<li>The length of inert appendages (hair, claws, nails, teeth) of biological specimens, in the direction of growth; presumably the thickness of tree bark also falls under this category;<\/li>\n<li>Certain physiological measurements, such as blood pressure of adult humans.<\/li>\n<\/ul>\n<h4>4.2.5.3 Central limit theorem<\/h4>\n<p>The normal distribution is also important, because it is involved in the Central Limit Theorem:<\/p>\n<p>Take the mean of\u00a0n\u00a0random samples from ANY arbitrary distribution with a\u00a0well-defined\u00a0standard deviation\u00a0\u03c3\u00a0and mean\u00a0\u03bc. As\u00a0n\u00a0gets bigger the\u00a0<strong>distribution of the sample mean<\/strong>\u00a0will always converge to a Gaussian (normal) distribution with mean\u00a0\u03bc\u00a0and standard deviation\u00a0\u03c3n\u221a.<\/p>\n<p>Colloquially speaking, the theorem states the distribution of an average tends to be normal, even when the distribution from which the average is computed is decidedly non-normal. This explains the ubiquity of the Gaussian distribution in science and statistics.<\/p>\n<h4>Example: Uniform Distribution<\/h4>\n<p>The uniform distribution is obviously non-normal. Let us call it the\u00a0parent\u00a0distribution.<\/p>\n<p>To compute an average, two samples are drawn (n=2), at random, from the parent distribution and averaged. Then another sample of two is drawn and another value of the average computed. This process is repeated, over and over, and averages of two are computed.<\/p>\n<p>Repeatedly taking more elements (n=3,4&#8230;) from the parent distribution, and computing the averages, produces a normal probability density.<\/p>\n<pre>fig, ax = plt.subplots(1, 4, sharey=True, squeeze=True, figsize=(14, 5))\nx = np.linspace(0, 1, 100)\nfor i in range(4):\n    f = np.mean(np.random.random((10000, i+1)), 1)\n    m, s = np.mean(f), np.std(f, ddof=1)\n    fn = (1\/(s*np.sqrt(2*np.pi)))*np.exp(-(x-m)**2\/(2*s**2))  # normal pdf            \n    ax[i].hist(f, 40, color=[0, 0.2, .8, .6]) \n    ax[i].set_title('n=%d' %(i+1))\n    ax[i].plot(x, fn, color=[1, 0, 0, .6], linewidth=5)\nplt.suptitle('Demonstration of the central limit theorem for a uniform distribution', y=1.05)\nplt.show()<\/pre>\n<h3>4.2.6\u00a0\u00a0\u00a0\u00a0\u00a0 Kernel Density<\/h3>\n<p>In many real problems, we may not be interested in the parameters of a particular distribution of data, but just a continuous representation of the data. In this case, we should estimate the distribution non-parametrically (i.e., making no assumptions about the form of the underlying distribution) using kernel density estimation.<\/p>\n<pre>from scipy.stats.distributions import norm\n\n# Some random data\ny = np.random.random(15) * 10\nx = np.linspace(0, 10, 100)\n\nx1 = np.random.normal(-1, 2, 15) # parameters: (loc=0.0, scale=1.0, size=None)\nx2 = np.random.normal(6, 3, 10)\ny = np.r_[x1, x2] # r_ Translates slice objects to concatenation along the first axis.\nx = np.linspace(min(y), max(y), 100)\n\n# Smoothing parameter\ns = 0.4\n\n# Calculate the kernels\nkernels = np.transpose([norm.pdf(x, yi, s) for yi in y])\n\nplt.plot(x, kernels, 'k:')\nplt.plot(x, kernels.sum(1), 'r')\nplt.plot(y, np.zeros(len(y)), 'go', ms=10)<\/pre>\n<p>Let us imagine that we have a set of data measurements without knowing their distribution and we need to estimate the continuous representation of their distribution. In this case, we can consider a Gaussian kernel to generate the density around the data. Let us consider a set of random data generated by a bimodal normal distribution. If we consider a Gaussian kernel around the data, the sum of those kernels can give us a continuous function that when normalized would approximate the density of the distribution:<\/p>\n<pre><strong>from <\/strong>scipy.stats <strong>import <\/strong>kde\n\nx1 = np.random.normal(-1, 0.5, 15)\n\n# parameters: (loc=0.0, scale=1.0, size=None)\n\nx2 = np.random.normal(6, 1, 10)\ny = np.r_[x1, x2]\n\n# r_ Translates slice objects to concatenation along the first axis.\n\nx = np.linspace(min(y), max(y), 100)\ns = 0.4   # Smoothing parameter\n\nkernels = np.transpose([norm.pdf(x, yi, s) for yi in y])\n\n# Calculate the kernels\ndensity = kde.gaussian_kde(y)\n\nplt.plot(x, kernels, 'k:')\nplt.plot(x, kernels.sum(1), 'r')\nplt.plot(y, np.zeros(len(y)), 'bo', ms=10)<\/pre>\n<ol start=\"21\">\n<li><strong>What does the figure shows?<\/strong><\/li>\n<\/ol>\n<pre>xgrid = np.linspace(x.min(), x.max(), 200)\nplt.hist(y, bins=28)\nplt.plot(xgrid, density(xgrid), 'r-')<\/pre>\n<p>SciPy implements a Gaussian KDE that automatically chooses an appropriate bandwidth. Let&#8217;s create a bi-modal distribution of data that is not easily summarized by a parametric distribution:<\/p>\n<pre># Create a bi-modal distribution with a mixture of Normals.\n\nx1 = np.random.normal(-1, 2, 15) # parameters: (loc=0.0, scale=1.0, size=None)\nx2 = np.random.normal(6, 3, 10)\n\n# Append by row\nx = np.r_[x1, x2]\n\n# r_ Translates slice objects to concatenation along the first axis.\nplt.hist(x, bins=18)<\/pre>\n<p>In fact, the library SciPy3 implements a Gaussian kernel density estimation that automatically chooses the appropriate bandwidth parameter for the kernel. Thus, the final construction of the density estimate will be obtained by:<\/p>\n<pre>density = kde.gaussian_kde(x)\nxgrid = np.linspace(x.min(), x.max(), 200)\nplt.hist(x, bins=18)\nplt.plot(xgrid, density(xgrid), 'r-')<\/pre>\n<h3>4.2\u00a0\u00a0\u00a0\u00a0 Estimation<\/h3>\n<p>An important aspect when working with statistical data is being able to use estimates to approximate the values of unknown parameters of the dataset. In this section, we will review different kinds of estimators (estimated mean, variance, standard score, etc.).<\/p>\n<pre>x = np.random.normal(0.0, 1.0, 10000)\na = plt.hist(x,50)<\/pre>\n<p><strong>Definition:<\/strong>\u00a0<em>Estimation<\/em>\u00a0is the process of inferring the parameters (e.g. mean) of a distribution from a statistic of samples drown from a population.<\/p>\n<p>For example: What is the estimated mean\u00a0\u03bc\u0302\u00a0of the following normal data?<\/p>\n<p>We can use our definition of empirical mean:<\/p>\n<pre>print ('The empirical mean of the sample is ', x.mean())<\/pre>\n<h3>4.3.1 Sample and Estimated Mean, Variance and Standard Scores<\/h3>\n<p>In continuation, we will deal with point estimators that are single numerical estimates of parameters of a population.<\/p>\n<h4>4.3.1.1 Mean<\/h4>\n<p>Let us assume that we know that our data are coming from a normal distribution and the random samples drawn are as follows:<\/p>\n<p style=\"text-align: center;\">{0.33, \u22121.76, 2.34, 0.56, 0.89}.<\/p>\n<p>The question is can we guess the mean \u03bc of the distribution? One approximation is given by the sample mean, x \u0304. This process is called estimation and the statistic (e.g., the sample mean) is called an estimator. In our case, the sample mean is 0.472, and it seems a logical choice to represent the mean of the distribution. It is not so evident if we add a sample with a value of \u2212465. In this case, the sample mean will be \u221277.11, which does not look like the mean of the distribution. The reason is due to the fact that the last value seems to be an outlier compared to the rest of the sample. In order to avoid this effect, we can try first to remove outliers and then to estimate the mean; or we can use the sample median as an estimator of the mean of the distribution. If there are no outliers, the sample mean x \u0304 minimizes the following mean squared error:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-81 aligncenter\" src=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/MSE.png\" alt=\"\" width=\"204\" height=\"65\" srcset=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/MSE.png 364w, http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/MSE-300x96.png 300w\" sizes=\"auto, (max-width: 204px) 100vw, 204px\" \/><\/p>\n<p>where n is the number of times we estimate the mean. Let us compute the MSE of a set of random data:<\/p>\n<pre>NTs=200\nmu=0.0\nvar=1.0\nerr = 0.0\nNPs=1000\nfor i in range(NTs):\n    x = np.random.normal(mu, var, NPs)\n    err += (x.mean()-mu)**2\n\nprint ('MSE: ', err\/NTs)<\/pre>\n<ol start=\"21\">\n<li><strong>What do you obtained as result?<\/strong><\/li>\n<\/ol>\n<h4>4.3.1.2 Variance<\/h4>\n<p>We can also estimate the variance with:<\/p>\n<p style=\"text-align: center;\">\u03c3\u0302\u00a0<sup>2<\/sup>=1n\u2211<sub>i<\/sub>(x<sub>i<\/sub>\u2212\u03bc)<sup>2<\/sup><\/p>\n<p>This estimator works for large samples, but it is biased for small samples. We can use this one:<\/p>\n<p style=\"text-align: center;\">\u03c3\u0302\u00a0<sup>2<\/sup><sup>n<\/sup><sup>\u2212<\/sup><sup>1<\/sup>=1n\u22121\u2211<sub>i<\/sub>(x<sub>i<\/sub>\u2212\u03bc)<sup>2<\/sup><\/p>\n<h4>4.3.1.3 Standard scores<\/h4>\n<p style=\"text-align: center;\">z<sub>i<\/sub>=x<sub>i<\/sub>\u2212\u03bc\/\u03c3<\/p>\n<p>This measure is dimensionless and its distribution has mean 0 and variance 1.<\/p>\n<p>It inherits the &#8220;shape&#8221; of\u00a0X: if it is normally distributed, so is\u00a0Z. If\u00a0X\u00a0is skewed, so is\u00a0Z.<\/p>\n<h4>4.3.1.4 Covariance<\/h4>\n<p><strong>Covariance<\/strong>\u00a0is a measure of the tendency of two variables to vary together.<\/p>\n<p>If we have two series\u00a0X\u00a0and\u00a0Y\u00a0with\u00a0X={x<sub>i<\/sub>}\u00a0and\u00a0Y={y<sub>i<\/sub>}, and they vary together, their deviations\u00a0x<sub>i<\/sub>\u2212\u03bc<sub>X<\/sub>\u00a0and\u00a0y<sub>i<\/sub>\u2212\u03bc<sub>Y<\/sub>\u00a0tend to have the same sign.<\/p>\n<p>If we multiply them together, the product is positive, when the deviations have the same sign, and negative, when they have the opposite sign. So, adding up the products gives a measure of the tendency to vary together.<\/p>\n<p>Covariance is the mean of the products:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-82 aligncenter\" src=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/cov.png\" alt=\"\" width=\"287\" height=\"63\" srcset=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/cov.png 414w, http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/cov-300x66.png 300w\" sizes=\"auto, (max-width: 287px) 100vw, 287px\" \/><\/p>\n<p>where\u00a0n\u00a0is the length of the two series.<\/p>\n<p>It is a measure that is difficult to interpret.<\/p>\n<pre>def Cov(X, Y):\n    def _get_dvis(V):\n        return [v - np.mean(V) for v in V]\n    dxis = _get_dvis(X)\n    dyis = _get_dvis(Y)\n    return np.sum([x * y for x, y in zip(dxis, dyis)])\/len(X)\n\n\nX = [5, -1, 3.3, 2.7, 12.2]\nX= np.array(X)\nY = [10, 12, 8, 9, 11]\n\nprint (\"Cov(X, X) = %.2f\" % Cov(X, X))\nprint (\"Var(X) = %.2f\" % np.var(X))\n\nprint (\"Cov(X, Y) = %.2f\" % Cov(X, Y))<\/pre>\n<p>Let us create some examples of positive and negative correlations like those showing the relations of stock market with respect to the economic growth or the gasoline prices with respect to the world oil production:<\/p>\n<pre>MAXN=100\nMAXN=40\n\nX=np.array([[1,9],[3, 2], [5,3],[5.5,4],[6,4],[6.5,4],[7,3.5],[7.5,3.8],[8,4],\n[8.5,4],[9,4.5],[9.5,7],[10,9],[10.5,11],[11,11.5],[11.5,12],[12,12],[12.5,12],[13,10]])<\/pre>\n<pre>plt.subplot(1,2,1)\nplt.scatter(X[:,0],X[:,1],color='b',s=120, linewidths=2,zorder=10)\nplt.xlabel('Economic growth(T)',fontsize=15)\nplt.ylabel('Stock market returns(T)',fontsize=15)\nplt.gcf().set_size_inches((20,6))<\/pre>\n<pre>X=np.array([[1,8],[2, 7], [3,6],[4,8],[5,8],[6,7],[7,7],[8,5],[9,5],[10,6],[11,4],[12,5],[13,3],[14,2],[15,2],[16,1]])\n\nplt.subplot(1,2,1)\nplt.scatter(X[:,0],X[:,1],color='b',s=120, linewidths=2,zorder=10)\nplt.xlabel('World Oil Production(T)',fontsize=15)\nplt.ylabel('Gasoline prices(T)',fontsize=15)\nplt.gcf().set_size_inches((20,6))<\/pre>\n<h4>4.3.1.5 Pearson\u2019s correlation<\/h4>\n<p>Shall we consider the variance? An alternative is to divide the deviations by\u00a0\u03c3, which yields standard scores, and compute the product of standard scores:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-83 aligncenter\" src=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/pearsonpi.png\" alt=\"\" width=\"207\" height=\"57\" srcset=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/pearsonpi.png 316w, http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/pearsonpi-300x83.png 300w\" sizes=\"auto, (max-width: 207px) 100vw, 207px\" \/><\/p>\n<p>The mean of these products is:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-84 aligncenter\" src=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/ropearson.png\" alt=\"\" width=\"272\" height=\"54\" srcset=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/ropearson.png 449w, http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/ropearson-300x59.png 300w\" sizes=\"auto, (max-width: 272px) 100vw, 272px\" \/><\/p>\n<p>Or we can rewrite\u00a0\u03c1\u00a0by factoring out\u00a0\u03c3X\u00a0and\u00a0\u03c3Y:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-85 aligncenter\" src=\"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-content\/uploads\/sites\/42\/2018\/09\/rocov.png\" alt=\"\" width=\"171\" height=\"64\" \/><\/p>\n<pre>def Corr(X, Y):\n    assert len(X) == len(Y)\n    return Cov(X, Y) \/ np.prod([np.std(V) for V in [X, Y]])\n\nprint (\"Corr(X, X) = %.5f\" % Corr(X, X))\n\nY=np.random.random(len(X))\n\nprint (\"Corr(X, Y) = %.5f\" % Corr(X, Y))<\/pre>\n<p>When\u00a0\u03c1=0, we cannot say that there is no relationship between the variables!<\/p>\n<p>Pearson&#8217;s coefficient only measures\u00a0<strong>linear<\/strong>\u00a0correlations!<\/p>\n<h4>4.3.2.6 Spearsman\u2019s rank correlation<\/h4>\n<p>Pearson\u2019s correlation works well if the relationship between variables is linear and if the variables are roughly normal. But it is not robust in the presence of\u00a0<strong>outliers<\/strong>.<\/p>\n<p>Spearman\u2019s rank correlation is an alternative that mitigates the effect of outliers and skewed distributions. To compute Spearman\u2019s correlation, we have to compute the rank of each value, which is its index in the sorted sample.<\/p>\n<p>For example, in the sample {7, 1, 2, 5} the rank of the value 5 is 3, because it appears third if we sort the elements.<\/p>\n<p>Then, we compute the Pearson\u2019s correlation,\u00a0<strong>but for the ranks<\/strong>.<\/p>\n<pre>def list2rank(l):\n    #l is a list of numbers\n    # returns a list of 1-based index; mean when multiple instances\n    return [np.mean([i+1 for i, sorted_el in enumerate(sorted(l)) if sorted_el == el]) for el in l]\n\nl = [7, 1, 2, 5]\nprint (\"ranks: \", list2rank(l))\n\ndef spearmanRank(X, Y):\n    # X and Y are same-length lists\n    print (list2rank(X) )\n    print (list2rank(Y))\n    return Corr(list2rank(X), list2rank(Y))\n\nX = [10, 20, 30, 40, 1000]\nY = [-70, -1000, -50, -10, -20]\nplt.plot(X,'ro')\nplt.plot(Y,'go')\n\nprint (\"Pearson rank coefficient: %.2f\" % Corr(X, Y))\nprint (\"Spearman rank coefficient: %.2f\" % spearmanRank(X, Y))<\/pre>\n<pre><strong style=\"color: #444444; font-family: 'Open Sans', Helvetica, Arial, sans-serif; font-size: 1rem;\">Exercise:<\/strong><span style=\"color: #444444; font-family: 'Open Sans', Helvetica, Arial, sans-serif; font-size: 1rem;\">\u00a0Obtain for the Anscombe's quartet [2] given in the figures bellow, the different estimators (mean, variance, covariance for each pair, Pearson's correlation and Spearman's rank correlation.<\/span><\/pre>\n<pre>X=np.array([[10.0, 8.04,10.0, 9.14, 10.0, 7.46, 8.0, 6.58],\n[8.0,6.95, 8.0, 8.14, 8.0, 6.77, 8.0, 5.76],\n[13.0,7.58,13.0,8.74,13.0,12.74,8.0,7.71],\n[9.0,8.81,9.0,8.77,9.0,7.11,8.0,8.84],\n[11.0,8.33,11.0,9.26,11.0,7.81,8.0,8.47],\n[14.0,9.96,14.0,8.10,14.0,8.84,8.0,7.04],\n[6.0,7.24,6.0,6.13,6.0,6.08,8.0,5.25],\n[4.0,4.26,4.0,3.10,4.0,5.39,19.0,12.50],\n[12.0,10.84,12.0,9.13,12.0,8.15,8.0,5.56],\n[7.0,4.82,7.0,7.26,7.0,6.42,8.0,7.91],\n[5.0,5.68,5.0,4.74,5.0,5.73,8.0,6.89]])<\/pre>\n<pre>plt.subplot(2,2,1)\nplt.scatter(X[:,0],X[:,1],color='r',s=120, linewidths=2,zorder=10)\nplt.xlabel('x1',fontsize=15)\nplt.ylabel('y1',fontsize=15)<\/pre>\n<pre>plt.subplot(2,2,2)\nplt.scatter(X[:,2],X[:,3],color='r',s=120, linewidths=2,zorder=10)\nplt.xlabel('x1',fontsize=15)\nplt.ylabel('y1',fontsize=15)\nplt.subplot(2,2,3)\nplt.scatter(X[:,4],X[:,5],color='r',s=120, linewidths=2,zorder=10)\nplt.xlabel('x1',fontsize=15)\nplt.ylabel('y1',fontsize=15)<\/pre>\n<pre>plt.subplot(2,2,4)\nplt.scatter(X[:,6],X[:,7],color='r',s=120, linewidths=2,zorder=10)\nplt.xlabel('x1',fontsize=15)\nplt.ylabel('y1',fontsize=15)\nplt.gcf().set_size_inches((10,10))<\/pre>\n<h1>Main references<\/h1>\n<p>[1]\u00a0<em>Think Stats: Probability and Statistics for Programmers<\/em>, by Allen B. Downey, published by O&#8217;Reilly Media.\u00a0<a href=\"http:\/\/www.greenteapress.com\/thinkstats\/\">http:\/\/www.greenteapress.com\/thinkstats\/<\/a><\/p>\n<p>[2] Anscombe&#8217;s quartet,\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Anscombe%27s_quartet\">https:\/\/en.wikipedia.org\/wiki\/Anscombe%27s_quartet<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>1.&nbsp;&nbsp;&nbsp;Objective Apply descriptive statistics to explore data collections and compute quantitative descriptions. Recall that descriptive statistics applies the concepts, measures, and terms that are used to describe the basic features of the samples in a study. Together with simple graphics, these measures form the basis of quantitative analysis of data. To describe the sample data [&hellip;]<\/p>\n","protected":false},"author":11,"featured_media":0,"parent":11,"menu_order":4,"comment_status":"closed","ping_status":"closed","template":"page-templates\/full-width.php","meta":{"footnotes":""},"class_list":["post-56","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-json\/wp\/v2\/pages\/56","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-json\/wp\/v2\/users\/11"}],"replies":[{"embeddable":true,"href":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-json\/wp\/v2\/comments?post=56"}],"version-history":[{"count":30,"href":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-json\/wp\/v2\/pages\/56\/revisions"}],"predecessor-version":[{"id":444,"href":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-json\/wp\/v2\/pages\/56\/revisions\/444"}],"up":[{"embeddable":true,"href":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-json\/wp\/v2\/pages\/11"}],"wp:attachment":[{"href":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-json\/wp\/v2\/media?parent=56"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}