{"id":283,"date":"2019-01-18T05:41:20","date_gmt":"2019-01-18T05:41:20","guid":{"rendered":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/?page_id=283"},"modified":"2020-01-22T18:53:13","modified_gmt":"2020-01-22T18:53:13","slug":"statistical-inference","status":"publish","type":"page","link":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/statistical-inference\/","title":{"rendered":"Statistical Inference"},"content":{"rendered":"<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>In this notebook we will see how to infer predictions about a population. To this end we will explore the relationship between sample parameters and population parameters and we will propose some methods to assess the quality of parameter estimates of a sample.<\/p>\n<h2 id=\"Data-description\">Data description<\/h2>\n<p>Let&#8217;s consider a dataset of accidents in Barcelona in 2013. This dataset can be downloaded from OpenDataBCN website (<a href=\"http:\/\/opendata.bcn.cat\/\">http:\/\/opendata.bcn.cat\/<\/a>), Barcelona&#8217;s City Hall open data service. Each register in the dataset represents an accident by a series of features: weekday, hour, address, number of dead and injured people, etc. This dataset will represent our population: the set of all reported traffic accidents in Barcelona during 2013.<\/p>\n<p>In <a href=\"https:\/\/dieguico.cartodb.com\/viz\/50b06d8c-13ab-11e5-8619-0e4fddd5de28\/public_map\">https:\/\/dieguico.cartodb.com\/viz\/50b06d8c-13ab-11e5-8619-0e4fddd5de28\/public_map<\/a> you can visualize a map of accidents in the city of Barcelona by hour of day, and by day of week.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[\u00a0]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"kn\">import<\/span> <span class=\"nn\">matplotlib.pylab<\/span> <span class=\"kn\">as<\/span> <span class=\"nn\">plt<\/span>\n<span class=\"kn\">from<\/span> <span class=\"nn\">matplotlib<\/span> <span class=\"kn\">import<\/span> <span class=\"n\">cm<\/span>\n<span class=\"kn\">import<\/span> <span class=\"nn\">math<\/span>\n<span class=\"kn\">import<\/span> <span class=\"nn\">pandas<\/span> <span class=\"kn\">as<\/span> <span class=\"nn\">pd<\/span>\n<span class=\"kn\">import<\/span> <span class=\"nn\">numpy<\/span> <span class=\"kn\">as<\/span> <span class=\"nn\">np<\/span>\n<span class=\"kn\">import<\/span> <span class=\"nn\">random<\/span>\n\n<span class=\"o\">%<\/span><span class=\"k\">matplotlib<\/span> inline \n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">style<\/span><span class=\"o\">.<\/span><span class=\"n\">use<\/span><span class=\"p\">(<\/span><span class=\"s1\">'seaborn-whitegrid'<\/span><span class=\"p\">)<\/span>\n<span class=\"c1\"># plt.rc('text', usetex=False)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">rc<\/span><span class=\"p\">(<\/span><span class=\"s1\">'font'<\/span><span class=\"p\">,<\/span> <span class=\"n\">family<\/span><span class=\"o\">=<\/span><span class=\"s1\">'times'<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">rc<\/span><span class=\"p\">(<\/span><span class=\"s1\">'xtick'<\/span><span class=\"p\">,<\/span> <span class=\"n\">labelsize<\/span><span class=\"o\">=<\/span><span class=\"mi\">10<\/span><span class=\"p\">)<\/span> \n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">rc<\/span><span class=\"p\">(<\/span><span class=\"s1\">'ytick'<\/span><span class=\"p\">,<\/span> <span class=\"n\">labelsize<\/span><span class=\"o\">=<\/span><span class=\"mi\">10<\/span><span class=\"p\">)<\/span> \n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">rc<\/span><span class=\"p\">(<\/span><span class=\"s1\">'font'<\/span><span class=\"p\">,<\/span> <span class=\"n\">size<\/span><span class=\"o\">=<\/span><span class=\"mi\">12<\/span><span class=\"p\">)<\/span> \n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[\u00a0]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"n\">data<\/span> <span class=\"o\">=<\/span> <span class=\"n\">pd<\/span><span class=\"o\">.<\/span><span class=\"n\">read_csv<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"files\/ch04\/ACCIDENTS_GU_BCN_2013.csv\"<\/span><span class=\"p\">,<\/span> <span class=\"n\">encoding<\/span><span class=\"o\">=<\/span><span class=\"s1\">'latin-1'<\/span><span class=\"p\">)<\/span>\n<span class=\"k\">print<\/span> (<span class=\"n\">data<\/span><span class=\"o\">.<\/span><span class=\"n\">columns)<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>We will create a new data column which is the date and a list with the number of accidents for every day of the year:<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[\u00a0]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"c1\">#Create a new column which is the date<\/span>\n<span class=\"n\">data<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Date'<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"s1\">'2013-'<\/span><span class=\"o\">+<\/span><span class=\"n\">data<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Mes de any'<\/span><span class=\"p\">]<\/span><span class=\"o\">.<\/span><span class=\"n\">apply<\/span><span class=\"p\">(<\/span><span class=\"k\">lambda<\/span> <span class=\"n\">x<\/span> <span class=\"p\">:<\/span> <span class=\"nb\">str<\/span><span class=\"p\">(<\/span><span class=\"n\">x<\/span><span class=\"p\">))<\/span> <span class=\"o\">+<\/span> <span class=\"s1\">'-'<\/span> <span class=\"o\">+<\/span>  <span class=\"n\">data<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Dia de mes'<\/span><span class=\"p\">]<\/span><span class=\"o\">.<\/span><span class=\"n\">apply<\/span><span class=\"p\">(<\/span><span class=\"k\">lambda<\/span> <span class=\"n\">x<\/span> <span class=\"p\">:<\/span> <span class=\"nb\">str<\/span><span class=\"p\">(<\/span><span class=\"n\">x<\/span><span class=\"p\">))<\/span>\n<span class=\"n\">data<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Date'<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"n\">pd<\/span><span class=\"o\">.<\/span><span class=\"n\">to_datetime<\/span><span class=\"p\">(<\/span><span class=\"n\">data<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Date'<\/span><span class=\"p\">])<\/span>\n<span class=\"n\">accidents<\/span> <span class=\"o\">=<\/span> <span class=\"n\">data<\/span><span class=\"o\">.<\/span><span class=\"n\">groupby<\/span><span class=\"p\">([<\/span><span class=\"s1\">'Date'<\/span><span class=\"p\">])<\/span><span class=\"o\">.<\/span><span class=\"n\">size<\/span><span class=\"p\">()<\/span>\n<span class=\"k\">print<\/span> (<span class=\"s2\">\"Mean:\"<\/span><span class=\"p\">,<\/span> <span class=\"n\">accidents<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">())<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>Let&#8217;s suppose that we are interested in describing the number of daily traffic accidents (<strong>accident rate<\/strong>) in the streets of Barcelona during 2013. In order to get a first idea of the data, we can plot the number of accidents for each day of 2013:<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[\u00a0]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"n\">fig<\/span><span class=\"p\">,<\/span> <span class=\"n\">ax<\/span> <span class=\"o\">=<\/span> <span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">subplots<\/span><span class=\"p\">(<\/span><span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"n\">figsize<\/span><span class=\"o\">=<\/span><span class=\"p\">(<\/span><span class=\"mi\">12<\/span><span class=\"p\">,<\/span> <span class=\"mi\">4<\/span><span class=\"p\">))<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">ylabel<\/span><span class=\"p\">(<\/span><span class=\"s1\">'Number of accidents'<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">xlabel<\/span><span class=\"p\">(<\/span><span class=\"s1\">'Day'<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">plot<\/span><span class=\"p\">(<\/span><span class=\"nb\">range<\/span><span class=\"p\">(<\/span><span class=\"mi\">0<\/span><span class=\"p\">,<\/span> <span class=\"mi\">365<\/span><span class=\"p\">),<\/span> <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">array<\/span><span class=\"p\">(<\/span><span class=\"n\">accidents<\/span><span class=\"p\">),<\/span> <span class=\"s1\">'b-+'<\/span><span class=\"p\">,<\/span> <span class=\"n\">lw<\/span><span class=\"o\">=<\/span><span class=\"mf\">0.7<\/span><span class=\"p\">,<\/span> <span class=\"n\">alpha<\/span><span class=\"o\">=<\/span><span class=\"mf\">0.7<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">plot<\/span><span class=\"p\">(<\/span><span class=\"nb\">range<\/span><span class=\"p\">(<\/span><span class=\"mi\">0<\/span><span class=\"p\">,<\/span> <span class=\"mi\">365<\/span><span class=\"p\">),<\/span> <span class=\"p\">[<\/span><span class=\"n\">accidents<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">()]<\/span><span class=\"o\">*<\/span><span class=\"mi\">365<\/span><span class=\"p\">,<\/span> <span class=\"s1\">'r-'<\/span><span class=\"p\">,<\/span> <span class=\"n\">lw<\/span><span class=\"o\">=<\/span><span class=\"mf\">0.7<\/span><span class=\"p\">,<\/span> <span class=\"n\">alpha<\/span><span class=\"o\">=<\/span><span class=\"mf\">0.9<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">show<\/span><span class=\"p\">()<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>Alternatively, we can plot the distribution of our variable of interest: the daily number of accidents.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[\u00a0]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"n\">fig<\/span><span class=\"p\">,<\/span> <span class=\"n\">ax<\/span> <span class=\"o\">=<\/span> <span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">subplots<\/span><span class=\"p\">(<\/span><span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"n\">figsize<\/span><span class=\"o\">=<\/span><span class=\"p\">(<\/span><span class=\"mi\">12<\/span><span class=\"p\">,<\/span> <span class=\"mi\">3<\/span><span class=\"p\">))<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">ylabel<\/span><span class=\"p\">(<\/span><span class=\"s1\">'Frequency'<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">xlabel<\/span><span class=\"p\">(<\/span><span class=\"s1\">'Number of accidents'<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">hist<\/span><span class=\"p\">(<\/span><span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">array<\/span><span class=\"p\">(<\/span><span class=\"n\">accidents<\/span><span class=\"p\">),<\/span> <span class=\"n\">bins<\/span><span class=\"o\">=<\/span><span class=\"mi\">20<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">ax<\/span><span class=\"o\">.<\/span><span class=\"n\">axvline<\/span><span class=\"p\">(<\/span><span class=\"n\">x<\/span><span class=\"o\">=<\/span><span class=\"n\">accidents<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">(),<\/span> <span class=\"n\">ymin<\/span><span class=\"o\">=<\/span><span class=\"mi\">0<\/span><span class=\"p\">,<\/span> <span class=\"n\">ymax<\/span><span class=\"o\">=<\/span><span class=\"mi\">40<\/span><span class=\"p\">,<\/span> <span class=\"n\">color<\/span><span class=\"o\">=<\/span><span class=\"p\">[<\/span><span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"mi\">0<\/span><span class=\"p\">,<\/span> <span class=\"mi\">0<\/span><span class=\"p\">])<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">savefig<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"bootmean.png\"<\/span><span class=\"p\">,<\/span><span class=\"n\">dpi<\/span><span class=\"o\">=<\/span><span class=\"mi\">300<\/span><span class=\"p\">,<\/span> <span class=\"n\">bbox_inches<\/span><span class=\"o\">=<\/span><span class=\"s1\">'tight'<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">show<\/span><span class=\"p\">()<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>If we have access to the whole <em>population<\/em>, the computation of the <strong>accident rate<\/strong> in 2013 is a simple operation: the total number of accidents divided by 365. As a measure of quality of this parameter we can also compute the standard deviation.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[\u00a0]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"k\">print<\/span> (<span class=\"s2\">\"Mean:\"<\/span><span class=\"p\">,<\/span> <span class=\"n\">accidents<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">(),<\/span> <span class=\"s2\">\"; STD:\"<\/span><span class=\"p\">,<\/span> <span class=\"n\">accidents<\/span><span class=\"o\">.<\/span><span class=\"n\">std<\/span><span class=\"p\">())<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>But now, let&#8217;s suppose that we have only access to a limited part of the data (the <em>sample<\/em>): the number of accidents during <em>some days<\/em> of 2013. Can we still give an approximation (an <em>estimate<\/em>) to this population mean?<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h2 id=\"Variability-in-estimates.\">Variability in estimates.<\/h2>\n<p>Estimates generally vary from one sample to another, and this sampling variation suggests our estimate may be close, but it will not be exactly equal to the parameter.<\/p>\n<p>This can be easily checked by generating 10 different samples (composed of 25% of the population) from our population and compute their accident rate estimates:<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[\u00a0]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"n\">df<\/span> <span class=\"o\">=<\/span> <span class=\"n\">accidents<\/span><span class=\"o\">.<\/span><span class=\"n\">to_frame<\/span><span class=\"p\">()<\/span>\n<span class=\"n\">m<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[]<\/span>\n\n<span class=\"k\">for<\/span> <span class=\"n\">i<\/span> <span class=\"ow\">in<\/span> <span class=\"nb\">range<\/span><span class=\"p\">(<\/span><span class=\"mi\">10<\/span><span class=\"p\">):<\/span>\n    <span class=\"n\">df<\/span><span class=\"p\">[<\/span><span class=\"s1\">'for_testing'<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"bp\">False<\/span>\n    <span class=\"c1\"># get a 25% sample <\/span>\n    <span class=\"n\">sampled_ids<\/span> <span class=\"o\">=<\/span> <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">random<\/span><span class=\"o\">.<\/span><span class=\"n\">choice<\/span><span class=\"p\">(<\/span><span class=\"n\">df<\/span><span class=\"o\">.<\/span><span class=\"n\">index<\/span><span class=\"p\">,<\/span>\n                                   <span class=\"n\">size<\/span><span class=\"o\">=<\/span><span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">int64<\/span><span class=\"p\">(<\/span><span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">ceil<\/span><span class=\"p\">(<\/span><span class=\"n\">df<\/span><span class=\"o\">.<\/span><span class=\"n\">index<\/span><span class=\"o\">.<\/span><span class=\"n\">size<\/span> <span class=\"o\">*<\/span> <span class=\"mf\">0.25<\/span><span class=\"p\">)),<\/span>\n                                   <span class=\"n\">replace<\/span><span class=\"o\">=<\/span><span class=\"bp\">False<\/span><span class=\"p\">)<\/span>\n    <span class=\"n\">df<\/span><span class=\"o\">.<\/span><span class=\"n\">ix<\/span><span class=\"p\">[<\/span><span class=\"n\">sampled_ids<\/span><span class=\"p\">,<\/span> <span class=\"s1\">'for_testing'<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"bp\">True<\/span>\n    <span class=\"n\">accidents_sample<\/span> <span class=\"o\">=<\/span> <span class=\"n\">df<\/span><span class=\"p\">[<\/span><span class=\"n\">df<\/span><span class=\"p\">[<\/span><span class=\"s1\">'for_testing'<\/span><span class=\"p\">]<\/span> <span class=\"o\">==<\/span> <span class=\"bp\">True<\/span><span class=\"p\">]<\/span>\n    <span class=\"n\">m<\/span><span class=\"o\">.<\/span><span class=\"n\">append<\/span><span class=\"p\">(<\/span><span class=\"n\">accidents_sample<\/span><span class=\"p\">[<\/span><span class=\"mi\">0<\/span><span class=\"p\">]<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">())<\/span>\n    <span class=\"k\">print<\/span>  (<span class=\"s1\">'Sample '<\/span><span class=\"o\">+<\/span><span class=\"nb\">str<\/span><span class=\"p\">(<\/span><span class=\"n\">i<\/span><span class=\"p\">)<\/span><span class=\"o\">+<\/span><span class=\"s1\">': Mean'<\/span><span class=\"p\">,<\/span> <span class=\"s1\">'<\/span><span class=\"si\">%.2f<\/span><span class=\"s1\">'<\/span> <span class=\"o\">%<\/span> <span class=\"n\">accidents_sample<\/span><span class=\"p\">[<\/span><span class=\"mi\">0<\/span><span class=\"p\">]<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">())<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[\u00a0]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"n\">fig<\/span><span class=\"p\">,<\/span> <span class=\"n\">ax<\/span> <span class=\"o\">=<\/span> <span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">subplots<\/span><span class=\"p\">(<\/span><span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"n\">figsize<\/span><span class=\"o\">=<\/span><span class=\"p\">(<\/span><span class=\"mi\">12<\/span><span class=\"p\">,<\/span> <span class=\"mi\">2<\/span><span class=\"p\">))<\/span>\n<span class=\"n\">x<\/span> <span class=\"o\">=<\/span> <span class=\"nb\">range<\/span><span class=\"p\">(<\/span><span class=\"mi\">10<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">ax<\/span><span class=\"o\">.<\/span><span class=\"n\">step<\/span><span class=\"p\">(<\/span><span class=\"n\">x<\/span><span class=\"p\">,<\/span><span class=\"n\">m<\/span><span class=\"p\">,<\/span> <span class=\"n\">where<\/span><span class=\"o\">=<\/span><span class=\"s1\">'mid'<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">ax<\/span><span class=\"o\">.<\/span><span class=\"n\">set_ylabel<\/span><span class=\"p\">(<\/span><span class=\"s1\">'Mean'<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">ax<\/span><span class=\"o\">.<\/span><span class=\"n\">set_xlabel<\/span><span class=\"p\">(<\/span><span class=\"s1\">'Sample'<\/span><span class=\"p\">)<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>Accident rate estimates can range from 24 accident per day to 27 accidents per day, depending on the sample. How can we give a unique value for the estimate?<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h3 id=\"Sampling-distribution-of-point-estimates\">Sampling distribution of point estimates<\/h3>\n<p>The most intuitive way to go about giving a value for the estimate is to simply take the <em>sample mean<\/em>. The sample mean is a point estimate of the population mean. If we can only choose one value to estimate the population mean, this is our best guess.<\/p>\n<p>Let&#8217;s computer the sample means for a set of 10000 samples, each one composed of 200 days:<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[\u00a0]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">autumn<\/span><span class=\"p\">()<\/span>\n\n<span class=\"c1\"># population<\/span>\n<span class=\"n\">df<\/span> <span class=\"o\">=<\/span> <span class=\"n\">accidents<\/span><span class=\"o\">.<\/span><span class=\"n\">to_frame<\/span><span class=\"p\">()<\/span>    \n<span class=\"n\">N_test<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">10000<\/span>              \n<span class=\"n\">elements<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">200<\/span>             \n\n<span class=\"c1\"># mean array of samples<\/span>\n<span class=\"n\">means<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[<\/span><span class=\"mi\">0<\/span><span class=\"p\">]<\/span> <span class=\"o\">*<\/span> <span class=\"n\">N_test<\/span>             \n\n<span class=\"c1\"># sample generation<\/span>\n<span class=\"k\">for<\/span> <span class=\"n\">i<\/span> <span class=\"ow\">in<\/span> <span class=\"nb\">range<\/span><span class=\"p\">(<\/span><span class=\"n\">N_test<\/span><span class=\"p\">):<\/span>          \n    <span class=\"n\">rows<\/span> <span class=\"o\">=<\/span> <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">random<\/span><span class=\"o\">.<\/span><span class=\"n\">choice<\/span><span class=\"p\">(<\/span><span class=\"n\">df<\/span><span class=\"o\">.<\/span><span class=\"n\">index<\/span><span class=\"o\">.<\/span><span class=\"n\">values<\/span><span class=\"p\">,<\/span> <span class=\"n\">elements<\/span><span class=\"p\">)<\/span>\n    <span class=\"n\">sampled_df<\/span> <span class=\"o\">=<\/span> <span class=\"n\">df<\/span><span class=\"o\">.<\/span><span class=\"n\">ix<\/span><span class=\"p\">[<\/span><span class=\"n\">rows<\/span><span class=\"p\">]<\/span>\n    <span class=\"n\">means<\/span><span class=\"p\">[<\/span><span class=\"n\">i<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"n\">sampled_df<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">()<\/span>\n    \n<span class=\"n\">fig<\/span><span class=\"p\">,<\/span> <span class=\"n\">ax<\/span> <span class=\"o\">=<\/span> <span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">subplots<\/span><span class=\"p\">(<\/span><span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"n\">figsize<\/span><span class=\"o\">=<\/span><span class=\"p\">(<\/span><span class=\"mi\">12<\/span><span class=\"p\">,<\/span><span class=\"mi\">3<\/span><span class=\"p\">))<\/span>\n\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">hist<\/span><span class=\"p\">(<\/span><span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">array<\/span><span class=\"p\">(<\/span><span class=\"n\">means<\/span><span class=\"p\">),<\/span><span class=\"n\">bins<\/span><span class=\"o\">=<\/span><span class=\"mi\">50<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">ylabel<\/span><span class=\"p\">(<\/span><span class=\"s1\">'Frequency'<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">xlabel<\/span><span class=\"p\">(<\/span><span class=\"s1\">'Sample mean value'<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">ax<\/span><span class=\"o\">.<\/span><span class=\"n\">axvline<\/span><span class=\"p\">(<\/span><span class=\"n\">x<\/span> <span class=\"o\">=<\/span> <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">array<\/span><span class=\"p\">(<\/span><span class=\"n\">means<\/span><span class=\"p\">)<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">(),<\/span> \n           <span class=\"n\">ymin<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">0<\/span><span class=\"p\">,<\/span> \n           <span class=\"n\">ymax<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">700<\/span><span class=\"p\">,<\/span> \n           <span class=\"n\">color<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[<\/span><span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"mi\">0<\/span><span class=\"p\">,<\/span> <span class=\"mi\">0<\/span><span class=\"p\">])<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">savefig<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"empiricalmean.png\"<\/span><span class=\"p\">,<\/span><span class=\"n\">dpi<\/span><span class=\"o\">=<\/span><span class=\"mi\">300<\/span><span class=\"p\">,<\/span> <span class=\"n\">bbox_inches<\/span><span class=\"o\">=<\/span><span class=\"s1\">'tight'<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">show<\/span><span class=\"p\">()<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">set_cmap<\/span><span class=\"p\">(<\/span><span class=\"n\">cmap<\/span><span class=\"o\">=<\/span><span class=\"n\">cm<\/span><span class=\"o\">.<\/span><span class=\"n\">Pastel2<\/span><span class=\"p\">)<\/span>\n\n<span class=\"k\">print<\/span> (<span class=\"s2\">\"Sample mean:\"<\/span><span class=\"p\">,<\/span> <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">array<\/span><span class=\"p\">(<\/span><span class=\"n\">means<\/span><span class=\"p\">)<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">())<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>This is the <strong>sampling distribution of the mean<\/strong>. From it we could estimate the most probable value of the mean and also its standard deviation, but in the real world we will not have access to this function!<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h3 id=\"Standard-error-of-the-mean\">Standard error of the mean<\/h3>\n<p>Noe let&#8217;s suppose that we have only one sample of the propulation. As comented before, the mean estimate from that sample may be close, but it will not be exactly equal to our parameter of interest (that can only be computed if we have access to the full population). For this reason it is interesting to measure its variability with respect to the sampling process. To this end we can use the <em>standard error of the mean<\/em>.<\/p>\n<p>It can be mathematically shown that given <span id=\"MathJax-Element-1-Frame\" class=\"MathJax\" style=\"box-sizing: border-box; display: inline-table; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;\" tabindex=\"0\" role=\"presentation\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;mi&gt;n&lt;\/mi&gt;&lt;\/math&gt;\"><span id=\"MathJax-Span-1\" class=\"math\"><span id=\"MathJax-Span-2\" class=\"mrow\"><span id=\"MathJax-Span-3\" class=\"mi\">n<\/span><\/span><\/span><\/span> independent observations <span id=\"MathJax-Element-2-Frame\" class=\"MathJax\" style=\"box-sizing: border-box; display: inline-table; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;\" tabindex=\"0\" role=\"presentation\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;mo fence=&quot;false&quot; stretchy=&quot;false&quot;&gt;{&lt;\/mo&gt;&lt;msub&gt;&lt;mi&gt;x&lt;\/mi&gt;&lt;mi&gt;i&lt;\/mi&gt;&lt;\/msub&gt;&lt;msub&gt;&lt;mo fence=&quot;false&quot; stretchy=&quot;false&quot;&gt;}&lt;\/mo&gt;&lt;mrow class=&quot;MJX-TeXAtom-ORD&quot;&gt;&lt;mi&gt;i&lt;\/mi&gt;&lt;mo&gt;=&lt;\/mo&gt;&lt;mn&gt;1&lt;\/mn&gt;&lt;mo&gt;,&lt;\/mo&gt;&lt;mo&gt;.&lt;\/mo&gt;&lt;mo&gt;.&lt;\/mo&gt;&lt;mo&gt;,&lt;\/mo&gt;&lt;mi&gt;n&lt;\/mi&gt;&lt;\/mrow&gt;&lt;\/msub&gt;&lt;\/math&gt;\"><span id=\"MathJax-Span-4\" class=\"math\"><span id=\"MathJax-Span-5\" class=\"mrow\"><span id=\"MathJax-Span-6\" class=\"mo\">{<\/span><span id=\"MathJax-Span-7\" class=\"msubsup\"><span id=\"MathJax-Span-8\" class=\"mi\">x<\/span><span id=\"MathJax-Span-9\" class=\"mi\">i<\/span><\/span><span id=\"MathJax-Span-10\" class=\"msubsup\"><span id=\"MathJax-Span-11\" class=\"mo\">}<\/span><span id=\"MathJax-Span-12\" class=\"texatom\"><span id=\"MathJax-Span-13\" class=\"mrow\"><span id=\"MathJax-Span-14\" class=\"mi\">i<\/span><span id=\"MathJax-Span-15\" class=\"mo\">=<\/span><span id=\"MathJax-Span-16\" class=\"mn\">1<\/span><span id=\"MathJax-Span-17\" class=\"mo\">,<\/span><span id=\"MathJax-Span-18\" class=\"mo\">.<\/span><span id=\"MathJax-Span-19\" class=\"mo\">.<\/span><span id=\"MathJax-Span-20\" class=\"mo\">,<\/span><span id=\"MathJax-Span-21\" class=\"mi\">n<\/span><\/span><\/span><\/span><\/span><\/span><\/span> from a population with a standard deviation <span id=\"MathJax-Element-3-Frame\" class=\"MathJax\" style=\"box-sizing: border-box; display: inline-table; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;\" tabindex=\"0\" role=\"presentation\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;msub&gt;&lt;mi&gt;&amp;#x03C3;&lt;\/mi&gt;&lt;mi&gt;x&lt;\/mi&gt;&lt;\/msub&gt;&lt;\/math&gt;\"><span id=\"MathJax-Span-22\" class=\"math\"><span id=\"MathJax-Span-23\" class=\"mrow\"><span id=\"MathJax-Span-24\" class=\"msubsup\"><span id=\"MathJax-Span-25\" class=\"mi\">\u03c3<\/span><span id=\"MathJax-Span-26\" class=\"mi\">x<\/span><\/span><\/span><\/span><\/span>, the standard deviation of the sample mean <span id=\"MathJax-Element-4-Frame\" class=\"MathJax\" style=\"box-sizing: border-box; display: inline-table; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;\" tabindex=\"0\" role=\"presentation\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;msub&gt;&lt;mi&gt;&amp;#x03C3;&lt;\/mi&gt;&lt;mrow class=&quot;MJX-TeXAtom-ORD&quot;&gt;&lt;mrow class=&quot;MJX-TeXAtom-ORD&quot;&gt;&lt;mover&gt;&lt;mi&gt;x&lt;\/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;&amp;#x00AF;&lt;\/mo&gt;&lt;\/mover&gt;&lt;\/mrow&gt;&lt;\/mrow&gt;&lt;\/msub&gt;&lt;\/math&gt;\"><span id=\"MathJax-Span-27\" class=\"math\"><span id=\"MathJax-Span-28\" class=\"mrow\"><span id=\"MathJax-Span-29\" class=\"msubsup\"><span id=\"MathJax-Span-30\" class=\"mi\">\u03c3<\/span><span id=\"MathJax-Span-31\" class=\"texatom\"><span id=\"MathJax-Span-32\" class=\"mrow\"><span id=\"MathJax-Span-33\" class=\"texatom\"><span id=\"MathJax-Span-34\" class=\"mrow\"><span id=\"MathJax-Span-35\" class=\"munderover\"><span id=\"MathJax-Span-36\" class=\"mi\">x<\/span><span id=\"MathJax-Span-37\" class=\"mo\">\u00af<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>, or <strong>standard error<\/strong> is:<\/p>\n<div class=\"MathJax_Display\"><span id=\"MathJax-Element-5-Frame\" class=\"MathJax\" style=\"box-sizing: border-box; display: inline-table; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: center; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;\" tabindex=\"0\" role=\"presentation\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot; display=&quot;block&quot;&gt;&lt;mi&gt;S&lt;\/mi&gt;&lt;mi&gt;E&lt;\/mi&gt;&lt;mo&gt;=&lt;\/mo&gt;&lt;mfrac&gt;&lt;msub&gt;&lt;mi&gt;&amp;#x03C3;&lt;\/mi&gt;&lt;mrow class=&quot;MJX-TeXAtom-ORD&quot;&gt;&lt;mi&gt;x&lt;\/mi&gt;&lt;\/mrow&gt;&lt;\/msub&gt;&lt;msqrt&gt;&lt;mi&gt;n&lt;\/mi&gt;&lt;\/msqrt&gt;&lt;\/mfrac&gt;&lt;\/math&gt;\"><span id=\"MathJax-Span-38\" class=\"math\"><span id=\"MathJax-Span-39\" class=\"mrow\"><span id=\"MathJax-Span-40\" class=\"mi\">S<\/span><span id=\"MathJax-Span-41\" class=\"mi\">E<\/span><span id=\"MathJax-Span-42\" class=\"mo\">=<\/span><span id=\"MathJax-Span-43\" class=\"mfrac\"><span id=\"MathJax-Span-44\" class=\"msubsup\"><span id=\"MathJax-Span-45\" class=\"mi\">\u03c3<\/span><span id=\"MathJax-Span-46\" class=\"texatom\"><span id=\"MathJax-Span-47\" class=\"mrow\"><span id=\"MathJax-Span-48\" class=\"mi\">x<\/span><\/span><\/span><\/span><span id=\"MathJax-Span-49\" class=\"msqrt\"><span id=\"MathJax-Span-50\" class=\"mrow\"><span id=\"MathJax-Span-51\" class=\"mi\">n<\/span><\/span>\u203e\u221a<\/span><\/span><\/span><\/span><\/span><\/div>\n<p>This allows <strong>to estimate the standard deviation of the sample mean<\/strong> even if we cannot perform the simulation process (f.e. because we have no access to the population). Usually, <span id=\"MathJax-Element-6-Frame\" class=\"MathJax\" style=\"box-sizing: border-box; display: inline-table; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;\" tabindex=\"0\" role=\"presentation\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;msub&gt;&lt;mi&gt;&amp;#x03C3;&lt;\/mi&gt;&lt;mi&gt;x&lt;\/mi&gt;&lt;\/msub&gt;&lt;\/math&gt;\"><span id=\"MathJax-Span-52\" class=\"math\"><span id=\"MathJax-Span-53\" class=\"mrow\"><span id=\"MathJax-Span-54\" class=\"msubsup\"><span id=\"MathJax-Span-55\" class=\"mi\">\u03c3<\/span><span id=\"MathJax-Span-56\" class=\"mi\">x<\/span><\/span><\/span><\/span><\/span> is not known and it is substituted by its empirical estimate (that is sufficiently good of <span id=\"MathJax-Element-7-Frame\" class=\"MathJax\" style=\"box-sizing: border-box; display: inline-table; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;\" tabindex=\"0\" role=\"presentation\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;merror&gt;&lt;mtext&gt;n&amp;amp;gt;30&lt;\/mtext&gt;&lt;\/merror&gt;&lt;\/math&gt;\"><span id=\"MathJax-Span-57\" class=\"math\" aria-hidden=\"true\"><span id=\"MathJax-Span-58\" class=\"noError\">n&amp;gt;30<\/span><\/span><\/span> and the population distribution is not skewed):<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[\u00a0]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"n\">rows<\/span> <span class=\"o\">=<\/span> <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">random<\/span><span class=\"o\">.<\/span><span class=\"n\">choice<\/span><span class=\"p\">(<\/span><span class=\"n\">df<\/span><span class=\"o\">.<\/span><span class=\"n\">index<\/span><span class=\"o\">.<\/span><span class=\"n\">values<\/span><span class=\"p\">,<\/span> <span class=\"mi\">200<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">sampled_df<\/span> <span class=\"o\">=<\/span> <span class=\"n\">df<\/span><span class=\"o\">.<\/span><span class=\"n\">ix<\/span><span class=\"p\">[<\/span><span class=\"n\">rows<\/span><span class=\"p\">]<\/span>\n<span class=\"n\">est_sigma_mean<\/span> <span class=\"o\">=<\/span> <span class=\"n\">sampled_df<\/span><span class=\"o\">.<\/span><span class=\"n\">std<\/span><span class=\"p\">()<\/span><span class=\"o\">\/<\/span><span class=\"n\">math<\/span><span class=\"o\">.<\/span><span class=\"n\">sqrt<\/span><span class=\"p\">(<\/span><span class=\"mi\">200<\/span><span class=\"p\">)<\/span>\n\n<span class=\"k\">print<\/span> (<span class=\"s1\">'Direct estimation of SE from one sample of 200 elements:'<\/span><span class=\"p\">,<\/span> \\\n       <span class=\"n\">est_sigma_mean<\/span><span class=\"p\">[<\/span><span class=\"mi\">0<\/span><span class=\"p\">])<\/span>\n<span class=\"k\">print<\/span> (<span class=\"s1\">'Estimation of the SE by simulating 10000 samples of 200 elements:'<\/span><span class=\"p\">,<\/span>  \\\n       <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">array<\/span><span class=\"p\">(<\/span><span class=\"n\">means<\/span><span class=\"p\">)<\/span><span class=\"o\">.<\/span><span class=\"n\">std<\/span><span class=\"p\">())<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>We could be also interested in quantifying the standard deviation of other estimates: median, standard deviation, etc., but unlike in the case of the sample mean, there is no simple formula for the standard error of other interesting sample estimates, such as the median.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>Let&#8217;s consider from now the whole accidents dataset as a sample from an hypothetical population (this is the most common situation when analyzing real data!).<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h3 id=\"Bootstrapping-the-standard-error-of-the-mean.\">Bootstrapping the standard error of the mean.<\/h3>\n<p>A modern alternative to the traditional approach to statistical inference is the <em>bootstrapping method<\/em>. In the bootstrap, we draw <span id=\"MathJax-Element-8-Frame\" class=\"MathJax\" style=\"box-sizing: border-box; display: inline-table; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;\" tabindex=\"0\" role=\"presentation\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;mi&gt;N&lt;\/mi&gt;&lt;\/math&gt;\"><span id=\"MathJax-Span-59\" class=\"math\"><span id=\"MathJax-Span-60\" class=\"mrow\"><span id=\"MathJax-Span-61\" class=\"mi\">N<\/span><\/span><\/span><\/span> observations with replacement from the original data to create a bootstrap sample or resample. Then, we can calculate the mean for this resample. By repeating this process a large number of times we can built a good approximation of the mean sampling distribution.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[\u00a0]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"k\">def<\/span> <span class=\"nf\">meanBootstrap<\/span><span class=\"p\">(<\/span><span class=\"n\">X<\/span><span class=\"p\">,<\/span><span class=\"n\">numberb<\/span><span class=\"p\">):<\/span>\n    <span class=\"kn\">import<\/span> <span class=\"nn\">numpy<\/span> <span class=\"kn\">as<\/span> <span class=\"nn\">np<\/span>\n    <span class=\"n\">x<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[<\/span><span class=\"mi\">0<\/span><span class=\"p\">]<\/span><span class=\"o\">*<\/span><span class=\"n\">numberb<\/span>\n    <span class=\"k\">for<\/span> <span class=\"n\">i<\/span> <span class=\"ow\">in<\/span> <span class=\"nb\">range<\/span><span class=\"p\">(<\/span><span class=\"n\">numberb<\/span><span class=\"p\">):<\/span>\n        <span class=\"n\">sample<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[<\/span><span class=\"n\">X<\/span><span class=\"p\">[<\/span><span class=\"n\">_<\/span><span class=\"p\">]<\/span> <span class=\"k\">for<\/span> <span class=\"n\">_<\/span> <span class=\"ow\">in<\/span> <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">random<\/span><span class=\"o\">.<\/span><span class=\"n\">randint<\/span><span class=\"p\">(<\/span><span class=\"nb\">len<\/span><span class=\"p\">(<\/span><span class=\"n\">X<\/span><span class=\"p\">),<\/span> <span class=\"n\">size<\/span><span class=\"o\">=<\/span><span class=\"nb\">len<\/span><span class=\"p\">(<\/span><span class=\"n\">X<\/span><span class=\"p\">))]<\/span>\n        <span class=\"n\">x<\/span><span class=\"p\">[<\/span><span class=\"n\">i<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">(<\/span><span class=\"n\">sample<\/span><span class=\"p\">)<\/span>\n    <span class=\"k\">return<\/span> <span class=\"n\">x<\/span>\n\n<span class=\"n\">m<\/span> <span class=\"o\">=<\/span> <span class=\"n\">meanBootstrap<\/span><span class=\"p\">(<\/span><span class=\"n\">accidents<\/span><span class=\"p\">,<\/span> <span class=\"mi\">10000<\/span><span class=\"p\">)<\/span>\n<span class=\"k\">print<\/span> (<span class=\"s2\">\"Mean estimate:\"<\/span><span class=\"p\">,<\/span> <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">(<\/span><span class=\"n\">m<\/span><span class=\"p\">))<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[\u00a0]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"n\">fig<\/span><span class=\"p\">,<\/span> <span class=\"n\">ax<\/span> <span class=\"o\">=<\/span> <span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">subplots<\/span><span class=\"p\">(<\/span><span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"n\">figsize<\/span><span class=\"o\">=<\/span><span class=\"p\">(<\/span><span class=\"mi\">12<\/span><span class=\"p\">,<\/span> <span class=\"mi\">3<\/span><span class=\"p\">))<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">ylabel<\/span><span class=\"p\">(<\/span><span class=\"s1\">'Frequency'<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">xlabel<\/span><span class=\"p\">(<\/span><span class=\"s1\">'Sample mean value'<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">hist<\/span><span class=\"p\">(<\/span><span class=\"n\">m<\/span><span class=\"p\">,<\/span> \n         <span class=\"n\">bins<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">50<\/span><span class=\"p\">,<\/span> \n         <span class=\"n\">normed<\/span> <span class=\"o\">=<\/span> <span class=\"bp\">True<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">ax<\/span><span class=\"o\">.<\/span><span class=\"n\">axvline<\/span><span class=\"p\">(<\/span><span class=\"n\">x<\/span> <span class=\"o\">=<\/span> <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">(<\/span><span class=\"n\">m<\/span><span class=\"p\">),<\/span> \n           <span class=\"n\">ymin<\/span> <span class=\"o\">=<\/span> <span class=\"mf\">0.0<\/span><span class=\"p\">,<\/span> \n           <span class=\"n\">ymax<\/span> <span class=\"o\">=<\/span> <span class=\"mf\">1.0<\/span><span class=\"p\">,<\/span> \n           <span class=\"n\">color<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[<\/span><span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"mi\">0<\/span><span class=\"p\">,<\/span> <span class=\"mi\">0<\/span><span class=\"p\">])<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>The boostrapping method can be applied to other simple estimates such as the median or the variance:<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[\u00a0]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"k\">def<\/span> <span class=\"nf\">medBootstrap<\/span><span class=\"p\">(<\/span><span class=\"n\">X<\/span><span class=\"p\">,<\/span><span class=\"n\">numberb<\/span><span class=\"p\">):<\/span>\n    <span class=\"kn\">import<\/span> <span class=\"nn\">numpy<\/span> <span class=\"kn\">as<\/span> <span class=\"nn\">np<\/span>\n    <span class=\"n\">x<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[<\/span><span class=\"mi\">0<\/span><span class=\"p\">]<\/span><span class=\"o\">*<\/span><span class=\"n\">numberb<\/span>\n    <span class=\"k\">for<\/span> <span class=\"n\">i<\/span> <span class=\"ow\">in<\/span> <span class=\"nb\">range<\/span><span class=\"p\">(<\/span><span class=\"n\">numberb<\/span><span class=\"p\">):<\/span>\n        <span class=\"n\">sample<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[<\/span><span class=\"n\">X<\/span><span class=\"p\">[<\/span><span class=\"n\">_<\/span><span class=\"p\">]<\/span> <span class=\"k\">for<\/span> <span class=\"n\">_<\/span> <span class=\"ow\">in<\/span> <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">random<\/span><span class=\"o\">.<\/span><span class=\"n\">randint<\/span><span class=\"p\">(<\/span><span class=\"nb\">len<\/span><span class=\"p\">(<\/span><span class=\"n\">X<\/span><span class=\"p\">),<\/span> <span class=\"n\">size<\/span><span class=\"o\">=<\/span><span class=\"nb\">len<\/span><span class=\"p\">(<\/span><span class=\"n\">X<\/span><span class=\"p\">))]<\/span>\n        <span class=\"n\">x<\/span><span class=\"p\">[<\/span><span class=\"n\">i<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">median<\/span><span class=\"p\">(<\/span><span class=\"n\">sample<\/span><span class=\"p\">)<\/span>\n    <span class=\"k\">return<\/span> <span class=\"n\">x<\/span>\n\n<span class=\"n\">med<\/span> <span class=\"o\">=<\/span> <span class=\"n\">medBootstrap<\/span><span class=\"p\">(<\/span><span class=\"n\">accidents<\/span><span class=\"p\">,<\/span> <span class=\"mi\">10000<\/span><span class=\"p\">)<\/span>\n<span class=\"k\">print<\/span> (<span class=\"s2\">\"Median estimate:\"<\/span><span class=\"p\">,<\/span> <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">(<\/span><span class=\"n\">med<\/span><span class=\"p\">))<\/span>\n<span class=\"n\">fig<\/span><span class=\"p\">,<\/span> <span class=\"n\">ax<\/span> <span class=\"o\">=<\/span> <span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">subplots<\/span><span class=\"p\">(<\/span><span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"n\">figsize<\/span><span class=\"o\">=<\/span><span class=\"p\">(<\/span><span class=\"mi\">12<\/span><span class=\"p\">,<\/span> <span class=\"mi\">3<\/span><span class=\"p\">))<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">hist<\/span><span class=\"p\">(<\/span><span class=\"n\">med<\/span><span class=\"p\">,<\/span> <span class=\"n\">bins<\/span><span class=\"o\">=<\/span><span class=\"mi\">5<\/span><span class=\"p\">,<\/span> <span class=\"n\">normed<\/span><span class=\"o\">=<\/span><span class=\"bp\">True<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">ylabel<\/span><span class=\"p\">(<\/span><span class=\"s1\">'Frequency'<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">xlabel<\/span><span class=\"p\">(<\/span><span class=\"s1\">'Sample median value'<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">ax<\/span><span class=\"o\">.<\/span><span class=\"n\">axvline<\/span><span class=\"p\">(<\/span><span class=\"n\">x<\/span> <span class=\"o\">=<\/span> <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">array<\/span><span class=\"p\">(<\/span><span class=\"n\">med<\/span><span class=\"p\">)<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">(),<\/span> \n           <span class=\"n\">ymin<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">0<\/span><span class=\"p\">,<\/span> \n           <span class=\"n\">ymax<\/span> <span class=\"o\">=<\/span> <span class=\"mf\">1.0<\/span><span class=\"p\">,<\/span> \n           <span class=\"n\">color<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[<\/span><span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"mi\">0<\/span><span class=\"p\">,<\/span> <span class=\"mi\">0<\/span><span class=\"p\">])<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h2 id=\"Confidence-intervals.\">Confidence intervals.<\/h2>\n<p>A point estimate provides a single plausible value for a parameter. However, as we have seen a point estimate is rarely perfect; usually there is some error in the estimate. That is why we have proposed to use the standard error as a measure of its variability.<\/p>\n<p>As an alternative, a next logical step would be to provide a <strong>plausible range of values<\/strong> for the parameter. A plausible range of values for the sample parameter is called a <strong>confidence interval<\/strong>.<\/p>\n<p>We will base the definition of confidence interval on two ideas:<\/p>\n<ul>\n<li>Our point estimate is the most plausible value of the parameter, so it makes sense to build the confidence interval around the point estimate.<\/li>\n<li>The plausability of a range of values can be defined from the sampling distribution of the estimate.<\/li>\n<\/ul>\n<p>In order to define an interval, we can make use of a well known result from probability that applies to normal distributions: roughly 95% of the time our estimate will be within 1.96 standard errors of the true mean of the distribution.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[\u00a0]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"n\">m<\/span> <span class=\"o\">=<\/span> <span class=\"n\">accidents<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">()<\/span>\n<span class=\"n\">se<\/span> <span class=\"o\">=<\/span> <span class=\"n\">accidents<\/span><span class=\"o\">.<\/span><span class=\"n\">std<\/span><span class=\"p\">()<\/span><span class=\"o\">\/<\/span><span class=\"n\">math<\/span><span class=\"o\">.<\/span><span class=\"n\">sqrt<\/span><span class=\"p\">(<\/span><span class=\"nb\">len<\/span><span class=\"p\">(<\/span><span class=\"n\">accidents<\/span><span class=\"p\">))<\/span>\n<span class=\"n\">ci<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[<\/span><span class=\"n\">m<\/span> <span class=\"o\">-<\/span> <span class=\"n\">se<\/span><span class=\"o\">*<\/span><span class=\"mf\">1.96<\/span><span class=\"p\">,<\/span> <span class=\"n\">m<\/span> <span class=\"o\">+<\/span> <span class=\"n\">se<\/span><span class=\"o\">*<\/span><span class=\"mf\">1.96<\/span><span class=\"p\">]<\/span>\n<span class=\"k\">print<\/span> (<span class=\"s2\">\"Confidence interval:\"<\/span><span class=\"p\">,<\/span> <span class=\"n\">ci)<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>This is how we would compute a 95% confidence interval of the sample mean by using bootstrapping:<\/p>\n<ol>\n<li>Repeat the following steps a large number <span id=\"MathJax-Element-9-Frame\" class=\"MathJax\" style=\"box-sizing: border-box; display: inline-table; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;\" tabindex=\"0\" role=\"presentation\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;mi&gt;M&lt;\/mi&gt;&lt;\/math&gt;\"><span id=\"MathJax-Span-62\" class=\"math\"><span id=\"MathJax-Span-63\" class=\"mrow\"><span id=\"MathJax-Span-64\" class=\"mi\">M<\/span><\/span><\/span><\/span> of times:\n<ul>\n<li>Draw <span id=\"MathJax-Element-10-Frame\" class=\"MathJax\" style=\"box-sizing: border-box; display: inline-table; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;\" tabindex=\"0\" role=\"presentation\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;mi&gt;N&lt;\/mi&gt;&lt;\/math&gt;\"><span id=\"MathJax-Span-65\" class=\"math\"><span id=\"MathJax-Span-66\" class=\"mrow\"><span id=\"MathJax-Span-67\" class=\"mi\">N<\/span><\/span><\/span><\/span> observations with replacement from the original data to create a bootstrap sample or resample;<\/li>\n<li>Calculate the mean for the resample.<\/li>\n<\/ul>\n<\/li>\n<li>Calculate the <strong>mean<\/strong> of your <span id=\"MathJax-Element-11-Frame\" class=\"MathJax\" style=\"box-sizing: border-box; display: inline-table; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;\" tabindex=\"0\" role=\"presentation\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;mi&gt;M&lt;\/mi&gt;&lt;\/math&gt;\"><span id=\"MathJax-Span-68\" class=\"math\"><span id=\"MathJax-Span-69\" class=\"mrow\"><span id=\"MathJax-Span-70\" class=\"mi\">M<\/span><\/span><\/span><\/span> values of the sample statistic. This process gives you a \u201cbootstrapped\u201d estimate of the sample statistic.<\/li>\n<li>Calculate the <strong>standard deviation<\/strong> of your <span id=\"MathJax-Element-12-Frame\" class=\"MathJax\" style=\"box-sizing: border-box; display: inline-table; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;\" tabindex=\"0\" role=\"presentation\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;mi&gt;M&lt;\/mi&gt;&lt;\/math&gt;\"><span id=\"MathJax-Span-71\" class=\"math\"><span id=\"MathJax-Span-72\" class=\"mrow\"><span id=\"MathJax-Span-73\" class=\"mi\">M<\/span><\/span><\/span><\/span> values of the sample statistic. This process gives you a \u201cbootstrapped\u201d estimate of the <strong>SE<\/strong> of the sample statistic.<\/li>\n<li>Obtain the 2.5th and 97.5th centiles of your <span id=\"MathJax-Element-13-Frame\" class=\"MathJax\" style=\"box-sizing: border-box; display: inline-table; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;\" tabindex=\"0\" role=\"presentation\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;mi&gt;M&lt;\/mi&gt;&lt;\/math&gt;\"><span id=\"MathJax-Span-74\" class=\"math\"><span id=\"MathJax-Span-75\" class=\"mrow\"><span id=\"MathJax-Span-76\" class=\"mi\">M<\/span><\/span><\/span><\/span> values values of the sample statistic.<\/li>\n<\/ol>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[\u00a0]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"n\">m<\/span> <span class=\"o\">=<\/span> <span class=\"n\">meanBootstrap<\/span><span class=\"p\">(<\/span><span class=\"n\">accidents<\/span><span class=\"p\">,<\/span> <span class=\"mi\">10000<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">sample_mean<\/span> <span class=\"o\">=<\/span> <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">(<\/span><span class=\"n\">m<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">sample_se<\/span> <span class=\"o\">=<\/span>  <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">std<\/span><span class=\"p\">(<\/span><span class=\"n\">m<\/span><span class=\"p\">)<\/span>\n\n<span class=\"k\">print<\/span> (<span class=\"s2\">\"Mean estimate:\"<\/span><span class=\"p\">,<\/span> <span class=\"n\">sample_mean)<\/span>\n<span class=\"k\">print<\/span> (<span class=\"s2\">\"SE of the estimate:\"<\/span><span class=\"p\">,<\/span> <span class=\"n\">sample_se)<\/span>\n\n<span class=\"n\">ci<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[<\/span><span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">percentile<\/span><span class=\"p\">(<\/span><span class=\"n\">m<\/span><span class=\"p\">,<\/span><span class=\"mf\">2.5<\/span><span class=\"p\">),<\/span> <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">percentile<\/span><span class=\"p\">(<\/span><span class=\"n\">m<\/span><span class=\"p\">,<\/span><span class=\"mf\">97.5<\/span><span class=\"p\">)]<\/span>\n<span class=\"k\">print<\/span> (<span class=\"s2\">\"Confidence interval:\"<\/span><span class=\"p\">,<\/span> <span class=\"n\">ci)<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h3 id=\"Waht-is-the-real-meaning-of-CI?\">Waht is the real meaning of CI?<\/h3>\n<p>The real meaning of &#8220;confidence&#8221; is not evident and it must be understood from the point of view of the generating process.<\/p>\n<p>Suppose we take many (infinite) samples from a population and built a 95% confidence interval from each sample. Then about 95% of those intervals would contain the actual parameter.<\/p>\n<p>This can be easily showed by simulating a large number of samples and checking how many intervals contain the true parameter:<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[\u00a0]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"n\">df<\/span> <span class=\"o\">=<\/span> <span class=\"n\">accidents<\/span>   \n\n<span class=\"n\">n<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">100<\/span>                                               <span class=\"c1\"># number of observations<\/span>\n<span class=\"n\">N_test<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">100<\/span>                                          <span class=\"c1\"># number of samples with n observations<\/span>\n<span class=\"n\">means<\/span> <span class=\"o\">=<\/span> <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">array<\/span><span class=\"p\">([<\/span><span class=\"mf\">0.0<\/span><span class=\"p\">]<\/span> <span class=\"o\">*<\/span> <span class=\"n\">N_test<\/span><span class=\"p\">)<\/span>                      <span class=\"c1\"># samples' mean<\/span>\n<span class=\"n\">s<\/span> <span class=\"o\">=<\/span> <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">array<\/span><span class=\"p\">([<\/span><span class=\"mf\">0.0<\/span><span class=\"p\">]<\/span> <span class=\"o\">*<\/span> <span class=\"n\">N_test<\/span><span class=\"p\">)<\/span>                          <span class=\"c1\"># samples' std<\/span>\n<span class=\"n\">ci<\/span> <span class=\"o\">=<\/span> <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">array<\/span><span class=\"p\">([[<\/span><span class=\"mf\">0.0<\/span><span class=\"p\">,<\/span><span class=\"mf\">0.0<\/span><span class=\"p\">]]<\/span> <span class=\"o\">*<\/span> <span class=\"n\">N_test<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">tm<\/span> <span class=\"o\">=<\/span> <span class=\"n\">df<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">()<\/span>                                        <span class=\"c1\"># \"true\" mean<\/span>\n\n<span class=\"k\">for<\/span> <span class=\"n\">i<\/span> <span class=\"ow\">in<\/span> <span class=\"nb\">range<\/span><span class=\"p\">(<\/span><span class=\"n\">N_test<\/span><span class=\"p\">):<\/span>                               <span class=\"c1\"># sample generation and CI computation<\/span>\n    <span class=\"n\">rows<\/span> <span class=\"o\">=<\/span> <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">random<\/span><span class=\"o\">.<\/span><span class=\"n\">choice<\/span><span class=\"p\">(<\/span><span class=\"n\">df<\/span><span class=\"o\">.<\/span><span class=\"n\">index<\/span><span class=\"o\">.<\/span><span class=\"n\">values<\/span><span class=\"p\">,<\/span> <span class=\"n\">n<\/span><span class=\"p\">)<\/span>\n    <span class=\"n\">sampled_df<\/span> <span class=\"o\">=<\/span> <span class=\"n\">df<\/span><span class=\"o\">.<\/span><span class=\"n\">ix<\/span><span class=\"p\">[<\/span><span class=\"n\">rows<\/span><span class=\"p\">]<\/span>\n    <span class=\"n\">means<\/span><span class=\"p\">[<\/span><span class=\"n\">i<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"n\">sampled_df<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">()<\/span>\n    <span class=\"n\">s<\/span><span class=\"p\">[<\/span><span class=\"n\">i<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"n\">sampled_df<\/span><span class=\"o\">.<\/span><span class=\"n\">std<\/span><span class=\"p\">()<\/span>\n    <span class=\"n\">ci<\/span><span class=\"p\">[<\/span><span class=\"n\">i<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"n\">means<\/span><span class=\"p\">[<\/span><span class=\"n\">i<\/span><span class=\"p\">]<\/span> <span class=\"o\">+<\/span> <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">array<\/span><span class=\"p\">([<\/span><span class=\"o\">-<\/span><span class=\"n\">s<\/span><span class=\"p\">[<\/span><span class=\"n\">i<\/span><span class=\"p\">]<\/span> <span class=\"o\">*<\/span><span class=\"mf\">1.96<\/span><span class=\"o\">\/<\/span><span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">sqrt<\/span><span class=\"p\">(<\/span><span class=\"n\">n<\/span><span class=\"p\">),<\/span> <span class=\"n\">s<\/span><span class=\"p\">[<\/span><span class=\"n\">i<\/span><span class=\"p\">]<\/span><span class=\"o\">*<\/span><span class=\"mf\">1.96<\/span><span class=\"o\">\/<\/span><span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">sqrt<\/span><span class=\"p\">(<\/span><span class=\"n\">n<\/span><span class=\"p\">)])<\/span>    \n\n<span class=\"n\">out1<\/span> <span class=\"o\">=<\/span> <span class=\"n\">ci<\/span><span class=\"p\">[:,<\/span><span class=\"mi\">0<\/span><span class=\"p\">]<\/span> <span class=\"o\">&gt;<\/span> <span class=\"n\">tm<\/span>                                   <span class=\"c1\"># CI that do not contain the \"true\" mean<\/span>\n<span class=\"n\">out2<\/span> <span class=\"o\">=<\/span> <span class=\"n\">ci<\/span><span class=\"p\">[:,<\/span><span class=\"mi\">1<\/span><span class=\"p\">]<\/span> <span class=\"o\">&lt;<\/span> <span class=\"n\">tm<\/span>\n\n<span class=\"n\">fig<\/span><span class=\"p\">,<\/span> <span class=\"n\">ax<\/span> <span class=\"o\">=<\/span> <span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">subplots<\/span><span class=\"p\">(<\/span><span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"n\">figsize<\/span><span class=\"o\">=<\/span><span class=\"p\">(<\/span><span class=\"mi\">12<\/span><span class=\"p\">,<\/span> <span class=\"mi\">5<\/span><span class=\"p\">))<\/span>\n<span class=\"n\">ind<\/span> <span class=\"o\">=<\/span> <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">arange<\/span><span class=\"p\">(<\/span><span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"n\">N_test<\/span><span class=\"o\">+<\/span><span class=\"mi\">1<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">ax<\/span><span class=\"o\">.<\/span><span class=\"n\">axhline<\/span><span class=\"p\">(<\/span><span class=\"n\">y<\/span> <span class=\"o\">=<\/span> <span class=\"n\">tm<\/span><span class=\"p\">,<\/span> \n           <span class=\"n\">xmin<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">0<\/span><span class=\"p\">,<\/span> \n           <span class=\"n\">xmax<\/span> <span class=\"o\">=<\/span> <span class=\"n\">N_test<\/span><span class=\"o\">+<\/span><span class=\"mi\">1<\/span><span class=\"p\">,<\/span> \n           <span class=\"n\">color<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[<\/span><span class=\"mi\">0<\/span><span class=\"p\">,<\/span> <span class=\"mi\">0<\/span><span class=\"p\">,<\/span> <span class=\"mi\">0<\/span><span class=\"p\">])<\/span>\n<span class=\"n\">ci<\/span> <span class=\"o\">=<\/span> <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">transpose<\/span><span class=\"p\">(<\/span><span class=\"n\">ci<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">ax<\/span><span class=\"o\">.<\/span><span class=\"n\">plot<\/span><span class=\"p\">([<\/span><span class=\"n\">ind<\/span><span class=\"p\">,<\/span><span class=\"n\">ind<\/span><span class=\"p\">],<\/span> \n        <span class=\"n\">ci<\/span><span class=\"p\">,<\/span> \n        <span class=\"n\">color<\/span> <span class=\"o\">=<\/span> <span class=\"s1\">'0.75'<\/span><span class=\"p\">,<\/span> \n        <span class=\"n\">marker<\/span> <span class=\"o\">=<\/span> <span class=\"s1\">'_'<\/span><span class=\"p\">,<\/span> \n        <span class=\"n\">ms<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">0<\/span><span class=\"p\">,<\/span> \n        <span class=\"n\">linewidth<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">3<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">ax<\/span><span class=\"o\">.<\/span><span class=\"n\">plot<\/span><span class=\"p\">([<\/span><span class=\"n\">ind<\/span><span class=\"p\">[<\/span><span class=\"n\">out1<\/span><span class=\"p\">],<\/span><span class=\"n\">ind<\/span><span class=\"p\">[<\/span><span class=\"n\">out1<\/span><span class=\"p\">]],<\/span> \n        <span class=\"n\">ci<\/span><span class=\"p\">[:,<\/span> <span class=\"n\">out1<\/span><span class=\"p\">],<\/span> \n        <span class=\"n\">color<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[<\/span><span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"mi\">0<\/span><span class=\"p\">,<\/span> <span class=\"mi\">0<\/span><span class=\"p\">,<\/span> <span class=\"mf\">0.8<\/span><span class=\"p\">],<\/span> \n        <span class=\"n\">marker<\/span> <span class=\"o\">=<\/span> <span class=\"s1\">'_'<\/span><span class=\"p\">,<\/span> \n        <span class=\"n\">ms<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">0<\/span><span class=\"p\">,<\/span> \n        <span class=\"n\">linewidth<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">3<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">ax<\/span><span class=\"o\">.<\/span><span class=\"n\">plot<\/span><span class=\"p\">([<\/span><span class=\"n\">ind<\/span><span class=\"p\">[<\/span><span class=\"n\">out2<\/span><span class=\"p\">],<\/span><span class=\"n\">ind<\/span><span class=\"p\">[<\/span><span class=\"n\">out2<\/span><span class=\"p\">]],<\/span> \n        <span class=\"n\">ci<\/span><span class=\"p\">[:,<\/span> <span class=\"n\">out2<\/span><span class=\"p\">],<\/span> \n        <span class=\"n\">color<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[<\/span><span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"mi\">0<\/span><span class=\"p\">,<\/span> <span class=\"mi\">0<\/span><span class=\"p\">,<\/span> <span class=\"mf\">0.8<\/span><span class=\"p\">],<\/span> \n        <span class=\"n\">marker<\/span> <span class=\"o\">=<\/span> <span class=\"s1\">'_'<\/span><span class=\"p\">,<\/span>\n        <span class=\"n\">ms<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">0<\/span><span class=\"p\">,<\/span> \n        <span class=\"n\">linewidth<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">3<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">ax<\/span><span class=\"o\">.<\/span><span class=\"n\">plot<\/span><span class=\"p\">(<\/span><span class=\"n\">ind<\/span><span class=\"p\">,<\/span> \n        <span class=\"n\">means<\/span><span class=\"p\">,<\/span> \n        <span class=\"n\">color<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[<\/span><span class=\"mi\">0<\/span><span class=\"p\">,<\/span> <span class=\"o\">.<\/span><span class=\"mi\">8<\/span><span class=\"p\">,<\/span> <span class=\"o\">.<\/span><span class=\"mi\">2<\/span><span class=\"p\">,<\/span> <span class=\"o\">.<\/span><span class=\"mi\">8<\/span><span class=\"p\">],<\/span> \n        <span class=\"n\">marker<\/span> <span class=\"o\">=<\/span> <span class=\"s1\">'.'<\/span><span class=\"p\">,<\/span>\n        <span class=\"n\">ms<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">10<\/span><span class=\"p\">,<\/span> \n        <span class=\"n\">linestyle<\/span> <span class=\"o\">=<\/span> <span class=\"s1\">''<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">ax<\/span><span class=\"o\">.<\/span><span class=\"n\">set_ylabel<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"Confidence interval for the samples' mean estimate\"<\/span><span class=\"p\">,<\/span>\n              <span class=\"n\">fontsize<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">12<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">ax<\/span><span class=\"o\">.<\/span><span class=\"n\">set_xlabel<\/span><span class=\"p\">(<\/span><span class=\"s1\">'Samples (with <\/span><span class=\"si\">%d<\/span><span class=\"s1\"> observations). '<\/span>  <span class=\"o\">%<\/span><span class=\"k\">n<\/span>, \n              <span class=\"n\">fontsize<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">12<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">savefig<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"confidence.png\"<\/span><span class=\"p\">,<\/span>\n            <span class=\"n\">dpi<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">300<\/span><span class=\"p\">,<\/span> \n            <span class=\"n\">bbox_inches<\/span> <span class=\"o\">=<\/span> <span class=\"s1\">'tight'<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">show<\/span><span class=\"p\">()<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h2 id=\"Hypothesis-testing\">Hypothesis testing<\/h2>\n<p>To give a measure of variability of our estimates is a way of producing a statistical proposition about the population, but not the only one. R.A.Fisher (1890-1962) proposed an alternative, known as <em>hypothesis testing<\/em>, that is based on the concept of <em>statistical significance<\/em>.<\/p>\n<p>Let&#8217;s suppose that a deeper analysis of traffic accidents in Barcelona results in a difference between 2010 and 2013. Of course, the diference could be caused only by chance, because of the variability of both estimates. But it could also be the case that traffic conditions are very diferent in Barcelona during these two periods and, because of this, data from these two periods can be considered as belonging to two diferent populations. Then, the relevant question is: Are the observed effects real or not?<\/p>\n<p>The process of determining the statistical significance of an effect is called <strong>hypothesis testing<\/strong>. This process starts by simplifying the options into two competing hypotheses:<\/p>\n<ul>\n<li><span id=\"MathJax-Element-14-Frame\" class=\"MathJax\" style=\"box-sizing: border-box; display: inline-table; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;\" tabindex=\"0\" role=\"presentation\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;msub&gt;&lt;mi&gt;H&lt;\/mi&gt;&lt;mn&gt;0&lt;\/mn&gt;&lt;\/msub&gt;&lt;\/math&gt;\"><span id=\"MathJax-Span-77\" class=\"math\"><span id=\"MathJax-Span-78\" class=\"mrow\"><span id=\"MathJax-Span-79\" class=\"msubsup\"><span id=\"MathJax-Span-80\" class=\"mi\">H<\/span><span id=\"MathJax-Span-81\" class=\"mn\">0<\/span><\/span><\/span><\/span><\/span>: The mean number of daily traffic accidents is the same in 2013 and 2010 (there is only one population, one true mean, and 2010 and 2013 are just different samples from the same population).<\/li>\n<li><span id=\"MathJax-Element-15-Frame\" class=\"MathJax\" style=\"box-sizing: border-box; display: inline-table; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;\" tabindex=\"0\" role=\"presentation\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;msub&gt;&lt;mi&gt;H&lt;\/mi&gt;&lt;mi&gt;A&lt;\/mi&gt;&lt;\/msub&gt;&lt;\/math&gt;\"><span id=\"MathJax-Span-82\" class=\"math\"><span id=\"MathJax-Span-83\" class=\"mrow\"><span id=\"MathJax-Span-84\" class=\"msubsup\"><span id=\"MathJax-Span-85\" class=\"mi\">H<\/span><span id=\"MathJax-Span-86\" class=\"mi\">A<\/span><\/span><\/span><\/span><\/span>: The mean number of daily traffic accidents for 2010 and for 2013 is different (2010 and 2013 are two samples from two different populations).<\/li>\n<\/ul>\n<p>We call <span id=\"MathJax-Element-16-Frame\" class=\"MathJax\" style=\"box-sizing: border-box; display: inline-table; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;\" tabindex=\"0\" role=\"presentation\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;msub&gt;&lt;mi&gt;H&lt;\/mi&gt;&lt;mn&gt;0&lt;\/mn&gt;&lt;\/msub&gt;&lt;\/math&gt;\"><span id=\"MathJax-Span-87\" class=\"math\"><span id=\"MathJax-Span-88\" class=\"mrow\"><span id=\"MathJax-Span-89\" class=\"msubsup\"><span id=\"MathJax-Span-90\" class=\"mi\">H<\/span><span id=\"MathJax-Span-91\" class=\"mn\">0<\/span><\/span><\/span><\/span><\/span> the <em>null hypothesis<\/em> and it represents a skeptical point of view: the effect we have observed is due to chance (due to the specific sample bias).<\/p>\n<p><span id=\"MathJax-Element-17-Frame\" class=\"MathJax\" style=\"box-sizing: border-box; display: inline-table; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;\" tabindex=\"0\" role=\"presentation\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;msub&gt;&lt;mi&gt;H&lt;\/mi&gt;&lt;mi&gt;A&lt;\/mi&gt;&lt;\/msub&gt;&lt;\/math&gt;\"><span id=\"MathJax-Span-92\" class=\"math\"><span id=\"MathJax-Span-93\" class=\"mrow\"><span id=\"MathJax-Span-94\" class=\"msubsup\"><span id=\"MathJax-Span-95\" class=\"mi\">H<\/span><span id=\"MathJax-Span-96\" class=\"mi\">A<\/span><\/span><\/span><\/span><\/span> is the <em>alternative hypothesis<\/em> and it represents the other point of view: the effect is real.<\/p>\n<p>The general rule of frequentist hypothesis testing is: We will not discard <span id=\"MathJax-Element-18-Frame\" class=\"MathJax\" style=\"box-sizing: border-box; display: inline-table; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;\" tabindex=\"0\" role=\"presentation\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;msub&gt;&lt;mi&gt;H&lt;\/mi&gt;&lt;mn&gt;0&lt;\/mn&gt;&lt;\/msub&gt;&lt;\/math&gt;\"><span id=\"MathJax-Span-97\" class=\"math\"><span id=\"MathJax-Span-98\" class=\"mrow\"><span id=\"MathJax-Span-99\" class=\"msubsup\"><span id=\"MathJax-Span-100\" class=\"mi\">H<\/span><span id=\"MathJax-Span-101\" class=\"mn\">0<\/span><\/span><\/span><\/span><\/span> (and hence we will not consider <span id=\"MathJax-Element-19-Frame\" class=\"MathJax\" style=\"box-sizing: border-box; display: inline-table; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;\" tabindex=\"0\" role=\"presentation\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;msub&gt;&lt;mi&gt;H&lt;\/mi&gt;&lt;mi&gt;A&lt;\/mi&gt;&lt;\/msub&gt;&lt;\/math&gt;\"><span id=\"MathJax-Span-102\" class=\"math\"><span id=\"MathJax-Span-103\" class=\"mrow\"><span id=\"MathJax-Span-104\" class=\"msubsup\"><span id=\"MathJax-Span-105\" class=\"mi\">H<\/span><span id=\"MathJax-Span-106\" class=\"mi\">A<\/span><\/span><\/span><\/span><\/span>) unless the observed effect is implausible under <span id=\"MathJax-Element-20-Frame\" class=\"MathJax\" style=\"box-sizing: border-box; display: inline-table; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;\" tabindex=\"0\" role=\"presentation\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;msub&gt;&lt;mi&gt;H&lt;\/mi&gt;&lt;mn&gt;0&lt;\/mn&gt;&lt;\/msub&gt;&lt;\/math&gt;\"><span id=\"MathJax-Span-107\" class=\"math\"><span id=\"MathJax-Span-108\" class=\"mrow\"><span id=\"MathJax-Span-109\" class=\"msubsup\"><span id=\"MathJax-Span-110\" class=\"mi\">H<\/span><span id=\"MathJax-Span-111\" class=\"mn\">0<\/span><\/span><\/span><\/span><\/span>.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h3 id=\"Testing-hypotheses-using-confidence-intervals.\">Testing hypotheses using confidence intervals.<\/h3>\n<p>We can use the concept represented by confidence intervals to measure the plausibility of an hypothesis.<\/p>\n<p>We can illustrate the evaluation of the hypotheses setup by comparing the mean rate of traffic accidents in Barcelona during 2010 and 2013 using a point estimate from the 2013 sample:<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[\u00a0]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"n\">data<\/span> <span class=\"o\">=<\/span> <span class=\"n\">pd<\/span><span class=\"o\">.<\/span><span class=\"n\">read_csv<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"files\/ch04\/ACCIDENTS_GU_BCN_2010.csv\"<\/span><span class=\"p\">,<\/span> <span class=\"n\">encoding<\/span><span class=\"o\">=<\/span><span class=\"s1\">'latin-1'<\/span><span class=\"p\">)<\/span>\n<span class=\"c1\">#Create a new column which is the date<\/span>\n<span class=\"n\">data<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Date'<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"n\">data<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Dia de mes'<\/span><span class=\"p\">]<\/span><span class=\"o\">.<\/span><span class=\"n\">apply<\/span><span class=\"p\">(<\/span><span class=\"k\">lambda<\/span> <span class=\"n\">x<\/span> <span class=\"p\">:<\/span> <span class=\"nb\">str<\/span><span class=\"p\">(<\/span><span class=\"n\">x<\/span><span class=\"p\">))<\/span> <span class=\"o\">+<\/span> <span class=\"s1\">'-'<\/span> <span class=\"o\">+<\/span>  \\\n               <span class=\"n\">data<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Mes de any'<\/span><span class=\"p\">]<\/span><span class=\"o\">.<\/span><span class=\"n\">apply<\/span><span class=\"p\">(<\/span><span class=\"k\">lambda<\/span> <span class=\"n\">x<\/span> <span class=\"p\">:<\/span> <span class=\"nb\">str<\/span><span class=\"p\">(<\/span><span class=\"n\">x<\/span><span class=\"p\">))<\/span>\n<span class=\"n\">data2<\/span> <span class=\"o\">=<\/span> <span class=\"n\">data<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Date'<\/span><span class=\"p\">]<\/span>\n<span class=\"n\">counts2010<\/span> <span class=\"o\">=<\/span><span class=\"n\">data<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Date'<\/span><span class=\"p\">]<\/span><span class=\"o\">.<\/span><span class=\"n\">value_counts<\/span><span class=\"p\">()<\/span>\n<span class=\"k\">print<\/span> (<span class=\"s1\">'2010: Mean'<\/span><span class=\"p\">,<\/span> <span class=\"n\">counts2010<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">())<\/span>\n\n<span class=\"n\">data<\/span> <span class=\"o\">=<\/span> <span class=\"n\">pd<\/span><span class=\"o\">.<\/span><span class=\"n\">read_csv<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"files\/ch04\/ACCIDENTS_GU_BCN_2013.csv\"<\/span><span class=\"p\">,<\/span> <span class=\"n\">encoding<\/span><span class=\"o\">=<\/span><span class=\"s1\">'latin-1'<\/span><span class=\"p\">)<\/span>\n<span class=\"c1\">#Create a new column which is the date<\/span>\n<span class=\"n\">data<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Date'<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"n\">data<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Dia de mes'<\/span><span class=\"p\">]<\/span><span class=\"o\">.<\/span><span class=\"n\">apply<\/span><span class=\"p\">(<\/span><span class=\"k\">lambda<\/span> <span class=\"n\">x<\/span> <span class=\"p\">:<\/span> <span class=\"nb\">str<\/span><span class=\"p\">(<\/span><span class=\"n\">x<\/span><span class=\"p\">))<\/span> <span class=\"o\">+<\/span> <span class=\"s1\">'-'<\/span> <span class=\"o\">+<\/span>  \\\n               <span class=\"n\">data<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Mes de any'<\/span><span class=\"p\">]<\/span><span class=\"o\">.<\/span><span class=\"n\">apply<\/span><span class=\"p\">(<\/span><span class=\"k\">lambda<\/span> <span class=\"n\">x<\/span> <span class=\"p\">:<\/span> <span class=\"nb\">str<\/span><span class=\"p\">(<\/span><span class=\"n\">x<\/span><span class=\"p\">))<\/span>\n<span class=\"n\">data2<\/span> <span class=\"o\">=<\/span> <span class=\"n\">data<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Date'<\/span><span class=\"p\">]<\/span>\n<span class=\"n\">counts2013<\/span> <span class=\"o\">=<\/span> <span class=\"n\">data<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Date'<\/span><span class=\"p\">]<\/span><span class=\"o\">.<\/span><span class=\"n\">value_counts<\/span><span class=\"p\">()<\/span>\n<span class=\"k\">print<\/span> (<span class=\"s1\">'2013: Mean'<\/span><span class=\"p\">,<\/span> <span class=\"n\">counts2013<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">())<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>This estimate suggests that during 2013 the mean rate of traffic accidents in Barcelona <strong>was higher<\/strong> than 2010. But is this effect statistically significant?<\/p>\n<p>Based on our sample, the 95% confidence interval for the mean rate of traffic accidents in Barcelona during 2013 can be calculated as:<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[\u00a0]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"n\">n<\/span> <span class=\"o\">=<\/span> <span class=\"nb\">len<\/span><span class=\"p\">(<\/span><span class=\"n\">counts2013<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">mean<\/span> <span class=\"o\">=<\/span> <span class=\"n\">counts2013<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">()<\/span>\n<span class=\"n\">s<\/span> <span class=\"o\">=<\/span> <span class=\"n\">counts2013<\/span><span class=\"o\">.<\/span><span class=\"n\">std<\/span><span class=\"p\">()<\/span>\n<span class=\"n\">ci<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[<\/span><span class=\"n\">mean<\/span> <span class=\"o\">-<\/span> <span class=\"n\">s<\/span><span class=\"o\">*<\/span><span class=\"mf\">1.96<\/span><span class=\"o\">\/<\/span><span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">sqrt<\/span><span class=\"p\">(<\/span><span class=\"n\">n<\/span><span class=\"p\">),<\/span>  <span class=\"n\">mean<\/span> <span class=\"o\">+<\/span> <span class=\"n\">s<\/span><span class=\"o\">*<\/span><span class=\"mf\">1.96<\/span><span class=\"o\">\/<\/span><span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">sqrt<\/span><span class=\"p\">(<\/span><span class=\"n\">n<\/span><span class=\"p\">)]<\/span> \n<span class=\"k\">print<\/span> (<span class=\"s1\">'2010 accident rate estimate:'<\/span><span class=\"p\">,<\/span> <span class=\"n\">counts2010<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">())<\/span>\n<span class=\"k\">print<\/span> (<span class=\"s1\">'2013 accident rate estimate:'<\/span><span class=\"p\">,<\/span> <span class=\"n\">counts2013<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">())<\/span>\n<span class=\"k\">print<\/span> (<span class=\"s1\">'CI for 2013:'<\/span><span class=\"p\">,<\/span><span class=\"n\">ci)<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>Because 2010 accident rate estimate does not fall in the range of plausible values of 2013, we say the alternative hypothesis cannot be discarted. That is, it can not be discarted that during 2013 the mean rate of traffic accidents in Barcelona was higher than during 2010.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h3 id=\"Testing-hypotheses-using-P-values.\">Testing hypotheses using P-values.<\/h3>\n<p>A more advanced notion of statistical significance was developed by R.A.Fisher in the 1920&#8217;s when looking for a test to decide whether variation in crop yields were due to some specific intervention or merely randon factors beyond experimental control. Fisher first assumed that fertilizer caused no difference (null hypothesis) and then calculated <span id=\"MathJax-Element-21-Frame\" class=\"MathJax\" style=\"box-sizing: border-box; display: inline-table; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;\" tabindex=\"0\" role=\"presentation\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;mi&gt;P&lt;\/mi&gt;&lt;\/math&gt;\"><span id=\"MathJax-Span-112\" class=\"math\"><span id=\"MathJax-Span-113\" class=\"mrow\"><span id=\"MathJax-Span-114\" class=\"mi\">P<\/span><\/span><\/span><\/span>, the probability that an observed yield in a fertilized field would occur if fertilizer had no real effect. This probability is called p-value.<\/p>\n<p>The p-value is the probability of observing data at least as favorable to the alternative hypothesis as our current data set, if the null hypothesis is true.<\/p>\n<p>To apply a test of hypotheses to our problem, the first step is to quantify the size of the apparent effect by choosing a test statistic. In our case, the apparent effect is a difference in accident rates, so a natural choice for the test statistic is the difference in means between the two periods:<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[\u00a0]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"n\">m<\/span> <span class=\"o\">=<\/span> <span class=\"nb\">len<\/span><span class=\"p\">(<\/span><span class=\"n\">counts2010<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">n<\/span> <span class=\"o\">=<\/span> <span class=\"nb\">len<\/span><span class=\"p\">(<\/span><span class=\"n\">counts2013<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">p<\/span> <span class=\"o\">=<\/span> <span class=\"p\">(<\/span><span class=\"n\">counts2013<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">()<\/span> <span class=\"o\">-<\/span> <span class=\"n\">counts2010<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">())<\/span>\n<span class=\"k\">print<\/span> (<span class=\"s1\">'m:'<\/span><span class=\"p\">,<\/span><span class=\"n\">m<\/span><span class=\"p\">,<\/span> <span class=\"s1\">'n:'<\/span><span class=\"p\">,<\/span> <span class=\"n\">n)<\/span>\n<span class=\"k\">print<\/span> (<span class=\"s1\">'mean difference: '<\/span><span class=\"p\">,<\/span> <span class=\"n\">p)<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>The second step is to define a null hypothesis, which is a model of the system based on the assumption that the apparent effect is not real. In our case the null hypothesis is that there is no diference between the two periods. The alternative hypothesis is that during 2013 the mean rate of traffic accidents in Barcelona was higher than 2010.<\/p>\n<p>The third step is to compute a p-value, which is the probability of seeing the apparent effect if the null hypothesis is true. In our case, we would compute the absolute difference in means, then compute the probability of seeing a difference as big, or bigger, under the null hypothesis.<\/p>\n<p>Usually, if P is less than 0.05 (the chance of a fluke is less than 5%) the result is declared statistically significant.<\/p>\n<p>To approximate the p-value, we can follow the following procedure:<\/p>\n<ol>\n<li>Pool the distributions, generate samples with size <span id=\"MathJax-Element-22-Frame\" class=\"MathJax\" style=\"box-sizing: border-box; display: inline-table; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;\" tabindex=\"0\" role=\"presentation\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;mi&gt;n&lt;\/mi&gt;&lt;\/math&gt;\"><span id=\"MathJax-Span-115\" class=\"math\"><span id=\"MathJax-Span-116\" class=\"mrow\"><span id=\"MathJax-Span-117\" class=\"mi\">n<\/span><\/span><\/span><\/span> and compute the difference in the mean.<\/li>\n<li>Generate samples with size <span id=\"MathJax-Element-23-Frame\" class=\"MathJax\" style=\"box-sizing: border-box; display: inline-table; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;\" tabindex=\"0\" role=\"presentation\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;mi&gt;n&lt;\/mi&gt;&lt;\/math&gt;\"><span id=\"MathJax-Span-118\" class=\"math\"><span id=\"MathJax-Span-119\" class=\"mrow\"><span id=\"MathJax-Span-120\" class=\"mi\">n<\/span><\/span><\/span><\/span> and compute the difference in the mean.<\/li>\n<li>Count how many differences are larger than the observed one<\/li>\n<\/ol>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[\u00a0]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"n\">x<\/span> <span class=\"o\">=<\/span> <span class=\"n\">counts2010<\/span>\n<span class=\"n\">y<\/span> <span class=\"o\">=<\/span> <span class=\"n\">counts2013<\/span>\n<span class=\"n\">pool<\/span> <span class=\"o\">=<\/span> <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">concatenate<\/span><span class=\"p\">([<\/span><span class=\"n\">x<\/span><span class=\"p\">,<\/span><span class=\"n\">y<\/span><span class=\"p\">])<\/span>\n<span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">random<\/span><span class=\"o\">.<\/span><span class=\"n\">shuffle<\/span><span class=\"p\">(<\/span><span class=\"n\">pool<\/span><span class=\"p\">)<\/span>\n\n<span class=\"n\">fig<\/span><span class=\"p\">,<\/span> <span class=\"n\">ax<\/span> <span class=\"o\">=<\/span> <span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">subplots<\/span><span class=\"p\">(<\/span><span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"n\">figsize<\/span><span class=\"o\">=<\/span><span class=\"p\">(<\/span><span class=\"mi\">12<\/span><span class=\"p\">,<\/span> <span class=\"mi\">3<\/span><span class=\"p\">))<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">hist<\/span><span class=\"p\">(<\/span><span class=\"n\">pool<\/span><span class=\"p\">,<\/span> \n         <span class=\"n\">bins<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">25<\/span><span class=\"p\">,<\/span> \n         <span class=\"n\">normed<\/span> <span class=\"o\">=<\/span> <span class=\"bp\">True<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">ylabel<\/span><span class=\"p\">(<\/span><span class=\"s1\">'Frequency'<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">xlabel<\/span><span class=\"p\">(<\/span><span class=\"s1\">'Number of accidents'<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">title<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"Pooled distribution\"<\/span><span class=\"p\">)<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[\u00a0]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"n\">N<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">10000<\/span> <span class=\"c1\"># number of samples<\/span>\n<span class=\"n\">diff<\/span> <span class=\"o\">=<\/span> [i for i in range(N)]\n<span class=\"k\">for<\/span> <span class=\"n\">i<\/span> <span class=\"ow\">in<\/span> <span class=\"nb\">range<\/span><span class=\"p\">(<\/span><span class=\"n\">N<\/span><span class=\"p\">):<\/span>\n    <span class=\"n\">p1<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[<\/span><span class=\"n\">random<\/span><span class=\"o\">.<\/span><span class=\"n\">choice<\/span><span class=\"p\">(<\/span><span class=\"n\">pool<\/span><span class=\"p\">)<\/span> <span class=\"k\">for<\/span> <span class=\"n\">_<\/span> <span class=\"ow\">in<\/span> <span class=\"nb\">range<\/span><span class=\"p\">(<\/span><span class=\"n\">n<\/span><span class=\"p\">)]<\/span>\n    <span class=\"n\">p2<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[<\/span><span class=\"n\">random<\/span><span class=\"o\">.<\/span><span class=\"n\">choice<\/span><span class=\"p\">(<\/span><span class=\"n\">pool<\/span><span class=\"p\">)<\/span> <span class=\"k\">for<\/span> <span class=\"n\">_<\/span> <span class=\"ow\">in<\/span> <span class=\"nb\">range<\/span><span class=\"p\">(<\/span><span class=\"n\">n<\/span><span class=\"p\">)]<\/span>\n    <span class=\"n\">diff<\/span><span class=\"p\">[<\/span><span class=\"n\">i<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"p\">(<\/span><span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">(<\/span><span class=\"n\">p1<\/span><span class=\"p\">)<\/span><span class=\"o\">-<\/span><span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">(<\/span><span class=\"n\">p2<\/span><span class=\"p\">))<\/span>\n\n<span class=\"n\">fig<\/span><span class=\"p\">,<\/span> <span class=\"n\">ax<\/span> <span class=\"o\">=<\/span> <span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">subplots<\/span><span class=\"p\">(<\/span><span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"n\">figsize<\/span><span class=\"o\">=<\/span><span class=\"p\">(<\/span><span class=\"mi\">12<\/span><span class=\"p\">,<\/span> <span class=\"mi\">3<\/span><span class=\"p\">))<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">hist<\/span><span class=\"p\">(<\/span><span class=\"n\">diff<\/span><span class=\"p\">,<\/span> \n         <span class=\"n\">bins<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">50<\/span><span class=\"p\">,<\/span> \n         <span class=\"n\">normed<\/span> <span class=\"o\">=<\/span> <span class=\"bp\">True<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">ylabel<\/span><span class=\"p\">(<\/span><span class=\"s1\">'Frequency'<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">xlabel<\/span><span class=\"p\">(<\/span><span class=\"s1\">'Difference in the mean'<\/span><span class=\"p\">)<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[\u00a0]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"c1\"># counting how many differences are larger than the observed one<\/span>\n<span class=\"n\">diff2<\/span> <span class=\"o\">=<\/span> <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">array<\/span><span class=\"p\">(<\/span><span class=\"n\">diff<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">w1<\/span> <span class=\"o\">=<\/span> <span class=\"n\">np<\/span><span class=\"o\">.<\/span><span class=\"n\">where<\/span><span class=\"p\">(<\/span><span class=\"n\">diff2<\/span> <span class=\"o\">&gt;<\/span> <span class=\"n\">p<\/span><span class=\"p\">)[<\/span><span class=\"mi\">0<\/span><span class=\"p\">]<\/span>      \n<span class=\"nb\">len<\/span><span class=\"p\">(<\/span><span class=\"n\">w1<\/span><span class=\"p\">)<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>If there are <span id=\"MathJax-Element-24-Frame\" class=\"MathJax\" style=\"box-sizing: border-box; display: inline-table; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;\" tabindex=\"0\" role=\"presentation\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;mi&gt;k&lt;\/mi&gt;&lt;\/math&gt;\"><span id=\"MathJax-Span-121\" class=\"math\"><span id=\"MathJax-Span-122\" class=\"mrow\"><span id=\"MathJax-Span-123\" class=\"mi\">k<\/span><\/span><\/span><\/span> sample pairs where the difference in mean is as big as or bigger than 0.05, the p-value is approximately <span id=\"MathJax-Element-25-Frame\" class=\"MathJax\" style=\"box-sizing: border-box; display: inline-table; font-style: normal; font-weight: normal; line-height: normal; font-size: 14px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;\" tabindex=\"0\" role=\"presentation\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;mi&gt;k&lt;\/mi&gt;&lt;mrow class=&quot;MJX-TeXAtom-ORD&quot;&gt;&lt;mo&gt;\/&lt;\/mo&gt;&lt;\/mrow&gt;&lt;mi&gt;N&lt;\/mi&gt;&lt;\/math&gt;\"><span id=\"MathJax-Span-124\" class=\"math\"><span id=\"MathJax-Span-125\" class=\"mrow\"><span id=\"MathJax-Span-126\" class=\"mi\">k<\/span><span id=\"MathJax-Span-127\" class=\"texatom\"><span id=\"MathJax-Span-128\" class=\"mrow\"><span id=\"MathJax-Span-129\" class=\"mo\">\/<\/span><\/span><\/span><span id=\"MathJax-Span-130\" class=\"mi\">N<\/span><\/span><\/span><\/span>. In or case:<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[\u00a0]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"k\">print<\/span> (<span class=\"s1\">'p-value (Simulation)='<\/span><span class=\"p\">,<\/span> <span class=\"nb\">len<\/span><span class=\"p\">(<\/span><span class=\"n\">w1<\/span><span class=\"p\">)<\/span><span class=\"o\">\/<\/span><span class=\"nb\">float<\/span><span class=\"p\">(<\/span><span class=\"n\">N<\/span><span class=\"p\">),<\/span> <span class=\"s1\">'('<\/span><span class=\"p\">,<\/span> <span class=\"nb\">len<\/span><span class=\"p\">(<\/span><span class=\"n\">w1<\/span><span class=\"p\">)<\/span><span class=\"o\">\/<\/span><span class=\"nb\">float<\/span><span class=\"p\">(<\/span><span class=\"n\">N<\/span><span class=\"p\">)<\/span><span class=\"o\">*<\/span><span class=\"mi\">100<\/span> <span class=\"p\">,<\/span><span class=\"s1\">'%)'<\/span><span class=\"p\">,<\/span> <span class=\"s1\">'Difference ='<\/span><span class=\"p\">,<\/span> <span class=\"n\">p)<\/span>\n<span class=\"k\">if<\/span> <span class=\"nb\">len<\/span><span class=\"p\">(<\/span><span class=\"n\">w1<\/span><span class=\"p\">)<\/span><span class=\"o\">\/<\/span><span class=\"nb\">float<\/span><span class=\"p\">(<\/span><span class=\"n\">N<\/span><span class=\"p\">)<\/span><span class=\"o\">&lt;<\/span><span class=\"mf\">0.05<\/span><span class=\"p\">:<\/span>\n    <span class=\"k\">print<\/span> (<span class=\"s1\">'The effect is likely')<\/span>\n<span class=\"k\">else<\/span><span class=\"p\">:<\/span>\n    <span class=\"k\">print<\/span> (<span class=\"s1\">'The effect is not likely')<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>We have defined the effect as a difference in mean as big or bigger than the observed difference, taking into account the sign. A test like this is called <em>one-sided<\/em>.<\/p>\n<p>If the relevant question is whether accident rates are different, then it makes sense to test the absolute difference in means. This kind of test is called <em>two-sided<\/em>because it counts both sides of the distribution of differences.<\/p>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>In this notebook we will see how to infer predictions about a population. To this end we will explore the relationship between sample parameters and population parameters and we will propose some methods to assess the quality of parameter estimates of a sample. Data description Let&rsquo;s consider a dataset of accidents in Barcelona in 2013. [&hellip;]<\/p>\n","protected":false},"author":11,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"page-templates\/full-width.php","meta":{"footnotes":""},"class_list":["post-283","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-json\/wp\/v2\/pages\/283","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-json\/wp\/v2\/users\/11"}],"replies":[{"embeddable":true,"href":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-json\/wp\/v2\/comments?post=283"}],"version-history":[{"count":6,"href":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-json\/wp\/v2\/pages\/283\/revisions"}],"predecessor-version":[{"id":409,"href":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-json\/wp\/v2\/pages\/283\/revisions\/409"}],"wp:attachment":[{"href":"http:\/\/vargas-solar.com\/data-centric-smart-everything\/wp-json\/wp\/v2\/media?parent=283"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}