{"id":80,"date":"2016-12-02T18:40:24","date_gmt":"2016-12-02T18:40:24","guid":{"rendered":"http:\/\/vargas-solar.com\/bigdata-visualisation\/?page_id=80"},"modified":"2016-12-05T16:27:26","modified_gmt":"2016-12-05T16:27:26","slug":"analyzing-large-data-collections-with-apache-pig","status":"publish","type":"page","link":"http:\/\/vargas-solar.com\/bigdata-visualisation\/hands-on\/analyzing-large-data-collections-with-apache-pig\/","title":{"rendered":"Analyzing Large Data Collections with Apache Pig"},"content":{"rendered":"<p>&nbsp;<\/p>\n<h3 style=\"text-align: left;\">Objective<\/h3>\n<p style=\"text-align: left;\">The objective of this exercise is to (i) show the use of Apache Pig Latin, a dataflow language for analyzing large data collections; (ii)\u00a0\u00a0apply visualization techniques to\u00a0have\u00a0an aggregated and comprehensive view of the data collection\u00a0content.\u00a0For this purpose, you will work with a sample of the Neubot Data Collection, a data collection comprising <a href=\"http:\/\/www.neubot.org\">Neubot<\/a> measurements\u00a0(e.g., download\/upload speed tests) realized by different users, in different places and using different internet providers.<\/p>\n<p>The exercise will be performed in two parts to be done in class and at home.<\/p>\n<ul>\n<li>In class you will have a practical experience in using Pig so that you can familiarize with the type of operations and approach to be adopted for exploring data collections content.<\/li>\n<li>At home you will use visualization tools for observing and understandig the results you obtained and adjusting your queries to obtain understandable view of the data collections.\n<ol>\n<li>You will send a report explaining which tool you chose, the visualization pattern (s) you used and the one that seemed better adapted in each case. Of course, if you adjust the query describe and explain the process through which you obtained a final result that seems reasonable.<\/li>\n<li>Send your report no later than Tuesday 6<sup>th<\/sup> December morning<\/li>\n<\/ol>\n<\/li>\n<\/ul>\n<h3 style=\"text-align: left;\">Requirements<\/h3>\n<ul style=\"text-align: left;\">\n<li><a href=\"http:\/\/pig.apache.org\/\">Apache Pig<\/a>\u00a014.0 (or greater)<\/li>\n<li><a href=\"http:\/\/java.sun.com\/javase\/downloads\/index.jsp\">Java<\/a>\u00a07 (or greater)<\/li>\n<li><a href=\"https:\/\/1drv.ms\/u\/s!AuTcHvCCLBZLmUS9RkItkE7ctw3W\">Neubot Data Collection<\/a><\/li>\n<li>NeubotTestsUDFs.jar (set of user defined functions required for this exercise)<\/li>\n<\/ul>\n<h3 style=\"text-align: left;\">Neubot Data Collection<\/h3>\n<p style=\"text-align: left;\">The <a href=\"https:\/\/www.dropbox.com\/s\/cmkotgo3c1v94zn\/NeubotTests.zip?dl=0\">Neubot Data Collection<\/a> is a data set containing network\u00a0tests (<em>e.g., upload\/download speed over HTTP or BitTorrent<\/em>). Each test is composed of the following information:<\/p>\n<table class=\" alignleft\" style=\"height: 798px;\" width=\"627\">\n<tbody>\n<tr>\n<td width=\"161\"><strong>Nombre <\/strong><\/td>\n<td><strong>Descripci\u00f3n\u00a0 <\/strong><\/td>\n<\/tr>\n<tr>\n<td width=\"161\"><em>client_address<\/em><\/td>\n<td>User IP address (IPv4 or IPv6).<\/td>\n<\/tr>\n<tr>\n<td width=\"161\"><em>client_country<\/em><\/td>\n<td>Country where the test was conducted.<\/td>\n<\/tr>\n<tr>\n<td width=\"161\"><em>client_provider<\/em><\/td>\n<td>Name of user&#8217; internet provider.<\/td>\n<\/tr>\n<tr>\n<td width=\"161\"><em>connect_time<\/em><\/td>\n<td>Number of seconds elapsed between the reception of the first and last package (<em>Round-Trip Time<\/em>).<\/td>\n<\/tr>\n<tr>\n<td width=\"161\"><em>download_speed<\/em><\/td>\n<td>Download speed (bytes\/secs).<\/td>\n<\/tr>\n<tr>\n<td width=\"161\"><em>neubot_version<\/em><\/td>\n<td>Neubot version used for this test.<\/td>\n<\/tr>\n<tr>\n<td width=\"161\"><em>platform<\/em><\/td>\n<td>User operative system.<\/td>\n<\/tr>\n<tr>\n<td width=\"161\"><em>remote_address<\/em><\/td>\n<td>IP address (IPv4 or IPv6) of the server used for this test.<\/td>\n<\/tr>\n<tr>\n<td width=\"161\"><em>test_name<\/em><\/td>\n<td>Test type (ex., <em>speedtest, bittorrent, dash<\/em>).<\/td>\n<\/tr>\n<tr>\n<td width=\"161\"><em>timestamp<\/em><\/td>\n<td>Time at which the test was realized.\u00a0Measured as the number of seconds elapsed after\u00a01\/01\/1970 (cf. <a href=\"https:\/\/en.wikipedia.org\/wiki\/Unix_time\">UNIX timestamp<\/a>).<\/td>\n<\/tr>\n<tr>\n<td width=\"161\"><em>upload_speed<\/em><\/td>\n<td>Upload speed (bytes\/secs).<\/td>\n<\/tr>\n<tr>\n<td width=\"161\"><em>latency<\/em><\/td>\n<td>Delay between the sent and reception of a control package.<\/td>\n<\/tr>\n<tr>\n<td width=\"161\"><em>uuid<\/em><\/td>\n<td>User ID (generated automatically by Neubot during installation).<\/td>\n<\/tr>\n<tr>\n<td width=\"161\"><em>asnum<\/em><\/td>\n<td>Internet provider&#8217; ID.<\/td>\n<\/tr>\n<tr>\n<td width=\"161\"><em>region<\/em><\/td>\n<td>Country region in which the test was realized\u00a0 (if known).<\/td>\n<\/tr>\n<tr>\n<td width=\"161\"><em>city<\/em><\/td>\n<td>Name of the city.<\/td>\n<\/tr>\n<tr>\n<td width=\"161\"><em>hour<\/em><\/td>\n<td rowspan=\"3\">Hour\/Month\/Year of the test (derived from timestamp).<\/td>\n<\/tr>\n<tr>\n<td width=\"161\"><em>month<\/em><\/td>\n<\/tr>\n<tr>\n<td width=\"161\"><em>year<\/em><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3 style=\"text-align: left;\">Running\u00a0Apache Pig<\/h3>\n<p style=\"text-align: left;\">Apache Pig is a <strong>data flow\u00a0language<\/strong> (Pig Latin), an interpreter and a compiler that produces <em>sequences of Map-Reduce<\/em> programs for analyzing large data sets in parallel infrastructures (e.g.,\u00a0Hadoop). Some of the benefits of using\u00a0Pig Latin are:<\/p>\n<ul style=\"text-align: left;\">\n<li><strong>Ease of programming<\/strong>. It is trivial to achieve parallel execution of simple, &#8220;embarrassingly parallel&#8221; data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.<\/li>\n<li><strong>Optimization opportunities<\/strong>. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.<\/li>\n<li><strong>Extensibility<\/strong>. Users can create their own functions to do special-purpose processing.<\/li>\n<\/ul>\n<p style=\"text-align: left;\">Although Pig Latin programs are intended to\u00a0be executed on a\u00a0cluster, you can conduct some test in your local machine. For this, open a terminal and type\u00a0the following instruction for opening\u00a0GRUNT (Pig&#8217; interactive interpreter) on\u00a0<strong>local mode<\/strong>:<\/p>\n<pre># Move to the folder containing the exercise material\r\ncd ~\/hands-on\/pig\r\n\r\n# Execute Pig in Local mode\r\npig -x local<\/pre>\n<h3 style=\"text-align: left;\">Example of a Pig Program<\/h3>\n<p style=\"text-align: left;\">The following program\u00a0illustrates how to use Apache Pig for processing the Neubot data collection.\u00a0In particular, it illustrates how to:<\/p>\n<ol style=\"text-align: left;\">\n<li>Define a schema that describes the structure of your\u00a0data.<\/li>\n<li>Filter the data based on some criteria (i.e., keep only speedtest).<\/li>\n<li>Project a subset\u00a0of the\u00a0attributes of the data collection (eg., keep just the names of the cities where the test were conducted).<\/li>\n<li>Display results on the screen.<\/li>\n<li>Store results\u00a0on the filesystem.<\/li>\n<\/ol>\n<p style=\"text-align: left;\">You can run this program by copy\/pasting it in\u00a0GRUNT.<\/p>\n<p style=\"text-align: left;\"><strong>*** Note:<\/strong> <em>Modify the PATHs to the NeubotTests data collection and NeubotTestsUDFs.jar if necessary.<strong>\u00a0<\/strong><\/em><\/p>\n<pre>REGISTER NeubotTestsUDFs.jar;\r\nDEFINE\u00a0\u00a0 IPtoNumber convert.IpToNumber();\r\nDEFINE\u00a0\u00a0 NumberToIP convert.NumberToIp();\r\n\r\nNeubotTests = LOAD 'NeubotTests' using PigStorage(',') as (\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 client_address: chararray,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 client_country: chararray,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 lon: float,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 lat: float,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 client_provider: chararray,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 mlabservername:\u00a0 chararray,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 connect_time:\u00a0\u00a0\u00a0 float,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 download_speed:\u00a0 float,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 neubot_version:\u00a0 float,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 platform:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 chararray,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 remote_address:\u00a0 chararray,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 test_name:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 chararray,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 timestamp:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 long,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 upload_speed:\u00a0\u00a0\u00a0 float,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 latency:\u00a0 float,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 uuid:\u00a0\u00a0\u00a0\u00a0 chararray,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 asnum:\u00a0\u00a0\u00a0 chararray,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 region:\u00a0\u00a0 chararray,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 city:\u00a0\u00a0\u00a0\u00a0 chararray,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 hour:\u00a0\u00a0\u00a0\u00a0 int,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 month:\u00a0\u00a0\u00a0 int,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 year:\u00a0\u00a0\u00a0\u00a0 int,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 weekday:\u00a0 int,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 day:\u00a0\u00a0\u00a0\u00a0\u00a0 int,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 filedate: chararray\r\n);\r\n\r\n--\r\n-- Keep only the 'speedtests'\r\n--     <strong>@<\/strong> means \"<em>use previous result<\/em>\" \r\nTests = FILTER @ BY (test_name matches '.*speedtest.*');\r\n\r\n--\r\n-- Cities were the tests were conducted\r\n--\r\nCities = FOREACH @ GENERATE city;\r\nCities = DISTINCT @;\r\nCities = ORDER @ BY city;\r\n\r\n--\r\n-- Display the results contained in 'Cities'\r\n--\r\nDUMP @;\r\n\r\n--\r\n-- Store the results in the folder 'Cities'\r\n--\r\nSTORE @ INTO 'SpeedTests';<\/pre>\n<h3 style=\"text-align: left;\">TO DO<\/h3>\n<p style=\"text-align: left;\">Define a data flow using Pig\u00a0that\u00a0answers each of\u00a0these queries:<\/p>\n<ol>\n<li style=\"text-align: left;\">Filter the speedtest conducted in Barcelona or\u00a0Madrid. Then list the internet providers\u00a0working in those cities.<\/li>\n<li style=\"text-align: left;\">List the names and the IP ranges of the internet providers\u00a0located in Barcelona. For this you need to use the IPtoNumber user defined function (cf. NeubotTestsUDFs.jar).<\/li>\n<li style=\"text-align: left;\">Group the speedtest based on the user network infrastructure (e.g., 3G\/4G vs ADSL). For this\u00a0you can\u00a0assume some max bandwidth (e.g., 21Mb\/sec for ADSL).<\/li>\n<li style=\"text-align: left;\">Find the user that realized the maximum number of tests. For this user, produce a table showing the evolution of her\/his download\/upload speeds.<\/li>\n<\/ol>\n<h3 style=\"text-align: left;\">Resources<\/h3>\n<ul>\n<li><a href=\"http:\/\/pig.apache.org\/docs\/r0.15.0\/basic.html\">Pig\u00a0documentation<\/a><\/li>\n<li><a href=\"http:\/\/vargas-solar.com\/bigdata-visualisation\/answers\/\">Answers<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>&nbsp; Objective The objective of this exercise is to (i) show the use of Apache Pig Latin, a dataflow language for analyzing large data collections; (ii)&nbsp;&nbsp;apply visualization techniques to&nbsp;have&nbsp;an aggregated and comprehensive view of the data collection&nbsp;content.&nbsp;For this purpose, you will work with a sample of the Neubot Data Collection, a data collection comprising Neubot [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":17,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-80","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"http:\/\/vargas-solar.com\/bigdata-visualisation\/wp-json\/wp\/v2\/pages\/80","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/vargas-solar.com\/bigdata-visualisation\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"http:\/\/vargas-solar.com\/bigdata-visualisation\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"http:\/\/vargas-solar.com\/bigdata-visualisation\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/vargas-solar.com\/bigdata-visualisation\/wp-json\/wp\/v2\/comments?post=80"}],"version-history":[{"count":4,"href":"http:\/\/vargas-solar.com\/bigdata-visualisation\/wp-json\/wp\/v2\/pages\/80\/revisions"}],"predecessor-version":[{"id":89,"href":"http:\/\/vargas-solar.com\/bigdata-visualisation\/wp-json\/wp\/v2\/pages\/80\/revisions\/89"}],"up":[{"embeddable":true,"href":"http:\/\/vargas-solar.com\/bigdata-visualisation\/wp-json\/wp\/v2\/pages\/17"}],"wp:attachment":[{"href":"http:\/\/vargas-solar.com\/bigdata-visualisation\/wp-json\/wp\/v2\/media?parent=80"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}