Objective
The objective of this exercise is to (i) show the use of Apache Pig Latin, a dataflow language for analyzing large data collections; (ii) apply visualization techniques to have an aggregated and comprehensive view of the data collection content. For this purpose, you will work with a sample of the Neubot Data Collection, a data collection comprising Neubot measurements (e.g., download/upload speed tests) realized by different users, in different places and using different internet providers.
The exercise will be performed in two parts to be done in class and at home.
- In class you will have a practical experience in using Pig so that you can familiarize with the type of operations and approach to be adopted for exploring data collections content.
- At home you will use visualization tools for observing and understandig the results you obtained and adjusting your queries to obtain understandable view of the data collections.
- You will send a report explaining which tool you chose, the visualization pattern (s) you used and the one that seemed better adapted in each case. Of course, if you adjust the query describe and explain the process through which you obtained a final result that seems reasonable.
- Send your report no later than Tuesday 6th December morning
Requirements
- Apache Pig 14.0 (or greater)
- Java 7 (or greater)
- Neubot Data Collection
- NeubotTestsUDFs.jar (set of user defined functions required for this exercise)
Neubot Data Collection
The Neubot Data Collection is a data set containing network tests (e.g., upload/download speed over HTTP or BitTorrent). Each test is composed of the following information:
Nombre | Descripción |
client_address | User IP address (IPv4 or IPv6). |
client_country | Country where the test was conducted. |
client_provider | Name of user’ internet provider. |
connect_time | Number of seconds elapsed between the reception of the first and last package (Round-Trip Time). |
download_speed | Download speed (bytes/secs). |
neubot_version | Neubot version used for this test. |
platform | User operative system. |
remote_address | IP address (IPv4 or IPv6) of the server used for this test. |
test_name | Test type (ex., speedtest, bittorrent, dash). |
timestamp | Time at which the test was realized. Measured as the number of seconds elapsed after 1/01/1970 (cf. UNIX timestamp). |
upload_speed | Upload speed (bytes/secs). |
latency | Delay between the sent and reception of a control package. |
uuid | User ID (generated automatically by Neubot during installation). |
asnum | Internet provider’ ID. |
region | Country region in which the test was realized (if known). |
city | Name of the city. |
hour | Hour/Month/Year of the test (derived from timestamp). |
month | |
year |
Running Apache Pig
Apache Pig is a data flow language (Pig Latin), an interpreter and a compiler that produces sequences of Map-Reduce programs for analyzing large data sets in parallel infrastructures (e.g., Hadoop). Some of the benefits of using Pig Latin are:
- Ease of programming. It is trivial to achieve parallel execution of simple, “embarrassingly parallel” data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
- Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
- Extensibility. Users can create their own functions to do special-purpose processing.
Although Pig Latin programs are intended to be executed on a cluster, you can conduct some test in your local machine. For this, open a terminal and type the following instruction for opening GRUNT (Pig’ interactive interpreter) on local mode:
# Move to the folder containing the exercise material cd ~/hands-on/pig # Execute Pig in Local mode pig -x local
Example of a Pig Program
The following program illustrates how to use Apache Pig for processing the Neubot data collection. In particular, it illustrates how to:
- Define a schema that describes the structure of your data.
- Filter the data based on some criteria (i.e., keep only speedtest).
- Project a subset of the attributes of the data collection (eg., keep just the names of the cities where the test were conducted).
- Display results on the screen.
- Store results on the filesystem.
You can run this program by copy/pasting it in GRUNT.
*** Note: Modify the PATHs to the NeubotTests data collection and NeubotTestsUDFs.jar if necessary.
REGISTER NeubotTestsUDFs.jar; DEFINE IPtoNumber convert.IpToNumber(); DEFINE NumberToIP convert.NumberToIp(); NeubotTests = LOAD 'NeubotTests' using PigStorage(',') as ( client_address: chararray, client_country: chararray, lon: float, lat: float, client_provider: chararray, mlabservername: chararray, connect_time: float, download_speed: float, neubot_version: float, platform: chararray, remote_address: chararray, test_name: chararray, timestamp: long, upload_speed: float, latency: float, uuid: chararray, asnum: chararray, region: chararray, city: chararray, hour: int, month: int, year: int, weekday: int, day: int, filedate: chararray ); -- -- Keep only the 'speedtests' -- @ means "use previous result" Tests = FILTER @ BY (test_name matches '.*speedtest.*'); -- -- Cities were the tests were conducted -- Cities = FOREACH @ GENERATE city; Cities = DISTINCT @; Cities = ORDER @ BY city; -- -- Display the results contained in 'Cities' -- DUMP @; -- -- Store the results in the folder 'Cities' -- STORE @ INTO 'SpeedTests';
TO DO
Define a data flow using Pig that answers each of these queries:
- Filter the speedtest conducted in Barcelona or Madrid. Then list the internet providers working in those cities.
- List the names and the IP ranges of the internet providers located in Barcelona. For this you need to use the IPtoNumber user defined function (cf. NeubotTestsUDFs.jar).
- Group the speedtest based on the user network infrastructure (e.g., 3G/4G vs ADSL). For this you can assume some max bandwidth (e.g., 21Mb/sec for ADSL).
- Find the user that realized the maximum number of tests. For this user, produce a table showing the evolution of her/his download/upload speeds.