{"id":222,"date":"2024-04-16T16:25:47","date_gmt":"2024-04-16T16:25:47","guid":{"rendered":"http:\/\/vargas-solar.com\/bigdata-engineering\/?page_id=222"},"modified":"2024-04-16T18:54:38","modified_gmt":"2024-04-16T18:54:38","slug":"spark-programming-in-baby-steps","status":"publish","type":"page","link":"http:\/\/vargas-solar.com\/bigdata-engineering\/spark-programming-in-baby-steps\/","title":{"rendered":"Spark programming in baby steps"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><strong>Objective<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Acquire insight about the principles of programming under a dataflow based parallel model using SPARK.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Material<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Baby steps Lab: <a href=\"https:\/\/github.com\/gevargas\/bigdata-management\/blob\/master\/Intro_Spark.ipynb\">github.com\/gevargas\/bigdata-management\/blob\/master\/Intro_Spark.ipynb<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Spark in Python documentation <a href=\"https:\/\/spark.apache.org\/docs\/latest\/api\/python\/getting_started\/index.html\">spark.apache.org\/docs\/latest\/api\/python\/getting_started\/index.html<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>To Do<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Get together into groups, choose to answer the questions of group 1 ou 2.  Propose according to the case a pseudo-Python-like code for the programming data processing examples, or draw the execution plan of the corresponding code.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Exercises<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><strong><em>Group <\/em>1<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Intent: Given a text file extract the lines containing the term \u201cSPARK\u201d<\/li>\n<\/ol>\n\n\n\n<ol class=\"wp-block-list\" start=\"2\">\n<li>Intent: Given a set of integers (e.g., {4,16, 25, 36}) , compute their square root<\/li>\n<\/ol>\n\n\n\n<ol class=\"wp-block-list\" start=\"3\">\n<li>Consider the following piece of code in Scala. Draw the initial RDD, the application of the transformation operation and the new RDD. Then draw its corresponding DAG.<\/li>\n<\/ol>\n\n\n\n<div class=\"wp-block-group\"><div class=\"wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-group\"><div class=\"wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-group\"><div class=\"wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained\">\n<pre class=\"wp-block-code\"><code>var linesRDD = sc.parallelize( Array(\"this is a dog\", \"named jerry\"))\ndef toWords(line:String):Array&#91;String]= line.split(\" \")\nvar wordsRDD = linesRDD.flatMap(toWords)\nwordsRDD.collect()<\/code><\/pre>\n<\/div><\/div>\n<\/div><\/div>\n<\/div><\/div>\n\n\n\n<ol class=\"wp-block-list\" start=\"4\">\n<li>Consider the following Scala code<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>val arr = 1 to 10000\nval nums = sc.parallelize(arr)\ndef multiplyByTwo(x:Int) = Array(x*2)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">If you call:&nbsp; <strong>multiplyByTwo(5)<\/strong> which is the output? <\/p>\n\n\n\n<ol class=\"wp-block-list\" start=\"5\">\n<li>The transformation flatMap can have different interpretations. How does it work in the following Scala code?<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><img loading=\"lazy\" decoding=\"async\" width=\"624\" height=\"184\" src=\"https:\/\/lh7-us.googleusercontent.com\/zG7m4lmCkAGP3IAWTe6phVrJPv_8Z7JP1idfUAe9s-KlmyYlw5QqJvq-7ZQzgWJmBBfaYpLQLNZUSA6NGgUtvPTp02qE0-6xmRSkXBR2RXhdsv7eFh0TmtOA0UqkvkHGqw-34qPhH5vkQAGXK7TuFQ\"><\/p>\n\n\n\n<ol class=\"wp-block-list\" start=\"6\">\n<li>In the following Scala code (take brings only few elements to the driver) what is the output?<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>var a = sc.parallelize(Array(1,2,3, 4, 5 , 6, 7));&nbsp;\nvar localarray = a.take(4);\nlocalarray<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Draw the RDD and process when applying take.<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\" start=\"7\">\n<li>Consider the following Scala code<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>var a = sc.parallelize(Array(1,2,3, 4, 5 , 6, 7), 3);&nbsp;\nvar mycount = a.count();\n\u27a2 mycount<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">What does the variable <strong>mycount<\/strong> contain? Considering that the parameter value 3 in the <strong>sc. parallelize() <\/strong>call refers to the number of nodes participating in the execution of the program. Draw the way the array values are fragmented across the 3 nodes.<\/p>\n\n\n\n<ol class=\"wp-block-list\" start=\"8\">\n<li>If you did not quite get how reduce() works, consider the following example:<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-us.googleusercontent.com\/4O4qjKyuo4gnyekIkxuRUi06rvBAdD4y8zF848SqWAde_8cTlfHttM3ByBPltKepRqR6gVTLA-lUp16U_QEdTvBte8A1X_5oWnMJE_Dkx8wj3uhK7jAnjsveYbIpiS0j118ZHsZkXx-p-7PcuZJYuw\" alt=\"\"\/><\/figure>\n\n\n\n<ol class=\"wp-block-list\" start=\"9\">\n<li>Inspired in the way we computed summation using reduce, can we compute the average using reduce()?<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>var seq = sc.parallelize(Array(3.0, 7, 13, 16, 19))\ndef avg(x: Double, y:Double):Double = {return (x+y)\/2}\nvar total = seq.reduce(avg);<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Ouput: Double = _____________<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Draw the process as in question (7) to show the way the previous code works. Is the code right or wrong? Give a Mathematical intuition comparing the code sum and avg and draw conclusions.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong><em>Group 2<\/em><\/strong><\/h4>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Intent: Given a set of textual files filter those that contain the term \u201cDataflow\u201d at least once<\/li>\n\n\n\n<li>Intent: Given a set of integers (e.g., {4,16, 25, 36}) , compute their square root<\/li>\n\n\n\n<li>Consider the following piece of code in Scala. Draw the initial RDD, the application of the transformation operation and the new RDD. Then draw its corresponding DAG.<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>var linesRDD = sc.parallelize( Array(\"this is a dog\", \"named jerry\"))\ndef toWords(line:String):Array&#91;String]= line.split(\" \")\nvar wordsRDD1 = linesRDD.map(toWords)\nwordsRDD1.collect()<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Output RDD : <\/strong>__________________<\/p>\n\n\n\n<ol class=\"wp-block-list\" start=\"4\">\n<li>Consider the following Scala code<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>val arr = 1 to 10000\nval nums = sc.parallelize(arr)\ndef multiplyByTwo(x:Int) = Array(x*2)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">If you call for an array [1,2,3,4,5]&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>var dbls = nums.flatMap(multiplyByTwo);\ndbls.take(5)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Which is the output?&nbsp;<\/p>\n\n\n\n<ol class=\"wp-block-list\" start=\"5\">\n<li>The transformation flatMap can have different interpretations. How does it work in the following Scala code?<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-us.googleusercontent.com\/zG7m4lmCkAGP3IAWTe6phVrJPv_8Z7JP1idfUAe9s-KlmyYlw5QqJvq-7ZQzgWJmBBfaYpLQLNZUSA6NGgUtvPTp02qE0-6xmRSkXBR2RXhdsv7eFh0TmtOA0UqkvkHGqw-34qPhH5vkQAGXK7TuFQ\" alt=\"\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Deduce and illustrate your guess about the possible uses of the transformation flatMap.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Since flatMap is useful for transforming a length N RDD into a length M RDD, its main uses come from situations when we don\u2019t want to return an RDD of the same length.<\/p>\n\n\n\n<ol class=\"wp-block-list\" start=\"6\">\n<li>In the following Scala code (union does not remove duplicates) what is the output?<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>var a = sc.parallelize(Array('1','2','3'));\nvar b = sc.parallelize(Array('A','B','C'));\nvar c=a.union(b)\nc.collect();<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Draw the input and out RDD and then the DAG.<\/p>\n\n\n\n<ol class=\"wp-block-list\" start=\"7\">\n<li>Consider the following Scala code. What is the result? How can you assess that the proposed result is correct? (Hint: give the formula)<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-us.googleusercontent.com\/rSi3bDTEcQ-3YoQJuxulPT-2NWzNZCU8lSuwKgF1cCe6I28RBACTxaAh3Y5oOTgBeaQKD6zFNS_L10jk-DYaWipjlxofQuXHxBTfpt4zvGMD0VmBQUdjdPP47hw_i_tATRw4t2WfdbRj5OPOitejCQ\" alt=\"\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<ol class=\"wp-block-list\" start=\"8\">\n<li>If you didn&#8217;t quite get how reduce() works, consider the following example:<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-us.googleusercontent.com\/4O4qjKyuo4gnyekIkxuRUi06rvBAdD4y8zF848SqWAde_8cTlfHttM3ByBPltKepRqR6gVTLA-lUp16U_QEdTvBte8A1X_5oWnMJE_Dkx8wj3uhK7jAnjsveYbIpiS0j118ZHsZkXx-p-7PcuZJYuw\" alt=\"\"\/><\/figure>\n\n\n\n<ol class=\"wp-block-list\" start=\"9\">\n<li>Inspired in the way we computed summation using reduce, can we compute the average using reduce()?<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>var seq = sc.parallelize(Array(3.0, 7, 13, 16, 19))\ndef avg(x: Double, y:Double):Double = {return (x+y)\/2}\nvar total = seq.reduce(avg);<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Ouput: Double = __________________<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Draw the process as in question (7) to show the way the previous code works.&nbsp;<br>Is the code right or wrong?&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Objective Acquire insight about the principles of programming under a dataflow based parallel model using SPARK. Material Baby steps Lab: github.com\/gevargas\/bigdata-management\/blob\/master\/Intro_Spark.ipynb Spark in Python documentation spark.apache.org\/docs\/latest\/api\/python\/getting_started\/index.html To Do Get together into groups, choose to answer the questions of group 1 ou 2. Propose according to the case a pseudo-Python-like code for the programming data processing [&hellip;]<\/p>\n","protected":false},"author":11,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-222","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"http:\/\/vargas-solar.com\/bigdata-engineering\/wp-json\/wp\/v2\/pages\/222","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/vargas-solar.com\/bigdata-engineering\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"http:\/\/vargas-solar.com\/bigdata-engineering\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"http:\/\/vargas-solar.com\/bigdata-engineering\/wp-json\/wp\/v2\/users\/11"}],"replies":[{"embeddable":true,"href":"http:\/\/vargas-solar.com\/bigdata-engineering\/wp-json\/wp\/v2\/comments?post=222"}],"version-history":[{"count":9,"href":"http:\/\/vargas-solar.com\/bigdata-engineering\/wp-json\/wp\/v2\/pages\/222\/revisions"}],"predecessor-version":[{"id":238,"href":"http:\/\/vargas-solar.com\/bigdata-engineering\/wp-json\/wp\/v2\/pages\/222\/revisions\/238"}],"wp:attachment":[{"href":"http:\/\/vargas-solar.com\/bigdata-engineering\/wp-json\/wp\/v2\/media?parent=222"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}