Spark programming in baby steps

Objective

Acquire insight about the principles of programming under a dataflow based parallel model using SPARK.

Material

Baby steps Lab: github.com/gevargas/bigdata-management/blob/master/Intro_Spark.ipynb

Spark in Python documentation spark.apache.org/docs/latest/api/python/getting_started/index.html

To Do

Get together into groups, choose to answer the questions of group 1 ou 2. Propose according to the case a pseudo-Python-like code for the programming data processing examples, or draw the execution plan of the corresponding code.

Exercises

Group 1

  1. Intent: Given a text file extract the lines containing the term “SPARK”
  1. Intent: Given a set of integers (e.g., {4,16, 25, 36}) , compute their square root
  1. Consider the following piece of code in Scala. Draw the initial RDD, the application of the transformation operation and the new RDD. Then draw its corresponding DAG.
var linesRDD = sc.parallelize( Array("this is a dog", "named jerry"))
def toWords(line:String):Array[String]= line.split(" ")
var wordsRDD = linesRDD.flatMap(toWords)
wordsRDD.collect()
  1. Consider the following Scala code
val arr = 1 to 10000
val nums = sc.parallelize(arr)
def multiplyByTwo(x:Int) = Array(x*2)

If you call:  multiplyByTwo(5) which is the output?

  1. The transformation flatMap can have different interpretations. How does it work in the following Scala code?

  1. In the following Scala code (take brings only few elements to the driver) what is the output?
var a = sc.parallelize(Array(1,2,3, 4, 5 , 6, 7)); 
var localarray = a.take(4);
localarray

Draw the RDD and process when applying take.

  1. Consider the following Scala code
var a = sc.parallelize(Array(1,2,3, 4, 5 , 6, 7), 3); 
var mycount = a.count();
➢ mycount

What does the variable mycount contain? Considering that the parameter value 3 in the sc. parallelize() call refers to the number of nodes participating in the execution of the program. Draw the way the array values are fragmented across the 3 nodes.

  1. If you did not quite get how reduce() works, consider the following example:
  1. Inspired in the way we computed summation using reduce, can we compute the average using reduce()?
var seq = sc.parallelize(Array(3.0, 7, 13, 16, 19))
def avg(x: Double, y:Double):Double = {return (x+y)/2}
var total = seq.reduce(avg);

Ouput: Double = _____________

Draw the process as in question (7) to show the way the previous code works. Is the code right or wrong? Give a Mathematical intuition comparing the code sum and avg and draw conclusions.

Group 2

  1. Intent: Given a set of textual files filter those that contain the term “Dataflow” at least once
  2. Intent: Given a set of integers (e.g., {4,16, 25, 36}) , compute their square root
  3. Consider the following piece of code in Scala. Draw the initial RDD, the application of the transformation operation and the new RDD. Then draw its corresponding DAG.
var linesRDD = sc.parallelize( Array("this is a dog", "named jerry"))
def toWords(line:String):Array[String]= line.split(" ")
var wordsRDD1 = linesRDD.map(toWords)
wordsRDD1.collect()

Output RDD : __________________

  1. Consider the following Scala code
val arr = 1 to 10000
val nums = sc.parallelize(arr)
def multiplyByTwo(x:Int) = Array(x*2)

If you call for an array [1,2,3,4,5] 

var dbls = nums.flatMap(multiplyByTwo);
dbls.take(5)

Which is the output? 

  1. The transformation flatMap can have different interpretations. How does it work in the following Scala code?

Deduce and illustrate your guess about the possible uses of the transformation flatMap.

Since flatMap is useful for transforming a length N RDD into a length M RDD, its main uses come from situations when we don’t want to return an RDD of the same length.

  1. In the following Scala code (union does not remove duplicates) what is the output?
var a = sc.parallelize(Array('1','2','3'));
var b = sc.parallelize(Array('A','B','C'));
var c=a.union(b)
c.collect();

Draw the input and out RDD and then the DAG.

  1. Consider the following Scala code. What is the result? How can you assess that the proposed result is correct? (Hint: give the formula)

  1. If you didn’t quite get how reduce() works, consider the following example:
  1. Inspired in the way we computed summation using reduce, can we compute the average using reduce()?
var seq = sc.parallelize(Array(3.0, 7, 13, 16, 19))
def avg(x: Double, y:Double):Double = {return (x+y)/2}
var total = seq.reduce(avg);

Ouput: Double = __________________

Draw the process as in question (7) to show the way the previous code works. 
Is the code right or wrong?