Executing Map Reduce Programs on Hadoop Environments

Objective

The general objective of this exercise is to perform the first steps on the use of a Hadoop Environment for executing map-reduce programs (written in Python). This first exercise with show how to install a one node Hadoop setting on Collab and observe how to implement and run a map-reduce program.

Material

Description

The main steps of the exercise are very simple. At first, this exercise does not run on a cluster but on one CPU allocated by default by google cloud. It helps to concentrate on the way the map and reduce functions are specified and how a program is designed on the map-reduce model.

To Do and To Hand In

  • Propose a UML component diagram of the Hadoop environment installed on Collab.
  • Propose a UML component diagram of the two map-reduce count words programs tested in the lab.
  • Explain how the first example implementing a grep operation with a regular expression is executed.
  • Explain the way the program “count words” is executed in the example.
  • What is the role of google drive in these examples?

Follow the steps in the Lab: MapReduce in Python using mrjob for executing a map-reduce job on a Hadoop cluster deployed on Google Cloud.

  • Propose a UML component diagram of the Hadoop environment configuration and test on Google Cloud.
  • Propose a UML component diagram of the two map-reduce programs tested in this version of the lab.
  • What is the role of the cloud on this version of the lab?