Executing Map Reduce Programs on Hadoop Environments

Objective

The general objective of this exercise is to perform the first steps on the use of a Hadoop Environment for executing map-reduce programs (written in Python). This first exercise with show how to install a one node Hadoop setting on Collab and observe how to implement and run a map-reduce program.

Material

Google Colab account
https://github.com/gevargas/bigdata-management/blob/master/Intro_Hadoop.ipynb
Lab: MapReduce in Python using mrjob

Description

The main steps of the exercise are very simple. At first, this exercise does not run on a cluster but on one CPU allocated by default by google cloud. It helps to concentrate on the way the map and reduce functions are specified and how a program is designed on the map-reduce model.

To Do and To Hand In

Propose a UML component diagram of the Hadoop environment installed on Collab.
Propose a UML component diagram of the two map-reduce count words programs tested in the lab.
Explain how the first example implementing a grep operation with a regular expression is executed.
Explain the way the program “count words” is executed in the example.
What is the role of google drive in these examples?

Follow the steps in the Lab: MapReduce in Python using mrjob for executing a map-reduce job on a Hadoop cluster deployed on Google Cloud.

Propose a UML component diagram of the Hadoop environment configuration and test on Google Cloud.
Propose a UML component diagram of the two map-reduce programs tested in this version of the lab.
What is the role of the cloud on this version of the lab?

Big Data Engineering

Modern systems for addressing data volume, velocity & variety

Executing Map Reduce Programs on Hadoop Environments

Objective

Material

Description

To Do and To Hand In