Objective
The general objective of this exercise is to perform the first steps on the use of a Hadoop Environment for executing map-reduce programs (written in Python). This first exercise with show how to install a one node Hadoop setting on Collab and observe how to implement and run a map-reduce program.
Material
- Google Colab account
- https://github.com/gevargas/bigdata-management/blob/master/Intro_Hadoop.ipynb
- Lab: MapReduce in Python using mrjob
Description
The main steps of the exercise are very simple. At first, this exercise does not run on a cluster but on one CPU allocated by default by google cloud. It helps to concentrate on the way the map and reduce functions are specified and how a program is designed on the map-reduce model.
To Do and To Hand In
- Propose a UML component diagram of the Hadoop environment installed on Collab.
- Propose a UML component diagram of the two map-reduce count words programs tested in the lab.
- Explain how the first example implementing a grep operation with a regular expression is executed.
- Explain the way the program “count words” is executed in the example.
- What is the role of google drive in these examples?
Follow the steps in the Lab: MapReduce in Python using mrjob for executing a map-reduce job on a Hadoop cluster deployed on Google Cloud.
- Propose a UML component diagram of the Hadoop environment configuration and test on Google Cloud.
- Propose a UML component diagram of the two map-reduce programs tested in this version of the lab.
- What is the role of the cloud on this version of the lab?