CONTENT

DateContentRessources
D-1
By mail: Welcome Instructions for practicalsImportant: Well-being, D&I and Evaluation
8 AprIntroduction to Big Data [slides]
Topics:
1. Datafication, 5Vs model Big Data, Platforms History  
2. Big Data enables architectures
— Evolutive overview
— Cloud Computing
— As-a-Service model (IaaS, PaaS, SaaS)
— Pay-as-you-go economic model
— Global regions & zones

Labs:
a. How to: Create & configure EC2 virtual machine AWS
Pricing Calculator (to check your VM monthly cost)
b. Case study: Urban Computing (desk exercise)
Videos:
What is Big Data? What is Cloud Computing?
9 AprDistributed Storage (slides1, slides2, slides3)
Topics:
— Preamble: storage and management requirements: hot vs cold data
1. From distributed file systems to cluster-based stores: NoSQL systems
2. Data management guarantees: CAP model
3. Polistores: polyglot persistence solutions

Labs:
a. Case study: Amazon S3 (object stores)
— How to: Store and Retrieve a File with Amazon S3
— Lab: Playing with Amazon S3 
b. HOMEWORK: Case study NoSQL (desk exercise)
Mynet Polystore
Videos:
– Understanding Object Storage, Buckets, and S3
– Platform Overview – Data & Storage
10 AprVisit CC-IN2P3 Computing Centre
See location on map

Note:
ID required !!
Backpacks are forbidden (you can securely store them at CPE)

Read before the visit:
Ch 4. Data Center Basics: Building, Power, and Cooling (annotated pdf)
—– Book chapter: The Datacenter as a Computer (see ressources)
Our House Is On Fire: The Climate Emergency and Computing’s Responsibility
CC-IN2P3 virtual tours:
Computing Center
Museum
15 AprZooming in on Distributed File Systems & Big Data Processing Platform (slides)
Topics:
1. Architecture and general principle
2. Fault tolerance

Labs:
a. Lab: Creating a Hadoop Cluster using Google Cloud
—- Git repository: gitlab.com/oaidel/cpe
Readings:
– HDFS Architecture (annotated)
16 AprProcessing Big Data: control flow vs data flow solutions (1/2) [slides]
Topics:
1. Programming model: map-reduce
2. Integrating map-reduce into control and data flow solutions
3. Control flow data processing
— Program definition & execution
— Control flow execution environments

Labs:
a. Case Study: Hadoop Ecosystem
Create & configure Cloud9 XL environment
Lab: Sharding Data Collections with MongoDB
Question to think: How can you propose a sharding strategy for graphs?
Readings:
– MapReduce: Simplified Data Processing on Large Clusters
Videos: What is Hadoop?
17
Apr
Processing Big Data: control flow vs data flow solutions (2/2) (slides)
Topics:
4. Data flow data processing
— Program definition & execution
— Processing & data management operators
— Program execution general principle:
* Execution DAG
* Lazy evaluation
— Data flow execution environments
Data Engineering Wrap Up (slides)

Labs:
a. Case Study: Parallel processing with Hadoop and Spark (here) – very long desk and practical exercise!
Readings:
– Apache Spark: A Unified Engine for Big Data Processing (annotated)
18 AprStudy time