CONTENT | Big Data Engineering

Date	Content	Ressources
D-1	By mail: Welcome Instructions for practicals	Important: Well-being, D&I and Evaluation
8 Apr	Introduction to Big Data [slides] Topics: 1. Datafication, 5Vs model Big Data, Platforms History 2. Big Data enables architectures — Evolutive overview — Cloud Computing — As-a-Service model (IaaS, PaaS, SaaS) — Pay-as-you-go economic model — Global regions & zones Labs: a. How to: Create & configure EC2 virtual machine AWS — Pricing Calculator (to check your VM monthly cost) b. Case study: Urban Computing (desk exercise)	Videos: What is Big Data? What is Cloud Computing?
9 Apr	Distributed Storage (slides1, slides2, slides3) Topics: — Preamble: storage and management requirements: hot vs cold data 1. From distributed file systems to cluster-based stores: NoSQL systems 2. Data management guarantees: CAP model 3. Polistores: polyglot persistence solutions Labs: a. Case study: Amazon S3 (object stores) — How to: Store and Retrieve a File with Amazon S3 — Lab: Playing with Amazon S3 b. HOMEWORK: Case study NoSQL (desk exercise) — Mynet Polystore	Videos: – Understanding Object Storage, Buckets, and S3 – Platform Overview – Data & Storage
10 Apr	Visit CC-IN2P3 Computing Centre See location on map Note: – ID required !! – Backpacks are forbidden (you can securely store them at CPE) Read before the visit: – Ch 4. Data Center Basics: Building, Power, and Cooling (annotated pdf) —– Book chapter: The Datacenter as a Computer (see ressources) – Our House Is On Fire: The Climate Emergency and Computing’s Responsibility	CC-IN2P3 virtual tours: – Computing Center – Museum
15 Apr	Zooming in on Distributed File Systems & Big Data Processing Platform (slides) Topics: 1. Architecture and general principle 2. Fault tolerance Labs: a. Lab: Creating a Hadoop Cluster using Google Cloud —- Git repository: gitlab.com/oaidel/cpe	Readings: – HDFS Architecture (annotated)
16 Apr	Processing Big Data: control flow vs data flow solutions (1/2) [slides] Topics: 1. Programming model: map-reduce 2. Integrating map-reduce into control and data flow solutions 3. Control flow data processing — Program definition & execution — Control flow execution environments Labs: a. Case Study: Hadoop Ecosystem Create & configure Cloud9 XL environment Lab: Sharding Data Collections with MongoDB Question to think: How can you propose a sharding strategy for graphs?	Readings: – MapReduce: Simplified Data Processing on Large Clusters Videos: What is Hadoop?
17 Apr	Processing Big Data: control flow vs data flow solutions (2/2) (slides) Topics: 4. Data flow data processing — Program definition & execution — Processing & data management operators — Program execution general principle: * Execution DAG * Lazy evaluation — Data flow execution environments Data Engineering Wrap Up (slides) Labs: a. Case Study: Parallel processing with Hadoop and Spark (here) – *very long desk and practical exercise!*	Readings: – Apache Spark: A Unified Engine for Big Data Processing (annotated)
18 Apr	Study time