Date | Content | Ressources |
---|---|---|
D-1 | By mail: Welcome Instructions for practicals | Important: Well-being, D&I and Evaluation |
8 Apr | Introduction to Big Data [slides] Topics: 1. Datafication, 5Vs model Big Data, Platforms History 2. Big Data enables architectures — Evolutive overview — Cloud Computing — As-a-Service model (IaaS, PaaS, SaaS) — Pay-as-you-go economic model — Global regions & zones Labs: a. How to: Create & configure EC2 virtual machine AWS — Pricing Calculator (to check your VM monthly cost) b. Case study: Urban Computing (desk exercise) | Videos: What is Big Data? What is Cloud Computing? |
9 Apr | Distributed Storage (slides1, slides2, slides3) Topics: — Preamble: storage and management requirements: hot vs cold data 1. From distributed file systems to cluster-based stores: NoSQL systems 2. Data management guarantees: CAP model 3. Polistores: polyglot persistence solutions Labs: a. Case study: Amazon S3 (object stores) — How to: Store and Retrieve a File with Amazon S3 — Lab: Playing with Amazon S3 b. HOMEWORK: Case study NoSQL (desk exercise) — Mynet Polystore | Videos: – Understanding Object Storage, Buckets, and S3 – Platform Overview – Data & Storage |
10 Apr | Visit CC-IN2P3 Computing Centre See location on map Note: – ID required !! – Backpacks are forbidden (you can securely store them at CPE) Read before the visit: – Ch 4. Data Center Basics: Building, Power, and Cooling (annotated pdf) —– Book chapter: The Datacenter as a Computer (see ressources) – Our House Is On Fire: The Climate Emergency and Computing’s Responsibility | CC-IN2P3 virtual tours: – Computing Center – Museum |
15 Apr | Zooming in on Distributed File Systems & Big Data Processing Platform (slides) Topics: 1. Architecture and general principle 2. Fault tolerance Labs: a. Lab: Creating a Hadoop Cluster using Google Cloud —- Git repository: gitlab.com/oaidel/cpe | Readings: – HDFS Architecture (annotated) |
16 Apr | Processing Big Data: control flow vs data flow solutions (1/2) [slides] Topics: 1. Programming model: map-reduce 2. Integrating map-reduce into control and data flow solutions 3. Control flow data processing — Program definition & execution — Control flow execution environments Labs: a. Case Study: Hadoop Ecosystem Create & configure Cloud9 XL environment Lab: Sharding Data Collections with MongoDB Question to think: How can you propose a sharding strategy for graphs? | Readings: – MapReduce: Simplified Data Processing on Large Clusters Videos: What is Hadoop? |
17 Apr | Processing Big Data: control flow vs data flow solutions (2/2) (slides) Topics: 4. Data flow data processing — Program definition & execution — Processing & data management operators — Program execution general principle: * Execution DAG * Lazy evaluation — Data flow execution environments Data Engineering Wrap Up (slides) Labs: a. Case Study: Parallel processing with Hadoop and Spark (here) – very long desk and practical exercise! | Readings: – Apache Spark: A Unified Engine for Big Data Processing (annotated) |
18 Apr | Study time |