Complete syllabus here: LIS-4112
- Definition and reference architecture [slides] [YouTube]
- Use of storage and distributed computing resources
- Data Labs and first experience with (big) data processing pipelines
- [Environment][YouTube]
- Getting started with the data science ecosystem [HO-1]
- Exploring data collections using descriptive statistics [HO-2]
- Analysing graphs [HO-3]
- Data distribution: distributed file systems (e.g., HDFS), NoSQL/NewSQL
- Parallel programming models (MapReduce) and execution environments (e.g., Hadoop, Spark) [YouTube]
- Hadoop: Exercise-1
- Spark: Exercise-2
- Data Labs and first experience with (big) data processing pipelines
- Virtualization techniques: hypervisors vs. containers, distributed resource brokers
- DevOps Introduction: Virtual machines for running Spark programs on MS Azure [slides][Exercise][YouTube]
- Componentization: playing with docker [slides] [Exercise-1][Exercise-2][YouTube][YouTube]
- Cloud virtual machines for data science (big data analytics) [slides][YouTube]