Instructions for Hands-On Exercises

In the 2024 edition of the ICT-Big Data course, you must perform 5 hands-on exercises (two of which have been partially completed in the Lab sessions).

You can work alone or in groups of 3-4 people. The more people in the group, the higher the quality of the answers expected to be (complete, sound, and with critical thinking applied to your discussion).

In the following lines, the specifications:

What is required to be produced by you/your team?
How to prepare them?
What and when to hand in?

This work is 40% of the grade of the ICT: Big Data part of the course by Genoveva Vargas-Solar

1. When?

Deadline: 16^th December 2024 12:00 CET (firm deadline)

2. Work to do

2.1 Exercise 1 Dataset quantitative profile and cleaning: The tabular data structure and its operators

Description of the tasks of the exercise [HO-1Bis]

To hand in: Share a notebook from Kaggle (no PDFs or other types of documents will be accepted), including:

The whole set of instructions (run)
At the end of the bottom of the exercise notebook
- Insert three markdown cells with the interpretations of the results, explaining to which extent the results answer the research question.
- A mind map (JPEG/PNG figure) associated with each family of operations that can be applied to a table (discussed in class) and the corresponding Python instructions that implement them. Use a markdown cell to insert your figure.
- The pipeline (JPEG/PNG figure) describes the logic flow implemented in the notebook for answering the research question (which steps implemented in the notebook lead to an answer to the research question?). Use a markdown cell to insert your figure.

N.B. For drawing the figures, use a tool like Google Draw or draw the figures by hand and take a photo. Ensure the quality is good so that a human can read and understand.

2.2 Exercise 2 Tracking Outliers using Unsupervised Learning (classification)

Description of the tasks of the exercise [GIST]
Look at FaQ for questions about the metrics at the beginning of the exercise.
A bit of details that can help can be found here

To hand in: Share a notebook from Colab to genoveva.vargas@gmail.com (no PDFs or other types of documents will be accepted), including:

The whole set of instructions (run beginning with a markdown stating the research question that guides the experiment.)
At the bottom of the exercise notebook, add three mark-down cells with
- The interpretations of results explain to which extent the results answer the research question.
- Mind map (JPEG/PNG Figure) with the principle of clustering by defining clusters
  - What do records/items in the data collection represent?
  - What is their relationship with the notion of vector in an n-dimensional space?
  - How are clusters recognized?
  - How is a clustering result assessed?
  - What role do scores introduced in the exercise play in a clustering result?
  - Why must we perform several iterations until we find a “final result”?
- The pipeline (JPEG/PNG figure) describes the logic flow implemented in the notebook for answering the research question (which steps implemented in the notebook lead to an answer to the research question?). Use a markdown cell to insert your figure.

N.B. For drawing the figures, use a tool like Google Draw or draw the figures by hand and take a photo. Ensure the quality is good so that a human can read and understand.

2.3 Exercise 3 The quality of data: Observing Bias

Description of the tasks of the exercise [HO3: GIST]

To Hand in Share a notebook from Colab to genoveva.vargas@gmail.com (no PDFs or other types of documents will be accepted), including:

The answers to the suggested actions will be found in the notebook, along with an explanation of what you did in a markdown cell.
Mind map (JPEG/PNG Figure) of how to address bias measuring it in data collections, then how to “fix” it during the preparation and sampling phases.
Choose another variable to protect and reproduce the same pipeline but consider a different “protected group”.

2.4 Exercise 4 Predicting Events

Create a copy and run the example on Kaggle/Colab (note that data are private) exercise [HO5: GIST]
Draw a figure representing the experiment pipeline presented in the notebook.
- Exhibit the preparation phases and the aspects to seek when preparing the dataset for predicting with logistic regression.
- Exhibit the phases required for using logistic regression (you can refer to the code snippets in the notebook devoted to this purpose)
- Exhibit the assessment and interpretation phases (construction of the confusion matrix)
Include the figure at the end of the notebook in a Markdown cell; do not hesitate to describe it in natural language in a separate Markdown cell.

2.5 Exercise 5 Modelling knowledge with graphs (Extra Work)

Run the example on Kaggle (note that data are private) exercise [HO5: K-Notebook]
see explanation here
Propose yet another mind map about:
- The type of graph built in the HO5The families of operations applied to graphs
- The visualization techniques used for visualizing the graphs
Design (DO NOT PROGRAM) an example, including
- Research question statement in natural language: assume you have a data collection.
- Describe the data collection you will have as input
- Draw a figure with a pipeline where a graph is used to model smart cities or smart energy problems.
  - Exhibit the phases that show how to build a graph to model a studied phenomenon.
  - Exhibit the phases that show how the graphs’ operations can be used to answer the research question.
  - Include the phases that will implement the assessment strategy.

Hand in a PDF of a document with the assignment results here

Data Centric Smart Experiments

Turning reality phenomena into data thanks to the Big Data trend

ENSE3 ICT BD Lab