Context
Tabular datasets are very common. They are formatted as tables defined as a series of (implicitly) typed attributes, which represent the structure/schema of the table. They contain a series of records aligned to this structure. Different exploration libraries propose a data type representing the notion of a table and sets of operators for manipulating tables. Even if the general principle is similar among tabular data models, the properties of the tabular data structures and operators vary.
Objective
This exercise aims to get acquainted with the tabular data structure and its associated operators and try a concrete solution provided by the library Pandas of Python. Applying a series of operators on tabular data, it is possible to explore their content, profile their mathematical properties and answer research questions of type “What happened?”. Through this exploration, we can determine the quality of a data set in terms of the missing and null values and the statistical distribution of the values in columns. Then, we can decide to ” clean” the data set and respond to “what happened?” like research questions.
Material
- Exercise HO-1, a version with R, can be found here [K-Notebool in R]
- Explanation on the whiteboard about the table data structure
To Do
- Organise into groups of 3 or 4 people. You can also decide to work alone, even if it is less fun.
- Have a look at the experiment proposed in HO-1
- Create a notebook in Kaggle according to in-class instructions and test-learn the tasks implemented in HO-1
- Propose a mind map with the table manipulation operators introduced in class and the corresponding operators proposed by the Pandas library that you have discovered by Testing HO-1
- Draw a pipeline that describes the series of tasks of the pipeline implemented in HO-1. The pipeline intends to answer the question, “How similar were EU countries when they invested in education throughout the first decade of the XXI century?”
- Propose an interpretation of the final result in the exercise (interpret the plots) and provide a critical view of their pertinence concerning the research question.
To Hand In
- Add the names of your group members in your notebook and the program you are inscribed in.
- Add the mindmap drawing, the pipeline drawing, and your interpretation to your notebook.
- Please share it with the professor USING KAGGLE.