- Define the notion of “Datification”? In which way is it a revolution with respect to smart environments?
- Define the characteristics of data centric sciences? What is the role of data for them? What are the two components that make them a new generation of experimental sciences?
- Define the notion of Big Data. In your opinion how does this notion opens new challenges to data management?
- Give 5 properties that characterise Big Data ? Explain in which way they are challenging for managing data?
- In the case of your domain of expertise, how does Big Data opens novel possibilities or problems/challenges?
- In terms of multi-dabases used for storing data collections, which are the challenges related to query rewriting in such setting? Design an example of data collections stemming from a smart building or a smart city quarter that can be stored in different databases and that are then queried in the spirit of distributed queries.
-
Data science issues
- Describe the general methodology of data science? What is its objective?
- What is a Web IDE? What does IDE stand for? What is a notebook?
- Give a general description of a Data Science virtual machine
- Give the general functional architecture showing how does Azure Notebooks communicates with GitHub and with the Python interpreter in the setting used for experimenting in the lab sessions?
Defining a tabular view of a data collection
- What is a DataFrame? Define a DataFrame that shows the readings of home appliances energy consumption when they are used according to the following schema:
<applianceName, initialdate, initialhour, finaldate, finalhour, consumedWatts>
Manipulating data
- Consider the operations that can be applied on top of tabular data structures like projection (retrieving a subset of columns/attributes), selection (retrieving a subset of records) and filter (retrieving a subset of records given a condition). Which are the operators provided by Pandas that implement these operations for DataFrame? What is the result type? Give examples particularly the way null values can be filtered.
- Which are the aggregation functions that can be applied to the DataFrames and which is the role of the parameters axis and inplace often used together with these functions?
- Which is the form of the expressions for adding columns to a DataFrame? and Rows? How can rows or columns be deleted?
- How can default values be added to attributes containing missing or null values?
- Give an example of the use of the group() method applied on a DataFrame.
- How are manipulation operators associated to DataFrames related and useful for implementing Data Science processes?
Descriptive Statistics
- What is the role of descriptive statistics with regard to the analysis of data collections?
- What type of questions can be answered using descriptive statistics? Which are the mathematical tools used for that?
- Which methods are provided by Pandas for getting acquainted with data collections content in a quantitative manner?
- How is the method shape used for analysing data in a DataFrame?
- What issues have to be considered in order to be able to apply statistics to raw data collections?
- What is the role of the generation of graphics in the application of descriptive statistics for analysing data?
- Which are the strategies used for dealing with dirty data when applying descriptive statistics functions?
- Why can the distribution of the values of a given attribute be important to be known in a data analytics process?
Unsupervised learning
- What is unsupervised learning? Explain its general principle.
- What type of questions can unsupervised learning methods answer? Give examples or use cases.
- Describe the general principle of the K-Means clustering algorithm?
- Explain which measures can be used for assessing the result fo applying such algorithm on data?
- What is the role of visualisation of results of the K-Means algorithm applied to a data collection?
Inferential statistics
- Explain the principle of linear regression and give an example
- What can be linear regression used for
- What are the criteria associated to data to be considered for deciding whether linear regression can be applied or not?
- Define a pipeline that gives the general steps to be implemented to solve a prediction problem using linear regression.
- What are the scores used for assessing linear regression results?
- What does it mean to bootstrap the std error of mean?
- What are confidence intervals and p-values?