 Define the notion of “Datification”? In which way is it a revolution with respect to smart environments?
 Define the characteristics of data centric sciences? What is the role of data for them? What are the two components that make them a new generation of experimental sciences?
 Define the notion of Big Data. In your opinion how does this notion opens new challenges to data management?
 Give 5 properties that characterise Big Data ? Explain in which way they are challenging for managing data?
 In the case of your domain of expertise, how does Big Data opens novel possibilities or problems/challenges?
 In terms of multidabases used for storing data collections, which are the challenges related to query rewriting in such setting? Design an example of data collections stemming from a smart building or a smart city quarter that can be stored in different databases and that are then queried in the spirit of distributed queries.

Data science issues
 Describe the general methodology of data science? What is its objective?
 What is a Web IDE? What does IDE stand for? What is a notebook?
 Give a general description of a Data Science virtual machine
 Give the general functional architecture showing how does Azure Notebooks communicates with GitHub and with the Python interpreter in the setting used for experimenting in the lab sessions?
Defining a tabular view of a data collection
 What is a DataFrame? Define a DataFrame that shows the readings of home appliances energy consumption when they are used according to the following schema:
<applianceName, initialdate, initialhour, finaldate, finalhour, consumedWatts>
Manipulating data
 Consider the operations that can be applied on top of tabular data structures like projection (retrieving a subset of columns/attributes), selection (retrieving a subset of records) and filter (retrieving a subset of records given a condition). Which are the operators provided by Pandas that implement these operations for DataFrame? What is the result type? Give examples particularly the way null values can be filtered.
 Which are the aggregation functions that can be applied to the DataFrames and which is the role of the parameters axis and inplace often used together with these functions?
 Which is the form of the expressions for adding columns to a DataFrame? and Rows? How can rows or columns be deleted?
 How can default values be added to attributes containing missing or null values?
 Give an example of the use of the group() method applied on a DataFrame.
 How are manipulation operators associated to DataFrames related and useful for implementing Data Science processes?
Descriptive Statistics
 What is the role of descriptive statistics with regard to the analysis of data collections?
 What type of questions can be answered using descriptive statistics? Which are the mathematical tools used for that?
 Which methods are provided by Pandas for getting acquainted with data collections content in a quantitative manner?
 How is the method shape used for analysing data in a DataFrame?
 What issues have to be considered in order to be able to apply statistics to raw data collections?
 What is the role of the generation of graphics in the application of descriptive statistics for analysing data?
 Which are the strategies used for dealing with dirty data when applying descriptive statistics functions?
 Why can the distribution of the values of a given attribute be important to be known in a data analytics process?
Unsupervised learning
 What is unsupervised learning? Explain its general principle.
 What type of questions can unsupervised learning methods answer? Give examples or use cases.
 Describe the general principle of the KMeans clustering algorithm?
 Explain which measures can be used for assessing the result fo applying such algorithm on data?
 What is the role of visualisation of results of the KMeans algorithm applied to a data collection?
Inferential statistics
 Explain the principle of linear regression and give an example
 What can be linear regression used for
 What are the criteria associated to data to be considered for deciding whether linear regression can be applied or not?
 Define a pipeline that gives the general steps to be implemented to solve a prediction problem using linear regression.
 What are the scores used for assessing linear regression results?
 What does it mean to bootstrap the std error of mean?
 What are confidence intervals and pvalues?