CONTROL QUESTIONS | Cloud Computing and Big Data

Cloud Computing

Define the cloud
Describe the reference architecture
Is the cloud a service-based architecture? Explain why or why not?
What are the three cloud provision models?
What is the content of the cloud: is it possible to consider that the cloud relies on a back end? If so, how is it composed?
What is the role of the cloud in the construction of computing systems?
In terms of the deployment of applications on the cloud? Which are the decision making criteria to consider (refer to the edge, the fog, cloud).
How are cloud services delivered to be used as development environments?
What is the role of virtualization and containerization in the context of the cloud?
Explain the notion of DevOps, how is it related to the cloud? (use in your vocabulary concepts as production line, deployment, maintenance, self-contained, resources allocation)
Describe the principle of virtualization vs containerization (do not forget to refer to trapping OS calls, confinement, fault- tolerance, properties). Enumerate the strong and weak points of both techniques.
Which are the techniques for managing and coordinating virtualized solutions? Why are they important? Which is the general principle of these coordination techniques? How are they related with service level agreements?
Explain the importance of resources consumption monitoring and its association with the Pay as U go model and the service level agreements?

Big Data

Define the notion of “Datification”? In which way is it a revolution with respect to smart environments?
Define the characteristics of data centric sciences? What is the role of data for them? What are the two components that make them a new generation of experimental sciences?
Define the notion of Big Data. In your opinion how does this notion opens new challenges to data management?
Give 5 properties that characterise Big Data? Explain in which way they are challenging for managing data?
In the case of your domain of expertise, how does Big Data open novel possibilities or problems/challenges?
In terms of multi-debases used for storing data collections, which are the challenges related to query rewriting in such setting? Design an example of data collections stemming from a smart building or a smart city quarter that can be stored in different databases and that are then queried in the spirit of distributed queries.
Describe the general architecture and functional principle of an HDFS. What is the difference with a centralized file system?
Describe the general principle of the NoSQL movement?
What is the CAP model?
What is the impact of having different data models in the NoSQL systems on data querying?
What are the implications of having data residing in memory and performing explicit persistence in NoSQL systems?
What is the relationship between clusters and NoSQL systems?
Explain the principle and interest of data sharding? How can you assess a data sharding proposal?
What is polyglot persistence? What are the implications in data querying? Which would be the role and challenges of providing an indexing solution in this context?

Data science issues

Describe the general methodology of data science? What is its objective?
What is a Web IDE? What does IDE stand for? What is a notebook?
Give a general description of a Data Science virtual machine
Give the general functional architecture showing how does Azure Notebooks communicates with GitHub and with the Python interpreter in the setting used for experimenting in the lab sessions?

Defining a tabular view of a data collection

What is a DataFrame? Define a DataFrame that shows the readings of home appliances energy consumption when they are used according to the following schema:

<applianceName, initialdate, initialhour, finaldate, finalhour, consumedWatts>

Manipulating data

Consider the operations that can be applied on top of tabular data structures like projection (retrieving a subset of columns/attributes), selection (retrieving a subset of records) and filter (retrieving a subset of records given a condition). Which are the operators provided by Pandas that implement these operations for DataFrame? What is the result type? Give examples particularly the way null values can be filtered.
Which are the aggregation functions that can be applied to the DataFrames?
Which is the form of the expressions for adding columns to a DataFrame? and Rows? How can rows or columns be deleted?
How can default values be added to attributes containing missing or null values?
Give an example of the use of the group() method applied on a DataFrame.
How are manipulation operators associated to DataFrames related and useful for implementing Data Science processes?

Descriptive Statistics

What is the role of descriptive statistics with regard to the analysis of data collections?
What type of questions can be answered using descriptive statistics? Which are the mathematical tools used for that?
Which methods are provided by Pandas for getting acquainted with data collections content in a quantitative manner?
How is the method shape used for analysing data in a DataFrame?
What issues must be able to apply statistics to raw data collections?
What is the role of the generation of graphics in the application of descriptive statistics for analysing data?
Which are the strategies used for dealing with dirty data when applying descriptive statistics functions?
Why can the distribution of the values of a given attribute be important to be known in a data analytics process?

Parallel programming models and engines

What is the difference between a parallel programming model and a parallel data processing engine?
Define the concept (in general) of data flow vs. control flow? How can these flows be used to express parallel computing, i.e., where is it possible to observe parallelism within a control and a data flow?
Which are the data management aspects to be considered in control flow and data flow parallel programming models?
What is a parallel data processing engine? Which are the main functions provided regarding the execution of parallel programs? How do they rely on a persistence support?
Explain the principle of the map reduce programming paradigm? What is the type of I/O parameters for the map and the reduce function?
What are mappers and reducers?
What is an execution plan? What is an execution DAG in Spark?
What is the general architecture of the Spark engine?
How are RDDs related to HDFS?
Explain the properties of RDDs and the role in the execution of parallel programs.

Cloud Computing and Big Data

How are cloud computing and Big Data related?
How can cloud services be composed in an integrated set of resources for preparing a parallel programming environment using a specific engine? Enumerate the type services to compose and then how you can deploy your solution on the cloud.
How does the property of elasticity of the cloud contributes to Big Data processing?
What are the resources allocation requirements of Big Data management and processing?
Which are the challenges of preparing the right cloud environments for deploying and running large scale workloads processing Big Data?
Describe families of big data driven applications that call for cloud environments to run at scale.