After vivid discussions led by the emergence of the buzzword “Big Data”, it seems that industry and academia have reached an objective understanding about data properties (volume, velocity, variety, veracity and value), the resources and “know how” it requires, and the opportunities it opens. Indeed, new applications promising fundamental changes in society, industry and science, include face recognition, machine translation, digital assistants, self-driving cars, ad-serving, chat-bots, personalised healthcare, smart industry and more.
The first lesson of the era of “Big Data” is that it is possible to access and exploit representative “samples” of available data collections thanks to the availability of the necessary resources for storing it and running greedy processing tasks on it. The second lesson is that computer science and mathematics disciplines must generate synergy with other sciences to exploit these new available “value”. The consequence is the emergence of “new” data centric sciences: data science, digital humanities, social data science, network science, computational science. These sciences with their new requirements and challenges call for a need to revisit the fundaments of databases, artificial intelligence and other disciplines used for addressing them with new perspectives.
This novel and multidisciplinary data centric and scientific movement, promises new and not yet imagined applications that rely on massive amounts of evolving data that need to be cleaned, integrated and analysed for modelling purposes. Yet, data management issues are not usually perceived as central.
Objective:
This course explores the key challenges and opportunities for data management in this new scientific world, and discusses how a possible data centric artificial intelligence supported by high performance computing (HPC) can best contribute to these exciting domains.
The objective is to introduce the data processing and analysis challenges introduced by data centric sciences applying data processing and analysis techniques for processing data on parallel platforms.