Data management issues in ML studios | Deploying Data Science Pipelines at Scale

Huge collections of heterogeneous data have become the backbone of scientific, analytic and forecasting processes. Combining simulation techniques, artificial vision and artificial learning with data science techniques it is possible to compute mathematical models to understand and predict phenomena. To achieve this ambitious objective, data must go through complex and repetitive processing and analysis pipelines, namely data science pipelines.

The enactment of data science pipelines must balance the delivery of different types of services such as: (i) hardware (computing, storage and memory), (ii) communication (bandwidth and reliability) and scheduling (iii) greedy analytics and mining with high in-memory and computing cycles requirements. Current data science environments (e.g. Microsoft ML environment) have particularly focused on the efficient provision of computing resources required for processing data through greedy analytics algorithms. Beyond, the execution of such tasks using parallel models and their associated technology, data management is still an open and key issue. How to distribute and duplicate data across CPUs/GPUs farms for ensuring their availability for executing parallel processes? How should data be organized (loaded and indexed) in main memory to perform efficient data processing and analytics at scale?

This lecture introduces and guides to develop data science pipelines for studying efficient enactment strategies to explore problems that can go beyond known analytics scales and that can contribute to perform continuous on-line data centric sciences experiments.