Towards new DBMS architectures: HADAS discussion

Inspired by  Michael Stonebraker keynote at EPFL (http://slideshot.epfl.ch/play/suri_stonebraker) we are opening the discussion about DBMS architectures in the group HADAS.

The objective of the exercice is to work together about the topic and identify the impact of old, new and coming architectures on the research of the group and the particular projects of the members. Therefore we all prepared some weeks ago the following:

1. Everyone watched the video of M. Stonebraker keynote
2.  We red papers about new perspectives of DBMS architectures by M. Carey, M. Stonebraker, G. Weikum
3. Everyone prepared  answers to the following questions:

  • Do you really feel that we have done all wrong in our domain? Are there good points you want to support?
  • Does the evolution of DBMS architectures have an impact on my own research? How? Give concrete examples of problems and approaches to illustrate.

The following discussion in posts show in life our reactions about this topic.

LATAM Faculty Summit 2012@ Riviera Maya, Mexico

Three main trend topics were addressed in this summit (http://research.microsoft.com/en-US/events/latamfacsum2012/agenda.aspx), (1) human-devices interaction, (2) data and information management under different perspectives and (3) applications. Here comments on two lectures, full of wisdom, with promising and exciting visions of research on databases:

Information Management via CrowdSourcing, Hector Garcia-Molina, Professor, Stanford University, United States

Following the wisdom of the crowd connected and available in all sorts of forums and social networks, it is possible to ask them to contribute to answer questions (i.e., queries):  Which is the best seafood restaurant in Riviera Maya? People ready to perform simple tasks – sometimes against some cents- can participate in looking for an answer this, for instance, providing their opinion, information or collective knowledge.

The very simple principle for answering this question is to ask the crowd and then have a strategy for classifying the opinions and decide which is the acceptable answer. Another possibility, is to combine the crowd’s answers with other information coming from classic databases, search engines or other data providers and then build an answer (e.g,, a ranked set of recommendations).

Interesting research challenges must be addressed if queries evaluation is “crowdsourced”: (a) the classification of the answers, for instance, considering their provenance; the generation of opinion trends; (b) the combination with information from more classic data providers, for instance by redefining some classic relational operators (e.g., join) or maybe defining new ones; (c) having queries answered efficiently (query optimisation) despite the fact that answers can arrive continuously for long periods of time; and certainly others, if the problem is studied in its whole complexity …

Data-Intensive Discoveries in Science: the Fourth Paradigm
 Alex Szalay, Alumni Centennial Professor, Department of Physics and Astronomy, The Johns Hopkins University, United States

Long time before the emergence of the buzzwords “Big Data” and “Open Data”, data experts met scientists and decided to build huge databases and make them available to the community through front ends. Maybe the best-known examples are the SDDS project (http://skyserver.sdss.org/) and the worldwide telescope (http://www.worldwidetelescope.org/).

Thanks to these ambitious projects, the notion of Internet scientist emerged and started making discoveries by exploiting these databases[1].  The surprise is that non-scientists also started accessing these data for hobby or for learning purposes. The project, initially lead by Jim Grey, has grown and it touches today other areas, like biology, cancer, social, human sciences and even HPC observation (!). The scientific community, old and young researchers, has certainly a role to play for populating and exploiting these democratized deposits of potential knowledge.


[1] This term Internet scientist is borrowed from A. Szalay, who used it in his keynote presentation.

Data management challenges@ DB group HADAS

Efficient and distributed service based data management

Data management consists on a set of processes for querying, organizing, indexing and replicating data on persistence supports (disk, memory, cache) for enabling the exploitation of information (e.g., analysis, aggregation) and guarantying data integrity. We identify four research challenges:

(1) Design and programming of algorithms for optimized querying, access and data retrieval, data organization on storage support;
(2) Building data management systems by designing efficient processes that implement data management algorithms (1);
(3) Implementation of data management systems on top of target platforms (e.g., services based platforms);
(4) Deployment of data management processes on target execution plate-forms (e.g., P2P, cloud).

The research done by the members[1] of the group HADAS working on data management issues concern points (1) (2), and (3)[2].  There are important results on point (4) through the projects e-CLOUDSS – http://e-cloudss.imag.fr,  redSHINE – http://redshine.imag.fr – and, CLEVER – http://clever.imag.fr

We address challenges introduced by the design of algorithms for managing data (1) particularly, we propose algorithms for computing “hybrid” query plans, query optimisation using machine learning techniques and operations research techniques (see projects UBIQUEST – http://ubiquest.imag.fr/ -, and OPTIMACS – http://optimacs.imag.fr,).

We also address the design of algorithms and protocols for managing storage support (cache, and disk) and for composition event flows and thereby observing the use model of resources for implementing data management processes.

Major results: model of hybrid query based on services, algorithm for evaluating queries by coordinating services, prototype HYPATIA[3], service based query evaluation based on continuous and on demand services[4], language MQLiST and a cache model for mashups.

Students participating in these research problems: Carlos Manuel López Enríquez, Lourdes, A. Martínez Medina, Mohamad Othman Abdallah, Juan Carlos Castrejón, Esteban Gutiérrez, Epal Njanem Orléant.

The construction and implementation of data management systems (2, 3) is an important and re-emerging challenge (3)[5]. We use coordination models (workflow) for defining querying, optimization and data storage services. This approach stems form the BPM community, and we use it for describing data management processes. We contribute to the database research with a process-oriented approach rather than to the software engineering domain that proposes services as architecture units (see for example the research done in the group ADELE at LIG – http://www-adele.imag.fr/).

Once data management is provided as a coordination of services, it is necessary to ensure its reliability and the integrity of the data it manages. We propose systems that ensure these properties.  We propose a policy model and its associated language for defining and ensuring these properties within data management services. We have defined policies for ensuring specific properties like exception handling, atomicity, security and, persistency. We have also proposed an environment called Pi-SODM for building service coordinations.

We also address dynamic services’ substitution in service coordinations. Service substitution is done considering functional and non-functional properties (response time, reliability, availability, cost) and it serves to implement recovery strategies as a way of reinforcing some integrity and reliability properties.

Major results: Policy based reliable service coordination model and its associated language; atomicity, persistence, security policies for service coordinations[6]; Pi-SODM environment.

Students participating in these research problems: Javier A. Espinosa Oviedo, Placido A. Souza Neto, Christiane Kamdem

We have some results concerning the deployment of data management services on large-scale target execution environments (4, 1). The PhD work of Juan Carlos Castrejón, Esteban Guitiérrez, and, Epal Orléant propose respectively algorithms for storing, replicating and observing resources distribution on the cloud.

 


[1] Genoveva Vargas Solar, Christine Collet, Christophe Bobineau, Noha Ibrahim.

[2] Par exemple, l’algorithme BP-GYO dans la thèse de Victor Cuevas Vicenttín (cf. publications COOPIS).

[3] Victor Cuevas-Vicenttin, Genoveva Vargas-Solar, Christine Collet, Evaluating Hybrid Queries through Service Coordination in HYPATIA, In Proceedings of the 15th International Conference on Extending Database Technology (EDBT), Berlin, Germany, 2012

[4] Víctor Cuevas-Vicenttín, Christine Collet, Genoveva Vargas-Solar, Noha Ibrahim and Christophe Bobineau, Coordinating services for accessing and processing data in dynamic environments, In Proceedings of the OTM 2010 Conferences, COOPIS 2010, LNCS, 2010

[5] Ionut Subasu, Patrick Ziegler, Klaus R. Dittrich, Towards Service-Based Data Management Systems, BTW Workshops 2007: 296-306

Michael J. Carey, Inside “Big Data Management”: Ogres, Onions, or Parfaits?,  In Proceedings of the 15th International Conference on Extending Database Technology (EDBT), Berlin, Germany, 2012

http://www.systems.ethz.ch/

[6] Javier-A. Espinosa-Oviedo, Vargas-Solar Genoveva, José-Luis Zechinelli-Martini and Christine Collet. Policy driven services coordination for building social networks based applications. In Proc. Of the 8th International Conference on Services Computing (SCC’ 11), Work-in-Progress Track, Washington, DC, USA, 2011.

P.A. Souza Neto, M.A., Musicante, G., Vargas-Solar, and J.L. Zechinelli-Martini, PEWS-CT: Adding Contract Support to a Web Service Composition Language. LTPD 2010, 4th Workshop on Languages and Tools for Multithreaded, Parallel and Distributed Programming. Salvador, Bahia – Brazil, 2010