Towards new DBMS architectures: HADAS discussion

Inspired by  Michael Stonebraker keynote at EPFL (http://slideshot.epfl.ch/play/suri_stonebraker) we are opening the discussion about DBMS architectures in the group HADAS.

The objective of the exercice is to work together about the topic and identify the impact of old, new and coming architectures on the research of the group and the particular projects of the members. Therefore we all prepared some weeks ago the following:

1. Everyone watched the video of M. Stonebraker keynote
2.  We red papers about new perspectives of DBMS architectures by M. Carey, M. Stonebraker, G. Weikum
3. Everyone prepared  answers to the following questions:

  • Do you really feel that we have done all wrong in our domain? Are there good points you want to support?
  • Does the evolution of DBMS architectures have an impact on my own research? How? Give concrete examples of problems and approaches to illustrate.

The following discussion in posts show in life our reactions about this topic.

Data, information and their value: a DB community challenge

Data is all around and it seems that when it is,properly processed it can lead to valuable information. The data management industry mastodons  must see a value on data that emerges in social networks, public storage systems, devices, since they are investing   and storing all these gold hidden behind mountains of Gigas. They are also investing in public front ends that provide information out of these raw data collections (cf. sky server, chronozoom projects). I was impressed by,the google front end able to,provide information in real time about the elections in Mexico, under a geographic OLAP like metaphor.  Users could navigate along different aggregation granularities organized in a geographic way: from ballot places in quarters to a global vision! This will be possible for following other political processes like Egypt elections  and the USA elections.

Huge data collections and decision making apps, prêt a porter and free with nice properties as freshness and provenance, availability. Can other companies, governments and communities still catch the boat and start investing on data harvest, storage and delivery?   is there still room for new amazingly simple and useful ideas? Of course there is ! I believe that there is even room for doing some business through research and development. The important thing is not to do alternative apps to those that are already there but try to look for the hidden markets that are still open.

New ways of data provision like data marketplaces is starting to gain popularity and force. If you have a nice, curated data collection, that you are willing to sell, you can provide it in a market according a specific economic model. Curated data can have added value properties such as  quality, provenance, reliability properties that impact the economic of the data. Database tools that can help to build and maintain data collections with nice DB properties are still missing. Indeed, building databases and putting them on line is still challenging, ongoing Big Data programs are a testimony of it.

The requirement is there, the financial sources are there, the expertise is also there, the DB community has fundamental research and R&D opportunities there. for making new decision making, data analysis data mining applications turn.

Is the cloud all about XXL?

This entry is inspired by two advertisements of the cloud (see http://www.youtube.com/watch?v=8aCYZ3gXfy8&feature=youtube_gdata_player and http://www.youtube.com/watch?v=YsVGDUkWTHI), that surprised me on how this notion is being incepted in specialists and common non-specialized audiences.

Most technical and somehow academic definitions of the cloud include the idea of unlimited access to resources implying somehow the idea of “big size”. Indeed, some communities tend to define the cloud as if it was all about XXL matters, that is, scaling data processing and storage, exploiting computing resources through map reduce models, load balancing, and high performance computing techniques. Yet, I believe that it is important to go beyond the size and define the cloud using other key features. For instance, data long life and flexible persistency, resources continuous availability and sharing are three key factors that seem to be used by cloud vendors for “grand publique”.

The cloud as an infrastructure is of course about exciting computer science challenges but it can also be used just as an execution platform, with the ambition of avoiding the burden of managing other platform services like a DBMS, an IDE, a Web server and rather using these services “on-line” as long and as much users need them. There is room for a transparent use of the cloud where all the benefits of having unlimited resources are there to be exploited and where people can concentrate on fulfilling a specific requirement that can be: building an application, or uploading personal photographs that can be instantly accessed through different devices.

Anyway, I believe that technology users in the different roles they play (online-players, developers, social contacts, companies assets, …) should have a look at the experience of being cloud users and have a first touch to “ubiquitous” access to technological services on XXL and XXS contexts.

LATAM Faculty Summit 2012@ Riviera Maya, Mexico

Three main trend topics were addressed in this summit (http://research.microsoft.com/en-US/events/latamfacsum2012/agenda.aspx), (1) human-devices interaction, (2) data and information management under different perspectives and (3) applications. Here comments on two lectures, full of wisdom, with promising and exciting visions of research on databases:

Information Management via CrowdSourcing, Hector Garcia-Molina, Professor, Stanford University, United States

Following the wisdom of the crowd connected and available in all sorts of forums and social networks, it is possible to ask them to contribute to answer questions (i.e., queries):  Which is the best seafood restaurant in Riviera Maya? People ready to perform simple tasks – sometimes against some cents- can participate in looking for an answer this, for instance, providing their opinion, information or collective knowledge.

The very simple principle for answering this question is to ask the crowd and then have a strategy for classifying the opinions and decide which is the acceptable answer. Another possibility, is to combine the crowd’s answers with other information coming from classic databases, search engines or other data providers and then build an answer (e.g,, a ranked set of recommendations).

Interesting research challenges must be addressed if queries evaluation is “crowdsourced”: (a) the classification of the answers, for instance, considering their provenance; the generation of opinion trends; (b) the combination with information from more classic data providers, for instance by redefining some classic relational operators (e.g., join) or maybe defining new ones; (c) having queries answered efficiently (query optimisation) despite the fact that answers can arrive continuously for long periods of time; and certainly others, if the problem is studied in its whole complexity …

Data-Intensive Discoveries in Science: the Fourth Paradigm
 Alex Szalay, Alumni Centennial Professor, Department of Physics and Astronomy, The Johns Hopkins University, United States

Long time before the emergence of the buzzwords “Big Data” and “Open Data”, data experts met scientists and decided to build huge databases and make them available to the community through front ends. Maybe the best-known examples are the SDDS project (http://skyserver.sdss.org/) and the worldwide telescope (http://www.worldwidetelescope.org/).

Thanks to these ambitious projects, the notion of Internet scientist emerged and started making discoveries by exploiting these databases[1].  The surprise is that non-scientists also started accessing these data for hobby or for learning purposes. The project, initially lead by Jim Grey, has grown and it touches today other areas, like biology, cancer, social, human sciences and even HPC observation (!). The scientific community, old and young researchers, has certainly a role to play for populating and exploiting these democratized deposits of potential knowledge.


[1] This term Internet scientist is borrowed from A. Szalay, who used it in his keynote presentation.

Going back to Cloud Futures 2012@ Berkley, California

 

I recently attended the third edition of the Microsoft Research workshop Cloud Futures at the University of Berkeley (http://research.microsoft.com/en-us/events/cloudfutures2012/). I bear in mind, three relevant moments in the workshop:

  1. The papers concerning data management issues and how they are addressed in the context of cloud. Particularly, the talk by Bill Howe, University of Washington[1], I appreciated the parallel made between Maslow’s pyramid and data management issues: how they are presented and addressed today and how he thinks they should be addressed.

The second idea discussed in this paper concerns the fact that scientists – non-expert in computer science- spend an important percentage of their research time designing data processing algorithms that correspond to operators and/or expressions of classic SQL. People tend to reinvent DBMS functions for dealing with their data management requirements. This can be further discussed, and was debated by the workshop’s audience (Joseph Hellerstein, Manager Big Science Google). The feeling is that not everything can be done through SQL expressions — of course I agree–, and that there is room and pertinence of algorithms that cannot be infused within querying languages. Yet, for the aspects concerning data management – sharing, storing, querying -, the idea is to understand why are solutions reinvented continuously, and try to figure out methodologies well adapted to fulfil scientific applications and scientists needs. 

I believe that the call is for tools that clearly expose the way data are processed. It is necessary to exhibit the workflows used for dealing with data, starting from raw data collections and leading to collections that can be used and shared by scientists for performing analysis processes – that lead to new data. These workflows provide support for arguing of the validity of the scientific results and conclusions produced through this data processing task. The role of the cloud resides in the possibility of sharing data (e.g., data markets) and processes. The other call is for facility in defining these workflows. Users and programmers must be able to express their data processing workflows by themselves without requiring the presence of experienced programmers, as is the case today. Languages like MS Link and Yahoo’s PigLatin, combining imperative expressions with SQL like ones, for dealing with data are interesting tools to explore in the definition cloud aware data processing tasks[2].

  1. The second relevant scientific component of the workshop was the papers on map reduce, a model that needs to be fully studied specially for revisiting data processing algorithms. First, I kept in mind the map reduce models summarized in the paper by Geoffrey Fox Indiana University[3]. The second interesting work concerns the analysis of algorithms presented by the paper of Stanford University on the limitations of map reduce[4] that analyses algorithms needing special design under the map reduce model (e.g., join relational operator).

I would associate to this paper M. Carey’s EDBT 2012 paper Inside “Big Data Management”: Ogres, Onions, or Parfaits?[5]. I have the feeling that there is a need to study this model, and in the context of the cloud, to couple the study to an economical model that can guide the analysis on its consumption of time, computing resources and its economic cost.

  1. 3.   The requirements of managing and processing huge amounts of data have lead to the emergence of the Big Data movement that is expanding throughout the scientific community. The panel[6] Big data on campus: Addressing the Challenges and Opportunities Across Domain  (referring to U. Berkeley campus), left some ideas to be thought about. The most important is that big data is a multidisciplinary issue that includes natural, social and human sciences. Natural, human and social sciences have different requirements: cleaning, building and designing databases, running simulations on huge data collections, discovering models, capitalizing results. Thus, dealing with big data requires scientists to acquire and develop information management abilities and techniques, computer scientists to propose adapted tools and engineering support for providing the necessary plumbing for deploying these tools in platforms that make them available to the community, respecting specific quality of service requirements. At U. Berkley big data on campus is a program that touches education and research programs, and it is a collective action.

 


[1] Bill Howe, Advancing Declarative Query for Data-Intensive Science in the Cloud, In proceedings of the Cloud Futures Workshop 2012, USA

[2] The literature has shown that data cleaning operations can be expressed by SQL like languages. Some may remember for instance the AJaX framework proposed by Helena Galhardas (See H. Galhardas, Data Cleaning and Transformation, Generative and transformational techniques in software engineering, Lecture Notes in Computer Science, 2006, Volume 4143/2006, 327-343). Here a list of approaches on data cleaning http://paul.rutgers.edu/~weiz/readinglist.html

Other approaches willing to integrate data mining operations to relational DBMS like professor’s Elena Baralis at Politecnico di Torino (http://dbdmg.polito.it/twiki/bin/view/Public/ElenaBaralis) are examples of these movement that dealt to data mining cartridges in commercial systems like Oracle.

[3] Geoffrey Fox, Dennis Gannon, Programming Paradigms for Technical Computing on Clouds and Supercomputers, In proceedings of the Cloud Futures Workshop 2012, USA

[4] Semih Salihogluz, Foto Afrati, Anish Das Sarma, Jeffrey D. Ullman, Vision Paper: Towards an Understanding of the Limits of Map-Reduce Computation, In proceedings of the Cloud Futures Workshop 2012, USA

[5] Michael J. Carey, Inside “Big Data Management”: Ogres, Onions, or Parfaits? In EDBT Electronic Conference Proceedings, March 26-30, 2012, Berlin, Germany (http://www.edbt.org/Proceedings/2012-Berlin/edbt_toc.html)

[6] Panel Session | Big Data on Campus: Addressing the Challenges and Opportunities Across Domains, Speakers: Cathryn Carson, D-lab; AnnaLee Saxenian, UC Berkeley; Arie Shoshani, Lawrence Berkeley Laboratory

Chair: Michael Franklin