LATAM Faculty Summit 2012@ Riviera Maya, Mexico

Three main trend topics were addressed in this summit (http://research.microsoft.com/en-US/events/latamfacsum2012/agenda.aspx), (1) human-devices interaction, (2) data and information management under different perspectives and (3) applications. Here comments on two lectures, full of wisdom, with promising and exciting visions of research on databases:

Information Management via CrowdSourcing, Hector Garcia-Molina, Professor, Stanford University, United States

Following the wisdom of the crowd connected and available in all sorts of forums and social networks, it is possible to ask them to contribute to answer questions (i.e., queries):  Which is the best seafood restaurant in Riviera Maya? People ready to perform simple tasks – sometimes against some cents- can participate in looking for an answer this, for instance, providing their opinion, information or collective knowledge.

The very simple principle for answering this question is to ask the crowd and then have a strategy for classifying the opinions and decide which is the acceptable answer. Another possibility, is to combine the crowd’s answers with other information coming from classic databases, search engines or other data providers and then build an answer (e.g,, a ranked set of recommendations).

Interesting research challenges must be addressed if queries evaluation is “crowdsourced”: (a) the classification of the answers, for instance, considering their provenance; the generation of opinion trends; (b) the combination with information from more classic data providers, for instance by redefining some classic relational operators (e.g., join) or maybe defining new ones; (c) having queries answered efficiently (query optimisation) despite the fact that answers can arrive continuously for long periods of time; and certainly others, if the problem is studied in its whole complexity …

Data-Intensive Discoveries in Science: the Fourth Paradigm
 Alex Szalay, Alumni Centennial Professor, Department of Physics and Astronomy, The Johns Hopkins University, United States

Long time before the emergence of the buzzwords “Big Data” and “Open Data”, data experts met scientists and decided to build huge databases and make them available to the community through front ends. Maybe the best-known examples are the SDDS project (http://skyserver.sdss.org/) and the worldwide telescope (http://www.worldwidetelescope.org/).

Thanks to these ambitious projects, the notion of Internet scientist emerged and started making discoveries by exploiting these databases[1].  The surprise is that non-scientists also started accessing these data for hobby or for learning purposes. The project, initially lead by Jim Grey, has grown and it touches today other areas, like biology, cancer, social, human sciences and even HPC observation (!). The scientific community, old and young researchers, has certainly a role to play for populating and exploiting these democratized deposits of potential knowledge.


[1] This term Internet scientist is borrowed from A. Szalay, who used it in his keynote presentation.

Going back to Cloud Futures 2012@ Berkley, California

 

I recently attended the third edition of the Microsoft Research workshop Cloud Futures at the University of Berkeley (http://research.microsoft.com/en-us/events/cloudfutures2012/). I bear in mind, three relevant moments in the workshop:

  1. The papers concerning data management issues and how they are addressed in the context of cloud. Particularly, the talk by Bill Howe, University of Washington[1], I appreciated the parallel made between Maslow’s pyramid and data management issues: how they are presented and addressed today and how he thinks they should be addressed.

The second idea discussed in this paper concerns the fact that scientists – non-expert in computer science- spend an important percentage of their research time designing data processing algorithms that correspond to operators and/or expressions of classic SQL. People tend to reinvent DBMS functions for dealing with their data management requirements. This can be further discussed, and was debated by the workshop’s audience (Joseph Hellerstein, Manager Big Science Google). The feeling is that not everything can be done through SQL expressions — of course I agree–, and that there is room and pertinence of algorithms that cannot be infused within querying languages. Yet, for the aspects concerning data management – sharing, storing, querying -, the idea is to understand why are solutions reinvented continuously, and try to figure out methodologies well adapted to fulfil scientific applications and scientists needs. 

I believe that the call is for tools that clearly expose the way data are processed. It is necessary to exhibit the workflows used for dealing with data, starting from raw data collections and leading to collections that can be used and shared by scientists for performing analysis processes – that lead to new data. These workflows provide support for arguing of the validity of the scientific results and conclusions produced through this data processing task. The role of the cloud resides in the possibility of sharing data (e.g., data markets) and processes. The other call is for facility in defining these workflows. Users and programmers must be able to express their data processing workflows by themselves without requiring the presence of experienced programmers, as is the case today. Languages like MS Link and Yahoo’s PigLatin, combining imperative expressions with SQL like ones, for dealing with data are interesting tools to explore in the definition cloud aware data processing tasks[2].

  1. The second relevant scientific component of the workshop was the papers on map reduce, a model that needs to be fully studied specially for revisiting data processing algorithms. First, I kept in mind the map reduce models summarized in the paper by Geoffrey Fox Indiana University[3]. The second interesting work concerns the analysis of algorithms presented by the paper of Stanford University on the limitations of map reduce[4] that analyses algorithms needing special design under the map reduce model (e.g., join relational operator).

I would associate to this paper M. Carey’s EDBT 2012 paper Inside “Big Data Management”: Ogres, Onions, or Parfaits?[5]. I have the feeling that there is a need to study this model, and in the context of the cloud, to couple the study to an economical model that can guide the analysis on its consumption of time, computing resources and its economic cost.

  1. 3.   The requirements of managing and processing huge amounts of data have lead to the emergence of the Big Data movement that is expanding throughout the scientific community. The panel[6] Big data on campus: Addressing the Challenges and Opportunities Across Domain  (referring to U. Berkeley campus), left some ideas to be thought about. The most important is that big data is a multidisciplinary issue that includes natural, social and human sciences. Natural, human and social sciences have different requirements: cleaning, building and designing databases, running simulations on huge data collections, discovering models, capitalizing results. Thus, dealing with big data requires scientists to acquire and develop information management abilities and techniques, computer scientists to propose adapted tools and engineering support for providing the necessary plumbing for deploying these tools in platforms that make them available to the community, respecting specific quality of service requirements. At U. Berkley big data on campus is a program that touches education and research programs, and it is a collective action.

 


[1] Bill Howe, Advancing Declarative Query for Data-Intensive Science in the Cloud, In proceedings of the Cloud Futures Workshop 2012, USA

[2] The literature has shown that data cleaning operations can be expressed by SQL like languages. Some may remember for instance the AJaX framework proposed by Helena Galhardas (See H. Galhardas, Data Cleaning and Transformation, Generative and transformational techniques in software engineering, Lecture Notes in Computer Science, 2006, Volume 4143/2006, 327-343). Here a list of approaches on data cleaning http://paul.rutgers.edu/~weiz/readinglist.html

Other approaches willing to integrate data mining operations to relational DBMS like professor’s Elena Baralis at Politecnico di Torino (http://dbdmg.polito.it/twiki/bin/view/Public/ElenaBaralis) are examples of these movement that dealt to data mining cartridges in commercial systems like Oracle.

[3] Geoffrey Fox, Dennis Gannon, Programming Paradigms for Technical Computing on Clouds and Supercomputers, In proceedings of the Cloud Futures Workshop 2012, USA

[4] Semih Salihogluz, Foto Afrati, Anish Das Sarma, Jeffrey D. Ullman, Vision Paper: Towards an Understanding of the Limits of Map-Reduce Computation, In proceedings of the Cloud Futures Workshop 2012, USA

[5] Michael J. Carey, Inside “Big Data Management”: Ogres, Onions, or Parfaits? In EDBT Electronic Conference Proceedings, March 26-30, 2012, Berlin, Germany (http://www.edbt.org/Proceedings/2012-Berlin/edbt_toc.html)

[6] Panel Session | Big Data on Campus: Addressing the Challenges and Opportunities Across Domains, Speakers: Cathryn Carson, D-lab; AnnaLee Saxenian, UC Berkeley; Arie Shoshani, Lawrence Berkeley Laboratory

Chair: Michael Franklin