cloud – Genoveva Vargas-Solar

I recently attended the third edition of the Microsoft Research workshop Cloud Futures at the University of Berkeley (http://research.microsoft.com/en-us/events/cloudfutures2012/). I bear in mind, three relevant moments in the workshop:

The papers concerning data management issues and how they are addressed in the context of cloud. Particularly, the talk by Bill Howe, University of Washington[1], I appreciated the parallel made between Maslow’s pyramid and data management issues: how they are presented and addressed today and how he thinks they should be addressed.

The second idea discussed in this paper concerns the fact that scientists – non-expert in computer science- spend an important percentage of their research time designing data processing algorithms that correspond to operators and/or expressions of classic SQL. People tend to reinvent DBMS functions for dealing with their data management requirements. This can be further discussed, and was debated by the workshop’s audience (Joseph Hellerstein, Manager Big Science Google). The feeling is that not everything can be done through SQL expressions — of course I agree–, and that there is room and pertinence of algorithms that cannot be infused within querying languages. Yet, for the aspects concerning data management – sharing, storing, querying -, the idea is to understand why are solutions reinvented continuously, and try to figure out methodologies well adapted to fulfil scientific applications and scientists needs.

I believe that the call is for tools that clearly expose the way data are processed. It is necessary to exhibit the workflows used for dealing with data, starting from raw data collections and leading to collections that can be used and shared by scientists for performing analysis processes – that lead to new data. These workflows provide support for arguing of the validity of the scientific results and conclusions produced through this data processing task. The role of the cloud resides in the possibility of sharing data (e.g., data markets) and processes. The other call is for facility in defining these workflows. Users and programmers must be able to express their data processing workflows by themselves without requiring the presence of experienced programmers, as is the case today. Languages like MS Link and Yahoo’s PigLatin, combining imperative expressions with SQL like ones, for dealing with data are interesting tools to explore in the definition cloud aware data processing tasks[2].

The second relevant scientific component of the workshop was the papers on map reduce, a model that needs to be fully studied specially for revisiting data processing algorithms. First, I kept in mind the map reduce models summarized in the paper by Geoffrey Fox Indiana University[3]. The second interesting work concerns the analysis of algorithms presented by the paper of Stanford University on the limitations of map reduce[4] that analyses algorithms needing special design under the map reduce model (e.g., join relational operator).

I would associate to this paper M. Carey’s EDBT 2012 paper Inside “Big Data Management”: Ogres, Onions, or Parfaits?[5]. I have the feeling that there is a need to study this model, and in the context of the cloud, to couple the study to an economical model that can guide the analysis on its consumption of time, computing resources and its economic cost.

3. The requirements of managing and processing huge amounts of data have lead to the emergence of the Big Data movement that is expanding throughout the scientific community. The panel[6] Big data on campus: Addressing the Challenges and Opportunities Across Domain (referring to U. Berkeley campus), left some ideas to be thought about. The most important is that big data is a multidisciplinary issue that includes natural, social and human sciences. Natural, human and social sciences have different requirements: cleaning, building and designing databases, running simulations on huge data collections, discovering models, capitalizing results. Thus, dealing with big data requires scientists to acquire and develop information management abilities and techniques, computer scientists to propose adapted tools and engineering support for providing the necessary plumbing for deploying these tools in platforms that make them available to the community, respecting specific quality of service requirements. At U. Berkley big data on campus is a program that touches education and research programs, and it is a collective action.

[1] Bill Howe, Advancing Declarative Query for Data-Intensive Science in the Cloud, In proceedings of the Cloud Futures Workshop 2012, USA

[2] The literature has shown that data cleaning operations can be expressed by SQL like languages. Some may remember for instance the AJaX framework proposed by Helena Galhardas (See H. Galhardas, Data Cleaning and Transformation, Generative and transformational techniques in software engineering, Lecture Notes in Computer Science, 2006, Volume 4143/2006, 327-343). Here a list of approaches on data cleaning http://paul.rutgers.edu/~weiz/readinglist.html

Other approaches willing to integrate data mining operations to relational DBMS like professor’s Elena Baralis at Politecnico di Torino (http://dbdmg.polito.it/twiki/bin/view/Public/ElenaBaralis) are examples of these movement that dealt to data mining cartridges in commercial systems like Oracle.

[3] Geoffrey Fox, Dennis Gannon, Programming Paradigms for Technical Computing on Clouds and Supercomputers, In proceedings of the Cloud Futures Workshop 2012, USA

[4] Semih Salihogluz, Foto Afrati, Anish Das Sarma, Jeffrey D. Ullman, Vision Paper: Towards an Understanding of the Limits of Map-Reduce Computation, In proceedings of the Cloud Futures Workshop 2012, USA

[5] Michael J. Carey, Inside “Big Data Management”: Ogres, Onions, or Parfaits? In EDBT Electronic Conference Proceedings, March 26-30, 2012, Berlin, Germany (http://www.edbt.org/Proceedings/2012-Berlin/edbt_toc.html)

[6] Panel Session | Big Data on Campus: Addressing the Challenges and Opportunities Across Domains, Speakers: Cathryn Carson, D-lab; AnnaLee Saxenian, UC Berkeley; Arie Shoshani, Lawrence Berkeley Laboratory

Chair: Michael Franklin