This entry is inspired by two advertisements of the cloud (see http://www.youtube.com/watch?v=8aCYZ3gXfy8&feature=youtube_gdata_player and http://www.youtube.com/watch?v=YsVGDUkWTHI), that surprised me on how this notion is being incepted in specialists and common non-specialized audiences.
Most technical and somehow academic definitions of the cloud include the idea of unlimited access to resources implying somehow the idea of “big size”. Indeed, some communities tend to define the cloud as if it was all about XXL matters, that is, scaling data processing and storage, exploiting computing resources through map reduce models, load balancing, and high performance computing techniques. Yet, I believe that it is important to go beyond the size and define the cloud using other key features. For instance, data long life and flexible persistency, resources continuous availability and sharing are three key factors that seem to be used by cloud vendors for “grand publique”.
The cloud as an infrastructure is of course about exciting computer science challenges but it can also be used just as an execution platform, with the ambition of avoiding the burden of managing other platform services like a DBMS, an IDE, a Web server and rather using these services “on-line” as long and as much users need them. There is room for a transparent use of the cloud where all the benefits of having unlimited resources are there to be exploited and where people can concentrate on fulfilling a specific requirement that can be: building an application, or uploading personal photographs that can be instantly accessed through different devices.
Anyway, I believe that technology users in the different roles they play (online-players, developers, social contacts, companies assets, …) should have a look at the experience of being cloud users and have a first touch to “ubiquitous” access to technological services on XXL and XXS contexts.
I recently attended the third edition of the Microsoft Research workshop Cloud Futures at the University of Berkeley (http://research.microsoft.com/en-us/events/cloudfutures2012/). I bear in mind, three relevant moments in the workshop:
- The papers concerning data management issues and how they are addressed in the context of cloud. Particularly, the talk by Bill Howe, University of Washington, I appreciated the parallel made between Maslow’s pyramid and data management issues: how they are presented and addressed today and how he thinks they should be addressed.
The second idea discussed in this paper concerns the fact that scientists – non-expert in computer science- spend an important percentage of their research time designing data processing algorithms that correspond to operators and/or expressions of classic SQL. People tend to reinvent DBMS functions for dealing with their data management requirements. This can be further discussed, and was debated by the workshop’s audience (Joseph Hellerstein, Manager Big Science Google). The feeling is that not everything can be done through SQL expressions — of course I agree–, and that there is room and pertinence of algorithms that cannot be infused within querying languages. Yet, for the aspects concerning data management – sharing, storing, querying -, the idea is to understand why are solutions reinvented continuously, and try to figure out methodologies well adapted to fulfil scientific applications and scientists needs.
I believe that the call is for tools that clearly expose the way data are processed. It is necessary to exhibit the workflows used for dealing with data, starting from raw data collections and leading to collections that can be used and shared by scientists for performing analysis processes – that lead to new data. These workflows provide support for arguing of the validity of the scientific results and conclusions produced through this data processing task. The role of the cloud resides in the possibility of sharing data (e.g., data markets) and processes. The other call is for facility in defining these workflows. Users and programmers must be able to express their data processing workflows by themselves without requiring the presence of experienced programmers, as is the case today. Languages like MS Link and Yahoo’s PigLatin, combining imperative expressions with SQL like ones, for dealing with data are interesting tools to explore in the definition cloud aware data processing tasks.
- The second relevant scientific component of the workshop was the papers on map reduce, a model that needs to be fully studied specially for revisiting data processing algorithms. First, I kept in mind the map reduce models summarized in the paper by Geoffrey Fox Indiana University. The second interesting work concerns the analysis of algorithms presented by the paper of Stanford University on the limitations of map reduce that analyses algorithms needing special design under the map reduce model (e.g., join relational operator).
I would associate to this paper M. Carey’s EDBT 2012 paper Inside “Big Data Management”: Ogres, Onions, or Parfaits?. I have the feeling that there is a need to study this model, and in the context of the cloud, to couple the study to an economical model that can guide the analysis on its consumption of time, computing resources and its economic cost.
- 3. The requirements of managing and processing huge amounts of data have lead to the emergence of the Big Data movement that is expanding throughout the scientific community. The panel Big data on campus: Addressing the Challenges and Opportunities Across Domain (referring to U. Berkeley campus), left some ideas to be thought about. The most important is that big data is a multidisciplinary issue that includes natural, social and human sciences. Natural, human and social sciences have different requirements: cleaning, building and designing databases, running simulations on huge data collections, discovering models, capitalizing results. Thus, dealing with big data requires scientists to acquire and develop information management abilities and techniques, computer scientists to propose adapted tools and engineering support for providing the necessary plumbing for deploying these tools in platforms that make them available to the community, respecting specific quality of service requirements. At U. Berkley big data on campus is a program that touches education and research programs, and it is a collective action.
 Bill Howe, Advancing Declarative Query for Data-Intensive Science in the Cloud, In proceedings of the Cloud Futures Workshop 2012, USA
 The literature has shown that data cleaning operations can be expressed by SQL like languages. Some may remember for instance the AJaX framework proposed by Helena Galhardas (See H. Galhardas, Data Cleaning and Transformation, Generative and transformational techniques in software engineering, Lecture Notes in Computer Science, 2006, Volume 4143/2006, 327-343). Here a list of approaches on data cleaning http://paul.rutgers.edu/~weiz/readinglist.html
Other approaches willing to integrate data mining operations to relational DBMS like professor’s Elena Baralis at Politecnico di Torino (http://dbdmg.polito.it/twiki/bin/view/Public/ElenaBaralis) are examples of these movement that dealt to data mining cartridges in commercial systems like Oracle.
 Geoffrey Fox, Dennis Gannon, Programming Paradigms for Technical Computing on Clouds and Supercomputers, In proceedings of the Cloud Futures Workshop 2012, USA
 Semih Salihogluz, Foto Afrati, Anish Das Sarma, Jeffrey D. Ullman, Vision Paper: Towards an Understanding of the Limits of Map-Reduce Computation, In proceedings of the Cloud Futures Workshop 2012, USA
 Panel Session | Big Data on Campus: Addressing the Challenges and Opportunities Across Domains, Speakers: Cathryn Carson, D-lab; AnnaLee Saxenian, UC Berkeley; Arie Shoshani, Lawrence Berkeley Laboratory
Chair: Michael Franklin