1 Understanding.the.Cloud.Computing.Landscape
1.7 Discussion
As the cloud computing technology continues to emerge, more cloud systems are developed and new concepts are introduced. In this respect, a fundamental under-standing of the extent to which cloud computing inherits its concepts from various computing areas and models is important to understand the landscape of this novel computing field and to define its potentials and limitations. Such comprehension will facilitate further maturation of the area by enabling novel systems to be put in context and evaluated in the light of existing systems. Particularly, an ontologi-cal, model-based approach encourages new systems to be compared and contrasted with existing ones, thus identifying more effectively their novel aspects. We con-tend that this approach will lead to more creative and effective cloud systems and novel usage scenarios of the cloud. With this in mind, our approach has been to determine the different layers and components that constitute the cloud, and study their characteristics in light of their dependency on other computing fields and models.
An ontology of cloud computing allows better understanding of the interrela-tions between the different cloud components, enabling the composition of new systems from existing components and further recomposition of current systems from other cloud components for desirable features like extensibility, flexibility, availability, or merely optimization and better cost efficiency. We as well postulate that understanding the different components of the cloud allows system engineers and researchers to deal with hard technological challenges. For example, compre-hending the relationship between different cloud systems can accentuate opportu-nities to design interoperable systems between different cloud offerings that provide higher-availability guarantees. Although high availability is one of the fundamental design features of every cloud offering, failures are not uncommon. Highly avail-able cloud applications can be constructed, for example, by deploying them on two competitive cloud offerings, e.g., Google’s App Engine [19] and Amazon’s EC2 [8].
Even in the case that one of the two clouds fails, the other cloud will continue to support the availability of the applications. In brief, understanding the cloud com-ponents may enable creative solutions to common cloud system problems, such as availability, application migration between cloud offerings, and system resilience.
Furthermore, it will convey the potential of meeting higher-level implementation concepts through interoperability between different systems. For example, the high-availability requirement may be met by formulating an inter-cloud protocol,
which enables migration and load balancing between cloud systems. Resilience in the cloud, for example, can also be met through concepts of self-healing and auto-nomic computing. The broad objective of this classification is to attain a better understanding of cloud computing and define key issues in current systems as well as accentuate some of the research topics that need to be addressed in such systems.
Not only can an ontology impact the research community, but it also can sim-plify the educational efforts in teaching cloud computing concepts to students and new cloud applications’ developers. Understanding the implications of developing cloud applications against one cloud layer versus another will equip developers with the knowledge to make informed decisions about their applications’ expected time-to-market, programming productivity, scaling flexibility, as well as performance bottlenecks. In this regard, an ontology can facilitate the adoption of cloud com-puting and its evolution. Toward the end goal of a thorough comprehension of the field of cloud computing, we have introduced in this chapter three contemporary cloud computing classifications that present cloud systems and their organization at different levels of detail.
References
1. J. Hofstader. Communications as a service. http://msdn.microsoft.com/en-us/library/
bb896003.aspx
2. Apex: Salesforce on-demand programming language and framework. http://developer.
force.com/
3. J. Appavoo, V. Uhlig, and A. Waterland. Project kittyhawk: Building a global-scale computer: Blue Gene/P as a generic computing platform. SIGOPS Oper. Syst. Rev., 42(1):77–84, 2008.
4. M. Chau, Z. Huang, J. Qin, Y. Zhou, and H. Chen. Building a scientific knowledge web portal: The nanoport experience. Decis. Support Syst., 42(2):1216–1238, 2006.
5. N. Chohan, C. Bunch, S. Pang, C. Krintz, N. Mostafa, S. Soman, and R. Wolski.
AppScale: Scalable and Open AppEngine application development and deployment.
Technical Report TR-2009-02, University of California, Santa Barbara, CA, 2009.
6. M. Christie and S. Marru. The LEAD portal: A teragrid gateway and application service architecture: Research articles. Concurr. Comput. Pract. Exp., 19(6):767–781, 2007.
7. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters.
Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI), San Francisco, CA, pp. 137–150, 2004.
8. Amazon Elastic Compute Cloud. http://aws.amazon.com/ec2/
9. EMC Managed Storage Service. http://www.emc.com/
10. Enomalism elastic computing infrastructure. http://www.enomaly.com
11. A. Hanemann et al. PerfSONAR: A service oriented architecture for multi-domain network monitoring. In B. Benatallah et al., editors, ICSOC, Amsterdam, the Netherlands, Lecture Notes in Computer Science, vol. 3826, pp. 241–254. Springer, Berlin, Germany, 2005.
12. R. Wolski et al. Grid resource allocation and control using computational econo-mies. In F. Berman, G. Fox, and A. J. G. Hey, editors, Grid Computing: Making the Global Infrastructure a Reality, pp. 747–772. John Wiley & Sons, Chichester, U.K., 2003.
13. W. Johnston et al. Network communication as a service-oriented capability. In L. Grandinetti, editor, High Performance Computing and Grids in Action, Advances in Parallel Computing, vol. 16, IOS Press, Amsterdam, the Netherlands, March 2008.
14. Eucalyptus. http://eucalyptus.cs.ucsb.edu/
15. I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit. Int. J.
Supercomput. Appl., 11(2):115–128, 1997.
16. D. Gannon et al. Building grid portal applications from a web-service component architecture. Proc. IEEE (Special Issue on Grid Computing), 93(3):551–563, March 2005.
17. D. Gannon, B. Plale, M. Christie, Y. Huang, S. Jensen, N. Liu, S. Marru, S. Pallickara, S. Perera, and S. Shirasuna. Building grid portals for e-science: A service oriented archi-tecture. High Performance Computing and Grids in Action. IOS Press, Amsterdam, the Netherlands, 2007.
18. GoGrid Cloud Center API. http://www.gogrid.com/how-it-works/gogrid-API.php 19. Google App Engine. http://code.google.com/appengine
20. Google Apps. http://www.google.com/apps/business/index.html 21. Hadoop. http://hadoop.apache.org/
22. C. Hoff. Christofer hoff blog: Rational survivability. http://rationalsecurity.typepad.
com/blog/
23. K. L. Jackson. An ontology for tactical cloud computing. http://kevinljackson.
blogspot.com/
24. M. Crandell. Defogging cloud computing: A taxonomy, June 16, 2008. http://refresh.
gigaom.com/2008/06/16/defogging-cloud-computing-a-taxonomy/
25. Microsoft Connected Service Framework. http://www.microsoft.com/serviceprovid-ers/solutions/connectedservicesframework.mspx
26. Microsoft Azure. http://www.microsoft.com/azure
27. M. Stanley. IBM ink utility computing deal. http://news.cnet.com/2100-7339-5200970.html
28. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. In SIGMOD ’08: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, Canada, pp.
1099–1110, 2008. ACM, New York.
29. Preboot Execution Environment (PXE) Specifications, Intel Technical Report, September 1999.
30. R. W. Anderson. Cloud services continuum, July 3; 2008. http://et.cairenenet/
2008/07/03/cloud-services-continuum/
31. R. W. Anderson. The cloud services stack and infrastructure, July 28, 2008. http://
et.cairene.net/2008/07/28/the-cloud-services-stack-infrastructure/
32. Amazon Simple Storage Service. http://aws.amazon.com/s3/
33. Salesforce Customer Relationships Management (CRM) system. http://www.
salesforce.com/
34. T. Severiens. Physics portals basing on distributed databases. In IuK, Trier, Germany, 2001.
35. P. Smr and V. Novek. Ontology acquisition for automatic building of scientific portals.
In J. Wiedermann, G. Tel, J. Pokorný, M. Bieliková, and J. Stuller, editors, SOFSEM 2006: Theory and Practice of Computer Science: 32nd Conference on Current Trends in Theory and Practice of Computer Science, pp. 493–500. Springer Verlag, Berlin/
Heidelberg, Germany, 2006.
36. D. Thain, T. Tannenbaum, and M. Livny. Distributed Computing in Practice:
The Condor Experience. Concurrency and Computation: Practice and Experience, 17(2–4):323–356, 2005.
37. Das U-Boot: The universal boot loader. http://www.denx.de/wiki/U-Boot/WebHome 38. Virtual Workspaces Science Clouds. http://workspace.globus.org/clouds/
17
Science Gateways:
Harnessing Clouds
and Software Services for Science
Nancy Wilkins-Diehr, Chaitan Baru,
Dennis Gannon, Kate Keahey, John McGee, Marlon Pierce, Rich Wolski, and Wenjun Wu
Contents
2.1 Science Gateways—Background and Motivation ...18 2.2 Clouds and Software Services...20 2.3 Science Clouds, Public and Private ...22 2.3.1 Eucalyptus—Open-Source IaaS ...23 2.3.2 Engineering Challenge...24 2.3.3 Eucalyptus Architecture...24 2.3.4 User Experience ...26 2.3.5 Notes from the Private Cloud ...26 2.3.6 Leveraging the Ecosystem ...27 2.3.7 Future Growth ...28
2.1 Science Gateways—Background and Motivation
Nancy Wilkins-Diehr
The pursuit of science has evolved over hundreds of years from the development of the scientific method to the use of empirical methods. This evolution continues today at an increasingly rapid pace. Scientific pursuit has always been marked by advances in technology. Increasingly powerful microscopes and telescopes have led to new discoveries and theories; access to sensor data improves the ability to analyze and monitor events and understand complex phenomena, such as climate change, and advances in sequencing technologies will very soon result in personalized medicine.
The evolution of science with technology continues today as well. The 1970s and 1980s saw the significant development of computational power. Computer simula-tions were considered a third pillar of science in addition to theory and experiment.
One of the biggest impacts in modern times has been the release of the Mosaic browser in 1992. This ushered in the modern information age and an explosion of knowledge sharing not seen since the invention of the printing press. The impact on science has been tremendous, but we contend that the extent of this impact is just beginning. The availability of digital data continues to grow and access and sharing mechanisms continue to evolve very quickly. Early Web 3.0 ideas are outlining how we move from information sharing on social Web sites and wikis to programmatic data sharing via standards (Resource Description Framework) and database queries (SPARQL query language) [1].
In the 1990s, scientists were beginning to develop and rely heavily on the Internet and communication technologies. The National Center for Biotechnology Information’s BLAST server provided scientists with an early sequence alignment tool that made use of remote computing capabilities [2]. Queries and results were exchanged via e-mail. This service was later made available on the Web and contin-ues to operate today.
2.4 Cloud Computing for Science ...28 2.4.1 Nimbus Goals and Architecture ...29 2.4.2 Science Clouds Applications ...30 2.4.2.1 Nimbus Helps Meet STAR Production Demands ...31 2.4.2.2 Building a Cloud Computing Ecosystem with CernVM ...32 2.4.2.3 CloudBLAST: Creating a Distributed Cloud Platform ...33 2.5 Gadgets and OpenSocial Containers ...33 2.6 Architecture of an SaaS Science Gateway ...36 2.7 Dynamic Provisioning of Large-Scale Scientific Datasets ...38 2.7.1 Science Gateways for Data ...39 2.7.2 Cloud Computing and Data ...39 2.8 Future Directions ...41 References ...43
In 1995, the headline was “International Protein Data Bank Enhanced by Computer Browser” [3]. The Protein Data Bank (PDB), first established in 1971, is the worldwide repository for three-dimensional structure data of biological macro-molecules. Over time, technology developments have changed many aspects of the PDB. Structures are determined by different methods and much more quickly, the number of new structures per year has increased nearly three orders of magnitude from 1976 to 2008. The expectations of the community have changed as well. Text files including structure descriptions were originally available for download via ftp.
Today the PDB features sophisticated data mining and visualization capabilities, as well as references to PubMed articles and structure reports [4].
A report from a 1998 workshop entitled Impact of Advances in Computing and Communications Technologies on Chemical Science and Technology [5] takes an early look at the impact of computing and communications technology on science. The authors point out that before the advent of the Internet, the practice of chemis-try research had remained largely unchanged. They saw the Internet improving access to scarce instruments and removing the constraints of time and distance previously imposed on potential collaborators. They believed these advances would fundamentally change both the types of scientific problems that can be tackled (the best minds can be brought to bear on the most challenging problems) and the very way in which these problems are addressed. They were accurate in their assessment.
Against this backdrop, the TeraGrid Science Gateway program was initiated in 2003. Previously, supercomputers were accessed by a small number of users who were members of elite research groups. TeraGrid architects recognized that the impact of high-end resources could be greatly increased if they could be coupled onto the back end of existing web portals being developed prolifically by scientists.
Today, gateways span disciplines and provide very diverse capabilities to researchers. The Social Informatics Data Grid (SIDGrid) provides access to mul-timodal data (voice, video, images, text, numerical) collected at multiple times-cales. SIDGrid users are able to explore, annotate, share, and mine expensive data sets with specialized analysis tools. Computationally intensive tasks include media transcoding, pitch analysis of audio tracks, and fMRI image analysis. Researchers utilize SIDGrid, but are unaware of the computational power performing these calculations behind the scenes for them. PolarGrid provides access to and analy-sis of ice sheet measurement data collected in Antarctica. Linked Environments for Atmospheric Discovery (LEAD) will allow researchers to launch tornado simulations on demand if incoming radar data display certain characteristics. The Asteroseismic Modeling Portal is ingesting data from NASA’s Kepler satellite mis-sion, which was launched in March 2009. The portal allows researchers to deter-mine the size, position, and age of a star by doing intensive simulations using the observed oscillation modes from satellite data as input. In all of these examples, the gateway interfaces allow scientists to focus on their work while providing the required computing power behind the scenes.
Technology continues to evolve with increasing rapidity. In 2009, cloud com-puting and “Software as a Service” (SaaS) were examples of virtualized access to high-end resources that enable science. This chapter highlights several activities in these areas, with a focus on the scientific application of the technologies. First, an overview of cloud computing and SaaS are presented. Next, two approaches to cloud deployment (Eucalyptus and NIMBUS) are described in some detail. Examples of scientific applications using virtualized services are provided throughout.
Finally, several detailed science examples are featured. Scientists can run sequence alignment codes from an iGoogle web page via gadgets provided by the Open Life Sciences Gateway. They have 120 different bioinformatics packages at their fingertips through the RENCI science portal. In both examples, software is offered truly as a service. The back-end high performance and high throughput computing, which makes the most rigorous computations possible, is completely hidden from the scientist. The final project looks at data subsetting and database distribution using clouds with high resolution topographic data as a driver. Future directions in all areas are summarized at the conclusion of the chapter.
2.2 Clouds and Software Services
Dennis Gannon
The term “cloud computing” means using a remote data center to manage scalable, reliable, on-demand access to applications. The concept has its origins in the early transformation of the World Wide Web from a loose network of simple web servers into a searchable collection of over 100 million Web sites and 25 billion pages of text. The challenge was to build such a searchable index of the Web and to make it usable and completely reliable for tens of thousands of concurrent users. This required massive parallelism to handle user requests and massive parallelism to sort through all that data. It also required both data and computational redundancy to assure the level of reliability demanded by users. To solve this problem, the web search industry had to build a grid of data centers that today have more comput-ing power than our largest supercomputers. The scientists and engineers who were working on improving search relevance algorithms or mining the Web for criti-cal data needed to use these same massively parallel data centers because that is where the data was stored. The most common algorithms they used often followed the “MapReduce” [6] parallel programming pattern. They shared algorithms and designs for distributed, replicated data structures and developed technology that made it simple for any engineer to define a MapReduce application and “upload it to the cloud” to run. Google was the first to use this expression and publicize the idea. Yahoo later released an open-source version of a similar MapReduce frame-work called Hadoop [7]. Microsoft has a more general technology based on the same concepts called Dryad/LINQ.
A programming model has evolved that allows a developer to design an applica-tion on a desktop and then push it to a data center for deployment and execuapplica-tion.
Google had released AppEngine, which allows a programmer to build a Python program that accesses the Google distributed cloud storage when pushed to the cloud. Microsoft has introduced Azure, which allows developers to build highly scalable parallel cloud web services. Together these software frameworks for build-ing applications are referred to as Platform as a Service (PaaS) models for cloud computing.
If we take a closer look at the data center system architecture that lies at the heart of systems like Azure, we see another model of cloud computing based on the use of machine virtualization technology. The most transparent example of this is the Amazon EC2 [8] and S3 [9] clouds. The idea here is very simple. The application developer is given a machine OS image to load with applications and data. The developer hands this loaded image back to EC2 and it is run in a virtual machine (VM) in the Amazon data center. The critical point is that the image may be replicated across multiple VMs so that the application it contains may scale with user demand. The developer is only charged for the resources actually used.
In this chapter, we describe several significant variations on this “Infrastructure as a Service” (IaaS) concept.
While IaaS and PaaS form the foundation of the cloud technologies, what the majority of users see is the application on their desktop or phone. The client appli-cation may be a web browser or an appliappli-cation that is connected to a set of services running in the cloud. Together, the application and the associated cloud services are often referred to as SaaS. There are many examples. Social networks provide both web and phone clients for their SaaS cloud application. Collaboration and virtual reality is provided in the cloud by second life. Photo sharing tools that allow users to upload, store, and tag images are now common features shipped with new phones and cameras. Microsoft’s LifeMesh is a cloud-based software service that allows the files and applications on your PC, laptop, and Mac to be synchronized.
Science gateways are tools that allow scientists to conduct data analysis and simulation studies by using the resources of a remote supercomputer rather than a remote data center. They share many of the same scalability and reliability require-ments of SaaS tools but they have the additional requirement that the back-end services need to be able to conduct substantial computational analysis that require the architectural features not supported by large data centers.
Supercomputers and data centers are very similar in many respects: they are both built from large racks of servers connected by a network. The primary differ-ence is that the network of a supercomputer is designed for extremely low latency messaging to support the peak utilization of each central processing unit (CPU).
Data centers are designed to maximize application bandwidth to remote users and are seldom run at peak processor utilization so that they can accommodate surges in demand. Data center applications are also designed to be continuously running services that never fail and always deliver the same fast response no matter how large