Computing for DOE’s
Cloud computing is gaining traction in the commercial world, with companies like Amazon, Google,
and Yahoo offering pay-to-play cycles to help organizations meet cyclical demands for extra
computing power. But can such an approach also meet the computing and data storage demands
of the nation’s scientific community?
A new $32 million program funded by the Amer-ican Recovery and Reinvestment Act through the U.S. Department of Energy (DOE) will examine cloud computing as a cost-effective and energy-efficient computing paradigm for mid-range sci-ence users to accelerate discoveries in a variety of disciplines, including analysis of scientific data-sets in biology, climate change, and physics.
DOE is a world leader in providing high-per-formance computing resources for science – with the National Energy Scientific Research Comput-ing Center (NERSC) at Lawrence Berkeley National Laboratory (LBNL) to support the high-end com-puting needs of over 3,000 DOE Office of Science researchers and the Leadership Computing Facil-ities at Argonne and Oak Ridge National Labora-tories serving the largest-scale computing projects across the broader science community through the Innovative and Novel Computational Impact on Theory and Experiment Program (INCITE). The focus of these facilities is on providing access to some of the world’s most powerful supercom-puting systems that are specifically designed for high-end scientific computing.
Interestingly, some of the science demands for DOE computing resources do not require the scale of these well-balanced petascale machines. A great deal of computational science today is conducted on personal laptops or desktop computers or on small private computing clusters set up by individual researchers or small collaborations at their home institution. Local clusters have also been ideal for researchers that co-design complex
problem-solv-ing software infrastructures for the platforms in addi-tion to running their simulaaddi-tions. Users with com-putational needs that fall between desktop and petascale systems are often referred to as “mid-range” and are the target for Magellan cloud projects.
In the past, mid-range users were enticed to set up their own purpose-built clusters for developing codes, running custom software or solving com-putationally inexpensive problems because hard-ware has been relatively cheap. However the cost incurred by ownership, including ever-rising energy bills, space constraints for hardware, ongo-ing software maintenance, security, operations and a variety of other expenses, are forcing mid-range researchers and their funders to look for more cost-efficient alternatives. Some experts suspect that cloud computing may be a viable solution.
Cloud computing refers to a flexible model for on-demand access to a shared pool of config-urable computing resources (such as networks, servers, storage, applications, services, and soft-ware) that can be easily provisioned as needed. Cloud computing centralizes the resources to gain efficiency of scale and permit scientists to scale up to solve larger science problems while still allowing the system software to be config-ured as needed for individual application require-ments. To test cloud computing for scientific capability, NERSC and the Argonne Leadership Computing Facility (ALCF) will install similar mid-range computing hardware, but will offer dif-ferent computing environments (figure 1). The systems will create a cloud test bed that scientists
The exploratory project has been named Magellan in honor of the Portuguese explorer, and also for the Magellanic Clouds, the two closest galaxies to our Milky Way visible from the Southern Hemisphere.
can use for their computations while also test-ing the effectiveness of cloud computtest-ing for their particular research problems.
Since the project is exploratory, it has been named Magellan in honor of the Portuguese explorer who led the first effort to sail around the globe. It is also named for the Magellanic Clouds, which are the two closest galaxies to our Milky Way and visible from the Southern Hemisphere.
What is Cloud Computing?
In the report “Above the Clouds: A Berkeley View of Cloud Computing” (see Further Reading, p59) a team of luminaries from the Electrical Engineer-ing and Computer Sciences Department at the University of California–Berkeley noted that cloud computing refers to both the applications delivered as services over the Internet and the hardware and systems software in the datacen-ters that provide those services. The services
themselves have long been referred to as software
as a service (SaaS). The datacenter hardware and
software is referred to as a cloud.
When a cloud is made available in a pay-as-you-go manner to the general public, it is a public cloud; the service being sold is utility computing. Current examples of public utility computing include Ama-zon Web Services (AWS), Google AppEngine, and Microsoft Azure. As a successful example, Elastic Compute Cloud (EC2) from AWS sells 1.0 GHz x86 ISA slices, or instances, for $0.10 per hour, and a new instance can be added in two to five minutes. An instance is the allocated memory and collection of processes running on the server. Meanwhile, Ama-zon’s Scalable Storage Service (S3) charges $0.12 to $0.15 per gigabyte-month, with additional band-width charges of $0.10 to $0.15 per gigabyte to move data into and out of AWS over the Internet.
The advantages of SaaS to both end users and service providers are well understood. Service
Figure 1. Cloud control. The Magellan management and network control racks at NERSC. To test cloud computing for scientific capability, NERSC and the Argonne Leadership Computing Facility (ALCF) installed purpose-built test beds for running scientific applications on the IBM iDataplex cluster.
providers enjoy greatly simplifed software instal-lation and maintenance and centralized control over versioning; end users can access the service “anytime, anywhere,” share data and collaborate more easily, and keep their data stored safely in the infrastructure. Cloud computing does not change these arguments, but it does give more application providers the choice of deploying their product as SaaS without provisioning a datacenter: just as the emergence of semiconductor foundries gave chip companies the opportunity to design and sell chips without owning a semiconductor fabrica-tion plant, cloud computing allows deploying SaaS – and scaling on demand – without building or provisioning a datacenter.
Mid-Range Users on a Cloud
Realizing that not all research applications require petascale computing power, the Magellan project will explore several areas:
● Understanding which science applications and user communities are most well-suited for cloud computing (sidebar “Metagenomics on a Cloud?” p58)
● Understanding the deployment and support issues required to build large science clouds. Is it cost effective and practical to operate science clouds? How could commercial clouds be leveraged?
● How does existing cloud software meet the needs of science and could extending or enhanc-ing current cloud software improve utility?
● How well does cloud computing support data-intensive scientific applications?
● What are the challenges to addressing security in a virtualized cloud environment?
By installing the Magellan systems (sidebar “Mag-ellan Hardware”) at two of DOE’s leading com-puting centers, the project will leverage staff experience and expertise as users put the cloud systems through their paces. The Magellan test bed will be comprised of cluster hardware built on IBM’s iDataplex chassis and based on Intel’s Nehalem CPUs and QDR InfiniBand interconnect (figure 3). Total computer performance across both sites will be on the order of 100 teraflop/s.
Researchers at ACLF and NERSC will look into the Eucalyptus toolkit, an open-source package that is compatible with Amazon Web Services, as a poten-tial tool for allocating Linux virtual machine images. In addition, the teams researching Magellan’s suit-ability will also investigate the performance of Apache’s Hadoop and Google’s MapReduce, two This purpose-built test bed for running scientific applications will be built on the IBM
iDataplex chassis and based on InfiniBand technology, the system will offer high density with front-access cabling and will be liquid-cooled using rear-door heat exchangers (figure 2). Total computer performance across both sites will be on the order of 100 teraflop/s. The NERSC portion of the system will include:
● 61.5 teraflop/s peak performance
● 720 compute nodes (5,760 cores) with Intel Nehalem quad-core processors
● 21.1 TB DDR3 memory
● QDR InfiniBand fabric
Meanwhile, the Argonne system will have:
● 43 teraflop/s peak performance
● 504 Compute nodes (4,032 cores) with Intel Nehalem quad-core processors
● 12 TB memory ● QDR InfiniBand fabric
Magellan HardwareR. K ALTSCHMIDT , LBNL
Figure 2. Staying cool. By building the Magellan test bed at NERSC on IBM’s iDataplex chassis, the facility can take advantage of the machine’s innovative half-depth design and liquid-cooled door, reduce cooling costs by as much as half, and reduce floor space requirements by 30%. The orange tubes in the picture will carry coolant to chill the system.
Figure 3. Magellan systems at both NERSC and the ALCF will be built using QDR InfiniBand fabric like the one pictured here.
The Magellan test bed will be comprised of cluster hardware built on IBM’s iDataplex chassis and based on Intel’s Nehalem CPUs and QDR InfiniBand interconnect, and the total computer performance across both sites will be on the order of 100 teraflop/s.
related software frameworks that deal with large dis-tributed datasets. Currently, one of the challenges in building a private cloud is the lack of software stan-dards. Although these frameworks are not widely supported at traditional supercomputing facilities, large distributed datasets are a common feature of many scientific codes and are natural fits for cloud computing. The team will also be experimenting with other commercial cloud offerings such as those from Amazon, Google, and Microsoft.
By making Magellan available to a wide range of DOE science users, the researchers will be able to analyze the suitability for a cloud model across
the broad spectrum of the DOE science workload. They will also use performance-monitoring soft-ware to analyze what kinds of science applica-tions are being run on the system and how well they perform on a cloud. The science users will play a key role in this evaluation as they bring a very broad scientific workload into the equation and will help the researchers learn which features are important to the scientific community.
Data Storage and Networking
To address the challenge of analyzing the massive amounts of data being produced by scientific One goal of the Magellan project is to understand which
science applications and user communities are best suited for cloud computing, but some DOE researchers have already given public clouds a whirl.
For example, Jared Wilkening, a software developer at Argonne National Laboratory, recently tested the feasibility of employing Amazon EC2 to run a BLAST-based metagenomics application. Metagenomics is the study of metagenomes, genetic material recovered directly from environmental samples. By identifying and understanding bacterial species based on sequence similarity, some researchers hope to put microbial communities to work mitigating global warming and cleaning up toxic waste sites, among other tasks.
BLAST is the community standard for sequence comparison. It enables researchers to compare a query
sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold.
Wilkening notes that the BLAST-based codes, like the one he used on the Amazon EC2, are perfect for cloud computing because there is little internal synchronization, therefore it does not rely on high-performance
interconnects. Nevertheless, the study’s conclusion was that Amazon is significantly more expensive than locally-owned clusters, due mainly to EC2’s inferior CPU hardware and the premium cost associated with on-demand access, although an increased on-demand for compute-intensive workloads could change that.
Wilkening’s paper was published in Cluster 2009, and slides are available at:
Metagenomics on a Cloud?
Figure 4. Metagenomics is the study of genetic material recovered directly from environmental samples.
instruments ranging from powerful telescopes photographing the Universe to gene sequencers unraveling the genetic code of life, the Magellan test bed will also provide a storage cloud with a little over a petabyte of capacity.
The NERSC Global File (NGF) system will pro-vide most storage needs for projects running on the NERSC portion of the Magellan system. Approximately 1 PB of storage and 25 gigabits per second (Gbps) of bandwidth have been added to support use by the test bed. Archival storage needs will be satisfied by NERSC’s High Perfor-mance Storage System (HPSS) archive, which is being increased by 15 PB in capacity. Meanwhile, the Magellan system at ACLF will have 250 TB of local disk storage on the compute nodes and additional 25 TB of global disk storage on the GPFS system.
NERSC will make the Magellan storage avail-able to science communities using a set of servers and software called Science Gateways, as well as experiment with Flash memory technology to provide fast random access storage for some of the more data-intensive problems. Approxi-mately 10 TB will be deployed in NGF for high-bandwidth, low-latency storage class and metadata acceleration. Around 16 TB will be deployed as local SSD in one SU for data analyt-ics, local read-only data and local temporary stor-age. Approximately 2 TB will be deployed in
HPSS. The ALCF will provide active storage, using HADOOP over PVFS, on approximately 100 com-pute/storage nodes. This active storage will increase the capacity of the ALCF Magellan sys-tem by approximately 30 TF of compute power, along with approximately 500 TB of local disk storage and 10 TB of local SSD.
The NERSC and ALCF facilities will be linked by a groundbreaking 100 Gbps network, devel-oped by DOE’s Energy Sciences Network (ESnet) with funding from the American Recovery and Reinvestment Act. Such high bandwidth will facilitate rapid transfer of data between geo-graphically dispersed clouds and enable scien-tists to use available computing resources regardless of location .
The Magellan program will run for two years, and the initial clusters will be installed in the next few months. At NERSC installation was slated to begin in November 2009, with early users getting access in December. The NERSC system (figures 5 and 6) was slated to go into production use in mid-January 2010. At ALCF, installation was planned to begin in January 2010, with early users gaining access in February, and the system
open-ing up for full access in March. ●
Contributors Horst Simon, Kathy Yelick, Jeff Broughton, Brent Draney, Jon Bashor, David Paul, and Linda Vu from NERSC at LBNL; Pete Beckman, Susan Coghlan, and Eleanor Taylor from ALCF at Argonne National Laboratory. Further Reading
Above the Clouds: A Berkeley View of Cloud Computing http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/ EECS-2009-28.pdf
Figure 6. Networking. When completed, the Magellan system at NERSC will be interconnected using QDR, 10 Gbps Ethernet, multiple 1 Gbps Ethernet, and 8 Gbps fiber channel SAN.
Figure 5. Main system console for Magellan at NERSC.