Architecting Scientifi c Data Systems in the Cloud
2.4 Comparing Cloud Computing Implementations Across Different Scientifi c Data Systems
2.4.2 Cloud Computing Architecture for Lunar Mapping and Mission Program
LMMP (Lunar Mapping and Modeling Portal) is tasked by the NASA’s Human Exploration and Operations Mission Directorate (HEOMD) to develop a single, com- mon, consistent, uniform, and intuitive NASA web portal for users to access lunar
mapping and modeling products, tools, and data. It provides valuable information to scientists, researchers, and engineers in the planning of future lunar missions, aiding in tasks such as evaluating and selecting potential landing sites, identifying locations to place rovers and other assets, and developing computer systems to navigate the surface and perform advanced scientifi c analysis. It has a secondary purpose as a tool to support the dissemination of lunar data to future missions and projects, interna- tional partners, commercial entities, education, and the general public. The data sources for LMMP range from historical missions, such as Apollo, Lunar Orbiter, Clementine, Lunar Prospector, and Earth-based observations, to the recent Lunar Reconnaissance Orbiter (LRO) and the Lunar Crater Observation and Sensing Satellite (LCROSS) missions. LMMP is currently housing over 2 TB of lunar data, including image mosaics, digital elevation maps, mineralogy maps, lighting models, gravity models, and thermal models.
LMMP contains both publicly released data sets as well as private and embar- goed information. When new data is received from spacecraft, it is typically embar- goed for several months to allow principal investigators and sponsoring scientists to have the opportunity to perform research and publish fi ndings before the data is disseminated to other scientists and the general public. Despite providing storage for historical as well as these new products in the same infrastructure, LMMP is required to segregate and control access to sensitive data. This was a major consid- eration in utilizing cloud storage and computing. Our solution was to only place publicly released data sets on the cloud while keeping sensitive data at JPL, pro- tected behind a login-based security model. However, we were able to integrate the cloud-hosted and JPL-hosted data sets together in one unifi ed user interface so that users were not specifi cally aware of the source of the data, yet would notice more data sets available to them once they logged in.
Once we identifi ed the data that we could put on the cloud, we looked into how we could utilize cloud services for our needs (Fig. 2.5 ).
2.4.2.1 Data Processing
One of the fi rst uses of cloud computing in LMMP was for creating image tiles from large image mosaics. LMMP archives images and digital elevation maps that are several gigapixels in resolution and gigabytes in size. A global image mosaic from the Clementine spacecraft has a screen resolution of 92,160 × 46,080 pixels (4 gigapixels). A mosaic from the Apollo 15 metric camera has a fi le size of 3.3 GB. While modern desktop computers, with some effort, can display these images as a whole in their native resolutions, older machines and resource-constrained devices such as mobile phones and tables do not have the processing power to render the images. If the image is not stored on a local fi le system, the image data has to be transferred from a remote machine, thus adding a dependency on network band- width. To work around these limitations, the image has to be broken up into smaller parts. This process is called image tiling and building a tile pyramid. The goal is to create small image tiles, typically a few hundred pixels high and wide (512×512 in
LMMP), and to send those tiles to the client as needed depending on the location and the zoom level that the client is viewing. This method of presenting image data is very common in mapping software and is used in many commercial mapping packages.
The process of tiling is straightforward but computationally intensive. It involves subsetting a portion of the original image and performing a bicubic resizing of the subset to a standard tile size. At the lowest level of this tile pyramid, the tiles are (or very close to) a 1:1 mapping to the original image. At each level, the tiles regen- erated at ¼ size of the previous tile until the last level, where the image is contained in exactly one tile. Despite the seemingly sequential process, image tiling is incredibly parallelizable and cloud computing enabled us to distribute this image processing across multiple machines.
With the help of the Hadoop framework and the scalability of cloud computing [ 2 ], we were able to process and tile large amounts of data in a relatively short amount
of time. We set up and confi gured Hadoop on several Amazon EC2 nodes and optimized to find the number and type of EC2 machines that would yield the best performance. We tested 20 large instances along with 4 Cluster Compute Instances. Because of the nature of our image processing and the nuances of the Hadoop framework, large amounts of binary data were transferred between the nodes in the cluster. Since the Cluster Compute Instances were tuned for network performance, that smaller cluster yielded better performance and was able to com- plete the image tiling process faster. However, both solutions performed signifi - cantly better than the previous solution of processing the image on a single local machine. The cost for processing each image was only a few dollars.
2.4.2.2 Data Storage, Transfer, and Hosting
In addition to image processing, LMMP heavily utilizes cloud storage and hosting for serving publicly available data. The project is required to handle at least 400 concurrent users using the system. Users must be able to access all of 2 TB of the data currently stored in the system as well as perform real-time data visualization and image manipulation. The cloud allows LMMP to scale up when demand for its tools and data is high and to scale down to minimize cost when demand is low.
Upon receiving data from the data providers, LMMP uses Apache OODT to ingest information about the product. Once the data has been properly catalogued, it can be stored in its original form for on-demand processing, such as rendering 3D animations, or can be tiled and served as static data such as the image tiles men- tioned previously. Depending on the type of data, different cloud storage schemes are utilized.
LMMP stores its public data in Amazon ‘s bucket store infrastructure, S3, as well as its elastic block storage, EBS, which is attached to an EC2 instance. Based on the features of these two types of storage, we store different types of fi les. Files stored on S3 are primarily static fi les that will not be used in subsequent processing such as the individual tiles generated from an image mosaic. S3 fi les are accessed via URLs, which makes them easily accessible from users online. In addition, we can use Amazon CloudFront to automatically host the data at different data centers around the world to reduce download times for our international users. However, S3 fi les are not suitable for dynamic computations within EC2. For data sets that will be used to dynamically generate products, we will store them in EBS. EBS is attached to EC2 instances and can be readily accessible to processes using standard fi le I/O functions
Moving the large amounts of data between JPL and the cloud has been a chal- lenge, and we have developed a specifi c tool that enables higher throughput by uti- lizing parallel connections to cloud systems. Rather than transferring fi les through a single connection, we could transmit multiple fi les concurrently and even parts of fi les and assembling the pieces at the destination. We used the partial GET feature in the HTTP protocol and an S3-specifi c feature for multi-part uploads to maximize bandwidth utilization.
The portal operates on a multi-tiered, heterogeneous architecture consisting of Apache HTTPD, Apache Tomcat Application Server, ESRI ArcGIS Server, and a JPL internal map server. The HTTPD server, running on EC2, distributes the requests to the other servers and returns the responses to the client. The Tomcat server, also running on EC2, performs catalog lookups, 3D animation rendering, sun angle lookups, and dynamic image subsetting. The ESRI ArcGIS server handles the distribution of some images as well as nomenclature data. This server runs on a Windows instance on EC2. Finally, the JPL map server hosts other images and runs within JPL. This architecture, coupled with the scalability of cloud computing, gives fi ne-grained fl exibility in what to scale capacity. If users are rendering multi- ple 3D animations, multiple Tomcat servers can be instantiated to satiate the demand. If users are requesting more image data, we can instantiate more Windows instances and have multiple ArcGIS servers. We can also run these servers are larger virtual machine instances to handle larger loads.
2.4.2.3 Conclusions
LMMP is an operational example of how a project can utilize cloud computing resources not only for data processing but also as an effi cient means for the storage and presentation of the data. The ability to start virtual machines as needed allows for the quick setup and teardown of distributed computing environments, especially with the help of frameworks such as Hadoop. The virtual machine software image can be instantiated on numerous compute “sizes” to emphasize memory capacity, CPU power, and/or network connectivity to maximize algorithm performance while decreasing cost. The cloud also provides a convenient platform to host and serve data to customers, whether they are scientists or the general public. The project can optimize its resource utilization based on actual demand. This frees up resources and allows the project to focus more on its primary goal of providing unique and interesting data.