and Analysis Tool Author: Kern Date: 22 February 2013
NRAO Doc. #: Version: 1.0
A New Data Visualization and Analysis Tool
PREPARED BY ORGANIZATION DATE
Jeff Kern NRAO 22 Feb, 2013
Change Record
VERSION DATE REASON
TABLE OF CONTENTS
1
Introduction ... 2
2
Requirements and Design Considerations ... 2
2.1
Data Storage and Access ... 3
2.2
Remote Access ... 3
2.3
2-‐D Rendering ... 3
2.4
Volume Rendering ... 4
2.5
Analysis Tool Integration ... 4
2.6
Virtual Observatory Integration ... 5
2.7
Linked Views ... 5
2.8
Graphics Processor Units (GPUs) ... 5
3
Existing tools and Projects ... 6
3.1
CASA’s Viewer ... 6
3.2
Other Astronomical Viewers ... 6
3.3
Other Data Visualization Packages ... 6
3.4
Current Projects ... 7
4
Proposed Architecture ... 7
1
INTRODUCTION
The CASA viewer is an adequate tool for investigation and analysis of current data produced by ALMA and VLA. Under the current plan incremental improvements will be made to support the evolving needs of the community. Some issues can easily be addressed in the current design others are much more difficult to accommodate. An alternative approach to incremental improvement would be to initiate a project to design and implement the next generation viewer.
It is an appropriate time to consider this initiative; data sets and products from ALMA, VLA, and GBT’s new Vegas spectrometer will continue to grow in size pushing the capabilities of the current viewer and analysis framework. The next generation of low frequency arrays will have similar issues and providing an efficient viewer / analysis capability would be a natural topic for collaboration. Finally the rising use of integral field units (IFUs) in optical astronomy expands the audience for a next generation cube viewer. In fact the CASA viewer has already been the focus of a successful collaboration with ESO to add necessary functionality to support their MUSE instrument.
This document describes some of the scientific and technical considerations, and a sketch of an possible architecture, for such a project.
2
REQUIREMENTS AND DESIGN CONSIDERATIONS
At the highest level the requirement for a viewer / analysis tool is quite simple: Facilitate the intuitive exploration of data, and provide facilities for extraction of quantitative scientific data. Behind this simple statement are a wide range of detailed implementation requirements, desirable behaviors, and implementation details. In this section I do not attempt to define all of the requirements for a new viewer / analysis framework, I do attempt to highlight many of the capabilities that should impact the design of a new system.
The fundamental problem posed by data from the new generation of interferometers is the volume of data. VLA and ALMA are both capable of producing tera-‐pixel (10,0003) scale cubes for routine observing modes. No astronomical viewer that I am aware of is prepared to efficiently analyze cubes of this size. Although it is certainly possible to segment the data and do piecewise analysis, embracing the complexity of these systems and looking at them in a holistic way will undoubtedly allow new forms of analysis.
Perhaps the most obvious effect of the increase in volume is the data access implications. Loading an entire cube into memory for manipulation is no longer possible, implying that efficient input / output operations are required. A corollary to this is that not all end-‐users will be able to write their own algorithms to analyze the data, simple serial access and single processor techniques are simply too slow to allow efficient data exploration.
One of the key facts about the output of ALMA and VLA is not just that there is more data, but that there is more interesting data. While a patient user might be willing to study each channel of a 256-‐ channel cube on an individual basis, investigating thousands of channels this way is not practical (particularly when many channels are empty). The tools must assist the user in identifying interesting aspects of the data and placing those data in context so the scientist can see connections while maintaining the flexibility to allow users to explore data in new and unforeseen ways.
2.1
Data Storage and Access
Once a data cube has reached sufficient size that it is no longer practical to hold it in memory, the storage method on disk becomes a primary driver in the speed with which a user can access the data. A study funded under the ALMA development program is investigating options for image storage; although that study is not yet complete I expect that two key capabilities identified will be:
• Random Access: It shall be possible to access subsets of the data, selected along any axis (or combination thereof) without incurring the expense of reading all data.
• Concurrent Access: It shall be possible for one or more operations to be applied to the data simultaneously.
Although solid state disks (SSDs) are increasing in size and decreasing in cost, large SSDs are not yet a practical solution for data storage; thus traditional spinning disks must still be considered. These devices are inherently block devices, thus true random access cannot be efficiently supported. A technique of tiling the data on disk is therefore required to support pseudo random access in which the excess I/O incurred at any stage is limited.
Modern systems suitable for data analysis have multiple processors or cores, to make use of the capabilities of these systems the image storage method must support concurrent access to the data by multiple threads of execution.
One issue that may need to be addressed is the storage of multiple cached versions of an image, these could be at different resolutions, or simply arranged differently on disk to optimize for different kinds of access. Handling the creation (when warranted) and selection among these versions would need to be done “under the hood” and out of view of the end user.
Although the viewer should be optimized to support a high efficiency storage format as discussed above, astronomy’s workhorse format of FITS must continue to be supported, as should the current CASA image format. Another to be determined issue is if these formats are handled natively or if they are converted to the internal format of the viewer first.
2.2
Remote Access
Cloud computing is very much in vogue these days, and it is tempting to dismiss remote access to astronomical data as a fad which will pass, there are several compelling reasons to consider enabling this:
• Co-‐Location of Data: Current viewer architectures require either moving the data to the users or moving the user to the data. Both are expensive in terms of time and money.
• Centralized Hardware: Allowing users remote access for visualization, allows access to and investigation of the data from systems which lack the power or storage to do such processing on their own (such as laptops, and even tablets).
• Convenience: Being able to remotely access and view data simplifies collaboration, reduces time spent transferring data, and brings astronomical software to the level of convenience expected in todays computing environment.
Throughout the remainder of this document I will assume a client-‐server architecture where separate software systems (which may or may not be co-‐located) are responsible for the generation of images (the server) and the display and interaction with these images (the client). This design will be described in more detail in section 4.
Most current astronomical viewers support 2-‐D rendering, displaying projections and slices of the cube along various axes. This is likely to continue to be the primary tool for quantitative investigation of data, due to the limitations of modern technology for display and particularly selection of volume rendered data.
A few features that should be added to the usual list of 2-‐D capabilities:
• Image size control: Transporting images which contain more pixels that the viewing device can support is a waste of bandwidth, particularly if the image is being viewed remotely. The resolution of the image should be matched to the use, and the capability of zooming to native resolution supported.
• Image prefetching: As the user traverses the data, the next image to be loaded can usually be anticipated, whenever possible these images should be prepared and cached to decrease latency between images displayed to the user.
• Image Interleaving: The image is first delivered to the client as a low-‐resolution image. As more tiles are delivered the image resolution will increase.
These capabilities are made more vital by the desire to support remote data access.
2.4
Volume Rendering
Volume rendering of data is controversial, proponents claim that volume rendering makes investigating data more intuitive, while opponents cite the difficulty of data selection and difficulty in transitioning to quantitative analysis.
Given infinite resources volume rendering should be part of any future analysis and visualization package. Practically this should not be one of NRAO’s highest priorities. Any future analysis and visualization framework should be designed to allow inclusion of a volume-rendering component, but the implementation should be deferred. This would be a very good candidate for external collaboration, as this is something that is not vital to NRAO, but would still be a useful contribution to the package.
2.5
Analysis Tool Integration
Rendering the data is not sufficient for the next generation visualization and analysis tool. The tool will need to help steer the user to interesting aspects of the data, and extract quantitative information from the data. Historically this is the portion of the system that the larger community will contribute to, supporting a plug-in mechanism to allow this contribution is a worthwhile requirement.
Provided that we support remote access to data (section 2.2) there are two distinct classes of analysis tools:
• Server Side Tools: Have access to the full data set and high performance processing. This should be most of the actual analysis tools. The disadvantage to server side tools is that supporting user-developed plug-ins is more difficult; effort will need to be devoted supporting this at the design stage.
• Client Side Tools: Will be more responsive but will not necessarily have access to the full resolution data, and will often be running on a system with much less computing power. These tools should be very lightweight.
As an example Gaussian fitting to an extracted spectra should take place on the client side (as it is only accessing a single spectrum), while doing that same fit on every pixel in the map would need to be a server side tool (since it needs to access all of the data in the cube).
We should also strive not to embed the analysis capabilities into the visualization tool directly but rather to interface to other libraries where possible. Obviously CASA libraries provide a basic set of functionality, but other packages may be able to be interfaced (provided format compatibilities) and certainly non-CASA reductions should be able to be developed to the data access specification. User libraries should be supported through some form of plug-in framework. This is very straightforward on the client side, but some more thought needs to be given to how this could be supported on the server side of the system.
2.6
Virtual Observatory Integration
Although community acceptance of Virtual Observatory protocols has been slow, the next generation visualization tool should support the relevant protocols. The most important protocol is the SIA Version 2, which has not yet been finalized, although other protocols will certainly be relevant.
A level of pragmatism should be applied to the adoption of VO standards, for new interfaces compliance with these standards should be a goal. However internal interfaces and data structures should not be rewritten to comply with the VO standards. It may be necessary to write a set of conversion layers that translate between formats.
2.7
Linked Views
Typical astronomical image analysis tasks tend to look at data sets in many ways simultaneously (for instance spectral and spatial views, or a histogram of intensity values). Having these various views coordinated decreases cognitive load and allows the astronomer to focus on the scientific content.
This tool, at least to allow interaction with other VO enabled tools, should support the Simple Application Messaging Protocol (SAMP), defined by VO. One issue to be considered is if the SAMP protocol is sufficient to support the internal communication of the tool or if other communication protocols need to be investigated.
2.8
Graphics Processor Units (GPUs)
Graphical processors provide massively parallel processing capabilities, but at the cost of increased coding complexity and variation in platform. There are several places in a visualization and analysis package where GPUs could be used:
• Volume Rendering: Most volume rendering applications will benefit from the use of specialized hardware. This is a client side use of the GPU, and thus a wide range of hardware should be supported. Above, I argue that NRAO should defer volume rendering. When this becomes a priority the issue of GPU usage should be revisited.
• 2-D Display: Many images produced by modern telescopes are larger than can be displayed (at full resolution) on most screens. Exploring these images can be done by panning and zooming, and at least in principle the GPUs could assist in this implementation. However most systems are capable of doing this without the specialized hardware of a GPU.
• Image Generation: Unlike the previous two applications, using a GPU as part of the image generation process would be a server side application. The basic rendering of data to an image probably does not require GPU acceleration, although some analysis operations may benefit from these capabilities. The use of a GPU could be invoked on a per tool basis though and would not necessarily need to be throughout the system.
Based on the above considerations, my conclusion is that GPUs are not required (although might be a nice addition) for the core objectives of a new visualization and analysis package. To simplify the design
and implementation I suggest that although they should not be designed out, their use is not a requirement.
3
EXISTING TOOLS AND PROJECTS
The astronomy community has developed many tools over the years to assist in visualization and analysis of data. Many of the aspects discussed above are present in some viewers although I am aware of no currently existing system which addresses all of the above concerns.
3.1
CASA’s Viewer
The CASA viewer is currently meeting the needs of the users, and is probably able to deal with images coming from ALMA and VLA in the coming years. Architecturally it has some of the features described in section 2 such as attempting to separate the GUI “client” from a server like kernel, and linking views between various tools. On the other hand supporting remote access to data and VO interaction are not currently supported.
There are a number of other flaws in the current implementation of the viewer, tool integration is not separable and is in fact distributed throughout the design. The connection between data access and data rendering is not clear meaning that only a single processor is working on a problem at a time. Finally the GUI interface itself has evolved over time, and certainly does not conform to more modern GUI design patterns.
For all that is wrong with the CASA viewer it is still a strong choice for modern cube analysis. For example European Southern Observatory (ESO) did a survey in 2009 -‐ 2010 to identify a viewer package to support their new Multi Unit Spectroscopic Explorer (MUSE) instrument. The outcome of this purely technical survey was that the two viewers best suited to their needs were a prototype plugin to the Aladin viewer and the CASA viewer. In the end ESO decided that extending the capabilities of the CASA viewer to support optical observations was the correct coarse.
3.2
Other Astronomical Viewers
Listed below are a few of the other common viewers and analysis tools currently in use in the astronomy community. Although they all have their strong points none of them are suitable as the basis for a next generation tool.
• AIPS TV: The longstanding workhorse of the radio astronomy community the AIPS viewer continues to be used. However it was designed for a different era and does not address many of the items above.
• AIPS++ Viewer: Although still used by some scientists, the AIPS++ viewer is no longer supported having been deprecated in favor of the current CASA viewer.
• Karma Package: Karma is a toolkit for inter-‐process communications, authentication, encryption, graphics display, user interface and manipulating the Karma network data structure. It’s viewer component kvis is in some ways the most advanced. Unfortunately the package is no longer supported.
• SAOImage DS9: Although the DS9 package has many of the features described above, it’s support for data cubes is not sufficient for the needs of the NRAO telescopes.
There are of course other scientific viewers outside of astronomy that could be investigated. This is particularly attractive for volume rendering applications where substantial resources have gone into development of these applications. The track record for adoption of this type of technology in astronomy has been poor.
My opinion is that for the core goals of a new NRAO viewer (remote access, VO compatibility, 2-‐D rendering, and analysis integration) the most cost effective path forward is to not attempt to modify an existing package to support our use case. On the other hand for volume rendering applications, if and when we decide to address them, modification of an existing package should be evaluated as a path forward.
3.4
Current Projects
There are currently two projects funded under the ALMA development program looking at visualization related issues:
• Unleashing Large Dataset Science (Lead Institution: U. of Maryland)
New analysis tools that afford easy, extensive, and creative access to large data cubes (1,000 X 1,000 pixels and larger) will enhance proposers’ ability to use ALMA efficiently for targeted science, and facilitate fundamental discovery science. This software design study officially commenced during late summer 2012 and concludes September 2013.
• A Visualization Portal for ALMA Data (Lead Institution: U. of British Columbia)
This study will develop a web-based, visualization portal that will enable multiple astronomers to simultaneously and collaboratively interrogate/explore a one terabyte; this work builds upon technology developed by the CyberSKA cyberinfrastructure project. The visualization tool design is in process. This software design study officially commences early fall 2012 and concludes June 2013.
Both of these projects are very relevant to the future of visualization and analysis at NRAO. It is not clear what future these projects have, beyond the current study phase. The results of these studies will be evaluated as they are concluded and incorporated into the design and implementation of the visualization package. It is likely that both projects will apply for further funding for an implementation phase of the project, NRAO could opt not to pursue this type of development and focus on ensuring compatibility with the tools they develop.
4
PROPOSED ARCHITECTURE
This section presents a sketch of a possible architecture for a new visualization and analysis tool, addressing the issues presented above. If NRAO decides to proceed with the development of a new analysis and visualization tool, the first step would be to do a more formal requirements capture, design, and costing exercise.
The proposed system has three tiers:
• Data Access Layer: Responsible for presenting (a selected subset) of data to the processing engine.
• Processing Engine: Responsible for execution of various tools on the data and sending the display data to the client.
• Client layer: Responsible for interactive display of the data produced by the processing engine and execution of lightweight processing.
• Stand Alone Systems: We must still support the case of a scientist working with their private data at their home institution. In this case all three components will be executing on the same system, with the DAL serving data off of the local disk.
• Remote access of Observatory Data: Here the Data Access Layer, and Processing Engine are running at the observatory with the client connecting remotely.
• Remote Processing of Data: Some users will develop sophisticated analysis systems that will run on their home system, only using the DAL on the observatory system.
When components are co-‐located effort should be made to optimize performance. Following VO protocols where appropriate and available will allow a “mix and match” approach with other tools and data sources.
I propose that NRAO undertake to ensure all three layers are implemented for the cases of interest to the observatory. We would exclude the volume rendering case from this initial development, and develop two client applications: a GUI client to support interactive exploration of data and a text client to support pipelines and production of publication quality images. Interaction with the ALMA development studies will need to be defined to avoid conflict of interest and duplication of effort.
5
CONCLUSION
The current CASA viewer is deficient for the future in several ways, however it is currently the preferred option for radio astronomical data investigation. The capabilities of the CASA viewer are being incrementally improved, and will continue to evolve over the lifetime of the CASA project. Virtual observatory capabilities will be added to the current viewer as resources allow, but will be add on capabilities rather than designed in from the beginning. Currently the CASA project allocates approximately 2 FTEs to the improvement and maintenance of the viewer. These developers are warning that the current implementation is becoming increasingly difficult to maintain and more effort will need to be allocated to maintenance than implementation in the near future.
I am unaware of any funded initiative to implement the next generation astronomical cube viewer. There are studies through the ALMA development program that could lead to implementation of some of these capabilities, but no concrete project. The increasing use of integral field units in optical astronomy may lead to development of a suitable tool by the optical astronomy community, but again I know of no funded effort.
The current CASA viewer represents approximately 15 years of effort and evolution, with at least three different viewers being produced. Without a more detailed set of requirements and design, estimating the effort required to implement the above sketch is difficult. A level of effort roughly at the same level (15 FTEs) seems plausible, preferably with a team of 4 individuals working full time for 3-‐4 years. If NRAO decides that this is a direction we wish to pursue, the immediate first effort is requirements capture, detailed design, and costing.
NRAO has a choice to make. We can remain with the status quo; maintain the current level of effort and a viewer that is (arguably) sufficient for the current needs of our community. The alternative is to invest in producing the next generation viewer, provide the core capabilities that the community requires, and organize collaborations to deliver additional features.