CASA Analysis and Visualization

(1)

CASA Analysis and Visualization

Synthesis ... 1

Current Status ... 1

General Goals and Challenges ... 3

Immediate Goals ... 5

Harnessing Community Development ... 7

Synthesis

We summarize capabilities and challenges facing CASA visualization and image analysis: immediate needs, large data with spectral axes covering many lines, and limited resources for development. We describe the path forward that these concerns suggest: one where basic functionality is added over each six month development cycle, focusing on cube exploration, basic data access and manipulation, and high quality output. We describe the interface with community development, which has two components: (1) providing easy access to the data for scientist-‐scale development and interface with existing tools and packages, including VO servers, and (2) serving as a functional, immediately available bridge until external projects (ALMA development or otherwise) reach maturity.

Current Status

CASA includes an interactive data browser, the CASA viewer, and a suite of core image analysis tools wrapped into ~15 analysis-‐oriented tasks that address basic image processing needs.

The viewer can be run as a standalone application or from within CASA. It can be scripted from the shell, but the set of features accessible while scripting remains limited. The viewer reads CASA image files and FITS files. It registers images on loading and provides the ability to explore data in the following ways:

1. Scroll among channel maps, zoom, and pan to move through an image cube. 2. Examine spectra of individual lines of sight or averaged over regions. 3. Identify spectral lines by comparing to the Splatalogue1_database.

4. Interactively fit a spectral line.

(2)

5. Interactively specify or load from disk regions of interest for statistical analysis, spectral analysis, or use as a mask.

6. Calculation of image statistics from regions of interst. 7. On-‐the-‐fly calculation of moment maps from image cubes.

8. Comparisons of multiple cubes via contours, blinking, or multiple panels.

A separate viewer mode, currently accessible only via the CLEAN task, allows the user to interactively construct or edit a “mask” image from the union of hand-‐ specified regions. That mode also gives the user control over the CLEAN loop and the two capabilities can be used hand in hand. Extensions with the next release will include histograms of pixel values for use with noise estimation, color table manipulation, etc.; the ability to calculate position-‐velocity cuts; and continuing improvements to the integration with the splatalogue spectral line database and general interface. The user friendliness of these capabilities still needs to be improved over the course of future cycles, but the viewer already represents a best-‐ in-‐class cube viewer in many ways.

Many of the current image analysis tasks focus on manipulation of image cubes to allow the user to produce their desired final image product. These include tasks to:

1. “Collapse” a data cube to an image in several ways. 2. Regrid a cube to a new astrometric grid or ordering. 3. Smooth a data cube spatially (but not yet spectrall).

4. Extract regions of interest to a python array or a new sub image. 5. Manipulate masks for moment creation and deconvolution.

6. Perform a large set of pixelwise mathematical operations on an image. 7. Import and export data from FITS (to CASA images).

There is also a limited set of more directly analysis focused tasks that allow one to:

1. Fit Gaussian sources to an image.

2. Fit a spectrum with line profiles or polynomials. 3. Derive statistics from a region of interest.

Current development is mainly carried out by two developers with support from the original viewer developer. One focuses mainly on enhancements to the viewer and exposing analysis functionality graphically. The other focuses on adding image analysis functionality and exposing this via shell-‐level tasks.

Qualitatively, the state of the CASA viewer is comparable to that of the karma and ds9 packages. Unlike karma, the viewer is being actively developed. The cube browsing features (including integration with splatalogue) exceed those of ds9 but overall ds9’s image browsing capabilities (color table and coordinate manipulation, interactive analysis) exceed the viewer. The integration of the viewer with the CASA image format and thus with the data reduction/imaging loop represents a unique, currently irreplaceable aspect of the viewer.

(3)

General Goals and Challenges

The most conservative goals for the viewer and analysis packages are to allow ALMA and VLA users to:

1. Interactively explore their data, especially data cubes. Examine spectra, movies, derive image statistics, identify spectral lines. Compare multiple lines, identify unknown lines.

2. Manipulate their cubes and gain easy access to whatever form or part of the data that they want. This includes smoothing along the spatial and frequency axes, regridding, subcube extraction/slicing, extraction of moment maps, and extraction of arbitrary position-‐velocity slices and export to FITS or other data structures appropriate for further analysis.

3. Visualize the data in a way that is appropriate for publication. Produce high quality, labeled images (at least) and plots and export to appropriate format.

This is the absolute minimum bar for CASA’s visualization and analysis: tools to look at the data produced by the NRAO telescopes and export the desired form of that data to the field-‐standard format. This list does not include model fitting (species fitting, rotation curves, density profiles), novel data browsing techniques (like wireframe surfaces or 3-‐d rotation), source extraction (clumpfinding) and characterization, or other advanced analysis.

Key challenges to meet these basic goals are:

1. The Spectral Axis. To state the obvious: robust handling of a data cube (rather than just an image) is a key requirement and strength of the CASA viewer, compared to, say, ds9. This means that things like three-‐dimensional mask creation for imaging, position-‐velocity cuts, spectral browsing, and moment creation are core functionality. These require careful coordinate handling, one reason that non-‐astronomy software is not a trivial solution.

2. Large Data Sets. A single field ALMA observation can currently produces a cube with natural size ~250x250x4000 per baseband (x4 per data set). Mosaics increase the spatial dimensions up to as many as a few thousand pixels, while spectral averaging and/or considering only a single line at a time reduces the spectral dimension substantially. Realistically, data cubes with hundreds of elements along each spectral and spatial dimension and sizes of several hundreds of MBs are now commonplace. Data cubes with several thousand elements along one or both axes and sizes of several GBs are no longer exceptional. In the near future, cubes with sizes of several tens of GBs and thousands of elements along both spatial and spectral axes have

(4)

the potential to become commonplace. This is especially true if the ALMA pipeline makes the decision to archive imaging of entire spectral windows.

3. Multiple Lines. The large spectral coverage and sensitivity of ALMA and the VLA mean that these large data sets now often cover multiple spectral lines in a single data set, often including unexpected lines that need to be identified. The viewer and analysis software needs to allow for ready line identification, extraction of single line data sets, and easy comparison among different lines within a data cube or drawn from several data cubes.

4. Immediate need: ALMA and the VLA are working now. Both NRAO facilities are now producing these large, complex data cubes and the software needs to be in place for image inspection to evaluate reduction, tune the imaging (requiring noise estimates and CLEAN mask creation), and create the output to be used in scientific analysis. CASA’s viewer already plays an indispensible role in this process for both telescopes.

5. Limited Resources. CASA has two developers devoting most of their time to this topic, one focused on graphical user interfaces for data inspection and one focused on analysis code (a third expert developer assists but is focused elsewhere). These developers attempt to deploy professional code that has undergone some quality assurance on six month cycles.

6. Complement the community. The community has developed a large array of tools to analyze spectroscopic and image radio data over the years. This largely does not include the kind of basic visualization and infrastructure routines discussed above, but it does include tools for model fitting and other science analysis. These are often not adequate to extract full scientific return from rich ALMA or VLA data, but the community can be expected to develop new tools as they are presented with new data. These “end tools” also do not represent widespread points of failure if they are not maintained.

These concerns naturally suggest that in the short term, CASA and NRAO adopt an “evolutionary, not revolutionary” approach to visualization and analysis. Given the scarce resources within the CASA project and the very clear, very immediate need for these tools (they are actively used by the ALMA and VLA projects every day) it makes sense for CASA to focus its efforts on providing continuous improvement in the basic exploration and data manipulation tools. These are unlikely to be developed by the community to a standard of robustness, user-‐ friendliness, and adaptability that NRAO could adopt and maintain and they are essential to move our data to science-‐ready state. In parallel, CASA is making an effort to expose data to users in as many ways as possible to facilitate ready integration with scientist-‐developed tools.

This focus on improving existing tools and adding basic functionality complements work being pursued by ALMA Development Studies, which are exploring the

(5)

creation of entirely new server-‐side environments and new analysis packages to explore highly multi-‐dimensional data. This functionality may eventually be key to make the most out of rich ALMA and VLA data sets, but does not appear to be best pursued by a small software development team in the context of the six month development cycle and does not appear to be at all imminent (if it ever materializes). The sensible vision for CASA is to continue to make conservative, widely useful, well-‐tested additions and serve as a mainstay package until some more ambitious next-‐generation package is ready for deployment. This ensures that the community has access to a well-‐maintained, full-‐featured, immediately available data exploration tool for the years that it will take to develop the viewer’s successor.

Immediate Goals

The practice for the last year (since the addition of the second developer) has been to keep the priorities above in mind and identify pressing needs that can be addressed during a 6-‐month CASA development cycle, drawing inspiration from other data browsing programs like karma, ds9, GAIA, GILDAS, and AIPS/AIPS++. The last few cycles have seen the following non-‐exhaustive list of additions:

1. A spectral line browser linked to the main viewer that can be used to examine spectra for points or regions and navigate the cube.

2. Integration with the splatalogue database for shell-‐level and interactive line identification.

3. A rich, robust region system exposed through the GUI.

4. Improved registration of data cubes comparing different transitions.

5. The addition of interactive moment creation and performance enhancements to moment creation.

6. New tasks for subcube extraction and better exposure of the data directly to python (with full coordinate information).

These exemplify the “incremental” approach described above. They are all general use functionality that can be developed, tested, and deployed during a six month development cycle. They improve the exploration infrastructure for almost all users and allow easier access to the data for specialized analysis.

Following this model, the current development cycle has seen improved browsing of cube values via pixel histograms, which are also being coupled to moment map creation and the color table. At the task/shell level, CASA has added the ability to extract position-‐velocity cuts, which has been one of the most requested features for several cycles. Priorities for the next cycle have not been set yet, but will almost certainly include exposing these cuts via a graphical interface inside the viewer and the ability to browse intensity profiles along cuts or user-‐defined paths.

(6)

1. Improved GUI exposures for subcube and position-‐velocity slice extraction and the addition of new viewer windows to explore position velocity slices (analogous to the spectrum browser).

2. A general version of the GUI mask creation utility available through CLEAN. 3. Task-‐level exposure of basic masking/thresholding capabilities for use in

clean mask creation.

4. Adding the ability to spectrally smooth images (in addition to the current spatial smoothing).

5. Improvements to the interface to register images on common coordinate frame and compare different lines within a data set.

6. Added functionality for users to manipulate their data directly.

In addition to these goals, there are three major areas that can be expected to become main focus items during upcoming development cycles:

1. Production of publication quality plots.

2. The ability to script viewer functionality from the command line. 3. Exposure of the viewer to scientist designed code.

The ability to produce publication quality plots has been a long-‐standing target and the viewer has seen a continuous stream of minor cosmetic improvements. This can be expected to continue and could rise to a main focus of development depending on community feedback. That is, the viewer makes plots now. As the demand for better plots exceeds the demand for other basic functionality, development focus on this item will increase.

The viewer has basic scripting capability via the IMVIEW task, but the exposure of new functionality to command line calls has lagged and will require a substantial investment of developer time. Eventually, this is key to interface the visualization capabilities of the viewer with automated scripting. This level of reproducibility is especially important for the publication quality plot angle -‐ imagine wanting to change one aspect of a plot during revision of a paper or (a very common occurence) to cycle through a survey using the viewer.

Finally, if CASA development focuses on infrastructure and basic functionality then the viewer needs to be exposed to scientist developers in a way that makes it possible for them to interface their code with it in real time. A basic example would be spectrum fitting or overplotting of models on a spectrum. The splatalogue team has developed a test case along these lines, working out an LTE calculation that predicts relative line strengths for specified species. In general, the issue is to define how the viewer would interact with user-‐contributed code. A closely related issue is the interface of the viewer and CASA tasks with VO servers to allow the integration of cube exploration with existing databases.

(7)

The relative priority of these three broader goals still has to be weighed at each cycle by gauging community need2_{. Broadly, along with basic cube exploration and}

manipulation these represent the core goals of the “evolution” approach. In an ideal world, scripting and exposure of a viewer interface might allow next-‐generation analysis approaches developed by the community to meld with CASA.

Harnessing Community Development

We have tried to emphasize that an incremental approach within CASA makes the most sense to slowly build up viewer and analysis capabilities. This ensures that limited developer effort aids the largest possible fraction of the user base. In this picture, community development (or development at NRAO outside the main CASA context, see the Kern white paper) has a critical role to play. This has three main aspects: (1) development studies outside CASA aiming for the “next big thing” in data cube analysis, (2) interfacing with existing infrastructure like SciPy, NumPy, Matplotlib, and AstroPy to avoid duplicate development, (3) creating an environment (both in CASA and in the community) that fosters small-‐scale scientist development and sharing of code.

We have already discussed the role of ALMA development programs in fostering the “next big thing” for analysis, e.g., server-‐side analysis, high-‐dimensional analysis, unique interfaces, interface with large databases (though basic interface with VO-‐ compliant servers is definitely within the scope of the 6-‐month cycle development).

The second point is that a large amount of infrastructure for plotting, fitting data, even advanced statistical calculations and image processing exists already. CASA has already taken advantage of the matplotlib / pylab plotting infrastructure (e.g., in PLOTANTS, PLOTCAL, PLOTBANDPASS, and the AnalysisUtils packages). These are now standard astronomical analysis tools with a good pedigree. Rather than invest large amounts of developer effort in duplicating them inside CASA it makes most sense to expose the data in ways that allow users to interface these tools. This is already well underway via the toolkit and tasks like IMVAL but will require continued refinement and documentation/examples. A key area to watch will be the development of AstroPy, a community project (led by staff members at STScI and MPIA) to develop a set of astronomical python modules that complement SciPy and fill a similar niche to Goddard’s IDL libraries.

Specialized, small-‐scale community development is just as critical. Under the framework that we describe it will be left up to the community to develop sub-‐field specific applications (e.g, an ammonia spectrum fitter, a rotation curve fitter, etc.).

2_{The recent ALMA User Survey provides some help, but the North American}

responses were not very focused on analysis -‐ probably because the survey preceded the availability of archive data or many deliveries.

(8)

NRAO and CASA have two critical roles to play here. First, CASA needs to provide the cleanest and easiest possible access to the data. This means both reading and writing, easy coordinate access, p-‐v cut and spectrum extraction to FITS and python data structures in automated ways. These goals have already been detailed above. Second, for maximum impact NRAO should directly foster sharing of community code without adopting responsibility for maintenance. This represents mostly scientist, rather than developer, effort. Some movement in this direction has already occurred in the establishment of the (so far lightly used) NRAO forums and the addition of contributed code areas to the CASA guides. Over the next year, a goal of both CASA scientific staff and NAASC will be to work out the right approach to collectt and distribute community analysis code in a way that complements CASA without adding to the already strained developer load.