AstroGrid : powering the Virtual Observatory

(1)

AstroGrid : powering the Virtual Observatory

Andy Lawrence

Institute for Astronomy, University of Edinburgh, Royal Observatory,

Blackford Hill, Edinburgh EH9 3HJ, Scotland, UK

On behalf of the AstroGrid consortium

http://www.astrogrid.org

Proc.SPIE, 4846,

in press, 2002.

ABSTRACT

AstroGrid is the UK’s contribution to the world-wide drive towards a Virtual Observatory (VO). I describe the project, its relation to other VO projects and other e-Science projects, and its current status. I then examine the concepts and science drivers behind the Virtual Observatory and the Grid, and the technical challenges which we face. The conception of the VO we arrive at is not one of a software monolith, but rather one of a framework which enables data centres to provide competing and co-operating data services, and which enables software providers to offer a variety of compatible analysis and visualisation tools and user interfaces. The first priority of the VO projects worldwide is to provide the infrastructure which will enable such creative diversity. AstroGrid is however also a consortium of data centres, which will pool resources within this framework, and we expect to develop an early working implemementation of immediate use to astronomers.

Keywords: Databases, Grid, web services, datamining

1. INTRODUCTION

This paper is in four parts. First I describe the AstroGrid project itself, and its relationship to other endeavours. Next, I summarise in turn the concepts behind the Virtual Observatory and the Grid, and how they relate to each other. I then examine the key technical issues we need to solve and the obstacles we need to overcome in order to build the Virtual Observatory. Finally I return to the AstroGrid project itself and summarise its current status.

2. THE ASTROGRID PROJECT

2.1. AstroGrid

AstroGrid is the UK’s contribution to the world-wide drive towards the Virtual Observatory ideal. Along with AVO and US-NVO, it is one of the three major up and running funded VO programmes. It is also part of the UK’s “e-science” programme, and is primarily funded as a Grid Technology development project, but with the aim of implementing an early working system as soon as possible. The AstroGrid consortium comprises (in alphabetical order) the Universities of Belfast, Cambridge, Edinburgh, Leicester, London (MSSL), and Manchester (Jodrell Bank), along with the Rutherford Appleton Lab (RAL) of CLRC. It is primarily a consortium of the main UK astronomical data centres. This is not a coincidence, as our take on the VO concept is one of enhanced and collaborative services offered by data centres. However the consortium also contains expertise in astronomical client-side software. The consortium came together during 1999 and 2000 as a response to PPARC’s LTSR (Long Term Science Review) process, which highlighted the importance of IT initiatives in astronomy, and especially the challenge of new large databases. A “White Paper” was debated by the PPARC advisory system and the UK community in Autumn/Winter 2000. This concentrated initially on optical, IR, and X-ray astronomy, but as a result of this debate we widened both the consortium and our remit to include radio astronomy, solar physics, and solar-terrestrial (space plasma) physics. A formal funding proposal was submitted in April 2001 and the project formally started in September 2001, with a plan to undertake a one-year Phase A study followed by a two year Phase B implementation.

(2)

2.2. The UK e-Science Programme

The “e-science” programme is a multi-disciplinary initiative aimed at transforming the IT infrastructure of UK science, spending approximately 120M (200M$) over three years. It has a special emphasis on Grid technology, but aims more generally at developing new styles of internet-enabled science. Across science there are repeated themes - data integration, large data volume challenges, resource discovery, distributed computing, improved visualisation, data throughput intensive analysis, and collaboration across the internet. Much of the UK e-science money is allocated to individual research councils to fund application projects. The biggest of these (20M over three years) is GridPP, the UK contribution to the CERN LHC Grid computing project, which in turn is part of the EU DataGrid project. Analysing the huge volume of LHC data requires distributed computing across Europe. Astronomy receives 5M, of which roughly 3.5M is being spent by AstroGrid and the remainder by a variety of other projects. Other major projects are MyGrid (Bio-informatics) and NERC data grid (Earth Sciences). There is also a core programme aimed at developing a generic infrastructure and supporting the applications programmes. This funds a Grid Support Team, a National e-Science Centre (NeSC) in Edinburgh, and a number of regional e-science centres. The core programme runs training sessions and a series of workshops, but has also spawned some development projects, the most important of which is OGSA-DAI (Open Grid Services Architecture - Database Access Infrastructure) which is a collaboration with IBM Hursley. AstroGrid and MyGrid have been designated ”early adopters” of OGSA-DAI products, and are already working with that team.

2.3. The international context

The AstroGrid consortium is a partner in the EU funded European AVO project (Astrophysical Virtual Obser-vatory). The work programmes are not identical but intersect. Formally AVO funds three staff positions which we see as part of the AstroGrid team, and an equivalent number of PPARC funded FTEs (full time equivalents) spread across a larger number of individuals are at the disposal of AVO. We have carefully aligned workpackages to make this feasible. The main distinction between AstroGrid and AVO is timing. AstroGrid is committed to early implementation, even if this requires limited functionality, whereas AVO is an R&D programme aimed at a follow-on project to construct a full scale working facility. To AVO then, AstroGrid is both a technology development programme, and a very large pilot programme.

AstroGrid also has good working relations with the US-NVO project. Since Autumn 2001, the three main funded projects (US-NVO, AVO, and AstroGrid) have had several joint workshops and a regular sequence of leadership-level telecons. At a conference in June 2002 in Germany we created an “International Virtual Observatory Alliance”, committed ourselves to a joint roadmap, and invited other newer VO projects - for example in Germany, Russia, and Australia, to join in. This body has regular telecons and is developing a joint web site.

2.4. Goals of AstroGrid

Informally, AstroGrid is driven by fourslogans. (i) The archive is the sky. (ii) Everybody can be a power user. (iii) Shift the results not the data. (iv) A supercomputer on your desk. The reason for each of these should become clear in sections 2 and 3. I should note that the first of these slogans was originated by Alex Szalay in promotional talks about the US NVO project. The goals of AstroGrid put forward in the April 2001 proposal are as follows. (i) A working datagrid for key UK databases. This includes the necessary machines, middleware, and user interface. (ii) Tools for simultaneous browsing of multiple archives, including improved visualisation techniques. (iii) Tools for on-line data analysis - complex queries, statistics, model fitting, cluster analysis, outlier selection, PCA, etc. (iv) A system for uploading user algorithms. (v) A resource discovery method. A year in, these goals largely stand, but have been sharpened and limited, as we discover where the work lies. A revised set of goals is presented at the end of this paper.

3. THE VIRTUAL OBSERVATORY

3.1. General Science Drivers

There is no single ”killer application”, but rather a very general drive towards improvement in three main areas. (i) More ambitious kinds of on-line analysis - not just data download, but complex queries, and the

(3)

ability to analyse datain situ, such as panning across large images, making N-D parameter plots and rotating them, fitting curves etc. This is part of a trend towards on-line working and standardised tools, but remote analysis is also required by the expected data explosion in astronomy. (ii) Multi-archive science - collecting all the multiwavelength data on a particular object, cross-matching optical and IR catalogues, or finding the ground-based radar measurements that correspond to the delayed effect of a coronal mass ejection observed with a solar telescope. (iii) Data intensive science - searching for rare objects, calculating correlation functions, or finding clumps in multiparameter space, for billions of objects. All the above can be done now - but slowly and awkwardly. The VO should make such techniques easier, faster, and more transparent. In doing so of course, it should increase the volume and quality of science. In particular, data intensive science is currently the province of scientists who dedicate their time to such techniques, but we should make such analyses easy enough such thateverybody can be a power user.

3.2. Collectivisation and Democratisation

The VO can be seen as part of an inexorable long term trend in astronomy towards communal organisation. The first step was the development of ”common-user” or ”facility class” instruments. The instrument is already built, documented, and robust, liberating the astronomer to think about collecting data. Next came communaly developed data reduction packages, such as Starlink, IRAF, or Midas - one no longer had to hack one’s own Fortran. Then came the first on-line archives. Next came a development that has transformed our daily working lives - the availability of push-button on-line information resources, such as Simbad, ADS, and astro-ph. Right now we seem to be in the middle of a trend towards collectivising the collection of data, with large consortium survey projects such as SDSS, 2dFGRS, and UKIDSS. What are the obvious next steps in this process ? The first isinteroperability of archives- making them all speak the same language so one can make joint queries. The second isautomation of resource discovery- not needing to know the URLs of dozens of different web pages. The third is facility class analysis tools. Right now we wouldn’t dream of coding our own data reduction routines, but do expect to write our own code for correlation functions, principal component analysis, etc. Such things should become modular and standard - and available on-line attached to the data. Just as with all the previous stages of communal development, although the evolution seems at first to be in a collectivist direction, the actual effect is to empower the individual. With all these tools, you don’t need to be in Cambridge or Caltech - one’s resources are nearly as good in Sheffield or Florida. This democratisation of science is a key driving force in nearly all e-science projects.

3.3. The Virtual Observatory Concept

The VO vision can be summed up as the desire to make all archives speak the same language - all searchable and analysable by the same tools, all data sources accessible through a common interface, all data held in distributed databases that appear as one. Much astronomical research will be done by ”observing” this virtual sky. The archive becomes the sky.

4. THE GRID

The “Grid” concept originally referred to computational grids, i.e. distributed sets of diverse computers co-operating on a calculation. The term ”Grid” is an analogy with the electrical power grid. This involves huge power stations and a complex infrastructure, but the user doesn’t need to know - you just plug your hairdryer into the socket and power flows. Likewise the vision is that from your desktop you submit a calculation to ”the Grid”, and don’t even need to know where your job is running - hence the next slogan : a supercomputer on your desktop. The relevance of strict computational grids to astronomical data analysis is debateable, but the term “Grid” has expanded to refer to a a general sense of transparent access to distributed resources, and a sense of collaboration and sharing. The resources which are shared could be storage, documents, software, CPU cycles, data, expertise, etc. The sharing concerned is usually taken to mean not just an attitude of “help yourself”, but a commitment to organised management of resources by a community and/or putting in place mechanisms to ease that sharing. The history of computing can be seen as a gradual evolution towards the grid concept. First came the physical networks, and the protocol stacks, to enable us to pass messages between computers. Next came the World Wide Web, providing transparent sharing of documents. Then came computational grids

(4)

enabling shared CPU. A new concept is that of a datagrid, making possible transparent access to databases. This is close to the Virtual Observatory concept, but to truly reach this ideal, we believe what we need is a

service grid. This involves not just open access to data sources, but also standardised formats and standardised services, i.e. operations on the data. Beyond this, the Grid community talks of information grids, knowledge grids, and Virtual Orgsanisatons.

5. HOW DO WE GET THERE ?

5.1. Sociology and Technology

Many of the obstacles we need to overcome are sociological more than technological. The first priority is to accelerate the drive towards standards. We need standards for both data formats and metadata, and to express these in forms compatible with modern information technology. (One of the early joint accomplishments of US-NVO, AVO, and AstroGrid was VOTable, an XML standard for astronomical table data). We need a standardised way of expressing provenance. This can mean “processing history” as in a pipeline system, but in a VO context it can also mean “where did this come from, and who touched it on the way to me ?”. Most interestingly, we need to standardise ways of expressing semantics, to aid software reasoning with returned quantities. As a minimum, it means an agreed vocabulary, so that applications using returned data know for example whether a column labelled “RMAG” is a Johnson magnitude or a Sloan magnitude, and whether the value is Vega-based or or an AB magnitude. The system of “Uniform Column Descriptors” (UCDs) developed at CDS Strasbourg is an excellent start. Beyond simple vocabularies, we need what is referred to in computer science circles as an “ontology”, expressing the relationships between terms in a taxonomy as a series of connected trees.

Many of the advances we need are in the area of internet technology. In section? we look at the emerging technology itself. Here we examine the requirements. We need protocols for publishing and exchanging data. The data exchange I am thinking of is between data centres and users, made available as a data service- not just a manual download, but automated exchange between programmes being run by a user and databases run by the data centre. This may involve a variety of client programmes, and a variety of database systems, which shouldn’t need to know about each other - hence the need for standardised exchange methods which isolate the local details. But user programmes need to know what data services are available, so the next requirement is for aregistryto which data centres canpublishtheir services. Ideally the ontology used can also be published so that this can be dynamic rather than fixed. The next technical requirement is a community-agreed standardised method for transmitting identity, authentication, and authorisation, so that the VO will be a “single sign on” system. One doesn’t want a trawl of the world’s archives for data on some particular object to stop and return seventeen times to ask the user if she is allowed to have this data and if so can she give a password please ? Finally, to achieve the Grid part of the VO vision, we need methods for managing distributed heterogeneous resources.

5.2. Bottlenecks and Grid geometry

Two bottlenecks determine the structure of the system we need to develop. The first is the limitation on I/O between disk and CPU. Some problems are limited by seek time, which has changed hardly at all over many years. Other massive search problems proceed more or less at streaming bandwidth, but even this has grown more slowly than Moore’s Law. Meanwhile the astronomical databases are growing ever bigger. Multi-TB databases are now the norm; the UKIDSS IR survey will produce 100TB within a few years, VISTA will produce that much every year by 2006, and LSST may one day make another order of magnitude leap. Simple minded serial searches of such databases would take months, and more complex analyses such as correlation functions, even longer. The solution is partly in better algorithms, cacheing, and so on, but sometimes only brute force will do - in other words, parallelism. We need database supercomputers for search and analysis. This trend is already well underway, with PC clusters becoming the norm. Users will not have their own clusters. Powerful search and analysis engines will be provided by major data centres.

The second bottleneck is network bandwidth. Nominal bandwidth is improving all the time, but true effective bandwidth is not typically limited by fibre capacity, but by end-point CPUs and firewalls, and is unlikely to be good enough to make download of very large databases feasible. Even if one had the patience to download

(5)

huge datasets, most users would not be able to store them. The answer is that the analysis one desires should be performed on the data in situ, and only the final answers brought back to the user. This leads to our next slogan,shift the results not the data.

Put together with the earlier conclusion regarding database supercomputers, the net effect is that data centres have to raise their game. Rather than offering simple download and documentation, as is normal today, they have to offersearch and analysis servicesfor the data they hold. This requires both powerful compute capacity, and the development of facility class analysis code.

This emphasis on data services determines the geometry of the VO. We do want one centralised superarchive, because the latest and biggest archives will always be developed at a number of competing centres, where the relevant scientific expertise resides. Neither do we want a completely democratic and anarchic P2P network, like Napster. Some computational and pipelining problems can benefit from the Seti-at-home approach, but this is not usually relevant to the world of data services that we envisage. Neither should the VO be a pre-planned pyramidal hierarchy, like the LHC Grid. Rather, the structure is one of a modest number of service providers and a large number of users. It is pretty much like the commercial model. Some of the use is open and unplanned. Some is registered use, or even allocated and costed. Users have a variety of access rights. The VO framework allows data centres to publish their services in a standard way, and user software can then combine multiple services from a variety of centres as required. The data centres are free to compete, but consortia of data centres can pool their resources and present a common interface, and direct queries flexibly as required. (This is the grid-like aspect of the VO).

5.3. Two Rivers

Most of the technology we need is being created outside the world of astronomy, and our job is to assess and select the technology, and then to further develop and deploy it. There are two separate fast flowing rivers, and until recently it looked like we had to decide which to jump into. The first river is theacademic computer science world, centred around the Grid concept and especially the Globus project. The drive has been for many years towards truly co-operative distributed computing, and the key concerns have been remote log on (of processes, not humans); of identity, authentication an authorisation; and of resource management. The second river is the

commercial/W3C world. The emphasis here has been on the problem of exchange of data (business to business), and of service description and publication. Enormous advances have been in the last few years, bundled as the idea of “web services” - the use of XML to describe data, exchange of data in SOAP wrappers, and service description in a standardised language called WSDL.

AstroGrid has been assessing the relevance of these technologies to the VO problem over the last year. They are both obviously extremely relevant and important. There are however some problems with Globus. Firstly, it is a work in progress, not a robust product. Next, it is not designed with a services-user structure in mind. Finally, it supports only flat file data transfer, as opposed to structured databases. The web services concept seems particularly close to our conception of a service based VO, but there are also limitations here. Standard web services are one-to-one and stateless, whereas we need to compose multiple services, using some kind of lifetime management. XML is very verbose - we need a standardised compact way of linking bulk binary data. There is no generally accepted authentication/authorisation solution. (There are various commercial digital certificate solutions, but none of them do what we want, for example authorising communities, and allowing access delegation). Next, it looks like we need a purpose designed astronomical registry - there is a commercial registry, UDDI, but for a variety of reasons it doesn’t match well to astronomical needs. The final limitation of standard web services is that the meanings of returned quantities are undefined - there is no agreed vocabulary, let alone ontology. There is unlikely ever to be a universal vocabulary or ontology, but rather solutions will be found in each domain. This is partly a sociological issue, as each community needs to find agreement. The technological issue is to find a standardised way of publishing and discovering meaning dynamically. This is very much the drift of the “semantic web” work in the W3C of course, but the technology required is not here yet.

The two streams - the Grid world, and the web services world - are now attempting to merge, with the idea of “grid services”. A project called Open Grid Services Architecture (OGSA) has been formed involving industry heavyweights as well as the Globus team at Argonne. One of the key elements is to build into the toolkit methods for accessing structured databases on the Grid. This has spawned a sub-project centred in the UK known as

(6)

OGSA-DAI (OGSA Database Access Infrastructure). This is led by the National e-Science Centre in Edinburgh and has a strong involvement with IBM Hursley. Two application projects - AstroGrid and MyGrid - have been designated early adopters of OGSA-DAI products.

5.4. Who builds what ?

It seems worth elaborating on who needs to do what to arrive at the Brave New World. The vision I have described is not one of a software monolith, but rather one of an infrastructure that allows data centres and application writers to provide competing services and tools. However, such competing entities in fact are likely to pool resources to provide the best facilities for astronomers, and AstroGrid certainly intends to do so.

First, the VO projects worldwide need to develop the infrastructure. This means we have to evolve agreed standards; to build an infrastructural toolkit for data exchange, and to build and operate a registry of data services. The standards need to be absolutely universal; the toolkit needs to be nearly universal; but there could in principle be multiple registries.

With that infrastructure in place, data centres can maintain data in whatever format they like, using whatever database engines they like, but will use a standard toolkit to write data services available for their data. They then publish their database metadata, and a description of their services, to one or more astronomical registries. Likewise, with the infrastructure in place, astronomical software providers can write tools that understand the data services - SED constructors, multi-image browsers, datamining algorithms etc. They can also write many possible user interfaces to the structure.

AstroGrid is a consortium of data centres, not just a project to develop the infrastructure. As such, we will write the first data services for key UK databases; we will write an example point-of-entry user interface; we will adapt or write a few simple tools; and we will provide some central resource - for one or more data warehouses, some analysis CPU, and storage for a concept we refer to as MySpace, a kind of Grid scratch space. We will also collaborate in a grid-like manner, scheduling, routing and optmising queries between the machines pooled by the consortium.

If the VO structure becomes as pervasive as we expect, then there are also implications for observatories and instruments builders. They will need of course to output “VO ready” data, but they will also need to decide whether to offer their own data services, or to farm out to data centres. Finally, it would be good to close the loop between archived data and collection of data - when you trawl around for data on your favourite object and find a key piece of information isn’t there, wouldn’t it be nice for an observing application form to pop up ?

6. STATUS OF ASTROGRID

AstroGrid was set up with a initial R&D focussed Phase A, to be followed, subject to review, by a implemen-tation centred Phase B. At the time of writing, we have just completed a successful Phase A review and are about to hire new staff and begin a construction phase which will begin in January 2003. As part of this pro-cess we wrote a substantial Phase A Report, which is openly available and can be found on our web pages at http://www.astrogrid.org. The bulk of the Phase A work centred around three areas - technology assessment reports, requirements analysis, and architecture development. The latter two areas were undertaken fairly for-mally with UML and use-case analysis, as we are following the Unified Process software development rationale. In addition we have undertaken several small software demonstration projects - in user interfaces, in authenti-cation and authorisation, in ontology, and as part of the AVO science demos. Finally, we have set up several collaborative web sites, meant to encourage anopen projectapproach - from the main astrogrid web page one can reach anews site and aforumsite, which anybody from outside as well as inside the project can view and also post to (if registered), and awikisite which is a network of directly editable web pages meant for collaborative document and software development. We welcome all to interact with these sites.

As a result of the Phase A review we formulated revised project goals. These are our Scientific Aims: • to improve the quality, efficiency, ease, speed, and cost-effectiveness of on-line astronomical research • to make comparison and integration of data from diverse sources seamless and transparent

(7)

• to remove data analysis barriers to interdisciplinary research

• to make science involving manipulation of large datasets as easy and as powerful as possible

These are our top-level practical goals:

• to develop, with our IVOA partners, standards for data, metadata, data exchange, and provenance • to develop a software infrastructure for data services

• to establish a physical grid of resources shared by AstroGrid and key data centres • to construct and maintain an AstroGrid Service and Resource Registry

• to implement a working Virtual Observatory system based around key UK databases • to provide a user interface to that system