• No results found

An information system with data pulling and full data lineage is essential for the handling and visualization of large astronomical datasets (Chapter 1). These concepts are best be introduced by discussing how they are incorporated in Astro-WISE.

The Astro-WISE consortium has designed a new paradigm and has implemented a fully scalable information system to overcome the huge information avalanche pro- duced by wide-field astronomical surveys (Mwebaze et al., 2009; McFarland, 2010). This is achieved by capturing in a generic way the reality of end-to-end survey oper- ations into a conceptual data model. The data model is translated into hierarchical classes using an object model that maps all links between dependencies. These objects are stored in the database, which links all data products to their progenitors (depen- dencies), creating a full data lineage of the entire processing chain. This object model is the blueprint of Astro-WISE.

Astro-WISE uses the advantages of Object-Oriented Programming (OOP) to pro- cess data in the simplest and most powerful ways. In essence, it turns the objects that represent conventional astronomical science products, into OOP objects, called

process targets. Each process target has associated processing parameters, which are

configurable parameters that guide the processing of that target. Each of these process target instances knows how to process itself to create the data product it represents. The processing itself can be done on a distributed cluster, and the results are either stored on a data server or in the database.

The most unique aspect of Astro-WISE is its ability to process data based on the final desired result to an arbitrary depth. This data pulling is the heart of Astro- WISEand is called target processing. Contrary to the typical case of forward chaining, the Astro-WISE database links allow the dependency chain to be examined from the intended process target all the way back to the raw data. A target’s dependencies are checked to see if it is up-to-date, and if there is a newer dependency or if the target does not exist, the target is (re)created.

2.1.1

A Functional Approach to Catalogs as Process Targets

In this chapter we take data pulling to the extreme with the design of process targets for catalog data. Such a catalog object—which we call a Source Collection—represents a collection of sources with attributes (or parameters) that quantify physical proper- ties.

The full data lineage allows any target to be processed at any time for any reason, since the process parameters unambiguously define how to do so. Ultimately, this means that it is not necessary to process a target completely, or at all. In a sense, this turns the Object-Oriented approach into a Functional one: A Process Target can also be seen as a representation of the operation that is used derive the science product, in addition to seeing it as a representation of the result itself. The actual processing of the object and storing the result is then optional.

These two viewpoints are equivalent and interchangeable and the contributions in this work stem from this dual perspective (section 2.3): (1) We allow Source Collec-

tions to be created —and used as a dependency for other Process Targets— by spec- ifying their data lineage, without requiring them to be processed. (2) Dependency graphs of Source Collections are created automatically through data pulling. These mechanisms create new Source Collections in a way that maximizes their reusability for future pulling requests. (3) We present a novel way to process only the part of a Source Collection that is required for the last Process Target in a dependency graph. This is done by using the power of backward chaining to temporarily optimize the dependency graph. (4) We present a novel algorithm to determine the logical relation- ships between catalogs from their data lineage directly. This is required because the exact set of sources that a catalog represents might not be evaluated. This algorithm is used to find Source Collections and for the optimization of dependency graphs. (5) The methods to calculate new attributes from existing attributes are decoupled from their application. This offers scientists flexibility in implementing their own methods while reinforcing the principles of data pulling. (6) The catalog objects and data pulling mechanisms are designed to be used in query driven visualization. This allows abstract data pulling and minimizes the processing required to create the visualized datasets

2.1.2

Astro-WISE

and other information systems

We have incorporated our catalog process targets in Astro-WISE, but the concepts should be applicable to other information systems as well. Therefore we refer to our particular implementation only when this influenced our design choices or when the implementation deviates from the design presented here. We would like to address the Astro-WISE way of processing and storage early on.

There are two ways to store information in Astro-WISE: All pixel data is stored as files on the data server. Everything beyond pixels is considered metadata and is stored in the database (Mwebaze et al., 2009). Catalog objects are completely stored in the database: the information about how to process the object as well as the result of such processing is considered metadata. To prevent ambiguity, we try to use more specific terms than ‘data’ and ‘metadata’ in this thesis. We use the term catalog data to refer to the result of processing a Source Collection (section 2.2.1).

Most processing of Process Targets can either be performed on the workstation of the scientist or on the distributed computing unit. The Source Collections can be processed either on the scientist’s workstation or on the database, depending on its class and on scalability or interactivity requirements (section 2.3.4). There is currently no explicit functionality to process Source Collections on the distributed computing unit. Most data pulling in Astro-WISE is done through the web based

Target Processor. Source Collection can be pulled on the interactive awe-prompt

(chapter 3) or through query driven visualization (chapter 5), they are currently not integrated with the Target Processor.