6.2 Object Model
6.2.3 Data-Analysis Model for PlugIns
A modular analysis system, that is flexible to combine different analysis algorithms, is a key component in a microarray analysis system. Data structures representing the different functional elements of an analysis process have been included into the data model. There is no representation of specific algorithms or implementations within the data model, but representations of generic functions operating on the MAGE datastructures.
The representation of analysis functionality is divided into three classes which correspond to program code in the backend of EMMA2:
Functions are the basic building blocks of the analysis process. Functions repre- sent the transformation algorithm implemented in a programming language. A function can be implemented in either R, Perl or as an arbitrary executable program. Functions can have parameters controlling the behavior of the func- tion, which need to be set by the user. The data types of parameters are also stored within the database. So called import and export functions form a special class of functions which serve as a means to store the results of an analysis step in the database.
Tools can be seen as parameterized functions with defined values for the parame- ters. As the user needs to set actual values for the parameters of a function, the parameters have to be stored in the database. There is a special type of
6.2. Object Model 95
tool, which serves as a means to combine several analysis methods in a single consecutive execution. These tools are termed ’Queue’ within the model (in the presentation to the user they are called ’pipelines’ in resemblance of an assembly-pipeline). Queues may contain a sequence of other tools and also other queues, which are then executed consecutively. A queue is independent of the actual data set but often restricted to a specific data type, which is determined by the first function in the pipeline.
Jobs are a combination of tools and actual data sets to be analyzed. A job can be executed via a scheduling mechanism to dispatch jobs to a multi-host compute cluster. Jobs reference the designated input data and the output they produce. They can be assigned to an experiment if appropriate.
Three different types of functions can be specified which serve different purposes: General analysis functions are R or Perl functions or binary executables perform-
ing computations.
Export-functions get additional annotation information from MAGE-OM, BRIDGE or a web-service. This can be for example pathway informa- tion from GenDB.
Writer functions put back data into the MAGE-OM. They can also be used to store images or graphics generated within a pipeline.
Importer functions put data into the database, like reading in raw-data files or array-layout definitions. Unlike writer functions, they do not require data from a previous computation.
Data structures were designed to represent the analysis modules within the database. The corresponding classes serve to set up pipelines and parameters of the contained functions. An overview of the extension classes that supplement MAGE-OM is given in UML in Figure 6.3 on the next page.
The ability to add new functions to the system is a key feature of the data analysis system. As such it allows to add functions which have a behavior not known during the implementation phase. To avoid meaningless pipelines, a type system has been included in the model. An example for the application of a type-systemcan be given as follows: a normalization function operates on the measured data that have been imported from the image analysis software; the application of a normalization function on already normalized data or on the results of a significance test would not be sensible. As a consequence, creating such a pipeline configurations has to be prevented.
The first step for the definition of a type system is to identify the types. From a computational perspective, the data type of microarray data is represented by multidimensional arrays of numeric and factorial values. The atomic datatypes and also the dimensionality of the arrays do not provide a suitable type system for
Figure 6.3: Simplified UML diagram of the supplementary persistence classes of the analysis model. The diagram depicts the core classes Job, Tool, and Function together with derived classes. The observation class is introduced to store all resulting data that do not fit in MAGE-OM. The diagram is simplified for clarity and readability: only the most important subclasses are depicted, and not all associations are named and show their cardinalities.
6.2. Object Model 97
functions as they do not provide a logical classification of the data types. A numeric array of given size might be the result of many different analysis algorithms.
In particular, the dimension indicating the number of spots, genes or other design elements of a microarray should be neglected by the type system, as a function like a normalization function should be applicable to datasets regardless of the specific array design used. On the other hand, the choice of a normalization function might reflect the technological platform of the array, because multi-channel microarrays and one-channel microarrays require different processing. In addition, different image quantification softwares produce diverse quantitation types.
It is required to base the type system on a higher-level of logical annotation, because MAGE-OM plus extensions provides for the complete view of the world for the software. Any analysis functions has to operate on the datastructures which exist within the model. Therefore, the possible types of the analysis functions must reflect the data-structures in MAGE as close as possible.
Four classes representing steps of the data analysis serve as input and output of data analysis functions. These classes are descendants of the BioAssayData class and are used to determine the basic data type of a function in Analysis model: PhysicalBioAssayData(PBAD) represent data measured with hardware equip-
ment like scanners. The only possible data type found with PBAD is images. Image analysis is currently not incorporated into EMMA pipelines but carried out by external applications. The presence of the PBAD data type provides the possibility to include image analysis directly into a pipeline.
MeasuredBioAssayData (MBAD) are data resulting directly from a image quan- tification software. MBAD consist of tabular data containing raw intensities and quality statistics for each feature on the array. These data have to be further processed by normalization.
DerivedBioAssayData (DBAD) represent the output of a transformation process. A transformation process takes as input MeasuredBioAssayData or Derived- BioAssayData and creates one or more numerical datasets as output which are of type DBAD. Normalization is an example for a transformation from MBAD to DBAD. Other functions like significance tests operate on normal- ized data which are of type DBAD and give a table of significance statistics also of type DBAD for each gene.
BioAssayDataClusters (BADC) are the results of a higher-level analysis which provide a grouping of individual design elements or also microarrays on the basis of MeasuredBioAssayData or DerivedBioAssayData. A typical function is a cluster analysis algorithm calculating a grouping of data into clusters. Also, the results of a classification algorithm producing a mapping of design elements into disjoint classes may be represented by this data type.
The BioAssayData classes are not complete, as specific analysis tasks require more sophisticated data-types than found in MAGE; images, clickable maps, files,
and lists of gene names are the most frequently required examples. The Observation class hierarchy is introduced to set forth such supplementary data types. It is also suited to eventually derive further sub-classes for new analysis methods without affecting the core MAGE-OM classes.
There are more preconditions that can help to classify the type of a data matrix, in particular the type of rows in a data matrix. As an example, the rows in a quantification table from an image analysis software represent measured values for each individual spot on the microarray. Also, normalization produces normalized intensities or intensity ratios for each spot, represented by the Feature class in MAGE-OM. Individual spots may be seen as repeated measurements for a common polymer of nucleotides physically present on the arrays, called Reporter. Reporters may be further grouped into a logical sequence, representing a genomic region. A common example are genes represented by different oligonucleotide sequences, which are subsequences of its coding region. These logical sequences are called CompositeSequence in MAGE terminology.
A normalization function solely operates on the Feature-level of the data matrix, while a function which computes an expected expression value, e.g. the mean, over all replicates for a sequence operates on Features and its output is based on the Reporter or CompositeSequence assignment of the Features. In conclusion, the DesignElement type of a function may be one of Feature, Reporter, CompositeSe- quence or any. The type any may be applicable for functions which are indifferent on the type of design element present, such as plotting functions.
The third categorization of data in the analysis process relies on the data types of the columns in a data matrix, called QuantitationType. In MAGE it is manda- tory to assing a QuantitationTypeDimension to a data set defining the type of measurements found in the columns. For a measured data set, the Quantitation- Types correspond to the column headers of the quantification table. An example for quantitation types in measured data are the foreground intensity and background intensity of the estimated spot intensities. The names of the quantitation types may vary between different image quantification software and analysis functions, and are not known to the implementation of the analysis system. Therefore, the quantitation types have to be specified for each function present in the database during run-time of the system.
In summary, the type of each function in EMMA2 is defined by tuple of the input and output type, in such a way that each function constitutes a mapping:
f : (Bi, Di, Qi)7−→ (Bo, Do, Qo) (6.1)
where B ∈ {BAD, PBAD, MBAD, DBAD, BADC} is the class of the BioAssay- Data in MAGE, D ∈ {Feature, Reporter, CompositeSequence, Any} the type of DesignElement and Q the QuantitationTypeDimension in MAGE. The type of a function is modeled in the supplementary class
CHAPTER
7
Implementation
In the previous Chapters, the required functionality of EMMA2 has been specified, and from that the structural components and their interactions have been derived. Now, it is time to assemble them into an operational piece of software. This can be accomplished by using programming languages, code libraries and data-base management systems. A prototype version of the software has been built from scratch to refine and improve the specification and design.
An overview of the component structure of EMMA2 is given in Figure 7.1 on the following page. The implementation and adaptation of these data-structures, algorithms and novel visualization methods are explicated in the following Chapter.
7.1 Choice of Core Development Tools
There are manifold development environments, database tools, and programming languages to support the process of implementing software. A major criterion for selecting these tools is their reliability and efficiency; another is free availability to guarantee the system is distributable and extensible within the academic commu- nity for everyone. For microarrays, additional problems such as dealing with very large and noisy datasets and very complex data structures need to be addressed.