Ready for semantic integration? - Developing tools for biological data integration

Chapter 5 Developing tools for biological data integration

5.5. Ready for semantic integration?

The current implementation of PLAN handles integration of data at the pure syntactic level, through the definition of executable workflows. Nevertheless it has been designed so as to work in a global infrastructure to support scientific workflows, whose architecture is described in [164]. In this section some of the requirements and difficulties in order to create such system are highlighted.

5.5.1.a. Data semantics

As explained before, scientific workflows can be defined as data-oriented. Therefore we should carefully consider the semantics of the information being handled in any analytical flow. Each data node should have a semantic type and a syntactic type.

Semantic types can be defined at two different levels: the basic semantic type of a data node is the type defined in the context of its data source (either database or application), i.e. the local semantic type. This local semantic type should therefore be independent of any particular use of the data. A second layer of semantics can be established at the abstract workflow level, i.e. at the level of the application ontology.

Developing tools for biological data integration 69

This application ontology is usually built around particular objectives (intended user analysis), and can consequently vary in different workflows.

How are the mappings of local semantic types to global semantic types (in the application domain ontology) done?

Local semantic types can be mapped to corresponding categories in the application ontology. This mapping is referred as contextualization in [168]. E.g. Swiss- Prot keywords (syntax: swissprot-keyword, data type: string) can be semantically typed using corresponding ‘categories’ as defined by the Uniprot Knowledgebase [169]. In this way ‘helicase’ and ‘hydrolase’ are categorised as “molecular function, enzyme”; ‘hereditary haemolytic anemia’ and ‘Alzheimer’s disease’ are categorised as “disease”; ‘SH3 domain’ and ‘Zinc-finger’ are categorised as “domain”; etc. The Swiss-Prot keyword categories provide a local semantic characterization that can be imported as such or mapped to corresponding terms in the application ontology.

It is worth noticing that local semantic types should be used for constraining connections (by semantic type checking), while global semantics will be normally used for building application domain rules that may not be present in the underlying data sources (databases or methods).

Local semantic types should be linked to their corresponding syntactic type(s) (a process known as ontological grounding). Some data sources can provide more than one syntactic type for each semantic class. Usually this will be the case of integrated data collections, such as InterPro. E.g. (see Figure 11): The unified InterPro (IPR001623) ‘Heat shock protein DnaJ, N-terminal’ domain is syntactically expressed through five signatures as: (PF00226) DnaJ in Pfam; (PS00636) DNAJ_1 and (PS50076) DNAJ_2 in Prosite; (SM00271) DnaJ in Smart; (SSF46565) DnaJ_N in Superfamily.

Although a sequence search to InterPro (through InterProScan service) will return a list of matched InterPro family/domains, the recognition of such family/domains is done through the mapping of underlying sequence signatures (InterProScan is in fact a query to a multidatabase using local query mechanisms, namely: BlastProDom, FPrintScan, HMMPIR, HMMPfam, HMMSmart, HMMTigr, ProfileScan, ScanRegExp, SumerFamiliy). While InterPro offers an integrated consolidated view on protein families and domains and a whole analysis of Swiss-Prot/TrEMBL sequence databases,

70 Developing tools for biological data integration

it does not provide a unified syntax for non-instantiated sequence patterns (nor even a ‘local’ copy of them). The syntax and data of sequence patterns (or signatures) should be obtained from the underlying databases.

Figure 11: Illustration of some of the domain signatures (syntax types) corresponding to the InterPro ‘Heat shock protein DnaJ, N-terminal’ domain.

5.5.1.b. Task semantics

Even if there is a great diversity and still growing number of specialised applications in molecular biology, I will analyse just a single and widely used bioinformatics application in order to show the complexity of handling and designing a semantic layer around tasks. The analysis is done in the context of the two example workflows used as motivating examples in [114].

The executable task to consider is the BLAST search against a protein sequence collection [22] (it can be invoked as a local application, an HTTP request or even a Web service). The abstract task can be described as “perform a protein-protein sequence similarity search”. Such abstraction allows the definition of more than one semantic-to- syntax (or abstract-to-executable) mapping for the “protein-protein similarity search”.

Developing tools for biological data integration 71

E.g. instead of using the BLAST algorithm, a FASTA search could be used as an alternative method.

What is more, the semantics of an abstract task are in fact modulated by its parameter and data-in nodes as well as the conditions established upon data-out nodes. In the two example workflows, the “protein-protein similarity search” was in fact used for at least four different purposes:

Feature mapping: transfer of sequence annotations (in this case a sequence domain) to three-dimensional structures, through the evaluation of corresponding aligned segments. Search sequence collection: PDB (contains structures).

Group sequences by ‘protein’: sequences corresponding to the same protein are recognised. Search sequence collection: Swiss-Prot (representing the known non- redundant protein sequence space).

Find numerous distant homologues: distant homologues are defined as those with sequence identity in the range 30-70%. Numerous, at least 10. Search sequence collection: Swiss-Prot.

Does it correspond to a full wild-type protein?: full (non-fragment) protein, wild- type (non-mutant) protein are evaluated through a search in Swiss-Prot (as containing full, wild-type sequences). Full is assessed by comparing query and hit sequence lengths; wild-type as having 100% sequence identity.

Therefore, a “canonical” abstract task such as ‘protein-protein similarity search’ can in fact represent a great number of higher-level tasks, depending on the context in which it is used (being its context the sum of data-in and parameters nodes and data-out evaluation).

The creation of hierarchical task structures from executable workflows to the definition of more abstract task should therefore consider not only the canonical semantics of executable workflows, but its intended use. This contextual use of a particular executable task results in different abstract functions due to:

Fixation of parameters (e.g. perform a search in a particular database). The fixed parameter is not lifted to the higher-level task.

72 Developing tools for biological data integration

Constraining data-in semantic types (e.g. query sequence of proteins with known three-dimensional structure). In this case, syntactic types remain, although the semantic type attached to data-in node changes.

Evaluation of data-out nodes (e.g. filtering output information upon a condition, such 100% similarity search).

Selection of subsets of data-out information (e.g. keeping only hit unique identifiers).

For the evaluation and selection of subsets in data-out nodes some declarative query mechanism on syntactic data is needed (such as the one provided by PLAN). This will involve changes in data-out syntactic as well as semantic types.

Therefore any abstract-as-view definition (i.e. the definition of an abstract task in terms of underlying executable tasks) will generally imply a redefinition of the syntax and semantics of data-in and data-out, and a subset of underlying parameters.

5.6. Discussion

Our approximation to the integrative data analysis in molecular biology can be classified as “process-oriented” (or business process integration as described by Linthicum [39]), involving both information (data) and methods (application or services). In process-oriented integration, relationships between data sources are built on-demand. Thus, there is no need to design and provide a universal integrated view on the component sources.

In most cases, users know which are the relevant information sources and applications that should be used in a particular analysis. Although they might not be aware of the exact schema of every data source relevant to their analysis, they are experts in the content of the data source, as well as the nature and semantics of the data, quality, etc… They also know the connections they want to build between data, as well as the applications for data transformation required in order to establish those connections.

Developing tools for biological data integration 73

it is possible to support multiple semantic mappings (one per user or use), that might not be anticipated by the system integrator,

users can exploit their knowledge on data sources to specify workflows achieving a good performance,

in some cases, transparency of data location (at the level of data collection) might not be desirable, due to trust and data quality issues,

procedural languages are easier to learn (and closer to the way analysis are made), there is no need to maintain a pre-defined integrated schema over the available

sources. Several integrations are possible corresponding to user applications and demands.

The main strength of PLAN is to significantly reduce the complexity of information integration from multiple sources. To do so, PLAN combines a declarative query language with the additional power of a procedural instruction set using a uniform and easy to manipulate XML format. Information can be kept in its original location and accessed only during run time. Only a resource catalogue defining access mechanisms and properties of the data is required. PLAN can be easily extended allowing the incorporation of additional data sources by registration to the catalogue. Available sources in the catalogue can seamlessly be used together in a computational workflow. The use of a declarative query language allows filtering operations on data, as well as any other complex queries provided by XQuery. Custom user defined functions can be easily added to be used in the query language.

Workflow approaches are not the single paradigm for biological data integration having a process-oriented focus. For example, an alternative approach is the one followed by HyBrow (Hypothesis Browser) [115], a tool for designing hypotheses and evaluating them for consistency with existing knowledge. In this case, the processes modelled, instead of analytical workflows, correspond to the traditional scientific method of working around hypotheses.

Discussion 75

Todo aquello había sido una forma de sintaxis, un modo de ordenación de la realidad quizá no menos arbitraria que la alfabética.

Juan José Millás, “El orden alfabético”

Discussion

Software projects are, in many cases, only a part of a wider project. I have worked in scientific research environments, where creativity and innovation are essential and impregnate other aspects of the project. In such context, software design and development is difficult to control and restrict to strict methodologies. This is partially due to the evolving nature of research activities that forces technological developments to change with the definition of the project itself. From my experience, best development methodologies for research projects involving the creation of databases or software applications are those following an incremental approach, and very important, working with prototypes.

Development of systems to support the preservation and organization of scientific archives is not just a technological endeavour. It is a research mission involving inventiveness as well as group dynamics and culture. Close interactions with data producers in the designing phases are essential, but also difficult. Furthermore,

76 Discussion

considerable knowledge and understanding of the specific research field is needed to design and implement an appropriate software system.

In addition to integration challenges found in many business environments, integration in molecular biology poses an additional level of complexity due to the nature of biological data. These particularities have to be taken into account in order to avoid the development of elegant systems from the technical point of view, but lacking appropriate functionality to be used in practical applications. Attention should also be taken to solutions originally created to fulfil short-term goals for specific purposes, which may not be scalable or maintainable in the future as they have been designed as crafted products.

The complexity of the subject at hand clearly demands an interdisciplinary work. As noticed in [15]: “Orchestrating fruitful interdisciplinary research across biology

and data management is not easy. Lack of sufficient interaction between biologists and data management researchers can easily lead to attempts to reinvent well-known data management technologies by bioinformaticists, or sterile pursuits of irrelevant (or misunderstood) problems by data management researchers. For fastest progress in the biological sciences, we must encourage both the development of content for biological databases as well as data management technology for managing this content”

Previous and ongoing initiatives in molecular biology to facilitate integrative data analysis can be grouped as those aiming at providing better means of data source interoperability, and those developing generic software systems. Among the last, it is worth noticing two trend directions. First, there is an increasing awareness of sharing not only information, but also applications with recent developments around service- oriented approaches (motivated by Web service technologies). Second, the semantic paradigm is also gaining acceptance (once again parallel to the creation of the “semantic Web”).

At this time it is worth questioning if these two directions are appropriate. Service-oriented approaches are suitable to share methods and/or algorithms, but they fall short if they do not take into account the need to also share information. Consequently, they should be complemented with means of data integration in the case that data standardization is not guaranteed, as in many applications in molecular

Discussion 77

biology. Thus, it seems that process-oriented solutions are more appropriate, as they consider both data and method integration. Among these, workflow and hypothesis- building paradigms seem to fit smoothly with applications in molecular biology.

The second question to answer is whether molecular biology (and related scientific domains) is ready for transparent semantic interoperability. My answer is “not yet”. Data semantics in molecular biology are either not well known and/or not properly specified. For semantic interoperability to be real and ubiquitous in biology, clear specification of semantics must happen. Although some formal ontologies are emerging, they are normally used for annotation purposes, not for describing data models (with exceptions as noted in chapter 2).

In addition, complexity of molecular biology data makes that granularity of data stored in databases may not be the appropriate granularity to represent biological information in a given application. Thus, the process of defining and establishing semantic correspondences among data sources and application domains will require the intensive use of semantic transformations. Furthermore, most computational biology applications are designed for the discovery of new information out of biological data. While applications in an “operational mode” can be formalized and created around predefined semantic models, applications working in a “discovery mode” are less suitable to express and share semantic conceptualizations.

Conclusions 79

Conclusions

After some years of research in this field, I have reached the following conclusions. A general insight is that, in spite of the conceptual complexity of the biological data integration problem, the main bottlenecks to achieve it are still found at very practical and technical levels. There are three reasons for this. First, there is a poor selectivity of searching mechanisms in many databases. Second, an important number of data is represented using complex data types. These complex data usually lack appropriate native operations and standard interfaces which are normally available for more common data types. Third, there is a need of using a wide range of data operations and transformations due to the lack of standards.

As of today, there is not a unique best solution for the task of providing systems to facilitate integrative biological data analysis. The work presented in this thesis has illustrated that:

As a first step it is essential to create publicly accessible biological databases. Our contribution to this aim has been the establishment of the Electron Microscopy

80 Conclusions

Database (EMD) as the world-wide public archive to store structural data obtained by 3D-EM.

There is a need to develop new databases providing infrastructures to support access to heterogeneous data. This is the case of work on the BioImage database, designed to store and manage multidimensional images of biological specimens obtained from various classes of microscopy techniques.

It is necessary to create federated infrastructures to relate data collections. In the case of macromolecular structural data this has been achieved by establishing correspondences between the atomic models stored in the Protein Data Bank (PDB) and three-dimensional maps in the EMD.

Consolidated access to biological information can be accomplished through the creation of integrated data models underlying data warehouses. In this line, we extended the Macromolecular Structure Database (MSD) to contain electron microscopy data.

An important aspect to ensure better means of interoperability is to supply appropriate means of data citation, helping to provide reliable mechanisms for data provenance tracking. These mechanisms are essential when creating derived data infrastructures as the FEMME database built on the analysis of EMD data.

Integrative data analysis will benefit from the developments of generic systems to share processes suitable for molecular biology research, as our proposal for the construction of computational workflows with PLAN.

Finally, standardization in well-delimited areas of research will enhance the interoperability of software platforms, as well as the exchange of data. In this line, we have launched the initiative towards the establishment of common conventions enabling data interchange among the 3D-EM field.

Conclusiones 81

Conclusiones

Después de algunos años de investigación en este campo, se pueden extraer una serie de conclusiones. Quizá la más evidente es que, a pesar de la complejidad conceptual del problema de la integración de datos biológicos, los principales cuellos de botella se encuentran a niveles prácticos. En primer lugar, muchas bases de datos proporcionan mecanismos de búsqueda con selectividad limitada. En segundo lugar, un importante número de datos se representa con tipos de datos complejos, que carecen de las operaciones nativas e interfaces estándares normalmente disponibles para tipos de datos más comunes. Finalmente, es necesaria la utilización de un abanico amplio de operaciones y transformaciones sobre los datos debido, en gran parte, a la escasez de estándares.

A día de hoy, no existe una única solución tecnológica ganadora en el conjunto de soluciones que posibilitan el análisis integrado de datos biológicos. El trabajo realizado en ésta tesis ilustra las siguientes conclusiones:

Como punto de partida, es esencial crear bases de datos biológicas accesibles públicamente. Nuestra contribución hacia dicho objetivo ha sido el

82 Conclusiones

establecimiento de “Electron Microscopy Database” (EMD) como archivo público de ámbito mundial para almacenar datos estructurales obtenidos por microscopía electrónica tridimensional.

Es preciso el desarrollo de nuevas bases de datos que proporcionen infraestructuras para el acceso a datos heterogéneos. Es el caso de BioImage, diseñada para el almacenamiento y la gestión de imágenes multi-dimensionales de especimenes biológicos obtenidas mediante diversas técnicas de microscopía.

Es necesaria la creación de infraestructuras federadas que permitan relacionar distintas colecciones de datos. En el caso de datos estructurales de macromoléculas biológicas ésta se ha llevado a cabo mediante el establecimiento de las correspondencias necesarias entre los modelos atómicos

In document Universidad Autónoma de Madrid. Escuela Politécnica Superior. Departamento de Ingeniería Informática (Page 86-118)