In this section, we describe the key approaches to integrating bioinformatics tools from distributed locations. Multi-agent systems have been used as a technology to integrate heterogeneous data and tools, while grid systems have been used to integrate tools and data for execution as distributed workflows. In addition, cooperative systems have been proposed to allow bioinformatics researchers to share their tools and data. We consider each in turn below.
3.4.1 Multi-agent Systems Applications
Agent-based systems are one of the technologies that can be used to help in solving problems related to biological data generated by genome projects. Distributed, hetero- geneous, and dynamic environments, as with the biological domain, are commonly the target domains of agent-based applications. Thus, some key problems of bioinformat- ics research, like integrating information that is distributed in remote, heterogeneous biological databases over the Internet, and keeping track of existing and updated bioin- formatics software and data, make the agent approach very suitable if we view each distributed bioinformatics site, tool or data provider and user as agents. However, the idea of applying agents to tackling key issues of bioinformatics research is still very new and, as a consequence, there are many problems to be investigated.
Nevertheless, some work has already been done in the development of multi-agent system
tools for use in prediction of secondary structure proteins (Armanoet al, 2005), disease
gene discovery (Williamset al, 2001), and automatic data integration (Karasavvaset al,
2002). In particular, the pioneering applications of agents in bioinformatics are described
by Bryson et al (2000) and Decker et al (2002), with a focus on data integration and
genome annotation. These are described in the next sections.
3.4.1.1 GeneWeaver
GeneWeaver (Bryson et al, 2000) is a multi-agent system designed to tackle problems relating to the integration of genome analysis and structure prediction tools. It is ar-
gued that the distributed, heterogeneous, dynamic character of biological information, together with the existence of several types of analysis and prediction programs to be applied to this information, points to the suitability of an agent approach. Here, the multi-agent system comprises a community of agents with distinct functionalities that work together to automate the annotation of genomic data. Agent functionalities are determined according to the tasks that need to be accomplished during the annotation process.
There are five types of agents in the GeneWeaver community: broker agents, primary
database agents,non-redundant database agents,calculation agents, and genome agents. Thebroker agent is responsible for storing information (such as their location, supported
communication methods, and abilities) about all the agents in the community. Primary
database agents are in charge of managing primary sequence databases like Swiss-Prot,
PDB, and PIR. Similar to primary database agents,genome agents and non-redundant
database agents are also responsible for managing genome information, the main dif- ference being that genome agents are responsible for controlling information about the
genome for a particular organism. Finally, calculation agents encapsulate existing soft-
ware applications used to analyse biological data, so that each program becomes an independent agent in the GeneWeaver community.
Agents communicate with each other within the GeneWeaver community using a spe-
cific language based on KQML, the BioAgent Language (BAL). BAL messages contain
language and ontology fields to help agents understand the content of the message. The
meta-data, data, and query expressions in the content field are represented by theBioA-
gent Content Language(BACL). Also, two ontology sets are defined: theBioAgent Meta Ontology (BAMO), which defines different types of meta-data and their meanings, and theBioAgent Data Ontology (BADO), which defines the data types employed.
GeneWeaver does not introduce new methods or techniques for performing any task related to genomic data annotation, but organises and manages existing ones so that they can operate in a more flexible, and more effective way.
3.4.1.2 BioMAS
Decker et al (2002) present a multi-agent system for automated genomic annotation.
Theirbiomassystem is an extension of previous work (Deckeret al, 2001) on automated
annotation and database storage of sequencing data for the herpesvirues, which was
expanded to a more generic system that can be used for studying more organisms. The new system also includes extensions for functional annotation, Expressed Sequence Tags
(EST)8 processing and metabolic pathway reasoning.
8Expressed Sequence Tag is a small sequence from an expressed gene, and acts as a physical marker
The system is composed of four overlapping multi-agent subsystems: basic sequence annotation,query processing,functional annotation, and EST processing. The function of the basic sequence annotation and query processing subsystems are, respectively, to integrate remote gene sequence annotations from various sources, and to allow complex queries on local databases via a web interface. The functional annotation subsystem is in charge of assisting the user to make functional annotations of each gene in a sequenced
genome, by using Gene Ontology (GO)9 (The Gene Ontology Consortium, 2000) for
annotating gene function. The EST processing subsystem was designed to support the use of expressed sequence tags as input data in the annotation process, in addition to complete sequences of nucleotides or proteins.
There are three types of agents in the system: information extraction agents, task
agents, andinterface agents. The first group of agents is responsible forwrapping public databases like Genbank, Swiss-Prot, PSort and ProDomain. Agents in the second group are divided into: domain-specific agents, which include annotation agents, responsible for guiding the annotation process, and sequence source processing agents, responsible for checking the consistency of sequence format; and domain-independent task agents, which include proxy and matchmaker agents, responsible for facilitating the communi- cation within the system. Interface agents are responsible for helping the user to add new sequences to the local knowledge base, and to query complete annotated knowledge bases.
3.4.2 Grid Applications: myGrid
Moreau et al (2002) describe some possible uses of agent technologies in an e-Science
Grid project with a focus on bioinformatics, myGrid (Goble et al, 2003). This project
aims to provide a distributed environment that supports the construction ofin silico ex-
periments, which are represented by workflows, and can be stored, shared and managed according to user preferences. Other complementary features include the notification to the user of relevant information related to their experiments, and the provision of assistance for less skilled users to manage their experiments.
myGrid has a service-oriented architecture, and provides support for users to create, discover and execute workflows. Services and workflows have semantic descriptions, indicating their functionality, the types of input they require, and the types of output
they produce (Lord et al, 2003). User discovery of workflows and services is achieved
via semantic services (McIlraith et al, 2001), which use matching algorithms to search
through semantic descriptions for services or workflows compatible with the user query (i.e., preferences, goals, etc). As a result of the discovery process, the user is presented with a list of available services from which they can choose.
9
The use of agents in this bioinformatics grid aims at addressing a common problem in bioinformatics research, the constant change in resources available to the bioscientist (i.e., their continuous appearance, disappearance, or change without prior notification). Agents are seen as an appropriate technology to tackle this problem since they provide an abstraction for the design of scalable systems, as well as the means to implement aspects like personalisation, communication, and negotiation within the grid environment.
Two types of agents have been defined to act in the grid: a user agent and a broker
agent. Theuser agentis responsible forrepresenting the user within the myGrid system, which includes providing the user’s personal preferences for other parts of the system, and mediating the communication between grid services and the user. Negotiation within
myGrid takes place on the basis of preferredquality of service for service providers and
service users, in the context of notification support. The agent responsible for managing
these negotiations is the quality of service broker, which negotiates on behalf of each
service user that wants to receive notifications of a specified quality, and then returns a final proposal.
According to Fosteret al (2004) in relation to the mutual benefits of combining grid and
agent-based systems, distributed bioinformatics applications may also be improved by
joining grid and agent-based technologies. This would enable in silico experiments to
be conducted and controlled in a more flexible way in both individual and collaborative work.
3.4.3 Cooperative Applications
Bioinformatics researchers are discovering the advantages of cooperative research, in which different types of information and tools are exchanged in order to improve indi- vidual or global results. Here, unique data sets are created in individual laboratories, and not published on public database sites. However, they could be shared with a world- wide community if provided with the right tools to support cooperation. The systems described in the previous sections are mostly concerned with integrating heterogeneous data and tools, or combining remote data and tools for execution in distributed work- flows, but they do not address cooperation explicitly (since the tools and data that are integrated or combined typically belong to the same individual or group).
In an effort to provide such a cooperation support tool, Overbeek et al (2004) present
a peer-to-peer environment for genome annotation, SEED, which allows researchers to combine publicly available genomic data with individual, non-public data exchanged with other researchers to form an integrated and distributed curated database of genomic data. Each SEED instance has a copy of this integrated database and is a self-contained genome annotation system that allows multiple users to access, update, and extend the annotation database. To support cooperative work, the SEED system uses a peer-to-
peer synchronisation facility that permits information sharing between SEED instances. Cooperation members are known (i.e., access is not anonymous), and have the option of choosing whether to participate in an annotation team. Although the system provides support for data exchange, it does not address the problem of selecting between different SEED users and instead it assumes that the user must select candidate SEEDs from a registry to send data requests. Also, it is not clear how annotation groups are formed, whether by finding users with related interests, or by other criteria.
A different approach to supporting cooperative research that uses a web services solution
is presented by Gao et al (2005), who develop a microarray10 data-mining system that
uses web services in drug discovery. The system is implemented by wrapping data
processing modules and databases into web services, integrating them, and providing a portal through which the user can select and aggregate services. A limitation of this approach is the lack of support for the automatic use of services, which is assumed to be carried out by the user, and for the analysis of the quality of the provided services. Given this overview of bioinformatics tools and key approaches to integrating and sharing bioinformatics tools and data, in the next sections we describe the application scenario, which we take as a case study through the thesis.