Representation of function: The next step

(1)

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/14008049

Representation of function: The next step

Article in Gene · June 1997

Impact Factor: 2.14 · DOI: 10.1016/S0378-1119(96)00854-2 · Source: PubMed CITATIONS

37

READS

25

5 authors, including: Natalia Maltsev University of Chicago 59 PUBLICATIONS 3,253 CITATIONS SEE PROFILE Evgeni Selkov Genome Designs, Inc. 198 PUBLICATIONS 2,808 CITATIONS SEE PROFILE Available from: Evgeni Selkov Retrieved on: 12 May 2016

(2)

Representation of function: the next step

R. Overbeek

a

, N. Larsen

b

, W. Smith

a

, N. Maltsev

a

, E. Selkov

c a

Mathematics and Computer Science DiÕision, Argonne National Laboratory, 9700 South Cass AÕe., Argonne, IL 60439-4815, USA b

Center for Microbial Ecology, Gilter Hall, Room 294, Michigan State UniÕersity, East Lansing, MI 48824, USA c

Institute of Theoretical Biophysics, Russian Academy of Sciences, Puschino, Moscow Region, Russia

Received 14 August 1996; accepted 5 November 1996

1. Introduction

The research community now has access to a number of

w x

complete genomes 1–3 , and several more have already been completed or are nearing completion. The incredible successes of the past eighteen months have set the stage for one of the most exciting scientific quests of the twenti-eth century: to characterize the functions of all the genes in these phylogenetically diverse microbial genomes, and then to characterize the dynamic behavior of these complex systems in terms of these basic components. This task certainly will not be completed by the turn of the century, but it will be well advanced. The analysis growing out of these microbial genomes ultimately will be seen as the foundation used to understand the more complex issues presented by more complex organisms. Our grasp of many of the fundamental mechanisms of life will emerge from the wealth of data produced by sequencing a phylogeneti-cally diverse set of microbes.

The community is now trying to organize the newly available data to support the characterization of function. Clearly, the detailed experiments required to identify the specific functions of the biological components will be extensive. What general strategies should be employed, and what data processing and computational support could be used to reduce the effort?

A basic step in documenting the function of systems is the hierarchical decomposition of components. The overall system is decomposed into a set of interconnected compo-nents, these components are further decomposed into sets of interconnected subcomponents, and so forth. The actual breakdown of a functional unit into interconnecting sub-components cannot always be done either uniquely or precisely. The reality of the situation often involves thou-sands of entities with complex interconnections and depen-dencies. The hierarchical decomposition, however, is of critical importance in comprehending complex systems.

We believe that a hierarchical decomposition into inter-connected subcomponents is a good abstraction for

work-ing with genomic sequence data for organisms. At the lowest level, enzymes and other protein complexes are formed by aggregating several polypeptides. At a some-what higher level, enzymes group conceptually into a pathway that transforms a set of inputs to a set of outputs. Pathways may themselves be aggregated into larger sub-systems. We acknowledge, however, that a hierarchical decomposition — while conveying real insights — also leaves out essential aspects. It is a means for classifying the parts. It does not convey the dynamic interplay of the parts as the whole system moves from one state into other states with completely different operating characteristics.

The hierarchical decomposition we are discussing at-tempts to classify components based on functional group-ings. Specifically, parts with quite different characteristics are grouped together if they participate in a single complex that plays a coherent functional role. Other meaningful hierarchical clusterings also exist. The most significant is the hierarchical clustering based on evolutionary history of the components. Thus, understanding the relationships of existing genes that encode polypeptides in the context of

Ž

‘‘protein families’’ that is, in terms of their evolution

.

from common ancestral sequences must certainly be one of our strongest tools for obtaining an accurate picture of biological systems. The hierarchy based on homology complements the hierarchy based on functional systems: each reveals key features of the system, and each plays an important role.

One possible, oversimplified view of ‘‘functional char-acterization of genes’’ is to fill out the two hierarchical representations — one hierarchy decomposing an organ-ism into increasingly lower-level subcomponents, and a second hierarchy capturing the evolution of thousands of distinct components and functional roles from a relatively small ancestral set. Certainly, access to complete and accurate versions of both hierarchies would offer a won-derful starting point for comprehension of biological sys-tems. Indeed, we argue for precisely this notion as a reasonable short-range objective. However, it would be

0378-1119r9r$15.00 Published by Elsevier Science B.V.

Ž .

(3)

( )

R. OÕerbeek et al.r Gene 191 1997 GC1–GC9

GC2

folly to suppose that constructing representations of these two hierarchies would capture a functional characterization of a genome; anything worthy of being called a functional characterization must ultimately model a dynamic system. The goal of this article focuses on the more modest notion of representing and deriving the two hierarchies as they apply to a growing set of sequenced genomes.

2. The decomposition based on functional subsystems

Several steps are required to understand our current view of how to develop and maintain a hierarchical repre-sentation of functional roles:

1. First, one should realize understand what a functional

Ž

overview would look like for a single organism for

1_.

reference, see http:rrrrrrrrrr_{www.cme.msu.edu r}rrrr_{WIT r}rrrr _.

Such a functional breakdown would decompose func-tions into basic subcomponents, corresponding to genes or gene products.

2. Then, we advocate representing the lower-level

sub-Ž

components for example, at the level corresponding to

.

a metabolic pathway using diagrams that sketch the interactions of the included yet lower-level components

Ž_{see Fig. 1 . This step is analogous to what is seen in an}.

auto parts store, where lowest-level modules are de-picted with diagrams to show the interrelationships of the detailed parts. Such diagrams sketch the essence of how the parts combine to achieve the function corre-sponding to the subtree.

3. One must then consider what would be produced by creating a detailed representation of functionality for a number of organisms. The result would have many sections of significant overlap; that is, many aspects of the distinct functional representations would be very similar. The question naturally arises.

Could we form a single encompassing oÕerÕiew from which we could project out the precise indiÕidual oÕerÕiews?

The advantage of such a unified framework becomes apparent when one thinks of comparing the functional-ity present in sets of organisms. On the other hand, the generalized overview obviously would contain numer-ous sections that would never appear together in any single organism.

4. Finally, one must consider the scope of the effort required to develop and maintain such an overview in the presence of rapid advances in genetic, biochemical, and molecular characterization of the organisms.

1

http:rrwww.cme.msu.edurWITr

Ž

Fig. 1. Arginine biosynthesis metabolic pathway graphics created by Evgeni Selkov as part of the Enzyme and Metabolic Pathways Database

.

project .

In the remainder of this section we clarify each of these steps.

2.1. The hierarchical representation of function for spe-cific organisms

Several efforts have been made to develop taxonomies of function for specific organisms. The best known is, perhaps, the one of Escherichia coli developed by Monica

w x

Riley 4 and used as the basis for the classifications in

Haemophilus influenzae and Mycoplasma genitalium. This

overview represented a major effort to develop a well-de-veloped hierarchical representation that went down to indi-vidual genes. It is worth noting that Riley and Karp also played an early role in exploring the issue of how to represent metabolic activity. Among the other efforts to develop taxonomies, we mention the functional overview produced by Michael Ashburner for Drosophila.

Such a hierarchical representation includes nodes that

Ž

specify functions. Then, specific genes and, by implica-tion, the corresponding polypeptides in the case of coding

(4)

.

regions are attached to nodes. In many cases, the genes will be attached to internal nodes, not just leaves. Thus, a

Ž

gene with an imprecisely known function like ‘‘ABC

.

transport protein’’ will be attached to an internal node; as progress is made in completing the characterization of such genes, these genes will be moved toward the leaves. A number of issues need to be dealt with explicitly. Let us begin with the issues relating to lower levels of the hierarchical decomposition — specifically, with those

re-Ž

lating to enzymes many of the same issues arise in classifying noncatalytic proteins; however, the central

dif-.

ficulties are revealed vividly in the context of enzymes . First, let us simply list some relevant issues relating to categorizing enzymes:

1. The functional description of enzymes can be based largely on the substantial efforts that have gone into development of the enzyme nomenclature, where EC numbers explicitly characterize functional categories. However, EC numbers often identify a class of related

Ž

functions rather than a specific function e.g., alcohol dehydrogenase, EC 1.1.1.1, characterizes a class of enzymes that catalyze reactions of the form Alcohol q NAD q s Aldehyde or Ketone q NADH rather than a

.

specific reaction .

Ž

2. A single aggregate e.g., the pyruvate dehydrogenase

.

complex may include a number of distinct enzymatic

Ž

subaggregates each composed of one or more

poly-.

peptides .

3. A single polypeptide may carry out multiple enzymatic

Ž

functions, each described by a distinct EC number e.g., the polypeptide P17547 performs three distinct enzy-matic functions characterized by EC 1.1.1.1, EC 1.2.1.10, and pyruvate-formate-lyase deactivase

activ-.

ity .

4. Enzymes can be classified into ‘‘forms’’, where each form represents a distinct abstract ‘‘ version’’ of the enzyme as it occurs in one or more organisms. Distinct forms often contain differing numbers of subunits, dif-ferent prosthetic groups occur in difdif-ferent cellular

loca-Ž .

tions, and so forth see Fig. 2 .

How, then, should one deÕelop a hierarchical decompo-sition, when so many complexities manifest themselÕes at eÕen the lowest leÕel?

Arguably, any functional decomposition is the

imposi-Ž

tion of an abstraction by, one hopes, an expert capable of retaining the significant aspects of reality while discarding

.

irrelevant details . However, we believe that our approach is useful. In the case of enzymes, we proceed roughly as follows:

1. We create an entry in the outline for each EC number. 2. That entry is subdivided into alternative forms, when

they are known.

3. Beneath each form, we have distinct entries for each subunit, unless there is only one.

Fig. 2. Prototype functional breakdown.

4. A single enzyme may occur at multiple locations in the overall decomposition.

A prototype of this breakdown is shown on the Web at the URL given in Section 2. One central point to be noted about an overall functional overview is that the lower levels of the hierarchy are the least controversial. Thus, the grouping of the subunits of an enzyme is natural. The grouping of the enzymes within a commonly accepted pathway is similarly clear. However, as one moves up the hierarchy, the ambiguity in characterizing large functional components also becomes obvious. There are alternative, meaningful ways to group subsystems, depending on one’s background and perspective.

2.2. The use of function diagrams

At the lowest levels, a function diagram can have great utility in depicting the interactions of the the genes within the subtree rooted at a node. Metabolic pathways are a common form of such a function diagram. The function diagrams included in the release of the metabolic pathways

w x

of EMP 4 include a variety of forms of data intended to aid the user in visualizing the interactions of the functional roles included in each diagram.

This use of low-level diagrams rather than a simple expansion of the lowest levels of the hierarchy has analogs

(5)

( )

GC4

within the representation of the hierarchy of function in the case of automobiles. In an auto manual or the catalogs for auto parts, it is common practice to offer detailed drawings of the spatial connections of parts in lowest-level dia-grams. Such visual abstractions, containing notes relating to what is known of the interactions of the components, are precisely what is needed by those attempting to work with autos. Similarly, the use of function diagrams in the con-text of biological systems converts a list of genes into a far more valuable presentation of the functional role of the genes; examination of a metabolic pathway, complete with enzyme numbers and notes on location, conveys a substan-tial amount of information that is missing from just the list of genes that code for subunits of the enzymes.

Evgeni Selkov maintains a growing collection of such drawings, including over 2000 depicting metabolic path-ways from over 1700 organisms. Our intent is to systemat-ically extend the collection to cover the lower-level

func-Ž .

tions most commonly, pathways for the sequenced genomes.

2.3. The notion of a single generalized hierarchical repre-sentation of function

Once one has seen the utility of a functional overview, complete with detailed function diagrams corresponding to the lower-level nodes in the hierarchy, it is natural to try to build such representations for numerous organisms. During the initial stages in which these disparate attempts are made somewhat independently, researchers will borrow sections in common, add sections reflecting functionality that was not included for previously analyzed organisms, and gradually produce a set of functional overviews that share common themes but include numerous unique

sub-Ž

sections often reflecting the particular expertise of the group compiling the overview, along with the aspects of

.

each organism that make it unusually interesting .

The utility of a single coherent overview from which

Ž .

the unique parts can be projected as strict subsets follows from two central points:

1. Maintenance of a set of specific overviews in the presence of rapid advances in biochemical, genetic, and molecular characterization of the organisms becomes increasingly difficult as the set of organisms expands. It now seems clear that functional overviews of hundreds of organisms will need to be maintained in order to support efforts to utilize the growing wealth of se-quence data.

2. Comparative analysis of organisms will require the ability to effectively portray functional correspon-dences. This ability is achieved more easily by using a single coherent overview rather than by moving be-tween numerous similar but substantially different func-tional overviews.

However, generating a single, coherent overview that would allow accurate projections for life forms as

diver-gent as bacterial parasites, plants, archaea, and mammals is an extremely demanding task. Ultimately, it will be achieved only through collaborations, numerous discus-sions, and occasional compromises.

2.4. Maintaining the generalized hierarchical representa-tion

The generation and maintenance of a generalized overview require an incremental process that might be reasonably depicted as follows:

1. An initial, more-or-less-complete overview must be generated. This first attempt will undoubtedly be skewed to include the functionality identified in the first se-quenced organisms.

2. This initial attempt must be subjected to criticism, analysis, and expansion by members from the research communities focused on the growing set of sequenced organisms.

3. Functional diagrams must be generated to capture the

Ž

growing body of research and the huge existing body

.

of biochemical data . Efforts should focus initially on the smaller, completely sequenced organisms. Already we have complete genomes for bacterial, archaeal, and eukaryotic organisms. By the end of 1996 we will have a collection of complete genomes spanning a huge range of metabolic and functional diversity. We believe that the initial set of approximately two thousand draw-ings depicting metabolic functions that Selkov has made

w x

available 6 can be used very effectively, although it is clear that the collection will need to be continually expanded and corrected.

4. The function diagrams must be carefully and accurately connected to the generalized overview. This is a far from trivial task. The collection of diagrams in some cases includes numerous variants of a single function, reflecting differences among organisms. Each function diagram must be connected to a set of organisms. 5. The set of functional roles depicted in the diagrams

constitutes the ‘‘function dictionary’’. All known se-quences that implement these functional roles for spe-cific organisms must be connected to their function or functions. Many sequences will not be characterized with the precision required to connect them to a func-tion; these must be noted; and if more detail is supplied at some later date, they must then be connected to the appropriate functional roles.

The simple articulation of these tasks may make them appear deceptively straightforward. Each step, however, involves substantial effort. What exists at the end of these steps is an initial functional hierarchy that will need con-stant maintenance. Our current status may be summarized as follows:

There is a single functional overview in the form of an

Ž

outline i.e., a tree; see http:rrrrrrrrrr_{www.mcs.anl.gov r}rrrr_home

2_.

r r r r

(6)

overview contains 1781 nodes, including 1367 ‘‘leaves’’, which are lowest-level nodes that must eventually be connected to function diagrams. At this time, 926 have been connected to one or more function diagrams.

We have approximately 2000 function diagrams. Of these, 1728 have been connected to leaves in the functional

Ž

overview that is, to the 926 leaves of the overview

.

mentioned above . These diagrams have been connected to over 400 distinct organisms. In some cases, we have

Ž

assigned pathways to phylogenetic classes that is, to

.

higher-level nodes in the phylogenetic tree . Note that a single pathway often connects to a number of organisms. Here is a summary of the most heavily classified phyloge-netic groupings:

No. of Pathways Phylogenetic Class 702 Escherichia coli 666 Salmonella typhimurium 565 Haemophilus influenzae 521 Homo sapiens 512 Rodentia 462 Mycoplasma capricolum 432 Aves 414 Bacillus subtilis 375 Ascomycotina 373 Sulfolobus solfataricus 341 Teleostei 309 Mycoplasma genitalium 249 Embryophyta 161 Fungi imperfecti 135 Pseudomonas 122 Rattus norvegicus 104 Archaea 98 Saccharomyces cerevisiae

While we consider this an excellent start, it is just the beginning. From our work on the initially sequenced genomes, we know that many, many more function dia-grams will be needed to adequately capture known biologi-cal functions and that these will need to be carefully attached to specific organisms. Below, we will consider tools to support this effort.

The function diagrams include 1472 distinct ‘‘func-tional roles’’. However, we have currently identified 11,216 distinct functions corresponding to proteins within the Swiss-Prot Sequence Data Bank. This number does not include ‘‘hypothetical proteins’’ but does include a number

Ž

of categories that represent imprecise functions i.e., func-tions that are too imprecisely specified to allow

meaning-.

ful inclusion in a function diagram . We have connected 16 308 of the entries in Swiss-Prot to the 1472 functional roles in the diagrams.

2

http:rrwww.mcs.anl.govrhomercompbioroverview.html

2.5. Allowing a restricted class of tailored oÕerÕiews

In the preceding sections, we have defended the devel-opment of a single generalized hierarchical decomposition of function to be used for all genomes. Yet, it is important that one understand the reasons why curators desire tai-lored overviews:

Many sections of the generalized form will be specific to a limited portion of the phylogenetic tree. As we gain a larger sampling of genomes, we might well find that very significant portions of the proteins in many organisms fall in these categories. Just as eukaryotic regulatory mecha-nisms might seem superfluous to an analysis of bacteria, so too will methanogenesis seem an unnecessary complica-tion to those working on Drosophila.

Perhaps more important, the higher-level contents and organization of the tree often reflect a specific goal or perspective. There are many distinct and reasonable ways to organize the higher levels; in fact, one often wishes to

Ž

have multiple outlines to reflect distinct purposes for example, it might be reasonable to use a characterization by ‘‘tissue’’ or ‘‘organelle’’ at the highest level in some cases, while normally distinctions of this sort are made at

.

lower levels .

Some functional units perform several distinct roles. For example, the TCA normally plays a central role in energy metabolism while playing a secondary role in amino acid biosynthesis. By itself, this situation is not a problem, since the topic can easily be inserted at both places in the functional overview. However, in some or-ganisms, the central role of the TCA is amino acid biosyn-thesis, and it plays no serious role in energy metabolism. In such cases, one wishes the topic to appear only in the appropriate slot.

While we consider it important that there be a single generalized functional overview that is consistently main-tained, one can support capabilities that alleviate the prob-lems mentioned above. We can easily allow a subtree to exist at multiple points within the overview, as long as the distinct instances are identical; that is, there is no reason to force a topic to occur at only one point in the overview.

We can also allow a far more dramatic extension — the ability to develop alternative trees that differ only at the upper levels. It is important that the reader understand that the cost of maintaining a functional overview occurs largely

Ž

at the lower levels where each node has an associated list

.

of pathways and Swiss Protein Data Bank sequences . We can develop alternative trees that are easily derived from the single maintained tree. To be more precise, consider an alternative tree such that each of the leaves corresponds to nodes contained in the generalized overview. Such a tree may be thought of as a tailored overview in which the leaves indicate subtrees from the generalized overview. This alternative tree can be used for presentation purposes without compromising efforts to maintain a consistent

(7)

( )

GC6

generalized overview. Indeed, one could construct a num-ber of such trees to be used in different contexts.

3. Clusterings based on historical origins of function

In the preceding section we discussed the construction of a hierarchy that captures the functional components and subcomponents of organisms. In the introduction we pointed out that other hierarchical clusterings of sequence data would also play significant roles in our attempts to comprehend biological systems. It is clear that multiple sequence alignments form one of the most useful tools needed to explore the evolution of function. From such alignments, we construct trees in an attempt to gain insight into the historical origins of the sequences. When align-ments have been constructed from sequences demonstrat-ing substantial similarity, the sequences usually do come from common ancestral sequences, and the resulting trees do offer useful estimates of historical events.

It is true that as we attempt to understand sequences that have diverged more radically, the relationships cease to be strictly hierarchical. One finds domains that can reasonably be considered homologous, while the overall sequences cannot be globally aligned. That is, the clarifica-tion of the origin of existing sequences from ancestral sequence data will probably introduce ‘‘operations’’ in which a sequence is composed from multiple parent se-quences. In this sense, it is probably wrong to speak of a hierarchical representation of the origins of current se-quence data. A more complex representation ultimately will be required.

However, we are still at the very initial stages of our exploration of the evolutionary history of existing se-quence data. At this point, the modest goal of developing a collection of multiple sequence alignments in which se-quences are globally aligned is reasonable. These can be used to study the variances in implementation of a specific function, and in some cases the correspondences between a set of closely related functions. The benefits that can be derived from such alignments are well known and have already profoundly affected our understanding of biologi-cal systems. During the past few years, a number of collections of protein alignments have emerged. These may be broken roughly into two categories:

1. Global alignments of closely related sequences

2. Local alignments and motifs used to develop more sensitive tools for gaining insight into possible func-tions for uncharacterized sequence data

The goals for developing these collections are obviously closely related. Each class of alignments can be used to produce clues guiding an experienced investigator in a search for understanding the role of a specific sequence and how that role is achieved.

Ž

The system that we describe here, called WIT What Is

.

There? , does include links to alignments generated by

others, as well as some generated by ourselves. We em-phasize that we are building directly upon the collection developed and distributed by the team at Baylor College of

Ž . w x

Medicine BCM 7 . When we display what what is

Ž .

known about a given functional role such as an enzyme , we offer access to a collection of alignments. The vast majority of those alignments are acquired directly from BCM.

We believe that the groups developing alignments will inevitably face the difficult issues relating to tracing the distant relationships. Our efforts to support comparative analysis in WIT focus on establishing links between se-quences accessible via our functional overview and the alignments, motifs, and other data on protein families that will become available.

4. The evolution of function

The metabolic reconstructions of currently existing or-ganisms ultimately must be viewed in the context of a phylogenetic tree in which models of the functionality present at each node exist. That is, to comprehend the functionality present in any single organism or set of organisms, one will need access to reasonable estimates of the functionality present in each of the nodes of the phylogenetic tree from the universal ancestor down to each of the currently existing organisms being studied.

Characterization and representation of the functionality of ancestral nodes are now becoming possible. We now have access to at least one complete genome from each of the three major domains of life, and many more will become available shortly. The basic approach can be sum-marized as follows:

1. Begin with a phylogenetic tree that represents our best estimate of the evolutionary origins of existing organ-isms. We believe this to be the tree produced by the

w x

Ribosomal Database Project 8 .

2. Form alignments in which all sequences within a single alignment are believed to play the same functional role

Ž_{i.e., do not include in the same alignment homologous}

sequences that have diverged to the point where they no

.

longer have the same functional role . Exclude se-quences corresponding to known instances of horizontal

Ž

transfer e.g., mitochondrial and chloroplast sequences

.

in eukaryotes .

3. Each alignment is a set of sequences from existing organisms, and the set of sequences all have the same assigned functional role. Attach this functional role to each of the leaves in the phylogenetic tree correspond-ing to an organism in the alignment. Also attach the functional role to the most recent common ancestor

Ž_{MRCA of the set of organisms, as well as to each}.

node between the MRCA and a leaf that has been assigned the functional role.

(8)

4. Once the phylogenetic tree has been labeled with func-tional roles in Step 3, the funcfunc-tionality present at many nodes will start to clarify. However, for a number of reasons, functional roles that clearly were present at some ancestral nodes will not yet be assigned. In the most trivial case, this may be because sequences corre-sponding to the role have not yet been identified for specific organisms. Hence, it will be necessary to ‘‘fill in the gaps’’ using the same intellectual steps required

Ž .

to develop models for specific organisms see above . The development of an emerging model of the evolu-tion of funcevolu-tion is certainly one of the central tasks we will attempt to support during the coming years.

5. A software environment to support functional recon-structions

5.1. The WIT system

The creation and maintenance of models of function for sequenced organisms represent one of the most exciting challenges for the next few years. The process of develop-ing, evaluatdevelop-ing, and gradually refining models of function for organisms requires convenient access to a number of types of data. Our system, called WIT, supports two central functions:

1. It allows an international community of curators to develop and maintain metabolic reconstructions for or-ganisms cooperatively.

2. It allows a larger community of users access to the models, presenting the current version for any desired organism embedded within a functional overview. The initial release of WIT can be found at http:rrrrrrrrrr

www.cme.msu.edu rrrrr_WIT 3_{, which is located at the}

Cen-ter for Microbial Biology at Michigan State University. It includes current, but not complete, metabolic reconstruc-tions for Haemophilus influenzae, Mycoplasma genitalium, Saccharomyces cerevisiae, and about 130 other unicellular organisms. We are offering access to the system to the community to allow other researchers to participate in the development of these models.

The metabolic reconstruction for an organism is

com-Ž .

posed of two tables or relations . The first is a list of the function diagrams that are asserted to represent

functional-Ž

ity present in the organism or, less precisely, a list of the metabolic pathways believed to be present in the

organ-.

ism . The second is simply a relation with two columns

w_{sequence identifier, functional role}x

which assert the that the designated functional role corresponds to the specified sequence. These two tables,

Ž

together with the global data items the functional

3

http:rrwww.cme.msu.edurWIT

overview, the table connecting lowest-level nodes to func-tion diagrams, and the table connecting funcfunc-tion diagrams

.

to functional roles , will be what is needed to capture functional hierarchies for the sequenced organisms.

WIT allows a curator to access a wide variety of biological data. The system will propose pathways for which evidence exists, will summarize the evidence, will supports annotations of decisions, and so forth. We have constructed the system to allow multiple interacting cura-tors, each developing unique metabolic reconstructions for a single organism and each being able to create and maintain annotations for specific decisions.

The majority of users of WIT simply will wish access to current metabolic reconstructions as they emerge. Users can now access current models, along with the evidence and annotations supporting specific decisions.

5.2. PUMA: a system for presenting metabolic reconstruc-tions

The current WIT system was developed after experi-mentation with a prototype system called PUMA. PUMA was created to demonstrate the utility of interconnecting a functional overview, a rich collection of metabolic path-ways, a phylogenetic framework, and a set of alignments. It was an ambitious prototype, and we learned a great deal about how best to establish access to a representation of function. We made it available on the Web in 1995, and a number of other sites on the Web have now established links to it.

The metabolic models displayed in PUMA were of necessity generated semi-automatically. This approach al-lowed us to portray approximate models of hundreds of organisms. The tools that we developed did attempt accu-rate classifications, and we believe that the system clarified a number of central design issues. However, the derivation and maintenance of accurate metabolic models require an environment in which curators can impose specific deci-sions based on their experience. As it became clear that the actual curation of metabolic models was the central issue, and that many of the presentation problems had adequately been solved in the context of PUMA, we decided to architect a new system. WIT should be viewed as the successor of PUMA, absorbing the useful technology de-veloped in that effort and extending it to support curation of metabolic reconstructions.

The presentation services prototyped in PUMA have been integrated into WIT, allowing broad access to metabolic reconstructions as soon as they are developed, or even as they are developed. A concerted effort will be made to generalize a number of the concepts from PUMA

Ž_{e.g., the concept of ‘‘enzyme’’ will be generalized to} .

‘‘functional role’’ , and substantial effort is being ex-pended to make the system maintainable. Clearly, there will be rapid additions and changes to almost all aspects of

(9)

( )

GC8

the data — the DNA sequence data, the protein sequence data, the function diagrams, and the metabolic reconstruc-tions. It is imperative that the system function smoothly, involving minimal human intervention. With WIT we are attempting to define suggested interfaces that will remain stable and to produce curated metabolic reconstructions to which others can easily link.

5.3. Maintenance of the basic data structures

The WIT system maintains curated metabolic recon-structions. The underlying data required to support a re-construction for a specific organism can be briefly summa-rized:

1. The system must maintain a current list of DNA frag-ments. These are normally a set of entries for an archival database. For genomes that are specifically curated, we start with the curated set of DNA contigs

Ž_{in the best case, a set of one or more complete} .

chromosomes .

2. The system maintains a set of ‘‘potential coding se-quences’’, which are regions of the DNA contigs that might possibly code for proteins. We intentionally in-clude a superset of the actual set of coding sequences, at least for those genomes that have not been subjected to extensive analysis.

3. A set of similarities between potential coding sequences and all entries in the Swiss Protein Data Bank, as well as a set of similarities against all complete genomes, is also computed and maintained. This is a computation-ally intensive task that we consider in more detail below.

4. Finally, we maintain the current, curated metabolic reconstruction that is a list of the functional roles asserted for each potential coding sequence and a single list of function diagrams that are believed to be applica-ble for the organism.

Ž .

On a periodic basis currently, once or twice a month , the archival databanks are examined to determine which entries are available for the set of organisms being

main-Ž

tained currently, data for about 140 organisms is present

.

in the system . The sets of DNA contigs and potential coding sequences are updated, and similarities relating to the new entries are computed. Each potential coding se-quence is given two ‘‘names’’: one includes the contig id and the location of the subsequence in the contig, and the other is a ‘‘common name’’. The common name is a Swiss-Prot accession number, if the coding sequence cor-responds to a Swiss-Prot entry; otherwise, it can be any

Ž

arbitrary id that is appropriate e.g., we use TIGR ids for

.

the genomes they maintain . Each update cycle involves making adjustments to the set of potential coding se-quences and names of potential coding sese-quences; when common names are used, the curators need not be

con-Ž

cerned with these details in any event, the set of asserted

.

functional roles and pathways are updated automatically .

5.4. Processing computational tasks on network of hetero-geneous processors

The computation of similarities between protein se-quences, the search for patterns in either protein or DNA sequences, the computation of alignments, and the compu-tation of trees are all tasks that need to be executed to maintain the data structures of WIT or to support user requests. They are all computationally intensive tasks, and the volume of such requests can be substantial. Hence, we have built a system to dispatch these tasks to a network of

Ž

workstations and a large supercomputing system an IBM

.

SP containing 128 nodes .

The dispatcher is a general tool that allows a user to request the execution of serial programs. The dispatcher accepts execution requests, locates workstations or super-computer nodes to execute the programs on, starts pro-grams, and monitors the execution of the programs. Cur-rently, the user must inform the dispatcher of which work-stations are candidates for executing tasks on and schedule nodes on the IBM SP so that the dispatcher can use them. In the near future, the dispatcher will interact with a resource management system to automatically allocate and deallocate computational resources. Resources will be allo-cated when there are a large number of tasks to execute, and resources will be deallocated when there are a small number of tasks to execute. The initial resource manage-ment system that will be used is the scheduler for the Argonne IBM SP supercomputer. The next resource

man-w x

agement system that will be used is the Globus 9 system. Globus will manage a large number of computational resources at many different sites. Using these resources should greatly decrease the amount of time needed to maintain WIT data structures and support user requests.

6. Summary

The availability of numerous complete genomes, includ-ing a phylogenetically diverse subset, will have dramatic implications for our ability to analyze genetic sequence data. It has become critical that we elevate the goals of basic sequence analysis to include a representation of higher-level functional components and that we develop a framework for the comparative analysis of these compo-nents. During the next few years we will have the opportu-nity to develop functional reconstructions of organisms that classify the distinct functional components. This is a necessary step that will need to be taken to support the eventual modeling of these functional systems.

We have developed the system WIT to help support those wishing to develop functional reconstructions. As a by-product, we have developed initial versions of a

‘‘func-w x

tional overview’’ and a collection of function diagrams 5 . Neither of these are final products: rather, they are initial efforts that will be substantially extended and improved

(10)

over the coming few years. They form the necessary background to WIT, a system that we offer to support the analysis, encoding, and presentation of higher-level func-tion in genomes.

Acknowledgements

This work was supported by the Mathematical, Informa-tion, and Computational Sciences Division subprogram of the Office of Computational and Technology Research, U.S. Department of Energy, under Contract W-31-109-Eng-38.

References

w x_{1 Fleischmann, R.D., Adams, M.D., White, O., Clayton, R.A.,}

Kirk-ness, E.F., Kerlavage, A.R., Bult, C.J., Tomb, J.F., Dougherty, B.A.

Ž .

and Merrick, J.M. et al. 1995 Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269, 496–512.

w x_{2 Fraser, C.M., Gocayne, J.D., White, O., Adams, M.D., Clayton, R.A.,}

Fleischmann, R.D., Bult, C.J., Kerlavage, A.R., Sutton, G. and

Kel-Ž .

ley, J.M. et al. 1995 The minimal gene complement of Mycoplasma genitalium. Science 270, 397–403.

w x3 Oliver, S.G. Ž1996. From DNA sequence to biological function. Nature 379, 597–600.

w x4 Riley, M. and Space, D.B. 1996 Genes and proteins of EscherichiaŽ . Ž .

coli GenProtEc . Nucleic Acids Res. 24, 40.

w x5 Karp, P. and Riley, M. 1993 Representations of Metabolic Knowl-Ž .

edge. In: Proceedings of the First International Conference on Intelli-gent Systems for Molecular Biology, edited by L. Hunter, D. Searls, J. Shavlik. AAI Publishers, Menlo Park, CA.

w x_{6 Selkov, E., Basmanova, S., Gaasterland, T., Goryanin, I., Gretchkin,}

Y., Maltsev, N., Nenashev, V., Overbeek, R., Panushkina, E.,

Ž .

Pronevitch, L., Selkov Jr., E. and Yunus, I. 1996 The metabolic pathway collection from EMP: the enzymes and metabolic pathways database. Nucleic Acids Res. 24, 26–29.

w x_{7 Smith, R.F., Wiese, B.A., Wojzynski, M.K., Davison, D.B. and} Ž .

Worley, K.C. 1996 BCM Search Launcher — an integrated inter-face to molecular biology data base search and analysis services available on the world wide web. Genome Res. 6, 454–462.

w x_{8 Maidak, B.L., Olsen, G.J., Larsen, N., Overbeek, R., McCaughey,}

Ž . Ž .

M.J. and Woese, C.R. 1996 The Ribosomal Database Project RDP . Nucleic Acids Res. 24, 82–85.

w x9 Foster, I. and Kesselman, C. Ž1996. Globus. A Metacomputing Infrastructure Toolkit. Proceedings of the Workshop on Environments

Ž .