Motif Knowledge Base Based on Deductive Object-Oriented Database Language

(1)

10

Motif Knowledge Base Based

on Deductive

Object-Oriented

Database

Language

MAKOTO HIROSAWAt REIKO TANAKAt MASATO ISHIKAWA•˜ [email protected] [email protected] [email protected]

tex-ICOT, presently with Hitachi System Development Laboratory

1099 Ohzenji, Asao, Kawasaki, 215, JAPAN 1 tInformation and Mathematical Science Laboratory, Inc. (IMS)

presently at ICOT

•˜

Institute for New Generation Computer Technology (ICOT) 1-4-28 Mita, Minato, 108 JAPAN

Abstract

The representation of biological concepts in a knowldge base are important to a machine or a non-specialist of biology to understand and analyze genetic information. In our previous study, we studied the representation of biological knowledge and the representation of biological knowledge related to motif of protein with the goal of discovering new motifs.

In this paper, firstly, the requirements for the representation of biological knowl-edge are listed. Then, solutions to these requirements are stated. Finally, represen-tation of bioloigal knowledge on motif in the Deductive Object-Oriented Language,

QUIXOTE , is shown. The knowledge base includes Prosite, a representative motif database, as the basis of the knowledge base.

(2)

-1 Introduction

The analysis of the genes of living organisms is essential technology to the deciphering of biological phenomena. Today, with increasing advances in bio-technology, the genes that have been identified but which have yet to be analyzed are rapidly increasing. To enable the automatic analysis of massive genes to extract biological information from them, the introduction of knowledge-based analysis is necessary, in addition to the development of high-speed computers and fast analysis algorithms. This is partly because the quality of analysis without biological knowledge is not high enough and partly because the time required for the analysis can be reduced remarkably by the introduction of biological knowledge.

Nowadays, data about genetic information is stored in databases. But, these databases have been constructed with little consideration to their application to knowledge _engineering. So, it is difficult to use these databases to create a system that enables high-level knowledge processing. To do this, the effective representation of biological knowledge is necessary.

The representation of biological knowledge has been studied by several researchers [HTanaka 1993][Kazic et al. 1993]. They succeeded in representing simple, small amounts of knowledge. But, the representation can not necessarily be extended to the representation of practical problems.

We selected motif discovery as a practical problem to study the representation of biological knowledge. There were two reasons for our selection. One was that the level of complexity of biological knowledge about motif is adequate for the study of the representation of biological knowledge. Its complexity is low enough for non-specialists of biology to study the represen-tation and thus to prevent us from becoming overwhelmed by unfamiliar science. Still, its complexity is high enough to cover the essential concepts of biology. We thought that the understanding gained through this study would also help us in the representation of other areas of biological knowledge.

The second reason for our selection was our familiarity with the motif discovery problem. We have already developed a motif discovery system [Hirosawa et al. 1993] . But, the represen-tation in the system was not particularly elaborate. The logic programming language, KL1, in which the system itself is written, was used to express biological knowledge_{. But, because} KL1 is essentially a programming language rather than knowledge representation language, its effectiveness of representation is limited.

This time we opted to use DOOD (Deductive Object-Oriented Database) language QUIXOTE [Yasukawa et al.1992] to represent biological knowledge . We thought that the Object-Oriented feature of QUIXOTE would be useful for the representation and that its deductive feature is suitable for interfacing with an inference machine.

2 Biological Knowledge Base

In the first subsection, the requirements for biological knowledge are described and the so-lutions satisfying them, using the Deductive Object-Oriented _{Knowledge base , are stated} briefly. Then, after overviewing of our Biological Knowledge Base, the knowledge base using QUIXOTE is described.

2.1 Requirements for a Biological Knowledge Base

If we want a knowledge base to be useful for both an inference machine and non-specialists of biology, there are several requirements that the biological knowledge base must satisfy . The

(3)

12

-requirements and their solutions are listed below.

• Requirement 1 Knowledges of different reliabilities must co-exist in the knowledge base. But knowledges must be processed differently according to their reliability.

Solution 1 This can be realized by exploiting the module concept of QUIXOTE . We divide knowledge into modules according to the reliability of the knowledge. When

we want to discover a small but reliable motif, we should consult only those mod-ules whose reliabilities are high. When we want to discover many, by sacrificing of

reliability, we should also consults those modules whose reliabilities are low.

• Requirement 2 The system must manage public knowledge bases and these knowledge bases belonging to researchers differently. Also, the system must manage knowledge

bases belonging to individuals differently.

Solution 2 This, too, can be realized by exploiting the module concept of QUIXOTE . Suppose there are three modules, Public*, Userl and User2. When userl wants

to consult biological knowledge, he or she should consult Public* and Userl. When User2 wants consult biological knowledge, he or she should consult Public* and User2.

• Requirement 3 In biology, Person 1 may regard a motif C as being a motif of enzyme

A, while Person 2 regards that motif as being a motif having function B. This often

happens in biology. This wuold appear to be a contradiction, but actually is not.

This arises because enzyme A has function B, and a portion of the protein around

motif C has function B. This happens because of the different interests of different

researchers. In this case, Person 1 is interested in the class of the discovered protein

and Person 2 is interested in its function.

Solution 3 This kind of complexity cannot be represented by conventional hierarchical representation. But, it can easily be represented in multiple inheritance in Object

Oriented feature of QUIXOTE , if the relation between biological concepts is untan-gled and properly described. To make the description, not only knowledge of, DOOD but also knowledge of biology is required. We have the both.

• Requirement 4 Biologists use their biological knowledge to a understand database. But, a machine or a non-biologist has difficulty in understanding a database, or they

cannot understand it perfectly because of their for the lack of knowledge. For

ple, they would find essentially impossible to understand the information described

in the comment in the database which is written in a natural language. Because the

information in comment is sometimes extremely important, this kind of information

must be understandable by both by the machine and non-specialists.

Solution 4 This can not be done automatically, It can be done only by reading and understanding the database and representing the knowledge. But this dirty and

time-consuming job is facilitated if concept of Object-oriented is in our mind.

2.2 Overview of Biological Knowledge Base

We constructed our Biological Knowledge Base using QUIXOTE . The Biological Knowledge Base contains knowledge on motif knowledge and its related knowledge.

(4)

2.2.1 Representation using the module feature corresponding to Requirement 1 and 2 and their solutions

There are three modules in the knowledge base, Prosite*, User and Discovery. The system consults Prosite* and User when it atempts to discover motifs, and it registers any discovered motifs in Discovery. Separate handling the motifs are realized by module feature of QUIXOTE .

In the system, Prosite* is more reliable than User. In Prosite*, motifs in Prosite [Bairoch 1991] , a representative database of motif, and related knowledge are represented in a hier-archical and multiple inheritance scheme. With the use of the multiple inheritance scheme, biological concepts are more properly represented (In the original Prosite motifs are repre-sented just in a hierarchical scheme) .

User is a module of Biological Knowledge Base in which motifs collected by user are stored. In some cases, motifs in User may be less reliable than those in Prosite*. To represent biological concepts in User and Discovery, the same scheme as that used in Prosite* is used.

When users want to deduce small but reliable information from the knowledge base, they should select only the Prosite* module. When, however, they want to deduce many information at the expense of reliability, they should select User module as well.

Figure 1

2.2.2 Representation of complex biological concept

corresponding to Requirement 3 and and their solutions

To illustrate the representation of complex concepts in the biological Knowledge Base more in detail, a portion of the knowledge in Prosite* is visualized in Figure 1. In the figure, concepts and motifs belonging to kinaseGroup and domain-general are shown. KinaseGroup

is a concept related to a class of protein and domain-general is related to the functions of proteins. Because the two concepts are note independent, there is cross-linking between them.

KinaseGroup, which transfers phosphate, is classified into four groups, one of which contains protein-kinase. Furthermore, protein-kinase is classified into tyrosine kinase(ty-kinase) and serine/threonine kinase(s-th-kinase). The three classes (protein kinase, tyosine kinase and serine/threonine kinase) didn't exist in the original Prosite. We introduce these three classes because, although they are not necessary for biologists to understand knowledge base but are

(5)

14-necessary for non-specialists _{and machine to understand it. But, introduction of the concepts} is can be appeared canceled. If users don't want the introduced these concepts of class (ex. tyrocine kinase), the system handles knowledge as if these classes did not exist. This can be realized by using method of QUIXOTE.

In some class of protein, the corresponding motif is registered. Protein kinase has a motif named protein-kinase-general. Its pattern is "[LIV]-G-x-G-x-[FYM]-[SG]-x-V"_{. Here,} expres-sion [SG] signifies S or G; 'x' signifies any amino acid. Tyrosine kinase has a motif named protein-kinase-signatures-tyrosine, while serine/threonine kinase has a motif named protein-kinases-serine/threonine.

As a motif of tyrosine kinase or serine/threonine kinase, we can refer to the motif belonging to protein kinase in addition to its own motif. This can be done by using method of QUIXOTE _.

Also as a motif of protein kinase we can refer to the motif named "ATP/GTP-binding site motif A (P-loop)" belonging to binding-domain by using of method of QUIXOTE . In the original Prosite, the information about cross-linking was described in natural language at as a comment. We read the implicit information and expressed it as explicit information to enable readability by ourselves and the machine.

2.3 Description by DOOD

The description of knowledge by QUIXOTE (which is illustrated in Figure 1) is shown in Figure 2 (class structure) and in Figure 3 (entries of motif). In Figure 2, A >= B means that

B is a sub-concept of A.

Figure 2

There are four motif entries in Figure 3. Among them, the upper three belong to Prosite* and the lowest belongs to User. The specification of module is done by the specification before the double semi-colon. We can specify Prosite* or User or Discovery _{as module.}

In the top entry (protein kinase) there is the attribute, otherMotif. The attribute enables the system to also refer to the motif belonging to another class (binding-domain)_{. The} proce-dure for referencing is also written using method of QUIXOTE . This multiple inheritance is one of the features of DOOD.

Protein kinase doesn't appear to have attribute value of acceptor in Figure 3. But it actually does have attribute value, becuase alcohol-group is inherited as the value from its upper class ec-2-7-1. There are two sub-groups of protein kinase, tyrosine kinase and ser-ine/threonine kinase, which have tyrosine and serine or threonine (expressed by acceptor

+ = {serine, threonine} ) as their attribute values. In this case, the inherited attribute value is overwritten. From a biological point of view, the attribute value becomes more precise because tyrosine, serine and threonine are members of alcohol group. This is another feature of DOOD.

Using this information, the acceptor of protein can be detremined more precisely if we find a motif that belongs to higher hierarchy. The identification as tyrosine kinase or serine/threonine

(6)

Figure 3

kinase gives us more information than identification as protein kinase, the upper concept of the two.

3 Discussion Knowledge Base

The Biological Knowledge Base in the system satisfies the four conditions stated in Section 2.1. Requirement 1, the management of multiple levels of reliability, is satisfied by the module feature of QUIXOTE . Requirement 2, the management of multiple knowledge bases having differents owner, doesn't correspond to the system. But, it too can be satisfied by the module feature of QUIXOTE .

Requirement 3, the correct representation of biological concepts, is satisfied by multiple inheritance, the overwriting of inheritance and the use of method, which are not special fea-tures of QUIXOTE but which are standard features of Deductive Object-Oriented Database. Requirement 4 is satisfied by our endeavor to biology.

Motif Discovery

Motif discovery using the previous representation [Hirosawa et al.1993a] and that using the current representation [Hirosawa et al.1993b] are compared.

The motif discovered using the current biological knowledge base system is the same as

these discovered motif using the previous biological knowledge base. The discovered motif is that of protein kinase.

The difference is the time required for the discovery. The previous system couldn't notice the motif as being a motif of protein kinase, ATP/GTP-binding site motif A (P-loop), be-longing to the binding domain, whereas the present system notices the motif with the help of multiple inheritance. The information of the motif makes the time taken of the present system lower than that of the previous system. This proves that proper representation of knowledge can contribute to the speed-up of the genetic analysis

(7)

16-Others

The use of the module feature of QUIXOTE , used to satisfy Requirement _{1 and 2, is not} indispensable to our system. And, it can be substituted by another means of description without the feature. But, we think that the module feature will prove useful when the size of the biological knowledge base is increased.

4 Conclusion

Biological knowledge base on motif was constructed by the use of DOOD (Deductive Object-Oriented Database) language QUIXOTE . In the biological knowledge base, entangle concepts of biology were described with the multiple inference and non-monotonic features of DOOD (precisely 00 feature) . Because knowledge of motif comprises an essential aspect of biological knowledge, we think the framework of representation in this research is also applicable to the representation of biological knowledge of other domains

Also, the D (Deductive) feature of DOOD is also suitable for interfacing with the inference module of the system. Consequently, DOOD is suitable for representing a biological knowledge base that is a part of genetic analysis system.

5 Acknowledgment

The authors thanks Prof. Fuchi, ex-director of ICOT, for introducing us to knowledge engineer-ing and providengineer-ing us with the research environment, and also Dr. Uchida, director of ICOT, f

or introducing us to genetic information processing. We also thanks to the QUIXOTE groups at ICOT, especially H. Tanaka, for their assistance.

Finally, the first author gives special thanks to N. Mizuhara and M. Kunimoto at the System development Lab. of Hitachi Ltd. for providing him with the computer environment needed to write the paper.

References

[Bairoch 1991] Bairoch, A. Prosite : A dictionary of protein site and pattern : User manual

Release 7.00, May 1991.

[Hirosawa et al. 1993a] Hirosawa, M. Hoshiwa, M. Ishikawa, M. Protein Multiple Sequence Alignment using Knowledge. Proceedings of the Twenty-Sixth Annual Hawaii International

Conference on System Sciences,Vol.1 pp 803-812,1993.

[Hirosawa et al. 1993b] Hirosawa, M., Tanaka, R., Ishikawa, M. Application of Deduc-tive Object-Oriented Knowledge Base to Genetic information Processing. Proceedings of the Interantional Symposium Next Generation Database Systems Their Applications _,1993. [HTakaka 1993] Tanaka, H. A Private Knowledge Base for Molecular Biological Research. Proceedings of the Twenty-Sixth Annual Hawaii International Conference on System

Sci-ences,Vol.1 pp 803-812,1993.

[Kazic 1993] Kazic, T. Proceedings of the Twenty-Sixth Annual Hawaii International Con-ference on System Sciences,Vol.1 pp 803-812,1993.

[Yasukawa et al. 1992] H.Yasukawa et al. Objects, Properties, and Modules in Quixote. Proc. of FGCS92, pp y89-112, 1992