Predictive genomic and metabolomic analysis for the standardization of enzyme data

(1)

www.elsevier.com/locate/pisc Available online at www.sciencedirect.com

REVIEW

Predictive genomic and metabolomic analysis

for the standardization of enzyme data

$

Masaaki Kotera

n

, Susumu Goto, Minoru Kanehisa

Bioinformatics Center, Institute for Chemical Research, Kyoto University, Japan

Received 2 September 2011; accepted 4 November 2013; Available online 26 March 2014

KEYWORDS

KEGG; Reaction Class (RCLASS);

KEGG Orthology (KO); E-zyme;

PathPred;

Chemical annotation

Abstract

The IUBMB's Enzyme List gives a valuable library of the individual experimental facts on enzyme activities, providing the standard classi_fication and nomenclature of enzymes. Empirical knowl-edge about the relationships between the enzyme protein sequences (or structures) and their functions (the capability of catalyzing chemical reactions) has been accumulating in public literatures and databases. This provides a complementary approach to standardize and organize enzyme data, i.e., predicting the possible enzymes, reactions and metabolites that remain to be identified experimentally. Thus, we suggest the necessity of classifying enzymes based on the evidence and different perspectives obtained from various experimental works. The KEGG (Kyoto Encyclopedia of Genes and Genomes) database describes enzymes from many different viewpoints including; the IUBMB's enzyme nomenclature/classification (EC numbers), the similarity group of enzyme reactions (KEGG Reaction Class; RCLASS) based solely on the chemical structure transformation patterns, and the similarity groups of enzyme genes (KEGG Orthology; KO) based on the orthologous groups that can be mapped to the KEGG PATHWAY and BRITE functional hierarchy. Some unique identifiers were additionally introduced to the KEGG database other than the EC numbers established by IUBMB. R, RP and RC numbers are given to distinguish reactions, reactant pairs and RCLASS, respectively. Genes, including enzyme genes, have their own ID numbers in specific organisms, and they are classified into ortholog groups that are identified by K numbers. In this review, we explain the concept and methodology of this formulation with some concrete example cases. We propose it beneficial to create a standard classification scheme that deals with both experimentally identi_fied and theoretically predicted enzymes.

http://dx.doi.org/10.1016/j.pisc.2014.02.003

2213-0209&2014 The Authors. Published by Elsevier GmbH. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/3.0/).

☆_{This article is part of a special issue entitled}_“_{Reporting Enzymology Data}_– _{STRENDA Recommendations and Beyond}_”_{. Copyright by} Beilstein-Institut.

n_{Corresponding author at: Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro-ku,} Tokyo 152–8550, Japan.

(2)

&2014 The Authors. Published by Elsevier GmbH. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/3.0/).

Contents

Introduction . . . 25

KEGG RCLASS . . . 26

Metabolic reconstruction/chemical annotation . . . 28

Independent IDs for the enzyme classiﬁcation in many perspectives . . . 29

Understanding of structure–function relationships of enzymes . . . 31

Conﬂict of interest statement. . . 31

Acknowledgments . . . 31

References . . . 31

Introduction

Human genome research and subsequent post-genome research

provided both the systematic analyses and wide deﬁnition of

the“genome”,i.e., the substances coded by the genome, such

as gene expression, polymorphism and proteome. DNA, RNA and proteins are synthesized based on genetic codes and template-based duplication, transcription and translation. On the other hand, chemical structures of metabolites such as glycans, lipids and terpenoids are not designed by such templates, but by biosynthetic pathways. Kanehisa et al. named the genome and

chemical activity that exists within all species as ‘genomic

space’ and ‘chemical space’ respectively (Kanehisa et al.,

2010). A ﬁrm grasp of the‘genomic space’becomes valuable

when screening for disease genes, drug targets,etc., likewise, a

grasp of the‘chemical space’provides insight when screening

imaging probes, drug leadsetc. Theﬁeld of molecular biology

has spread to omics-level research (genomics, proteomics,

metabolomics, etc.), and is continually expanding to study

whole families of organisms. For instance, the next-generation sequencer is expected to be powerful enough to analyze

environmental genomics, also referred to as“metagenomics”

(Schloss and Handelsman, 2003;Handelsman, 2004;Riesenfeld et al., 2004; Tringe et al., 2005). Similarly, high-throughput mass spectrometry and NMR enable the user to study metabo-lomics at a family, order or class level, which can be referred to

as “meta-metabolomics” (Raes and Bork, 2008; Turnbaugh

and Gordon, 2008; Auld and Acker, 2014; Monasterio, 2014).

Genome analysis has become routine, and individual repositories of genes are being constructed for all known living organisms. Conversely, repositories of the chemical substances that exist in, or affect individual living organisms are in their infant stages and are not well established; much less is known about the interrelationships that exist between the genomic and chemical spaces. To bridge this gap, it becomes essential to establish robust methodology to pre-dict chemical substances from genomic data and vice versa. Enzymes are the important bridge between the genome and

chemical biosynthesis. An enzyme, amylase, wasﬁrst identiﬁed

in 1833 byPayen and Persoz (1833). At that time, it was not

known that many enzymes are made of proteins. It was in 1926 when Sumner showed that an enzyme, urease, is in fact a

protein (for this work he won the 1946 Nobel Prize in

Chemistry). Sanger and Tuppy (1951a, 1951b) published a

method to determine amino-acid sequences in 1951. After

that, many more enzymes were identiﬁed, and there arose the

need for systematic enzyme nomenclature. International Union of Biochemistry and Molecular Biology (IUBMB) established the

Enzyme List in 1961 for this exact purpose (Tipton and Boyce,

2000). This was before the establishment of the Atlas of Protein

Sequence and Structure in 1972 (Dayhoff, 1972) and the

prototype of the GenBank database in 1979 (Goad, 1987). It

has now become relatively easy to obtain nucleic acid sequences, and it has become mandatory to determine nucleic acid or amino acid sequences for an enzyme and register them in the GenBank database prior to publishing an original paper discussing said enzyme.

Since then, information on genes, protein sequences and structures have been proliferating, creating huge databases that are connected worldwide, such as the amino acid sequence

databases PIR (Protein Information Resources) (Barker et al.,

1999), Swiss-Prot (Bairoch and Boeckmann, 1991), Entrez

Protein (Marchler-Bauer et al., 2002), and PRF (Nishikawa

et al., 1993), and the protein function databases PROSITE (Bairoch, 1991), Pfam (Sonnhammer et al., 1997), InterPro (Apweiler et al., 2001), GenomeNet Motif (Kanehisa et al.,

2002) and ExPASy ENZYME (Bairoch, 2000), and the protein

structure databases PDB (Bernstein et al., 1977), SCOP (Murzin

et al., 1995), CATH (Orengo et al., 1997), FSSP (Holm and Sander, 1994), and the integrated databases at NCBI (National Center of Biotechnology Information), EBI (European Bioinfor-matics Institute), SIB (Swiss Institute of BioinforBioinfor-matics), and GenomeNet. Due to the recent successful development of high-throughput measurement techniques, the rate of biological data accumulation has become even faster, vastly exceeding the knowledge capacity of the human mind.

The IUBMB's Enzyme List (EC numbers) classiﬁes enzymes

based on published experimental data and provides extremely useful information regarding experimental evidence. The

Enzyme List classiﬁes enzymes hierarchically; where up to

the sub-subclass (the third number) is a systematic classiﬁ

ca-tion of enzyme-catalyzed reacca-tions. The fourth number of the Enzyme List is a serial number given to an experimentally observed (and published) enzyme with details of the reaction

(3)

number record is linked to the PubMed ID, enabling easy access to the original paper. There are currently two types of

EC numbers; ofﬁcial EC numbers and unofﬁcial EC numbers.

The ﬁrst is the representation of biochemical knowledge

organized by the IUBMB–IUPAC Biochemical Nomenclature

Committee. The second is for genome annotation to identify enzyme genes (and enzymes), which are not organized by the Biochemical Nomenclature Committee, but by the annotators

of databases including KEGG (Kanehisa et al., 2010), based on

sequence similarity. KEGG once used EC numbers as primary

identiﬁers of enzymes, but not anymore, due to reasons that

will be discussed later.

Enzyme functions are highly dependent on the enzyme's protein structures. Like any other proteins, enzymes are also synthesized in the ribosome using the nucleic acid sequences of genes as their templates, therefore their structures are the products of evolution. Evolutionally close enzymes have similar motifs, and form a group of enzymes. In homologous proteins, even if the proteins are not similar as a whole, the regions of common functions or structural

restrictions, motifs and speciﬁc functions all tend to be

preserved well. Some empirical knowledge has been becom-ing clear through the development of structural biology and site-directed mutagenesis. The site-directed mutagenesis studies have been performed since 1980s to change enzyme

functions (Carter, 1986), through a trial and error process.

Because a proteins X-ray crystal structure is still difﬁcult to

stably obtain, there have been many attempts to predict enzyme structure and function from amino acid sequences. The prediction of enzyme function is based on empirical knowledge stored in databases rather than through physical and chemical theory and calculation. Since the end of the 20th century, many genomes from various species have been

proﬁled, enabling their comparison, and the comparison of

inter-species enzymes to ﬁnd conserved and non-conserved

regions. This gives a hint as to which amino acid residues undergo evolutionary change. Thanks to improvements in computational ability and data storage, it is becoming possible to design protein structures from amino acid sequences and to predict their function through computer simulation and bioin-formatics tools. There have been some successful attempts to manipulate enzyme functions; in 2008 Ghirlanda et al. pub-lished on the computational enzyme design of Kemp elimination

catalysts (Ghirlanda, 2008). Nevertheless, manipulation of

enzyme function still remains a trial and error process. Although

some examples of the structure–function relationships of

enzymes are known, it is only limited to a small set of enzymes. We have been tackling the genome-scale metabolic path-way reconstruction problem, constructing a systematic assembly and organization of all the metabolism of a given organism, while focusing on the relationships between genomic and chemical spaces. Through this process, the KEGG Orthology (KO) database was created, mapping ortho-logue gene groups that at the same place in regard to biological pathway and functional hierarchy, offering stra-tegies to predict the possible set of chemical structures

from the genomic information of a given organism. Atﬁrst,

we accumulated knowledge about glycosyltransferase

groups for the post-translational modiﬁcation of proteins.

These enzyme genes were grouped based on the orthology

and substrate speciﬁcity, and were successfully applied to

predict the possible glycan structures for a given genome or

transcriptome (Hashimoto et al., 2009; Kawano et al.,

2005). The same strategy was applied to polyketides,

non-ribosomal peptides and polyunsaturated fatty acids (Minowa

et al., 2007;Hashimoto et al., 2008).

This strategy was successful because the orthology and

the substrate speciﬁcity correlated well. In order for this

strategy to be applied to other enzyme groups, it is

necessary to organize the enzyme classiﬁcation so that

enzyme function can be deduced from protein sequences, or vice versa, for a wide range of enzymes. We recently established the KEGG Reaction Class (RCLASS) based on the similarity of reactions in terms of the reaction motifs or the RDM chemical transformation patterns. RCLASS enables users to develop a method to predict what kind of enzymes can catalyze putative reactions, and also a method to predict the metabolic fate of given compounds. We plan to use these two methodologies to connect chemical and genomic spaces, whilst simultaneously continuing work to

reﬁne enzyme classiﬁcations for predictive genomic and

metabolomic analyses. This would add a genomic perspec-tive to the enzyme data standardization, enhancing the utility of IUBMB's Enzyme List.

KEGG RCLASS

Our primary goal in the development of RCLASS is to extend

the EC classiﬁcation so that it also covers putative reactions

that are not yet well characterized. High-throughput mea-surement techniques hint at the existence of considerable

numbers of orphan metabolites, i.e., compounds that are

known to be present in living organisms but whose

syn-thetic/degradation pathways are unknown (Kotera et al.,

2008). In order to identify the enzyme proteins involved in

these pathways, it is essential to characterize or classify the putative reaction equations that are often incomplete. In

principle, the ofﬁcial EC numbers cannot be used for this

purpose because their assignment requires conﬁrmed

experimental evidence of enzyme activity and a complete reaction equation. In order to describe the relationships between putative reactions and putative enzyme proteins

(or genes), it is essential to develop an enzyme classiﬁcation

scheme that is applicable not only for the conﬁrmed

reactions with complete equations, but also for the putative reactions, even if the equations are incomplete.

Finding possible enzyme reactions from metabolomic data naturally starts with a pair of compounds (which we

refer to as a“reactant pair”) corresponding to a reaction

equation, not always a complete reaction equation (Kotera

et al., 2004). Possible chemical transformation within the compounds can be obtained by comparing the two chemical structures. Technically, chemical compounds are repre-sented as graph structures, where the edges represent chemical bonds, and the nodes represent atoms attached with functional group information. In order to distinguish

functional groups and microenvironments of atoms, ﬁve

atom species (C, N, O, S and P) are classiﬁed into the 68

KEGG atom types (Hattori et al., 2003) (such as“N1a”for an

amino group inFigure 1). As a result of graph comparison,

the matched subgraph corresponds to the conserved atom group under the enzymatic reaction, and the unmatched sub-graph of each compound corresponds to the eliminated

(4)

or the added atom groups. The boundary area between the conserved and the non-conserved sub-graphs can be regarded as the reaction center on which the putative enzyme acts. In such a way, the RDM chemical transforma-tion patterns are extracted from a reactant pair in the

computational manner (Kotera et al., 2004; Hattori and

Kotera, 2011). The RDM pattern is represented with a string of the KEGG Atom Types, and describes a chemical bond that is generated or eliminated in a reaction.

We deﬁned the RCLASS entries that represent a set of

chemical transformations found in a Substrate–product pair

(reactant pair). Each RCLASS entry was given identiﬁcation

numbers (RC numbers). An RCLASS entry may consist of multi-ple RDM patterns when more than one chemical bond is generated or eliminated. So far 2345 RCLASS entries were

deﬁned out of 38,241 RPAIR entries in the KEGG database

(August 2011). RCLASS entries have graphics representing the

common chemical transformations that occur in a deﬁned set of

reactant pairs (Figure 2), where reaction centers and their

vicinities are emphasized in the KEGG by atom types and colors. The directions are decided according to the alphabetical order of the RDM patterns, and the orientations of the chemical structures are decided manually so that the similar RCLASS graphics are drawn in the same orientation whenever possible. Therefore it has become easier for the user to understand the chemical structure transformation, as well as to compare

different reaction types. RCLASS classiﬁes reactions based

solely on chemical transformation of reactions on metabolic pathways and are independent from any other information such

as the range of substrate speciﬁcity and amino acid sequence.

The relationships among many instances related to enzymes

are as follows. The basic information on these classiﬁcations is

taken from the IUBMB enzyme list (EC numbers). Reactions taken from the IUBMB enzyme list and other literatures are

given identiﬁcation numbers (R numbers) and are stored in

KEGG REACTION followed by the addition of conﬁrmed source

organisms information, pathway information, and orthologue

groups of enzyme genes. Substrate–product pairs (reactant

pairs) are deﬁned for enzyme reactions (Figure 3) and are

stored in the RPAIR database, together with the calculation of the RDM chemical structure transformation patterns. In gen-eral, a reaction (R numbers) consists of multiple reactant pairs (RP numbers).

Figure 2 Reactant pairs and reaction classes.

Figure 3 Reaction equations and reactant pairs.

(5)

Metabolic reconstruction/chemical annotation

RCLASS is proposed to be beneﬁcial in linking metabolomics

to genomics, as well as to analyze the conserved consecu-tive reaction patterns in the evolution of metabolic path-ways. We surveyed the frequently appearing RDM patterns

speciﬁc for the 11 categories of KEGG metabolic pathways,

and then discovered some speciﬁc patterns within the

categories, especially biodegradation pathways, and thus developed a method to predict biodegradation pathway by

bacteria (Oh et al., 2007). Such a method for predicting

metabolic fate is based on the extraction of biological meaning from chemical structure, which is referred to as

chemical annotation (Dry et al., 2000;Chen et al., 2005;

Kanehisa et al., 2008).

Metabolic network reconstruction and annotation can be

classiﬁed into three ideal and hierarchically ranked sets of

conditions; if theﬁrst conditions can be accomplished, then the

second and third ones are not required. Similarly, if the second

set of conditions can be achieved, then the third

is not needed, though theﬁrst would then need to be revisited.

Theﬁrst conditions specify that when a metabolic pathway is

well characterized with experimentally conﬁrmed enzymes and

reactions in at least one organism, genome-based and pathway-based annotations are applicable. The second conditions stipu-late that when certain reactions are known to take place but

the enzymes in charge of these reactions are not identiﬁed in

any organisms, prediction of the enzymes from proposed

reactions is necessary in order to achieve theﬁrst conditional

set. The third condition is when some metabolites are known to exist but the reactions producing or degrading them are not

identiﬁed, then predictions of these reactions are necessary to

move back to the second conditional stepsetc.

We deﬁned reference pathways to cope with theﬁrst set of

annotation conditions and designed the KEGG PATHWAY and

BRITE so that they generally do not focus on a speciﬁc

organism, but are designed in a general way to be applicable

to all organisms. Reference pathways are deﬁned as the

combined pathways that are present in a number of organisms and there exists a consensus among many published papers.

Figure 4 describes the difference between a species-speciﬁc pathway and a reference pathway, and the relationships among various IDs. In the reference pathway, rectangles and circles represent gene products (mostly proteins) and other

(6)

molecules (mostly metabolites), respectively. This graphic is one of the reference pathways for which no organism has been

speciﬁed. When the user selects to view a reference pathway,

the colored rectangles indicate the links to the corresponding

orthologue (KO) entries, enzyme classiﬁcations or reactions.

When the user speciﬁes an organism, the colored rectangles

indicate the links to the corresponding KEGG GENE pages,

which indicates the speciﬁed organism possesses the

corre-sponding genes or proteins in the genome. White rectangles indicate that there are no genes annotated to the correspond-ing function. Note that this does not necessarily mean the organism does not really have the corresponding genes. It is

possible that the corresponding genes have not been identiﬁed

yet. Manually deﬁned KO entries (groups of orthologous genes)

are the basic components of the systems information, i.e.,

PATHWAY network diagrams and BRITE functional classiﬁ

ca-tions. Continuous reﬁnement of reference pathways and

orthologue information is the key to maintain the quality of this procedure.

We designed the E-zyme tool (Kotera et al., 2004) in

response to the second set of annotation conditions, the practical situation where the user wants to identify enzymes (enzyme genes, proteins or reaction mechanisms) from only a partial reaction equation. The user can input any compound

pairs, and obtain the candidate EC classiﬁcations, generating a

‘clue’to identify the enzyme genes or proteins. This needs the

library of the RDM chemical transformation patterns calculated in advance, which is compared with the query transformation

pattern, resulting in a list of possible EC classiﬁcations with

speciﬁc scores. Recently, we have done a signiﬁcant

improve-ment in this E-zyme, where a more complicated voting scheme

and EC-RDM proﬁle based scoring system is applied to achieve

higher coverage with a higher accuracy rate (Yamanishi et al.,

2009).

The third set of annotation conditions, where the user obtains a chemical structure of a metabolite for which the biosynthesis/biodegradation pathway is unknown, has also been

tackled using RDM patterns (Oh et al., 2007), as an extension of

the E-zyme approach. We recently developed a new web-based

server named PathPred (Moriya et al., 2010) for predicting the

metabolic fate of a given chemical compound, based on the conserved RCLASS depending on the types of pathways. This server provides plausible reactions and transformed com-pounds, and displays all predicted reaction pathways in a

tree-shaped graph (Figure 5a). The suggested pathway includes

the steps with the plausible EC numbers, which are predicted

by E-zyme (Figure 5b). The user can choose the type of pathway

according to their purpose, the biodegradation of xenobiotics in bacteria and the biosynthesis of secondary metabolites in plants, which utilizes different characteristic subsets of the

RDM patterns. In theﬁrst step, the query compound structure is

compared with those in the selected metabolic category. In the second step, possible RDM patterns on the query compound are selected from the RDM pattern library based on the structurally similar compounds containing the corresponding RDM patterns

with the use of the SIMCOMP program (Hattori et al., 2003, 9).

The third step is to obtain the plausible products according to the selected RDM patterns. The generated products become the next query compound and the prediction is iterated if possible.

Optionally, if already known, the ﬁnal compound in the

biodegradation or the initial compound in the biosynthesis can

be speciﬁed, (bi-directional prediction).

Independent IDs for the enzyme classi

ﬁ

cation

in many perspectives

As an expansion of our study to reconstruct metabolic pathway based on chemical structures, we have been trying to predict accompanying genes for predicted reactions based on the relationships between metabolite chemical structures and protein sequences. The key to archive this is

the classiﬁcation of enzymes from both genomic and

meta-bolomic points of view.

There are many ways to classify enzymes. Enzymes in the

IUBMB's Enzyme List are systematically classiﬁed according

to the chemical structures of their substrates and products,

(7)

and co-factors, as well as reaction selectivity and substrate

speciﬁcity, which are inalienably related toEnzyme

Nomen-clature. Enzymes can also be classiﬁed based on enzyme proteins, such as the amino acid sequences and the 3D structure of proteins. Other factors that can group enzymes

include the location in the pathway (i.e., biological

func-tions), and the location of the cells. Enzymes are classiﬁed

into membrane-bound enzymes and soluble enzymes. The

membrane-bound enzymes can be further classiﬁed into

buried type (such as receptor proteins), transmembrane type (such as channel, transporter, ATP syntheses) and membrane adhesion type (such as hydrogenases).

KEGG has a sub-database named BRITE, the collection of

hierarchical classiﬁcations representing many different types

of relationships in biological systems according to various aspects. BRITE enables the user to search any terms of interest,

including enzymes, from many classiﬁcations at a time. EC

numbers (IUBMB Enzyme List), RC numbers (KEGG RCLASS) and

K numbers (KEGG Orthology; KO) are the three main identiﬁers

that classify enzymes, and all are available in KEGG BRITE. KO is a collection of the groups of orthologous genes that are regarded to share common function and the same evolutional origin, in other words, an orthology corresponds to a functional unit located in the same place in a reference pathway and phylogenetic tree. KO entries are generated in the process of genome annotation, and a KO entry in principle corresponds to more than one gene derived from more than one organism. In order to cope with an increasing number of complete genomes, the genome-based annota-tion is now automatically performed (except for a selected number of reference organisms) with continuous efforts to

manually improve the pathway-based, cross-species

annotation.

For predictive genomic and metabolomic analyses, it is essential to organize knowledge about the relationships

between enzyme structures (including amino acid

sequences, and 3D structures) and enzyme functions. The process to classify amino acid sequences and 3D structures of proteins is performed by both manual annotations and automatic calculations. Both ways have advantages and disadvantages: the former is generally high in quality but

low in speed, and vise versa for the latter. Thus many

databases apply the large-scale calculations followed by manual inspections. We propose that RCLASS contributes to

the large-scale calculation of reaction classiﬁcation that

efﬁciently integrates genomic and chemical spaces.

The strength of our approach lies on the independence of

reaction classiﬁcation from the classiﬁcation of enzyme genes,

enzyme proteins and enzyme nomenclature. Due to this independence, it has become possible to cover all reactions by considering the differences among orthologous proteins in

the range of substrate speciﬁcity, co-factor requirements,

multistep reactions, multi-functional enzymesetc. For

exam-ple, the enzymes EC 2.7.1.1 and EC 2.7.1.2 are deﬁned as

hexokinase and glucokinase, respectively. The former enzyme takes a broad range of molecules as substrates, catalyzing

many reactions (ATP+D-hexose=ADP+D-hexose 6-phosphate).

One of them (ATP+D-glucose=ADP+D-glucose 6-phosphate) is

catalyzed by the latter, with stricter substrate speciﬁcity.

In KEGG, these two are regarded as the same type of reaction in terms of their RCLASS entries, and are grouped into three orthologue groups: an orthologue group K00844 is

assigned to the former, and two orthologue groups K12407 and K00845 are assigned to the latter. In another example, there are three glyceraldehyde-3-phosphate (GAP)

dehy-drogenases with different cofactor requirements. EC

1.2.1.12 requires NAD+, EC 1.2.1.13 requires NADP+, and

EC 1.2.1.59 requires either NAD(P)+. In KEGG, these three

are also regarded as the same type of reaction in terms of the RCLASS entries involved, and are grouped into four orthologue groups: K00134 and K10705 for EC 1.2.1.12, K05298 for EC 1.2.1.13 and K00150 for EC 1.2.1.59.

Many enzymes are multi-functional. In this case, we give multiple EC, R, RP and RC numbers to the corresponding K number. For example, bisphosphoglycerate mutase is given an orthology K01837, three EC numbers 5.4.2.1, 5.4.2.4 and 3.1.3.13, three R numbers R01518

(2-phospho-D-glycerate=3-phospho-D-glycerate), R01662 (3-phospho-D

-glycerol phosphate=2,3-bisphospho-D-glycerate) and R01516

(2,3-bisphospho-D-glycerate+H2O=3-phospho-D-glycerate+

orthophosphate) and the corresponding RP and RC numbers. There is another known enzyme named phosphoglycerate

mutase, which has narrower substrate speciﬁcity (only

catalyz-ing R01518), which is given orthology identiﬁcation K01834.

There are many cases where an enzyme is involved the catalysis of a complex series of reaction steps. For example, fatty acid biosynthesis contains many enzyme complexes, only acetyl CoA carboxylase is a separate enzyme. To make matters more complicated, the complexes are different dependent on taxonomy. Animal-type fatty acid synthase (EC 2.3.1.85)

con-sists of a polypeptide, given identiﬁcation K00665. Fungi type

(EC 2.3.1.86) consists of two subunits (K00667 and K00668). Bacterial type is separated into at least two proteins (K11533 and K11628), of which the latter has EC 2.3.1.111 but the

former does not have any ofﬁcial EC number. There are many

other complicated examples; EC 1.2.7.1 (pyruvate synthase) forms an enzyme complex consisting of four peptides porA,

porB, porD and porG. We gave them identiﬁers K00169, K00170,

K00171 and K00172, respectively, and link each to EC 1.2.7.1. EC numbers classify enzymes by function; therefore they contain many different sequences. As a result, some EC numbers have become highly variable in terms of their reaction patterns and sequence families. The former type of EC numbers, catalyzing many different reactions, include cytochrome P450 (EC 1.14.14.1), glutathionine transferase (EC 2.5.1.18), monoamine oxidase (EC 1.4.3.4), enoyl-CoA

hydratase (EC 4.2.1.17), alcohol dehydrogenase (EC

1.1.1.1), fatty acid synthase in animal and yeast (EC 2.3.1.85 and 86, respectively), aldehyde dehydrogenase (EC 1.2.1.3), PTS enzyme II (EC 2.7.1.69), acyl-CoA dehy-drogenase (EC 1.3.99.3) and 3-hydroxyacyl-CoA dehydro-genase (EC 1.1.1.35). The latter type of EC numbers, involving many different orthologues computationally gen-erated from KEGG GENES, include NADH dehydrogenase (EC 1.6.5.3), ATP synthase (EC 3.6.3.14), DNA polymerase (EC 2.7.7.7), serine/threonine protein kinase (EC 2.7.11.1), peptidylprolyl isomerase (EC 5.2.1.8), PTS enzyme II (2.7.1.69), enoyl-CoA hydratase (EC 4.2.1.17), RNA poly-merase (EC 2.7.7.6), DNA-methyltransferase (2.1.1.72), two-component histidine kinase (2.7.13.3), and ubiquitin-conjugating enzyme (EC 6.3.2.19). Many enzymes in the latter cases are large protein complexes, or large paralogue groups, and enzymes acting on macromolecules (such as DNA, RNA and proteins).

(8)

Understanding of structure

–

function

relationships of enzymes

The Enzyme List began before the accumulation of amino acid

sequence data. Researchers who ﬁnd new enzymes are

encouraged to contact to IUBMB to report them. When registering new enzymes, required information is mostly reaction based, such as proposed sub-subclass, accepted name, synonyms, reaction catalyzed, co-factor requirements,

brief comment on speciﬁcity, other comments and references.

There are many EC numbers that are not used for genome annotation because of the lack of sequence information. Among the 4150 EC numbers, 1454 (35%) do not correspond to any sequence data in KEGG nor UniProt. Some of these EC numbers were determined before the establishment of Gen-Bank, but other EC numbers were determined after that, although the sequence information remains unregistered for some reason. We suggest that the Enzyme List should include more information about enzyme proteins, such as a sequence

database identiﬁer (if any is available), source organism name

and taxonomy identiﬁer (if any is available). At the same time,

there should be a clear distinction between the original EC number given to an experimentally characterized enzyme in a

speciﬁc organism and the deduced EC numbers given to other

organisms based on sequence similarity. This would facilitate stronger links between the genomic and metabolomic infor-mation, and greatly enhance the utility of the Enzyme List. The quality of genome and chemical annotation determines the quality of theoretically and experimentally reconstructed biological networks, which in turn contribute to various studies

on human health, environmental biology,etc. As the number

of published experimental evidences increases, it is becoming more important to attach quantitative and qualitative descrip-tors to genome annotations and databases. As the number of published studies containing experimental evidence of new enzymes, metabolites and gene interactions is continually increasing, it becomes vital to attach quantitative and qualitative descriptors to genome annotations and integrate them into existing databases

Con

ﬂ

ict of interest statement

None of the authors have any conﬂict of interest.

Acknowledgments

The computational resources were provided by the Bioinfor-matics Center, Institute for Chemical Research, Kyoto University. The KEGG project is supported by the Institute for Bioinformatics Research and Development of the Japan

Science and Technology Agency, and a grant-in-aid for scientiﬁc

research on the priority area‘Comprehensive Genomics’from

the Ministry of Education, Culture, Sports, Science and Techno-logy of Japan.

References

Acker, Michael, Auld, Douglas, 2014. Considerations for the design and reporting of enzyme assays in high-throughput screening applications. Perspect. Sci. 1, 56–73.

Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M.D.R., Durbin, R., Falquet, L., Fleischmann, W., Gouzy, J., Hermjakob, H., Hulo, N., Jonassen, I., Kahn, D., Kanapin, A., Karavidopoulou, Y., Lopez, R., Marx, B., Mulder, N.J., Oinn, T.M., Pagni, M., Servant, F., Sigrist, C.J.A., Zdobnov, E.M., 2001. The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 29 (1), 37–40.

Bairoch, A., 2000. The ENZYME database in 2000. Nucleic Acids Res. 28 (1), 304–305.

Bairoch, A., 1991. PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Res. 19 (Suppl), 2241–2245.

Bairoch, A., Boeckmann, B., 1991. The SWISS-PROT protein sequence data bank. Nucleic Acids Res. 19 (Suppl), 2247–2249. Barker, W.C., Garavelli, J.S., Huang, H., McGarvey, P.B., Orcutt, B.

C., Srinivasarao, G.Y., Xiao, C., Yeh, L.L., Ledley, R.S., Janda, J. F., Pfeiffer, F., Mewes, H., Tsugita, A., Wu, C., 1999. The Protein Information Resource (PIR). Nucleic Acids Res. 28 (1), 41_–44. Bernstein, F.C., Koetzle, T.F., Williams, G.J.B., Meyer, E.F., Brice,

M.D., Rodgers, J.R., Kennard, O., Shimanouchi, T., Tasumi, M., 1977. The protein data bank: a computer-based archival_ﬁle for macromolecular structures. J. Mol. Biol. 112 (3), 535_–542. Carter, P., 1986. Site-directed mutagenesis. Biochem. J. 237 (1),

1–7.

Chen, J., Swamidass, S.J., Dou, Y., Bruand, J., Baldi, P., 2005. ChemDB: a public database of small molecules and related chemoinformatics resources. Bioinformatics 21 (22), 4133–4139. Dayhoff, M.O., 1973. Atlas of Protein Sequence and Structure [1972]: Supplement. National Biomedical Research Foundation. Dry, S., McCarthy, S., Harris, T., 2000. Structural genomics in the

biotechnology sector. Nat. Struct. Mol. Biol. 7, 946–949. Ghirlanda, G., 2008. Old enzymes, new tricks. Nature 453, 164–166. Goad, W., 1987. Computational tools for using and analyzing DNA

sequences. Somatic Cell Mol. Genet. 13 (4), 391_–395.

Handelsman, J., 2004. Metagenomics: application of genomics to uncultured microorganisms. Microbiol. Mol. Biol. Rev. 68 (4), 669_–685.

Hashimoto, K., Tokimatsu, T., Kawano, S., Yoshizawa, A.C., Okuda, S., Goto, S., Kanehisa, M., 2009. Comprehensive analysis of glycosyl-transferases in eukaryotic genomes for structural and functional characterization of glycans. Carbohydr. Res. 344, 881_–887. Hashimoto, K., Yoshizawa, A.C., Okuda, S., Kuma, K., Goto, S.,

Kanehisa, M., 2008. The repertoire of desaturases and elongases reveals fatty acid variations in 56 eukaryotic genomes. J. Lipid Res. 49, 183–191.

Hattori, M., Kotera, M., 2011. Chemoinformatics on metabolic pathways: attaching biochemical information on putative enzy-matic reactions. In: Lodhi, H., Yamanishi, Y. (Eds.), Chemoinfor-matics and Advanced Machine Learning Perspectives: Complex Computational Methods and Collaborative Techniques. IGI Glo-bal, pp. 318_–339.

Hattori, M., Okuno, Y., Goto, S., Kanehisa, M., 2003. Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways. J. Am. Chem. Soc. 125, 11853_–11865.

Holm, L., Sander, C., 1994. The FSSP database of structurally aligned protein fold families. Nucleic Acids Res. 22 (17), 3600_–3609.

Kanehisa, M., Goto, S., Furumichi, M., Tanabe, M., Hirakawa, M., 2010. KEGG for representation and analysis of molecular net-works involving diseases and drugs. Nucleic Acids Res. 38, D355–D360.

Kanehisa, M., Araki, M., Goto, S., Hattori, M., Hirakawa, M., Itoh, M., Katayama, T., Kawashima, S., Okuda, S., Tokimatsu, T., Yamanishi, Y., 2008. KEGG for linking genomes to life and the environment. Nucleic Acids Res. 36, D480–D484.

Kanehisa, M., Goto, S., Kawashima, S., Nakaya, A., 2002. The KEGG databases at GenomeNet. Nucleic Acids Res. 30 (1), 42–46.

(9)

Kawano, S., Hashimoto, K., Miyama, T., Goto, S., Kanehisa, M., 2005. Prediction of glycan structures from gene expression data based on glycosyltransferase reactions. Bioinformatics 21, 3976_–3982. Kotera, M., McDonald, A.G., Boyce, S., Tipton, K.F., 2008. Eliciting

possible reaction equations and metabolic pathways involving orphan metabolites. J. Chem. Inf. Model. 48 (12), 2335_–2349. Kotera, M., Okuno, Y., Hattori, M., Goto, S., Kanehisa, M., 2004.

Computational assignment of the EC numbers for genomic-scale analysis of enzymatic reactions. J. Am. Chem. Soc. 126, 16487–16498.

Marchler-Bauer, A., Anderson, J.B., DeWeese-Scott, C., Fedorova, N.D., Geer, L.Y., He, S., Hurwitz, D.I., Jackson, J.D., Jacobs, A. R., Lanczycki, C.J., Liebert, C.A., Liu, C., Madej, T., Marchler, G.H., Mazumder, R., Nikolskaya, A.N., Panchenko, A.R., Rao, B. S., Shoemaker, B.A., Simonyan, V., Song, J.S., Thiessen, P.A., Vasudevan, S., Wang, Y., Yamashita, R.A., Yin, J.J., Bryant, S. H., 2002. CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Res. 31 (1), 383_–387.

Monasterio, Octavio, 2014. Nomenclature for the applications of nuclear magnetic resonance to the study of enzymes. Perspect. Sci. 1, 88–97.

Minowa, Y., Araki, M., Kanehisa, M., 2007. Comprehensive analysis of distinctive polyketide and nonribosomal peptide structural motifs encoded in microbial genomes. J. Mol. Biol. 368, 1500–1517.

Moriya, Y., Shigemizu, D., Hattori, M., Tokimatsu, T., Kotera, M., Goto, S., Kanehisa, M., 2010. PathPred: an enzyme-catalyzed metabolic pathway prediction server. Nucleic Acids Res. 38, W138–W143.

Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C., 1995. SCOP: a structural classi_ﬁcation of proteins database for the investiga-tion of sequences and structures. J. Mol. Biol. 247 (4), 536–540. Nishikawa, K., Ishino, S., Takenaka, H., Norioka, N., Hirai, T., Yao, T., Seto, Y., 1993. Constructing a protein mutant database. Protein Eng. 7 (5), 733.

Oh, M., Yamada, T., Hattori, M., Goto, S., Kanehisa, M., 2007. Systematic analysis of enzyme-catalyzed reaction patterns and

prediction of microbial biodegradation pathways. J. Chem. Inf. Model. 47, 1702_–1712.

Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thornton, J.M., 1997. CATH _– a hierarchic classi_ﬁcation of protein domain structures. Structure 5 (8), 1093–1109. Payen, A., Persoz, J.-F., 1833. Mémoire sur la diastase, les

principaux produits de ses réactions et leurs applications aux arts industriels. Ann. Chim. Phys. 53, 73–92(2nd series). Raes, J., Bork, P., 2008. Molecular eco-systems biology: towards an

understanding of community function. Nat. Rev. Microbiol. 6, 693–699.

Riesenfeld, C.S., Schloss, P.D., Handelsman, J., 2004. METAGE-NOMICS: genomic analysis of microbial communities. Ann. Rev. Genet. 38, 525–552.

Sanger, F., Tuppy, H., 1951a. The amino-acid sequence in the phenylalanyl chain of insulin. I. The identi_ﬁcation of lower peptides from partial hydrolysates. Biochem. J. 49 (4), 463_–481. Sanger, F., Tuppy, H., 1951b. The amino-acid sequence in the phenylalanyl chain of insulin. 2. The investigation of peptides from enzymic hydrolysates. Biochem. J. 49 (4), 481_–490. Schloss, P.D., Handelsman, J., 2003. Biotechnological prospects

from metagenomics. Curr. Opin. Biotechnol. 14 (3), 303_–310. Sonnhammer, E.L.L., Eddy, S.R., Durbin, R., 1997. Pfam: a

com-prehensive database of protein domain families based on seed alignments. PROTEINS: Struct. Funct. Genet. 28, 405–420. Tipton, K., Boyce, S., 2000. History of the enzyme nomenclature

system. Bioinformatics 16 (1), 34–40.

Tringe, S.G., von Mering, C., Kobayashi, A., Salamov, A.A., Chen, K., Chang, H.W., Podar, M., Short, J.M., Mathur, E.J., Detter, J. C., Bork, P., Hugenholtz, P., Rubin, E.M., 2005. Comparative metagenomics of microbial communities. Science 308 (5721), 554–557.

Turnbaugh, P.J., Gordon, J.I., 2008. An invitation to the marriage of metagenomics and metabolomics. Cell 134 (5), 708_–713. Yamanishi, Y., Hattori, M., Kotera, M., Goto, S., Kanehisa, M., 2009.

E-zyme: predicting potential EC numbers from the chemical transformation pattern of substrate_–product pairs. Bioinfor-matics 25, i79_–i86.