www.elsevier.com/locate/pisc Available online at www.sciencedirect.com
REVIEW
Predictive genomic and metabolomic analysis
for the standardization of enzyme data
$
Masaaki Kotera
n, Susumu Goto, Minoru Kanehisa
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Japan
Received 2 September 2011; accepted 4 November 2013; Available online 26 March 2014
KEYWORDS
KEGG; Reaction Class (RCLASS);
KEGG Orthology (KO); E-zyme;
PathPred;
Chemical annotation
Abstract
The IUBMB's Enzyme List gives a valuable library of the individual experimental facts on enzyme activities, providing the standard classification and nomenclature of enzymes. Empirical knowl-edge about the relationships between the enzyme protein sequences (or structures) and their functions (the capability of catalyzing chemical reactions) has been accumulating in public literatures and databases. This provides a complementary approach to standardize and organize enzyme data, i.e., predicting the possible enzymes, reactions and metabolites that remain to be identified experimentally. Thus, we suggest the necessity of classifying enzymes based on the evidence and different perspectives obtained from various experimental works. The KEGG (Kyoto Encyclopedia of Genes and Genomes) database describes enzymes from many different viewpoints including; the IUBMB's enzyme nomenclature/classification (EC numbers), the similarity group of enzyme reactions (KEGG Reaction Class; RCLASS) based solely on the chemical structure transformation patterns, and the similarity groups of enzyme genes (KEGG Orthology; KO) based on the orthologous groups that can be mapped to the KEGG PATHWAY and BRITE functional hierarchy. Some unique identifiers were additionally introduced to the KEGG database other than the EC numbers established by IUBMB. R, RP and RC numbers are given to distinguish reactions, reactant pairs and RCLASS, respectively. Genes, including enzyme genes, have their own ID numbers in specific organisms, and they are classified into ortholog groups that are identified by K numbers. In this review, we explain the concept and methodology of this formulation with some concrete example cases. We propose it beneficial to create a standard classification scheme that deals with both experimentally identified and theoretically predicted enzymes.
http://dx.doi.org/10.1016/j.pisc.2014.02.003
2213-0209&2014 The Authors. Published by Elsevier GmbH. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/3.0/).
☆This article is part of a special issue entitled“Reporting Enzymology Data– STRENDA Recommendations and Beyond”. Copyright by Beilstein-Institut.
nCorresponding author at: Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro-ku, Tokyo 152–8550, Japan.
&2014 The Authors. Published by Elsevier GmbH. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/3.0/).
Contents
Introduction . . . 25
KEGG RCLASS . . . 26
Metabolic reconstruction/chemical annotation . . . 28
Independent IDs for the enzyme classification in many perspectives . . . 29
Understanding of structure–function relationships of enzymes . . . 31
Conflict of interest statement. . . 31
Acknowledgments . . . 31
References . . . 31
Introduction
Human genome research and subsequent post-genome research
provided both the systematic analyses and wide definition of
the“genome”,i.e., the substances coded by the genome, such
as gene expression, polymorphism and proteome. DNA, RNA and proteins are synthesized based on genetic codes and template-based duplication, transcription and translation. On the other hand, chemical structures of metabolites such as glycans, lipids and terpenoids are not designed by such templates, but by biosynthetic pathways. Kanehisa et al. named the genome and
chemical activity that exists within all species as ‘genomic
space’ and ‘chemical space’ respectively (Kanehisa et al.,
2010). A firm grasp of the‘genomic space’becomes valuable
when screening for disease genes, drug targets,etc., likewise, a
grasp of the‘chemical space’provides insight when screening
imaging probes, drug leadsetc. Thefield of molecular biology
has spread to omics-level research (genomics, proteomics,
metabolomics, etc.), and is continually expanding to study
whole families of organisms. For instance, the next-generation sequencer is expected to be powerful enough to analyze
environmental genomics, also referred to as“metagenomics”
(Schloss and Handelsman, 2003;Handelsman, 2004;Riesenfeld et al., 2004; Tringe et al., 2005). Similarly, high-throughput mass spectrometry and NMR enable the user to study metabo-lomics at a family, order or class level, which can be referred to
as “meta-metabolomics” (Raes and Bork, 2008; Turnbaugh
and Gordon, 2008; Auld and Acker, 2014; Monasterio, 2014).
Genome analysis has become routine, and individual repositories of genes are being constructed for all known living organisms. Conversely, repositories of the chemical substances that exist in, or affect individual living organisms are in their infant stages and are not well established; much less is known about the interrelationships that exist between the genomic and chemical spaces. To bridge this gap, it becomes essential to establish robust methodology to pre-dict chemical substances from genomic data and vice versa. Enzymes are the important bridge between the genome and
chemical biosynthesis. An enzyme, amylase, wasfirst identified
in 1833 byPayen and Persoz (1833). At that time, it was not
known that many enzymes are made of proteins. It was in 1926 when Sumner showed that an enzyme, urease, is in fact a
protein (for this work he won the 1946 Nobel Prize in
Chemistry). Sanger and Tuppy (1951a, 1951b) published a
method to determine amino-acid sequences in 1951. After
that, many more enzymes were identified, and there arose the
need for systematic enzyme nomenclature. International Union of Biochemistry and Molecular Biology (IUBMB) established the
Enzyme List in 1961 for this exact purpose (Tipton and Boyce,
2000). This was before the establishment of the Atlas of Protein
Sequence and Structure in 1972 (Dayhoff, 1972) and the
prototype of the GenBank database in 1979 (Goad, 1987). It
has now become relatively easy to obtain nucleic acid sequences, and it has become mandatory to determine nucleic acid or amino acid sequences for an enzyme and register them in the GenBank database prior to publishing an original paper discussing said enzyme.
Since then, information on genes, protein sequences and structures have been proliferating, creating huge databases that are connected worldwide, such as the amino acid sequence
databases PIR (Protein Information Resources) (Barker et al.,
1999), Swiss-Prot (Bairoch and Boeckmann, 1991), Entrez
Protein (Marchler-Bauer et al., 2002), and PRF (Nishikawa
et al., 1993), and the protein function databases PROSITE (Bairoch, 1991), Pfam (Sonnhammer et al., 1997), InterPro (Apweiler et al., 2001), GenomeNet Motif (Kanehisa et al.,
2002) and ExPASy ENZYME (Bairoch, 2000), and the protein
structure databases PDB (Bernstein et al., 1977), SCOP (Murzin
et al., 1995), CATH (Orengo et al., 1997), FSSP (Holm and Sander, 1994), and the integrated databases at NCBI (National Center of Biotechnology Information), EBI (European Bioinfor-matics Institute), SIB (Swiss Institute of BioinforBioinfor-matics), and GenomeNet. Due to the recent successful development of high-throughput measurement techniques, the rate of biological data accumulation has become even faster, vastly exceeding the knowledge capacity of the human mind.
The IUBMB's Enzyme List (EC numbers) classifies enzymes
based on published experimental data and provides extremely useful information regarding experimental evidence. The
Enzyme List classifies enzymes hierarchically; where up to
the sub-subclass (the third number) is a systematic classifi
ca-tion of enzyme-catalyzed reacca-tions. The fourth number of the Enzyme List is a serial number given to an experimentally observed (and published) enzyme with details of the reaction
number record is linked to the PubMed ID, enabling easy access to the original paper. There are currently two types of
EC numbers; official EC numbers and unofficial EC numbers.
The first is the representation of biochemical knowledge
organized by the IUBMB–IUPAC Biochemical Nomenclature
Committee. The second is for genome annotation to identify enzyme genes (and enzymes), which are not organized by the Biochemical Nomenclature Committee, but by the annotators
of databases including KEGG (Kanehisa et al., 2010), based on
sequence similarity. KEGG once used EC numbers as primary
identifiers of enzymes, but not anymore, due to reasons that
will be discussed later.
Enzyme functions are highly dependent on the enzyme's protein structures. Like any other proteins, enzymes are also synthesized in the ribosome using the nucleic acid sequences of genes as their templates, therefore their structures are the products of evolution. Evolutionally close enzymes have similar motifs, and form a group of enzymes. In homologous proteins, even if the proteins are not similar as a whole, the regions of common functions or structural
restrictions, motifs and specific functions all tend to be
preserved well. Some empirical knowledge has been becom-ing clear through the development of structural biology and site-directed mutagenesis. The site-directed mutagenesis studies have been performed since 1980s to change enzyme
functions (Carter, 1986), through a trial and error process.
Because a proteins X-ray crystal structure is still difficult to
stably obtain, there have been many attempts to predict enzyme structure and function from amino acid sequences. The prediction of enzyme function is based on empirical knowledge stored in databases rather than through physical and chemical theory and calculation. Since the end of the 20th century, many genomes from various species have been
profiled, enabling their comparison, and the comparison of
inter-species enzymes to find conserved and non-conserved
regions. This gives a hint as to which amino acid residues undergo evolutionary change. Thanks to improvements in computational ability and data storage, it is becoming possible to design protein structures from amino acid sequences and to predict their function through computer simulation and bioin-formatics tools. There have been some successful attempts to manipulate enzyme functions; in 2008 Ghirlanda et al. pub-lished on the computational enzyme design of Kemp elimination
catalysts (Ghirlanda, 2008). Nevertheless, manipulation of
enzyme function still remains a trial and error process. Although
some examples of the structure–function relationships of
enzymes are known, it is only limited to a small set of enzymes. We have been tackling the genome-scale metabolic path-way reconstruction problem, constructing a systematic assembly and organization of all the metabolism of a given organism, while focusing on the relationships between genomic and chemical spaces. Through this process, the KEGG Orthology (KO) database was created, mapping ortho-logue gene groups that at the same place in regard to biological pathway and functional hierarchy, offering stra-tegies to predict the possible set of chemical structures
from the genomic information of a given organism. Atfirst,
we accumulated knowledge about glycosyltransferase
groups for the post-translational modification of proteins.
These enzyme genes were grouped based on the orthology
and substrate specificity, and were successfully applied to
predict the possible glycan structures for a given genome or
transcriptome (Hashimoto et al., 2009; Kawano et al.,
2005). The same strategy was applied to polyketides,
non-ribosomal peptides and polyunsaturated fatty acids (Minowa
et al., 2007;Hashimoto et al., 2008).
This strategy was successful because the orthology and
the substrate specificity correlated well. In order for this
strategy to be applied to other enzyme groups, it is
necessary to organize the enzyme classification so that
enzyme function can be deduced from protein sequences, or vice versa, for a wide range of enzymes. We recently established the KEGG Reaction Class (RCLASS) based on the similarity of reactions in terms of the reaction motifs or the RDM chemical transformation patterns. RCLASS enables users to develop a method to predict what kind of enzymes can catalyze putative reactions, and also a method to predict the metabolic fate of given compounds. We plan to use these two methodologies to connect chemical and genomic spaces, whilst simultaneously continuing work to
refine enzyme classifications for predictive genomic and
metabolomic analyses. This would add a genomic perspec-tive to the enzyme data standardization, enhancing the utility of IUBMB's Enzyme List.
KEGG RCLASS
Our primary goal in the development of RCLASS is to extend
the EC classification so that it also covers putative reactions
that are not yet well characterized. High-throughput mea-surement techniques hint at the existence of considerable
numbers of orphan metabolites, i.e., compounds that are
known to be present in living organisms but whose
syn-thetic/degradation pathways are unknown (Kotera et al.,
2008). In order to identify the enzyme proteins involved in
these pathways, it is essential to characterize or classify the putative reaction equations that are often incomplete. In
principle, the official EC numbers cannot be used for this
purpose because their assignment requires confirmed
experimental evidence of enzyme activity and a complete reaction equation. In order to describe the relationships between putative reactions and putative enzyme proteins
(or genes), it is essential to develop an enzyme classification
scheme that is applicable not only for the confirmed
reactions with complete equations, but also for the putative reactions, even if the equations are incomplete.
Finding possible enzyme reactions from metabolomic data naturally starts with a pair of compounds (which we
refer to as a“reactant pair”) corresponding to a reaction
equation, not always a complete reaction equation (Kotera
et al., 2004). Possible chemical transformation within the compounds can be obtained by comparing the two chemical structures. Technically, chemical compounds are repre-sented as graph structures, where the edges represent chemical bonds, and the nodes represent atoms attached with functional group information. In order to distinguish
functional groups and microenvironments of atoms, five
atom species (C, N, O, S and P) are classified into the 68
KEGG atom types (Hattori et al., 2003) (such as“N1a”for an
amino group inFigure 1). As a result of graph comparison,
the matched subgraph corresponds to the conserved atom group under the enzymatic reaction, and the unmatched sub-graph of each compound corresponds to the eliminated
or the added atom groups. The boundary area between the conserved and the non-conserved sub-graphs can be regarded as the reaction center on which the putative enzyme acts. In such a way, the RDM chemical transforma-tion patterns are extracted from a reactant pair in the
computational manner (Kotera et al., 2004; Hattori and
Kotera, 2011). The RDM pattern is represented with a string of the KEGG Atom Types, and describes a chemical bond that is generated or eliminated in a reaction.
We defined the RCLASS entries that represent a set of
chemical transformations found in a Substrate–product pair
(reactant pair). Each RCLASS entry was given identification
numbers (RC numbers). An RCLASS entry may consist of multi-ple RDM patterns when more than one chemical bond is generated or eliminated. So far 2345 RCLASS entries were
defined out of 38,241 RPAIR entries in the KEGG database
(August 2011). RCLASS entries have graphics representing the
common chemical transformations that occur in a defined set of
reactant pairs (Figure 2), where reaction centers and their
vicinities are emphasized in the KEGG by atom types and colors. The directions are decided according to the alphabetical order of the RDM patterns, and the orientations of the chemical structures are decided manually so that the similar RCLASS graphics are drawn in the same orientation whenever possible. Therefore it has become easier for the user to understand the chemical structure transformation, as well as to compare
different reaction types. RCLASS classifies reactions based
solely on chemical transformation of reactions on metabolic pathways and are independent from any other information such
as the range of substrate specificity and amino acid sequence.
The relationships among many instances related to enzymes
are as follows. The basic information on these classifications is
taken from the IUBMB enzyme list (EC numbers). Reactions taken from the IUBMB enzyme list and other literatures are
given identification numbers (R numbers) and are stored in
KEGG REACTION followed by the addition of confirmed source
organisms information, pathway information, and orthologue
groups of enzyme genes. Substrate–product pairs (reactant
pairs) are defined for enzyme reactions (Figure 3) and are
stored in the RPAIR database, together with the calculation of the RDM chemical structure transformation patterns. In gen-eral, a reaction (R numbers) consists of multiple reactant pairs (RP numbers).
Figure 2 Reactant pairs and reaction classes.
Figure 3 Reaction equations and reactant pairs.
Metabolic reconstruction/chemical annotation
RCLASS is proposed to be beneficial in linking metabolomics
to genomics, as well as to analyze the conserved consecu-tive reaction patterns in the evolution of metabolic path-ways. We surveyed the frequently appearing RDM patterns
specific for the 11 categories of KEGG metabolic pathways,
and then discovered some specific patterns within the
categories, especially biodegradation pathways, and thus developed a method to predict biodegradation pathway by
bacteria (Oh et al., 2007). Such a method for predicting
metabolic fate is based on the extraction of biological meaning from chemical structure, which is referred to as
chemical annotation (Dry et al., 2000;Chen et al., 2005;
Kanehisa et al., 2008).
Metabolic network reconstruction and annotation can be
classified into three ideal and hierarchically ranked sets of
conditions; if thefirst conditions can be accomplished, then the
second and third ones are not required. Similarly, if the second
set of conditions can be achieved, then the third
is not needed, though thefirst would then need to be revisited.
Thefirst conditions specify that when a metabolic pathway is
well characterized with experimentally confirmed enzymes and
reactions in at least one organism, genome-based and pathway-based annotations are applicable. The second conditions stipu-late that when certain reactions are known to take place but
the enzymes in charge of these reactions are not identified in
any organisms, prediction of the enzymes from proposed
reactions is necessary in order to achieve thefirst conditional
set. The third condition is when some metabolites are known to exist but the reactions producing or degrading them are not
identified, then predictions of these reactions are necessary to
move back to the second conditional stepsetc.
We defined reference pathways to cope with thefirst set of
annotation conditions and designed the KEGG PATHWAY and
BRITE so that they generally do not focus on a specific
organism, but are designed in a general way to be applicable
to all organisms. Reference pathways are defined as the
combined pathways that are present in a number of organisms and there exists a consensus among many published papers.
Figure 4 describes the difference between a species-specific pathway and a reference pathway, and the relationships among various IDs. In the reference pathway, rectangles and circles represent gene products (mostly proteins) and other
molecules (mostly metabolites), respectively. This graphic is one of the reference pathways for which no organism has been
specified. When the user selects to view a reference pathway,
the colored rectangles indicate the links to the corresponding
orthologue (KO) entries, enzyme classifications or reactions.
When the user specifies an organism, the colored rectangles
indicate the links to the corresponding KEGG GENE pages,
which indicates the specified organism possesses the
corre-sponding genes or proteins in the genome. White rectangles indicate that there are no genes annotated to the correspond-ing function. Note that this does not necessarily mean the organism does not really have the corresponding genes. It is
possible that the corresponding genes have not been identified
yet. Manually defined KO entries (groups of orthologous genes)
are the basic components of the systems information, i.e.,
PATHWAY network diagrams and BRITE functional classifi
ca-tions. Continuous refinement of reference pathways and
orthologue information is the key to maintain the quality of this procedure.
We designed the E-zyme tool (Kotera et al., 2004) in
response to the second set of annotation conditions, the practical situation where the user wants to identify enzymes (enzyme genes, proteins or reaction mechanisms) from only a partial reaction equation. The user can input any compound
pairs, and obtain the candidate EC classifications, generating a
‘clue’to identify the enzyme genes or proteins. This needs the
library of the RDM chemical transformation patterns calculated in advance, which is compared with the query transformation
pattern, resulting in a list of possible EC classifications with
specific scores. Recently, we have done a significant
improve-ment in this E-zyme, where a more complicated voting scheme
and EC-RDM profile based scoring system is applied to achieve
higher coverage with a higher accuracy rate (Yamanishi et al.,
2009).
The third set of annotation conditions, where the user obtains a chemical structure of a metabolite for which the biosynthesis/biodegradation pathway is unknown, has also been
tackled using RDM patterns (Oh et al., 2007), as an extension of
the E-zyme approach. We recently developed a new web-based
server named PathPred (Moriya et al., 2010) for predicting the
metabolic fate of a given chemical compound, based on the conserved RCLASS depending on the types of pathways. This server provides plausible reactions and transformed com-pounds, and displays all predicted reaction pathways in a
tree-shaped graph (Figure 5a). The suggested pathway includes
the steps with the plausible EC numbers, which are predicted
by E-zyme (Figure 5b). The user can choose the type of pathway
according to their purpose, the biodegradation of xenobiotics in bacteria and the biosynthesis of secondary metabolites in plants, which utilizes different characteristic subsets of the
RDM patterns. In thefirst step, the query compound structure is
compared with those in the selected metabolic category. In the second step, possible RDM patterns on the query compound are selected from the RDM pattern library based on the structurally similar compounds containing the corresponding RDM patterns
with the use of the SIMCOMP program (Hattori et al., 2003, 9).
The third step is to obtain the plausible products according to the selected RDM patterns. The generated products become the next query compound and the prediction is iterated if possible.
Optionally, if already known, the final compound in the
biodegradation or the initial compound in the biosynthesis can
be specified, (bi-directional prediction).
Independent IDs for the enzyme classi
fi
cation
in many perspectives
As an expansion of our study to reconstruct metabolic pathway based on chemical structures, we have been trying to predict accompanying genes for predicted reactions based on the relationships between metabolite chemical structures and protein sequences. The key to archive this is
the classification of enzymes from both genomic and
meta-bolomic points of view.
There are many ways to classify enzymes. Enzymes in the
IUBMB's Enzyme List are systematically classified according
to the chemical structures of their substrates and products,
and co-factors, as well as reaction selectivity and substrate
specificity, which are inalienably related toEnzyme
Nomen-clature. Enzymes can also be classified based on enzyme proteins, such as the amino acid sequences and the 3D structure of proteins. Other factors that can group enzymes
include the location in the pathway (i.e., biological
func-tions), and the location of the cells. Enzymes are classified
into membrane-bound enzymes and soluble enzymes. The
membrane-bound enzymes can be further classified into
buried type (such as receptor proteins), transmembrane type (such as channel, transporter, ATP syntheses) and membrane adhesion type (such as hydrogenases).
KEGG has a sub-database named BRITE, the collection of
hierarchical classifications representing many different types
of relationships in biological systems according to various aspects. BRITE enables the user to search any terms of interest,
including enzymes, from many classifications at a time. EC
numbers (IUBMB Enzyme List), RC numbers (KEGG RCLASS) and
K numbers (KEGG Orthology; KO) are the three main identifiers
that classify enzymes, and all are available in KEGG BRITE. KO is a collection of the groups of orthologous genes that are regarded to share common function and the same evolutional origin, in other words, an orthology corresponds to a functional unit located in the same place in a reference pathway and phylogenetic tree. KO entries are generated in the process of genome annotation, and a KO entry in principle corresponds to more than one gene derived from more than one organism. In order to cope with an increasing number of complete genomes, the genome-based annota-tion is now automatically performed (except for a selected number of reference organisms) with continuous efforts to
manually improve the pathway-based, cross-species
annotation.
For predictive genomic and metabolomic analyses, it is essential to organize knowledge about the relationships
between enzyme structures (including amino acid
sequences, and 3D structures) and enzyme functions. The process to classify amino acid sequences and 3D structures of proteins is performed by both manual annotations and automatic calculations. Both ways have advantages and disadvantages: the former is generally high in quality but
low in speed, and vise versa for the latter. Thus many
databases apply the large-scale calculations followed by manual inspections. We propose that RCLASS contributes to
the large-scale calculation of reaction classification that
efficiently integrates genomic and chemical spaces.
The strength of our approach lies on the independence of
reaction classification from the classification of enzyme genes,
enzyme proteins and enzyme nomenclature. Due to this independence, it has become possible to cover all reactions by considering the differences among orthologous proteins in
the range of substrate specificity, co-factor requirements,
multistep reactions, multi-functional enzymesetc. For
exam-ple, the enzymes EC 2.7.1.1 and EC 2.7.1.2 are defined as
hexokinase and glucokinase, respectively. The former enzyme takes a broad range of molecules as substrates, catalyzing
many reactions (ATP+D-hexose=ADP+D-hexose 6-phosphate).
One of them (ATP+D-glucose=ADP+D-glucose 6-phosphate) is
catalyzed by the latter, with stricter substrate specificity.
In KEGG, these two are regarded as the same type of reaction in terms of their RCLASS entries, and are grouped into three orthologue groups: an orthologue group K00844 is
assigned to the former, and two orthologue groups K12407 and K00845 are assigned to the latter. In another example, there are three glyceraldehyde-3-phosphate (GAP)
dehy-drogenases with different cofactor requirements. EC
1.2.1.12 requires NAD+, EC 1.2.1.13 requires NADP+, and
EC 1.2.1.59 requires either NAD(P)+. In KEGG, these three
are also regarded as the same type of reaction in terms of the RCLASS entries involved, and are grouped into four orthologue groups: K00134 and K10705 for EC 1.2.1.12, K05298 for EC 1.2.1.13 and K00150 for EC 1.2.1.59.
Many enzymes are multi-functional. In this case, we give multiple EC, R, RP and RC numbers to the corresponding K number. For example, bisphosphoglycerate mutase is given an orthology K01837, three EC numbers 5.4.2.1, 5.4.2.4 and 3.1.3.13, three R numbers R01518
(2-phospho-D-glycerate=3-phospho-D-glycerate), R01662 (3-phospho-D
-glycerol phosphate=2,3-bisphospho-D-glycerate) and R01516
(2,3-bisphospho-D-glycerate+H2O=3-phospho-D-glycerate+
orthophosphate) and the corresponding RP and RC numbers. There is another known enzyme named phosphoglycerate
mutase, which has narrower substrate specificity (only
catalyz-ing R01518), which is given orthology identification K01834.
There are many cases where an enzyme is involved the catalysis of a complex series of reaction steps. For example, fatty acid biosynthesis contains many enzyme complexes, only acetyl CoA carboxylase is a separate enzyme. To make matters more complicated, the complexes are different dependent on taxonomy. Animal-type fatty acid synthase (EC 2.3.1.85)
con-sists of a polypeptide, given identification K00665. Fungi type
(EC 2.3.1.86) consists of two subunits (K00667 and K00668). Bacterial type is separated into at least two proteins (K11533 and K11628), of which the latter has EC 2.3.1.111 but the
former does not have any official EC number. There are many
other complicated examples; EC 1.2.7.1 (pyruvate synthase) forms an enzyme complex consisting of four peptides porA,
porB, porD and porG. We gave them identifiers K00169, K00170,
K00171 and K00172, respectively, and link each to EC 1.2.7.1. EC numbers classify enzymes by function; therefore they contain many different sequences. As a result, some EC numbers have become highly variable in terms of their reaction patterns and sequence families. The former type of EC numbers, catalyzing many different reactions, include cytochrome P450 (EC 1.14.14.1), glutathionine transferase (EC 2.5.1.18), monoamine oxidase (EC 1.4.3.4), enoyl-CoA
hydratase (EC 4.2.1.17), alcohol dehydrogenase (EC
1.1.1.1), fatty acid synthase in animal and yeast (EC 2.3.1.85 and 86, respectively), aldehyde dehydrogenase (EC 1.2.1.3), PTS enzyme II (EC 2.7.1.69), acyl-CoA dehy-drogenase (EC 1.3.99.3) and 3-hydroxyacyl-CoA dehydro-genase (EC 1.1.1.35). The latter type of EC numbers, involving many different orthologues computationally gen-erated from KEGG GENES, include NADH dehydrogenase (EC 1.6.5.3), ATP synthase (EC 3.6.3.14), DNA polymerase (EC 2.7.7.7), serine/threonine protein kinase (EC 2.7.11.1), peptidylprolyl isomerase (EC 5.2.1.8), PTS enzyme II (2.7.1.69), enoyl-CoA hydratase (EC 4.2.1.17), RNA poly-merase (EC 2.7.7.6), DNA-methyltransferase (2.1.1.72), two-component histidine kinase (2.7.13.3), and ubiquitin-conjugating enzyme (EC 6.3.2.19). Many enzymes in the latter cases are large protein complexes, or large paralogue groups, and enzymes acting on macromolecules (such as DNA, RNA and proteins).
Understanding of structure
–
function
relationships of enzymes
The Enzyme List began before the accumulation of amino acid
sequence data. Researchers who find new enzymes are
encouraged to contact to IUBMB to report them. When registering new enzymes, required information is mostly reaction based, such as proposed sub-subclass, accepted name, synonyms, reaction catalyzed, co-factor requirements,
brief comment on specificity, other comments and references.
There are many EC numbers that are not used for genome annotation because of the lack of sequence information. Among the 4150 EC numbers, 1454 (35%) do not correspond to any sequence data in KEGG nor UniProt. Some of these EC numbers were determined before the establishment of Gen-Bank, but other EC numbers were determined after that, although the sequence information remains unregistered for some reason. We suggest that the Enzyme List should include more information about enzyme proteins, such as a sequence
database identifier (if any is available), source organism name
and taxonomy identifier (if any is available). At the same time,
there should be a clear distinction between the original EC number given to an experimentally characterized enzyme in a
specific organism and the deduced EC numbers given to other
organisms based on sequence similarity. This would facilitate stronger links between the genomic and metabolomic infor-mation, and greatly enhance the utility of the Enzyme List. The quality of genome and chemical annotation determines the quality of theoretically and experimentally reconstructed biological networks, which in turn contribute to various studies
on human health, environmental biology,etc. As the number
of published experimental evidences increases, it is becoming more important to attach quantitative and qualitative descrip-tors to genome annotations and databases. As the number of published studies containing experimental evidence of new enzymes, metabolites and gene interactions is continually increasing, it becomes vital to attach quantitative and qualitative descriptors to genome annotations and integrate them into existing databases
Con
fl
ict of interest statement
None of the authors have any conflict of interest.
Acknowledgments
The computational resources were provided by the Bioinfor-matics Center, Institute for Chemical Research, Kyoto University. The KEGG project is supported by the Institute for Bioinformatics Research and Development of the Japan
Science and Technology Agency, and a grant-in-aid for scientific
research on the priority area‘Comprehensive Genomics’from
the Ministry of Education, Culture, Sports, Science and Techno-logy of Japan.
References
Acker, Michael, Auld, Douglas, 2014. Considerations for the design and reporting of enzyme assays in high-throughput screening applications. Perspect. Sci. 1, 56–73.
Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M.D.R., Durbin, R., Falquet, L., Fleischmann, W., Gouzy, J., Hermjakob, H., Hulo, N., Jonassen, I., Kahn, D., Kanapin, A., Karavidopoulou, Y., Lopez, R., Marx, B., Mulder, N.J., Oinn, T.M., Pagni, M., Servant, F., Sigrist, C.J.A., Zdobnov, E.M., 2001. The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 29 (1), 37–40.
Bairoch, A., 2000. The ENZYME database in 2000. Nucleic Acids Res. 28 (1), 304–305.
Bairoch, A., 1991. PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Res. 19 (Suppl), 2241–2245.
Bairoch, A., Boeckmann, B., 1991. The SWISS-PROT protein sequence data bank. Nucleic Acids Res. 19 (Suppl), 2247–2249. Barker, W.C., Garavelli, J.S., Huang, H., McGarvey, P.B., Orcutt, B.
C., Srinivasarao, G.Y., Xiao, C., Yeh, L.L., Ledley, R.S., Janda, J. F., Pfeiffer, F., Mewes, H., Tsugita, A., Wu, C., 1999. The Protein Information Resource (PIR). Nucleic Acids Res. 28 (1), 41–44. Bernstein, F.C., Koetzle, T.F., Williams, G.J.B., Meyer, E.F., Brice,
M.D., Rodgers, J.R., Kennard, O., Shimanouchi, T., Tasumi, M., 1977. The protein data bank: a computer-based archivalfile for macromolecular structures. J. Mol. Biol. 112 (3), 535–542. Carter, P., 1986. Site-directed mutagenesis. Biochem. J. 237 (1),
1–7.
Chen, J., Swamidass, S.J., Dou, Y., Bruand, J., Baldi, P., 2005. ChemDB: a public database of small molecules and related chemoinformatics resources. Bioinformatics 21 (22), 4133–4139. Dayhoff, M.O., 1973. Atlas of Protein Sequence and Structure [1972]: Supplement. National Biomedical Research Foundation. Dry, S., McCarthy, S., Harris, T., 2000. Structural genomics in the
biotechnology sector. Nat. Struct. Mol. Biol. 7, 946–949. Ghirlanda, G., 2008. Old enzymes, new tricks. Nature 453, 164–166. Goad, W., 1987. Computational tools for using and analyzing DNA
sequences. Somatic Cell Mol. Genet. 13 (4), 391–395.
Handelsman, J., 2004. Metagenomics: application of genomics to uncultured microorganisms. Microbiol. Mol. Biol. Rev. 68 (4), 669–685.
Hashimoto, K., Tokimatsu, T., Kawano, S., Yoshizawa, A.C., Okuda, S., Goto, S., Kanehisa, M., 2009. Comprehensive analysis of glycosyl-transferases in eukaryotic genomes for structural and functional characterization of glycans. Carbohydr. Res. 344, 881–887. Hashimoto, K., Yoshizawa, A.C., Okuda, S., Kuma, K., Goto, S.,
Kanehisa, M., 2008. The repertoire of desaturases and elongases reveals fatty acid variations in 56 eukaryotic genomes. J. Lipid Res. 49, 183–191.
Hattori, M., Kotera, M., 2011. Chemoinformatics on metabolic pathways: attaching biochemical information on putative enzy-matic reactions. In: Lodhi, H., Yamanishi, Y. (Eds.), Chemoinfor-matics and Advanced Machine Learning Perspectives: Complex Computational Methods and Collaborative Techniques. IGI Glo-bal, pp. 318–339.
Hattori, M., Okuno, Y., Goto, S., Kanehisa, M., 2003. Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways. J. Am. Chem. Soc. 125, 11853–11865.
Holm, L., Sander, C., 1994. The FSSP database of structurally aligned protein fold families. Nucleic Acids Res. 22 (17), 3600–3609.
Kanehisa, M., Goto, S., Furumichi, M., Tanabe, M., Hirakawa, M., 2010. KEGG for representation and analysis of molecular net-works involving diseases and drugs. Nucleic Acids Res. 38, D355–D360.
Kanehisa, M., Araki, M., Goto, S., Hattori, M., Hirakawa, M., Itoh, M., Katayama, T., Kawashima, S., Okuda, S., Tokimatsu, T., Yamanishi, Y., 2008. KEGG for linking genomes to life and the environment. Nucleic Acids Res. 36, D480–D484.
Kanehisa, M., Goto, S., Kawashima, S., Nakaya, A., 2002. The KEGG databases at GenomeNet. Nucleic Acids Res. 30 (1), 42–46.
Kawano, S., Hashimoto, K., Miyama, T., Goto, S., Kanehisa, M., 2005. Prediction of glycan structures from gene expression data based on glycosyltransferase reactions. Bioinformatics 21, 3976–3982. Kotera, M., McDonald, A.G., Boyce, S., Tipton, K.F., 2008. Eliciting
possible reaction equations and metabolic pathways involving orphan metabolites. J. Chem. Inf. Model. 48 (12), 2335–2349. Kotera, M., Okuno, Y., Hattori, M., Goto, S., Kanehisa, M., 2004.
Computational assignment of the EC numbers for genomic-scale analysis of enzymatic reactions. J. Am. Chem. Soc. 126, 16487–16498.
Marchler-Bauer, A., Anderson, J.B., DeWeese-Scott, C., Fedorova, N.D., Geer, L.Y., He, S., Hurwitz, D.I., Jackson, J.D., Jacobs, A. R., Lanczycki, C.J., Liebert, C.A., Liu, C., Madej, T., Marchler, G.H., Mazumder, R., Nikolskaya, A.N., Panchenko, A.R., Rao, B. S., Shoemaker, B.A., Simonyan, V., Song, J.S., Thiessen, P.A., Vasudevan, S., Wang, Y., Yamashita, R.A., Yin, J.J., Bryant, S. H., 2002. CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Res. 31 (1), 383–387.
Monasterio, Octavio, 2014. Nomenclature for the applications of nuclear magnetic resonance to the study of enzymes. Perspect. Sci. 1, 88–97.
Minowa, Y., Araki, M., Kanehisa, M., 2007. Comprehensive analysis of distinctive polyketide and nonribosomal peptide structural motifs encoded in microbial genomes. J. Mol. Biol. 368, 1500–1517.
Moriya, Y., Shigemizu, D., Hattori, M., Tokimatsu, T., Kotera, M., Goto, S., Kanehisa, M., 2010. PathPred: an enzyme-catalyzed metabolic pathway prediction server. Nucleic Acids Res. 38, W138–W143.
Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C., 1995. SCOP: a structural classification of proteins database for the investiga-tion of sequences and structures. J. Mol. Biol. 247 (4), 536–540. Nishikawa, K., Ishino, S., Takenaka, H., Norioka, N., Hirai, T., Yao, T., Seto, Y., 1993. Constructing a protein mutant database. Protein Eng. 7 (5), 733.
Oh, M., Yamada, T., Hattori, M., Goto, S., Kanehisa, M., 2007. Systematic analysis of enzyme-catalyzed reaction patterns and
prediction of microbial biodegradation pathways. J. Chem. Inf. Model. 47, 1702–1712.
Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thornton, J.M., 1997. CATH – a hierarchic classification of protein domain structures. Structure 5 (8), 1093–1109. Payen, A., Persoz, J.-F., 1833. Mémoire sur la diastase, les
principaux produits de ses réactions et leurs applications aux arts industriels. Ann. Chim. Phys. 53, 73–92(2nd series). Raes, J., Bork, P., 2008. Molecular eco-systems biology: towards an
understanding of community function. Nat. Rev. Microbiol. 6, 693–699.
Riesenfeld, C.S., Schloss, P.D., Handelsman, J., 2004. METAGE-NOMICS: genomic analysis of microbial communities. Ann. Rev. Genet. 38, 525–552.
Sanger, F., Tuppy, H., 1951a. The amino-acid sequence in the phenylalanyl chain of insulin. I. The identification of lower peptides from partial hydrolysates. Biochem. J. 49 (4), 463–481. Sanger, F., Tuppy, H., 1951b. The amino-acid sequence in the phenylalanyl chain of insulin. 2. The investigation of peptides from enzymic hydrolysates. Biochem. J. 49 (4), 481–490. Schloss, P.D., Handelsman, J., 2003. Biotechnological prospects
from metagenomics. Curr. Opin. Biotechnol. 14 (3), 303–310. Sonnhammer, E.L.L., Eddy, S.R., Durbin, R., 1997. Pfam: a
com-prehensive database of protein domain families based on seed alignments. PROTEINS: Struct. Funct. Genet. 28, 405–420. Tipton, K., Boyce, S., 2000. History of the enzyme nomenclature
system. Bioinformatics 16 (1), 34–40.
Tringe, S.G., von Mering, C., Kobayashi, A., Salamov, A.A., Chen, K., Chang, H.W., Podar, M., Short, J.M., Mathur, E.J., Detter, J. C., Bork, P., Hugenholtz, P., Rubin, E.M., 2005. Comparative metagenomics of microbial communities. Science 308 (5721), 554–557.
Turnbaugh, P.J., Gordon, J.I., 2008. An invitation to the marriage of metagenomics and metabolomics. Cell 134 (5), 708–713. Yamanishi, Y., Hattori, M., Kotera, M., Goto, S., Kanehisa, M., 2009.
E-zyme: predicting potential EC numbers from the chemical transformation pattern of substrate–product pairs. Bioinfor-matics 25, i79–i86.