3.4 Protein Properties Detection
3.4.1 Molecular Function
Understanding the role of mutations, in particular their contribution to diseases like cancer, requires identifying their impact on molecular func-
tions. Causative mutations can drive cancers by activating a protein
function or in-activating a function. They can promote cancer progression by their resistance to drugs or, according to a recent study, switching of
functions [RAS11]. Consider the following example on the structural and
L145Q and V157F are classified asβ-sandwich mutants and are located on
β−strands S3 and S4, respectively (Figure 1B). Nearly a third of the cancer
mutations are in theβ-sandwich region, but individual mutations occur at
a much lower rate than hot-spot mutations.12β sandwich mutantsretain
sequence-specific DNA bindingat lower temperatures, but many are not
functional at physiological temperature.a
aExcerpt from Structural effects of the L145Q, V157F, and R282W cancer-
associated mutations in the p53 DNA-binding core domain, PMID: 21561095
The above example expresses the impact of the two mutations, L145Q and V157F, on the molecular function sequence-specific DNA binding (Gene Ontology ID: 0043565).
Detection of the functional impact of mutations has not only drawn attention in cancer study, but also has been an important matter in re- sequencing efforts.
Consider another study example on the functions and structures of
Peroxiredoxins9 and their mutants:
The C83S mutant exhibited similar peroxidase activity to the wild type,
which is exclusively dimeric, in the Trx/Trx reductase system.a
aExcerpt from Dimer-oligomer interconversion of wild-type and mutant
rat 2-Cys peroxiredoxin: disulfide formation at dimer-dimer interfaces is not essential for decamerization, PMID: 17974571
The above example shows that the C83S mutation did not affect the
peroxidase activity (GO ID: 0004601) of the protein. To detect molecular
functions, we use the concepts presented by the Gene Ontology [ABB+00].
We generate an RDF representation of molecular functions from a down- load of the Gene Ontology. The Gene Ontology is provided in OBO-XML
format, where each node is one entry (Figure 12). We first check for
molecular function namespaces, then, we extract the name and GO ID, as
well as the synonyms of the entry. Using this information, we generate our RDF file. For obtaining further information, molecular functions are
specified by their Gene Ontology ID (Figure 13). The format of a triple
is C1 rdfs:subClassOf C2, where rdfs:subClassOf is an instance of
rdf:Property and states that C1, here recognized as the Gene Ontol- ogy ID, is an instance of rdfs:Class and a subclass of C2, an instance of rdfs:Class, “molecular function”. The developed RDF is used for gazetteer- ing.
<term>
<id>GO:0000014</id>
<name>single−stranded DNA specific endodeoxyribonuclease activity</name> <namespace>molecular function</namespace>
<def>
<defstr>Catalysis of the hydrolysis of ester linkages within a single−stranded deoxyribonucleic acid molecule <dbxref>
<acc>mah</acc> <dbname>GOC</dbname> </dbxref>
</def>
<synonym scope="exact">
<synonym text>ssDNA−specific endodeoxyribonuclease activity</synonym text> <dbxref>
<acc>mah</acc> <dbname>GOC</dbname> </dbxref>
</synonym>
<is a>GO:0004520</is a> </term>
Figure 12: Example for a Molecular Function encoded in OBO-XML for- mat: single-stranded DNA specific endodeoxyribonuclease activity (GO
ID:0000014)
<rdf : Description rdf : about=”http://www.semanticsoftware.info/ molecular function #GO:0000014”> <rdfs : label>ssDNA−specific endodeoxyribonuclease activity</rdfs : label>
<rdfs : label>single−stranded DNA specific endodeoxyribonuclease activity</rdfs : label>
<rdfs : subClassOf rdf : resource =”http://www.semanticsoftware.info/ molecular function #MolecularFunction”/> <rdf : type rdf : resource =”http://www.w3.org/2002/07/owl#Class”/>
</rdf : Description>
Figure 13: Example for a Molecular Function encoded in RDF: single-
stranded DNA specific endodeoxyribonuclease activity (GO ID:0000014)
3.4.2
Kinetic Constants
Depending on their interests, enzyme and protein engineers apply recom- binant DNA technology to improve enzyme kinetic values and stability or identify the roles of residues. Consider a study on the role of Asn107 in humans:
To examine the role of Asn107 in the catalytic mechanism of human XR, mutant forms (N107D and N107L) were prepared. The two mutations
increasedKmfor the substrate (>26-fold) andKdfor NADPH (95-fold), but
only the N107L mutation significantlydecreasedkcatvalue.a
aExcerpt from Crystal structure of human L-xylulose reductase holoen-
zyme: probing the role of Asn107 with site-directed mutagenesis, PMID:
15103634
Here, two prepared mutations, N107D and N107L, affect three kinetic values, Michaelis Menten constant (Km), Turn-over number (Kcat) and Dissociation constant (Kd), of the protein. To capture these kinetic prop- erties, we manually compiled them from the scientific literature. The list of these properties is by no means exhaustive. However, property syn- onyms add complexity to later tasks where the relations are extracted
and validated against the ontology (see Section 3.5.6). A simple RDF
schema allows us to deal with different term representations of a concept
and to resolve all aliases of the same concept. Figure 14 shows protein
property terms that can collapse to the same idea. A triple is defined as C1 rdfs:subClassOf C2, where C2 is an instance of rdfs:Class, “Pro- teinProperty”. The rdfs:label is an instance of rdf:Property, where
rdfs:domain is rdfs:resource and the rdfs:range is literal.
<rdf : Description rdf : about=”∝DissociationConstant”> <rdfs : label>dissociation constant</rdfs : label> <rdfs : label>affinity</rdfs : label>
<rdfs : label>Kd−value</rdfs:label> <rdfs : label>Kd</rdfs:label>
<rdfs : subClassOf rdf : resource =”∝ProteinProperty”/>
<rdf : type rdf : resource =”http://www.w3.org/2002/07/owl#Class”/> </rdf : Description>
Figure 14: Example for a kinetic property encoded in RDF: Dissociation
Constant
Normalizing all aliases to one single representation can also be helpful when populating the output ontology. Consider Half-life as an example,
it can appear in different variations (Table 9). All these variations are
represented as labels in the aforementioned RDF, thus, in case any of them matches, the mention is normalized to Half-life.
Table 9: Different representations of “Half-life”
t0.5 (t1/2) half-life
half-lives half life half lives
halflife halflives
3.4.3
Kinetic Values
Knowing the magnitude of kinetic parameters affected by mutations enables biologists to better compare the mutation impacts of their interests.
As an example, consider this bio-engineering study that was conducted on quinoprotein Glucose Dehydrogenase to improve the thermal stability of the enzyme:
The halflife at 55°C of Ser415Cys (183 min) was approx 36-fold greater
than that of the wild-type enzyme (5 min) and4-fold greater than that of
the Ser231Lys variant (40min).a
aExcerpt from Stabilization of quaternary structure of water-soluble quino-
protein glucose dehydrogenase, PMID: 12746550
The Ser residue at position 415 is chosen for constructing different vari- ants of the enzyme and compared with the S231K variant. Analyzing which variant results in the most thermostable enzyme requires the extraction of the magnitudes. Half-life of S415C is measured as 183 min, whereas
S231K was measured as 40 min, and the measured half-lives of the two
mutations are also compared to that of the wild-type enzyme.
The magnitudes of kinetic parameters are expressed in signed numbers, decimals and ranges of values for a single parameter as shown in the following example:
For the three residues in the BE–αF loop, Q137M caused increases of
2-4-fold in Km values for the three substrates, and L143F showed 8-fold
increase in the Km value for only diacetyl, but the effect of the H146L
mutation on the kinetic constants was small.a
aExcerpt from Identification of amino acid residues involved in sub-
strate recognition of L-xylulose reductase by site-directed mutagenesis, PMID:
Since the existing generic tokeniser can only detect digits, we developed a simple tokeniser to capture possible representations of magnitudes. The pattern used in the tokeniser is as follows:
(DASH) ∗ (DIGIT ) + ((DASH|CONNECT OR P UNCT UAT ION|MAT H SY MBOL)(DIGIT )+)∗
However, more complex mentions of ranges cannot be captured by the developed tokeniser and need further analysis. As illustrated in the following example, value ranges can be expressed in various formats:
Example #1:
In agreement with the lack of involvement of ionizable groups in the ox- idative half-reaction is the observation that no pKa values in the range
between7.0and11.0were detected in the pH profile of the kcat/Koxygen
value with the H99N enzyme.a
aExcerpt from Contribution of flavin covalent linkage with histidine 99 to
the reaction catalyzed by choline oxidase, PMID: 19398559 Example #2:
Catalytic efficiency with NADP(H) decreased from1,500 to 10,000-fold for
G223D/T224I and from100 to 1,400-fold for T224I/H225N.a
aExcerpt from Complete reversal of coenzyme specificity by concerted
mutation of three consecutive residues in alcohol dehydrogenase, PMID:
12902331
Example #3:
Independent evidence supporting the lack of commitments to catalysis in the Asn-99 variant enzyme is the lack of perturbation in the kinetic pKa
value of8.4± 0.1determined in the pH profile for the kcat/Km value when
choline is substituted with 1,2-[2H4]choline as substrate.a
aExcerpt from Contribution of flavin covalent linkage with histidine 99 to
the reaction catalyzed by choline oxidase, PMID: 19398559
To ensure that we extract the reported ranges of magnitudes, we col- lected possible range representations from the literature and expressed
them through grammar rules (Table 10). After detecting all possible values,
we check which values express a physical quantity using the patterns
Table 10: Kinetic Value Extraction Patterns
Priority Pattern
1 (VALUE) ((CONJUNCTION)(VALUE))*
1 (NUMBER WORDS)((CONJUNCTION) (NUMBER WORDS))*
3.4.4
Units of Measurement
Units of measurement are expressed in various formats, in mass or molar concentration (e.g., mg/ml or mmol/l), in different systems (e.g., unit, katal)
and different scales (e.g., mM, μM and nM). Finding how a magnitude is
measured requires detecting units of measurement. Consider the following example:
Catalytic efficiency with NADP(H) dropped dramatically in the double mutants, G223D/T224I and T224I/H225N, and in the triple mutant,
G223D/T224I/H225N (kcat/KmNADPH = 760mm-1 min-1), as compared
with the wild-type enzyme (kcat/KmNADPH = 133,330 mm-1 min-1).a
aExcerpt from Complete reversal of coenzyme specificity by concerted
mutation of three consecutive residues in alcohol dehydrogenase, PMID:
12902331
The magnitude of catalytic efficiency is measured with mm-1 min-1. Using the same approach as for kinetic properties, the list of units of measurement was collected from the literature and encoded in an RDF
schema (Figure15). The RDF schema is limited to one subclass hierarchy
and assigns the units of measurement to their identified concept in the OWL-
DL ontology. Consider the unit of measurement, per second in Figure15,
the same concept as PerSecond is encoded in the OWL-DL ontology (see
Section 3.5.1). If any of the representations of per second is detected in the
document, the class PerSecond is assigned to it, facilitating the ontology population step.
3.4.5
Physical Quantities
Now that we extracted all the possible kinetic values, we can use the information about the units of measurement to extract physical quantities.
<rdf : Description rdf : about=”http://info. semanticsoftware /unitOfMeasurement#PerSecond”> <rdfs : label>/sec</rdfs: label>
<rdfs : label>s−1</rdfs: label> <rdfs : label>sec−1</rdfs: label> <rdfs : label>sec−1</rdfs: label> <rdfs : label>per sec</rdfs : label> <rdfs : label>per s</rdfs : label>
<rdfs : subClassOf rdf : resource =”http://info . semanticsoftware /unitOfMeasurement#UnitOfMeasurement”/> <rdf : type rdf : resource =”http://www.w3.org/2002/07/owl#Class”/>
</rdf : Description>
Figure 15: Example for a unit of measurement encoded in RDF: Per second Table 11: Physical Quantity Extraction Patterns
Priority Pattern
2 (PKA) (TOKEN)[2,3] (VALUE) ((CONJUNCTION) (VALUE))? 2 (PHUNIT) (VALUE)
1 ((VALUE) (UNIT))((CONJUNCTION) ((VALUE) (UNIT)))?
Usually, units of measurement follow values, except for a few with no specific units of measurement, such as pH. More succinctly:
physical quantity = value + unit of measurement
After reviewing the literature, we designed a set of patterns to capture
physical quantities (see Table11).
3.5
Impact Analysis
Mutations are considered as sources of species evolution. Some result in beneficial changes, while others have detrimental effects. It is important to not only find impacts, but also to mark the origin mutation and altered protein properties for further analysis. A system capable of analyzing mutation impacts requires information from many entities.
Impact analysis consists of the following steps: 1. Finding impact expressions.
2. Finding mutations or mutation keywords.
3. Identifying the polarity of the impact to detect advantageous and disadvantageous impacts.
4. Grounding the impacts to mutations to find which mutations lead to a specific impact.
5. Finding the affected protein properties.
6. Finding the magnitude of the effect to help bio-engineers compare the effects and find the most favorable mutations.
For the first step, we use ontology based gazetteering, with the help of the morphological analyzer, to capture term variations. Using some
heuristics (see Section 3.5.4), we attempt to ground the impacts to the
detected mutations. Possible kinetic values are found using a custom
tokeniser and validated by some rules (see Section 3.4.3). The magnitude of
an impact is detected through heuristics and validated against the domain
ontology (see Section3.5.6). The last task solved by the system is to find
the protein properties changed by a mutation, which is also done through
additional heuristics (see Section3.5.5). The overall workflow of the impact
extraction is shown in Figure 16.
Impact Detection
Impact Extraction
MeasuredWith Relation Detection ImpactOn Relation Detection
Impact Grounding
Figure 16: Impact Extraction Overview
3.5.1
Impact Ontology
Our Mutation Impact Ontology is an extension to the Mutation Miner
Table 12: Mutation Impact Concepts in the Mutation Miner Ontology
Object Property Domain Range Description
hasProperty Protein ProteinProperty Which protein the protein property belongs to impactOn MutImpact ProteinProperty Identifies the protein property affected by a mutation
measuredWith ProteinProperty UnitOfMeasurement Holds between protein property and the corresponding unit of measurement mutationMutImpactRel Mutation MutImpact Associates an impact with a mutation
Datatype Property Domain Range Description
physicalQuantity UnitOfMeasurement value Identifies the magnitude of a mutation effect on the protein property
with them (Figure 17). The use of the impact ontology facilitates advanced
queries and impact extraction. The Mutation Impact Ontology contains information about several elements: Text elements, biological entities and entity relations. We extended the ontology with new classes, such as
MichaelisMentenConstant, SpecificActivity, and MaximalVelocity (Figure 18). Our ontology has a rich set of relationships between the concepts. Main concepts modeling impacts on a semantic level are:
Mutation: An alteration or a change to a gene and developing a different offspring.
UnitOfMeasurement: A standard for measuring the physical quantity.
MutationImpact: The expansion of an impact can be presented as a bi- furcating tree: each bifurcating node represents a mutation effect on protein properties, whether the impact is measurable or not.
ProteinProperty: A class for protein properties, which subsumes kinetic
properties, protein function, and protein stability.
Information about the effect of mutations on proteins can be modeled at different granularity levels. For example, the effect can be on the structure, which consequently can affect various properties of the proteins. For a finer level of granularity, we represent all these relations.
The relations between these entities, expressed as OWL object properties,
are listed in Table 12.
Each kinetic property is measured with specific units of measurement, for example Michaelis Menten Constant is measured with units such as per
second, per minute, etc. However, in interpreting the mutation impacts, not
Figure 17: Impact Ontology
can be used. For example, the measured values of the affected protein property are compared with the measured values of the wild type or other mutated protein properties, and specified by percent, fold or orders of
magnitude. We decided to establish some restrictions on the units of
measurement with which each kinetic property is measured, as well as the ratio measurement units. These restrictions are encoded in the ontology
based on global standards (SI10), where kinetic properties are measured by
specific units of measurements. These constraints are encoded as possible value fillers for the measuredWith slot for a specific protein property. For instance, Km can be measured with fold, per second and per minute, etc.
(Figure 19).
We also defined a datatype property for protein properties, called physi-
calQuantity, referring to the value and the unit of measurement found in
the text.