Molecular Function - Protein Properties Detection

3.4 Protein Properties Detection

3.4.1 Molecular Function

Understanding the role of mutations, in particular their contribution to diseases like cancer, requires identifying their impact on molecular func-

tions. Causative mutations can drive cancers by activating a protein

function or in-activating a function. They can promote cancer progression by their resistance to drugs or, according to a recent study, switching of

functions [RAS11]. Consider the following example on the structural and

L145Q and V157F are classiﬁed asβ-sandwich mutants and are located on

β−strands S3 and S4, respectively (Figure 1B). Nearly a third of the cancer

mutations are in theβ-sandwich region, but individual mutations occur at

a much lower rate than hot-spot mutations.12β sandwich mutantsretain

sequence-speciﬁc DNA bindingat lower temperatures, but many are not

functional at physiological temperature.a

a_{Excerpt from Structural effects of the L145Q, V157F, and R282W cancer-}

associated mutations in the p53 DNA-binding core domain, PMID: 21561095

The above example expresses the impact of the two mutations, L145Q and V157F, on the molecular function sequence-speciﬁc DNA binding (Gene Ontology ID: 0043565).

Detection of the functional impact of mutations has not only drawn attention in cancer study, but also has been an important matter in re- sequencing efforts.

Consider another study example on the functions and structures of

Peroxiredoxins9 _{and their mutants:}

The C83S mutant exhibited similar peroxidase activity to the wild type,

which is exclusively dimeric, in the Trx/Trx reductase system.a

a_{Excerpt from Dimer-oligomer interconversion of wild-type and mutant}

rat 2-Cys peroxiredoxin: disulﬁde formation at dimer-dimer interfaces is not essential for decamerization, PMID: 17974571

The above example shows that the C83S mutation did not affect the

peroxidase activity (GO ID: 0004601) of the protein. To detect molecular

functions, we use the concepts presented by the Gene Ontology [ABB+00].

We generate an RDF representation of molecular functions from a down- load of the Gene Ontology. The Gene Ontology is provided in OBO-XML

format, where each node is one entry (Figure 12). We ﬁrst check for

molecular function namespaces, then, we extract the name and GO ID, as

well as the synonyms of the entry. Using this information, we generate our RDF ﬁle. For obtaining further information, molecular functions are

speciﬁed by their Gene Ontology ID (Figure 13). The format of a triple

is C1 rdfs:subClassOf C2, where rdfs:subClassOf is an instance of

rdf:Property and states that C1, here recognized as the Gene Ontol- ogy ID, is an instance of rdfs:Class and a subclass of C2, an instance of rdfs:Class, “molecular function”. The developed RDF is used for gazetteering.

<term>

<name>single−stranded DNA speciﬁc endodeoxyribonuclease activity</name> <namespace>molecular function</namespace>

<def>

<defstr>Catalysis of the hydrolysis of ester linkages within a single−stranded deoxyribonucleic acid molecule <dbxref>

</def>

<synonym text>ssDNA−speciﬁc endodeoxyribonuclease activity</synonym text> <dbxref>

</synonym>

Figure 12: Example for a Molecular Function encoded in OBO-XML for- mat: single-stranded DNA speciﬁc endodeoxyribonuclease activity (GO

ID:0000014)

<rdf : Description rdf : about=”http://www.semanticsoftware.info/ molecular function #GO:0000014”> <rdfs : label>ssDNA−speciﬁc endodeoxyribonuclease activity</rdfs : label>

<rdfs : label>single−stranded DNA speciﬁc endodeoxyribonuclease activity</rdfs : label>

</rdf : Description>

Figure 13: Example for a Molecular Function encoded in RDF: single-

stranded DNA speciﬁc endodeoxyribonuclease activity (GO ID:0000014)

3.4.2 Kinetic Constants

Depending on their interests, enzyme and protein engineers apply recom- binant DNA technology to improve enzyme kinetic values and stability or identify the roles of residues. Consider a study on the role of Asn107 in humans:

To examine the role of Asn107 in the catalytic mechanism of human XR, mutant forms (N107D and N107L) were prepared. The two mutations

increasedKmfor the substrate (>26-fold) andKdfor NADPH (95-fold), but

only the N107L mutation signiﬁcantlydecreasedkcatvalue.a

a_{Excerpt from Crystal structure of human L-xylulose reductase holoen-}

zyme: probing the role of Asn107 with site-directed mutagenesis, PMID:

15103634

Here, two prepared mutations, N107D and N107L, affect three kinetic values, Michaelis Menten constant (Km), Turn-over number (Kcat) and Dissociation constant (Kd), of the protein. To capture these kinetic properties, we manually compiled them from the scientiﬁc literature. The list of these properties is by no means exhaustive. However, property synonyms add complexity to later tasks where the relations are extracted

and validated against the ontology (see Section 3.5.6). A simple RDF

schema allows us to deal with different term representations of a concept

and to resolve all aliases of the same concept. Figure 14 shows protein

property terms that can collapse to the same idea. A triple is deﬁned as C1 rdfs:subClassOf C2, where C2 is an instance of rdfs:Class, “Pro- teinProperty”. The rdfs:label is an instance of rdf:Property, where

rdfs:domain is rdfs:resource and the rdfs:range is literal.

<rdf : Description rdf : about=”&prop;DissociationConstant”> <rdfs : label>dissociation constant</rdfs : label> <rdfs : label>aﬃnity</rdfs : label>

<rdfs : label>Kd−value</rdfs:label> <rdfs : label>Kd</rdfs:label>

Figure 14: Example for a kinetic property encoded in RDF: Dissociation

Constant

Normalizing all aliases to one single representation can also be helpful when populating the output ontology. Consider Half-life as an example,

it can appear in different variations (Table 9). All these variations are

represented as labels in the aforementioned RDF, thus, in case any of them matches, the mention is normalized to Half-life.

Table 9: Different representations of “Half-life”

t0.5 (t1/2) half-life

half-lives half life half lives

halﬂife halﬂives

3.4.3 Kinetic Values

Knowing the magnitude of kinetic parameters affected by mutations enables biologists to better compare the mutation impacts of their interests.

As an example, consider this bio-engineering study that was conducted on quinoprotein Glucose Dehydrogenase to improve the thermal stability of the enzyme:

The halﬂife at 55°C of Ser415Cys (183 min) was approx 36-fold greater

than that of the wild-type enzyme (5 min) and4-fold greater than that of

the Ser231Lys variant (40min).a

a_{Excerpt from Stabilization of quaternary structure of water-soluble quino-}

protein glucose dehydrogenase, PMID: 12746550

The Ser residue at position 415 is chosen for constructing different vari- ants of the enzyme and compared with the S231K variant. Analyzing which variant results in the most thermostable enzyme requires the extraction of the magnitudes. Half-life of S415C is measured as 183 min, whereas

S231K was measured as 40 min, and the measured half-lives of the two

mutations are also compared to that of the wild-type enzyme.

The magnitudes of kinetic parameters are expressed in signed numbers, decimals and ranges of values for a single parameter as shown in the following example:

For the three residues in the BE–αF loop, Q137M caused increases of

2-4-fold in Km values for the three substrates, and L143F showed 8-fold

increase in the Km value for only diacetyl, but the effect of the H146L

mutation on the kinetic constants was small.a

a_{Excerpt from Identiﬁcation of amino acid residues involved in sub-}

strate recognition of L-xylulose reductase by site-directed mutagenesis, PMID:

Since the existing generic tokeniser can only detect digits, we developed a simple tokeniser to capture possible representations of magnitudes. The pattern used in the tokeniser is as follows:

(DASH) ∗ (DIGIT ) + ((DASH|CONNECT OR P UNCT UAT ION|MAT H SY MBOL)(DIGIT )+)∗

However, more complex mentions of ranges cannot be captured by the developed tokeniser and need further analysis. As illustrated in the following example, value ranges can be expressed in various formats:

Example #1:

In agreement with the lack of involvement of ionizable groups in the ox- idative half-reaction is the observation that no pKa values in the range

between7.0and11.0were detected in the pH proﬁle of the kcat/Koxygen

value with the H99N enzyme.a

a_{Excerpt from Contribution of ﬂavin covalent linkage with histidine 99 to}

the reaction catalyzed by choline oxidase, PMID: 19398559 Example #2:

Catalytic efﬁciency with NADP(H) decreased from1,500 to 10,000-fold for

G223D/T224I and from100 to 1,400-fold for T224I/H225N.a

a_{Excerpt from Complete reversal of coenzyme speciﬁcity by concerted}

mutation of three consecutive residues in alcohol dehydrogenase, PMID:

12902331

Example #3:

Independent evidence supporting the lack of commitments to catalysis in the Asn-99 variant enzyme is the lack of perturbation in the kinetic pKa

value of8.4± 0.1determined in the pH proﬁle for the kcat/Km value when

choline is substituted with 1,2-[2H4]choline as substrate.a

a_{Excerpt from Contribution of ﬂavin covalent linkage with histidine 99 to}

the reaction catalyzed by choline oxidase, PMID: 19398559

To ensure that we extract the reported ranges of magnitudes, we collected possible range representations from the literature and expressed

them through grammar rules (Table 10). After detecting all possible values,

we check which values express a physical quantity using the patterns

Table 10: Kinetic Value Extraction Patterns

Priority Pattern

1 (VALUE) ((CONJUNCTION)(VALUE))*

1 (NUMBER WORDS)((CONJUNCTION) (NUMBER WORDS))*

3.4.4 Units of Measurement

Units of measurement are expressed in various formats, in mass or molar concentration (e.g., mg/ml or mmol/l), in different systems (e.g., unit, katal)

and different scales (e.g., mM, μM and nM). Finding how a magnitude is

measured requires detecting units of measurement. Consider the following example:

Catalytic efﬁciency with NADP(H) dropped dramatically in the double mutants, G223D/T224I and T224I/H225N, and in the triple mutant,

G223D/T224I/H225N (kcat/KmNADPH = 760mm-1 min-1), as compared

with the wild-type enzyme (kcat/KmNADPH = 133,330 mm-1 min-1).a

a_{Excerpt from Complete reversal of coenzyme speciﬁcity by concerted}

mutation of three consecutive residues in alcohol dehydrogenase, PMID:

12902331

The magnitude of catalytic efﬁciency is measured with mm-1 min-1. Using the same approach as for kinetic properties, the list of units of measurement was collected from the literature and encoded in an RDF

schema (Figure15). The RDF schema is limited to one subclass hierarchy

and assigns the units of measurement to their identiﬁed concept in the OWL-

DL ontology. Consider the unit of measurement, per second in Figure15,

the same concept as PerSecond is encoded in the OWL-DL ontology (see

Section 3.5.1). If any of the representations of per second is detected in the

document, the class PerSecond is assigned to it, facilitating the ontology population step.

3.4.5 Physical Quantities

Now that we extracted all the possible kinetic values, we can use the information about the units of measurement to extract physical quantities.

</rdf : Description>

Figure 15: Example for a unit of measurement encoded in RDF: Per second Table 11: Physical Quantity Extraction Patterns

Priority Pattern

2 (PKA) (TOKEN)[2,3] (VALUE) ((CONJUNCTION) (VALUE))? 2 (PHUNIT) (VALUE)

1 ((VALUE) (UNIT))((CONJUNCTION) ((VALUE) (UNIT)))?

Usually, units of measurement follow values, except for a few with no speciﬁc units of measurement, such as pH. More succinctly:

physical quantity = value + unit of measurement

After reviewing the literature, we designed a set of patterns to capture

physical quantities (see Table11).

3.5 Impact Analysis

Mutations are considered as sources of species evolution. Some result in beneﬁcial changes, while others have detrimental effects. It is important to not only ﬁnd impacts, but also to mark the origin mutation and altered protein properties for further analysis. A system capable of analyzing mutation impacts requires information from many entities.

Impact analysis consists of the following steps: 1. Finding impact expressions.

2. Finding mutations or mutation keywords.

3. Identifying the polarity of the impact to detect advantageous and disadvantageous impacts.

4. Grounding the impacts to mutations to ﬁnd which mutations lead to a speciﬁc impact.

5. Finding the affected protein properties.

6. Finding the magnitude of the effect to help bio-engineers compare the effects and ﬁnd the most favorable mutations.

For the ﬁrst step, we use ontology based gazetteering, with the help of the morphological analyzer, to capture term variations. Using some

heuristics (see Section 3.5.4), we attempt to ground the impacts to the

detected mutations. Possible kinetic values are found using a custom

tokeniser and validated by some rules (see Section 3.4.3). The magnitude of

an impact is detected through heuristics and validated against the domain

ontology (see Section3.5.6). The last task solved by the system is to ﬁnd

the protein properties changed by a mutation, which is also done through

additional heuristics (see Section3.5.5). The overall workﬂow of the impact

extraction is shown in Figure 16.

Impact Detection

Impact Extraction

MeasuredWith Relation Detection ImpactOn Relation Detection

Impact Grounding

Figure 16: Impact Extraction Overview

3.5.1 Impact Ontology

Our Mutation Impact Ontology is an extension to the Mutation Miner

Table 12: Mutation Impact Concepts in the Mutation Miner Ontology

Object Property Domain Range Description

hasProperty Protein ProteinProperty Which protein the protein property belongs to impactOn MutImpact ProteinProperty Identiﬁes the protein property affected by a mutation

measuredWith ProteinProperty UnitOfMeasurement Holds between protein property and the corresponding unit of measurement mutationMutImpactRel Mutation MutImpact Associates an impact with a mutation

Datatype Property Domain Range Description

physicalQuantity UnitOfMeasurement value Identiﬁes the magnitude of a mutation effect on the protein property

with them (Figure 17). The use of the impact ontology facilitates advanced

queries and impact extraction. The Mutation Impact Ontology contains information about several elements: Text elements, biological entities and entity relations. We extended the ontology with new classes, such as

MichaelisMentenConstant, SpeciﬁcActivity, and MaximalVelocity (Figure 18). Our ontology has a rich set of relationships between the concepts. Main concepts modeling impacts on a semantic level are:

Mutation: An alteration or a change to a gene and developing a different offspring.

UnitOfMeasurement: A standard for measuring the physical quantity.

MutationImpact: The expansion of an impact can be presented as a bifurcating tree: each bifurcating node represents a mutation effect on protein properties, whether the impact is measurable or not.

ProteinProperty: A class for protein properties, which subsumes kinetic

properties, protein function, and protein stability.

Information about the effect of mutations on proteins can be modeled at different granularity levels. For example, the effect can be on the structure, which consequently can affect various properties of the proteins. For a ﬁner level of granularity, we represent all these relations.

The relations between these entities, expressed as OWL object properties,

are listed in Table 12.

Each kinetic property is measured with speciﬁc units of measurement, for example Michaelis Menten Constant is measured with units such as per

second, per minute, etc. However, in interpreting the mutation impacts, not

Figure 17: Impact Ontology

can be used. For example, the measured values of the affected protein property are compared with the measured values of the wild type or other mutated protein properties, and speciﬁed by percent, fold or orders of

magnitude. We decided to establish some restrictions on the units of

measurement with which each kinetic property is measured, as well as the ratio measurement units. These restrictions are encoded in the ontology

based on global standards (SI10_{), where kinetic properties are measured by}

specific units of measurements. These constraints are encoded as possible value fillers for the measuredWith slot for a specific protein property. For instance, Km can be measured with fold, per second and per minute, etc.

(Figure 19).

We also deﬁned a datatype property for protein properties, called physi-

calQuantity, referring to the value and the unit of measurement found in

the text.

In document Automated Extraction of Protein Mutation Impacts from the Biomedical Literature (Page 47-57)