Annotating stripped models - Comparing Saint-annotated models with pre-existing annotation

3.5 Comparing Saint-annotated models with pre-existing annotation

3.5.2 Annotating stripped models

The first comparison illustrates the benefits of searching based on the id or name attributes of an SBML species element. The stripped versions of the telomere model and the p53-Mdm2 model, both of which contained a number of useful species, were annotated by Saint and then compared against their originals. Even within a single model, some names and identifiers can be useful to Saint while others are not.

11_{http://saint-annotate.svn.sourceforge.net/viewvc/saint-annotate/trunk/docs/}

3.5.2.1 Annotation of the stripped telomere model

Basic information on the annotation of both the original and re-annotated the telomere model is available in Table 3.1, together with a summary of the comparison between the two models. A detailed listing of the modifications made to the model is available in Table3.2.

Species 55

Species with any annotation† 51/55

Useful species 15

Species with names 0

Species with SBO terms 0

Saint-supported URIs in useful species 14/15 Species with names after re-annotation 14 Species with SBO terms after re-annotation 15 URIs after re-annotation 163 URIs recovered from useful species 14/14

Table 3.1: Basic information on the annotation present in the original version of the telomere model and in bold for the re-annotated version. This model had no species name attributes, and therefore Saint relies on the id attribute. There were also no SBO terms in the original model. Although there are 15 useful species, only 14 contain either a Saint-supported URI, a name or an SBO term; the fifteenth (the “RPA” species) has a useful identifier but contains only a PIRSF URI unsupported by Saint. †Any annotation refers to MIRIAM URIs, names and SBO terms.

The total number of Saint-supported URIs has increased from 14 to 163, with further addition or subtraction of URIs possible if required by the modeller. In addition, Saint recovered all 14 of the URIs from the useful species. The original model contained noSBOterms or species names, and Saint has added 15 SBOTerms and 14 species names, which is almost complete coverage of the useful species.

The species “ssDNA” is an example of incorrect annotation by Saint due to a lack of available information. ”ssDNA“ is difficult for Saint to filter: it is not a protein, but can be used as a phrase within protein names. Currently, the Saint user must spot these errors and remove the suggested annotation using the Saint interface prior to downloading the model. The one useful species which was not completely annotated withSBOterm, species name and URIs was “RPA”. A large amount of annotation was suggested by Saint, but all except the SBOterm was manually rejected. In the original annotation, the “RPA” annotation is a single URI which points to the PIR family (PIRSF) “replication protein A, RPA70 subunit”12. This URI is marked as having an “isVersionOf” relationship the RPA species, rather than as an exact “is” match, as the PIRSF URI is linked to mouse, and not yeast. A search for an equivalent for yeast in UniProtKB shows a number of yeast subunits. In

theory, an appropriate choice would be made by the modeller. In PIRSF002091, the equivalent yeast UniProtKB accession is listed as P22336, which is “ Replication factor A protein 1”, a.k.a. “RFA1”. This leads to the conclusion that the species in the model was named after the mouse protein and not the yeast protein. As “RFA” is not a name that matched the original ID (“RPA”), there is no way to retrieve the correct information from the data sources.

In conclusion, in comparing the annotation on the useful species for this model, all Saint-supported URIs were recovered, and new URIs,SBOterms and species names were added. A detailed listing of all new and recovered annotation present in the useful species are shown in Table3.2.

SPECIES ID MATCH NEW CDC13 Cdc13 SBO:0000245 urn:miriam:uniprot:P32797 RAD17 Rad17 SBO:0000245 urn:miriam:uniprot:P48581 RAD24 Rad24 SBO:0000245 urn:miriam:uniprot:P32641 RPA SBO:0000245 MEC1 Mec1 SBO:0000245 urn:miriam:uniprot:P38111 EXO1 Exo1I SBO:0000245 urn:miriam:uniprot:P39875 EXO1 Exo1A SBO:0000245 urn:miriam:uniprot:P39875 RAD9 Rad9I SBO:0000245 urn:miriam:uniprot:P14737 RAD9 Rad9A SBO:0000245 urn:miriam:uniprot:P14737 RAD53 Rad53I SBO:0000245 urn:miriam:uniprot:P22216 RAD53 Rad53A SBO:0000245 urn:miriam:uniprot:P22216 CHK1 Chk1I SBO:0000245 urn:miriam:uniprot:P38147 CHK1 Chk1A SBO:0000245 urn:miriam:uniprot:P38147 DUN1 Dun1I SBO:0000245 urn:miriam:uniprot:P39009 DUN1 Dun1A SBO:0000245 urn:miriam:uniprot:P39009

Table 3.2: A detailed breakdown of recovered (in the “MATCH” column) and new (in the “NEW” column) annotation by Saint for the telomere model. Only names, SBO terms and Saint-supported URIs are shown in this table. Additionally, the large number of new GO terms added as MIRIAM URIs are omitted for brevity.

3.5.2.2 Annotation of the stripped p53-Mdm2 model

Basic information on the annotation of both the original and re-annotated p53-Mdm2 model is pre- sented in Table3.3, together with a summary of the comparison between the two models. A detailed listing of the modifications made to the model is available in Table3.4. The total number of Saint- supported URIs has increased from 7 to 89, and of the original seven Saint-supported URIs present in the original useful species, five were recovered. The other two URIs were from related species which were incorrectly identified, and are described below. The 84 remaining URIs added to the model are GOURIs. The original model contained only two SBOterms, both of which were recovered. Additionally, two newSBOterms were added. Finally, there were no species names in the original model, and two new names were added with Saint.

In contrast to the telomere model, the p53-Mdm2 model contains species whose id attribute was created as a mixture of two protein names, e.g. “Mdm2_p53”. Of the six useful species, four have identifiers formed in this way. For each of these four, Saint suggested annotation based around both of their associated proteins. However, as a composite of two proteins will not necessarily behave in the same way as its constituents, all annotation except forSBOterms and UniProtKB URIs was manually removed. An expert modeller would be able to correctly choose the appropriate name and

GOterms. The UniProtKB URIs were kept, but with a relationship of hasPart.

Two species, “ARF” and “ARF_Mdm2”, were incorrectly identified. In the original model, the both species have an exact “is” match with UniProtKB Q8N726, whose names and synonyms are “Cyclin- dependent kinase inhibitor 2A, isoform 4”, “p14ARF” and “p19ARF”. The annotation procedure by Saint only had the id to use, however, as theMIRIAMURIs were removed in the stripped models. As such, the correct entry could not be returned due to the mismatch of the query term with the UniProtKB names. However, a number of alternative suggestions were provided by Saint.

In document Enhancing systems biology models through semantic data integration (Page 106-109)