• No results found

Context Models for Concept Disambiguation

3.2 Methods

3.2.4 Context Models for Concept Disambiguation

The previous section shows that a maximum entropy classifier trained on high quality train- ing data performs best on the task of sense disambiguation. This section shows how Context Models are used to disambiguate concept candidates.

Walkthrough of recognition pipeline

The Concept Recognition configured with PubMed as document source and the Gene On- tology as the source of defined concepts can be executed as follows.

Document Retrieval. The documents to be annotated are PubMed abstracts. The arti- cle with the PMID 17963238 is fetched because the user searched for ”Tim10”. Along with this article 44 other articles are fetched as well. The following is executed for each article.

Terminology Processing. The concepts to be identified are from the Gene Ontology. One concept defined in the Gene Ontology is DNA binding. The terminology of GO is preprocessed as described in section 3.2. The label of DNA bindings tokenized into two words: ’DNA’ and ’binding’. The word ’DNA’ is an abbreviation and has no stemming variants. The word binding has the stemming variants ’bind’, ’binds’ and ’binding’. Both word are very frequent word in the Gene Ontology. That is, 203 other concepts contain the word ’DNA’ in the label and 540 other concepts contain the word ’binding’ in the label. The mentioning of one of the word does indicate but does not ensure that the text refers to the concept DNA binding

Input Text Processing. The document attributes of the citation 17963238 are fetched. The citation has the title Zinc binding of Tim10: Evidence for existence of an unstructured binding intermediate for a zinc finger protein. The journal it is published in is named Proteins. The article was published in 2007. The abstract and the title are segmented into sentences and words. Four content scopes are defined: word, sentence, paragraph and citation. The scope “citation” includes the journal name and the publication date.

The title and the sentences of the abstract are scanned for concept candidates. One candidate is the concept DNA binding as the sentence: Next, the stopped-flow fluorescence technique was used to investigate the kinetic process of the binding reaction. mentions a stemming variant of the word ’binding’ in the label of the concept.

Sense Disambiguation. The curation database contains 251 positive and 163 negative training examples. This allows to distinguish the meaning of ’binding’ in 82,9% of the cases correctly, see table 3.13. The trained models learned which features of a citation indicate the meaning DNA binding and which not.

Table 3.10 shows examples of features which can be extracted from the citation. For example in the scope “title” the stemmed word phrases ’zinc finger’,’protein’ and ’unstructur bind’ can be identified. The content of the scope “title” is the same for all words in the abstract. In contrast to this the scope “sentence” is different for words in different sentences in the abstract. In this case the word ’binding’ is contained in a sentence with stemmed word

Binary feature Scope Type zinc finger title word phrase protein title word phrase unstructur bind title word phrase · · · word phrase stoppedflow fluoresc techniqu sentence word phrase investig sentence word phrase kinet process sentence word phrase · · · word phrase degre of fold paragraph word phrase zinc concentr paragraph word phrase · · · word phrase Proteins journal title 2000-2010 year number

Table 3.10: Examples of binary features extracted from the PubMed abstract below. The word phrase features were stemmed using the Porter Stemmer algorithm. The journal title is used as it is. The years are accumulated to decades, so the example falls in to the decade 2000-2010.

TITLE: Zinc binding of Tim10: Evidence for existence of an unstructured binding intermediate for a zinc finger protein.

ABSTRACT: ... Comparison between the results of CD and fluorescence studies showed that the zinc-binding reaction is not a simple one-step process. It involves formation of a binding intermediate that is structurally as unfolded as the apoTim10; subsequently, a degree of folding is induced at increased zinc concentrations in the final complex. Next, the stopped-flow fluorescence technique was used to investigate the kinetic process of the binding reaction. Data analysis shows that the reaction has a single kinetic phase at a low free Zn(2+) concentration ( approximately 1 nM), and a double kinetic phase at a high free Zn(2+) concentration...

Journal: Proteins 2007. (c) 2007 Wiley-Liss, Inc. (PMID: 17963238)

phrases like: ’stoppedflow fluoresc techniqu’,’investig’ and ’kinet’, compare with table 3.10. The cooccurrence of such word phrases is used as binary features in case the model was trained on the scope “sentence”. Other scopes found in the example include “paragraph”, “journal”, “year” and “citation”. Note that the scope “citation” includes all other scopes.

The outcome of the disambiguation depends on how the model was trained and which threshold was set. The confidence value is computed as described in section 3.2.2. In case the confidence value is higher than the threshold the text range of the word ’binding’ is marked up with a link to the concept DNA binding. This the given example it is correct to link the word binding to the meaning DNA binding in the Gene Ontology because zink fingers are motifs in DNA- and RNA-binding proteins whose amino acids are folded into a single structural unit around a zinc atom.

Using Context Models to predict relevant articles

The idea here was to use Context Models to predict whether PubMed citations are relevant for curators of the Mouse Genome Informatics database. To further evaluate the performance of Context Models and to evaluate which feature are learned a context model was trained on data provided by the MGI database. The algorithm had only access to abstracts, titles, journal name, authors, etc., but not to full text. Later the learned features were exaimed.

Essentially, the algorithm learned from positive and negative examples the characteristics of relevant articles. Obviously, these involved a lot of mouse terminology, but there were not so obvious hints for good and bad articles, too. A list of 5000 abstracts of manually curated full text articles was used to train a model to predict whether the article is relevant for MGI’s research on mouse. The resulting models are large: They comprise some 80,000 tokens, which are judged for their relevance or irrelevance.

Experiment 1: From the ca. 120,000 PMIDs listed in the MGI corpus45000 ids across all years were randomly selected as a positive data set. As a negative data set, 5000 out of the 2,000,000 most recent PMIDs were randomly selected. The positive and negative datasets were both randomly split into training (85%) and test set (15%). So, there were: 4250 positive training examples and 4250 negative training examples and 750 positive and 750 negative test examples.

The learning algorithm learned from the training data set (8500 articles) and was eval- uated on the test set (1500). Training and test sets were completely independent, they do not overlap.

The algorithm achieved a success rate of 94.1% (precision 91%, recall 97.5%). These are average values, since the above generation of test/training data was repeated multiple times. The algorithm learned for 80,000 tokens how relevant they are. Some feature gave strong negative hints:

• tokens such as “health”, “manage”, “work”, “questionnair”, “physician”, “sample”, “bioethic”, “speech”, “risk factor”

• jourmals such as “J. Bacteriol”, “Plant Physiol.”, “Phys. Rev. Lett.”, “Br. Dent. J.”, “J. Anat. Physiol.”, “Mod. Healthc.”, “Bioorg. Med. Chem. Lett.”

• years such as 1900-1910, 1920-1930 Other features gave strong positive hints:

• tokens comprising all kinds of variant and combination of mouse, but also knockout, +/+, disrupt gene, human nomenclatur, b cell, control type, erythroid

• journals such as Gene, Genet Res, Neurn, Eur J Immunol, Cytogenet Cell Genet, Genomics

As mentioned, there were 80,000 tokens used in the model overall. This made an accurate fingerprint of what MGI’s curators annotate and what they do not which uses terminology and meta information such as journal names.

Experiment 2: Further it was of high interest how sensitive the trained model is to identify abstracts which mention mouse, mice, or murine in the abstract, but which are not in the list of papers selected by curator previously? This was a difficult task, as these negative examples were much closer to the positive ones.

The classification of papers with the learned model gave a confidence value from 0% to 100% and in the above analysis, an article was labeled positive if it had a confidence score >50%. The question was, how do these values look for the negative examples with mouse in the abstract? The results show that 1260 predictions out of 4657 were below 50%, i.e. they were correctly labeled as negative examples.

One the other side, the 5000 positive articles from your PMID list had an average confi- dence score of 87% and only 250 out 5000, i.e. 5%, got errornously misclassfied as negative.

Experiment 3: Finally the question was how many of the 120,251 PMIDs previously manual curated by MGI’s curators, that do not have “mouse”, “mice” or “murine” in the abstract or title were correctly identified nonetheless?

As a result 8807 out of 120251 articles did not mention “mouse”, “mice” or “murine” in the abstract or title and 8233 of them had a confidence >=50% and were thus correctly identified, i.e. 94% of the articles without the mouse keywords were identified nonetheless.

Summary The Maximum Entropy models take binary features of unigrams, bigrams and trigrams of words in the sentences and in the title. Additionally the journal title was used as a feature. Table 3.10 visualizes how the context of the word “binding” is computed in an PubMed abstract. Note that the journal title and the title words are treated separately from phrases in the abstract text. This allows title words to become more important than words from the abstract text if the training data indicates this. Journal title and publication year may help to distinguish vocabulary differently used in various research areas and over time. Manual MeSH annotations were not used as features although it is obvious they would support the disambiguation task greatly. The aim was to keep the models independent of manual pre-preprocessing of the documents.

From experiments it can be concluded that (1) a context model is very good to find articles that are of interest for MGI’s research (i.e. very high recall of 97%). (2) When evaluated on random articles it has also very high precision (91%). (3) For very difficult cases, where mouse is in the abstract, but the article is nonetheless not relevant, the precision is not as good. (4) The context model also identified relevant articles (94% of all previously manually curated articles) that do not contain the keywords “mouse”, “mice” or “murine”. (5) Since the classification gives a confidence score it is possible to retrieve articles, sort them by confidence and thus ensure that most important ones come first.

To summarize the advantages of the Context Models:

• Context Models outperform approaches relying on the terminology’s structure • Context Models make use of textual contexts of concept, e.g. words not defined in

the ontology, as well as meta-information, e.g. journal title or publication years, in a single model

• Context Models need only few high quality training examples • Context Models can be computed very efficiently

• Context Models are robust against varying granularity of ontologies

• Context Models can be used to rank citations by interest for a group of researchers The disadvantages of the Context Models are:

• Each ambiguous concept needs separate training

• The training data must be of same/similar scope, e.g. abstracts vs. full texts • The quality of training data must be high