3.4 Summary and Discussion
4.2.3 Annotating documents
Problem:
the words in struc- tured texts hold in- formation but are not linked to meaning
The words and phrases in structured text need to be linked to meaning in order to enable reasoning. A human reader interprets the words and grammar of a text to form a meaningful mental model. Humans do this by taking all available contextual information they have into account. In order to process text automatically explicit semantic links need to be established to lexical units found in the sentences. Listing 4.1 shows a PubMed citation structured into document attributes, the title and the abstract. The abstract is segmented into sentences. A tagger identified “purines”, “HIV-1” and “RNA” as a lexical tokens. The other tokens were not marked-up in this example. Additionally “Dimerization Initiation Site” was identified to be the long form of the abbreviation “DIS”. Besides words and phrases two author names, a journal name and an id were marked-up.
Non of these information units are linked to meaningful conceptualizations. A biolo- gist is able to interpret the value of the PMID tag as the unique identifier of a literature citation at www.pubmed.org. Readers of the journal named “Biopolymers” will recognize the name of a printed journal with the unique ISSN 0006-3525 published since 1963 in the United States. Researchers with biomedical background associate “purines” with chemical compounds, “HIV-1” as a type of a virus and “RNA” as macromolecule encoding genetic
information. Furthermore colleagues of the authors will recognize their colleagues names and associate the publication to the correct individuals. However PubMed contains entries from a person named “Sponer J” in the Czech Republic, Germany and the US. Without further contextual knowledge it is not possible to figure out whether PubMed refers to the same individual or to different persons.
1 < P u b M e d C i t a t i o n >
2 < P M I D > 1 8 4 1 2 1 2 7 < / P M I D >
3 < J o u r n a l > B i o p o l y m e r s < / J o u r n a l >
4 < C o n t e n t >
5 < T i t l e > C o n f o r m a t i o n a l t r a n s i t i o n s of f l a n k i n g < t > p u r i n e s < / t >
6 in < t > HIV -1 < / t > < t > RNA < / t > DIS k i s s i n g c o m p l e x e s < / T i t l e >
7 < A b s t r a c t > 8 < S e n t e n c e > D i m e r i z a t i o n of < t > HIV -1 < / t > g e n o m i c < t > RNA < / t > 9 is i n i t i a t e d by k i s s i n g l o o p i n t e r a c t i o n s at the 10 < lf > D i m e r i z a t i o n I n i t i a t i o n S i t e < / lf > ( < a b b r > DIS < / a b b r > ). 11 < / S e n t e n c e > 12 < S e n t e n c e > ... < / S e n t e n c e > 13 < S e n t e n c e > ... < / S e n t e n c e > 14 < / A b s t r a c t > 15 < / C o n t e n t > 16 < A u t h o r > S p o n e r J < / A u t h o r > 17 < A u t h o r > R ´e b l o v ´a K < / A u t h o r > 18 < / P u b M e d C i t a t i o n >
Listing 4.1: A structured PubMed citation represented in a simple XML-format. Meta- information is stored as children of the document’s root node. The content contains the document title and the abstract. The abstract is segmented into sentences. As illustrating examples phrases are marked up: “purines”, “HIV-1” and “RNA” were marked-up as a tokens. “Dimerization Initiation Site” was identified to be the long form of the abbreviation “DIS”.
Goal:
enable reasoning over document content
This shows that further reasoning is not possible without linking information in the text to meaningful concepts. The objective was to link phrases in PubMed abstracts and meta-information of the publication to well defined concepts to enable reasoning over the content.
Reasoning is the process of making inferences from a body of information. Given the information that “HIV-1” is a specific type of RNA viruses, which is provided in the back- ground knowledge, and the citation in the above example refers to “HIV-1”, it is reasonable to conclude that the article mentions a virus. Interestingly the whole text of the PubMed entry and the attached keywords do not refer the concept virus at all. Without the back- ground knowledge it is not possible to automatically conclude that the article mentions a virus. Other conclusions could be: the journal with the ISSN 0006-3525 publishes research about RNA viruses, a person in the Czech Republic named K. R´eblov´a does research on the HIV virus. A person named J. Sponer publishes work together with K. R´eblov´a.
Solution:
semantic markups, links to background knowledge
Document annotation is the process of establishing associations between semantic con- cepts and documents or parts of documents. Semantic concepts can include ontology con- cepts as well as author identities, dates, Wikipedia entries, protein or gene names and geographic information. A document annotator may wrap a complex concept recognition algorithm, as describes in chapter 3, which takes a structured text and returns identified concepts, or simply translates a document attribute such as the publication date into a computable form or map internal identifiers, such as journal names, to external identifiers such as ISSN numbers. In this work document annotators were implemented as taggers. A
document annotator does not insert new nodes to the structured text but adds attributes
of the type ontology concept to phrases, sentences, paragraphs or the document’s root. Problem:
concept recognition is imperfect and pro- duce wrong/missing annotations
Chapter 3 specifically discusses the open problem of Concept Recognition. One outcome of this research is that the current approaches are imperfect leading to wrong and missing annotations. Three types of errors can be distinguished: (1) Annotations can be too general missing the specific meaning in the context. This occurs for example when the recognition algorithms misses the correct bounderies of an entity in the text. (2) An annotation can be too specific. This occurs when the recognition does not penalize the missing substantiation in the context. (3) The recognition pipeline can miss concepts at all if the lexical form is not recognized or the disambiguation model fails to classify the concept correctly because
of missing training data. Goal:
make correct state- ments over large amount of annotated documents
For a domain expert it is easy to detect these errors by reading the abstract text. However for typical queries the number of abstracts returned by PubMed can easily exceed hundreds of citations. Hence it is impractical to look at the annotation manually. The goal was to make correct statements about a large amount of documents returned for a typical PubMed query. The returned documents contain wrong and missing annotations.
The fact that a typical PubMed query returns large sets of citations can be exploited. Well-known biomedical aspects are recurrently stated in several publications. Dooyeweerd [69] suggests an aspectual analysis of text: To undertake textual analysis, go through phrase by phrase or even word by word, on the grounds that each phrase or word is often there by deliberate human choice, whether that choice is conscious or not. (If a phrase gets in from sheer habit of using and repeating that phrase, then it is perhaps meaningless, and can be
ignored.) Solution:
exploit recurrent statements in large citation databases
The idea is to group textual phrases in scientific publications into aspects. Textual phrases are statements about facts, theories or hypothesis referring to known concepts. It is assumed that often repeated statement are more likely to reflect common knowledge and are less likely artifacts of an imperfect text analysis. On the contrary infrequent statements are more likely to reflect less assured ideas and are more likely to result from textmining artifacts. Recently made statements have the potential to reflect new insights.
A Lucene index was created containing 91.046.321 sentences and titles of PubMed ci- tations of the last four years. Along with each sentence/title the document ID, sentence number and concept IDs of each annotation and its ancestors were stored. This resulted in a 64.1GB large file based index.
The index allows queries of the following type: return an iterator over all sentences in ci- tations containing one or more keywords and are annotated with a concept or an descendant. The concept may be an annotation or an ancestor of an annotation. This schema assumes that a document which is relevant for a concept, is also relevant for a more general concept. Suppose a document mentions the disease Tuberculosis. Following the assumption the same document is also relevant for Bacterial Infections. The background knowledge contains the following relations, the symbol “” denotes a relation of type is-a and is tran-
sitive:
Mesh Diseases Bacterial Infections and Mycoses Bacterial Infections Gram-Positive Bacterial Infections Actinomycetales Infections My-
cobacterium InfectionsTuberculosis
This implies: Tuberculosis is a Mycobacterium Infection is a . . . is a Disease. Problem:
changing background knowledge requires re-indexing
All explicit and implicit sematic links are stored in this Lucene index. This makes query- ing very efficient but makes re-indexing necessary in case the ontology is changed. Another disadvantage of this schema is the large number of documents resulting from the segmenta- tion of each document into many sentences. A complete re-indexing of many documents is a resource intensive task. The following section investigates the usage of databases to query
large sets of annotated documents.
Goal:
reasoning over mil- lions of documents and growing back- ground knowledge
PubMed comprises more than 18 Million citations. Every day up to 5000 documents are added or revised. The biomedical knowledge stored in MeSH and the Gene Ontology growth. The Gene Ontology project updates its databases on a weekly basis. The re-indexing of all PubMed citations takes several days on a cluster of ten machines. The goal was to reason with changing background knowledge of ten-thousands of concepts over a growing number of millions of documents.
Solution:
design of a relational database for storing annotations
A relation database was designed to allow efficient reasoning over millions of stand-off annotations. Documents were queried separately from the annotations. Annotations were made per abstract but not for sentence. The main database table maps document ids to concept ids (an additional column was added with the foreign key of annotation ids). An index was created over the document id column. Another table was designed which holds for each annotation information about the confidence and the text range of the annotation. The database can be queried as follows: suppose a biologist is interested in document mentioning diseases. The background knowledge, also stored in the database, is queried for descendant concepts of Diseases, the result is a temporary table comprising 6527 concept ids. The annotations table is joined with the temporary table representing the diseases branch of the ontology. The result is a table of document ids of citations mentioning the concept Diseases or a descendant. The result also contains the references to the annotation details which can be used to highlight the textual evidences generated by the Concept Recognition algorithm.
Currently the document table contains 17,870,689 entires, the annotation table contains 412 million entries for GO and MeSH comprising 46,432 ontology concepts. The database was implemented using the MySQL Server 5.0 on a 16GB machine. The index size is >14GB. The main purpose of this database is the efficient retrieval of all semantic markups in a set of document ids. This does not include information about concepts which are related to the involved concepts. The following section describes the necessary processing step to create a graph connecting all these concepts.