entity names in UMLS, as this collection is amenable to a string-similarity-based approach. We observe, however, that this collection is still incomplete. We augment the collection with the missing variations, namely the plural forms of existing nouns as well as the full set of conjugations of existing verbs.
High throughput is achieved by turning to MinHash [19] as the key ingredient. MinHash is a variant of the Locality Sensitive Hashing (LSH) [22] algorithm that transforms a dictionary lookup into a hash lookup with high probability of success. In terms of NLP processing, the method uses at most part-of-speech tagging, which is a fast process, and avoids any further processing such as dependency parsing.
Together with judicious selection of a subset of the UMLS dictionary and simple heuristics in selecting which text mentions to perform lookups for, the method achieves up to 83% precision and 78% coverage under a strict rating scheme that penalizes failure in Word Sense Disambiguation (WSD), at a throughput of 1,720 PubMed abstracts or 175 Web pages per minute. The resulting code has been released as an open source software, and has made it possible to process large corpora in other biomedical text mining works [39, 40, 41].
4.2
Related Work
Biomedical ER for a Single Sub-domain
ER in the biomedical domain often focuses on a specific sub-domain. Proteins and genes are the most popular sub-domains, and the BioCreative initiative has been driving the BioNLP community with various gene mention recognition [189, 233] and normalization tasks [67, 115, 126]. Out of a large body of works, there are a number of software tools publicly available, where Gimli [20] and ABNER [177] are two notable ones. Since gene and protein names are written in a highly specific but non-standardized manner, recognizing their text mentions continues to be a research challenge. As recent as 2016, Sheikhshab et al. [179] propose to use a graph-based method to leverage a two-word window around a gene name in order to improve precision. Recognizing protein-centric entities such as sequence variants has been studied as well [220].
Chemicals are another popular sub-domain for ER because chemical names are written in a completely different but just as specific and non-standardized manner. Chemical names are a mixture of established names (e.g. ferric oxide), chemical el- ements and their symbols (Fe2O3), and other established words such as prefixes and suffixes, all jumbled up as multi-word or long formulaic expressions (Amylo-(1,4,6)- transglycosylase). Therefore for chemicals, orthographic features are an important ingredient for recognizing entities, as evidenced in two existing works [9, 103]. And since chemical names are too variable, no dictionary can hope to exhaustively list all possible names. As a result, there are existing works [9, 166, 212] that leverage a dictionary as a starting point, and refine the intermediate results with more sophis- ticated machinery such as Conditional Random Fields (CRF’s). There has also been
effort [95] to draw upon multiple methods and combine their results in an ensemble manner.
Besides recognizing proteins, genes, and chemicals, there are also works focusing on other sub-domains such as anatomical parts [151], cell lines [87], diseases [106, 138, 171, 240], drugs [96], malignancies [86], organisms and species [51, 132], as well as entities related to a single, highly specific biological system (bacterial type IV secretion system) [3].
Biomedical ER for Multiple Sub-domains
A number of existing approaches [91, 135, 176, 180, 191, 208, 235] are in principle applicable to all entity types. In practice, however, these approaches study their performances using one dominant gold standard, the GENIA corpus [90]. This cor- pus contains annotated text mentions belonging to 6 entity types: proteins, DNA’s, RNA’s, chemicals, cells, and cell lines. As a result, how generalizable these approaches are beyond these 6 entity types remain to be studied. A review by Funk et al. [48] provides a detailed analysis of further ER approaches targeting these entity types, using the larger and more recent CRAFT corpus [7].
Biomedical ER methods that truly tackle all sub-domains are relatively sparse. The seminal work by Frantzi et al. [47] propose the C-value / NC-value method to recog- nize multi-word terms in an unsupervised manner. BANNER [100], a method based on CRF that decidedly forgoes a dictionary, is another milestone contribution that becomes a building block for later, bigger systems. For general-purpose biomedical ER, however, MetaMap remains the most widely used software tool and is widely regarded as the de facto standard tool. Other alternatives such as the BioPortal API, MaxMatcher [238], and NOBLE [209] do exist to tackle any text genre, while cTakes [173]4 is specifically designed to tackle clinical text. A survey by Neves and Leser [139] offers a comprehensive overview of entity annotation tools. The League Table [160] was an effort to supply an online platform to compare and benchmark differ- ent annotation tools against multiple gold standards; this service seems to have been decommissioned. At the time of writing of this thesis, BeCalm5 has just begun as a new initiative to provide an annotation metaserver.
Dictionary Construction and Enrichment
Apart from the task of looking up entities, there has also been efforts to enrich the dictionary upon which lookups are performed. Such is the contribution of BioLexi- con [204] – this linguistic resource is a catalog of over 2.2m lexical entries featuring entity names, as well as words specific to the biomedical domain in all their lexical variations. Other sub-domain-specific efforts include enriching a dictionary for ab- breviations [141], and constructing dictionaries from scratch for chemical names [66] and human phenotypes [31, 93].
4
Latest version available at ctakes.apache.org
5
31 4.3. Methodology