CHAPTER 2: BACKGROUND
2.7 Unsupervised Approaches to Relation Extraction
In this section, we will review a few of the most recent unsupervised approaches to relation extraction.
Hasegawa et al. (2004) presented an unsupervised approach for the discovery of relations among named entities from a newspaper domain. Their approach employed
15http://webdocs.cs.ualberta.ca/~lindek/minipar.htm
16www.connexor.com/software/syntax/
the clustering technique in order to cluster named entity pairs according to the similarity of context words intervening between them. The relation discovery process was based on the assumption that pairs of named entities co-occurring in similar context can be grouped together in a cluster. After the NER, the two named entities are considered to co-occur if they appear within the same sentence and are separated by at most N intervening words. A vector space model and cosine similarity measures were employed to calculate the similarities between the set of contexts of named entities pairs. The approach used the maximum 5 context words between named entities and set the frequency threshold of 30 co-occurring named entities pairs. The presented approach was able to achieve a good precision and recall but one of the drawbacks of this approach is that because of high frequency threshold, the system was unable to discover some valuable relations.
Sekine (2006) and Shinyama and Sekine (2006) presented two unsupervised approaches to IE known as ‘On-demand IE’ and ‘Pre-emptive IE’ respectively. The basic motive behind both these approaches was to identify the most salient relations in documents and extract information on user demands by employing unsupervised learning methods. The on-demand IE system (Sekine, 2006) extracts salient relations from the text based on a user query and builds tables based on these extracted relations by using paraphrase discovery technology. The system makes use of recent advances in pattern discovery, paraphrase discovery and extended NE tagging. The system used a newspaper corpus and retrieves relevant documents based on a user query and then applies PoS tagger, a dependency analyser and an extended NE tagger to extract patterns from the relevant documents. These extracted patterns are then arranged into a set of similar patterns by applying paraphrase recognition. A table was created for each pattern set, if the pattern set contained more than two patterns. Shinyama and Sekine (2006) (pre-emptive IE) apply NER, coreference resolution and parsing to a newspaper corpus in order to extract relations between NEs. The approach uses unrestricted relation discovery in order to discover all possible relations from texts and presents them as tables. In unrestricted relation discovery the relations appearing repeatedly in a corpus are extracted automatically (without human intervention). The extracted relations are grouped into pattern tables of NE pairs expressing the same relation. This approach uses clustering in order to cluster the semantically similar relations.
Etzioni et al. (2008) presented an unsupervised approach to RE by using Web as a corpus. Their approach used a huge corpus of 9 million web pages to automatically extract all relations between noun phrases. The main contribution of this approach is to introduce an open RE system known as TEXTRUNNER. TEXTRUNNER consists of three key modules: self-supervised learner, single-pass extractor and redundancy- based assessor. Self-supervised learner module produces a classifier by using a small sample corpus without any hand-tagged data. This classifier labels candidate extractions as ‘trustworthy’ or not. The single-pass extractor module makes a single pass over the whole corpus to extract tuples of all possible relations from corpus. These extracted tuples are then sent to the classifier and only those which the classifier labels as trustworthy are kept. A redundancy-based assessor module assigns a probability score to each trustworthy tuple based on a probabilistic model of redundancy in text (Downey et al., 2005). The experimental results revealed in this paper show that TEXTRUNNER achieves a 33% relative error reduction for a comparable number of extractions when compared with the state-of-the-art Web RE system KNOWITALL (Etzioni et al., 2005). Moreover, TEXTRUNNER was able to achieve higher precision than KNOWITALL.
Eichler et al. (2008) presented an unsupervised RE system (IDEX) which automatically extracts information regarding an input topic provided by the user. The relevant documents related to the given topic are then retrieved and extracted relations are clustered in an unsupervised way. IDEX employs LingPipe18 for sentence boundary detection, NER and coreference resolution. IDEX only considered those sentences for relation extractions which contain at least two NE’s. These selected sentences are then parsed using Stanford parser19. IDEX then extracts all the verb relations i.e. for each verb its subject(s), object(s), preposition(s) with arguments and auxiliary verb(s) and it keeps only those verb relations where at least the subject or object is an NE. Extracted relations are grouped into relation clusters based on their similarity. IDEX used Berlin Central Station corpus for their experiments which comprise 1068 web pages downloaded from Google consisting of 55255 sentences, 10773 relation instances were automatically extracted and clustered by those
18http://alias-i.com/lingpipe/
sentences. The system was able to produce 306 clusters out of which 121 were deemed as consistent (i.e. all instances in the cluster express similar relations), 35 partly consistent and 69 were not consistent.