recall, but it is unclear how the approach would perform in a less controlled setting.
Plank et al.(2014) also exploit links in tweets for part of speech tagging of tweets, but instead of restricting themselves to particular websites and collecting labels, they use links to retrieve richer linguistic information from those websites that are linked. Crucially, linked websites are only used during training, but not required during testing. What happens during training is that words appearing in the tweet are aligned with words on linked websites, so that more context for those words is available. The tag most frequently assigned to those words on the website is then projected to the occurrence of the word in the tweet. They call their method “not-so-distant supervision” and indeed, the method is only vaguely related to distantly supervised relation extraction. The general method of acquiring additional information from linked websites is a strategy also used for Twitter-based entity linking (Gorrell et al.,2015) and could also be used for relation extraction from social media or Web data.
Fan et al. (2015) propose to use a variant of distant supervision with Freebase for entity linking. Entity linking approaches are typically trained with Wikipedia. In Wikipedia, text is annotated with links, most of which are named entities. These annotations can then be used to train models for entity linking. The idea of Fan et al.(2015) is to achieve something similar with Freebase instead of Wikipedia. To do so, they make use of the Freebase property/common/ topic/topic_equivalent_webpage to collect Web pages which are known to be about specific entities. Whenever they find an entity’s name on those Web pages, they then annotate them with their Freebase ID. This can be used to train an entity linking approach with Freebase as a background knowledge base and linked Web pages, in addition to Wikipedia pages as text. It would be interesting to see how useful only using the topic related Web pages would be for training distantly supervised relation extractors. In particular, the authors do not discuss how many Freebase entries have such linked Web pages, so it would be interesting to study how many entities do and if so, how useful they are for training relation extractors.
2.5
Limitations of Current Approaches
Most distant supervision approaches (Section 2.4.2) use the same distant supervision paradigm for creating training as well as test corpora, with the exception of Roller and Stevenson(2014), who try to map relations to existing gold standard corpora and then reuse those. What existing approaches do not evaluate is relation extraction across sentences using coreference resolution, as e.g. annotated in gold standards such as the ACE 2005 Multilingual Corpus. Further, approaches which perform instance-level extraction for knowledge base population focus on relation extraction, and leave validation of extractions, a popular task as part of the TAC KBP challenges, to future work.
Section2.4.3explains that the distant supervision assumption can cause incorrect labelling and summarises different methods for improving on that. At-least-one models make the assumption that at least one of the sentences in which an entity pair is mentioned is a positive training example. The suitability of the context for a relation is learned in concert with the suitability of an entity pair for a relation. However, this assumption can fail, then leading to low performance.
In addition, performing inference in the context of learning the graphical models can be very expensive. Pattern-based models do not have the at-least-one restriction. However, they still rely on expensive graphical models and in addition, they rely on expressing relations in terms of patterns. Approaches combining data labelled with the distant supervision assumption with manually labelled data is sensible for some use cases where such data already exists (e.g. for the TAC KBP challenges); however, for most scenenarios, additional manually annotated training data is not available. Approaches addressing the problem of the false negatives model do this by avoiding the use of negative training data; by incorporating latent variables and performing inference over the search space; or by using pseudo-relevance feedback. The best-performing approach of the ones discussed (Min et al.,2013) only shows minor improvements over MIML, an approach addressing the problem of false positives, but is computationally expensive. Moreover, the problem might be specific to the relations selected and also depends on how negative training data is selected. The research in this thesis therefore only focuses on the problem of false positives. Approaches combining distant supervision with supervised data using the ACE or KBP 2011 corpus show an improvement over both training distantly supervised RE models alone and over training supervised models alone. However, the reason for this is not uncovered, i.e. is it due to currently available hand-labeled RE corpora being small and still being able to benefit from more training data, even if it is noisy, or does the training data happen to be complementary? Future work still needs to uncover the relationship between size of manually labeled RE data and usefulness of additional automatically generated RE data.
Section2.4.4shows that most distant supervision methods use Stanford NERC for pre-processing. There are two methods which use fine-grained NERC for distant supervision (Ling and Weld,2012;
Liu et al., 2014) and show promising results. However, both of them rely on Wikipedia for gen- erating training data by specifically exploiting the anchor text of entity mentions and Wikipedia categories to map to Freebase types. For NE types not annotated in such a resource, or for testing documents which are not very similar in style to Wikipedia articles and would thus be considered out of genre, this approach would not be suitable. Overall, research on fine-grained NERC for distant supervision shows promise, but still leaves much room for future work. Most importantly, existing distant supervision methods view NERC as a preprocessing step. Such a pipeline archi- tecture can lead to errors made at an earlier stage of the pipeline (e.g. NERC) being propagated to a later stage of the pipeline (e.g. RE). Future work could focus on jointly learning models for those the tasks, thus learning dependencies between the stages.
2.6
Summary
Distant supervision, a relation extraction method that uses relations defined in a background knowledge base to automatically label training data, has become a popular research area since 2009. Research efforts have mostly focused on improving automatic labelling to reduce false positives and false negatives (Section 2.4.3), and there has been some work on improving NERC for distant supervision (Section2.4.4), and on integrating distant supervision with Open IE (Riedel et al.,2013). Distant supervision approaches further differ with respect to what knowledge base
2.6. SUMMARY 33
and corpus they use: most approaches use Freebase and either Wikipedia or the New York Times corpus, and a handful use the YAGO knowledge base or biomedical knowledge bases and biomedical corpora (Section 2.4.1). Distant supervision is either used for sentence-level or instance level extraction, and some approaches try to reuse gold standards as test corpora, whereas most perform a held out evaluation (Section 2.4.2). Applications of distant supervision include semantic role labelling, Twitter tagging, parsing, classifying YouTube labels and entity linking (Section 2.4.5), which further demonstrate the usefulness of distant supervision.