3.4.2
NERC for Distant Supervision
Lessons learnt from those experiments are then applied to research how the task of NERC can be improved for distant supervision for the Web genre. Currently there is only limited research in this area; most research for distant supervision focuses on reducing noise for heuristic labelling (which is also studied in this thesis, see Section3.3).
Previous studies for improving NERC for distant supervision have made the hypothesis that the main problem is that Stanford NERC produces coarse-grained NE labels, which are not always a good fit for relation types. They therefore propose training a NERC with fine-grained NE types (Ling and Weld, 2012; Liu et al.,2014) using Wikipedia. However, such an approach only works if additional annotated NERC training data is available for that genre. Although their fine- grained NERC, FIGER, might also perform better than Stanford NERC for out-of-genre scenarios, a drop in performance would still be expected compared to using NERC training data for the same genre.
The aim of the research described in Chapter6is to jointly train a NERC and relation extrac- tor using only the training data automatically annotated with the distant supervision assumption. Traditionally, NLP tasks use a pipeline architecture, where models for different parts of the pipeline (e.g. NEC, RE) are trained separately. However, this ignores the fact that there are dependencies between the different tasks. In addition, if an error is made at an early stage in the pipeline, it is propagated to a task at a later stage of the pipeline. Such errors can be reduced by jointly learn- ing models for different stages, since then, dependencies between the different tasks are learned. Methods explored for this in the context of natural language processing are e.g. integer linear pro- gramming (Roth and Yih,2004,2007;Galanis et al.,2012) and markov logic networks (Domingos et al., 2008; Riedel et al., 2009). Ideally, all different possibilities of dependencies between the tasks would be explored by performing full inference over the search space. However, this is com- putationally very expensive. A cheaper method is to only explore parts of the search space which are likely to be relevant. One way of doing this is with imitation learning.
This joint approach proposed in this thesis therefore uses the structured prediction method imitation learning (Ross et al., 2011). The approach is compared against a pipeline approach with both Stanford NERC and FIGER for the NEC component and a subsequent distantly super- vised RE. The assumption is that a joint approach with imitation learning outperforms a distant supervision approach with supervised NERC as a preprocessing step for some of the relations. Those relations are the ones between “non-standard” NE types such as “album”, which do not correspond directly to a NE type the supervised NERC is trained for.
3.5
Training and Feature Extraction
3.5.1
Training
Distant supervision approaches use a variety of different learning methods, ranging from simple classifiers such as SVMs or MaxEnt models to tensor models or RNNs. For experiments on selecting training samples (Section 3.3), a simple MaxEnt classifier is used for the purpose of comparing
againstMintz et al.(2009) as a baseline.
For experiments on joint named entity recognition and relation extraction, the imitation learn- ing algorithm DAgger (Ross et al., 2011) is used. The aim is to study if the same distantly labelled training data can successfully be used to train two models, a named entity classifier and a relation classifier, to outperform an approach which uses a supervised NEC and a distantly supervised RE, as described in Section3.4.
3.5.2
Feature Extraction
Most distant supervision approaches use standard relation extraction features, such as the con- text around the relation candidate, the words between the subject and object candidate and the dependency path between the subject and object candidate (Mintz et al.,2009; Hoffmann et al.,
2011).
For the experiments in Chapter 4 the relation features proposed in Mintz et al. (2009) are used for comparison reasons. In Chapter 6, feature selection is then studied in more detail. In particular, the goal is to study if low-precision high-frequency features such as bag of words features or high-precision low-frequency features such as lexicalised dependency paths, or a mix of those lead to the highest performance. Results reported in Mintz et al. (2009) suggest that there is very little difference between the performance of shallow features such as bag of words features and semantic features such as dependency features. However, for a multi-stage learning approach with NEC followed by RE, it is plausible that results could be different. The second stage (RE) is only reached if the first stage (NEC) indicates that the NEs are of the correct types. Therefore, it might be beneficial for the first stage to have high recall to make sure relevant RE candidates are not discarded. For the second stage, it might then be important to have high precision to make the correct prediction.
Another research goal is to study whether Web features can help NERC performance for RE with imitation learning. Although Web pages have been used for information extraction, this has so far not been studied. Using Web features is typically limited to information extraction from semi-structured data such lists and tables (Dalvi et al.,2012;Wang et al.,2012b;Shen et al.,2012) or to research on using Wikipedia as a corpus for named entity linking (Bunescu and Pas¸ca,2006;
Han et al., 2011). Those studies indicate that HTML markup on Web pages helps to improve performance of semi-structured information extraction. In the case of named entity linking, Web pages with links are useful because they provide a corpus annotated with references to a knowledge base, which can then be used for learning to link named entities in text to a knowledge base.
As mentioned before (Section 3.2.2), Wikipedia is a curated corpus, and conclusions reached on the basis of studies of information extraction from Wikipedia might not hold for information extraction from Web pages in general. Specifically, in Wikipedia, links in articles almost always point to other articles, which are in turn often NEs. On general Web pages, many links are links to other websites and this assumption cannot be made.
The research goal is to study if features extracted from HTML markup such as links, text in bold or italics, or also lists can help improve NERC performance. Both local (the same mention) and global (on the same Web page) features are studied.