Knowledge Extraction: The LARC Framework
5.4 LARC: A General View
There is a fundamental difference between our learning framework LARC and sim-ilar frameworks that perform SRL, presented in Section 5.3.2. Existing SRL ap-proaches make use of the ready available corpora of FrameNet or PropNet, where hundred of thousands sentences have been manually annotated with semantic roles by linguists. That is, such approaches do not need to consider how the training data are acquired.
When working with text outside the scope of such available corpora, acquir-ing trainacquir-ing data becomes a big concern. Therefore, our whole approach is built around this concern, with the explicit intention of keeping the number of manually annotated instances as low as possible.
The architecture of the LARC framework is shown in Figure 5.5. The numbers in every box indicate the order of execution within the context of LARC. In the following, we briefly describe the functionality of each LARC component. More details in each of these components will be given in Section 5.5.
Tagging: The purpose of tagging is to get part-of-speech tags and stemming infor-mation (if available) for the words of the documents. Tags, by assigning a syntactic category (e.g., verb, noun, adjective, pronoun, etc.) to words, provide useful knowl-edge in processing text. Based on such knowlknowl-edge, sometimes it can be sufficient to
5http://www.cis.upenn.edu/~{}bsnyder3/cgi-bin/search.cgi
Parsing Tagging
Tree Representation
Corpus Statistics and
Clustering Selection &
Annotation
Learning Algorithm Feature
1 2
3
4 5
6
7
8
Corpus
Initialization Bootstrap Creation
Active Learning
Figure 5.5: LARC Architecture
retain for text representation only the nouns and verbs of a document. Or, in the context of our event-oriented representation, one can replace a whole sentence (or clause) by the event type (or frame name) to which the corresponding verb belongs.
All in all, tagging is a text processing step that cannot be omitted.
Parsing: The purpose of syntactic parsing is to uncover inherent syntactic rela-tions among different components of a sentence. A parser will divide the sentence into many phrases, e.g., NP (noun phrase), VP (verb phrase), or PP (prepositional phrase), and will show how these phrases relate to one another. Because we expect that knowledge roles will be generally expressed by phrases and not by individ-ual words, the parsing step is very important for the learning approach, since it produces the phrases that will be labeled with knowledge roles during classifica-tion. Furthermore, parsing information is a good source of knowledge for creating informative features for the learning process.
Tree Representation: Tagging and parsing information produced in the previ-ous two steps cannot be used directly, because of formatting differences. Thus, it is advisable to normalize and conflate their outputs by creating a tree data structure, where nodes store several data pieces (part-of-speech, stem, grammatical function,
phrase type, etc.), useful for the process of feature creation. A tree data structure is created, because parsing naturally displays a tree structure. For an example, refer to Figure 5.8.
Feature Creation: A learning approach needs learning instances. In Machine Learning, instances are commonly represented as vectors of some feature values.
When available data in their original form do not have a structure of feature–
value pairs, two things are needed. (1) designing a set of features that would contain as much information as possible in representing the concepts to be learned;
(2) assigning to every instance their corresponding feature values. In the process of feature design, feature functions are created, so that the second step of value assignment can take place automatically. Thus, during LARC execution, the Feature Creation component will use its feature functions to calculate feature values from the given input data.
Corpus Statistics and Clustering: In order to guide the process of selecting the most informative instances for manual annotation, we need statistics from the available text. A useful statistics is, for instance, the distribution of verbs in the cor-pus. Furthermore, we can cluster together sentences that display the same syntactic structure, in order to simplify the further annotation process.
Selection and Annotation: The learning algorithm needs instances that already have a value for the class feature, so that it can learn to recognize the class, based on the other feature values. Up to this point, the feature vectors created in the Feature Creation step do not contain a value for the class. That is, it is not known whether an instance is the verbalization of a knowledge role or not. The annotation step provides such information, by requiring from a human user (the oracle) to assign knowledge roles to some sentences that are selected automatically by the selection step. The selection step needs a strategy for selecting the most informative instances, so that the number of instances that the user needs to annotate manually remains low. The selection strategy used by us is part of our active learning strategy discussed in Section 5.5.6, however, one can think of other selection strategies to use at this step, which suit the nature of available data.
Bootstrap Initialization: A benefit of having a corpus with inherent redun-dancy (that is, the same type of information is conveyed again and again, although using different wording) is that one can use this characteristic to bootstrap the initialization process. That is, if a human user assigns labels to some sentences manually, and the corpus has other sentences structurally similar to those, then it is possible to spread labels to these unannotated sentences, in order to increase the size of the training set.
Learning Algorithms In a traditional, passive learning setting, a learning sys-tem learns on the set of training instances supplied from outside. Contrary to that, in an active learning setting, the learning system has control over the set of instances that can be used for training. How does the learning system decide what instances to choose for training? Two approaches have crystallized over many years of research:
uncertainty sampling [Cohn et al.,1994] and query by committee [Argamon-Engelson and Dagan,1999].
In the uncertainty sampling approach, the learner asks from an oracle (the hu-man annotator) to provide labels for instances on which it is uncertain. In the query by committee approach, there are several learners that learn independently on the same set of instances. Then, the decision to ask for new labels is based on the disagreement set, that is, the set of instances, to which the learners have assigned different labels.
The common way of performing active learning in Machine Learning is by ap-plying direct control into the internals of the learning algorithms. Such an approach makes it difficult for outsiders of this field to use active learning in real-world prob-lems. Therefore, for the LARC framework, we have chosen an alternative approach.
We implement the query-by-committee approach using three off-the-shelf learning algorithms. The active learning strategy is then a separate module that takes into account the results of the learners, without interfering with their internal imple-mentation.