Feature Classes - A Novel Approach for Knowledge Base Construction with Conditional Ran-

2.2 A Novel Approach for Knowledge Base Construction with Conditional Ran-

2.2.4 Feature Classes

The choice of features is an essential part in any IE system and much of a system’s performance depends on the accurate definition of features. Due to their discriminative na- ture, CRFs are able to incorporate many rich and highly dependent features. One of the strengths of CRFs is that they soften hard rules, i. e. features get assigned probabilistic weights according to their maximum likelihood estimate. In this section, general characteristics of the features used in Text2SemRel are described, while in Section 2.4.1 we provide detailed information about the implementation of these features, when discussing our experiments. Text2SemRel’s features can be roughly divided into local features, context features and features consisting of external knowledge. These can further be grouped into features designed rather for NER and features designed rather for SRE. For instance, in case of SRE it will be often necessary to reason over contextual clues located quite far away from an entity, that’s why the contextual features are expected to be more helpful for SRE. Note that all features, except the external knowledge features, are derived from the training data.

Local features These are extracted from single tokens or parts of a single token. Many IE systems use a kind of standard feature set with slight derivations across systems. Local features are expected to be more meaningful for NER. Text2SemRel uses essentially the following:

• Word Token: The word tokens represent the most simple features, but might already

be indicative for a specific label.

• Orthography: Entities often share common orthography, e. g. they are often capital-

ized or consist of digits. A complete list of orthographic features is shown in Table 2.1.

• Word Shape: This feature is a normalization, where a word is broken down into

its shape. E. g. two different words that have the same length and are capitalized encode the same word shape. The word shape of the word ’Angela’ is normalized to ’Xxxxxx’. A further option is to prune this normalization such that all adjacent letters are merged into a single one (’Xx’ in the example above).

• Character N-gram: The character N-grams are defined for a specific window length

and are consecutive substrings of N items from a given word sequence. Character 3-grams of ’Angela’ are: ’Ang’, ’nge’, ’gel’, ’ela’.

• Prefix: Prefix features are consecutive substrings of N items from a given word

sequence, with the additional constraint that they must start off with the beginning of the word sequence. The prefix of the word ’kinase’ with length N = 3 would be ’kin’.

• Suffix: Suffix features are consecutive substrings of N items from a given word sequence with the additional constraint that they must end with the last character of the word sequence. The suffix of the word ’kinase’ with lengthN = 3 would be ’ase’.

• Role Feature: Special feature designed for SRE, which indicates that a specific entity

class has been recognized in the input sequence. Every token from the input sequence which is predicted to be member of an entity class gets assigned this feature. This information comes from a NER system. Text2SemRel uses a CRF for NER.

External Knowledge Features Incorporating external knowledge into an IE system often improves performance and is an essential part of many IE systems (see e. g. [114,

157]). Text2SemRel uses two different types of external knowledge features: (i) dictionaries

consisting of entities and (ii) keyword lists that are indicative for specific relations. While the accuracy of these dictionaries is usually very good, its coverage is typically quite low. The idea behind dictionary matching features is that if some substring of the input token sequence matches a dictionary entry, the entity class of the corresponding dictionary should get more likely. Even though there are lot of resources available on the web, dictionaries are domain-dependent features and might sometimes be difficult to obtain. Some recent work tries to automatically generate dictionaries from large collections of unlabeled text (see e. g. [114,77]).

Context Features These features extract characteristics from the surroundings of a current word xi. CRFs can ask arbitrary questions about the input token sequence x, which makes it straightforward to incorporate context features in this model. Context features are the crucial factor for being able to handle the task of extracting relations with CRFs, since a relationship between two entities is most of the time expressed in the environment of one of the participating entities. Recent research indicates that binary relationships in general can often be expressed with a small number of lexico-syntactic patterns [15]. In the future Text2SemRel could make use of these results and use such lexico-syntactic patterns as additional features.

• Conjunction: The input for conjunction is an interval [−N;N] and a set of other

feature classes for which conjunctions shall be extracted. A conjunction feature takes features at a specific position inside the specified interval and combines the feature with position-specific information. For instance, let’s assume we have a given interval [−2; 2] and the set of features for which conjunction is specified, consists of the prefix feature and the suffix feature class. Let’s assume the current word is ’Merkel’ and the previous word is ’Angela’. For N =−1 we get one conjunction feature for the suffix ’ela’ and positionN =−1 as well as one conjunction feature for the prefix ’Ang’ and position N = −1. Note that the conjunction feature is expected to be helpful for both NER and SRE.

• Entity Neighborhood: A special feature for SRE, that gets as input the following

assumed to be true, an interval [−N;N] and a set of other classes of features for which entity neighborhood shall be considered. The neighborhood feature searches in the specified interval around the predicted entity if specific classes of features can be found.

• Window: The window feature considers the same input space as the conjunction

feature does. But in contrast to the conjunction feature, no position-specific information is extracted, only the information that a defined feature occurs in the given window.

• Negation: The input space for the negation feature is a set of feature classes. This

feature is active, if none of the defined feature classes can be found in the input sequence.

In document Bundschus, Markus (2010): From Text to Knowledge: Bridging the Gap with Probabilistic Graphical Models. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 46-48)