• No results found

Concept Extraction

8.3 The Extraction Workow

8.3.6 Concept Extraction

The result of the Semantic Pattern Extraction is a set of two or three fragments. The first fragment contains the subject(s) and is thus the subject fragment. The second and third fragment are the first resp. second level object fragments. In the following step, the concepts need to be extracted from those fragments in order to build the semantic relations.

To find the concepts that participate in a semantic relation, two further FSMs are used, one to parse the subject fragment, and one to parse object fragments. In many cases, the concepts appear directly left and right from a semantic pattern, which might suggest that the concept extraction can be easily carried out by searching for the first nouns left and right of the pattern. However, there are many examples in which such a naive assump-tion fails and a much more complex approach for concept extracassump-tion is necessary. Two examples are:

1. A wardrobe, also known as an armoire from the French, is a standing closet.65

2. Column or pillar in architecture and structural engineering is a structural element.66 In the first example, a relation like <French,is-a, closet> would be concluded and in the second example relations like <architecture,is-a, structural element>. Both relations are erroneous, because the correct relations would be <wardrobe,is-a, closet> and < {column, pillar},is-a, structural element>.

The first example contains an additional apposition about the origin of a wardrobe. Such side notes about origin, grammar or pronunciation can frequently occur and must not be ignored while parsing a subject fragment, as erroneous relations may be the result.

The second example contains a so-called field reference, which specifies to which field or domain the article refers to. Though there is no relation between a field reference and the semantic patterns, there is a relation between a field reference and the subject(s). The field reference suggests that the subjects belong to the specific field or domain and thus constitutes apart-of relation. The Wikipedia approach is able to extract such relations;

in the second example, it would extract < {column, pillar},part-of, {architecture, structural engineering}>.

65http://en.wikipedia.org/wiki/Wardrobe (January 2014)

The FSMs used for the fragment parsing are more complex than the FSMs to find seman-tic patterns, and they are used in a different way. While the pattern FSMs were entered somewhere in the sentence if a specific anchor term was found, these FSMs are entered at the very first word of the fragment. The final state, that has to be reached, marks the end of the fragment. This means that the FSM is used from the first to the last word of the fragment, not only for specific parts.

There exist special states in the FSMs indicating that a subject (resp. object) or a field reference was found. If such a state is reached, the respective terms are extracted and temporarily stored. If the subject and object fragment FSMs reached the final state, the extraction process is finished, as in this case at least one subject and one object were extracted. If no subject or object could be extracted, no relations can be built and the article is discarded.

Figure 8.6: A simplied version of subject fragment FSM.

Fig. 8.6 shows a simplified finite state machine to process subject fragments, i.e., the sentence fragment appearing before the first pattern. In this fragment, the field references and subjects have to be extracted. Similar to the pattern FSMs, there are different initial states (red color). Besides, there are the two specified states 4 and 7. State 4 handles the subjects, which can be nouns (N) or noun compounds (NC). Analogously, state 7 handles field references, which also can be nouns or noun compounds. State 13 is the final state, which can be entered if the end of the fragment (resp. the beginning of the semantic pattern) was reached.

To illustrate the FSM, consider the following example:

A spindle, in furniture and architecture, is a cylindrically, symmetric shaft.67

67http://en.wikipedia.org/wiki/Spindle_(furniture) (January 2014)

The subject fragment is obviously "A spindle, in furniture and architecture". The first word "A" indicates that the FSM is entered at state 3. The following word, "spindle", is a noun, allowing a transition to state 4. As this state handles the sentence subjects, spindle is considered a subject. The following word "in" allows a transition to state 1. The next word is "furniture", which is a noun and thus leads to state 7. State 7 handles field references, so furniture is considered a field reference. The next word, "and", leads to state 8 and the final word "architecture" leads back to state 7. Therefore, architecture is again considered a field reference. Now, the end of the fragment is reached and there is a transition to the final state 13. Thus, the fragment was successfully parsed and one subject (spindle) and two field references were extracted (furniture, architecture).

Since there are many special cases that can occur in such definition sentences, the ac-tually implemented subject fragment FSM is much more complex. It consists of about 20 states, 40 transitions and many additional conditions to decide whether a transition can be used or not. Object fragment processing works analogously to subject fragment parsing, though the object fragment FSM is less complex.

Parentheses

Parentheses play an important role in the sentence fragment parsing. Consider the fol-lowing three examples:

1. A burl (American English) or bur or burr (used in all non-US English speaking countries) is a tree growth...68

2. A curio (or curio cabinet) is a predominantly glass cabinet...69

3. Countertop (also counter top, counter, benchtop, (British English) worktop, or (Australian English) kitchen bench) usually refers to a horizontal worksurface...70

In the first example, parentheses contain additional information about grammar, ety-mology, word usage and the like, which can highly impair the sentence parsing. As a consequence, one might decide to remove parenthesis expressions entirely. However, in the second example, the parenthesis expression contains a synonym (curio cabinet is syn-onym to cabinet), which would not be extracted if this expression was removed. Finally, in the third example, there is a complex parenthesis expression containing both valuable synonyms and obstructive usage information.

For this reason, deleting parenthesis expressions can be disadvantageous, because they can contain important information. On the other hand, parenthesis expressions can be-come very complex so that the fragment parser cannot process the sentence and is forced to discard the whole fragment (which means that no relations are extracted). Besides this, unreasonable information can be extracted if parentheses are not carefully treated.

68http://en.wikipedia.org/wiki/Burl (January 2014)

69http://en.wikipedia.org/wiki/Curio_cabinet (January 2014)

Assuming that parentheses often contain a synonym expression to a given word, the ap-proach might consider American English to be a synonym to burl in the first example and conclude <burl,equal, American English>.

In the Wikipedia relation extraction approach, parenthesis expressions are handled as follows: First, the fragments are processed without touching the parentheses. A small list of key words is used to distinguish between nouns like American English or French from actual concept names. If the FSM does not reach the final state, the parentheses are replaced by commas and the processing is tried again. There is also a configuration in which parentheses are turned into an apposition, e.g., A car (automobile) is a... is turned into A car, or automobile, is a... which can be easily processed by the FSM. Finally, if the fragment still cannot be successfully parsed, the parenthesis expression is removed. In this case, synonymous terms can be missed by the approach, but the approach may be able to successfully parse the sentence after all.