3. Software for Corpus Linguistics
3.2 Corpus development and analysis tools
3.2.2 The USAS semantic tagger
The UCREL semantic analysis system (USAS) accepts as input text which has been tagged for parts of speech using the CLAWS4 POS tagger. The tagged text is fed into the main semantic analysis program (SEMTAG), which assigns semantic tags representing the general sense field of words from a lexicon of single words and a list of multi-word combinations, called templates (e.g. ‘as a rule’). These are updated as new texts are analysed (Rayson and Wilson, 1996). Currently, the lexicon contains nearly 37,000 words and the template list contains over 16,000 multi-word units. Items not contained in the lexicon or template list are assigned a special tag, Z99. Figure 3.2 is an example of semantic word tagging, taken from a library system requirements definition document.
It_Z8 is_Z5 anticipated_X2.6+ that_Z5 the_Z5 system_X4.2
will_T1.1.3 be_Z5 administered_A9- by_Z5 the_Z5 Library_Q4.1/H1
,_PUNC but_Z5 this_Z8 will_T1.1.3 not_Z6 always_N6+++
be_the_case_A5.2+[i9.3 ._PUNC
Figure 3.2 An example of lexical semantic tagging
The semantic tags are composed of:
1. an upper case letter indicating general discourse field. 2. a digit indicating a first subdivision of the field.
3. (optionally) a decimal point followed by a further digit to indicate a finer subdivision.
4. (optionally) one or more ‘pluses’ or ‘minuses’ to indicate a positive or negative position on a semantic scale.
5. (optionally) a slash followed by a second tag to indicate clear double membership of categories.
6. (optionally) a left square bracket followed by ‘i’ to indicate a semantic template (multi-word unit).
For example, A5.2+ indicates a word in the category ‘general and abstract words’ (A), the subcategory ‘evaluation’ (A5), the sub-subcategory ‘true and false’ (A5.2), and ‘true’ as opposed to ‘false’ (A5.2+). Likewise, Q4.1/H1 belongs to the category ‘communication’ (Q), subcategory ‘the media’ (Q4), and refers to ‘books’ (Q4.1), as well as ‘kinds of houses and buildings’ (H1)31.
The semantic annotation is designed to apply to open-class or ‘content’ words. Words belonging to closed classes (such as prepositions, conjunctions, and pronouns), as well as proper nouns, are marked by a tag with an initial Z.
31 A full tagset for the USAS tagger can be found online at http://www.comp.lancs.ac.uk/ucrel/usas/
As in the case of grammatical tagging, the task subdivides broadly into two phases: Phase I (Tag assignment): Attaching a set of potential semantic tags to each lexical unit and Phase II (Tag disambiguation): Selecting the contextually appropriate semantic tag from the set provided by Phase I. SEMTAG makes use of seven major techniques or sources of information in phase II:
1. POS tag. Some senses can be eliminated by prior POS tagging. For example, consider the word spring. There is a lexicon entry for spring which specifies firstly the possibility of a noun tag or a verb tag, and secondly the possibility that the noun may have the ‘coil’ sense or the ‘season’ sense. In this sample lexicon entry, the POS tagger, by choosing the noun tag, obviously eliminates one of the senses (‘to jump’). Hence the semantic tagger’s task is simplified to choosing between the ‘season’ and the ‘coil’:
word form POS tag semantic tag
spring noun [season sense] [coil sense] spring verb [jump sense]
2. General likelihood ranking for single-word and template tags. In the lexicon and template list senses are ranked in terms of frequency, even though at present such ranking is derived from limited or unverified sources such as frequency-based dictionaries, past tagging experience and intuition. For example, green referring to ‘colour’ is generally more frequent than green meaning ‘inexperienced’.
3. Overlapping template resolution.Normally, semantic multi-word units take priority over single word tagging, but in some cases a set of templates will produce overlapping candidate taggings for the same set of words. A set of heuristics is applied to enable the most likely template to be treated as the preferred one for tag assignment. The heuristics take account of length and span of the idioms and how much of a template is matched in each case.
4. Domain of discourse. Knowledge of the current domain or topic of discourse is used to alter rank ordering of semantic tags in the lexicon and template list for a particular domain. Consider the adjective battered to which three candidate tags can be assigned: ‘Violence’ (e.g. battered wife), ‘Judgement of Appearance’ (e.g.
battered car), and ‘Food’ (e.g. battered cod). If the topic of conversation was known to be food, then we automatically raise the likelihood of the ‘Food’ semantic tag, at the expense of the other two tags.
5. Text-based disambiguation. It has been claimed (by Gale et al, 1992), on the basis of corpus analysis, that to a very large extent a word keeps the same meaning throughout a text. For example, if a text on one occasion uses bank in the sense of ‘side of a river’, all other occurrences of bank are likely to have that same sense. In SEMTAG, this method works together with step 4.
6. Contextual rules. The template mechanism is also used in identifying regular contexts in which a word is constrained to occur in a particular sense. Consider the meaning of the noun account: if it occurs in a sequence such as NP's account of NP it almost certainly means ‘narrative explanation’, whereas if it occurs in a financial context, in such collocations as savings account or the balance of … account it almost certainly has the meaning of a ‘bank account’.
7. Local probabilistic disambiguation. It is generally supposed that the correct semantic tag for a given word is substantially determined by the local surrounding context. To return to the example of account: if this noun occurs in the company of words such as financial, bank, overdrawn, money, there is little doubt that the financial meaning is the correct one. However, we could identify the surrounding context not only in terms of (a) the words themselves, but in terms of (b) their grammatical tags, (c) their semantic tags, or (d) some combination of (a) - (c). This method is still under development in SEMTAG and future work includes experimentation, using a training corpus and a test corpus, to determine what weight to give each of these contextual factors in selecting the correct semantic tag for a given word or word class. Other factors which need to be determined are discussed in Garside and Rayson (1997).
After automatic tag assignment has been carried out, manual post-editing can take place, if desired, to ensure that each word and idiom carries the correct semantic classification. An additional program using template analysis techniques (see section 3.3) can then mark important lexical relations (e.g. negation, modifier + adjective, and adjective + noun combinations).