Chapter 4: The Development and the Linguistic Analysis of the Corpus
4.2 The Linguistic Analysis and the Semantic Classification of the Corpus
4.2.2 Using WordNet for Semantic Tagging
4.2.2.1 Using Semantic Classes in WordNet
As described before in the literature review, the task of classifying words semantically based on their semantic classes supports NLP. Primarily, it extends the task of NER and at the same time, it provides a partial disambiguation step for WSD. A knowledge resource like the WordNet lexicon has been extensively utilised for retrieving useful
information such as meanings of words, lexical relations between words or word forms and semantic relations between word meanings. The studies highlighted in the literature survey depict the importance of using the WordNet lexicographers (e.g. body, communication, possession, animal, change etc) in automatic semantic classification of words in text by assigning words in text into WordNet semantic classes. Consequently, lexicographers in WordNet hold the potential to be useful in this study for the semantic analysis of objectives in order to semantically classify the target words that are commonly used in creating SMART objectives by identifying an appropriate semantic class for each of these target words. This will support the applied machine learning method in learning the rules that ensure that the objectives are “specific”.
Another reason which also encourages performing a semantic classification of the corpus in this study before applying a machine learning method for learning the grammar rules, is the study addressed by Bouillon et al. (2001) which utilises ILP to extract the related noun-verb pairs from the MATRA-corpus. Their study shows that the semantic tagging (classification) of the corpus improves the quality of the learning for the rules generalised by the ILP method. In particular, they have compared the use of an ILP method (i.e. PROGOL) on a POS tagged version of the MATRA-corpus and on a semantically tagged version of the same corpus. The experimental results show that the rules learned by the ILP method from the semantically tagged corpus are better than those learned from the POS tagged one. They argue that the semantic tagging has increased the level of generalisation for the learned rules and illustrated some semantic properties of the words surrounded a semantically related noun-verb pair in the corpus. In contrast, the learning applied on the POS tagged corpus using the ILP method has led to poorer contextual information and less generalised rules containing more linguistic features to extract the relevant noun-verb pairs.
The following paragraphs describe how the WordNet lexicographer class labels (e.g. possession, change, social etc) are used for determining the semantic class, to which each target word in the objectives belongs.
In this research, the text of the objective sentences presented in the developed corpus has been syntactically and semantically analysed. Results showed that there are
certain categories of words used in writing SMART objectives. With regards to the definition of the SMART approach, an objective is considered “specific” if it is well- defined and specifies clearly what is to be achieved (Hurd et al., 2008). The definition of the “specific” factor in the SMART approach also implies that the objective should describe the action taken for achieving it by using an action verb (e.g. increase, achieve, boost, reduce etc) (Rouillard, 2003). Furthermore, the objective should state what it applies to, by specifying the domain of application in the objective (e.g. the objective pertains to sales, costs, profits etc) (Carliner, 1998). Consider the following example of a SMART objective which appears in the created corpus:
“By the end of next year, PC sales will increase by 8%.”
The word “increase” in the above objective sentence is an action verb that describes the action taken for achieving this objective, whilst the word “sales” represents the domain of application for the above objective. These two features (action verb and the domain of application for the objective) are required in an objective sentence in order to be “specific”. The POS tagger in GATE provides the grammatical categories for each word in the above sentence and tags the word “increase” as a verb and the word “sales” as a noun.
Based on WordNet, a semantic classification has been built for the semantic tagging of the corpus of objectives, where the most generic lexicographer semantic classes have been chosen to semantically classify the target words (action verbs, common nouns that represent the objectives’ domain) which are used commonly to write SMART objectives. Therefore, a tagset has been defined for the semantic classification of the main POS categories (nouns and verbs) of the target words in the given corpus of objectives. In particular, all possible WordNet semantic class labels have been assigned to the target nouns and verbs in the objective sentences, where irrelevant semantic categories for the given corpus have been discarded. As a result, 5 verb semantic classes have been chosen to semantically categorise the target verbs that appear in the given objective sentences, while only 1 noun semantic class has been selected to classify the target common nouns in the objectives. Figure 8 presents the defined tagset of the most generic WordNet semantic classes for the target
common nouns and verbs which occur in corpus of objective sentences, since this tagset will be used later for the semantic tagging of the target words in this corpus.
Figure 8 The Main WordNet Semantic Classes of the Target Nouns and Verbs in the Objectives
To automatically identify the semantic classes of the target words in the objective sentences, automatic methods in NLP could be used to access the WordNet database and then retrieve the most appropriate semantic class for each target word in a given context.
The idea is then to compare the defined tagset of the semantic classes for the target nouns and verbs in the objective sentences with the results obtained by applied automatic NLP methods to check the performance of these methods in disambiguating the target words in the objectives semantically.
The following two sections describe two open source systems which have been applied to access the WordNet database for retrieving the required semantic information for the target words in the objectives. The first system is used to identify the senses of all words in the objectives based on the context in which they occur. Since the first system does not provide the semantic classes of the words in text, a second system has been applied to perform this task based on WordNet and annotate the target words in the objectives with the defined tagset of the semantic classes of verbs and nouns by using the disambiguation results obtained from the first system.