Text Standardisation by Synonym Mapping
2. Retrieve the documents in which the query terms appear in the same sentences ensuring that the potential representative and the extension have been used with
the same respective parts-of-speech as the target word and extension have, in the target sentence.
3. Repeat the procedure for all potential representatives.
4. Choose the potential representative with the largest number N of returned docu-ments. If N ≥ 1, then the potential representative can be used to replace the target word, in the target sentence.
The Word Sense Disambiguation process is illustrated in Figure 3.3. The target sen-tence contains the target phrase and the extension. The target word in the target phrase is replaced by one of the potential representatives to create a potentially standardised phrase. The potentially standardised phrase and the extension are used to generate a Google query. The target sentence and all google returned sentences in which the poten-tially standardised phrase and the extension appear together, are parsed using the Link Grammer Parser. Each Google sentence parse is compared to the parse of the target sentence. If there are Google sentences in which the potential representative and the ex-tension have the same Parts of Speech as the target word and the exex-tension have in the target sentence, then disambiguation takes place.
The requirement that the potential representative and the extension appear in retrieved documents with the same respective parts-of-speech as the target word and extension do in the target sentence, is a good contributing factor towards ascertaining context. Thus finding a document in which this condition is fulfilled is taken to be evidence that the potentially standardised phrase is a sensible one. The assumption is that if there is a document in which someone has used the potential representative in the same way as the
3.2. Harmonising the Text 51 target word, then the potential representative can be used in place of the target word in this sentence, without altering the meaning.
Target Sentence
Figure 3.3: The Word Sense Disambiguation Tool
The internet is used as a very large corpus where we assume that if the potentially standardised phrase is sensible, the probability of getting a document in which it appears in the same sentence as the extension and used in the same context as the target sentence, tends to 1. If no document is returned after the above operation, we have no evidence that replacing the target word with that particular potential representative will not alter the meaning of the target sentence and we thus do not replace the target word with the representative word. The level of confidence of the disambiguation is dependent on the number of returned documents in which the potentially standardised phrase is used in the same context as the target phrase. It is also dependent upon the length of the target phrase – the bigger the phrase, the more confident we are of the disambiguation. It is worth noting that although the algorithm is aimed at resolving ambiguity of single words, the use of target phrases that are not single words results in the disambiguation of some longer phrases.
Consider again the sentence He found it difficult to open the front door. Suppose that we wish to determine if front should be replaced with movement in this sentence.
Figure 3.4 shows a parse of the target sentence. In this case, the noun phrase the front door will be the target phrase and the extension will be open. In order to determine if
3.3. Evaluation 52 movement is a synonym for front in this sentence, we replace front with movement in the target phrase changing it from the front door to the movement door. This potentially standardised phrase and the extension are used to create a Google query. Google returns a document on Rugby news in which the query words appear in the same sentence: “And during the season, the movement door remains open.” The parse of this sentence is shown in Figure 3.5.
he found.p it difficult.a to open.v the front.n door.n . Constituent tree:
(S (NP He) (VP found
(NP it)
(ADJP difficult (S (VP to
(VP open
(NP the front door)))))) .)
Figure 3.4: A Parse of the Sentence “He found it difficult to open the front door.”
Examination of the parses for target and Google-returned sentences shows that, open in the potentially standardised phrase returned in the Google document is not used in the same context as that in our target phrase. This is evidenced by the different parts-of-speech the extension, open, exhibits in the different sentences. In the target phrase, the extension is used as a verb and in the Google sentence, it is used as an adjective. Since no other documents are returned where the context is the same as that of our target word, we conclude that in this instance, front cannot be replaced by movement and still retain the meaning of the original sentence.
3.3 Evaluation
We are only interested in harmonising sentences where we can obtain accurate disambigua-tion. Thus our focus is on the correctness of the harmonised sentences and not necessarily on the overall effectiveness of the disambiguation tool. The Word Sense Disambiguation algorithm has been tested on documents in the SmartHouse domain. Ten sentences con-taining 14 polysemous words were harmonized and presented to a domain expert and to
3.3. Evaluation 53 and during the season.n , the movement.n door.n remains.v open.a . Constituent tree:
(S And
(PP during
(NP the season)) ,
(S (NP the movement door) (VP remains
(ADJP open))) .)
Figure 3.5: A Parse of the Sentence “And during the season, the movement door remains open.”
a native speaker of English. The sentences chosen were those that were deemed to have sufficiently changed from their original versions. Since it is sometimes impossible even for humans to pick the right sense of a word from WordNet, we thought asking our two test experts to do the same would not be reasonable. Instead, we placed the representative word right next to the target word in the text and asked them whether replacing the target word with the representative word would alter the meaning of the sentence. The target and representative words were highlighted, each with different colors to make the task easy. Figure 3.6 shows 5 of the harmonised sentences put to our experts. The target words in the original sentences and the representative words in the harmonised sentences are shown in bold for clarity.
All disambiguations were deemed correct by both experts i.e., replacing words with their representatives did not alter the meaning of the sentences. In 4 instances, the native English speaker thought that although the meaning of the sentence had not been altered, the representative word was as good as the word it was replacing while in the other cases, she preferred the representative word. This shows that it is possible to harmonise text and not loose important information, which is what we set out to do. The sentences presented to the experts were those that our tool was able to harmonise so this was not a test of the overall performance of the Word Sense Disambiguation tool but rather, of the correctness of those words that were used to replace others during text harmonisation.
Important domain terms such as wheelchair or mobility problem, will usually appear as nouns or noun phrases. Furthermore, in natural language, knowledge entities are usually
3.4. Chapter Summary 54
…because of her poor sight relied on listening to the commentary rather than watching the action.
…because of her poor vision relied on listening to the commentary rather than watching the action.
Mr. RC was at risk from fire and scalding because he was unaware of the special risks.
Mr. RC was at risk from fire and scalding because he was unaware of the special dangers.
He had difficulty sleeping, constantly waking throughout the night.
He had trouble sleeping, constantly waking throughout the night.
She relied on care worker to close the windows.
She relied on care staff to close her windows.
She had a tendency to leave the home, leaving the carer behind.
She had a tendency to leave the house, leaving the carer behind.
Harmonised Sentence Original Sentence
Figure 3.6: Sample of Original and Harmonised Sentences
related using verbs and so verbs or verb phrases are also likely to be important. Conse-quently, the key phrase extraction task is aimed at identifying important nouns (and noun phrases) and important actions. Thus in order to reduce on computational effort, only nouns and verbs are considered for harmonisation. Figure 3.7 shows the transformation in the original text of the wandering sub-problem if Figure 1.2. (a) is the original text, (b) the lemmatised version, and (c) is the harmonised text for the sub-problem. The harmonised terms are in bold for purposes of clarity.
3.4 Chapter Summary
This chapter has presented an unsupervised approach to Word Sense Disambiguation.
WordNet is used to assign each polysemous word with a sense that is applicable in the domain. The Word Sense Disambiguation algorithm makes use of syntactic information of the sentence in which a target word appears and information from Google, to assign the correct sense to the target word, in the context in which it appears. This approach has potential applications for ontology learning and general knowledge modeling. The texts returned by Google provide evidence that the disambiguated phrase is plausible in the English language. Although automated, the technique neither relies on statistical information of the word in the document collection nor does it require the use of
hand-3.4. Chapter Summary 55 (a) Original Text
Wandering:
Mrs M had a tendency to leave the house and wander off. She had recently wandered away from the house in the middle of the night. Her husband had found her outside, dressed only in a thin nightgown, trying to open the side door to a neighbor’s home...
(b) Lemmatised Text
wandering mrs m had a tendency to leave the house and wander off. she had recently wander away from the house in the middle of the night. her husband had found her outside, dress only in a thin nightgown, try to open the side door to a neighbor home...
(c) Harmonised Text
wandering mrs m had a tendency to leave the home and wander off. she had recently wander away from the home in the middle of the night. her husband had found her outside, dress only in a thin nightgown, try to open the side door to a neighbor home...
Figure 3.7: Harmonisation of the wandering sub-problem
tagged data.