• No results found

Creating Domain-specific Synonym Mappings

Text Standardisation by Synonym Mapping

3.1 Creating Domain-specific Synonym Mappings

In order for a machine to make any reasonable attempt at interpreting the text it reads, it is important that it is able to recognise the different meanings of a given word or phrase.

For this, it needs to have access to knowledge of the different synonyms a given word can take in the domain. A common approach to determining semantically related words is to determine their distributional similarities. The assumption here is that similar words will appear in similar contexts. Thus comparing contexts of two words can help to determine

43

3.1. Creating Domain-specific Synonym Mappings 44

Figure 3.1: Text Standardisation and Query Interpretation

if the words are semantically related or not. The techniques presented by van der Plas &

Bouma (2005) and Curran & Moens (2002) are good examples of such approaches. They make use of grammatical relations to define the context of a given target word.

Latent Semantic Indexing (Deerwester et al. 1990) is another technique that has been explored by many researchers wishing to determine similarity between terms. In Latent Semantic Indexing, the documents provide the term’s context for it is assumed that se-mantically related words tend to appear in the same documents. Massie, Wiratunga, Donati & Vicari (2007) also rely on the thesis that similar words will distribute similarly.

The commonality among these approaches is that they provide ranked lists of seman-tically related words where one would expect synonymous words to appear at the top and hyponyms1 and hypernyms2 at the lower ranks since the similarity between a word and its synonym should be higher than that between the word and say, its hypernym.

Unfortunately, it is not always the case that the ranked lists reflect these different sim-ilarities. van der Plas & Tiedemann (2006) obtain better scores for precision and recall when they use parallel corpora where documents are translated into multiple languages, as compared to using monolingual corpora. However, these authors also report

“unsatis-1A hyponym is the subordinate of a given word.

2A hypernym is the superordinate of a given word.

3.1. Creating Domain-specific Synonym Mappings 45 factory” scores for precision. In order to ensure the reliability of synonym mappings, we use WordNet (Fellbaum 1998) a lexical knowledge base that was created by hand to look up the synonyms of the various polysemous words in our document collection.

3.1.1 WordNet

In WordNet, synonymous nouns, verbs, adjectives and adverbs are grouped into synsets where a synset represents a lexical concept. Synsets are interelated by means of concept and lexical based relations. Synonyms are organised according to concept-superconcept relationships called hyponymy, thus forming a concept hierarchy. For example, the noun mango is a hyponym (subordinate) of the noun fruit, and fruit is a hypernym (superor-dinate) of mango. WordNet also has links relating words representing relations such as antonyms, meronymy and so on. Using WordNet, we obtain clusters of words with similar meaning whereby, a word representing the cluster can be used to replace the others in text, and still retain the meaning of the original sentence.

3.1.2 Creating the Clusters

We cluster words according to their synonymy relationships; words that are synonyms in a given context are clustered together. In order to determine the context, we need the word’s part-of-speech and the different meanings or senses of each word. A CLAWS part-of-speech tagger (Marshall 1987) was used for this purpose.

If two words with the same part-of-speech appear in the same WordNet synset, they are deemed to be synonymous in a particular context, and thus clustered together. We consider only strict synonyms i.e., words that belong to the same WordNet synset. So we build clusters of synonymous words and out of each cluster, we choose the word that appears most frequently in our texts and use it as a representative for the other words in the cluster. The representative word will substitute all the cluster words, in text during harmonisation. If more than one word in a cluster have the highest frequency, a representative is randomly chosen. To illustrate, suppose that we have the nouns, house and home appearing in the document collection with frequencies 2 and 3 respectively.

Figure 3.2 illustrates 1 sense for each of the two nouns as given by WordNet. House has 12 senses in all and home has 9.

3.1. Creating Domain-specific Synonym Mappings 46

Sense 4 of house

family, household, house, home, menage -- (a social unit living together; "he moved his family to Virginia"; "It was a good Christian household"; "I waited until the whole house was asleep"; "the teacher asked how many people made up his home")

=> unit, social unit -- (an organization regarded as part of a larger social group; "the coach said the offensive unit did a good job"; "after the battle the soldier had trouble rejoining his unit")

Sense 7 of home

family, household, house, home, menage -- (a social unit living together; "he moved his family to Virginia"; "It was a good Christian household"; "I waited until the whole house was asleep"; "the teacher asked how many people made up his home")

=> unit, social unit -- (an organization regarded as part of a larger social group; "the coach said the offensive unit did a good job"; "after the battle the soldier had trouble rejoining his unit")

Figure 3.2: Sample Wordnet Senses for House and Home

House is synonymous with home in house’s sense number 4 and home is synonymous with house in home’s 7th sense. Since home is more frequent in the document collection, it is made a representative of the synset in which house appears. The assumption is that the most frequently used version of a word is also the most preferred in the domain. We only consider the frequency of the word in the particular part-of-speech. This is because a word could be frequently used in say, its noun form and not as a verb. If we used the overall frequency, a rarely used word might be used to represent a cluster. Linguistically, this should not pose a problem since all the text will be mapped onto a common conceptual model eventually. However, we want the conceptual model to comprise words that are fairly common in the domain to ensure it is easy to comprehend by the user.

Furthermore, to ensure that words from unseen text will be correctly interpreted, all other synonyms of the representative word for this sense are included in the clusters. Thus this process selects WordNet synsets that are relevant to the domain and use the most frequent words in the synsets as their representatives. These representative words will be used in place of each word in a synset, whenever the word appears in one of our documents or a query.