Enrichment - The ViewS Semantic Augmentation Pipeline

Chapter 4 Semantic Augmentation of User Generated Content

4.2 The ViewS Semantic Augmentation Pipeline

4.2.2 Enrichment

The purpose of surface form enrichment is to extend the surface form with additional linguistically and semantically related terms. The output increases the probability of mapping a text term with the ontology entities.

The enrichment process also uses the WordNet lexical data base. For a given term and POS tag, WordNet defines a structure of senses called lemma. Each lemma comprises a set of senses for this term and is organised into a set of synonym sets (known as synsets).

Sense Detection and Mapping.

WordNet semantically classifies each synset into lexical categories, e.g. noun.animal and verb.motion are categories that depict nouns related to animals and verbs related to motion respectively. A set of lexical categories is selected according to their relevance to the domain for which we want to model viewpoints. For example, for the domain of IC and social signals, verb.emotion is relevant but not noun.animal.

For more fine grain semantic classification to direct the linguistic and semantic enrichment at a word level, an Upper Ontology is utilised to further filter irrelevant linguistic data. The Suggested Upper Merged Ontology (SUMO) [101] is selected. SUMO offers two main advantages: (a) it covers a wide range of aspects, e.g. communication, people, physical elements etc., which is important for the generality of the approach and, (b) it provides direct mappings of ontology entities (including concepts, individuals and predicates) to WordNet synsets [102]. Other Upper ontologies that could be used include DOLCE [103] which also provides alignment with WordNet. However, the alignment is based on an early version and considers only the top-level of WordNet.

Mapping operators between a SUMO entity and a WordNet synset include: equivalence, subsuming and instance mapping. It is then possible to examine word senses (in synsets) from the text and link them to the

surface form

(SF)

exact text tokens (ET)

stemmed tokens (ST)

appropriate domain-specific SUMO entities (similar method has been followed in [104, 105]). For example, in the music domain, the WordNet term "song" has an equivalent mapping with the SUMO concept "MakingVocalMusic" in the sense of "the act of singing". Hence "MakingVocalMusic" can be used to enrich the surface form “song” from the specific synset.

For a given token (ET/ST) or a multi-word term (MWT) in the surface form (SF), its senses-synsets SS are filtered to pick only the relevant senses to form SS1 : SS1 SS. based on the pre-selected set of semantic lexical categories. These senses (in SS1) are further filtered to pick only senses to form SS2 covered by the relevant SUMO mappings: SS2 SS1 SS. SS2 is used for the surface form enrichment.

Enrichment Types.

With the resulting senses synsets, SS2, four types of enrichment (Figure 4.3) are conducted by one of these two methods: (i) using semantically enhanced linguistics to retrieve lexical derivations, synonyms and antonyms, and (ii) using corpus statistical measurements to retrieve similar words.

Figure 4.3 Elements of the enriched text surface form (ESF).

(i) For lexical derivations, synonyms and antonyms. Words in SS2 are used to query WordNet for lexical derivations, synonyms and antonyms. For each result set, the whole synset was exploited (i.e. the lexical derivatives of a word are organised again as a synset) and checked for relevancy using the aforementioned sense-detection and mapping. Antonyms are qualified with a negation attribute. From SF, MWT elements are also used to query WordNet and match keywords, and eventually enrich as discussed.

(ii) For similar words. For enriching the surface form with similar words,

DISCO [106], which retrieves similar words from English language corpora enriched surface form (ESF) lexical derivations (DRV) synonyms (SNM) antonyms (ANT)

using techniques based on statistical distributions, was exploited. This enables contextually (co-occurring in text) related terms to be retrieved, increasing the probability of such terms to be found in the ontologies for semantic annotation. DISCO has been used with the Wikipedia corpus, as it provides multi-disciplinary collective knowledge (compared with PubMed which is medicine oriented or the British National Corpus which is significantly smaller than Wikipedia provided with the tool17_).

Figure 4.4 presents the pseudocode for the „similar-word enrichment‟ algorithm used with DISCO for a given keyword in SS2. The input to the includes: a keyword (in), relevant senses from WordNet for this keyword (in.senses) and the number of senses (in.senses.count). With this input the DISCO API is queried and returns a set of similar words (out) together with their similarity score (Sim (out_word)). At this stage, a threshold for similarity value is applied. For each of the similar words (out_word) which pass the declared threshold, WordNet is queried to retrieve its senses (out_word.senses) for every possible POS tag. Each of these senses (out_word.sense) is matched with the input senses (in.senses) using the weighted score of the following parameters: (a) lexical category of the sense from WordNet, (b) the SUMO mapping entity, and (c) the SUMO mapping operator (one of equivalence, subsuming or instance).

The threshold values SIM_THRESHOLD and

SENSE_SCORE_THRESHOLD as well as the constant scoring values

MAX_SENSE_SCORE, LEX_SCORE, SUMO_SCORE and

SUMO_OP_SCORE can be set manually by the experimenter for comparisons. The process includes querying DISCO with words related to the selected domain and dimensions, retrieving the results, checking the results with respect to their possible senses according to the sense detection and sense mapping filters discussed earlier. When the resulting words match the selected domain, the experimenter retrieves the similarity scores and tunes the threshold accordingly18_.

17_{http://www.linguatools.de/disco/disco-download_en.html}

18_{Following the described process, for social signals (see Section 4.4) the} threshold is set to 0.7

in //the word used to query DISCO in its base form

in.senses //the senses of the word used to query DISCO in.senses.count //the number of senses

out //the set of resulted words

Sim (out_word) //the similarity score for an output word

set SIM_THRESHOLD set MAX_SENSE_SCORE set LEX_SCORE set SUMO_SCORE set SUMO_OP_SCORE set SENSE_SCORE_THRESHOLD FOR each out_word in out

IF Sim (out_word)≤ SIM_THRESHOLD THEN EXCLUDE out_word;

ELSE

out_word.senses = Extract possible senses from WordNet; //including all the possible syntactic roles, i.e. noun, verb, adverb //and adjective

FOR each out_word.sense in out_word.senses //check if the word sense is in context

maximum_score = MAX_SENSE_SCORE* in.senses.count; current_score = 0;

FOR each in.sense in in.senses

IF out_word.sense.lexical_category = in.sense.lexical_category THEN Current_score += LEX_SCORE;

IF out_word.sense.SUMO_concept = in .sense.SUMO_concept THEN Current_score += SUMO_SCORE;

IF out_word.sense.SUMO_operator = in.sense.SUMO_ operator THEN Current_score += SUMO_OP_SCORE;

IF current_score/maximum_score ≥ SENSE_SCORE_THRESHOLD THEN INCLUDE out_word.sense

Figure 4.4 The „similar-word enrichment‟ algorithm used with DISCO.

In document Modelling viewpoints in user generated content (Page 52-55)