Evaluation Methodologies and Corpora - Implementation of ESA and Evaluation Methodology

3.3 Implementation of ESA and Evaluation Methodology

3.3.2 Evaluation Methodologies and Corpora

There are different scenarios that semantic relatedness has been applied to in related work. Depending on the use case of the respective approaches, there are several evaluation designs and evaluation corpora that can be applied.

In the following subsections, different corpora are presented that have properties that are relevant for the given scenario:

Term–based Relatedness In ELWMS.KOM, tags are usually consisting of one to two terms (on average 1.73 terms) and a majority of tags are represented by nouns. Therefore, a corpus that is compa- rable to the tags in ELWMS.KOM should consist of single terms (cf. subsection 3.3.2) or very short 9 _{available from http://dumps.wikimedia.org/, retrieved 2011-01-11}

Explicit Semantic Analysis (ESA)

ESA

• ₌

|terms|×1 n×1

document d1 n articles from reference corpus

Preprocessing steps* Semantic interpretation matrix M * Contain: 1. Tokenization 2. Stemming 3. Calculation of tf-idf 4. Normalizing Semantic interpretation vector iesa n×|terms| n 1×|terms| vectors document d2

comparison

n×1

Figure 3.4:Process of creating a semantic interpreter from Wikipedia articles and deriving a semantic interpreta- tion vector i_esa.

multi–term documents (cf. subsection 3.3.2). Another observation that was made in the tags of ELWMS.KOM is that some tags represent a generic knowledge domain (e.g. “e–learning”), whereas other tags are very specific and denote a clearly delimited concept (e.g. “mobile social search”). This should be reflected in the specificity of the corpora’s terminology. Therefore, several corpora with different properties have been used to examine the applicability of semantic relatedness. Document–based Relatedness Snippets in ELWMS.KOM are usually short and contain on average 120

terms. A corpus that is used to measure the quality of a semantic relatedness approach thus should reflect this document size approximately. Further, the learning intention of users is an important aspect in the choice of documents. As, to the knowledge of the author, no appropriate corpus is existing that reflects these requirements, the novel semantic corpus Gr282 is presented (cf. subsection 3.3.2) that conforms to this specification.

Multilingual Tags and Documents As shown in section 3.1.2, users of ELWMS.KOM in the examined aca- demic environment often collected resources and created tags in different languages. Thus, appropriate corpora should be used that allow evaluating a cross–language semantic relatedness approach for term–based and document–based relatedness additionally to the monolingual approaches.

The following subsections give an overview of the selected corpora and present the applied evaluation methodologies.

Relatedness of Term–Term Pairings

In scenarios applying semantic relatedness to word sense disambiguation [70, 148, 84], the common methodology to evaluate an approach is by comparing human judgements of relatedness of a set of term pairs to the ratings the respective algorithm has calculated. The focus of this evaluation approach is not the absolute value of the respective ratings but rather the order / ranking in comparison to the

human ratings. Therefore, usually a rank correlation measure is applied, for example Spearman’s rank

correlation coefficient (also called Spearman’sρ). This measure can be applied to two k–sized lists x and y containing pairwise values. It determines how good the correlation of the values can be approximated

by a monotonic function. First, for each variable x_i and y_i, the ranks r ank(x_i), rank(y_i) of those values are determined. In case of equal relatedness values (called ties), the average of the respective ranks is assigned. Especially the lower relatedness boundary of 0.0 is probable to occur several times, e.g. when no term overlap exists. After cleaning of ties, the Spearman’s rank correlation coefficient ρ is defined [141] as ρ = 1 −6 Pk i diff 2 i k(k2− 1) (3.4)

where diff_i is the difference between ranks of x_i and y_i and k the size of the samples.

The significance of the difference between two correlations can be determined by using t_diff [69]. It is used to check whether the correlation between the pairs of variables (x, y) and (z, y) is significantly different. It is defined as:

t_diff= (ρ_{x y}− ρz y) È (k − 3)(1 + ρxz) 2(1 − ρ2 x y− ρ 2 xz− ρ 2 z y+ 2ρx yρxzρz y) (3.5) The resulting values for t_diffare compared with the critical values of the t–distribution.

Datasets for Monolingual Evaluations

There are several datasets that have been used in monolingual semantic similarity evaluations, most notably the Rubenstein and Goodenough [164] similarity dataset (called Rub65) encompassing 65 pairs of nouns that have been rated by 51 humans for their similarity. There is a German translation of this dataset Gur65 [84] with 24 human raters, which unfortunately does not exactly correspond to the English version (cf. table A.1 in appendix A.1). For monolingual settings, these inconsistencies are not relevant, but for evaluating multilingual approaches this matters.

For determining semantic relatedness, the German dataset Gur35010provides 350 term pairs with their respective relatedness values given by 8 subjects (cf. table A.2 in appendix A.1). In comparison to Gur65, this dataset contains not only nouns but also verbs, adjectives and adverbs. Further, named entities are included in this dataset, e.g. Benedikt, VW and Opel. This makes Gur350 a more challenging dataset that is not fully applicable to semantic relatedness approaches that use formal ontologies, e.g. WordNet. Dataset for Multilingual Evaluations

The multilingual dataset Schm280 [172] is adapted from the English WordSim353 dataset created by Finkelstein et al. [70] (cf. table A.3 in appendix A.1). It contains 280 English noun pairs with their German equivalent translated by up to 12 participants, each value pair with a relatedness value rated by at least 13 subjects.

Relatedness of Query Term—Document Pairings

A common scenario for semantic relatedness is the task to find a related document for a given query term, for example in IR settings [203]. Usually, in evaluations for such a scenario, there are no rated 10 _{No publication known, available at http://www.ukp.tu-darmstadt.de/data/semantic-relatedness/, retrieved 2011-}

02-25

term pairs but rather a query that has to be mapped to a correct document. This is a task that is much more demanding to the respective approach, as it must be able to calculate a semantic relatedness measure between a single term and a possibly multi–term document. Some approaches presented in section 3.2 do not support taking into account multiple terms (e.g. [186]). A corpus that is commonly used to evaluate such a setting is a “word choice problem” corpus, i.e. having a term as query that is rare and attempting to select the correct description from a set of multiple possibilities. A data set that is often applied here is the TOEFL corpus [113] that consists of a set of 80 query terms and for each a selection of four possible synonyms. Usually, this corpus is used to evaluate semantic similarity approaches, but it has been applied to semantic relatedness as well [203]. A German equivalent is the Reader’s Digest Word Puzzle corpus11(RDWP984). It was obtained from the 2001 to 2005 editions of the German Reader’s Digest Magazine and contains 984 multiple choice questions consisting of a query term and four possible answers in form of a single term or a short definition (cf. table A.4 in appendix A.1). This dataset contains highly domain specific, rare terminology and therefore is a challenging corpus for determining whether a semantic relatedness approach is able to provide a good coverage of the terminology.

The quality of a semantic interpreter with a reduced article set is indicated by two different values:

Coverage and Accuracy. Accuracy can be differentiated in Local Accuracy and Global Accuracy.

Coverage denotes the ratio of queries for which ESA is able to calculate a result and the total number of queries. As ESA maps documents to concepts according to the term overlap of the concepts’ articles, it is crucial that the terms used in the documents are reflected in the semantic interpreter. A query is considered as covered by the semantic interpreter, if any relatedness can be calculated between the query term and the descriptions. Thus, Coverage is an indicator of enclosure of the terminology that is needed in order to generate a result.

Local Accuracy is the ratio of queries answered correctly by ESA and the number of covered queries. For all covered queries, the answer is scored as correct that is most related to the query. Thus, local accuracy represents the quality of the evaluation without taking into account queries that could not be answered due to terms missing in the semantic interpreter.

Global Accuracy is the ratio of queries correctly answered by ESA and the total number of queries. It does not take into account the accuracy loss resulting from non–covered queries and represents a quality measure reflecting a real–world setting.

Relatedness of Document–Document Pairings

Comparing a query document to a set of other documents and finding the most related match is a typical task in Information Retrieval. For evaluating such an approach, a methodology is employed that is used to evaluate search engine rankings [47]. Basically, a semantic relatedness value is calculated for each document d_q∈ D and all di∈ D \ dq. The result is a list that is ranked by decreasing relatedness. dq and a compared document d_kat rank k are defined to be semantically related (i.e. r_k= 1.0) if they cover the same or similar concepts. If documents are semantically related, they are allocated to the same semantic group D_q (equation 3.6). r_k=    1 if d_qand d_k∈ Dq 0 otherwise (3.6)

Further, precision at rank and recall at rank (equations 3.7 and 3.8) are used to calculate the average

precision (equation 3.9) over one relatedness comparison for different recall values.

precision(k) = 1 k X 1≤i≤k r_i (3.7) recall(k) = 1 |Dq| X 1≤i≤k r_i (3.8) average precision= 1 |Dq| X 1≤k≤|D| r_k_{∗ precision(k)} (3.9)

Depending on the properties of the used evaluation corpus, there are two different ways of presenting the results. In case that one document corresponds to exactly one other document, a top–k [8] approach can be used. In top–k, a precision value is given for the k highest ranked results for each document, e.g. if a corresponding document is returned at rank 2, the top–1 result would be0.0, whereas the top–5 would yield1.0. As result, an average over all top–k results is given. As there is only one relevant document, recall is always either0.0 (relevant document not in top–k result set) or 1.0 (relevant document is in

top–k result set).

Alternatively, if one document corresponds to a set of other documents, top-k is not reasonably ap- plicable. Here, all pair–wise comparisons are averaged and the average precision is plotted against interpolated recall, resulting in a so–called precision–recall diagram [47]. A precision–recall diagram represents a graph of the trade–off between precision and recall. For summarizing the quality of such a diagram numerically, two measures are usually given: Break Even Point (BEP) and Mean Average Precision (MAP). The BEP [200] represents the point where precision equals recall (and, as shown in the plots given in section 3.6.4), the interpolated precision–recall curve crosses f(r) = r, i.e. the angle bisector of the first quadrant). The MAP is the average of the precisions that have been computed for all documents.

Dataset for Monolingual Evaluation

For scenarios that enable the functionality of document recommendation, an evaluation corpus was needed that meets several requirements:

• The evaluation corpus should consist of German documents, as the focus of this research is based on the German Wikipedia.

• Documents in the evaluation corpus should conform to the snippet definition given in section 3.1.1, i.e. a majority of documents should contain between 20 and 200 terms.

• Documents in the evaluation corpus should honour the scenario of RBL with web resources. That is, they should contain a narrow scope of topics and be basically appropriate to meet specific information needs.

• Documents should contain different topics and have different scopes, i.e. should not only represent narrow factual knowledge but also contain opinions and overview information, making it a challenging task for semantic relatedness approaches.

Thus, a novel semantic corpus called Gr282 (cf. table A.5 in appendix A.1) has been built in a user study. Eight participants (2 female and 6 male, four students of Information Science, two students of Educational Science and two research assistants) were asked to research answers to a catalogue of ten questions (for a full listing see appendix A.2) using only fragments of web resources. For each question they were to find five snippets that (partially) contained the answer to this question using one of four different search engines (Google12, Yahoo!13, Bing14 and Ask15) in order to ensure diversity of found web resources. Further, they were asked to restrict the snippets’ length to 20 to 200 terms. This was not a fixed requirement though, if needed, the participants were allowed to collect larger web resource fragments.

In order to conform to the fourth requirement named above, the questions were formulated in a way that five different types of questions were asked with each type featuring two questions. Following question types were identified as relevant for the given scenario:

• Opinions, e.g. “Is the term Dark Ages justified?” • Facts, e.g. “What is the FTAA?”

• Related snippets to a common topic, e.g. “Find examples for internet slang!” • Homonyms, e.g. “What are Puma, Jaguar, Panther, Tiger and Leopard?” • Broad topics, e.g. “Find information about the evolution of man!”

Gr282

Size of corpus 282 documents

Average length of snippets 95.21 terms

Minimum length 5 terms

Maximum length 756 terms

Standard deviation 71.31 terms

Table 3.4:Short descriptive summary of novel corpus Gr282

After having collected the answers, duplicate answers and answers from the same sources were dis- carded. Finally, the evaluation corpus consisted of 282 snippets (a short summary is available in table 3.4) that were labelled with their question types and manually split into groups of different semantic concepts. Because, as expected, homonyms and broad topics showed to be consisting of snippets with different meanings, different semantic groups could be formed for some questions (cf. appendix A.2). For example, for the question that asks about the meaning of “Puma, Jaguar, Panther, Tiger and Leopard” there are different correct answers. First, they all belong to the biological feline genus Panthera. Second, they are all project names for Apple’s Operating System OS X. Third, they are all common names of war tanks (however, none of the study participants answered with this option). Thus, this question spans three semantic groups.

In the evaluation, an IR task was executed with the expectation of getting all semantically related snippets (i.e. all snippets in the same semantic group) before all semantically unrelated snippets. As the semantic groups do have different sizes, a top–k evaluation is not applicable. Therefore, in this thesis,

precision–recall diagrams and BEP and MAP are given as the results of the Gr282 evaluation.

12 _{http://www.google.de/}_{, retrieved 2009-10-02} 13 _{http://de.yahoo.com/}_{, retrieved 2009-10-02} 14 _{http://www.bing.com/}_{, retrieved 2009-10-02} 15 _{http://de.ask.com/}_{, retrieved 2009-10-02}

Dataset for Multilingual Evaluation

The Europarl corpus [105] is a multilingual collection of sentence–aligned protocols of proceedings from the European Parliament in eleven languages. The documents consist of full, grammatically correct sentences in natural language grouped in approximately 4,000 chapters16, translated by professional translators. A challenge for cross–lingual relatedness approaches is the occurrence of many named entities (e.g. speakers) and the variability of the translations.

Europarl300

English German

Size of corpus 300 parallel documents

Average length of snippets 28.08 terms 26.75 terms

Minimum length 4 terms 4 terms

Maximum length 111 terms 111 terms

Standard deviation 17.24 terms 15.77 terms

Table 3.5:Short descriptive summary of corpus Europarl300

Due to computational constraints (for this cross–lingual evaluation, each document has to be com- pared to all documents in the parallel language, resulting in n2 comparisons) only a subset of the Europarl corpus was taken, containing 300 parallel documents in German and English, in the following referred to as Europarl300 (cf. table 3.5 and table A.6 in appendix A.1). This subset consists of the first 300 documents the Europarl test data of the second Workshop on Statistical Machine Trans- lation 200717. Because one document has exactly one correspondent document in the other language, a

top–k evaluation is applicable in this scenario.

In document Semantic and Structural Analysis of Web–based Learning Resources - Supporting Self–directed Resource–based Learning (Page 49-55)