Elements and parameters of the semantic hyperspace .1 The corpus

Chapter 4. The structure of the mental lexicon as defined by patterns of word cooccurrence

4.1 A similarity-based semantic space

4.1.1 Elements and parameters of the semantic hyperspace .1 The corpus

The size of the corpus affects the robustness of the cooccurrence-based representations. Large corpora produce vector representations that are more immune to noise due to restricted corpus-size. Patel, Bullinaria and Levy (1998) and Curran (2004) found that an increase in the size of the corpus improved their results, even for very large corpora (Curran used a two billion word corpus). Several hyperspace studies in English use (subsets of) large corpora such as the British National Corpus (BNC, around 90 million written and 10 million transcribed spoken words) or USENET (a corpus of around 170 million word corpus of newsgroup text).

On the subject of spoken versus text corpora, McDonald (2000) gives three reasons why speech is better than text. First, speech is the primary environment for language acquisition. (I would add that speech is the primary source of human communication.) Second, the smaller type:token ratio of speech provides a more reliable source of contextual information and thus the construction of denser vectors. Third, the results he obtained with the spoken subset of the BNC (around 10 million words) fitted isolated word recognition data better than similar size text BNC subsets.

The chosen corpus may be lemmatised or otherwise prepared before counting the cooccurrences (e.g. McDonald, 2000). Lemmatisation removes all morphology and leaves only word stems, affecting the information carried by the vectors. This eliminates possible morphology-based clusters in the hyperspace. Annotated corpora can be used to disambiguate between homophones in the counts, refining the quality of the vectors, for example McDonald (2000) and Monaghan and Christiansen (2004) both took information about the syntactic category of words from the CELEX database;

Curran (2004) marked up the corpus including sentence splitting, tokenization and part of speech tagging.

4.1.1.2 The context

Context window methods count occurrences of a number of context words within a window of a number of words before and/or after the target word.

The target words are the nodes in the semantic space. More target words mean a more complete space.

The main variables in the context are the window size (how many words around the target word are considered) and shape (are the context-words to be counted to the left, to the right of the target, or both) and the number and choice of context words that are included in the calculation of the vector components.

The window extends over a number of words or characters to the left and/or to the right of the target word. Some studies employ large windows of around 500 words (Yarowsky, 1992; Beeferman, 1998), but this makes the calculations computationally expensive. Others use small windows both for syntactic and semantic categorisation tasks: Finch and Chater (1996), two words to either side; Lowe and McDonald (2000), 5 words to either side;

McDonald (2000), up to 10-20 words to either side; Curran (2004), combinations of 1-3 words to either side (finding the best results for one word to each side and with two words to the left). Patel, Bullinaria and Levy (1998) searched the parameter space in an attempt to optimize the window size and shape against two evaluation criteria: the ratio of mean Euclidean distances between semantically related and unrelated words, and a measure of syntactic categorisation. They found that the best results were obtained by counting the left and right contexts separately (as two components of the vector), using window sizes between two and 16 words. However, Levy, Bullinaria and Patel (1998), using different criteria for the optimisation of the parameter space - semantic and syntactic categorisation and synonym choice - found that the best results were obtained by averaging the contents of the left and right windows with window sizes between one and seven words.

Monaghan, Chater and Christiansen (in press) used a window of one word to the left (the preceding word only) for a noun-verb discrimination task (carried out using both distributional and phonological clues). Mintz (2003) developed a different form of window called ‘frame’ consisting of a pair of words that occur separated by one intervening word, e.g. ‘a _ of’. He showed that frequently occurring frames accurately predicted the syntactic category of the intervening word. Monaghan and Christiansen (2004) compared Mintz’s method with Monaghan, Chater and Christiansen’s (in press) preceding word window and found that while the frames had a higher accuracy for noun-verb classification, the preceding word window classified a much higher proportion of words.

The number of context words determines the dimensionality of the space. It is usually a few hundred: Finch and Chater (1992) used 150 context words;

Lund and Burgess (1996) used 200, and claimed that adding more context words did not alter the results; Lowe and McDonald (2000) used 536 context words; McDonald (2000) used 446 context words..

The choice of context words defines the type of information that the space captures. Some studies simply select the most common words in the corpus (Finch and Chater, 1992; Redington, Chater and Finch, 1998), while others remove from that set a series of very frequent uninformative words such as prepositions, conjunctions, determiners, pronouns etc, which they claim are so ubiquitous that they do not help judging semantic similarity (Lowe and McDonald, 2000; McDonald, 2000; Jarmasz, 2003). Yet other studies add extra constraints to the context word set, for example McDonald (2000) and Lowe and McDonald (2000) chose the most reliable context words – those that produced the most consistent cooccurrence patterns across a number of sub-corpora. However, Levy and Bullinaria (2001) found that adding the most frequent words in the corpus (mostly functors) to Lowe and McDonald’s reliable context words significantly boosted the results in a semantic categorisation task. A word context set consisting mainly of function words also seems to help categorise words syntactically (Finch & Chater, 1992 and Redington, Chater & Finch, 1998).

To sum up, syntactic categorisation tends to be best achieved with very small windows and functors in the context word set, and semantic categorisation, with larger windows and content words in the context word set.

4.1.1.3 Metrics of similarity

Vector space models of the semantic lexicon assume that semantically similar words tend to occur in similar contexts. This section reviews the most commonly used methods to measure similarity between word context vectors.

Among the geometric similarity metrics (illustrated in Figure 4.2) are the Euclidean distance, which is the distance between the two points located by vectors in a space and the City Block (also called Manhattan and Levenshtein) distance, so called because of the way you have to go from A to B in a grid-like geometry such as the Manhattan streets and avenues, in straight perpendicular lines, and turning at the corners. The City Block and Euclidean distance metrics are sensitive to vector length, but this problem can be overcome by measuring similarity as the cosine of the angle between the two position-vectors. The cosine focuses on the difference between the directions of the vectors (see Figure 4.2), and is not sensitive to vector length, which makes it appropriate to compare words of similar frequency, but it is sensitive to vector sparseness, so it should be used to compare vectors of similar sparseness.

Figure 4.2. Three geometrical similarity measures between points A and B: the City Block distance is CB1 + CB2; the Euclidean distance is D; the cosine distance is the cosine of angle α.

Other metrics commonly used in information retrieval are the Dice metric (also used to measure phonological similarity, see § 3.2.1), which is twice the ratio between shared attributes and the total number of attributes for each target word, and the Jaccard metric, which compares the number of common attributes with the number of unique attributes for each pair of targets.

Similarity coefficients have also been used in Internet search engines (e.g.

Tudhope & Taylor, 1996). Information-theory metrics include the Kullback-Leibler divergence (or relative entropy) and Hellinger distance, both of which quantify the differences between two probability distributions.

a D CB1

CB2 A

b B α

Curran (2004) compares the behaviour of most of the metrics explained above, plus several variants including weight functions designed to assign a higher value to context words that are more indicative of the meaning. He found that Dice and Jaccard performed best in a semantic task. Levy, Bullinaria and Patel (1998) compare the Euclidean, City Block, Cosine, Hellinger and Kullback-Leibler metrics and found that the last two (the information theoretic metrics) perform best in semantic tasks.

In the studies presented in the rest of this thesis I use the cosine to measure the similarity between the cooccurrence vectors of two words (following McDonald, 2000) as the cosine of the angle they form. The cosine of the angle between the vectors locating words x and y is calculated as follows (for vectors defined by n components):

∑

Following the same logic as the analysis of phonological similarity, the aspects of the lexicon where semantic similarity is more easily detected must correspond to the more salient structural parameters of the representational space of the semantic lexicon. In cooccurrence statistics methods, the parameters are distributional cooccurrence patterns of words. Different types of words play different parts in defining the semantic space. Section 4.2 explores a semantic hyperspace representation of the Spanish lexicon generated with cooccurrence statistics. In particular it examines the role of syntactic category (focusing on nouns and verbs), of semantics proper and of gender in the organization of the semantic hyperspace.

In document Tamariz Thesis (Page 103-108)