LM Experimentation for Function Word Probability

6. Model Implementation

6.5 LM Experimentation for Function Word Probability

The objective of this section is to implement a LM for the entire vocabulary of words in the training data set and apply it in estimating the probability of function words for any given test sentence (Let’s call this model 𝐿𝑀1). Later, the implementation of 𝐿𝑀2, used in

estimating content word probability will be described. Both models will be subsequently applied towards sentiment prediction.

The methodology involves implementing a base-line Kneser-Neys smoothed model which is tuned by modifying the discount and lower order bigram and unigram. Other smoothed LMs are implemented and measured against this baseline model. The performance of the models is compared intrinsically by calculating perplexity (See Appendix 6 for a description of perplexity). As part of the implementation, two vital issues associated with LMs are addressed: Out of vocabulary words and cut-offs.

6.5.1 Out of vocabulary (OOV) words

One of the challenges in LM involves out of vocabulary words. These are words that appear in the test data but not the training set and because they do not appear in the training set it is impossible to estimate their probabilities. Therefore, because insufficient information about the words exists, their probabilities are inestimable. Usually, low frequency words (words with frequency of 1 or 2) in the training set are considered as OOV words because they do not have enough information to make reasonable estimates.

Figure 19: Snippet of Model Process Portraying the development of LM1 and LM2

This thesis makes the open vocabulary assumption which means that the model accounts for unseen/OOV words. The reasoning behind this is because new words (formal or colloquial) and expressions are constantly being added to the English vocabulary. Traditional approaches to dealing with OOV typically involve estimation of pseudo-word probabilities via the following steps:

1.Pre-selecting a fixed vocabulary of words

2.Identifying and converting words in the training set that are not in the fixed vocabulary to the pseudo-word ‘UNK’43_{(Bell et al, 1990; Jurafsky and Martin, 2009).}

3.Finally, estimating the probability of ‘UNK’ like any other word.

The problem with this approach is expressed in the question: What is the criteria for compiling the fixed vocabulary of words? How is the size of this fixed vocabulary determined, since it is possible for the vocabulary of words in the corpus to be larger? A second approach entails replacing just the first occurrence of every word type in the training data with ‘UNK’ (Bell et al, 1990; Jurafsky and Martin, 2009). A third approach involves estimating the probabilities for the most common 𝑘 words, while all others are mapped to the token ‘UNK’ (Manning and Schutze, 1999). This work proposes a methodology that uses clues and patterns in the corpus towards identifying OOV words. The aim is to gather enough probability information for unseen words that are likely to occur with high frequencies in the test set and low frequency unseen words. This methodology is based partly on research carried out by Muller and Schutze (2011) who

showed that OOV words are typically short words, names, acronyms and words containing special characters. Therefore, the first instance of high-frequency words of such types in the corpus can be converted to ‘UNK’. Since these words are high frequency words, the assumption is that not much is lost in the way of probability mass because of its high frequency. To this end, the following substitutions are made in the training corpus:

• Returning to the original training text, i.e. the unmodified unprepared text. Mentions of alpha-numeric text, locations, names and organizations are counted. For instances that occur more than 5 times44_{in the text, the very first instance is}

replaced with ‘UNK’ in the mirror sentence. This accounts for low frequency content words that are typically discarded in training but are likely to occur in the test set. For instance, in this implementation, an important word ‘C4-logistics’ occurred just twice in the Conservative-EU test set.

• The first occurrence of other high frequency content words (words with a count that is greater than 1545_{) is replaced with ‘UNK’. Since these words are high}

frequency words it is expected that modifying just one of it would not have a considerable effect on the probability estimation. In making this modification, some probability weight is borrowed from high frequency word and assigned to the unknown word representation ‘UNK’. Examples of such words and their frequencies include: ‘administration’ – 152 times, ‘government’- 1974 times, ‘party’ – 468 times in Conservative-EU training set.

In the next section, cut-offs are considered.

6.5.2 Cutoffs

Cutoffs are a way of restricting the size of the LM by cutting off or ignoring infrequent ngrams. “The count below which the ngrams are discarded is called cutoffs” (Clarkson and Rosenfeld, 1997, p. 1). While cutoffs generally tend to reduce the size of the LM, they have also been shown to slightly reduce the performance of the model (Chen and Goodman, 1998; Clarkson and Rosenfeld, 1997). Thus, the decision about applying cutoffs is about weighting the benefit of a very large model against the slight loss in performance incurred from cutoffs.

In deciding, the experiments of Chen and Goodman, (1998) which compared the effect of cutoffs on several smoothing trigram models on a Wall-street Journal (WSJ) corpus was considered. Their findings were considered because it provides results over a reasonable range of sentences (from 100 sentences to over a million). Their findings show that for KN

44_{This number is actually a function of the training set. In the case study, 5 is used. Other data types}

might require less.

45_{This number is actually a function of the number of named entities in the training set. In the case}

models, cutoffs lead to a loss in performance as seen in figure 2046_{, where cross-entropy}47

is greater for 0-1-1 and 0-0-2 ngrams and increases with increase in corpus size up to 100000 sentences (see figure 20). To this end, 0-0-1 cutoffs were applied in this implementation.

Figure 20: Comparing performance of trigram models with KN smoothing with cutoffs and KN smoothing without cutoffs (Source: Chen and Goodman, 1998).

Following smoothing, cutoffs, determination of OOVs, the implementation of interpolated KN language model (𝐿𝑀1) for each value holder’s training corpus using the SRILM toolkit

is described. This concludes the implementation of 𝐿𝑀1.

In document Formalization and modeling of human values for recipient sentiment prediction (Page 100-103)