Validate split words from token - Industry contact approval

Appendix I Industry contact approval

Equation 9 Validate split words from token

Urdu dictionary

…

ان

رت د

اردان

…

Rc = 7

Split on first nj yields one match from Urdu dictionary with 7 rc

therefore these tokens are rejected

Split on second nj yields no match from Urdu dictionary

Split on third nj yields no match from Urdu dictionary

Split on fourth nj yields two matches from Urdu dictionary with 0 rc, therefore these two

words are accepted

117

Figure 40 – Unrecognised token processing

When the split words of the token form valid words that are above 0 and = < the npw (Equation 9) the words found are added to the utterance and sent forward for the engine to process.

The word segmentation algorithm exploits the non-joiner (NJ) characters of the Urdu language which can be utilised to identify possible word segmentation boundaries. Nevertheless, the non-joiner characters are not a concrete indicator to word boundaries as they can appear in the middle of word. Word segmentation of utterances does pose some challenges such as over segmentation of words (Rashid and Latif, 2012). However, this has been reduced in this algorithm through the utilisation of the two Urdu dictionaries which are also used in the predictive text component (section 7.3). The first dictionary is a domain specific dictionary which is comprised of 786 domain specific frequently used words that were derived from the log file of the first evaluation (See section 7.8 for further details on the domain specific Urdu dictionary), the word frequencies in this dictionary are calculated by the word frequency component (see section 7.5 for further details on the word frequency component). The second dictionary is a general Urdu dictionary comprised of 2430 of the most frequently used ligatures which have been extracted from a 19.3 million Urdu word corpus gathered from a wide range of domains complied by the Centre for Language Engineering, Pakistan (Engineering, 2014). The domain specific dictionary contains the most frequently used words related to the domain of UMAIR making the dictionary smaller and more focused. The words in the domain specific dictionary take precedence over other general Urdu dictionary so the segmented words are first compared to this dictionary to identify the words, as the utterances are more likely to contain domain specific words which will reduce processing time. Furthermore, the domain specific words will take precedence over the general word dictionary in order to avoid over segmentation.

An example of over segmentation is demonstrated in word یتﺧانش (identification), this word contains a non-joiner character within its ligature (ا). This word can be split to form the word انش (define) that is also a valid word. However the use of the frequently used word dictionary mitigates this issue as words found within this dictionary take precedence over words found in the general Urdu dictionary, the word یتﺧانش is found

118

in the frequently used Urdu dictionary as it is a word that is frequently used in the domain of UMAIR. Moreover, for example if the word was not in the frequently used dictionary and the word انش was found from the general dictionary, it would leave the remainder of the ligature which would be یتﺧ this word has no meaning in Urdu, therefore the algorithm is programmed to reject both parts of the word (as the rc = 3) and continue processing through the unrecognised token until all segments form valid words which leave no remaining characters. Another step taken to avoid over segmented is the longer words found through segmentation take precedence over the shorter words. Thus in an instance described above where a word can be segmented and both parts of the word form valid words the algorithm is programmed to use the whole word not the two segmented words.

Once all the characters in the unrecognised token form valid words then these words are sent forward to be processed as valid tokens as a part of the original utterance to be processed by the PM engine. The pre-processing process of segmentation and validation ensures that non-segmented tokens are captured and processed, thus ensuring the only valid tokens are sent forwards to be processed which also maximises probability of finding a strong match to the utterance from the scripted patterns.

Table 13 illustrates some examples of how the algorithm pre-processes utterances in order to ensure consistent word segmentation so that all tokenised words from valid words. The example utterances are some of the utterances that the first prototype failed to recognise that were taken from the log file of the first UMAIR evaluation. The engine from the first UMAIR prototype failed to recognise these utterances because they contained instances where the user opted not to leave space after the non- joiner characters (highlighted in red). To a native Urdu reader there is no problem is distinguishing word boundaries, however for a PM engine that relies on the use of consistent white space to tokenise words this caused major problems.

UMAIR 1

Without word segmentation component

UMAIR 2

With word segmentation component

ردان یبيرق

؟ےہ ںاہک رتفد

Where is the local nadra office?

119 سنوک ا نرھب مراف ا گوہ ا ين ا ک ٹروپساپ ے ؟ےئل

Which form do I have to fill in for a new passport?

؟ےئل_ےک ٹروپساپ_اين_اگوہ_انرھب مراف_اسنوک

120 ئن کيا ے نتک ٹروپساپ ا ؟ےہ اک

How much is a new passport?

؟ےہ اک_انتک ٹروپساپ_ےئن کيا

س ناتسکاپ ںيم

ںوہ ںيہن

I am not from Pakistan

ںوہ ںيہن_ےس ناتسکاپ ںيم

Table 13 – Utterances before and after being processed by word segmentation algorithm

Table 13 shows how the word segmentation algorithm processed the inconsistently segmented user utterances to ensure all the tokenised words formed valid words. The green space in the UMAIR 2 column highlight where the algorithm segmented the tokens to from valid words.

7.5 Word frequency component

Word frequencies are used in many widely used practical applications of statistical natural language processing, such as document retrieval based on keywords (Altmann et al., 2009). The word frequency component was added to UMAIR’s architecture in order for UMAIR to be able to learn and adjust word frequency values in the domain specific dictionary according to the data stored in the log file. The word frequencies work with the word segmentation and predictive text components in order to offer intelligent and more relevant suggestions for both these components. These components both utilise dictionaries in order to mitigate Urdu language specific issues. However by calculating word frequencies both these components are able to operate more intelligently and effectively, by offering more appropriate suggestions to the predictive text component (see section 7.3) based on the frequencies of words used in previous utterances. Furthermore, the word frequencies are utilised by the word segmentation component/algorithm (see section 7.4) to resolve over segmentation and instances where tokens can be segmented in multiple variations, in these instances the words with the higher frequencies take precedence over the less frequently used words.

The original word frequency value was calculated and stored in the database through the knowledge captured and stored in the log file. The log file stores anonymous data of all the user utterances that are processed by UMAIR’s engine. The values are also updated at the end of each discussion where the log file records of the conversation are automatically scanned and all valid words used by the user during the conversation are captured and used to update the frequency values stored in the database. This data/knowledge is then utilised by the word frequency component to calculate and

121

adjust the word frequency dictionary, in order to offer intelligent suggestions through the predictive text feature and to improve the word segmentation algorithm.

The word frequency component utilises the Bags of Words (BOW) (Boulis and Ostendorf, 2005) technique to calculate the word frequency (see Equation 10 – Bag of Words Frequency Equation). The bag-of-words retrieval models represent queries and documents as unordered sets of terms; this strategy is based on an independence assumption. Bag-of-words models have been shown to be simple and effective (Choi et al., 2014). The bag-of-words representation, is represented with a vector of the word counts that appear in it. Depending on the classification method, the bag-of-words vector can be normalized and scaled (Boulis and Ostendorf, 2005).

The ranking functions associated with bag of words retrieval models often consist of term frequency (Metzler, 2008). In addition to using words as indexing terms it is usually assumed that the ordering of the words does not matter in this instance as this implementation is only concerned with calculating word frequencies, not word or sentiment classification. This way utterances no longer have to be represented as sequences. Instead the utterances can be represented as a bag of words. This representation is equivalent to an attribute-value representation as used in machine learning. Each distinct word is a feature and the number of times the word occurs in the log file/temporal memory is its value. This is represented by the following

In document Methodology and algorithms for Urdu language processing in a conversational agent (Page 133-138)