Updated edit distance algorithm - Industry contact approval

Appendix I Industry contact approval

Equation 7 Updated edit distance algorithm

The edit distance is utilised as the edge weight of each token in the edge weight matrix (see section 4.5.3 for edge weight explanation) which is then utilised to find the final match strength between the user utterance and the database patterns. By adapting the edit distance algorithm to allow the flexibility of common spelling variations in the edit distance UMAIRs engine can calculate and assign more accurate edge weights to the tokenised words, therefore reducing the negative impact these group of commonly mistaken characters has on the final similarity calculation. The edit distance component of the WOW algorithm is now specifically tailored to address one of the language challenges unique to Urdu, making the similarity calculation more robust and accurate.

111

In order to further reduce impact of spelling mistakes on the PM/similarity engine the further techniques that could be adopted by UMAIR to help the users while typing/entering utterances to interact with UMAIR have been explored. This feature is outlined in the following section.

7.3 Predictive text

The predictive text feature was added to UMAIR based on research in to Human Computer Interaction (HCI) and methods of reducing spelling errors from the user perspective. Based on the literature it was found that in a text dialogue system a predictive text feature can aid users with spelling and reduce spelling errors while typing/entering utterances (Akram et al., 2014, Mora-Cortes et al., 2014, Kaufmann et al., 2012). Therefore, it was decided to implement a predictive text feature in to the architecture/UI of UMAIR to address the negative impact spelling errors have on the similarity calculation. The predictive text feature utilises an Urdu dictionary which contains 786 words created from the log file of the first UMAIR evaluation. The user utterances from the first evaluation were collated and validated for spelling errors once all the words were validated they were stored in the knowledge database as the dictionary for the predictive text component to utilise.

The predictive text feature is initiated when the user types the first letters of the intended word, all words from the attached dictionary that share the same first letters are activated, and the most frequently used word (see section 7.5 word frequency component) among them is presented to the user. The predictive text feature utilises the word frequency component in order to make intelligent suggestions to the user based on previous knowledge of user utterances. The suggested word is presented to the user highlighted in a lighter font colour within the input textbox. The user then can either further type the intended word which will then further narrow the list of activated words, or select and accept the predicted word as soon it appears in the textbox by pressing the left arrow key on the keyboard. An example is illustrated in Figure 36.

112

In the example above the user typing the Urdu word for ID card which starts with the Urdu letter ش, the predictive text system offers the suggested word یتﺧانش based on the past frequency of this words usage with the system. One of the main causes of unrecognised utterances from the first evaluation stemmed from spelling related errors made by the user. The spelling related errors resulted in the engine failing to recognise that particular word when processing the user utterance. The predictive text feature is implemented in order to reduce the number of spelling related errors that occur during the user interaction by aiding the user while they are typing utterances in to the system. To date the work on Urdu predictive text is very limited and to the researchers knowledge this is the first predictive text system implemented on a non-mobile device.

7.4 Word segmentation algorithm/component

Inconsistent word segmentation is a language unique issue for the Urdu language. The magnitude of its impact on CA’s was only brought to light through the end user evaluation of the first UMAIR prototype. As discussed in chapter 3 section 3.8, due to the morphological features of the Urdu language, the use of space to separate words by the users in certain cases during writing is entirely optional.

This feature of the Urdu language had severe detrimental effects for the PM/similarity engine of UMAIR, as the process of PM requires the tokenisation of the utterance in to its individual words which are then processed by the engine. The evaluation results found that during the input of text in cases where users had the option not to leave space (i.e. when the word ends in a non-joiner character), most users took advantage of this language feature and opted not to insert space between words.

An example of this is illustrated in Figure 37 where an example problematic utterance (translates to “I need a new ID Card”) taken from the log file of the first evaluation is illustrated in both its forms (i.e. with and without consistent spacing) the green represents the use of white space to separate words.

113

Inconsistent use of white space Consistent use of white space

ےئہاچ_ڈراک_یتخانشاینےھجم ےئہاچ_ڈراک_یتخانش_این_ےھجم

Utterance Tokenisation result

یتخانشاینےھجم ڈراک ےئہاچ ےھجم این یتخانش ڈراک ےئہاچ

Figure 37 - Inconsistent and consistent word spacing

In cases such as the example illustrated in Figure 37 the engine tried to perform pattern matching on the whole token with all three words as one, which would cause the engine to fail to recognise that word/token thus negatively affecting the whole similarity calculation, and reducing the knowledge available in the utterance to the engine in relation to pattern matching. It was evident that this word segmentation issue had to be tackled in order to increase the effectiveness and robustness of UMAIRs engine, which relies on the user utterance to be correctly segmented in order to perform PM and similarity calculation more effectively.

Through research it was discovered that there were two possible options that could be adopted in order to mitigate this issue. Firstly, the scripts could be amended so that the scripted patterns included the inconsistently segmented versions of the patterns. The second option was to research and develop a new component that could insert spaces and segment, un-segmented/inconsistently spaced user utterances into valid words in real time before the utterance tokens were sent forward for processing by UMAIRs engine.

The first option although feasible was not the best option as this would further exacerbate the task of the scripter and involve further complexity during the scripting process. As all possible variations of the utterance with and without consistent segmentation would have to be scripted. In light of this a new Urdu word segmentation

114

algorithm was researched, developed and implemented in to UMAIR’s architecture which would pre-process the user utterances in order to ensure that the individual words of a particular user utterance were correctly/consistently segmented thus allowing UMAIR’s engine to process the text without the hindrance of inconsistent word segmentation.

The general process the word segmentation algorithm follows in order to segment an utterance containing an unrecognised token follows is illustrated in Figure 38.

Figure 38 - Word segmentation process flow

The word segmentation algorithm can be defined as follows: let the number of non- joiners be nj. nj is the total sum of the non-joiner characters in the token. The value of

Validate words with dictionary

Add all valid tokens to utterance list, and forward unrecognised tokens for word

segmentation

Segment NJ characters in the token.

Validate words with dictionary Split on next NJ

Segments form valid words No Yes Yes Yes No Yes No Tokenised utterance words Process tokenised utterance No All tokens form valid words Token has more than 1 NJ Token has NJ

115

nj is used to measure the potential number of words (npw) in the token (illustrated in Equation 8). When the number of potential words (npw) in the token is calculated through Equation 8. The value of npw is the number of potential words that could be in the unrecognised token.

𝑛𝑝𝑤 = ∑(𝑛𝑗 + 1)

In document Methodology and algorithms for Urdu language processing in a conversational agent (Page 127-132)