Model A: Greedy Selection - Count-based Segmentation for MCWs

4.3 Count-based Segmentation for MCWs

4.3.1 Model A: Greedy Selection

In the first proposed model we use bigger subunits (blocks) than characters. The proposed segmentation model takes a string (word) as its input and finds the longest and frequent substring within its input. A frequent substring is a substring which occurs at least θ times in a training corpus, where θ is a parameter of the model. We refer to the longest and frequent substring as main. If there is any other substring before and after main, the segmentation function is applied to them too. Clearly this is a recursive pipeline to find the longest and frequent substring at each step. At the end, if there exists some remainder that are not frequent, they are all decomposed into characters. The pseudo-code of the model is illustrated in Algorithm 2.

Algorithm 2 CountSegmentation (InputS, Start, End, θ)

1: procedure CountSegmentation

2: Ź Start: index of the first character of InputS.

3: Ź End: index of the last character of InputS

4: (main,index) = find(InputS)

5: Źmain: the longest & frequent substring within InputS.

6: Ź index: index of the main’s first character.

7: if main != null then

8: Store main

9: L = length(main)

10: CountSegmentation(InputS,Start,index-1,θ)

11: CountSegmentation(InputS,index+L,End,θ)

12: else

13: decompose all reminders into characters

The process is an optimization problem to find segmentation boundaries which try to jointly maximize the length and frequency criteria. For a given input string with a length L, there can be Lˆ(L+1)₂ character n-grams, e.g. for the sequence

S=abcd with L = 4, the list of character n-grams is equal to {a, b, c, d, ab, bc,

cd, abc, bcd, abcd}. Among character n-grams some of them are frequent based on statistics of the training corpus. At each step of the segmentation process, we select

an n-gram which is frequent and has the longest length compared to other frequent alternatives.

In the character-level decomposition, the idea is to segment strings into their basic subunits which are characters (the alphabet set of the corpus). Then by using of different neural architectures, an attempt is made to reveal the relation between related characters (Kim et al., 2016). This is a bottom-up approach, going from a character-level representation to a morpheme- and word-level representation. The intuition behind our model is partly the same with some distinctive differences. We believe that the set of basic elements of each corpus is not only limited to its characters. If a set of consecutive characters occurs frequently, then it could be considered as a basic element. Accordingly, we introduce new atomic blocks (instead of just characters) which include one or more characters, and all words can be transformed/decomposed through these blocks.

To clarify the process we use a dummy example. For a given sequence s=‘aabcxy’ with two frequent substrings ‘ab’ and ‘xy’, the word-level segmentation keeps s as it is and the character-level model blindly maps s to ‘a.a.b.c.x.y’ regardless of any other criterion. In contrast, according to the count-based segmentation model, if there are frequent substrings in s, they could be substantial units for other strings and should be treated as atomic units, similar to characters, in which case the final segmentation should be ‘a.ab.c.xy’. Clearly, in ‘ab’ all of ‘a’, ‘b’ and ‘ab’ substrings are frequent but as ‘ab’ has the longest length, it is selected from that part of s.

We also have a real-world example for this procedure. Figure 4.2 shows the count-based segmentation process for a complex Farsi word which was taken from our training corpus. The input word is ‘prdrãmdtrynhã’ meaning ‘the people with

the highest salary’. Based on statistics of the Farsi training corpus (see Section 4.5),

the longest and most frequent substring is ‘ãmd’ which is a 3-gram constituent. This means that there is no frequent n-gram with n ą 3 and ‘ãmd’ has the highest frequency among all other 3-grams. ‘ãmd’ is separated and the segmentation model is applied to its preceding (L-string) and following substrings (R-string). Each

substring is considered as a new input for the model which has a dedicated main,

L-string and R-string. The model is recursively applied until all frequent substrings

–‘ãmd’, ‘dr’, ‘tr’ and ‘hã’– are separated. There are still two substrings remaining, namely ‘pr’ and ‘yn’. These two substrings are not considered as frequent in our setting, so they are decomposed into characters. The final decomposition result by the proposed model is: ‘prdrãmdtrynhã’ ñ ‘p.r.dr.ãmd.tr.y.n.hã’.

Figure 4.2: The process of segmenting ‘prdrãmdtrynhã’. main for each node is its offspring in the middle. The nodes before and after main are L-string and R-string, respectively. Dotted and solid lines indicate the character-level and morpheme-level decompositions. Final states are illustrated with double lines.

The segmentation model benefits from the advantages of the word-level, morpheme- level and character-level models. If the input string’s surface-form is frequent, the model does not decompose it and uses the original form. If it is considered as a rare sequence, it is decomposed into characters, and if the string is neither frequent nor rare it is segmented into sets of characters. There might be one or more characters in each set which means that the model uses a hybrid segmentation (as in the aforementioned Farsi example).

In Model A and all other models the frequency of a subunit is computed using the training corpus (not the lexicon/vocabulary), in which we consider all occurrences of the subunit. We do not cross word boundaries for collecting frequency information, i.e. words are separated from each other (space-to-space), segmented into all possible character n-grams and the number of occurrences is counted for each n-gram.

In document Machine translation of morphologically rich languages using deep neural networks (Page 100-103)