• No results found

4.3 Count-based Segmentation for MCWs

4.5.1 Discussion

In this section we try to address two different properties to analyze the model from different perspectives. The first issue that we wish to discuss is the size of training corpora. Recently, Józefowicz et al. (2016) trained huge neural networks with dif- ferent architectures for language modeling. They trained deep models on very large corpora (on the one billion word benchmark6 (Chelba et al., 2013)) using massive hardware resources (the best performance was reported from a model trained over 32 Tesla K40 GPUs for 3 weeks). According to their comparisons, the architecture proposed by Kim et al. (2016) provides state-of-the-art results. As our model is an extension to that architecture, it is supposed to perform better, but it cannot be claimed without evaluating the model in practice. In large-scale experimental settings the model might behave differently. We cannot afford the setting and re- sources described in Józefowicz et al. (2016) but we are able to design a comparable experiment. To evaluate the performance of our (best) models in the presence of large vocabularies, we use two large Farsi and German corpora. Information about

the corpora and experimental results are reported in Tables 4.6.

Data PPL

Language V C MC mc Mdp CLM CLMC WordCLMdp

German 637k 319 9122 1 71k 342 340 387

Farsi 336k 132 6513 1.5 38k 290 280 305

Table 4.6: Experimental results for large datasets. mc is the extrenal parameter of

Model C. dp stands for dynamic programming and indicates the tokens extracted via Model D.

The German corpus was built using the Europarl-v77 collection (Koehn, 2005). We selected 1.9 million sentences from the German side of the English–German corpus and for Farsi, we selected the same number of sentences from the Hamshahri8 collection (AleAhmad et al., 2009), which is a standard dataset of monolingual sentences and frequently used in Farsi language processing research.

Table 4.6 summarizes the advantages and disadvantages of different approaches. The main advantage of the character-based model is its ability to handle very large datasets. It transforms everything into a limited set of characters which facili- tates the language-modeling task. However, the level of granularity provided by the character-based model is not optimal, so we use our tunable models. When words are segmented with Model C the neural language model is able to perform better, but we know that it could be hard to set the best value for the external parameter of Model C. There is also another model, namely WordCLMdp, for which we do

not need to set the external parameter. The training procedure for this model is 15 hours faster than CLMC, but its performance is slightly worse. We can summaries

different properties of these approaches in this way: i) CLM performs well for large datasets but the character-level segmentation is not the best segmentation scheme;

ii) CLMC is more precise compared to CLM, but it requires a complicated training

procedure; iii) finally, WordCLMdp is fast and has no external parameter but it

may fail to perform equally for large datasets. However, its performance is not far 7http://www.statmt.org/wmt16/translation-task.html

50 500 1,000 1,500 2,800 110 115 120 125 130 |M| P erplexit y

Figure 4.7: Impact of different θ values on PPL and |M|.

from that of CLMC.

The second issue which can be discussed is the segmentation scheme itself. As previously mentioned, we think that the improvement gained by character-level models is not only because of the segmentation model. As an example in CLM, the

highway module directly affects word representations. Kim et al. (2016) extensively

discussed this matter and provided several examples. We believe that the strength of our model relies heavily on the chosen segmentation technique, rather than the neural architecture. Our performance does not rely on highway layers as much as CLM, as if we add/remove a layer to/from the highway module in the default con- figuration (in Model C), the final perplexity would stay almost unchanged, whereas if we do the same with CharCLM (add one extra layer or remove a layer from the

highway module), its performance fluctuates (perplexity changes) by 10 to 20 points.

Therefore, our segmentation scheme by its very nature is able to provide richer and more representative information.

In our models the final output is highly influenced by the external parameters (θ and m), so that it is necessary to find their best value to obtain the optimal output. However, with some random values such as m = 1 or θ = 500, our models can provide comparable results to state-of-the-art models. We designed an experiment to show how these parameters affect the number of blocks and the model’s perfor-

mance. In our experiment we studied Model A (on Farsi) with different θ values. As Figure 4.7 shows, the best perplexity score achieved by the model is 110 where

θ = 500, which generates a basic set of 1460 blocks. This means that by the given

frequency threshold, all existing words in the corpus can be encoded by 1460 atomic units. When the threshold is increased the model starts to react similarly to CLM, because the basic units become smaller so that the basic set becomes closer to the alphabet set. For example with θ = 10, 000, the basic set includes 124 blocks and the perplexity is 120. We can see the opposite trend when the θ value is decreased. When the threshold is set to a value smaller than 500 the basic set is not optimal any more and it downgrades the performance of the NLM, because with smaller

θ values the chance of selecting longer blocks is higher and the basic set becomes

too sparse., e.g. with θ = 200 the size of the basic set is 2795 which results in a perplexity score of 112, which is worse than 110. As the value of θ decreases this negative trend continues up to θ = 100, and after this threshold the quality of the model drops drastically.