Cross-Entropy
5.6 Text Mining Toolkit
5.6.2 Results: IPA Representation
The first representation I examine is IPA transcription. Each character is a single phoneme.
Language identification
Using samples of text from Phillipians 2, the correct language for each test string was identified reliably (i.e. in 100% of cases) for test strings of length 26 characters or longer. There is an exponential increase in mis-identification as test strings become shorter than this threshold, with Spanish being identified as Greek, then also German as Dutch, then a broader scattering of errors (see Figure 5.3).
The best fit curve has the equation: Percentage correct≈ 100(1 − 1.3e−0.41L), where L is the length of the test string. Therefore, to achieve 100% accuracy using 100 test strings in 99%
of experiments, a test string of length L ≥ 34 is required. A test string of length 500 (used hereafter) has an identification error rate of <1 in 10−87. I therefore confirm that the language of the test string can be reliably identified, and the length threshold for doing so is consistent, as per Hypotheses 1 and 2 in Subsection 5.4.1.
Figure 5.3: Percentage of test strings correctly identified by length
Language interaction as a predictor of cross-entropy
We can reliably identify the language of a test string of a given length transcribed phonemically in the IPA. The cross-entropy of a test string in a given language with a model based on that same language is therefore consistently ranked lower than the cross-entropy of different lan-guage models. But are the mean cross-entropies of non-identical lanlan-guage pairs distinguishable from one another?
Applying a one-way ANOVA to the cross-entropy of an ordered pairing5 of languages for test strings of length 500±5 characters, I find that there is an effect size of η2 = 0.87(see Table 5.9). That is, the proportion of the variance in cross-entropy that can be explained by the combination of the test language and the model language is 87%. The proportion of the variance which is residual, not explained by this, nor by the test language or model language independently, is <0.5%. Ordered pairings of languages are a reliable predictor of cross-entropy, and so further investigation of Kullback-Leibler divergence is worth pursuing.
Figure 5.4 shows the distributions of cross-entropy for test strings of length 500±5 char-acters and 150±5 charchar-acters. There were approximately six test strings and 21 test strings per language, respectively.
5e.g. ‘Dutch Spanish’ refers to a Dutch test string modelled using Spanish, which as discussed is not necessarily the same as ‘Spanish Dutch’, a Spanish test string modelled using Dutch.
Figure 5.4: Cross-entropy ranking of IPA transcriptions, for test strings of length 150 and 500 characters.
Each point corresponds to a single test string. Also shown are the mean, hinges at first and third quartiles, and whiskers extending to the minimum/max-imum values that are no further than 1.5 times the inter-quartile range from
the hinges.
Deg. of Sum of Mean F-ratio Pr(>F) η2 freedom Squares Square
Language of test string 6 0.398 0.06639 267.6 < 2× 10−16 0.031
Language of model 6 1.283 0.21384 862 < 2× 10−16 0.099
Language of test string×
Language of model 36 11.278 0.31328 1262.9 < 2× 10−16 0.866
Residuals 245 0.061 0.00025
Table 5.9: Factors contributing to variance in cross-entropy of IPA transcrip-tions, for test strings of length 500 characters.
Consistency of Kullback-Leibler divergence
Having established that cross-entropy is significantly predicated on the combination of two lan-guages, we turn to the Kullback-Leibler divergence.
Figure 5.5 shows symmetric Kullback-Leibler divergences for all language pairs. These are calculated by pairwise means of the Kullback-Leibler divergences for a test string of language A modelled with B, and for B modelled with A, and normalised using the same constant as in the prototype (i.e. 8.54) to give values between 0 and 1.
The robustness of this ranking was tested using 10-fold cross-validation. The data were ran-domly divided into 10 sets. Each set in turn was treated as a test set, with the remaining 90% of data points forming a training set. The training sets were modelled using a random decision forest, and the resulting predictions compared to the relevant test set. The mean error was 0.016, the 99th percentile was 0.053, and the maximum was 0.085. For comparison, the values obtained for these languages have ranges between 0.16 and 0.63, so 99th percentile Kullback-Leibler divergences obtained from an IPA representation are accurate to ±11% of the range. For the purposes of categorical comparison, these language pairs could therefore be divided into five non-overlapping categories (see Table 5.10).
Considering Hypothesis 3 (Subsection 5.4.1), that the Kullback-Leibler divergences are con-sistently ranked, we see that this is false when considering the ordering of 42 language pairings as distinct items. However, we can reject the null hypothesis that there is no effect on rankings
from language pairings, since five distinct categories can be observed.
Table 5.10: Language pairs categorised by symmetric Kullback-Leibler diver-gence
Asymmetry of Kullback-Leibler divergence
The Kullback-Leibler divergence of IPA representations is not symmetrical (see Figure 5.7). The cross-entropy of test language A modelled by language B is significantly different from B mod-elled by A in all cases. However, this asymmetry varies in magnitude (Table 5.11), depending on the language of the test string and of the model.
We can therefore reject Hypothesis 4 (Subsection 5.4.1), that the Kullback-Leibler divergence is symmetrical for all language pairs.
Predictability per language
Returning to the ANOVA of cross-entropy (Table 5.9), we see that the language of the test string, the language of the model and their combination are all significant factors (p < 10−16). I therefore reject the null hypothesis that all languages are equally segmentally predictable when
Figure 5.5: Symmetric Kullback-Leibler divergence of IPA representation
Figure 5.6: Visualisation of mean symmetric Kullback-Leibler divergence, IPA transcription. (Dereeper et al., 2008, Felsenstein, 1989)
represented with IPA characters. Of the three factors, the language of the test string has the smallest impact (η2= 0.03), the language of the model has a larger impact (η2= 0.10), and the combination of the two has by far the largest effect size (η2= 0.87).
Test strings in Spanish have the lowest entropy (see Table 5.12). For example, the average Portuguese test string of a given length requires 17% more bits than the average Spanish test string of the same length. This implies that there is more segmental information in a Portuguese phrase than in a Spanish phrase with the same number of segments, and so on for other pairs.
The models for German and Dutch result in better compression, on average, than the models for Spanish and Greek (see Table 5.13). Test strings encoded with a Greek model require13more bits, averaged across all test languages, than the same test strings encoded with a German model.
The predictability per language across all four representations under examination is com-pared in Subsection 5.6.6 on page 153.
Figure 5.7: Kullback-Leibler divergence of language pairs and their inverse, ordered by the mean of the two, which is marked with a vertical line; IPA
rep-resentation
Language pair Inverse Probability
Table 5.11: Probability that KL distances of language pairs and of their inverses were drawn from the same distribution
Entropy % increase
Table 5.12: GLM: Contribution to cross-entropy by language of test string;
IPA representation
Table 5.13: GLM: Contribution to cross-entropy by language of model;
IPA representation