Segments to features - Text Mining Toolkit

Cross-Entropy

5.6 Text Mining Toolkit

5.6.7 Segments to features

So far, I have examined all features in combination for each representation, effectively using seg-mental representations with differing levels of phonetic detail and inventory overlap between languages. The end result of this is that all four representations give rise to similar language dis-tances, because there is very little variation in the resulting segment inventories. It is therefore not possible to use these language distance hypotheses as predictions of the representational theories to be tested and compared. This section therefore discusses the predictions made us-ing individual features / elements from the three non-segmental representations.

For each feature/element of each representation, I calculated cross-entropy and Kullback-Leibler divergence as I did for feature bundles in Subsection 5.6.2 - Subsection 5.6.5.

Static SPE-style features

As discussed in Subsection 5.2.2, a string representing a single binary feature would look like so:

For individual static features, the maximum entropy is 1.6, assuming three potential states per character.

Individual features require more input data than combinations; with test strings of length 500 characters, not all test strings returned the language of the test strings as the language of the model having the lowest entropy. I therefore used longer test strings which did reliably return the correct language, of length 800 characters.

Running 10-fold cross validation, the stability of Kullback-Leibler divergence calculated us-ing individual features ranges from slightly more stable than combinations (for [round] and [tap]), to much less stable - see Table 5.28.

Language specific SPE-style features

Turning to the language-specific binary features, even with the longest available test string of 900 characters, the language of the test string cannot be reliably identified with single features.

Combining the features into laryngeal, place, and manner bundles, test strings of over 400 characters could be reliably identified as to language. The manner bundle provided the most

Table 5.28: Kullback-Leibler values and error for individual static features Mean Range Mean 99^thpercentile Max Error as value of values error error error % of range

round 0.21 1.79 0.046 0.16 0.18 9%

tap 0.25 1.47 0.036 0.13 0.14 9%

anterior 0.33 1.16 0.042 0.13 0.17 11%

consonantal 0.32 1.12 0.044 0.13 0.19 12%

labiodental 0.19 1.13 0.041 0.14 0.18 12%

voice 0.25 0.90 0.037 0.11 0.13 12%

spread glottis 0.20 1.20 0.041 0.15 0.18 13%

constricted glottis 0.17 0.97 0.030 0.13 0.15 13%

distributed 0.32 0.89 0.039 0.12 0.17 13%

implosive 0.24 1.02 0.039 0.13 0.15 13%

lateral 0.19 1.15 0.050 0.15 0.19 13%

syllable 0.33 1.06 0.048 0.15 0.16 14%

sonorous 0.26 0.91 0.042 0.13 0.16 14%

delayed release 0.29 0.84 0.038 0.12 0.16 14%

strident 0.29 0.85 0.041 0.12 0.16 14%

continuant 0.27 0.93 0.043 0.13 0.17 14%

dorsal 0.30 0.91 0.043 0.13 0.15 14%

coronal 0.26 0.89 0.042 0.13 0.15 14%

tense 0.52 1.16 0.049 0.17 0.20 15%

trill 0.22 0.87 0.036 0.13 0.15 15%

labial 0.20 0.76 0.038 0.12 0.16 16%

approximant 0.30 0.90 0.046 0.15 0.17 17%

long 0.18 1.09 0.039 0.19 0.23 17%

back 0.36 0.85 0.041 0.15 0.17 18%

front 0.39 0.75 0.042 0.13 0.18 18%

nasal 0.21 0.89 0.053 0.16 0.20 18%

high 0.36 0.73 0.044 0.13 0.18 18%

low 0.33 0.68 0.046 0.13 0.16 19%

stable ranking, and the laryngeal bundle the least, reflecting the number of individual features which combine to produce them. (See Table 5.29 for results from 800 character test strings.)

Whilst each individual bundle gives a stable ranking, these rankings are not correlated between bundles (Figure 5.21). This means that it is possible to contrast the differing effects of different features on language distance. However, combining the bundles into average results does not result in reliable language distance calculations. This could potentially be mitigated by using longer test strings - the segmental tests run previously used strings with lengths an order of magnitude greater than the threshold for language identification.

Figure 5.21: Kullback-Leibler divergence of language-specific SPE-style fea-ture bundles, ordered by manner

Table 5.29: Kullback-Leibler values and error for laryngeal, place and manner for language-specific SPE-style features

Mean 99^thpercentile Maximum Percentage

Feature Mean Range error error error error

Manner 1.61 3.25 0.05 0.16 0.20 5%

Place 1.65 2.37 0.06 0.18 0.22 8%

Laryngeal 0.33 1.09 0.05 0.15 0.18 13%

Elements

Firstly, I examined strings of individual elements, with headed and unheaded elements treated separately:

As with the language-specific binary features, individual elements cannot reliably identify the language of the test string with test strings of 900 characters or under. I therefore com-bined presence with headedness into a single character, such that each composite character has three possible states: absent, unheaded or headed. These states are entirely independent, such that, for example, a pattern involving headed A will not aid in identifying the same pattern in a different language which refers to unheaded A instead. However, this approach does help identify patterns in which the headedness of an element has predictive power for neighbouring unheaded versions, or vice versa.

Using these bundles as characters, test strings of 800 characters or more are reliably iden-tified. (With test strings of length 700 characters, one of the 28 English test strings was mis-identified as Dutch.)

There is a similar range of stability in elements as in static binary features, with most indi-vidual elements being similar in stability to, or slightly less stable than, their combination into a single bundle, which had an error rate of 13%. (See Table 5.30). H gives more stable results than combining all elements, perhaps because of the stark division of languages in this sample into voicing and aspirating languages, with no confounds from tone or breathy voice.

Table 5.30: Kullback-Leibler values and error for Elements

Mean 99^thpercentile Maximum Percentage

Element Mean Range error error error error

H 0.42 1.60 0.04 0.14 0.17 8%

I 0.36 1.41 0.05 0.18 0.25 13%

A 0.30 1.08 0.04 0.15 0.17 14%

ʔ 0.22 0.94 0.04 0.13 0.16 14%

Syllabicity 0.29 0.73 0.04 0.12 0.15 16%

L 0.30 1.53 0.05 0.25 0.32 17%

U 0.18 0.60 0.04 0.13 0.15 22%

As well as varying in their stability, individual elements vary in their predictability (see Table 5.31). For example, |A| is approximately 1.5 times more predictable than |A| on average, and |H| 20% more predictable than |I|. This implies that the functional load is different between different elements. An interesting avenue of future research would be to compare these findings between different languages, and look for phonetic or phonological phenomena which correlate to differences in modelled predictability. Likewise, such calculations could be used to compare the implications for information transer of different analytical decisions in element assignment.

In document Measuring phonological distance between languages (Page 155-159)