BPEmb : Byte Pair Embeddings in 275 Languages

One of the main advantages of BPE is that it is applicable to any sequence of sym- bols. In particular, it can be applied to text, regardless of language.10 We use this advantage to train subword segmentation models in many languages, then employ these models to segment large text corpora into subwords, and finally train subword embeddings which we publish for general use. This yields a collection of Byte Pair Embeddings in 275 languages, which we call BPEmb .11 Describing each of these steps in more detail, we follow the following procedure:

1. Text corpus. To enable learning good BPE models and embeddings, we re- quire a large corpus of texts. We use Wikipedia as corpus and extract plain article texts with WikiExtract.12 After removing Wikipedia language editions with very little content, we obtain article texts in 275 languages.

2. Preprocessing. Two preprocessing steps aim to improve BPE model training. We lowercase all characters, since we expect that sentence-initial capitalization, title case, capitalization of nouns, and other case variations are not rele- vant for subword segmentation. Similarly, we replace all digits with 0 to pre- vent the BPE model from making irrelevant distinctions between individual numbers.

10_{Whether it is meaningful to apply BPE to any language, that is, whether the algorithm learns}

meaningful subword segmentations in a given language, dependends on the properties of the language and the availability of training data. Also see the limitations discussed in section 5.4.3.

11_{https://github.com/bheinzerling/bpemb}

3. BPE model training. Having prepared texts in 275 languages, we now learn BPE models using SentencePiece.13 A priori, it is not clear how the number of BPE merge operations should be set. Hence, we train different models with varying numbers of merge operations and evaluate the impact of this hyper-parameter later. Specifically, we train BPE models with 1000, 3000, 5000, 10000, 25000, 50000, 100000 and 200000 merge operations.

4. BPE subword segmentation. By applying the BPE models trained in the pre- vious step to the texts in our training corpus, namely, Wikipedia editions in a given language, we obtain subword-segmented texts, as shown in Figure 5.5 on page 115.

5. Subword embedding training. Finally, we use off-the-shelf software, namely GloVe,14 to train subword embeddings for each language and each number of merge operations. Since it is not clear what the best embedding dimen- sionality is, we train embeddings with various dimensions, leaving another hyper-parameter setting to empirical evaluation.

To assess BPEmb , that is, the subword embeddings trained according to this procedure, we first perform a qualitative evaluation, before comparing it with other subword representations in a quantitative evaluation in the next section.

5.3.1 Qualitative Evaluation

We first analyze the subword segmentations induced by the BPE models we trained on Wikipedia as described above. Table 5.7 on the next page shows subword segmentations of our English running example myxomatosis and its translation in Ger- man, Polish, and Japanese. We observe that, for this particular word, our trained BPE models yield reasonable segmentations in three of these languages. However, it also becomes apparent that there is no best number of merge operations which yields good segmentations across several languages, as the arguably best segmentations emerge after 10000 BPE merges for English, 50000 for German, and 100000 for Japanese.

Next, we qualitatively assess our trained subword embeddings by inspecting nearest neighbors in the embedding space. Figure 5.6 on page 119 shows the nearest neighbors of the embedding of the English morpheme osis (“disease”, “abnormal state”). We find several words and subwords with similar or related meanings:

13_{https://github.com/google/sentencepiece}

Language Merge operations Subword segmentation

English 1000 _m y x om at os is

3000 _my x om at os is 5000 _my x om at os is 10000 _my x om at osis 25000 _my x omat osis 50000 _my x omat osis 100000 _myx omatosis 200000 _myx omatosis German 1000 m y x om at o se 3000 _m y x om at o se 5000 _my x om at ose 10000 _my x om at ose 25000 _my x om at ose 50000 _my x omat ose 100000 _my x omat ose 200000 _myx omatose Polish 1000 _m y k so m ato za 3000 _my k so m ato za 5000 _my k so m ato za 10000 _my k so m ato za 25000 _my k so mato za 50000 _my k so mato za 100000 _my kso mato za 200000 _my kso mato za

Japanese 5000 _ 兎粘液腫 10000 _ 兎粘液腫 25000 _ 兎粘液腫 50000 _ 兎粘液腫 100000 _ 兎粘液腫 200000 _ 兎粘液腫

TABLE 5.7: Learned subword segmentantions for different numbers of merge operations. In English, BPE yields reasonable subword seg- ments of the word myxomatosis, which, after 10000 merge operations, include the two morphemes omat (“tumor”) and osis (“sickness”). Sim- ilarly, the segmentation of the German Myxomatose includes omat and ose, the German equivalent of the English osis. For the Polish Myksoma- toza, our trained models fail to find any morphemes, producing mato instead of omat and za instead of oza, the Polish equivalent of osis. In Japanese, the disease has the name 兎粘液腫, consisting of the characters for “rabbit”, “sticky”, “fluid”, and “tumor” (recall that myxomatosis is a tumorous disease in rabbits). The BPE model trained on the Japanese edition of Wikipedia finds the correct segmentation into 兎 (“rabbit”), 粘液 (“mucus”), and 腫 (“tumor”).

FIGURE 5.6: BPE embeddings most similar to the subword osis. t- SNE projection (Maaten and Hinton, 2008) with http://projector. tensorflow.org.

• itis: a suffix indicating disease, occurring, for example, in bronchitis; • disease, diseases;

• _tum: a character trigram occurring in the word tumor;

• _inf : a character trigram occurring in the word inflammation; and • related words such as _symptoms, _patients, _chronic.

The fact that the nearest neighbours include the embedding of the subword _diagn is an interesting case: On the one hand, the concatentation with osis yields _diagnosis, which is quite related to the topic of sickness. On the other hand, the subword segmentation implied here is wrong, since the word diagnosis originates from the Ancient Greek día (“through”, “apart”) and gn¯osis (“knowledge”).

shire (English) ingen (German) ose (German)

ington lingen krank

_england hausen _erkrank

ford hofen itis

_wales heim _behandlung

outh bach _krankheit

_kent sheim hy

bridge weiler fekt

well dorf pt

_scotland _bad apie

orth berg _krank

TABLE 5.8: Examples of subword similarities. Shown are subwords most similar to the English morpheme shire, which is commonly found in place names in the United Kingdom; subwords similar to the Ger- man morpheme ingen, which is common in German place names; and subwords similar to the morpheme ose, the German equivalent of the English osis.

Table 5.8 shows more examples of similar subwords. The first example is the English suffix shire, which occurs in place names like Leicestershire, Berkshire, or York- shire. Among its most similar subwords, we find similar suffixes:

• ington, which occurs, for example, in Kensington or Islington; • ford, which occurs, for example, in Stratford;

• outh: which occurs, for example, in Plymouth; • bridge: which occurs, for example, in Cambridge;

The list also includes related words: _england, _wales, _kent, and _scotland. For the German morpheme ingen, which is common in names of German cities and vil- lages, all similar subwords commonly occur as word-final morphemes15 in place names, with the exception of the word-initial _bad (“bath”), among whose many word senses is one indicating spa towns, for example in Bad Säckingen. Subwords similar to the German ose (“osis”) include subwords with similar meaning, such krank (“sick”), _erkrank (word stem, “to become sick”), and itis, as well as related words such as _behandlung (“treatment”).

In the next section, we compare BPEmb and other subword approaches.

15_{Strictly speaking, the subword sheim is not a morpheme but the concatenation of an epenthetic}

In document Aspects of Coherence for Entity Analysis (Page 130-135)