arxiv: v1 [cs.cl] 1 Jun 2021

(1)

S

HUO

W

EN

- J

IE

Z

I

: Linguistically Informed Tokenizers

For Chinese Language Model Pretraining

Chenglei Si1∗, Zhengyan Zhang2,3,4∗, Yingfa Chen2,3,4∗, Fanchao Qi2,3,4, Xiaozhi Wang2,3,4, Zhiyuan Liu2,3,4†, Maosong Sun2,3,4

1_{University of Maryland, College Park, MD, USA}

2_{Department of Computer Science and Technology, Tsinghua University, Beijing, China} 3_{Institute for Artificial Intelligence, Tsinghua University, Beijing, China}

4_{State Key Lab on Intelligent Technology and Systems, Tsinghua University, Beijing, China} [email protected], [email protected]

Abstract

Conventional tokenization methods for Chi-nese pretrained language models (PLMs) treat each character as an indivisible to-ken (Devlin et al.,2019), which ignores the characteristics of Chinese writing system. In this work, we comprehensively study the influences of three main factors on the Chi-nese tokenization for PLM: pronunciation, glyph (i.e., shape) and word boundary. Cor-respondingly, we propose three kinds of tok-enizer: 1) SHUOWEN(说文, meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI(解字, meaning Solve Character), the glyph-based tokenizers; 3) Word seg-mented tokenizers, the tokenizers with Chi-nese word segmentation. To empirically compare the effectivenesses of studied to-kenizers, we pretrain BERT-style language models with them and evaluate the models on various downstream NLU tasks. We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers, while Chinese word segmentation shows no benefit as a pre-processing step. Moreover, the proposed SHUOWEN and JIEZI tokenizers exhibit significantly better robustnesses on handling noisy texts. The code and pretrained models will be publicly released to facilitate linguis-tically informed Chinese NLP.1

1 Introduction

Large-scale Transformer-based pretrained lan-guage models (PLMs) (Devlin et al., 2019; Lan et al.,2020;Clark et al.,2020;Ma et al.,2020;He et al.,2021, inter alia) have achieved great success in recent years and attracted wide research inter-est, in which the tokenization plays a fundamental role.

1

Please refer to the AppendixA.1for the historical mean-ing of SHUOWEN-JIEZI.

∗

Equal contribution

†

Corresponding author email: [email protected]

Unfortunately, current tokenization methods are mostly developed primarily for English (Bostrom and Durrett,2020). Almost all the current PLMs adopt the sub-word tokenization method originat-ing from machine translation, such as the Byte-Pair Encoding (Sennrich et al., 2016), Word-Piece (Schuster and Nakajima,2012;Devlin et al.,

2019) and SentencePiece based on the unigram language model (Kudo and Richardson, 2018). While the idea of sub-word tokenization is in-tuitive and effective for morphological-rich syn-thetic languages, it is not the case for Chinese.

We believe that it is crucial to develop tai-lored techniques for the languages beyond En-glish because there can be huge differences be-tween different languages (Bender, 2019, #Ben-derRule). Towards this end, we devote this work to analysing three unique linguistic characteris-tics of Chinese (writing system) compared to En-glish: 1) The Chinese writing system is mor-phemic (Hill, 2016), which means the Chinese characters poorly reflect the pronunciation, result-ing in the conventional character-based tokeniza-tion misses much more phonological informatokeniza-tion. 2) Modern Chinese words basically do not un-dergo morphological alternations (Packard,2000), thus rendering sub-word tokenization inapplica-ble. However, Chinese characters are mainly lo-gograms, which means their glyphs, the compo-sition of stokes and radicals, also contain rich se-mantic information (Wu et al., 2019). 3) In Chi-nese writing, there is no natural word boundary like the space in English writing. Although it is possible to inject word boundaries via Chinese word Segmentation (CWS), there is no study on

how this works for Chinese PLMs.

Targeting the three factors, we then explore three corresponding tokenization strategies: 1) A pronunciation-based tokenizer family called SHUOWEN, which first romanizes the Chinese characters based on their pronunciations, and then

(2)

constructs the vocabulary with the romanized scripts using the unigram language model (Kudo and Richardson, 2018). 2) A glyph-based to-kenizer family called JIEZI, which decomposes characters into combinations of Chinese strokes or radicals, and then constructs the vocabulary with the stroke or radical sequences using the unigram language model. 3) A word segmented tokenizer family, which first uses a Chinese word segmenter to segment Chinese texts into words, and then con-structs the vocabulary with the segmented word sequences using the unigram language model.

We pretrain BERT-style PLMs using the pro-posed tokenizers from scratch and evaluate the resultant models on various downstream tasks. Through comprehensive evaluation on ten Chinese NLU tasks, we find that our pronunciation-based (SHUOWEN) and glyph-based (JIEZI) tokenizers

outperform conventional single-character tokeniz-ers in most tasks. Furthermore, as they have the unique advantage to learn the meanings of com-plex characters through the composition of sim-pler sub-characters, they are naturally more robust on handling noisy input. Surprisingly, we find that Chinese Word Segmentation (CWS) has no benefit for Chinese language model pretraining.

Our work suggests that linguistically informed techniques based on the characteristics of different languages need more attention. We will release the code, pretrained models, and the SHUOWEN -JIEZItokenizers to serve as a better foundation for future research on Chinese PLM.

2 Related Work

Chinese PLM. Several previous works have ex-plored techniques to improve Chinese language model pretraining. Zhu (2020) and Zhang et al.

(2021) expanded BERT vocabulary with Chinese words apart from the single characters and incor-porated them in the pretraining objectives. Xiao et al. (2021) and Cui et al. (2019) considered coarse-grained information through masking n-gram and whole words during the masked lan-guage modeling pretraining. Diao et al. (2020) incorporated word-level information via superim-posing the character and word embeddings. Lai et al. (2021) incorporated Chinese word lattice structures in the pretraining.

Linguistically Informed Techniques for Chi-nese. CWS is a common preprocessing step for Chinese NLP tasks (Li and Sun, 2009). Meng

et al. (2019) empirically analysed whether CWS

is helpful for downstream Chinese NLP tasks be-fore the PLM era and found that in many cases the answer is negative. We examine the impact ofCWS for PLM instead.Wu et al.(2019) incor-porated glyph information of Chinese characters though adding extra encoders to encode the im-ages of Chinese characters and then combine them with the character embeddings. We do not intend to fuse in additional information from sources like images, but instead, all of our proposed tokenza-tion methods are drop-in replacements to the ex-isting single-character tokenizers, without adding any extra layers or parameters. Tan et al.(2018) explore to Chinese text into Wubi sequences that represent character glyph information for the task of machine translation.

3 Method

In this section, we introduce our proposed tok-enization methods.

3.1 SHUOWEN: Pronunciation-based

Tokenizers

The Chinese writing system is morphemic (Hill,

2016) and barely convey phonological informa-tion. However, the pronunciation of Chinese char-acters also reveals semantic patterns (Duanmu,

2007) and has long been widely used as input methods in China (e.g., pinyin). In order to capture such information, we propose a pronunciation-based tokenizer named SHUOWEN.

On raw Chinese input texts (e.g., 魑魅魍魉), SHUOWENperforms the following steps:

1. Romanize the text using Chinese translitera-tion systems. In this work, we explore two different transliteration methods: pinyin and zhuyin (i.e., bopomofo). Pinyin uses the Latin alphabet and four 2 different tones (¯, ´, ˇ, `) to romanize pronunciations of characters, e.g., 魑魅魍魉 → Chi¯ Mei` Wangˇ Liangˇ. On the other hand, zhuyin uses a set of self-invented characters and the same four tones to romanize the characters, e.g.,魑魅魍魉 → ㄔㄇㄟ` ㄨㄤˇ ㄌㄧㄤˇ. Note that in zhuyin, the first tone mark (¯) is usually omitted. 2. Insert special separation symbols (+) after

each character’s romanized sequence, e.g.,

2

(3)

Chi¯+Mei`+Wangˇ+Liangˇ+, ㄔ+ㄇㄟ`+ㄨㄤˇ+ㄌㄧㄤˇ+. This prevents cases where romanized sequences of different characters are mixed together, especially when there are no tone markers to split them in zhuyin. 3. Different Chinese characters often have

the same pronunciation. For disam-biguation, we append different indices after the romanized sequences for the homophonic characters, so that allowing a biunique mapping between each Chi-nese character and its romanized sequence, e.g., Chi¯33+Mei`24+Wangˇ25+Liangˇ13+, ㄔ10+ㄇㄟ`3+ㄨㄤˇ6+ㄌㄧㄤˇ1+.

4. Apply a unigram language model (ULM) as in Kudo and Richardson (2018) on the roman-ized sequences to build the final vocabulary. We do not set any constraint on the vocabulary other than the vocabulary size. The resultant vo-cabulary contains tokens corresponding to flexible combinations of characters and sub-characters. 3.2 JIEZI: Glyph-based Tokenizers

The word shapes of Chinese characters contain rich semantic information and can help NLP mod-els (Cao et al.,2018). For example, most Chinese characters can be broken down into semantically meaningful radicals. Characters that share com-mon radicals often share related semantic infor-mation, such as the four characters ‘魑魅魍魉’ all share the same radical ‘鬼’ (meaning ghost), and their meanings are indeed related to ghosts and monsters.3

However, the prevailing tokenization method for Chinese treats each Chinese character as a sep-arate token and hence preventing the model to learn the shared semantics of characters with com-mon radicals. In order to solve this problem, we propose the glyph-based tokenizer JIEZI, which performs the following steps on raw Chinese in-put (e.g.,魑魅魍魉):

1. Convert each character into a stroke or rad-ical sequence. To convert into stroke se-quences, we use Latin alphabet to repre-sent the basic strokes and convert the

char-3_{Interestingly, the word ‘}_{魑魅魍魉’ is in fact a Chinese}

idiom, which is now often used to refer to bad people who are like monsters.

acters based on the standard stroke orders4, e.g., 魑 → pszhshpzznnhpnzsszshn; 魅 → pszhshpzznhhspn. To convert into radical se-quences, we adopt three existing glyph-based Chinese input methods: Wubi, Zhengma, Cangjie. These methods group strokes to-gether in different ways to form radicals or stroke combinations, and then represent char-acters with them. We use Latin alphabet to represent these radicals or stroke combina-tions, e.g.,魑魅魍魉 → Wubi: rqcc rqci rqcn rqcw; Zhengma: njlz njbk njld njoo; Cangjie: hiyub hijd hibtv himob.

2. Similar to the pronunciation-based tokeniz-ers, we add the same separation symbol after each character, and also add the disambigua-tion indices for characters whose stroke se-quences are identical (e.g., 人 (people) and 八 (eight)).

In the converted sequences, we can see how common radicals naturally appear (the underlined parts). Please refer to the Appendix A.2 for a detailed explanation on the differences between these different input methods involved.

3.3 Word Segmented Tokenizer

Chinese Word Segmentation (CWS) is a common technique to split Chinese chunks into a sequence of Chinese words. The resultant segmented words sometimes provide better granularity for down-stream tasks (e.g.,Chang et al.,2008). However, the impact ofCWSis unclear in the context of pre-training, especially how it interplays with statisti-cal approaches like BPE and unigram LM. Hence, we study word segmented tokenizers that performs the following process on raw Chinese input, e.g., “这篇论文有意思。(this paper is interesting)” :

1. We use a state-of-the-art segmenter THU-LAC (Li and Sun,2009) to segment the sen-tence into a sequence of words joined by spaces, e.g., ‘这篇\论文\有意思\。’ (We use \to indicate a blank space for easier reading.) 2. We directly applyULMon these space-joined

sequences to construct the vocabulary. In other words, CWS is used as a preprocess-ing step on trainpreprocess-ing corpora when we build the vo-cabulary using above-mentioned methods. When

4

(4)

Dataset Task MaxLen Batch Epoch #Train #Dev #Test Domain

TNEWS DC 256 32 6 53.4K 10K 10K News

IFLYTEK DC 256 32 6 12.1K 2.6K 2.6K App Description

BQ SPM 256 32 6 100K 10K 10K Bank Service THUCNEWS DC 256 32 6 669K 83.6K 83.6K News CLUEWSC WSC 256 32 24 1.2K 0.3K 0.3K Literature AFQMC SPM 256 32 6 34.3K 4.3K 3.9K Financial CSL SPM 256 32 6 20K 3K 3K Academic Papers OCNLI SPM 256 32 6 45.4K 5K 3K Mixed CHID MRC 96 24 6 519K 57.8K 23K Mixed C3 MRC 512 24 6 12K 3.8K 3.9K Mixed

Table 1: Hyper-parameters and statistics of different datasets. DC: document classification. SPM: sentence pair matching (including natural language inference). WSC: Winograd Schema Challenge. MRC: machine reading comprehension.

we perform the actual pretraining and finetuning, we also performCWSas preprocessing before to-kenization using the word segmented tokenizer.

4 Experiment

4.1 Baselines

We use two strong baseline tokenizers in this work. The first one is the conventional single-character tokenizer as used in BERT-Chinese and many other follow-up Chinese PLM (e.g., Cui et al.,2019,2020). We name this tokenizer BERT-Chinese as it originated from the BERT-Chinese version of BERT.

For the second baseline, we directly apply Sen-tencePiece with unigram LM on the raw Chi-nese corpus to generate the vocabulary. As a re-sult, the vocabulary contains both single charac-ters and words (i.e., character combinations). This approach resembles the vocabulary of some re-cent multi-granularity Chinese PLM variants such as AMBERT (Zhang et al., 2021) and Lattice-BERT (Lai et al., 2021). Unlike them, we do not add any new model designs or pretraining ob-jectives, but instead use the original BERT archi-tecture and masked LM objective. We name this baselineSP-ULM.

To ensure a fair comparison, we set the same vo-cabulary size of 22675 for all tokenizers. We use the same training corpus to train all the tokenizers. We use the SentencePiece library’s unigram LM implementation to train the tokenizers.

In order to evaluate the effectiveness of the tok-enizers, we pretrain a BERT model using each to-kenizer and compare their performance on down-stream tasks. When pretraining the BERT models using each tokenizer, we use the same pretraining

corpus and the same set of hyper-parameters for all models being compared. Notably, we also re-pretrain the BERT model using the BERT-Chinese tokenizer on our pretraining corpus instead of just loading from existing checkpoints to ensure that all baselines and proposed methods are directly comparable. Since our proposed tokenizers are di-rect drop-in replacements for the baseline tokeniz-ers, they do not incur any extra parameters. As a result, all the models being compared have the same number of parameters, allowing for a truly apple-to-apple comparison.

4.2 Datasets

We evaluate the trained models with different tok-enization methods on a total of ten different down-stream datasets, including single-sentence tasks, sentence-pair tasks, as well as reading comprehen-sion tasks. We briefly introduce each dataset be-low and present the dataset statistics in Table1. TNEWS (Co.,2018) is a news title classification dataset containing 15 classes. We use the split as released inXu et al.(2020).

IFLYTEK (Co.,2019) is a long text classification dataset containing 119 classes. The task is to clas-sify mobile applications into corresponding cate-gories given their description.

BQ (Chen et al., 2018) is a sentence-pair ques-tion matching dataset extracted from an online bank customer service log. The goal is to evaluate whether two questions are semantically identical or can be answered by the same answer.

(5)

TNEWS IFLY THUC BQ WSC AFQMC CSL OCNLI CHID C3 AVG 6-layer BERT-Chinese 64.10 57.77 96.97 81.98 62.39 68.95 82.60 68.46 72.33 53.51 70.91 SP-ULM 64.26 55.44 97.09 81.52 62.06 69.88 83.16 68.98 72.77 51.73 70.69 (-0.22) JIEZI-CangJie 63.86 59.51 97.04 81.59 63.27 70.47 82.91 69.03 72.73 52.67 71.31 (+0.40) JIEZI-Stroke 63.81 58.74 96.87 81.55 62.94 69.66 82.44 68.02 72.21 53.35 70.96 (+0.05) JIEZI-Zhengma 63.96 58.74 96.99 82.27 61.95 69.86 83.46 68.56 72.12 54.91 71.28 (+0.37) JIEZI-Wubi 64.91 59.39 97.03 81.41 62.72 69.14 82.60 69.12 72.02 53.99 71.16 (+0.25) SHUOWEN-Pinyin 63.58 59.55 97.04 81.65 63.60 68.60 82.66 67.93 72.81 53.02 71.04 (+0.13) SHUOWEN-Zhuyin 64.11 59.16 97.01 81.64 63.93 68.53 82.86 69.39 71.48 54.59 71.27 (+0.36) 12-layer BERT-Chinese 65.07 58.01 97.05 82.33 73.14 71.04 83.90 70.19 76.61 55.90 73.32 SP-ULM 65.01 58.98 97.20 82.99 73.36 70.93 83.45 70.46 77.28 57.70 73.74 (+0.42) JIEZI-CangJie 64.26 60.29 97.15 83.48 71.16 71.48 83.68 71.50 76.82 57.99 73.78 (+0.46) JIEZI-Stroke 65.11 59.75 97.09 82.88 70.72 71.64 83.63 70.03 77.45 59.68 73.53 (+0.21) JIEZI-Zhengma 64.51 60.78 97.14 83.15 72.15 70.76 83.68 71.22 76.72 57.49 73.76 (+0.44) JIEZI-Wubi 64.47 60.05 97.16 82.76 72.70 72.00 83.62 70.77 76.34 58.31 73.82 (+0.50) SHUOWEN-Pinyin 64.50 60.40 97.17 83.13 70.18 71.37 84.12 71.97 76.11 58.05 73.70 (+0.38) SHUOWEN-Zhuyin 64.50 59.98 97.09 82.99 73.03 71.83 83.82 71.74 76.74 57.23 73.90 (+0.58) Table 2: Results for standard evaluation. Best result on each dataset of each model size is boldfaced. The numbers in brackets in the last column indicate the average difference compared to the BERT-Chinese baseline.

TNEWS IFLYTEK CLUEWSC AFQMC CSL OCNLI C3 AVG

SP-ULM 64.26 55.44 62.06 69.88 83.16 68.98 51.73 65.07 SP-ULM+CWS 64.26 54.15 63.05 69.62 82.87 68.64 51.77 64.91 JIEZI-Wubi 64.91 59.39 62.72 69.14 82.60 69.12 53.99 65.98 JIEZI-Wubi +CWS 63.66 59.22 63.16 68.65 82.21 68.81 52.76 65.50 SHUOWEN-Zhuyin 64.11 59.16 63.93 68.53 82.86 69.39 54.59 66.08 SHUOWEN-Zhuyin +CWS 63.37 57.24 62.83 68.94 82.12 68.69 51.48 64.95

Table 3: Results of models trained with Word Segmented Tokenization. All models are 6-layer.

CLUEWSC (Xu et al.,2020) is a coreference res-olution dataset in the format of Winograd Schema Challenge. The task is to determine whether the given noun and pronoun in the sentence co-refer. AFQMC is the Ant Financial Question Matching Corpus for the question matching task that aims to predict whether two sentences are semantically similar

CSL is the Chinese Scientific Literature dataset extracted from academic papers. Given an ab-stract and some keywords, the goal is to determine whether they belong to the same paper. It is for-matted as a sentence-pair matching task.

OCNLI (Hu et al., 2020) is a natural language inference dataset. The task is to determine whether the relationship between the hypothesis and premise is entailment, neural, or contradic-tion.

CHID (Zheng et al., 2019) is a cloze-style read-ing comprehension dataset where . Given contexts where some idioms are masked, the task is to se-lect the appropriate idiom from a list of candidates.

C3 (Sun et al., 2019) is a multiple choice ma-chine reading comprehension dataset. The goal is to choose the correct answer for some questions given a context.

(6)

clean 15% 30% 45% 60% TNEWS BERT-Chinese 63.99 62.18 60.49 57.64 53.97 SP-ULM 63.99 62.88 60.79 59.20 55.42 JIEZI-Wubi 64.05 62.80 62.56 62.71 62.81 OCNLI BERT-Chinese 68.07 62.60 56.83 51.37 46.00 SP-ULM 69.10 64.00 56.57 52.43 46.73 JIEZI-Wubi 68.43 67.10 65.47 65.40 64.80

Table 4: Results for noisy evaluation with glyph noises.

clean 15% 30% 45% 60% TNEWS BERT-Chinese 63.99 60.87 57.70 52.21 45.51 SP-ULM 63.99 61.52 58.60 53.30 46.81 SHUOWEN-Pinyin 63.35 60.60 57.45 53.61 49.29 SHUOWEN-Zhuyin 63.79 60.99 57.69 53.15 48.98 OCNLI BERT-Chinese 68.07 61.73 54.50 49.97 44.80 SP-ULM 69.10 62.23 54.77 50.33 45.70 SHUOWEN-Pinyin 67.83 61.00 54.33 49.87 45.10 SHUOWEN-Zhuyin 69.63 60.77 54.77 50.67 47.70

Table 5: Results for noisy evaluation with phonology noises.

seeds.

4.4 Experiment Setup: Noisy Evaluation Apart from evaluating on the standard bench-marks, we also evaluate in a noisy setting to illus-trate the advantage of our proposed tokenization methods to handle noisy inputs. Specifically, we inject two types of synthetic noises into both the training and test data in order to test whether the models can learn from noisy training data and also perform robustly on noisy test data. We vary the ratio of noise in the data to examine the impact. The two types of noises we inject are:

• Glyph-based noise: we replace original char-acters with other charchar-acters that have similar glyph but have different semantic meanings (e.g.,壁 (wall) and 璧 (jade)). Specifically, we obtain a substitution candidate list for each character, where the candidates are se-lected so that they share at least one common radical with the original character. Then, we randomly sample a certain ratio r% of the original characters, for each of them, we randomly sample a substitution character from its candidate list for substitution. This simulates common noises when people use glyph-based input methods where these

sim-ilar characters could be chosen since their in-put encoding are similar.

• Pronunciation-based noise: we replace origi-nal characters with other characters that have the same pronunciation but different seman-tic meanings (e.g., 真 (real) and针 (needle)). Specifically, we obtain a substitution candi-date list for each character, where all the can-didates have the same pronunciation as the original character. Then, similarly, we ran-domly sample a certain ratio r% of the origi-nal characters, for each of them, we randomly sample a substitution character from its can-didate list for substitution. This simulates the common noise when users use pronunciation-based input methods where the input encod-ing of these characters and their substitutions are the same.

(7)

glyph-based noises and pronunciation-glyph-based noises (e.g., 快 and 块 share a same radical and also have the same pronunciation), we keep them in both types of noises.

Intuitively, our SHUOWEN tokenizer could be robust to pronunciation-based noises and JIEZI

to-kenizer could be robust to glyph-based noises be-cause the substitution characters share similar pro-nunciation or glyph components with the original characters, which may be captured by our tokeniz-ers.

This noisy setup is reflective of real-life use cases where user queries often contain such noises. Since most Chinese people use ei-ther glyph-based input methods (e.g., wubi) or pronunciation-based input methods (e.g., pinyin, zhuyin), such mis-typed characters can be very common. This highlights the potential impact of our work.

5 Results

5.1 Standard Evaluation

We compare the results of the baseline tokeniz-ers (BERT-Chinese, SP-ULM) with our proposed JIEZI (including four variants: JIEZI-Cangjie,

JIEZI-Stroke, JIEZI-Zhengma, JIEZI-Wubi) and SHUOWEN (including two variants: SHUOWEN -Pinyin and SHUOWEN-Zhuyin) tokenizers in

Ta-ble2.

Despite some variations across different datasets, we observe that in terms of the average score over ten datasets, all of our proposed tokenizers outperform the BERT-Chinese base-line. Notably, for the 6-layer model size, our JIEZI-Cangjie tokenizer obtains an average of 0.40 points over the BERT-Chinese tokenizer, for the 12-layer model size, our SHUOWEN-Zhuyin

tokenizer achieves an average of 0.58 points of improvement over the BERT-Chinese tokenizer.

These results indicate that on standard bench-marks, our proposed tokenizers can match or out-perform the existing tokenizers for Chinese.

On the other hand, we examine the impact of

CWS by comparing three tokenizers with their word segmented counterparts in Table 3. We can see that adding CWS as a preprocessing ac-tually slightly decreased the average performance on downstream tasks.

5.2 Noisy Evaluation

We perform noisy evaluation on two datasets: TNEWS and OCNLI. For glyph-based noises, we compare baselines BERT-Chinese andSP-ULM

with our JIEZI-Wubi. The results are presented in Table 4. We observe that when the noise ra-tio increases, the advantage of JIEZI is particu-larly large. For example, when 60% characters are substituted, JIEZI-Wubi still performs close to the original performance, while other baselines suffer large drops in performance. On OCNLI, the gap can be as large as 18 points in accuracy.

For pronunciation-based noises, we compare baselines BERT-Chinese and SP-ULM with our

SHUOWEN-Pinyin and SHUOWEN-Zhuyin. The results are shown in Table5. Unlike the cases on glyph-based noises, we observe that the advantage of our SHUOWENtokenizers are not so significant compared to the baselines. One potential reason is that there many Chinese characters with the same pronunciation. Unlike how the semantic meanings of radicals can be rather consistent across different characters, phoneme combinations can have vastly different meanings across different characters (i.e., characters may pronounce the same but have to-tally different semantic meanings), which makes it difficult to learn these different semantic mean-ings in the pronunciation-based token embedding.

6 Conclusion

(8)

References

Emily Bender. 2019. The #BenderRule: On Nam-ing the Languages We Study and Why It Mat-ters. The Gradient.

Kaj Bostrom and Greg Durrett. 2020. Byte Pair Encoding is Suboptimal for Language Model Pretraining. In Findings of EMNLP.

Shaosheng Cao, Wei Lu, Jun Zhou, and Xiaolong Li. 2018. cw2vec: Learning Chinese Word Em-beddings with Stroke n-gram Information. In AAAI.

Pi-Chuan Chang, Michel Galley, and Christo-pher D. Manning. 2008. Optimizing Chinese Word Segmentation for Machine Translation Performance. In WMT@ACL.

Jing Chen, Qingcai Chen, Xin Liu, Haijun Yang, Daohe Lu, and Buzhou Tang. 2018. The BQ Corpus: A Large-scale Domain-specific Chi-nese Corpus For Sentence Semantic Equiva-lence Identification. In EMNLP.

Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELEC-TRA: Pre-training Text Encoders as Discrimi-nators Rather Than Generators. In ICLR. IFLYTEK Co. 2019. IFLYTEK: A multiple

cate-gories Chinese text classifier. TouTiao Co. 2018. TNEWS Dataset.

Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. 2020. Revis-iting Pre-Trained Models for Chinese Natural Language Processing. In Findings of EMNLP. Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin,

Ziqing Yang, Shijin Wang, and Guoping Hu. 2019. Pre-Training with Whole Word Masking for Chinese BERT. arXiv, abs/1906.08101. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and

Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Lan-guage Understanding. In NAACL-HLT.

Shizhe Diao, Jiaxin Bai, Yan Song, Tong Zhang, and Yonggang Wang. 2020. ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations. In Findings of EMNLP.

San Duanmu. 2007. The Phonology of Standard Chinese. Oxford University Press.

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. In ICLR.

Archibald A Hill. 2016. The typology of writing systems. In Papers in linguistics in honor of Leon Dostert, pages 92–99. De Gruyter Mou-ton.

Hai Hu, Kyle Richardson, Liang Xu, Lu Li, San-dra Kübler, and Lawrence S. Moss. 2020. OC-NLI: Original Chinese Natural Language Infer-ence. In Findings of EMNLP.

Taku Kudo and John Richardson. 2018. Sentence-Piece: A simple and language independent sub-word tokenizer and detokenizer for Neural Text Processing. In EMNLP.

Yuxuan Lai, Yijia Liu, Yansong Feng, Songfang Huang, and Dongyan Zhao. 2021. Lattice-BERT: Leveraging Multi-Granularity Repre-sentations in Chinese Pre-trained Language Models.

Zhenzhong Lan, Mingda Chen, Sebastian Good-man, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A Lite BERT for Self-supervised Learning of Language Representa-tions. In ICLR.

Jingyang Li and Maosong Sun. 2007. Scalable Term Selection for Text Categorization. In CoNLL.

Zhongguo Li and Maosong Sun. 2009. Punctua-tion as Implicit AnnotaPunctua-tions for Chinese Word Segmentation. Computational Linguistics. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei

Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Op-timized BERT Pretraining Approach. arXiv, abs/1907.11692.

(9)

Yuxian Meng, Xiaoya Li, Xiaofei Sun, Qinghong Han, Arianna Yuan, and Jiwei Li. 2019. Is Word Segmentation Necessary for Deep Learn-ing of Chinese Representations? In ACL. Jerome L. Packard. 2000. The Morphology of

Chinese: A Linguistic and Cognitive Approach. Cambridge University Press.

Mike Schuster and Kaisuke Nakajima. 2012.

Japanese and Korean voice search. ICASSP. Rico Sennrich, B. Haddow, and Alexandra

Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. arXiv, abs/1508.07909.

Kai Sun, Dian Yu, Dong Yu, and Claire Cardie. 2019. Investigating Prior Knowledge for Chal-lenging Chinese Machine Reading Comprehen-sion. TACL.

Mi Xue Tan, Yuhuang Hu, Nikola I. Nikolov, and Richard H. R. Hahnloser. 2018. wubi2en: Character-level Chinese-English Translation through ASCII Encoding. In WMT.

Wei Wu, Yuxian Meng, F. Wang, Qinghong Han, Muyu Li, Xiaoya Li, Jie Mei, Ping Nie, Xi-aofei Sun, and Jiwei Li. 2019. Glyce: Glyph-vectors for Chinese Character Representations. In NeurIPS.

Dongling Xiao, Yukun Li, Han Zhang, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. 2021.

ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding. In NAACL.

Liang Xu, Hai Hu, Xuanwei Zhang, Chenjie Cao Lu Li, Yudong Li, Yechen Xu, Kai Sun, Dian Yu, Cong Yu, Yin Tian, Qianqian Dong, Wei-tang Liu, Bo Shi, Yiming Cui, Junyi Li, Jun Zeng, Rongzhao Wang, Weijian Xie, Yanting Li, Yina Patterson, Zuoyu Tian, Yiwen Zhang, He Zhou, Shaoweihua Liu, Zhe Zhao, Qipeng Zhao, Cong Yue, Xinrui Zhang, Zhengliang Yang, Kyle Richardson, and Zhenzhong Lan. 2020. CLUE: A Chinese Language Under-standing Evaluation Benchmark. In COLING. Xinsong Zhang, Pengshuai Li, and Hang Li. 2021.

AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization. In Findings of ACL.

Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, and Maosong Sun. 2020. CPM: A Large-scale Generative Chinese Pre-trained Language Model. arXiv, abs/2012.00413.

Chujie Zheng, Minlie Huang, and Aixin Sun. 2019. ChID: A Large-scale Chinese IDiom Dataset for Cloze Test. In ACL.

(10)

A Appendix

A.1 What is SHUOWEN-JIEZI?

SHUOWEN-JIEZI (‘说文解字’) is an ancient

Chi-nese dictionary from the Han dynasty. It was the first dictionary to analyze the structure of Chinese characters and to give the rationale behind them, and also the first dictionary to use Chinese radi-cals to organise the sections.

The literal meaning of SHUOWEN and JIEZI correspond nicely to the core intuitive be-hind our pronunciation- and glyph-based tokeniz-ers. We name our methods this name to pay tribute to the ancient wisdom of our ancestors.

A.2 Input Methods

Current Chinese input methods for computers can be categorized into pronunciation-based and glyph-based. Both methods encode each Chinese character into a sequence of units from a smaller set of alphabet (e.g. the Latin alphabet), but dif-fer in what the units represent. In pronunciation-based input methods, each unit usually represents a phoneme, while in glyph-based methods, one unit or a group of units generally represent a radi-cal or stoke composition. Chinese characters have a standardized stroke order, which can be taken into account by glyph-based methods. In almost all commonly used input methods, there exists dif-ferent characters that encode into the same se-quence, in which case, the solution is usually to list all matching characters and let the user select the correct one. We briefly introduce each input method used in the paper below.

Cangjie (‘仓颉’) and Wubi (‘五笔’) are two sim-ilar glyph-based input methods. They map keys on the QWERTY keyboard to fundamental radi-cals or combination of strokes that are combined to represent the shape of entire characters. They sometimes disregard stroke order in favor of com-binations that visually more similar to the charac-ter. The main difference between them is the key mapping and the rules on how to break down char-acters into fundamental components.

Zhengma (‘郑码’), a glyph-based method, is sim-ilar to Cangjie and Wubi. It maps each Latin let-ter to fundamental radicals, which are combined into entire characters. But Zhengma differs from Cangjie and Wubi in that it strictly follows stroke order.

Stroke (‘笔画’), a glyph-based method, more commonly used in mobile phones or numerical

keypads, maps five keys to five basic types of strokes. The user pressed the corresponding keys according to characters’ stroke order.

Pinyin (‘拼音’) input method, a pronunciation-based input method, is the most widely used input method among Chinese speakers. It is based on the Hanyu Pinyin (‘汉语拼音’, meaning Chinese Sound-Spelling) romanization system for Chinese. Each Chinese monophthong is mapping to one or two Latin alphabet.