Part-of-speech Sequences and Distribution in a Learner Corpus of English

(1)

Part-of-speech Sequences and Distribution in a

Learner Corpus of English

Rebecca H. Shih

*

, John Y. Chiang

+

and F. Tien

+

*

_{Department of Foreign Languages and Literature}

+

_{Department of Computer Science and Engineering,}

National Sun Yat-sen University

Page 171 ~ 177

Proceedings of

Research on Computational Linguistics

Conference XIII (ROCLING XIII)

(2)

Part-of-speech Sequences and Distribution

in a Learner Corpus of English

Rebecca H. Shih*, John Y. Chiang+ and F. Tien+

*_{Department of Foreign Languages and Literature} +_{Department of Computer Science and Engineering}

National Sun Yat-sen University, Kaohsiung,Taiwan, R.O.C. E-mail: [email protected]

Abstract

Computer learner corpora have been widely used by SLA/EFL specialists since mid 1990s to gain better insights into authentic learner language. The work presented in this paper examines the inter-language of Taiwanese learners of English from a part-of-speech sequence perspective. Two pre-tagged corpora (one learner corpus and one native corpus) are involved in this work. The experimental results indicate that there are more than one third of eligible POS trigrams that are never practiced by the Taiwanese learners in their writing and the learners have stronger preference than native speakers in using pronouns, especially right after punctuations, verbs and conjunctions.

1. Introduction

(3)

(4)

(5)

BNC TLCE overlap

100 200 300 400 500 600 700 800 900 ₁₀₀₀ ₁₁₀₀ ₁₂₀₀ ₁₃₀₀ ₁₄₀₀ ₁₅₀₀ ₁₆₀₀

Rank of trigrams

number of overlapping trigrams

(6)

(7)

4. Discussions and future work

The results of the preliminary experiments above show that there are more than one third of BNC trigrams that the learners never practice in their writing, whereas there are 4.5% of TLCE trigrams which do not appear in the BNC’s. It is intended to believe that this small proportion of TLCE trigrams is contributed from the learner’s writing errors. However, increasing the size of the native speaker corpus to observe any changes in the distribution of the trigrams will clarify the findings. It is also worth looking into those BNC trigrams that the learners do not know or are not aware of, and then isolating those with high frequency for the pedagogical purpose.

The experimental results also suggest that the learners use pronouns excessively in their writing and that they have stronger preference than native speakers in using pronouns right after punctuations, verbs and conjunctions but less preference after prepositions and nouns. Pronouns often appear in the informal register, and as the corpus is composed of college students’ compositions as well as their weekly journals, the informality of the journals may contribute partly to their excessive use of pronouns. So, it is desirable in the next stage of the work to divide the learner corpus in terms of its different registers and compare their POS distributions with the native speaker corpus.

Acknowledgements

(8)