• No results found

Part-of-speech Sequences and Distribution in a Learner Corpus of English

N/A
N/A
Protected

Academic year: 2020

Share "Part-of-speech Sequences and Distribution in a Learner Corpus of English"

Copied!
8
0
0

Loading.... (view fulltext now)

Full text

(1)

Part-of-speech Sequences and Distribution in a

Learner Corpus of English

Rebecca H. Shih

*

, John Y. Chiang

+

and F. Tien

+

*

Department of Foreign Languages and Literature

+

Department of Computer Science and Engineering,

National Sun Yat-sen University

Page 171 ~ 177

Proceedings of

Research on Computational Linguistics

Conference XIII (ROCLING XIII)

(2)

Part-of-speech Sequences and Distribution

in a Learner Corpus of English

Rebecca H. Shih*, John Y. Chiang+ and F. Tien+

*Department of Foreign Languages and Literature +Department of Computer Science and Engineering

National Sun Yat-sen University, Kaohsiung,Taiwan, R.O.C. E-mail: [email protected]

Abstract

Computer learner corpora have been widely used by SLA/EFL specialists since mid 1990s to gain better insights into authentic learner language. The work presented in this paper examines the inter-language of Taiwanese learners of English from a part-of-speech sequence perspective. Two pre-tagged corpora (one learner corpus and one native corpus) are involved in this work. The experimental results indicate that there are more than one third of eligible POS trigrams that are never practiced by the Taiwanese learners in their writing and the learners have stronger preference than native speakers in using pronouns, especially right after punctuations, verbs and conjunctions.

1. Introduction

(3)
(4)
(5)

BNC TLCE overlap

100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600

Rank of trigrams

number of overlapping trigrams

(6)
(7)

4. Discussions and future work

The results of the preliminary experiments above show that there are more than one third of BNC trigrams that the learners never practice in their writing, whereas there are 4.5% of TLCE trigrams which do not appear in the BNC’s. It is intended to believe that this small proportion of TLCE trigrams is contributed from the learner’s writing errors. However, increasing the size of the native speaker corpus to observe any changes in the distribution of the trigrams will clarify the findings. It is also worth looking into those BNC trigrams that the learners do not know or are not aware of, and then isolating those with high frequency for the pedagogical purpose.

The experimental results also suggest that the learners use pronouns excessively in their writing and that they have stronger preference than native speakers in using pronouns right after punctuations, verbs and conjunctions but less preference after prepositions and nouns. Pronouns often appear in the informal register, and as the corpus is composed of college students’ compositions as well as their weekly journals, the informality of the journals may contribute partly to their excessive use of pronouns. So, it is desirable in the next stage of the work to divide the learner corpus in terms of its different registers and compare their POS distributions with the native speaker corpus.

Acknowledgements

(8)

Figure

Table 2: the number of POS trigrams in the corpora
Figure 2: POS distribution

References

Related documents

From a clinician's point of view colposcopic suspicion of the early invasion differs from that made by naked eye examination (speculoscopy). We point out that differences

Batchelor 7 another federal three-judge panel held that private posses- sion of obscene materials is protected under Stanley, but public distribu- tion, no matter

We have proposed a measurable coding procedure called run-length based Huffman coding (RLHC) which is reasonable for multistage encoding to upgrade the test

In either case, we must know how the re-entry trajectory affects a vehicle’s maximum deceleration, heating, and accuracy, as well as the re-entry corridor’s size. Depending on the

DASH: Disabilities of the Arm, Shoulder and Hand Questionnaire; GRC: Global Rating of Change Scale; HRQOL: Health Related Quality of Life; ICC: Intraclass Correlation Coefficient;

This proposed draft TJR core domain set (pain, function, patient satisfaction, revi- sion, adverse events, death, joint-specific quality of life) was discussed with a

My results show that the public mission and status of the University of Michigan positively affects a number of employees’ work motivation. This seems to affirm the findings of

Some properties of D -preinvexity for vector-valued functions are given and interrelations among D -preinvexity, D -semistrict preinvexity, and D -strict preinvexity for