CHAPTER 2 BACKGROUND FOR THE STUDY
2.3 Natural Language Processing
Generally speaking, as a direction of scientific inquiry and engineering, Natural Language Processing (NLP) was developed in the 1940s. It gained attention because of the increasing importance of fast, accurate translation after World War II, and the
emergence of technologies that could automate such translation with the use of computers (Jones, 1994). At first, the driving assumption underlying NLP was “the grammar of English can be defined and used to parse any portion of the language” and “contextual knowledge can be stored and used for automated elucidation of meaning” (Martinez, 2010, p. 352). However, after researchers devised complex sets of formal rules that were supposed to capture the human linguistic competence, it soon became clear that the creativity and ambiguity of natural language could not be completely covered by such formal rule models.
Starting in the late 1980s, the statistical approach to NLP (StatNLP) has aimed to detect common patterns in language by using stochastic, probabilistic, and statistical methods, because language does not always follow rules that can be easily formalized. Machine learning in StatNLP algorithms has increased the feasibility of automated language processing. Instead of hand-coded rules, this approach allows for formal linguistic generalizations to be automatically derived from a given tagged or untagged
corpus. Several NLP tasks have been approached applying this method, such as parsing, part-of-speech tagging, machine translation, automatic summarizing, and text
classification (Manning & Sch tze, 1999).
One of the classical problems frequently addressed with StatNLP is the text classification task. In this task, a computational system is built to assign one or more categories to input texts. Common examples of the text classification task include determining the gender of the text author, identifying attitudes (or sentiments) of movie reviews, or filtering spam emails. The text classification task can be performed with three approaches—supervised, semi-supervised, and unsupervised. All three approaches rely on a corpus (called the ‘training corpus’), the difference being whether the training corpus is fully tagged, partially tagged, or not tagged (Baharudin, Lee, & Khan, 2010). ‘Tagging’ here refers to annotation relevant for the purposes of the classification task and is usually completed manually on the training corpus. For example, a training corpus for the spam classification task will contain emails judged as ‘spam’ and ‘non-spam’ by human experts.
A popular, powerful approach to supervised machine learning frequently used in the text classification task is the so-called ‘Support Vector Machines’, or SVMs (Cortes & Vapnik, 1995; Cristianini & Shawe-Taylor, 2000; Joachims, 1998; Vapkins, 1998). The SVM “is an abstract learning machine which will learn from a training data set and attempt to generalize and make correct predictions to novel data” (Campbell & Ying, 2011, p. 1). Based on the training data set, the SVM is trained to make binary decisions (whether or not a particular item belongs to the category of interest) for the new data. The SVM works by mapping data points from the training data set in a multi-dimensional
feature space so a gap (called ‘the hyperplane’) is identified, which separates the category of interest from the remaining items in the data set. Among all the mapped data points, support vectors describe the nearest data points to this hyperplane. The SVM
discriminates the new data and assigns categories to them based on the position of the new data points in relation to this hyperplane (Kecman, 2005). As with most machine learning approaches, when SVMs are utilized in practice, each object (data point) is represented by a vector of quantifiable ‘features’, or relevant parameters that meaningfully describe the object.
Joachims (1998) presented four pieces of theoretical evidence to show that SVMs should work better than other approaches to text classification tasks. First, SVMs rely on “high dimensional input space,” which means they can deal with a large number of features, which is typically the case in NLP. At the same time, SVMs still can use “few irrelevant features” from the training data set. Also, they can deal with the situation when the training data set is sparse, that is, there are relatively few data points that illustrate the category of interest. Finally, because “most text categorization problems are linearly separable,” it matches the characteristics of SVMs (pp. 139-140).
In applied linguistics, StatNLP-based text classification techniques have been applied to genre analysis. For instance, Pendar and Cotos (2008) conducted an automated text classification task by applying SVM as their classifier to categorize sentences into moves from CARS schema (Swales, 1990) in Introduction sections of RAs. They prepared a training corpus, which contained 401 Introduction sections in 20 disciplines, that is 267,029 words long, leading to 11,149 sentences as data points. They utilized unigrams, bigrams, and trigrams as the input features for their SVM classifier. Their
results suggested that the unigram models lead to the best recall and the trigram models resulted in the best precision. Bigrams might increase recall, but at the same time
decrease precision because there were more frequent bigrams than unigrams or trigrams. The reported accuracy of classification was between 60% and 80%.
Taking one step further, Cotos’ (2010) group used the SVM classifier to identify non-native graduate students’ writing in Introduction sections. They further applied this technique to the Methods, Results, and Discussion/Conclusion sections of RAs in the Research Writing Tutor (RWT) project, which is a computer-assisted language learning (CALL) tool that performs automatic identification of rhetorical moves in student writing (Cotos, 2014a).
The scholarship reviewed in this section demonstrates that, firstly, there are solid theoretical justifications for the appropriateness of the SVM as a supervised StatNLP approach to the text classification task; secondly, previous studies successfully employed the text classification task within genre-analysis research, specifically dealing with the genre of RAs; finally, well-performing classifiers that can automatically classify the rhetorical moves in learner writing based on manually annotated training corpora can be used as the basis for the development of CALL tools. Therefore, one of the goals of this dissertation was to train an SVM classifier to categorize sentences in RA abstracts following the initial corpus collection and genre analysis research.
A potential drawback of using StatNLP classifiers trained on corpora of professional writing to predict (i.e., to automatically classify) the rhetorical moves in learners’ writing might be the fact that the classifier would be looking for certain linguistic features in learners’ writing. However, these might not be present because the
learner is not very proficient in this particular genre. Therefore, an important direction for research informing the design of an effective pedagogical intervention for the purposes of this study is a more in-depth analysis of the linguistic features that are associated with rhetorical moves in the target genre. Section 2.4 discusses the area of inquiry that focuses on Formulaic Language analysis and presents an argument for its importance in the scope of this study.