4 THE RELATIONSHIP BETWEEN SYNTACTIC COMPLEXITY AND
4.1 Method
TAASSC calculates four types of indices. The first type includes the indices of syntactic complexity that are included in Lu’s (2010, 2011) Syntactic Complexity Analyzer (SCA) and address research question 1a. The second type comprises fine-grained indices of clausal complexity and address research question 2a. The third type comprises fine-grained indices of phrasal complexity and address research question 3a. The fourth and final type comprises frequency-based indices of syntactic sophistication and address research question 4a. See Chapter 3 for an in-depth description of all indices included in TAASSC.
4.1.2 Writing proficiency corpus
The written proficiency corpus is comprised of argumentative essays written as part of the Test of English as a Foreign Language (TOEFL). The essays comprise responses to two independent prompts (240 texts each) that ask test-takers to compose an essay that asserts and defends an opinion on a particular topic based on life experience (see Table 4.1). Test-takers are given 30 minutes to complete the writing task, and are expected to produce at least 300 words. See Table 4.2 for an overview of this corpus.
Table 4.1 Writing prompts for independent essays in TOEFL public use dataset
Test Form Prompt Instructions
1 Do you agree or disagree with the following statement? It is more important to choose to study subjects you are interested in than to choose subjects to prepare for a job or career. Use specific reasons and examples to support your answer.
2 Do you agree or disagree with the following statement? In today's world, the ability to cooperate well with others is far more important than it was in the past. Use specific reasons and examples to support your answer.
Table 4.2 Overview of writing proficiency corpus
Prompt N Number of
Words Mean Score
Standard Deviation
1 240 77,238 3.83 0.86
2 240 74,252 3.47 0.91
Each essay was given a score on a 5-point scale by at least two raters trained by ETS. If the scores given by the raters differed by 1 point or less, scores were averaged. If any two scores given by raters differed by more than 1 point, a third rater was used to adjudicate the score. Scores range from 1.0 to 5.0 in .5 point intervals. The holistic rating score used included
descriptors related to the completion of the task, organization, development of ideas, coherence, word use, and syntax. See Table 4.3 for the score descriptors for low and high proficiency essays. See Appendix A for the complete rating scale.
Table 4.3 Abbreviated TOEFL rubric for independent writing tasks
Score Descriptors
5 An essay at this level largely accomplishes all of the following: effectively addresses the topic and task
is well organized and well developed, using clearly appropriate explanations, exemplifications, and/or details
displays unity, progression, and coherence
displays consistent facility in the use of language, demonstrating syntactic variety, appropriate word choice, and idiomaticity, though it may have minor lexical or grammatical errors
2 An essay at this level may reveal one or more of the following weaknesses: limited development in response to the topic and task
inadequate organization of connection of ideas
inappropriate or insufficient exemplifications, explanations, or details to support or illustrate generalizations in response to the task
a noticeably inappropriate choice of words or word forms an accumulation of errors in sentence structure and/or usage
4.1.3 Statistical analysis
In order to determine how writing quality differs across writing levels in TOEFL independent essays, a multiple linear regression analysis was conducted for each index type3. First, normality was checked using the visualization component of the WEKA statistical package (Hall et al., 2009). Any variables that violated a normal distribution were discarded. In most cases, discarded variables represented syntactic features that occurred extremely rarely in the data (and therefore were not candidates for transformation) such as indirect objects and relative clauses. Pearson correlations were then conducted on the remaining variables to determine whether they were meaningfully correlated with holistic essay score. Any variables that did not reach an absolute correlation value of r ≥ .100 with holistic essay score (which represents the threshold for a “small” effect [Cohen, 1988]) were removed from further consideration. Next, the
3 This study examines whether linear relationships exist between linguistic features and language proficiency. That linguistic development may not be strictly linear and is likely affected by a number of factors (e.g., is a complex adaptive system [Larsen-Freeman & Cameron, 2008]) is acknowledged. Linear analyses are used in order to find simple explanations, which may serve as a starting point for future analyses of factors which mitigate variability in language learning.
remaining variables were checked for multicollinearity to ensure that final model consisted only of unique indices and that multicollinear indices did not exaggerate the results of the multiple regression analysis (Tabachnick & Fidell, 2014). For each pair of variables with absolute
correlation values of r >= .700, only the variable with the highest correlation with holistic score was kept (Crossley, Salsbury, & McNamara, 2012).
The remaining variables were entered into a ten-fold cross-validation multiple regression using the WEKA statistical package. Ten-fold cross-validation is a method designed to avoid overfitting a statistical or machine-learning model (Witten & Frank, 2005). In a 10-fold cross- validation multiple regression, the dataset is randomly divided into ten sections (called “folds”). A stepwise multiple regression is conducted using nine of the ten folds to train a statistical model, which is then tested on the remaining fold. This procedure is repeated nine more times until all of the folds have served as the test set. Finally, each of the ten models is averaged. After the 10-fold multiple regression was conducted, a follow-up regression using the averaged model was conducted in SPSS on the entire dataset. The next step in the statistical analysis was to determine how generalizable the model was across topics by comparing the multiple regression models between prompts using a Fisher r to z transformation. This analysis tests whether the differences between two correlation values are due to chance (Dunn & Clark, 1969).
The accuracy of the model was also evaluated by calculating the exact and adjacent matches between actual holistic score and the score predicted by the model. This is a common way to evaluate the accuracy of automatic essay scoring algorithms (Shermis & Burstein, 2003). Exact matches include predicted scores that match the actual score, while predicted and actual scores are considered to be adjacent matches when they only differ by a prescribed number of points. For all analyses in this study, predicted scores are considered to be adjacent matches if
the are within 1 point of the actual score. To facilitate this evaluation, all scores were rounded to the nearest whole number (Shermis & Burstein, 2003). Kappa statistics were also conducted on the rounded scores to estimate the strength of agreement between the actual scores and the predicted scores (Landis & Koch, 1977).