Cross Validation in Training Set - Improving Essay Scoring with Argumentation Features

10.3 Improving Essay Scoring with Argumentation Features

10.3.1 Cross Validation in Training Set

Our current study continues to use 38 argumentation features and compare two argument mining systems ArgS and ArgN as described in Chapter 9. 38 argumentation features are grouped in 5 sets: argument component (AC), argumentative label of components (CL), sequence of argument components (i.e., argument flow, AF), argumentative relation (RL), and argument typology (TS). Essays in the data set are first segmented into argument component by using adACI model (Chapter 8). The two argument mining systems are then employed to label argument components and identify support relations. Finally, argumentation features are extracted from argument mining output. Number of predicted argument components and support relations are shown in Table40. T-test results show that statistical values of ArgS and ArgN are all significantly different with p < 0.0001.

Given two sets of baseline features and 5 sets of argumentation features, we evaluate different combinations. AES models are trained using Logistic Regression algorithm in Weka (Hall et al., 2009), and evaluated in 10-fold cross validation. Essay scoring performance are

Feature set κ qwk

Length 0.440† 0.567†

Content 0.453† 0.582†

Length + Content (Base) 0.475 0.599

ArgS ArgN ArgS ArgN

AC 0.341_† 0.469_† Base + AC 0.474 0.599 CL 0.119_† 0.118_† 0.215_† 0.211_† Base + CL 0.475 0.482* 0.599 0.605* AF 0.058_† 0.042_† 0.141_† 0.091_† Base + AF 0.477 0.476 0.602 0.601 RL 0.029_† 0.054_† 0.058_† 0.092_† Base + RL 0.466_† 0.478 0.592_† 0.602 TS 0.015_† 0.000_† 0.029_† 0.000_† Base + TS 0.475 0.470 0.600 0.595 ARG 0.346_† 0.364_† 0.481_† 0.494_† ARG + Base (All) 0.480 0.486* 0.604 0.611*

All – AC 0.475 0.487* 0.599 0.610*

All – CL 0.477 0.484* 0.602 0.608*

All – AF 0.480 0.480 0.603 0.604

All – RL 0.481 0.481 0.605 0.606*

All – TS 0.485 0.487* 0.608 0.611*

Table 41: 10-fold cross validation performance of essay score prediction of base and argumentation features. ARG denotes all argumentation features.

shown in Table41. Best values are highlighted in bold. Symbols * and_{† indicate significantly} higher and lower than Base values (p < 0.05), respectively.

As shown in the middle part of the table, 4 of 5 argumentation feature sets derived from ArgS do not gain improvement for the base model when adding them individually to the base model. Only AF set is effective in that combining them with base features yielded higher κ and qwk.

With regard to ArgN, 3 of 5 argumentation feature sets (i.e., CL, AF and RL) helped improve the base model, and the improvement by CL features are significant. Significant improvements are obtained when combining argumentation features. The best combination includes base features with all argumentation features except argument structure typology (TS) features. This provides an evidence that argumentation features indeed help improve essay score prediction, and the performance increase is more significant when the argument mining output is more accurate.

Considering each set of argumentation features individually, we can see that they almost cannot predict essay score when used alone except AC features which yielded κ = 0.341 and qwk = 0.469. Interestingly, although AC features returned the highest performance among argumentation features, those do not help improve base performance at all. In fact, count features in AC correlate moderately to strongly with the corresponding Length features. For example, Pearson’s correlation tests for number of argumentative sentences vs. number of sentences, number of words in argument components vs. number of words returned r > 0.8, p = 0. Thus, we hypothesize that AC features do not provide more predictive information than those captured in Length features.

The least effective argumentation features are TS features in that ArgN’s TS features had κ and qwk almost zeros. Adding TS features extracted by either ArgS or ArgN to the baseline both degraded performance of the base model. As shown in the bottom part of the table, we achieved the best κ and qwk by removing ArgN’s TS features from all features.

Despite the low performance by each argumentation feature set, the performance increase by combining those argumentation features with the base AES model confirms our hypothesis of the improvement benefit of argumentation features. The improvements are significant when the base model is enhanced with ArgN’s features. This makes us believe that base features such as length statistics and writing mechanics grade essays at a coarse grain and argumentation features help further refine the prediction.

Feature set κ qwk

ArgS ArgN ArgS ArgN

Length + Count (Base) 0.463 0.591

ARG 0.346_† 0.361_† 0.481_† 0.492_† ARG + Base (All) 0.470 0.475* 0.597 0.600* Base + AC 0.471* 0.471* 0.596 0.596 Base + CL 0.467 0.473* 0.594 0.600* Base + AF 0.465 0.469* 0.594 0.597* Base + RL 0.464 0.466 0.592 0.593 Base + TS 0.465 0.465 0.593 0.592 All – AC 0.468 0.474* 0.595 0.600* All – CL 0.466 0.474* 0.595 0.600* All – AF 0.470 0.474* 0.597 0.601* All – RL 0.470 0.469 0.597 0.595 All – TS 0.472* 0.476* 0.599* 0.601*

Table 42: Cross-prompt performance of essay score prediction of base and argumentation features.

The above finding is also confirmed in Table 42 where we conduct cross-prompt validation. In each run, we hold essays of a prompt as a test data and use essays of the 7 remaining prompts for training the models. Similarly to 10-fold cross validation results, adding argumentation features improve cross-prompt AES performance, and the improvements are significant (p < 0.05) for features extracted by ArgN. For ArgS’s features, a significant improvement is obtained when removing TS features from the complete set. Not using TS features also helps obtain the best κ = 0.476 and qwk = 0.601 for ArgN. Overall, AES results by adding ArgN features are better than adding ArgS features most of the times.

Cross-prompt experiment is considered a more difficult evaluation because test and training essays are of different writing topics. The AES improvements by adding argumentation features are revealed more clearly in the cross-prompt setting which demonstrates the topic- independent advantage of argumentation features. In the next chapter, we further study this aspect of argumentation features in cross-domain AES.

Our experiments did not do an exhaustive feature selection, but aimed for evaluating argumentation features by groups to get an insight of how possible outputs of argument mining help improve AES. When comparing results in 10-fold cross validation and cross- prompt validation, a finding is that the best combination of argumentation features is _{AC, CL, RL, AF_{} and it is true for both ArgS and ArgN. Argument typology features (TS)} perform the worst when used alone, and give the lowest (or second lowest) performance when adding to the base model. Although adding TS features still gains improvement for the base AES model (by a small amount), we hypothesize that the value of typology features is restricted by the low performance of argumentative relation mining. In future work, we plan to improve argumentative relation mining with joint prediction and study if relation-based features (i.e., RL and TS) can be more effective.

In document Context-aware Argument Mining and Its Applications in Education (Page 141-145)