IDENTIFICATION
Stab and Gurevych(2017) encodes argument components using BIO tagset that every token in the essay has either B, I or O label depending on whether it is at the beginning, inside, or outside the argument component. Figure16shows an example argumentative sentence with BIO labels assigned to its tokens. The authors used Conditional Random Field algorithm (Lafferty et al., 2001) to learn a sequence labeling model. We adapt their model to use in our argument mining system. For each tokens, we extract following features as proposed in the prior study (Stab and Gurevych, 2017):
• Structural features
– Token position: token present in first or last paragraph; token is first or last token in sentence; relative and absolute token position in document, paragraph and sentence. – Punctuation: token precedes or follows any punctuation, full stop, comma and semi-
Class Training set Test set Argument Components
Major Claim 598 (12%) 153 (12%)
Claim 1202 (25%) 304 (24%)
Premise 3023 (63%) 809 (64%)
In-paragraph Argument Component Pairs
Linked 3023 (18%) 809 (16%)
Not-linked 14227 (82%) 4113 (84%) In- and Cross-paragraph Argumentative Relations
Support 3820 (90%) 1021 (92%)
Attack 405 (10%) 92 (8%)
Table 22: Class distributions in training and test sets of corpus Persuasive2.
– Position of covering sentence: absolute and relative position of the token’s covering sentence in the document and paragraph.
• Syntactic features
– Part-of-speech: the token’s part-of-speech.
– Lowest common ancestor (LCA): length of the path to the LCA with the following and preceding token in the parse tree normalized by the depth of the tree.
– LCA types: constituent types of two LCA of the current token with its preceding and following tokens.
• Lexico-syntactic features: for each token t, extract its uppermost node n in the parse tree with the lexical head t. A lexico-syntactic feature is defined as the combination of t and the constituent type of n. We also consider the child node of n in the path to t and its right sibling, and extract their lexico-syntactic features (Soricut and Marcu, 2003). • Probability feature: we compute the maximal conditional probability of the current token
It ’s true that technology and computers do make their jobs easier but it cannot definitely replace them .
B I I I I I I I I I I I O B I I I I O
Figure 16: Tokens with BIO tagset.
ti being the beginning of an argument component given its preceding tokens:
maxn∈{1,2,3}P (tagset(ti) = B|ti−n, ..., ti−1)
The probability is estimated as a division of the number of times the preceding tokens precede a token ti with tag B by the total number of occurrences of the preceding tokens
in the training data.
We add six features derived from argument and domain word lexicon:
• AD features: two boolean features indicate whether the current token is an argument word or a domain word; four boolean features indicate whether the preceding or following tokens are argument or domain words.
Performances of argument component identification (ACI) on the test set are reported in Table 23. The first row shows reported results inStab and Gurevych (2017). The two next rows present results of our implementation of Stab and Gurevych(2017), and our improved version with AD features (referred to as adACI). Our implementation of Stab & Gurevych’s model obtained close results to what is reported in their paper. Our improved version with AD features yielded the best performance.
To further evaluate the impact of AD features in the task, we conduct 10-fold cross val- idation on training set and report results in Table 24. T-tests show that all performance improvement by AD features are significant with p < 0.01. Given this result, we integrate improved ACI model into our argument mining system to solve argument component identi- fication.
Model F1 Prec. Recl. F1:B F1:I F1:O Human upper bound 0.886 0.887 0.885 0.821 0.941 0.892 ACI byStab and Gurevych 0.867 0.873 0.861 0.809 0.934 0.857 ACI (our implementation) 0.865 0.868 0.862 0.799 0.938 0.859
adACI 0.872 0.877 0.868 0.814 0.939 0.863
Table 23: Argument component identification performance on the test set. Corpus: Persua- sive2.
Model F1 Prec. Recl. F1:B F1:I F1:O
ACI 0.850 0.853 0.848 0.778 0.931 0.841 adACI 0.856* 0.859* 0.854* 0.791* 0.932* 0.844*
Table 24: 10-fold cross validation performance of argument component identification in the training set. Corpus: Persuasive2.
8.3 ARGUMENT COMPONENT CLASSIFICATION
The next component in our argument mining system aims at classifying each argument component as MajorClaim, Claim, or Premise. This problem setting is more practical than the classification problem we solved in Chapters 4and 5. While the old 4-way classification also considered non-argumentative sentences, this 3-way classification works on the output of the argument component identification step, and thus can skip non-argumentative sentences. For this task, Stab and Gurevych (2017) had significantly improved their first model in 2014bwith novel features. The most notable change that Stab and Gurevych made to their model was that they used dependency triples rather then production rules. It has been shown in our prior study that dependency triples are more effective then production rules for this classification task. Besides, they expanded their discourse marker set and included output
ADw4 ADw4 + Prob. ADw4 + Embedding Accuracy 0.820 0.845* 0.818 Kappa 0.656 0.704* 0.650 Macro F1 0.792 0.825* 0.788 Macro Precision 0.795 0.829* 0.795 Macro Recall 0.790 0.821* 0.783 F1:MajorClaim 0.866 0.896* 0.862 F1:Claim 0.628 0.681* 0.618 F1:Premise 0.883 0.898* 0.884
Table 25: 10-fold cross validation performance of ACC models in the training set. Corpus: Persuasive2.
of a discourse parser (Lin et al.,2014). Finally, the authors proposed two novel feature sets. The probability features are the conditional probabilities of the current component C having the argumentative label t in {MajorClaim, Claim, Premise} given the sequence of preceding tokens p:
P (label(C) = t|p)
Conditional probabilities P are estimated from the training data.
The second new feature set is based on the pre-trained Word2Vec word embedding (Mikolov et al., 2013). They summed vectors of words in the argument component and preceding tokens within the sentence. The summation vector was then added to the feature space.
While their new lexical features, e.g., dependency parse and discourse relations, over- lap with our features, their probability and embedding features are novel. However, their experiments showed that adding probability and embedding features yielded very little per- formance improvement, i.e., less than 0.5%. Moreover, because the conditional probability of argumentative labels are estimated in the training set, we worry that relying on probabil-
ity features may over-fit with training data and degrade performance on unseen test data. Therefore, we experiment with adding these feature sets to our proposed model.
8.3.1 Experiment Results: Cross Validation in Training Set
For argument component classification, we employ the ADw4 model, and train the model using LibSVM (Chang and Lin, 2011). As shown in Table 25, adding probability features significantly improves performance for ADw4, but adding embedding features does not. In fact, ADw4 + Embedding performs worse than our original model. However, the high performance of ADw4 + Probability is expected because the probabilities are computed in the training set.
8.3.2 Experiment Results: Performance in Test Set
Argument component classification performances on the test set are shown in Table26. Base column reports performance of the improved base classifier, and ILP column presents perfor- mance of the ILP-based joint prediction (Stab and Gurevych,2017). ILP model exploits the mutual information between argument components and argumentative relations to optimize the prediction. As a result, ILP obtained remarkably better performance than Base.
Comparing our proposed models, adding probability or embedding features do not im- prove performance of ADw4. While ADw4 + Probability yielded the best performance in the training set, its performance is the lowest in the test set. This suggests that probability features might cause over-fitting. Although embedding features allow a much more efficient representation than bag-of-words, a simple usage like adding word vectors to feature space seems to not help. However, given all successes of word embeddings in many different NLP tasks, we plan to explore more advanced usage of word embeddings in argument mining in the future.
Comparing ADw4 with Stab and Gurevych’s results, our model achieved significantly higher performance than their base classifier. Despite the fact that ADw4 does not have any information from argumentative relation prediction, its F1 score is comparable to ILP’s F1, and it even could predict MajorClaim better than the ILP model did. This makes us
Base ILP ADw4 ADw4+Prob. ADw4+Embd. Accuracy - - 0.848 0.839 0.841 Kappa - - 0.702 0.684 0.687 F1 0.794 0.826 0.825 0.814 0.816 Precision - - 0.831 0.825 0.825 Recall - - 0.822 0.805 0.810 F1:MajorClaim 0.891 0.891 0.910 0.901 0.906 F1:Claim 0.611 0.682 0.667 0.646 0.648 F1:Premise 0.879 0.903 0.900 0.896 0.896
Table 26: Test performance of ACC models. Corpus: Persuasive2.
believe that we can further improve state-of-the-art performance when implementing joint prediction from our base classifier.
Given the above results, we integrate ADw4 into our argument mining system to solve argument component classification.