5.3 SVM-based Metadiscourse Tagging
5.4.3 Feature Combinations
In this section, the results of different features and feature combinations using the MDT- SVM model are reported. For the purposes of analysis, these features are partitioned into three groups: n-grams of Word, Lemma and POS, Positioning Length and Prosodic Cues. In addition, the effects of using ASR outputs are also reported for some of the feature combinations that reported best on the reference transcriptions.
It is important to note that the significance test is perfumed by evaluating each experi- ment using 50-fold cross-validations and then computing the F1-scores. In particular, a t-test was used to check the statistical significance of the result of each experiment compared to the best obtained results. A t-test is usually used to compare the means of two groups if they are significantly different from each other (Zimmerman,1997). For instance, in Table5.5the best result is obtained by using lexical tri-gram (LEX+TGM) features, as indicated by bold-face. Then the set of experiments that give the best results in other features, such as POS+TGM or LEM+TGM, are compared to the best result across all features (i.e. LEX+TGM).
Physics Economics Overall Feature P R F P R F P R F LEX* 47.76 33.14 39.13 52.22 38.30 44.19 49.99 35.72 41.66 LEX+LEM 46.21 34.15 39.28 51.26 39.56 44.66 ∗ 48.74 36.86 41.97 LEX+POS 45.12 36.14 40.13 47.83 40.65 43.95 46.48 38.39 42.04 LEM+POS 44.42 36.18 39.88 44.63 37.3 40.64 44.53 36.74 40.26 LEX+LEM+POS 46.66 39.32 42.68 48.00 42.43 45.04 47.33 40.88 43.86
Table 5.6: Results of using a combination of n-grams of words (LEX), Lemma (LEM) and POS tags, simply (POS). Bold face denotes significant results and * denotes insignificant
difference.
N-grams of Word, Lemma and POS
The first experiment settings tested were the use of n-grams of words, lemmas and POS. Unigram, bigram, trigram and a combination of these were tried. It is important to note that the bigram features include unigram features, and the trigram features include both unigrams and bigrams. Table5.5reports the results for each pair of features/discipline.
In general, the results show that the use of syntactic features only decreases the model performance compared to other n-grams features used in all disciplines. Also, the use of word n-grams provides the most significant results (average F1-score 41.66%) in both disciplines. Results also show that out of all textual n-grams features (e.g. POS, or lemma or words), the use of trigram features provides the most significant results compared to unigrams and bigrams; this observation was consistent in both disciplines. It is also noticeable that the use of lemma and words trigrams have approximately similar performance in both disciplines; the difference in performance is insignificant, as indicated by * in Table5.5. For example, for Physics lectures the model provides the same F1-scores results, 39.13%. Similarly, for Economics lecturers the results were 41.14% and 41.19% when lemma and words trigrams are used, respectively.
The results of previous experiments were inconclusive regarding the use of n-grams fea- tures to classify metadiscourse tags. Further investigation is needed to gain further insight into the previous results having roughly similar results when either word or lemma features were used. In particular, it is crucial to know whether this similarity is due to the fact that these two features represent the same information, or because they complement each other. It would also be interesting to know whether inclusion of the syntactic features would add any value to these lexical combinations. To test these assumptions, Table 5.6 shows the results of the experiments of a combination of the trigrams of words, lemmas and POS tags. The combination of all three of these features significantly improved the overall results of the MDT-SVM model, to 43.86%, compared to 41.66% and 41.64% when using only the trigrams words and lemma, respectively, as shown in Table 5.5. Another important consideration is the difference between the two disciplines, since classifying metadiscourse using Economics
Physics Economics Overall Feature P R F P R F P R F LEX+LEM+POS 46.66 39.32 42.68 48.00 42.43 45.04* 47.33 40.88 43.86 LEX+LEM+POS+Length 47.58 34.02 39.67 52.79 40.06 45.55 50.19 37.04 42.61 LEX+LEM+POS+Position 44.74 37.42 40.75 49.89 39.62 44.17 47.32 38.52 42.46 LEX+LEM+POS+Distance 31.69 29.84 30.74 47.65 33.88 39.60 39.67 31.86 35.17 LEX+LEM+POS+Length 43.77 25.04 31.86 45.42 33.32 38.44 44.59 29.18 35.15 +Position+Distance
Table 5.7: Results of using positional information (Length, Position, and Distance), along with other features including lexical (LEX), lemma (LEM), and Part-of-Speech Tags (POS).
Bold face denotes significant results and * denotes insignificant difference.
lectures provides far better results than Physics lectures. For instance, in the settings of the best combination of features (n-grams of words, lemmas and POS tags) the overall F score of Physics lectures was 42.68%, compared to 45.04% in Economics lectures. This is despite the fact that the total number of metadiscourse tag occurrences in Physics is higher than those in Economics. This may indicate that the expressions used in Economics lectures are less variable than those in Physics lectures.
Positional, Length and Distance
In this section, experiments conducted using features that exhibit some of the discourse structure are reported. These features are: the length of the sentence, the position of the sentence in the lecture, and the distance between the current sentence under classification and the last occurrence of a metadiscourse tag.
Table5.7shows the results of using these positional features individually and also when combined with the best combination of n-grams of words, lemmas and POS tags from the previous section. Results indicate that most of the positional features have no impact on the classification performance. However, among the aforementioned features the length feature achieved the best results, particularly for Economics lectures. For instance, in Economics lectures the F score, when adding the length information over the previous n-grams fea- tures, increases to 45.55%; but this improvement is not statistically significant. Similarly, for Physics lectures the overall results decreased: from 42.68% when only the n-grams of words, lemmas and POS were used, to 39.67% when adding the length information. The performances of using the rest of these features, namely position, distance and combinations of all, are not significant. In general, the small improvement in the performance from using such features may indicate that these types of features cannot be generalised as much as the n-grams features for the metadiscourse tagging task.
Physics Economics Overall Feature P R F P R F P R F LEX+LEM+POS 46.66 39.32 42.68 48.00 42.43 45.04 47.33 40.88 43.86 LEX+POS+F0 46.64 40.92 43.59 47.90 43.62 45.66 47.27 42.27 44.63 LEX+POS+PD 46.16 41.35 43.62 49.28 45.17 47.14 47.72 43.26 45.38 LEX+POS+F0+PD 46.09 42.25 44.09 50.31 47.36 48.79 48.20 44.81 46.44
Table 5.8: Results for adding prosodic features (F0, PD) to reference transcriptions. Bold face denotes significant results.
Prosodic Cues
The final set of experiments considered the inclusion of the prosodic cues. In particular, pitch-based features and pause duration were used. Table 5.8 presents the results of these experiments, first individually, then combined. Pause duration was found to have a better influence on the results than using F0, and this is consistent in both disciplines. In Physics lectures the improvement was from 42.68% to 43.62% in F score. Similarly, the F score increased from 45.04% to 47.14% for Economics lectures. This can be attributed to the fact that pause duration can capture boundary information between words, and this may serve as an indication of metadiscourse instances. For example, lecturers often tend to pause after saying something important (EMP tag) or even when they introduced the topic of the lecture (INT). The purpose here is to allow the students to absorb the information just given. This can be true for most of the metadiscourse tags, as the expressions used to signal the functions of each of these tags have a main purpose: to engage the students during the lecture. In addition, the combination of prosodic features seems to be statistically significant in both disciplines, with an overall F score of 46.44%. In general, the inclusion of prosodic features was found to have more impact compared to positional and length features for the task of metadiscourse tagging.