3.4 Models Integration
4.1.2 Features
In order to apply machine learning algorithms it is necessary to represent the training instances by a set of numerical features. A good set of features should represent the training instances in such a way that would make it possible for the machine learning algorithms to find patterns in the data which can be used to classify instances according to the desired target labels. For the task of Argumentative Sentence Detection, the training instances are sentences that occur in a text document, which should be classified as argumentative sentence or not argumentative sentence. Each sentence is represented with a set of features at the lexical, syntactic and semantic level:
• N-Gram: contiguous sequence of 1 to N tokens from a given sentence. This feature was used as a baseline to compare with more specific features. We encode the presence of unigrams, bigrams, and trigrams in the sentence (N = 1, N = 2 and N = 3, respectively);
Models
Figure 4.2: Opinion article annotated with one argument
• Word couples: all possible combinations of word pairs within a sentence. Using this feature, we expect to retrieve pairs of words that capture argumentative reasoning, appearing not necessarily adjacent to each other. These pairs of words occur typically together in the same sentence and are often associated to argumentative content. Since the pair of words are not necessarily adjacent to each other, this feature increases the feature space substantially. For this reason, we also did experiments with a cleaned corpus, in which all the punctuation marks, numbers and nouns were removed (e.g.“Concluo [...] porque [...]" (“I conclude [...] because [...]"), “Se [...] então [...]" (“If [...] then [...]"));
• Argumentative keywords: set of clue words directly indicating the structure of the argu- ment. These words are strong indicators of argumentative content. A set of argumentative keywords, K, that are typically found in argumentative text written in Portuguese was manu- ally compiled, based on the work presented in [Coh84]. The set of argumentative keywords Kcontains a total of 51 argumentative keywords (e.g. “logo" (“thus"), “pois" (“because"), “portanto"(“therefore")). This feature is encoded as a binary feature: if the sentence con- tains at least one word which belongs to the set of argumentative keywords K then, the feature is set to 1; otherwise, the feature is set to 0 (e.g. in the underlined sentence shown in Figure4.2this feature will be set to 1 due to the presence of the word “pois");
• Text statistics:
– Absolute Position: current sentence absolute position in relation to the document where the sentence was extracted (e.g. for the underlined sentence in Figure 4.2 - 3);
– Average Word Length: words used in argumentative sentences might have different characteristics from words used in non argumentative sentences. This feature explores if this difference occurs in the average length of the words (e.g. for the underlined sentence in Figure4.2- 4.0);
– Number of punctuation marks: argumentative sentences may increase the number of punctuation marks in the sentence (e.g. for the underlined sentence in Figure4.2- 1); – Sentence Length: number of words in current sentence (e.g. for the underlined sen-
Models
• Adverbs: some adverbs can signal argumentative content (e.g. “então" (“so"), “sempre" (“always"), “mas" (“but"), amongst others);
• Modal Auxiliary: words indicating the level of necessity, which are usually found in some types of arguments (e.g. “poder" (“can"), “dever" (“must"), “ter" (“have"), amongst oth- ers);
• Verb tense: changes in verb tense can often be found in argumentative context. For instance, arguing about something in the present supported by premises that occurred in the past. Given a sentence siwe explored changes in the verb tense that occur in the sentence siand,
between the sentence si and the surrounding sentences, si−1 and si+1 (e.g. in the sentence
(b) from Figure4.1, changes in verb tense between “appeared" and “had gone" indicate a sequence of events which, in some situations, are associated to argumentative content). A window size of length 3 (current, previous and next sentences) was considered in this feature due to the assumption that the ADUs must occur in sequential spans of text and, therefore, analyzing sentences that are not in the neighborhood is not necessary. When analyzing changes in verb tense between different sentences, we consider the verbs that are closer to each other. A change in verb tense between two sentences, siand si−1, occurs if the last verb
not in the infinitive form from sentence si−1has a different verb tense than the first verb not
in the infinitive form from sentence si. A change in verb tense between two sentences, siand
si+1, occurs if the first verb not in the infinitive form from sentence si+1has a different verb
tense than the last verb not in the infinitive form from sentence si. The information related to
verb tenses is obtained from the part-of-speech tool Citius Tagger [GG15], which classifies each verb with one of the following verb tense categories: Present, Imperfect, Future, Past or Conditional;
• Domain words repetition: arguments have to be about something and, therefore, repeti- tions of domain words or the existence of similar domain words are expected in different components of the argument. In this feature repetitions of nouns, name entities, verbs and adjectives were considered. All the punctuation marks and discourse markers were removed in the cleaning process. Given a sentence si we explored word repetitions occurring in the
sentence siand, between sentence siand the surrounding sentences, si−1and si+1. A window
size of length 3 (current, previous and next sentences) was considered in this feature due to the same reason explained in the previous feature. Using an word embeddings model gener- ated for the Portuguese language [ARPS13], we calculate the similarity between two words using the metric cosine similarity between the word feature vectors that represent each of the words. We calculate the similarity between each pair of words occurring in sentence si, between pairs of words in sentence si and sentence si−1, and between pairs of words in
sentence si and sentence si+1, separately. For each of them, the similarity score of the most
similar pair of words is encoded directly as a feature (e.g. in the underlined sentence from Figure4.2, this feature should capture the similarity between the words “átomo", “elemento" and “neutrões", which correspond to similar words related to the topic of the argument; in
Models
the sentences from Figure4.1the most similar words are “book" and “publisher", which are also related to the topic of the argument that is being presented).
To scale and normalize the mentioned set of features, the tf-idf method was used to scale each set of features that is based on a vocabulary of words (N-Gram, Word couples, Modal auxiliary, Adverbs), and all numerical features are scaled to a range between 0 and 1, using the method Min- MaxScalerprovided by scikit-learn [PVG+11].
The tf-idf representation, short for term frequency-inverse document frequency representation, is a weighting scheme commonly used to scale features based on a vocabulary of words. Term frequency (tf ) measures the raw frequency of a term in a document (i.e. the number of times that a term t occurs in a document). Inverse document frequency (idf ) is a measure of how much in- formation the word provides, that is, whether the term is common (low idf score) or rare (high idf score) across all documents. Combining both measures, we obtain the tf-idf measure. An high tf- idf value is reached by a high term frequency in a given document and a low document frequency of the term in the whole collection of documents.
As previously described, in the Domain words repetition feature, we exploit a distributed rep- resentation of words (word embeddings). These distributions map a word from a dictionary to a feature vector in high-dimensional space, without human intervention, from observing the usage of words on large (non-annotated) corpora. This real valued vector representation tries to arrange words with similar meanings close to each other based on the occurrences of these words in a corpora. Then, from these representations, interesting features can be explored, such as semantic and syntactic similarities. In the experiments presented in this thesis, we used a model provided by the tool Polyglot1, in which a neural network architecture was trained on Portuguese Wikipedia articles. A full description of the tool can be found in [ARPS13]. In order to obtain a score in- dicating the similarity between two words, we compute the cosine similarity between the vectors that represent each of the desired words in the high-dimensional space.