4. App Review Classification: Easy over Hard
4.3. Classification Models
This section describes the classification models designed to answer our research questions RQ1-A and RQ1-B. First, we explain in detail the textual features used to train two types of MaxEnt models – one uses simple Word N-grams features (also called BoW) and other exploits complex linguistic features – for review sentence classification. Then, the architecture of CNN model is explained that is used to classify the same set of review sentences.
4.3.1. Word N-Grams (BoW)
Word n-grams, often called BoW, is a straightforward feature extraction method that returns a contiguous sequence of n-words from a given review sentence. For instance, 1 to 2 n-grams of a review sentence (“plz fix this feature”) are ’plz’,’fix’,’this’,’feature’,’plz fix’,’fix this’,’this feature’. In this method, first a dictionary is built by extracting a contiguous sequence of n-words from the train- ing corpus. Then, a feature matrix is maintained in which each row represents a
review sentence that stores the frequency of each n-gram in that review sentence. The method doesn’t require any external linguistic tool for its usage, which makes it very appealing to practitioners.
BoW features are useful when characterizing the review sentences into sen- tence types. For instance, the words “awesome” and “great” mostly appear in review sentences belong to type praise; while the words “bug”,“crash”,and “plz fix” appears in review sentences where users mention a bug in an app. The study of Maalej and Nabil [45] used BoW features to classify a full review text into dif- ferent categories such as feature request and bug report etc. However, we used the same features to classify reviews at the sentence level. Obviously, a full review contains more information, but we believe that review sentences are more specific and contain enough lexical information to classify them correctly.
4.3.2. Character N-Grams (BoC)
Like BoW, character n-grams (i.e., BoC) are all n-consecutive letter sequences (without spaces) in the words or tokens of a review sentence. For example, the character 3-grams for the sentence “The UI is Ok” are ’The’, ’heU’, ’eUI’, ’UIi’, ’Iis’, ’isO’, and ’sOk’. In previous studies [18], BoC features have been used successfully in many applications such as malicious code detection and duplicate bug report detection.
4.3.3. Linguistic Features
To train a MaxEnt model with rich linguistic features, we extracted the same set of linguistic features used in the study of Gu and Kim [18]. Their set of linguis- tic features also includes the BoC features explained in the previous section (i.e. Section 4.3.2).
Linguistic features can be useful for classification of review sentences into its types (see section 5) because review sentences in each category often follow a distinct structural pattern. For instance, sentences belong to type feature evalua- tionlike “The search (NOUN) works pretty nice (ADJECTIVE)” or “It’s perfect (ADJECTIVE) for storing notes (NOUN)” follow a pattern that is different from the pattern of sentence type feature request such as “please add (VERB) look up feature (NOUN)” or “it could (MODEL) be (VERB) improved by adding more themes (NOUN)”.
In the following paragraphs, we explain the linguistic features used in our study:
a) Part of Speech (POS). POS tagger marks up the type of each word in a sen- tence. It also takes into account the context (i.e., relationship with the adjacent and surrounding words) in which a particular word appears in a sentence. For example, POS tags for the sentence “The user interface is elegant” are “DETER- MINER NOUN NOUN VERB ADJECTIVE”. For this study, we extracted the PTB
POS tags2with NLTK3library. All the POS tags extracted from a review sentence are concatenated and used as a feature for review classification.
b) Constituency Parse Tree. A constituency parse tree represents the gram- matical structure of a sentence. Figure 5 shows the constituency parse tree for a sample review sentence generated using Stanford CoreNLP library4. The parse tree shows that the sentence node (S) composed of a noun phrase (NP) and a verb phrase (VP) and the VP phrase is further decomposed into an adjective phrase (ADJP). We traversed the parse tree in a breadth-first order, and labels of non- terminal nodes of the first five nodes are concatenated and used as a feature.
Figure 5. Constituency parse tree for a review sentence “the user interface is not very elegant”. The feature extracted from this tree is “ROOT-S-NP-VP-DT-NN” [74].
c) Semantic Dependency Graph (SDG). SDG is a directed graph that shows the dependency relations between words in a sentence[18]. Nodes in the graph represent words labeled with POS tags and edges represent dependency relations between words. Figure 6 shows the dependency graph of a sample sentence gen- erated using spaCy5 library. The word ’is’ is the ROOT node of the sentence as it does not have any incoming edges. The root has three dependents with the fol- lowing relationships: a noun subject (nsubj) ’interface’, a negation modifier (neg) ’not’, and adjectival complement (acomp) ’elegant’. The child node ’interface’ has two children: a determiner (det) ’the’ and a noun compound modifier (nn) ’user’. To extract the textual feature, the SDG is traversed in a breadth-first order and the dependency relations labeling the edges and the POS tags of the words in the nodes are concatenated. Leaf nodes that are not directly connected to the
2https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html 3
http://www.nltk.org/
4https://stanfordnlp.github.io/CoreNLP/ 5https://spacy.io/
ROOT node are ignored. For example, the textual feature extracted from SDG of a sentence shown in Figure 6 is “VBZ-nsubj-NN-neg-ADV-acomp-JJ”.
Figure 6. Semantic Dependence Graph of a sample review sentence “the user interface is not elegant”. The feature extracted from this SGD is “VBZ-nsubj-NN-neg-ADV-acomp- JJ” [74].
d) Trunk Word. The trunk word feature is simply the root word of a SDG. For instance, the trunk word of the sentence “The user interface is not elegant” is ’is’.
4.3.4. Convolutional Neural Networks (CNNs)
CNN-based classification models have shown encouraging results on various tex- tual classification tasks [6, 33]. We adopt the CNN architecture proposed by Kim [33] to classify review sentences.
The architecture of the model is illustrated in Figure 7. The first layer of the network embeds words into low dimensional vectors. The second layer performs convolutions over the embedded word vectors using multiple filter sizes. The output of these convolutions are max pooled into a long feature vector in the third layer. The fourth layer is a dense layer with dropout applied. Finally, the results are classified using a softmax layer. For more details see Section b).
Since neural network models have a large number of trainable parameters, they typically require large training sets to learn properly. However, when the available training sets are not very large, as is the case in this study, initializing CNN-based model with pre-trained word embedding vectors, obtained from a unsupervised neural language model might help to improve model performance [33, 79].
Therefore, we train CNN-models both with and without pre-trained word em- beddings to assess the effect of using the externally trained word vectors for clas- sifying app review sentences. We use the 300-dimensional Word2Vec embeddings
Figure 7. CNN model architecture for sentence classification (Figure taken from [33])
[50] trained on 100 billion words from Google News.6
The words that are absent in the vocabulary of pre-trained embeddings are initialized randomly. In particular, we experiment with three different models:
• CNN (rand): The CNN model in which all word vectors in the embedding layer are randomly initialized and then modified during training.
• CNN (static): The CNN model is initialized with the pre-trained word vec- tors but all words including the ones that are randomly initialized are kept static and are not updated during training.
• CNN (non-static): Same as CNN (static) but the pre-trained vectors are fine-tuned during model training for our classification task.