Which approaches exist to extract software feature-relevant

3.3 Results

3.3.1 Findings in Selected Publications

3.3.1.1 Which approaches exist to extract software feature-relevant

This section investigates existing approaches to extract software feature-relevant informa-tion from natural language software engineering artifacts and aims to answer Rq#1. In the following, this section describes each approach, found in the Slr in detail.

P1 (Bakar et al., 2016) describes an approach to extract features from online software reviews. In a first step, they identify similar documents by means of the Fuzzy C-Means (Fcm) clustering algorithm where each data point has a probability of belonging to each cluster. A data point which is located closer to a clusters’ centeroid has a stronger membership to the cluster (Bezdek et al., 1984). In a second step, they extract bigrams

3.3. RESULTS

and trigrams (which represent features) based on PoS patterns from each cluster. Finally, the approach clusters the extracted features into similar feature clusters by means of a modified Word Overlap Metric.

P2 (Guzman and Maalej, 2014) provides an approach that allows to automatically identify application features mentioned in user reviews. They assume that nouns, verbs, and adjectives are most likely used to describe features (in contrast to e.g., adverbs, numbers, or quantifiers). From the words which are tagged as noun, verb, or adjective, the approach extracts features by means of a collocation algorithms. A collocation algorithm in context of text mining identifies words which unusually often co-occur (see, e.g., Aggarwal and Zhai, 2012a). The approach finally considers only collocated words with less than three words and appear in at least three reviews. The authors use Latent Dirichlet Allocation (Lda) to finally group extracted features that tend to co-occur in the same reviews and furthermore assign topics to each review. Lda is a probabilistic distribution topic modeling algorithm which automatically discovers topics a corresponding document contains (Blei et al., 2003).

P3 (Hariri et al., 2013) provides an approach to extract common software features across products from online product listings. Therefore, they use an incremental diffusive clustering (Idc) algorithm that incrementally identifies features based on a voting schema using distance metrics.

The approach from P4 (Johann et al., 2017) allows to extract high-level features (describing essential functional capabilities) from app descriptions and app reviews. They use predefined PoS patterns (e.g., Vb Nn Nn) to extract feature candidates. Furthermore, they simplify sentences which include enumerations and conjunctions in order to provide atomic sentences to be used for feature extraction. The feature candidates which are extracted from an app description and the feature candidates which are extracted from the related user review are used to match these features by means of a binary text similarity function on different levels of granularity. First, the similarity function determines the similarity between the feature candidates based on single word matching. Second, the approach uses WordNet to consider synonym sets of captured words (e.g., photo and image). Finally, to compensate a possible difference in the number of words, the approach utilizes cosine similarity.

P5 (John, 2010) proposes Cave, a fully manual approach to identify software feature-relevant information in Nl user documentation. Therefore, they determine a set of patterns (e.g., section headings typically contain features, repeated words or phrases can be domains or subdomains), which allows to locate software feature-relevant information.

The approach from P6 (Khan et al., 2014) allows to identify product features in customer reviews by means of syntactic patterns. They extract ”base noun phrases“,

CHAPTER 3. SOFTWARE FEATURE EXTRACTION FROM NATURAL LANGUAGE TEXT: STATE OF THE ART

linking ”verb based noun phrases“, and ”preposition based noun phrases“ that represent product features by means of different syntactic patterns.

P7 (Li et al., 2015) provides an approach to automatically extract requirements for scientific software from available Nl knowledge sources like user manuals and project reports. They use a combination of syntactic (PoS) and lexical patterns (e.g., method of {Nn | Np} as well as a gazetteer¹) in order to identify and extract requirement candidates of different Drums (Domain specific Requirements Modeling for Scientists) types. The Drums types define core requirement types (e.g.,data definition, interface, process) in the Drums model. Similar to P2, the authors utilize the Lda topic modeling algorithm to group the extracted Drums requirement candidates. The Lda algorithm performs the clustering task based on bi- and trigrams which are determined from the requirement candidates by means of a collocation algorithm a priori. Finally, the Lda algorithm computes topics (features) for each requirements cluster.

P8 (Slankas and Williams, 2013) describes an approach to automatically extract non-functional requirements (Nfr) from unconstrained Nl documentation. The approach utilizes a k-nearest neighbor (k-NN) classifier. The k-NN classifier is a supervised algorithm which classifies a new object based on which objects previously classified (i.e.

training objects) are closest to the new object. The majority class from these k-nearest neighbors is defined as class label for the new object (Aggarwal and Zhai, 2012b). The closeness between objects is determined by means of a distance metric (e.g., Euclidean for numerical attributes). The authors use a modified version of the Levensthein distance (see Levenshtein, 1966). They evaluated the classifier against others (e.g., Smo, Nb) and

report that Smo performed better than k-NN.

P9 (Yu et al., 2013) proposes an approach to mine and recommend software features across multiple software web repositories like sourceforge.net. In that context, they created a hierarchical repository of software features (Hesa). Hesa contains features on high-level which are described by feature elements. A feature element is defined as a “[...]

raw description of a feature which can indicate a functional characteristic or concept of the software product” (Yu et al., 2013). A feature element corresponds to a sentence of an online software profile. The feature elements are clustered by an extended Lda algorithm into flat clusters. Finally, an improved Agglomerative Hierarchical Clustering algorithm (iAHC) transforms the features (and corresponding feature elements) into a hierarchical (semantic-based) feature structure. An Agglomerative Hierarchical Clustering algorithm is a bottom-up clustering method which determines hierarchical clusters (e.g., clusters have sub-clusters) based on distance metrics in an iterative process.

1A gazetteer in context of text mining is a domain-specific dictionary used to mine identify domain-specific terms

3.3. RESULTS

P10 (Zhan and Li, 2010) provides an approach which mines product features in product reviews by means of their nominal semantic structure. Initially, PoS tags are used in syntactic patterns to determine noun fragments in the product reviews. Starting from the noun fragments and their semantic dependencies, the approach determines potentially relevant non-nominal semantic neighbors that can be either adjectives or verb predicates.

The combination of a nominal noun fragment and a non-nominal semantic neighbor represents a product feature. As a last step, the authors apply co-clustering in their approach on product features in order to determine fine-grained product feature cluster.

3.3.1.2 Which software feature-relevant entities are extracted? Which types

In document Retrospective Semi-automated Software Feature Extraction from Natural Language User Manuals (Page 46-49)