Machine Learning - Towards a holistic framework for software artefact consistency management

8.2.1 Basic Concepts

Mitchell defines machine learning as a field concerned with the construction of "computer programs that automatically improve with experience" [191]. Machine learning allows the discovery of knowledge from data by devising algorithms that draw inspiration from a number of fields. Such areas include artificial intelligence, probability and statistics, computational complexity, information theory, psychology and neurobiology, control theory, and philosophy [191]. The impact of these fields is manifested in the core ideas behind machine learning algorithms and models. For exampleNeural Networksare modelled based on the biological brain, and Bayesian Networkslearning is based on principles originating in probability and statistics.

Machine learning algorithms have proven to be useful in a substantial number of application domains. One example is data mining problems where the aim is to discover implicit correlations and novel patterns in large-scale data [192]. Other areas include speech recognition, computer vision, and robot control.

Machine learning problems can be categorised into various groups, such as classification,

regressionorclusteringproblems. The aim of both classification and regression is to predict

a target (output) based on some predictors (inputs) [193]. However, the two differ in the type of the target. While the target in classification is a nominal variable, in regression it is numeric. Classification, under which the approach presented here falls, is introduced in detail in Subsection 8.2.4. The main premise of clustering is to assign observations into groups based on some similarity. A notable clustering method is theK-means algorithm, which is aimed at finding user-specified number of clusters represented by their centroids [192].

Depending on the learning approach, four main types of machine learning scenarios can be differentiated [194]. Supervised learning involves the use of labelled instances, that is, the algorithm is provided with a training set that contains the desired output values. On the other hand, inunsupervised learningthe training data does not contain the desired outputs, whereas

semi-supervised learningmay involve a few desired outputs. Finally, inreinforcement learning

the algorithm learns through trial and error. The work presented here falls under the area of supervised learning.

8.2.2 Relevant Machine Learning Usage Scenarios

Machine learning has been applied in a number of software development and software maintenance problems; as Zhang points out, requirements engineering, rapid prototyping, component reuse, cost/effort prediction, defect prediction, test oracle generation, validation, reverse engineering and change impact prediction are just a few areas that can benefit from the potential machine learning techniques offer [195] [196]. However, due to the data requirements of such techniques, one of the hindering factors of applying machine learning algorithms is the availability and accessibility of relevant software engineering specific data from software projects [197].

In the field of traceability, a number of solutions rely on machine learning techniques to complement other automated trace generation techniques and to improve their results. A few examples include the Multi-strategy Learning approach to recover trace links between Java programs and Use Case elements [198], and a custom classification algorithm to improve the quality of traces between regulatory code and product level requirements [199]. Additionally, work has been done to evaluate the applicability and performance of clustering in automated tracing [200], to combine the Vector Space Model with Regular Expressions, Key Phrases and Clustering using a modified K-means algorithm to automatically recover links between text documents and source code [201], and to investigate the use of clustering to improve tracing between high-level requirements and low-level design elements [202]. Finally, reinforcement learning has been used to identify common textual segments between documents and to suggest links between them [203]. These solutions focus on specific artefacts and on automating tracing between these representations. In comparison, the approach presented in this work applies supervised learning to establish trace links between heterogeneous artefacts and hereby aims at providing a more generic solution applicable in different development scenarios.

8.2.3 Motivation to Use Machine Learning

Automatically creating trace links is a complex problem. Inter artefact relationships cannot simply be inferred from a set of rules describing correlations between artefact elements without imposing very restrictive practices on developers, such as strict naming conventions, or manually creating mappings between artefacts. The heterogeneity of artefacts and artefact elements, which differ in their naming, structure and abstraction levels, exacerbate this complexity. It cannot be guaranteed that software projects follow standardised coding practices such as naming conventions, which means all aspects of artefacts can be variable. The complexity of this problem makes a simple heuristic approach unlikely to succeed. Thus, a way of capturing and leveraging the fundamental complexity of the interactions between artefacts is required and machine learning is particularly suited to modelling complex non-linear spaces.

8.2.4 Traceability Creation as a Classification Problem

The premise of our approach is that establishing trace links can be thought of as a binary classification problem. That is, a pair of source and target artefact elements can be categorised into a given set of categories, related or unrelated, based on existing and already categorised pairs. As described by Domingos [204], a classification is a system which, given a vector of feature values, outputs a single discrete value called the class. The problem, specifically in the context of classification, can be defined as approximating a boolean-valued function from training examples, i.e. given examples labelled as members and non-members of a class. Each instance X - a pair of source and target artefact elements - is represented by attributes (selected features, which are discussed in Section 8.5). The target concept - whether or not a trace link exists for X - can be denoted by:

c:X →0,1wherec(X) =1 (8.1)

if there is a link between source and target, and

c:X →0,1wherec(X) =0 (8.2)

if there is no link between source and target.

The learner is presented with negative (c(X) = 0) and positive (c(X) = 1) examples and the aim of the classification is to find an estimation (h) such that h(X) = c(X). The outcome of the learning process is successful if following the approximation of the target function over training examples, the approximation on unobserved examples yields sufficiently accurate results [191]. In the next few sections, the methodology for data collection, preparation and feature selection is outlined,

followed by model selection, training and an evaluation of results.

In document Towards a holistic framework for software artefact consistency management (Page 157-160)