A prerequisite of successfully implementing machine learning algorithms in the ACM framework is the availability and accessibility of relevant software engineering data for training and evaluation. This is often a challenge due to the nature of software projects; data from proprietary products is typically unavailable [197] and artefacts other than source code are not always available or complete for open source systems. The experiments presented in this work utilise data from six different open source systems hosted in online repositories.
8.3.1
Criteria for Candidate System Selection
The primary criteria for selecting candidates include:
Artefacts available in repository. The main criterion was the availability of a variety of artefacts to represent different combinations of traceability scenarios. It was also considered which artefacts are most widely used in projects. According to a survey, the most widely used non source code artefact is the UML class diagram, followed by sequence diagrams and use cases [205]. Examining a number of online repositories reveals similar patterns.
Implementation language: Java. The framework currently handles Java source code, and therefore the search was limited to systems implemented in this language.
System size. Systems of varying sizes were selected for experiments. Smaller systems are easier to comprehend and allow the establishment of trace links across the entire system instead of having to focus on individual components to manage complexity. Conversely, larger systems may offer more complicated links of different types between artefacts.
The criterion of the availability of various artefact types proved to be a challenge since only a small proportion of systems provide documentation, such as requirement or architecture specifications. To maximise the chance of finding candidate systems, an extensive search took place on popular source code repositories using a list of available hosts [206]. Out of the listed repositories, candidate systems were found on GitHub1, SourceForge2 and Google Code3. A further challenge is the non-uniform metrics these hosts provide for comparing project size. GitHub, for example, does not disclose lines of code metrics; therefore, where such information
1https://github.com/ 2https://sourceforge.net/ 3https://code.google.com/
is not available, the metric was calculated. Following is a brief summary of the candidate systems.
8.3.2
Candidate Systems
Table 8.1 provides a summary of the functionality, origin and size metrics of the selected systems.
Micro Mouse Simulator (MMS)4is a micro-mouse maze editor and simulator that leverages various maze solving algorithms. It has been implemented using Java and Python. MMS provides Java source code and UML class diagram type artefacts.
JGAP5is a Java framework that can be used as a means to solve problems applying evolutionary principles. JGAP offers extensive documentation and approximately 1400 test cases, which makes it a suitable candidate for extracting source code and unit test artefacts.
Neo4j6, the popular graph database, was selected because of the size of its codebase and because it provides Java source code, unit test and module view architecture artefacts.
Myrobotlab7 is a framework for robotics and creative machine control providing services for machine vision, speech recognition, servo control, GUI control and microcontroller communication. Since Myrobotlab offers extensive documentation in the form of architectural diagrams, as well as some test cases covering certain areas of its functionality, it provides data for setting up architecture-source code and unit test-source code links.
The Java Binary Block Parser (JBBP)8is a framework for parsing binary block data in Java supporting various data types. JBBP was selected due to the variety of artefacts it contains: most Java classes are covered by test cases and the system also allowed the extraction of a use case diagram providing another dimension to artefact data used in trace link establishment.
Finally,Titan9is an open source distributed graph database designed to support complex and real-time traversal queries on large graphs and concurrent transactions. The project provides test cases, Java source code, as well as a conceptual architecture artefact for extraction.
The various metrics provided by the repositories, such as lines of code, number of contributors and commits allow the comparison of the size of the candidate systems. It is concluded that MMS and JBBP represent one end of the spectrum characterised by a smaller size, JGAP and
4https://code.google.com/p/maze-solver/ 5http://jgap.sourceforge.net 6https://github.com/neo4j 7https://github.com/MyRobotLab/myrobotlab 8https://github.com/raydac/java-binary-block-parser 9https://github.com/thinkaurelius/titan
System Description Source Repository Lines of Code (LOC) Number of Contributors / Commits MazeSolver Micro-mouse maze editor Google Code 9223 4/139
JGAP Java framework for Genetic Algorithms SourceForge 57200 -
Neo4j Graph database GitHub 152139 118 / 34995
MyRobotLab Java framework for robotics GitHub 133247 11 / 665
Java Binary Parser Java binary block data parser GitHub 27677 1 / 194
Titan Distributed graph database GitHub 107792 32 / 4422
Table 8.1:Comparison of candidate systems.
ARTEFACT TYPES SYSTEMS UML Use Case
Diagram Module View Architecture Diagram Conceptual Architecture Diagram UML Class Diagram UML Sequence Diagram Java
Source Code JUnit tests MazeSolver X X
JGAP X X
Neo4j X X X
Myrobotlab X X X
Java Binary Block Parser X X X
Titan X X X
Table 8.2: Extracted artefacts.
MyRobotLab are larger systems, followed by Titan, while Neo4j is the largest of the candidate systems.
Table 8.2 shows the types of artefacts extracted from the systems. It can be seen that Java source code was available in all repositories and most repositories allowed the extraction of unit test artefacts, while every other artefact type was found only in single repositories.