Extending the Lindholmen data set

1.8 Future Work

1.8.1 Extending the Lindholmen data set

In this section, we present our views on what the Lindholmen data set should grow into. Our views are pretty much driven towards the direction of building

1.8. FUTURE WORK 41

CoSARI (presented in paper E). Accordingly, enrichment of the Lindholmen data set can be done by either increasing its quantity (e.g. via adding more models) and/or quality (e.g. via curating the existing data set). We discuss some directions in more detail next.

Adding more models and design documents. Currently, the Lindhol- men data set contains UML models stored in what we found to be the most common formats: .uml, .xmi, and images formats (.jpg, .png, .bmp). In general, UML models might be stored in many other formats, e.g. formats that are specific for UML-editors such as .plantuml, .argo, .dia and .ump. One way to extend our corpus is to include more of such formats. In our approach we have looked for individual files that represent UML models. Another direction to look for more UML models is to search in files that may also contain other information, but also contain UML models, such as: Word (doc(x)), PDF, HTML, PowerPoint (ppt), among others. To capture these types of models and design documents, the data collection process (described in paper B) needs to be extended to accept more tool-specific formats and document-specific formats. This will require the creation of new tools/techniques that automatically extract models and/or design knowledge from such formats.

Similarly, software models that are not conforming to the UML standard can be added to the dataset. These can be models that are members of the UML-like family of languages, such as SysML, Capella models, or models from other model-based approaches such as Simulink.

Adding models from industrial cases. Currently, the Lindholmen data set contains UML models from OSS projects. While there exist some open source projects that are driven by companies, we could not identify these in our dataset. In the future, the Lindholmen data set could be extended by involving more industrial cases. Given that many companies are using Git as their versioning system, the technical process of collecting data from company’s Git resources can be done in the same manner as our approach does for GitHub. The main challenge here will be to convince companies to share their designs publicly. Possibly governments could play an examplary role here. Studying these cases will allow us to compare how UML is used in the settings of OSS and industrial cultures.

Adding more software development artifacts. One main reason why we collected UML class diagrams is that they are a commonly used representa- tion for software architecture and designs. However, software designs may be represented in other notations, and it could be interesting to compare these and their effectiveness in software projects.

Also, besides models and design documents, various software development artifacts can be collected so as to understand the contexts in which software modeling practices are used. Currently, the Lindholmen data set contains only descriptive (meta-)data of the projects in which UML models were found. Other software development artifacts such as source code, issues, mailing list, wiki documents could also be collected. Clearly, this leads to extra efforts on collecting and curating relevant data outside the Lindholmen dataset. For

example, in paper F, we collected ”issues” directly from GitHub via the GitHub API as a means for operationalizing the defect-proneness of software projects.

A couple of foreseen challenges towards collecting various software development artifacts are: i) Crawling big-data is technically not always an easy task given limited computation-resources, and ii) Many of the interesting artifacts exist outside of GitHub repositories. Therefore, collecting them requires extra efforts on building tools and cleaning noisy data, iii) including more types of artefacts further increases the number of file representations/formats that needs to be supported. For these issues, we specifically call for joint efforts of multiple research teams. One way to collaborate could be where each team takes the responsibility to collect and maintain one (or two) types of artifact(s) in the corpus.

More and more data curation. Data curation is an important activity to maintain the currency of the metadata and to make data better accessible, easier to find, more descriptive, and more relevant. Data curation becomes even more critical given the expected increase in the amounts of data in the future. By conducting this Ph.D study, we learned that it is possible to curate the dataset in both manual and automated ways. While manual curation approaches are often time-consuming, automated curation approaches require careful validation to ensure adequate accuracy. In the future, we can explore hybrid approaches in which knowledge about objects to be curated can be used to improve the performance of the automated curation [76, 77]. For example, we can probably increase the performance of the classification models presented in paper D by using the knowledge about the performance of the classification features.

Besides, data curation can benefit from adding annotations to the data set. In particular, annotations can be made at project- and model-levels. For example, at the project level, annotations about “project license”, “business domain”, the “goals of project when using models” (for design or documentation), “general impacts of using UML” can be employed. At the model-level, annotations on layout-style of the UML model, tool that was used to generate the model, general role of the model, quality of the model, etc. could be very beneficial.

In document Empowering Empirical Research in Software Design: Construction and Studies on a Large-Scale Corpus of UML Models (Page 60-62)