Mining Software Repositories

CHAPTER 2 CRITICAL LITERATURE REVIEW

2.2 Related Work

2.2.6 Mining Software Repositories

Recently, many software development activities and stages being performed using various tools. For instance, in writing source code, mostly version control systems like Git are applied, for code review and testing, code review tools like Gerrit and issue tracking systems like Bugzilla are adopted, mailing systems or IRCs in various work environment are used for people communications too. Hassan et al. (Hassan and Xie, 2010) categorized three groups for the examples of the tools’ related repositories in MSR: Historical (such as source control or bug repositories), Runtime (such as deployment logs), and Code (such as Google code ) repositories. These tools provide us with ample valuable data source help better understanding of software projects such as development-related behaviour, defect prediction and, code change recommendation. Thanks to open source software projects, they give the researchers this possibility to access to the various software repositories specially large software systems that so far were the center of attention for many researches in this domain like (Zimmermann, 2007; Moura et al., 2015; Tian et al., 2012). Then, MSR researchers try to analyze these rich data sources to achieve practical and applicable facts and information from software systems and projects and thereby support software projects with recommendations and guidelines based on these information. In this thesis, we applied MSR techniques on historical repositories (like issue reports and mailing lists) in open source projects to extract affects from the recorded discussions and communications of developers.

However, one of the major challenges in MSR research, and in our study as well, is finding the linkages between different kinds of repositories. Because there is not necessarily any standard or enforced practices to push developers to link them precisely, for instance to link a bug issue and a commit in source control. In our study, finding the link between issue reports and reviews of patches was one of our challenges too. We tried to solve using proper data sets from open source ecosystems like OpenStack and Eclipse, as these ecosystems are among the leaders of pushing developers to clearly link code changes to issues and reviews. In addition, recently almost every known tools are integrated with GitHub, therefore, this type of problem should be mitigated. However, this is part of a critical problem that MSR researches may suffer from, called “systematic bias” and impacts both build prediction models and generalizability of hypotheses (Bird et al., 2009b). Systematic bias happens when the distribution of data is not fully random and balanced, i.e., the data is not representative of the population. For example, Bird et al. (Bird et al., 2009b) showed for bug-fix data sets used normally in bug prediction analysis in MSR, it is probable that just some developers submit bug reports like developers of a fragile component, thus the data set is not representative of the whole components.

Recently, Bird et al. (Bird et al., 2015) illustrate best practices in MSR using leading data scientists’ experiences. They discuss different sorts of data sources such as code reviews, app stores and log files. Hassan et al. (Hassan and Xie, 2010) see the future of MSR as Software Intelligence (SI), like business intelligence but will support software practitioners including owners and developers in decision making process by using fact based support systems. Decisions such as when to release a software system or which parts of the system should be tested and so on can be done based on recent and pertinent information offered by SI. Therefore, this process, which is based on a well-studied science, likely prevents wasting large resources and co expensive costs. They believe that recent advances in MSR is promising for SI realization in the near future.

Based on our literature review, we discussed the current researches that have focused on the impact of human affects in software engineering by applying different techniques including biometric approaches. Our research will aim to understand the importance of human affects related metrics obtained from various repositories related to recorded communications among developers during software engineering process. To study their importance, we considered an important criteria “software quality” based on the defects rate and time to fix the defects. Moreover, knowing the emotional state of the development team helps the manager to create an environment capable of combating the effects of "bad" emotions. Thus, training the development team on stress management, communication and assertiveness will improves the coping ability of the RE.

CHAPTER 3 RESEARCH PROCESS AND ORGANIZATION OF THE THESIS

This chapter presents the methodology of the research process and the structure of this whole dissertation. This thesis will focus on the three parts mentioned in the research hypothesis, as well as one preliminary part: 0) affect measurements in software engineering context, 1) analyzing the link between affect-related factors and quality of work, 2) analyzing the link between affect-related factors and time taken by work, 3) investigating one possible solution to deal with conflicts in open source projects.

Our work tries to show the importance of affects of developers in software development process and help practitioners to measure affect-related factors from software artifacts. Chapter 4 analyzes the presence and evolution of sentiments in developer and user mailing lists, moreover evaluates the application of one tool in software engineering process. Chapters 5 and 6 investigate the link between affect-related factors and the quality of the resulting work, and the time of code reviewing and issue resolution, respectively. Finally, chapter 7 analyzes the characteristics of one possible solution adopted in open source projects to deal negative affects rooted in diversity that may cause debates, conflicts, and battles among participants.

3.1 Investigating the Presence and Evolution of Sentiment in Mailing Lists (Tourani et al., 2014)

Before studying the link between affect and quality/time in later chapters, we first need to analyze whether affect exists in software engineering communication. In particular, we wanted to explore the use of sentiment mining tools to identify such affects. For this study, we chose mailing lists of two mature and successful open source projects, Tomcat and Ant, to extract the sentiment of developers or users during their communications as mailing lists are one of the most popular media for discussion in open source software projects. This study was in following with our previous study Murgia et al. (2014), which confirmed the existence of emotions in issue reports and also the feasibility of automated emotion mining from issue reports.

Using the state-of-the-art tool, SentiStrength, we set out to identify extremely positive or negative feelings in emails, then manually evaluated these emails to understand their topics. Although, we showed the presence and evolution of the sentiment in studied open source mailing lists, we also observed how noise in the sentiment measures appeared (because of

special way of sampling). To further explore this noise, section 8.2 evaluates and compares SentiStrength to one other cutting-edge sentiment mining tool.

3.2 Investigating the Link Between Affect-Related Factors with Quality and

In document On the Impact of Affect in Software Engineering (Page 39-42)