Issue Classification and Misclassification: Theoretical Underpinnings

Open information environments (OIE) are environments in which new sources and uses

of information emerge and where the users of information, while having access to

different sources of information, often have no control over them (Parsons and Wand,

2014). The stakeholders in OIEs are usually information contributors (sources),

information consumers (users) and OIE sponsors (Parsons and Wand, 2014). OSS issue

repositories are OIEs in which the sources of information (issue data) are the developers

and non-developer users who report issues, and the users are the specific project

developers/maintainers who use the issue data for implementation purposes. The sponsors

could be the management team of web-based OSS environments (e.g., GitHub) and

organizations interested in OSS development (e.g., Redhat), who may pay their

employees to participate in OSS development.

Parsons and Wand identify the need to accommodate semantic diversity and

ensuring information quality as key requirements for OIEs to be successful. Different

users and sources are likely to have different views/interpretations of the information

generated in OIEs; these different views can lead to different meanings being assigned to

the data generated or data available, and views can change over time (Parsons and Wand,

153

mechanisms in place for accommodating the different and evolving views of sources and

users. The different contributors of OSS issue data (non-developer users and developers)

and users of these information (project developers, maintainers, and the interested

readers) come from diverse backgrounds (e.g., different organizations and nationalities),

and are geographically distributed (e.g., Crowston et al., 2012) and, therefore, can be

expected to possess diverse views about the issues concerning OSS projects. For example,

individuals from different countries could have very different views about user interface

(Schmid, 2014).

Individuals can observe some characteristics of objects in a domain and form their

individual perceptions about what they have observed; they could form different

conceptualizations about the same characteristics of an object (diverse views), and these

conceptualizations could even change over time for an individual (evolving views)

(Lukyanenko and Parsons, 2015). Issue data submitted by an individual issue reporter can

be viewed as information about some phenomena in the application domain of the

particular OSS project. Multiple issue reporters could observe the same issue, form

diverse perceptions about it, and report their individual descriptions about the observed

issue. This could result in issue information from different reporters being perceived as

duplicate by the project developers/maintainers. An example of duplicate issue

information is provided below:

Three different submitted bug reports perceived as duplicates of each other by project developers (source: open office Bugzillahttps://bz.apache.org/ooo )

[1] “The rows are way too small (in fact I can’t see a thing). I had to upsize the fonts to 22 to get a

decent view of the sheet.”

[2] “The default row height is set to 0.0 for all cells when first starting. I have been unable to find a

154

[3] “Just installed 1.0 on redhat 7.2 with KDE 2.2. Open up a new spreadsheet. The rows are invisibly

tiny. I select all rows with ctrl-a, then go to menu format/row/height and the height is showing as 0.03 cm with the default checkbox checked on. I enter 0.5 cm in the edit box and the default checkbox turns itself of. I close the dialog and the rows are now large enough to type in. Bug: The rows should not open so small. Where did the default of 0.03 come from?”

In the above example, the three bug reports were submitted by different reporters

who observed the same issue, formed different individual conceptualizations of it (diverse

views) and reported their individual descriptions about the issue. Bettenburg et al. report

that OSS developers may not perceive duplicate issue information as a serious problem;

instead, they may add useful information about an issue (Bettenburg et al., 2008). Hence,

diverse individual descriptions about the same issue could potentially help enrich the

issue information content. Since different issue reporters may make observations about

some phenomenon related to an OSS project in different ways, some may make rich

observations/mental visualizations and subsequently provide rich, detailed issue

information, while others may end up providing incomplete or incorrect issue information

as perceived by the project developers/maintainers. Incompleteness and incorrectness are

commonly occurring problems with the issue data in OSS issue repositories (Bettenburg

et al., 2008). In the above example on duplicate bug reports, it can be seen that the third

bug report has detailed information content, whereas the first bug report has limited

information content, potentially illustrating the differences in the mental

conceptualizations of their reporters at the time of reporting.

To support diverse and evolving views, that is, facilitate semantic diversity, a

desired property from OIE applications is that they should allow capturing and storing

155

Wand, 2014; Lukyanenko and Parsons, 2015). Classification is a human dependent

activity and can be greatly influenced by human characteristics such as experience and

knowledge (Lukyanenko et al., 2014b_{). The same thing may be classified differently by} different individuals or the same thing may be classified differently by an individual at

different times (Lukyanenko et al., 2014b). For example, one individual may classify a passport as an identity document while another individual may classify it as a travel

document (Lukyanenko et al., 2014b). A priori classification presented in any IS artifact reflects fixed views that cannot easily accommodate the multiple and rapidly evolving

views that are commonplace in OIEs (Parsons and Wand, 2014). Fixed views imposed by

a priori classification can bias user-generated content to the views of a limited set of

contributors and prevent the inclusion of views of others (Lukyanenko and Parsons,

2015). This is because individuals can widely differ in their conceptualizations of objects

in some domain and individual conceptualizations can vary over time as well. As a result,

the views of many potential contributors may not match with the limited view that an a

priori classification imposes (Lukyanenko and Parsons, 2015). When information

contributors are unfamiliar with the classes presented by an information system artifact to

them, the result is a forced choice which does not match with the perceptions of the

information contributors (Lukyanenko et al., 2014b_{). This can have negative impacts such} as lower quality of contributed information and information loss (Lukyanenko et al.,

2014b). As an example in the context of OSS issue repositories, comment 27 (Table 38) points out issues that are edge cases; for example, issues that are both a bug and an

enhancement. Other combinations are possible like bug-documentation or bug-

156

by issue gathering interface such as Bugzilla cannot accommodate such diverse cases and

may result in loss of such information.

In OSS issue repositories, this would mean that the issue reporters should be able

to specify their issue information as it is in their minds without having to worry about

assigning them to some a priori classes/labels that an issue gathering interface provides .

In other words, OSS issue gathering interfaces should capture issue information from

reporters without imposing the need to assign specific class labels to them while creating

and submitting issues. This can clearly support the diverse views of many different issue

reporters distributed across the globe. For example, consider the label issue type in the

Bugzilla issue reporting interface. If a reporter chooses enhancement as the type of his/her

issue, in order to be certain it is indeed an enhancement, he or she needs to be certain that

the requested characteristic is not already in the software which would mean having a

good knowledge of the current functionalities and characteristics of the software. It is

highly likely that often this is not the case. Consider the other label priority (severity is

similar to this). Prioritization of requirements often involves groups of requirements and

stakeholders, for example, high priority mould mean a requirement is likely to be

implemented much before several other requirements or that it is more important in

comparison to several other requirements to a group of stakeholders (Firesmith, 2004). A

lone issue reporter submitting a single issue at some time point may not have a very good

idea of priority and severity of the issue he or she is submitting. This is also indicated by

the following comment of a responding OSS developer in the second survey: #1:

“Reporters are very rarely able to accurately decide priority, severity or any of the other

157

established with a quick back and forth with the reporter.” Hence, by asking the issue reporter to assign such labels to their issue description, issue reporting interfaces such as

that of Bugzilla appear not to be accommodating diversity well in the views of issue

reporters (e.g., those issue reporters who do not have enough knowledge to provide all

labels). As a result, many issue reporters may provide incorrect labels (e.g., see developer

comment 18, Table 37), and the issue description gets stored along with those incorrect

labels. Thus enforcing a priori classification at the time of creation of issues is a potential

contributor to misclassification.

On the other hand, many OSS issue gathering interfaces (e.g., GitHub) provide a

simple interface that seeks to capture just the issue description from the issue reporter.

The issue reporters do not need to add any labels or classes to their issue information and

the issue information gets stored independent of any classes/labels. In GitHub, only

project developers/maintainers can assign labels to the submitted issues

(https://help.github.com/articles/creating-an-issue/). Thus, the decision makers (project

developers/maintainers) can infer the labels/classes (e.g., whether an enhancement,

feature request or a bug) for a particular issue from the issue description itself, provided

sufficient information has been provided in the description (c.f., Lukyanenko et al., 2014)

and the issue reporters are not forced to classify/label their issues at the time of creation

of their issues.

GitHub and Bugzilla issue gathering interfaces represent two popular, but

different, ways of capturing and storing issue information from reporters in OSS domain.

158

these interfaces are widespread and need improvement. Specifically, the true goal is getting the problem fixed. Therefore both interfaces would benefit by having a prominent area to accelerate any fix….” Google Code, Gitlab and Codeplex are examples of OSS development environments that use an issue reporting interface (shown in Appendix 4)

similar to that of GitHub whereas Jira is an example that is similar to Bugzilla.

Differences in how an information system captures information from contributors

can influence the quality of that information; for example, putting restrictions on

contributors can result in information loss (Lukyanenko et al., 2014). Therefore, it is

important to investigate how the two different approaches to issue data gathering in OSS

issue repositories may affect the information quality of issue data. This becomes even

more important considering the recent demands to GitHub management from some

GitHub developers for a complex issue reporting interface similar to that of Bugzilla.3 .In this third phase of the research, I take a qualitative approach to explore how the two

different issue gathering approaches in OSS development may contribute to the

misclassification problem (an information quality problem with OSS issue data) and what

can be done at the interface level for mitigating the misclassification problem.

The next section describes in greater detail the research methodology.

3_{See, for example: https://github.com/dear-github/dear-github/issues/59 ; https://github.com/dear-} github/dear-github/issues/72

159

In document Understanding and improving requirements discovery in open source software development: an initial exploration (Page 163-170)