Refining Low Quality Review Queue - Holistic recommender systems for software engineering

5.5 Discussion

5.5.3 Refining Low Quality Review Queue

Following and extending the discussion on the quality function, we can see that popularity metrics and readability metrics are the most useful metric sets to refine the low quality review queue. Readability metrics are eﬀective alone and in combination with popularity metrics (Table 5.17), thus connecting readability to good quality posts, and both metric sets complement the simple textual features currently in use at Stack Overflow. Thus, we recommend to take into account popularity and readability metrics in the quality assessment of posts. For example, while an automatic refinement system could be long-term goal, a simple recommendation system could be more eﬀective in the short term to help reviewers to discard posts that score high in popularity and readability metrics, or review first posts that score particularly low.

5.6 Threats to Validity

Construct Validity

Threats to construct validity are concerned with whether what one measures is what one intends to measure. In our case, there could be several reasons why the considered quality of the questions is incorrect. We rely on the judgement of the SO crowd to differentiate the quality of questions, which is a potentially error-prone process. In fact, the perceived quality might be different from person to person, and might be based on different definitions. This issue is alleviated by the fact that we also manually examined more than 100 questions not only to get insights on their features, but also to verify whether the choices made by the users were reasonable. Moreover, Stack Overflow relies on the same criteria.

Another issue regarding quality is the definition of the class A in our dataset (Table 5.1): ‘very good’ questions. We defined ‘very good’ questions as those with a Stack Overflow score higher than 7. We chose this threshold to obtain balanced datasets, and as a reasonable trade-oﬀ between choosing a too high value (which would have only included questions regarding trending topics) and a very low one (which would have included questions only inspected by few users).

Finally, the choice of balanced dataset could have impacted the results of the machine learning models. Nevertheless, this should not be a problem when enough training data is available. Statistical Conclusion

Threats to statistical conclusions concern the fact that the data is enough to support claims. We considered statistically significant samples in our experiments; this was possible because we relied on the crowd assessment and not on manual inspections of questions.

External Validity

External validity threats are concerned with the generalizability of results. The approaches we tried may show diﬀerent results when applied to a Q&A website other than Stack Overflow. To alleviate this issue, we chose to include questions related to any valid topic in the technical forum, thus including a very large population. An evaluation of our approach that involves other Q&A websites could measure the eﬀect of this threat.

5.7 Conclusions 89

5.7 Conclusions

We devised, implemented and illustrated an approach to classify question quality, and in the same way understand what fundamentally influences and characterizes it. We devised three sets of metrics that capture both textual features of a question and the reputation of the user who asked it. Together with these metrics, we constructed two datasets from a Stack Overflow data dump which captured what the community identifies as a question’s quality. We began by devising two types of quality for a question in Stack Overflow (i.e., ‘Good Quality’, ‘Bad Quality’) and we then extended our definition to four level of quality (i.e., ‘Very Good’, ‘Good’, ‘Bad’, ‘Very Bad’), by imposing some empirical thresholds, based on data balancing.

We also conducted a twofold empirical study aimed at classifying Stack Overflow questions’ quality and understand how the metrics we devised influence it. In the first part, we used a machine learning algorithm to infer decision trees. In the second part, we used genetic algorithms to learn linear quality functions that describe a question’s quality. We encoded quality as a function classifying bad quality questions with negative values, and high quality questions with positive values, thus representing diﬀerent shades of quality where the extremes represent very bad and very good levels. By analyzing the tails of such quality functions, and in particular intersections of them, we were able to reach prediction results that can be beneficial for an automatic quality classification approach, and confirm that popularity metrics are the best predictors, and identify which specific metrics strongly influence good or bad quality.

Reflections

This chapter shows how the estimation of the quality of a development artifact, like a Stack Overflow discussion, requires a heterogeneous analysis of the information that goes beyond the textual part of the artifact.

Even when focusing on the narrative of the artifacts, the simple analysis of the terms, or characters, as performed by the Stack Overflow team, poorly performs if compared to more elaborated analysis of the narrative, like readability indexes, or to complementary information like popularity metrics.

It is worth noting that according to the studies performed in this chapter, the best results are obtained by combining diﬀerent type of information together. For example, the combination of readability and popularity metrics led to better performance, thus highlighting once again that the heterogeneity of the information can be exploited to improve current approaches.

In the next part we focus on extracting and modeling heterogeneous data from development artifacts, breaking the boundaries imposed by the narrative, and discovering heterogeneous elements like code elements immersed in it. We will see how this novel modeling phase steps out from the canonical, textual oriented, analyses of development artifacts, and how it can be leveraged to devise applications which preserve the multifaceted nature of the information.

Part

III

Parsing and Modeling

Unstructured Data

6

Automated Multi-Language Parsing and Modeling of

Software Engineering Artifacts

Extracting and analyzing non-textual elements in development artifacts is a non-trivial task. Indeed, code fragments unavoidably overlap with narrative, causing potential undesired matches. For example, many words in natural language can be valid identifiers for a programming language (e.g., valid type names, identifiers).

Oﬀ-the-shelf parsers like the Eclipse JDT are not designed to extract code elements from the narrative, and even if the code is isolated (e.g., in a <code>tag), such parsers generally fail to parse incomplete code fragments into a meaningful and correct abstract syntax tree (AST). Another challenge concerns the presence of multiple languages like JSON and XML, or structured elements like stack traces. Oﬀ-the-shelf parsers do not even consider these options. A viable way to overcome these limitations is to build an island grammar [Moo01, BCLM11], which is able to separate constructs of interest (i.e., islands) from the uninteresting constructs (i.e., water). We leverage island grammars to construct an analyzer capable of disentangling structured (e.g., Java, JSON) and semi-structured contents (e.g., Stack Traces) from the narrative.

This chapter describes an approach to automatically model the artifact contents through a Heterogeneous Abstract Syntax Tree (H-AST) that represents, in a unique structure, both textual fragments, interchange formats and languages of an artifact, thus enabling semantic modeling of its contents.

Structure of the Chapter

In Section 6.1 we present the foundations of our approach and its challenges. In Section 6.2 we investigate the two research questions, we describe a technique based on automated random testing to evaluate its eﬀectiveness, and we apply it in the context of Stack Overflow discussions. In Section 6.3 we conclude the chapter.

6.1 Multilingual Island Grammar

Our approach targets both web resources (e.g., Stack Overflow discussions) and non-web artifacts. The former are normally represented and stored as HTML documents or fragments. Often the

<code>elements do not contain syntactically valid source code. Therefore, we treat the contents of <code>elements as document text, rendering the whole HTML and considering the resulting

text. Our approach can also be applied to non-web resources like plain text emails, since it works on the document’s representation.

Given this context, we extend an approach based on island grammars [Moo01, BCLM11] to disentangle code constructs from the natural language narrative. In the literature, island grammars have been used to extract Java constructs of interest from mailing lists [BCLM11]. We extend it to enable the modeling of multi-language code fragments (i.e., JSON, XML, and stack traces) supporting both complete (e.g., class and method declarations) and partial con- structs (e.g., variable declarations, statements), even when they are interleaved with narrative. Furthermore, we support incomplete constructs like method declarations without return type or without body.

We proceed to describe the basic features of our approach, how we leverage it to extract and model Java fragments, and how we extended it to support multi-lingual contents.

In document Holistic recommender systems for software engineering (Page 108-114)