• No results found

Finally, discussions related to tasks 11,12,16,17,18, 24, 28, 30, and 36, have not been modified since July 2013. Thus, also in this case the Prompter recommendations have been strongly influenced by the different results generated by the search engines in the two different time periods.

Summary of RQ4. While the recommendations provided by Prompter one year later from Study I changed in 78% of cases, its performance did not show strong deviations, with just 24% (against the old 21%) of the new recommended Stack Overflow’s discussions classified by participants as not related to the task at hand. Manual analysis suggests that the changes in the search engines together with the volatility of the information exploited by Prompter, represent the main reasons for 78% of different recommendations after just one year.

4.6

Threats to Validity

Construct Validity

Threats to construct validity are related to the relationship between theory and observation. In Study I and in its replication, such threats are mainly due to (i) the fact that we mimic the code being written by a user by providing with Prompter a partially-complete class, and (ii) by letting the users provide evaluations using a Likert scale. Concerning the former, we made sure such classes were not too detailed nor too empty, to represent realistic situations where Promptercould be used. Concerning the latter, this is a standardized evaluation scale used to collect participants’ feedback.

Study II overcomes the limitations of Study I mentioned above. In Study II threats to con- struct validity are due to how we measured the task completeness. Certainly, we could have used a test suite to measure the completeness in a objective manner. Conversely, code inspection allows us to evaluate partial implementations. The use of checklists and multiple independent evaluators limited the bias and subjectiveness.

Internal Validity

Threats to internal validity are related to factors that could have influenced the results. For Study I one factor to be considered is the knowledge of the participants—not known a-priori— of the APIs being used in the particular task. The availability of multiple participants with different degree of experience mitigates this threat. Students taking part in our evaluation were not evaluated based on the task outcome, and we asked participants not to use other sources of information during the task, e.g., to use them as a comparative source to the provided discussion. In Study II, to limit the effect of participants’ skills and experience, we have pre-assessed them and used this information assigning them to the four groups. We also analyze to what extent the usefulness of Prompter depends on the particular task.

For the replication of Study I, confounding factors could have influenced results of both RQ3 (different recommendation rankings) and RQ4 (developers’ assessment). Specifically, for what concern the ranking, we cannot exclude that the different position (or the total disappear) of a question from the search-engine rankings may depend on changes/optimization in the search engines themselves. Nevertheless, we believe this can be one of the factors that affect the volatility of recommenders’ results, and that one cannot control.

Concerning RQ4, different subjects gave a different evaluation for the same recommendation (already assessed in Study I). Participants could have judged the same recommendation as very

relevant in Study I, and not relevant at all in the replication, and vice versa. This can happen because of the large difference of experience participants have. To verify whether such a situation could have occurred, we statistically compared—using Mann-Whitney tests (two-tailed)—the ratings provided to the 29 recommendations by participants to Study I and by participants to Study I replication). Results indicate the presence of a significant difference only for tasks 11 (p-value=0.03, median old study=4, median new study=3), 27 (p-value=0.0001, median old study=5, median new study=3), and 31 (p-value=0.02, median old study=4, median new study=2).

Conclusion Validity

For Study I we report descriptive statistics and violin plots of the collected results, along with participants’ feedback, while for its replication, whenever possible, we use appropriate statistical procedures, namely Wilcoxon paired tests and Cliff’sd effect size measures. For Study II, we used distribution-free tests (Wilcoxon, Mann-Whitney, and permutation test) and effect size (Cliff’s d) measures, suitable for limited data sets as in our study. Whenever multiple tests are used on the same data, we apply p-value adjustment using the Holm’s procedure [Hol79].

External Validity

Threats to external validity concern the generalizability of our findings. In terms of participants, the study involved both professionals and students, with different degree of experience. We claim the study provides a good coverage of the potential categories of users, although further studies with more participants are desirable. In terms of objects, we selected 37 tasks being different in terms of nature and required technical knowledge. However, we cannot exclude that our results depend on the particular choice of the tasks. For Study II, although we selected, both students and industrial developers, it is worthwhile to replicate the study with a larger number of participants. Furthermore, Prompter was only evaluated with two tasks that are not representative enough for tasks that developers would perform. We believe that Study I achieves a better external validity whereas Study II a better construct validity.

Finally, concerning the Study I replication, it is possible that the different ranking and eval- uation obtained for the recommendations pertinent to the 29 tasks depend on these particular cases. In other words, there might be tasks—e.g., related to emerging technology—for which recommendations can be more "volatile", while other tasks—e.g., related to the usage of consol- idated programming practices—-such as Java SDK—can be relatively more stable. Therefore, further studies can be needed to confirm or contradict the results obtained in this study.

4.7

Conclusions

We have presented an approach to turn an IDE into the developer’s programming prompter. The approach is based on (1) automatically capturing the code context in the IDE, (2) retrieving documents from Stack Overflow, (3) ranking the discussions according to a ranking model, and (4) suggesting them to the developer when (and only if) it has enough self-confidence. We implemented our approach in Prompter, a tool embodying the ideal behavior a recommender should have: a silent observer of the developer, that only intervenes when it deems itself to have a relevant enough suggestion, and that does not force the developer to invoke it but is always available in case the developer needs it. Through a quantitative study (Study I ), performed

4.7 Conclusions 67

via an online survey, we showed how the Prompter ranking model resulted to be effective in identifying the right discussions given a code snippet to analyze.

In a second study (Study II ) we evaluated Prompter during maintenance and development tasks. We showed how, from a quantitative point of view, Prompter revealed to significantly help developers in completing the experiment tasks and how, from a qualitative point of view, the developer appreciated its features and usability.

We also replicated Study I after one year from the original experiment. Surprisingly, the results showed that starting from the same code snippets Prompter’s recommendations changed in 78% of cases due to the volatility of the information it mines from the web. Despite this, the new recommendations still showed to be related to the task at hand in most of cases. However, the results of the replication clearly highlighted that recommenders built on top of information mined from the web may experience strong changes in their behavior during time. As a consequence, the replication of empirical studies aimed at evaluating such tools and techniques could be unfeasible. Reflections

This chapter presented an approach not totally depending on pure information retrieval. The ranking model described in Section 4.2.4 takes into account several aspects of the information. The textual similarity between a Stack Overflow discussion and the source code in the IDE is decorated by including other types of similarity concerning code (e.g., method names, and type names), and non-source related community information (e.g., user reputation).

Even though this model moves some steps towards a holistic interpretation of the information, its implementation it is still reductionist. Indeed, the elements composing the model are just weighted and combined in a linear function, forcing the overall ranking model to be a sum of similarities. Implicitly, this model devises a fix structure of an artifact that needs to be satisfied, i.e., in a Stack Overflow discussion, the code snippets need to match the same types, the same methods, while the narrative parts must use the same words as in the code in the IDE.

In addition, the model is reductionist in the way it treats the information together. According to Table 4.2, some of the weights assigned to metrics in the model are equal to zero (i.e., Code Similarity, API Types Similarity, Accepted Answer Score), thus excluding such metrics from the computation. The configuration reported in Table 4.2 is just one of the possible local maxima that can be obtained in the training phase. Other model configurations, not reported in the chapter, exhibit different weight distributions, sometimes eliminating some of the zero values. In other words, a configuration of the model might better work for a subset of the artifacts used in the training phase, while another possible configuration might better fit a different subset of the same set of artifacts.

The actual ranking model highlights the need of a heterogeneous overview of the information to assess the quality of an artifact. The current implementation treats the information in a reductionist way by excluding certain types of information that might play a prominent role for certain artifacts. In the next chapter we show how the heterogeneity of the information can further increase when it comes at evaluating the quality of the narrative of a Stack Overflow discussion.

5

Improving Low Quality Stack Overflow Post Detection

In the two previous chapters Q&A websites like Stack Overflow, played a prominent role as source of information for developers. However, the quality of the contents provided by Q&A websites varies, and ranges “from high-quality questions and answers to low-quality, sometimes abusive content [, thus making] the tasks of filtering and ranking more complex than in other domains” [ACD+08]. In Stack Overflow, the task of keeping up the quality of questions is left to the crowd: Poor quality posts are identified by a selected subset of users in the community (i.e., moderators) who have the rights of closing and deleting questions.

As reported by Correa et al. [CS14], around 80% of the questions take at least 1 month or more to receive a delete vote, and approximately 14% receive 3 delete votes before being actually deleted. This latency in the deletion process is a symptom of the amount of effort required by moderators to guarantee a satisfiable level of quality in Stack Overflow.

In this chapter we propose an approach to automatize the filtering process. We have devised a quality predictor that helps moderators in identifying poor-quality questions at their creation time, thus reducing the review time. To do so, we have investigated the concept of quality for Stack Overflow questions and developed the classification approach.

Structure of the Chapter

In Section 5.1 we describe the Stack Overflow review queue process. In Section 5.2 we discuss how we construct the datasets we use for our analysis. In Section 5.3 we present the metrics that we use to construct our classifier. In Section 5.4 we then present our classifier and the results we obtain. In Section 5.5 we discuss our findings. In Section 5.6 we discuss the threats to validity, and we draw our conclusions in Section 5.7.

5.1

The Stack Overflow Review Queue Process

Low quality posts in Stack Overflow are identified through a review queue system managed by moderators (a restricted set of users with enough reputation to unlock specific privileges1). Stack Overflow has 7 review queues:2

1. Late Answers: Answers which were posted much later than the question. 2. First Posts: First posts for users.

1http://stackoverflow.com/help/privileges/ 2http://meta.stackexchange.com/questions/161390/

3. Low Quality Posts: Posts automatically determined to be of low quality based on several system criteria that generates a post quality score, or voted as such by users.

4. Close/Reopen Votes: Questions with active close votes or close flags show up in the close queue, and questions with active reopen votes, as well as questions which have been edited after closing, appear in the reopen queue.

5. Suggested Edits: Users without enough reputation to edit have their edits placed in this queue.

6. Community Eval: On the 60th day of beta, and every 90 days after that, this queue is filled with a set of posts which may be rated as “Excellent”, “Satisfactory”, or “Needs Improvement”.

When a post in a review queue receives 3 delete votes by moderators, it is deleted from Stack Overflow3. The post remains visible to users with a reputation score above 10,000 and its author. A post can be undeleted again by moderators if and only if it receives 3 undelete votes.

The queue of our direct interest is the Low Quality Posts Queue, since it contains posts that have been automatically determined as low quality, by using several system criteria that generates a post quality score, or that have been manually flagged by users. We focus on improving the efficiency of the Low Quality Posts review queue. In particular, we propose an approach to refine the queue to remove misclassified (i.e., good quality) post while retaining the bad quality posts in the review queue.