Stack Overflow as Source of Information - Holistic recommender systems for software engineering

from the two (i.e., editing history, code contents, and recent browsing). Codetrail makes the web browser an additional IDE view that enables additional features like bookmarking of web resources in the resources code from both the IDE and the web browser, and automatic documentation browsing in the web browser by synching with the current element under the cursor in the IDE. A similar approach is proposed by Hartmann and Dhillon [HDC11] in HyperSource, an augmented IDE that associates browsing histories with source code edits. Their approach takes advantage of both the web browser and the IDE to track the history of visited pages and code edits, establish a link between the two, and mark the web resources directly in the code editor where the code change has been performed. These marks are interactive and allow the developer to review the browsing history concerning a specific change.

Also Sawadsky et al. [SMJ13] followed a similar idea and developed Reverb, a tool that extends both the Eclipse IDE and the Chrome7 web browser. Their approach monitors the web pages visited in the web browser, and indexes them by using Apache Lucene8, and also tracks interactions in the IDE to understand the element currently displayed. When the element displayed changes, Reverb queries the Lucene index, retrieves a list of visited pages, and shows them directly in the IDE.

2.4 Stack Overflow as Source of Information

Among the available online resources, Q&A services provide developers with the infrastructure to exchange knowledge in the form of questions and answers [AZBA08]. Developers ask questions and receive answers regarding issues from people that are not part of the same project, performing what is defined as crowd sourcing a task. Even though researchers pointed out that Q&A services could not provide high level technical answers [NAA09, MMM+11], these services are filling “archives with millions of entries that contribute to the body of knowledge in software development” and they often become the substitute of the oﬃcial project documentation [TBS11].

The impact of Stack Overflow on the way developers exchange and transfer knowledge [VSDF14] captured the attention of researchers who raised a set of questions concerning the impact of this Q&A website on software engineering practices and the tools [STvDC10]. For example, Storey et al. [SSC+14] found that while traditional channels (i.e., mailing lists, face- to-face communication) are still considered crucial, social media like Stack Overflow “have led to yet another paradigm shift in software development, with highly tuned participatory development cultures contributing to crowdsourced content”.

Stack Overflow relies on a very active community asking and discussing a considerable amount of questions daily, providing an answer rate above 90%, and a median answer time of only 11 minutes [MMM+11]. At the time of writing, by querying the Stack Exchange Data Explorer9, Stack Overflow accounts for more than 6 million users who asked more than 12 million questions. This critical mass of information makes Stack Overflow an ideal resource for RSSEs.

Stack Overflow can be leveraged as a source of information for RSSEs to automatize the identification of relevant help online. For example, Cordeiro et al. [CAG12] proposed to process the contextual information of stack traces to retrieve pertinent Stack Overflow discussions to help developers in the IDE when a runtime error happens. Their approach relies on a the HTML tagging of the discussions to identify code blocks, and analyzes the code with a combination of regular expressions and Eclipse JDT aimed at identifying stack traces or Java code respec-

7_{https://www.google.com/chrome/} 8_{http://lucene.apache.org/}

tively. Another example is Dora [KDSH12], a tool integrated into the Visual Studio IDE10that automatically queries and analyzes online discussions (e.g., Stack Overflow, Codeguru, Bytes, Daniweb, Dev Shed) to locate relevant solutions to programming problems. Dora searches for discussions by using the search engine provided by a website like Stack Overflow, and evaluates the quality of the retrieved discussions by relying on a model based on community-related features (e.g., number of replies, resolved answer).

Mixing community features in Stack Overflow and textual features to retrieve relevant help is a goal targeted by several approaches when building their own search engine, without reusing existing ones. For example, Campos et al. [CdSdAM16] devised a retrieval approach that takes into account pairs composed by a question and an answer, and evaluates them by composing three aspects: (1) the score of Apache Lucene, (2) the score of received by the Stack Overflow community, and a score that determines the How-to nature of a pair. Similarly, Zagalsky et al. [ZBY12], presented Example Overflow11, a search engine for Javascript code samples that allows the developer to retrieve samples for the JQuery12 library according to a textual query. Example Overflow_{uses a score function to estimate the overall quality of code samples that} mixes the score given by Apache Lucene, and the community scores assigned to each part of the discussion (i.e., title, question, answers, code) from which the sample is taken.

Stack Overflow discussions are also used as resource to enrich current documentation. For instance, Subramanian et al. [SIH14] presented Baker, a tool that augments API documentation (i.e., Javadoc) with code samples taken from Stack Overflow, and viceversa. Their approach employs the Eclipse JDT parser to reconstruct a partial AST of the code sample found between

<code> tags to identify fully qualified names that pertain to API usages. The fully qualified names are then used to dynamically inject code samples in the API documentation and link the documentation within the corresponding Stack Overflow discussion, thus favoring navigation between the two resources. A similar approach was devised by Treude and Robillard [TR16]. Their approach enriches current API documentation with summarized information taken from Stack Overflow discussions to describe usages of classes and methods. They employ regular expression to identify code elements within HTML, and select relevant sentences by using SISE (Supervised Insight Sentence Extractor), a summarization approach that considers several factors concerning community aspects (e.g., user reputation, score, favorites, views), textual aspects (e.g., part-of-speech tags), and code related aspects (e.g., API elements in the sentence), to build an extractive summary. The summary can be thus injected in the API documentation to provide additional information concerning real usages. Diﬀerently from the previous ones, Wong et al. [WYT13] enrich software documentation by automatically generating comments for source code. Their approach, called AutoComment, leverages and refines the code description found in Stack Overflow, identifies, by using code clone detection techniques, related code elements in source code, and generates comments accordingly.

RSSEs for Stack Overflow were also designed to assess the quality of posts. Indeed, the quality of Stack Overflow posts has many implications for developers. On the one hand, the Stack Overflow community aims at keeping a certain level of quality in their posts, and lets the crowd judge and filter out low quality posts. On the other hand, an automated approach can help the crowd in identifying such low quality posts faster, and developer in dodging them when looking for help in Stack Overflow. For example, Correa and Sureka [CS13, CS14] devised an approach to automatically identify questions to be deleted or closed. Their predictive models detect the quality of a question at the creation time, and use a set of features concerning user

10_{https://www.visualstudio.com} 11_{http://www.exampleoverflow.net} 12_{https://jquery.com}

In document Holistic recommender systems for software engineering (Page 35-37)