6.3
Conclusion
We presented an automated approach to parse and model software engineering artifacts. We devised a multi-lingual island grammar to isolate and model constructs of interest from four different languages like Java (supporting version 8 to the fullest), XML, JSON, and Stack Traces through a full-fledged H-AST. We leveraged random testing to evaluate the robustness of the language grammars and their modeling capabilities in isolation. We compared our approach with an existing competing technique to extract simple Java constructs, and we evaluated the island parser and the model reconstruction in a concrete setting like Stack Overflow discussions. Reflections
Going beyond the boundaries imposed by the textual representation, and modeling the contents in a H-AST gives a totally new perspective on how to manipulate data for development artifacts. Indeed, with a H-AST structure is preserved, and development artifacts can be navigated, or even transformed or modified in its contents. This additional level of abstraction opens up new possibilities to build novel tools and novel analysis. Some examples of these applications are presented in the next chapter.
7
Applications and Reusability
A fundamental aspect of any mining approach involving Stack Overflow posts concerns the intrin- sic heterogeneity of the data: Posts are composed of both unstructured fragments representing the natural language part of the discussion, and structured fragments (e.g., Java code, XML, JSON), both co-existing in the same artifact. The pure extraction of constructs of interest from natural language leaves a conceptual “hole” in the process. For example, an analysis can focus on identifying relationships between XML configuration files (e.g., for the Android platform) and code samples, but after the extraction of structured fragments, the data comes still in the form of text. The multilingual grammar devised in the previous chapter goes beyond the plain textual representation of a development artifact, resulting in the H-AST model produced by the parser. This output model keeps track of the structure of the unstructured fragments within an artifact, filling the “hole” left by the absence of a model.
The next step is thus to model the extracted information and discover the semantic links among these elements to actually perform the specific analysis. To perform this step an additional abstraction layer on the information is needed.
In this chapter we present StORMeD, a dataset modeling more than 800k Stack Overflow discussion concerning Java, where the contents are further modeled with a meta-information system. We discuss how this additional modeling of the information provided by StORMeD can be leveraged and reused to build analysis and tools.
Structure of the Chapter
In Section 7.1 we describe StORMeD and its meta-information model. In Section 7.2 we present an exploratory study to discover usages ofsun.misc.Unsafein Stack Overflow by using StORMeD. In Section 7.3 we describe a tool to sanitize untagged code elements in Stack Overflow. In Section 7.4 we conclude the chapter.
7.1
StORMeD: Stack Overflow Ready Made Data
In this section we present the first application of our island parser. We fully exploit the H-AST as the basic block to build a meta-information model of the contents provided by Stack Overflow discussions. The meta-information model is embedded in a full-fledged structural model of Stack Overflow discussions, which also preserves the human tagging of the contents. We created StORMeD, a dataset counting more than 800k discussions concerning the Java programming language, enabling reusability of the Stack Overflow data.
7.1.1 The Artifact Model
Figure 7.1 shows the artifact model for a Stack Overflow discussion. Three different colors are used to highlight the structure of the document, reaching the structural detail of the contents.
Question Answer StackOverflowElement StackOverflowPost Comment InformationUnit NaturalLanguage TaggedUnit 1 * 1 * CodeTaggedUnit ASTNode
JavaASTNode StackTraceASTNode JsonASTNode XmlASTNode
1
* MetaInformation
Method Invocations Method Declarations
Identifiers Variables Declarations User * StackOverflowArtifact * 1 1
JSON Members XML Elements
Types Natural Language Text Readability
Type Declarations
Code Readability
Figure 7.1. Object Model for a Stack Overflow discussion.
The object model depicted in Figure 7.1 follows a top-down decomposition of the artifact. The orange part represents the structure of the artifact itself: A Stack Overflow discussion is composed by a set of posts that can be question and answers, each post has an owner (i.e., user) and a set of comments. The green part concerns the human tagging performed in the contents. In this caseCodeTaggedUnitrepresents where users highlights non textual elements by using the
<code> tag, whileNaturalLanguageTaggedUnitrepresents all other HTML tagging in the body of a post. The blue part models the meta-information contained in the contents. Each type of meta-information describe a specific type of information concerning code and natural language.
7.1 StORMeD: Stack Overflow Ready Made Data 113
I am migrating from xml based spring configuration to "class" based configuration using the corresponding @Configuration annotation. I came across the following problem: I want to create a new bean, which has a reference to another (service) bean. Therefore I autowired this class to set this reference during bean creation. My configuration class looks as follows:
@Configuration
@ComponentScan(basePackages = {"com.akme"})
public class ApplicationContext {
@Resource
private StorageManagerBean storageManagerBean;
@Bean(name = "/storageManager")
public HessianServiceExporter storageManager() { HessianServiceExporter hessianServiceExporter = new HessianServiceExporter();
hessianServiceExporter.setServiceInterface(StorageManager.class); hessianServiceExporter.setService(storageManagerBean); return hessianServiceExporter;
} }
But this doesn't work, because the causes a
BeanNotOfRequiredTypeException exception during startup.
Bean%named%'storageManagerBean'%must%be%of%type% [com.akme.StorageManagerBean],%but%was%actually%of%type% [com.sun.proxy.$Proxy20]
The StorageManagerBean is annotated with an @Service annotation. And the xml based configuration worked as expected:
<beanname="/storageManager"
class="org.springframework.remoting.caucho.HessianServiceExporter">
<propertyname="service"ref="storageManagerBean"/>
<propertyname="serviceInterface"value="com.akme.StorageManager"/>
</bean> <p> <code> <p> <code> <p> <p> <code>
Figure 7.2. Example of Stack Overflow question with HTML tagging.
7.1.2 Preserving the human tagging
A user on Stack Overflow can create a post by using a subset of the HTML language. In this subset, the user can indeed make use of tags like<code>to highlight code snippets in a discussion. In Section 6.2.3 we analyzed and estimated the agreement of the tagging performed by users. The analysis highlighted how contents tagged as<code>at the top level might not provide code at all, as well as the “textual” part (i.e., not tagged as<code>at top level), might have untagged code elements.
Figure 7.2 shows the same conversation1 discussed in Chapter 6 (see Figure 6.2) with the actual HTML tagging on the left side. By performing this tagging, the user is letting the reader now where the candidates structured fragments are. somehow differentiating the nature of two separated parts of the contents, an two different type of information units. To keep track of the
human tagging, we generalize the tagged contents to two different types of information units: Natural Language Tagged Unit: Whatever is not tagged as <code>at the top level, like textual
decorations (e.g.,<b>,<hr>), lists (e.g.,<ol>,<ul>), and paragraph (i.e.,<p>). Code Tagged Unit: Every contents tagged as<code> at the top level.
These information unit types expose a H-AST node providing the parsed contents. In doing so, the model takes care of mistagged contents, thus allowing potential analysis to take this aspect into account.
7.1.3 The meta-information Model
According to the object model depicted in Figure 7.1, every information unit carries a set of meta- information. The meta-information model enables the decoration of information units with, for example, the result of an analysis or the traversal of the H-AST. We provide the following pre-computed meta-information:
Types: the set of Java types mentioned in a unit, including qualified types (reference types) and primitive types (e.g., int, double);
Type Declarators: all the H-AST nodes matching a Java type declaration, including classes, interfaces, and enumerators;
Variable Declarators: all the H-AST nodes matching a variable (or field) declarator; Method Invocations: all the H-AST nodes matching a method invocation;
Method Declarators: all the H-AST nodes matching a method declarator; Code Identifiers: all the H-AST nodes matching an identifier;
JSON: all the H-AST nodes matching a JSON member declaration, i.e.,, a pair composed by a member name and a declaration (e.g., object, string, number, boolean);
XML: all the H-AST nodes matching an XML tags, both single tags with no children (i.e.,<tag/>,
<tag>) and composed tags with children;
Natural Language: the term frequency (tf ) vector that can be used to calculate, for example, textual similarities. The tf vector is generated using Apache Lucene2. We split text on case change, on digits and symbols, we lower the case, we remove stop words, and we apply the snowball stemmer3 to the obtained terms;
Sentiment Analysis: it provides the overall sentiment analysis value of the information unit from a pure textual point of view. This type of information unit should not concern structured units, since sentiment analysis on source code (e.g., Java, XML) would lead to non-sense results;
2http://lucene.apache.org 3http://snowball.tartarus.org/