WHAT DEVELOPERS ARE TALKING ABOUT?
AN ANALYSIS OF STACK OVERFLOW DATA
1. Abstract
We implemented a methodology to analyze the textual content of Stack Overflow discussions. We used latent Dirichlet allocation (LDA), a statistical topic modeling technique, to automatically discover the main topics present in developer discussions. We analyzed the discovered topics, as well as their relationships and trends over time, to gain insights into the development community.
2. Topic Modelling
2.1 Topic Model - LDA
A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Intuitively, given that a document is about a particular topic, we can expect particular words to appear in the document more or less frequently. Currently, Latent Dirichlet allocation (LDA), is one of the most common topic model in use. Basically, LDA is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.
2.2 MALLET
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. It includes sophisticated tools for document classification: efficient routines for converting text to "features", a wide variety of algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several commonly used metrics.
3. Stack Overflow Data Set
In this section, we discuss the relevance of Stack Overflow Data and the organization of data dump.
3.1 Stack Overflow
In recent years, stackoverflow.com has become a major source of information for developer community. This Q & A site is quite popular among software developers and discussions on this website reflects upon the current usage or popularity of technologies.
3.2 Data Set
Stack Overflow Data is publicly available under the Creative Commons license. The dataset is organised into five XML documents: badges.xml, comments.xml, posts.xml, users.xml and
votes.xml. We were particularly interested in posts.xml, which contains posts information (questions
and answers with tags). We analysed the data set which spanned over 3 years from July, 2008 to June, 2011. Size of posts.xml for three years was around 10 GB and posed as a challenge in terms of parsing and processing it.
4. Method Overview
Figure 1, depicts the various phases of data processing which are discussed in this section.
FI G U R E 1 : O V E R A L L W O R K I N G O F PR O J EC T
4.1 Data extraction and pre-processing.
Posts.xml was parsed using SAX XML Parser in python and content (title, tags and body) of the post were written to plain text files. Majority of the posts fall in the category of coding related discussions and hence contain code snippets. We remove all code snippets(if, while, etc) from the posts and utilize the remaining information in the post. Also, the content of the posts is present in html format and hence html tags were removed in order to get the actual text content of the post.
4.2 Topic Modeling
The text files generated after data-extraction and preprocessing were then fed to the Topic Modeling component of MALLET. This package by default takes one input text file and performs topic modeling over that file, we modified this package to process large number of files and enable automatic discovery of files, given a directory name.
Stop words list was also modified incrementally to include more technical stop words and to reduce noise.
4.3 Post processing
Topic modeling was performed over quarterly data to generate trends discussed in section 6.3 and over the posts related to most popular topics to generated trends discussed in section 5.3.
5.1 Motivation
We investigate whether some topics are related to other topics in terms of questions and answers. This can help us identify closely-coupled topics, where questions in one topic tend to generate answers in seemingly unrelated topics. Moreover, this can help point out the cross-cutting areas of concerns for developers across different topics: problems so common that they span across multiple domains. For instance, if many questions regarding both mobile application development and web development generate answers related to user interfaces, it hints that user interface development is a cross-cutting concern faced by developers across two different platforms.
5.2. Solution
The steps involved are:
●
Finding Top K topics●
Generating Mappings(topic to question posts and question posts to answer posts)●
Get all answer posts for each topic in Top K posts●
Run topic modeling for each topic over the answer posts from above step●
Project the data collected in a comprehensible manner using data visualize5.3. Results
The entire space represents Stack Overflow. Each outer circle represents the topics and the size of each circle is proportional to the popularity of the topic. Each topic in-turn consists of lot of nested circles to represent the topics triggered from it. Again the size of each of them is proportional to the popularity of the topic. We have done some post processing to remove few obvious topics in each category which might not be of interest to our research question in context. For instance, topic java did generate topics like data structure, library usage and so on. But any language is bound to trigger activities in such areas. Hence, we added this step in post processing stage.
5.4. Analysis
Few of the results shown above are surprising and very informative. Lets talk about the most trending topic Java. Java has triggered activities in areas like hibernate (ORM tool), SQL, etc. This information would be a good food for business analyst to figure out statistics like the most sought after ORM tool used with java, most used backend database with java and much more. One surprising stats is the blooming of github due to ruby on rails. Git hub is known to be gaining popularity in recent times, but this data analysis shows that more interactions have been triggered due to ruby-on-rails. Thus, this data analysis gives us a wholesome view of relation between various topics and to get an insight about the activities triggered in cross cutting areas of concern.
FI G U R E 2 : V I S U A L I Z A T I O N O F TH E RE S U L TS
6. How does developer interest change over time?
6.1. Motivation and Research Question
By analyzing the rise and fall of interests in different topics, product developers will be able to assess the relative popularity of their products. This will also help in identifying marketing and research opportunities and trends. For example, if interest in .NET Framework topic is rising while interest in Java topic is dropping, then companies, book publishers, and researchers might want to direct their attention to
.NET problems and challenges. The trend analysis also helps in reasoning about the rise or fall of certain topics in developer discussions.
6.2 Solution
We divided the entire dataset into chunks of fixed time frames with each chunk covering posts over 3 months. Hence, we got 12 partitions over the entire data set covering 3 years in all. Topic modeling is performed for each chunk separately to find the trending topics.
We also wish to analyze the temporal trends of topics. To do so, we define the impact of a topic zk in month m as
where D(m) is the set of all posts over 3 months in context. The impact metric measures the relative proportion of posts related to that topic compared to the other topics in that particular time frame. θ (di, zk)
represents the topic score of zk for the document di. All the statistics thus collected are projected in a 2- dimensional space where in, impact of a topic versus time is shown as below.
We categorized the entire space into various meaningful categories so make our comparison more meaningful and comprehensible. Thus we had 4 different comparisons showing comparisons of different topics.
Category 1 – Programming Languages:
Java, c++, Python
Category 2 – Web Technologies:
JavaScript, php, Ruby-on-Rails, django and HTML/CSS
Category 3 – Application Development
iPhone application development and Android application development
Category 4 – General Trend
Web Technologies, Server side Technologies and Mobile application development
Last category is a more general comparison where in, we combined few topics put together to give a holistic idea of which layer of stack is trending more among developers. Thus, server-side technologies include .Net framework, MySQL; web technologies include PHP, JavaScript, ruby-on-rails, HTML/CSS and django; and mobile technologies include iPhone application development and Android application development. This analysis will give us an overall picture of the general trend among developers, whether developers are more interested in server-side development or web development or mobile application development.
Figure 3 Languages
Java – Green C++ --‐ Blue Orange – Python Light green – PHP
Figure 6 Technology Domains
Yellow – Web Technologies Green – Server Side Technologies Blue – Mobile Technologies
6.3. Results
Above graph shows the comparison of web technologies over 3 years time frame with each plot representing a 3-month period. The graph shows that Web technologies is clearly the winner among the related all technology domains, as it remains the top player during most of the quarters. Thus the above analysis gives a good comparison of the popularity of various technologies among developers. It also helps us to reason out the highs and lows for a particular technology as explained in the next paragraph.
6.4. Real time events
During this trend analysis, some of the technologies surfaced as trending at some particular point of time, this
1. iPhone OS 2.0 SDK was released in March 2008 which led to iPhone Application Development trending in Apr-June 2008.
2. Rails version 2.3 (with major changes) was released in March 2009 leading to Ruby on Rails surfacing up in trends in Apr-June 2009
3. Adobe Flex version released in March 2010 and it started trending in Apr-June 2010.
7. Challenges faced and Future Work
One of the challenges which we faced was that of the Data size. The post.xml file was 10 GB. This took a lot of processing time.
One more challenge which we faced was that, MALLET does not remove technical stop words from the data.
In other words, there are technical words, which would not help in topic modeling, and are quite general in nature. To remove such kind of technical stopwords we used explicit codes.
One more challenge which we faced was that of wrongly tagged questions. In stack over flow, the person who asks the questions has to tag it with keywords which are related to the question. There are chances of questions being wrongly tagged. Wrongly tagged questions create noise which is hard to eliminate.
MALLET just gives us the set of keywords related to the topic, but it does not give us the name of the topic corresponding to the set of keywords. So, we had to manually go through all the keywords of a particular topic and name it accordingly. This process was arduous and time consuming. Also, there were few topics which had keywords which were general in nature, and made it difficult to name the specific topic.
As future work we would like to extend our work to compare trends of specific technologies, and how interests in related/competing technologies differ over time.
8. Conclusion
In this project, we implement a methodology to discover and quantify the topics and trends in Stack Overflow, a popular Q&A website with millions of active users. Our methodology is based on LDA, a widely-applied statistical topic model, which discovers topics from the textual content of Stack Overflow. We use various metrics to quantify the topics and their changes over time, which allows us to gain insight into the discussions in Stack Overflow.
Our analysis provides an approximation of the wants and needs of the contemporary developer. Also, Our analysis can be used by the Stack Overflow team to better understand the content generated by its users.
Knowing what topics are present, and which are popular at any given time, could help in the moderation of the website.
9. Source Code
https://github.com/nsivabalan/treaso 10. References
[1] Anton Barua, Stephen W. Thomas, and Ahmed E. Hassan, "What are developers talking about? An analysis of topics and trends in Stack Overflow", Empirical Software Engineering, 2012
[2] MALLET http://mallet.cs.umass.edu/