StackOverflow Dataset - Effective and Efficient Approaches to Retrieving and Using Expertise in

Over the last few years, data retrieved from community question answering (CQA) sites are commonly used by the research community. The availability of content, user information, social network structures and some types of manual assessments of content through voting and best reply selection make these collections useful for many research problems. One such widely used CQA site is the StackOverflow². In StackOverflow, the focus is on technical topics such as programming languages and environments, algorithms and operating systems. Users can post questions, answer questions or leave comments to both questions and answers. In most CQA sites, the question and its corresponding answers form a thread, and they are displayed together in the user interface. An example StackOverflow question thread with the question and an answer is presented in Figure 4.2.

As can be seen from the example in Figure 4.2, a question consists of a title, body and tags within the blue boxes. The title consists of several important keywords, and gives the main idea of the information need. The body is the longest field which explains the information need in detail. The tags do not explain the specific information need of the question but consists of several words or phrases chosen by asker to categorize the particular question. On the other hand, an answer just consists of a body field. Both questions and answers can receive comments from other users in order to either make or ask for a clarification. The average length of these fields is presented at Table 4.3.

Questions, answers and comments in StackOverflow can receive up or down votes from other users depending on the quality, necessity or accuracy of the post. Askers can also select an answer as the best, which is shown with green check mark next to the answer. Furthermore, all posts are associated with their corresponding authors and they all are timestamped.

A public data dump of StackOverflow is used for experiments in this thesis. This collection contains all the posts (questions and answers) and comments made to these posts until May 2014. Statistics related to this collection are provided in Table 4.4.

Routing questions to users who can answer them accurately and ranking replies based on corresponding responders’ question-specific expertise are two widely used tasks that require estimation of users’ expertise for a given question. In this thesis, the proposed expert finding approaches are also applied and tested with respect to these tasks. Prior work used different experimental methodologies and evaluation metrics for these tasks. In the rest of this chapter, these methodologies will be explained initially and necessary improvements are proposed when necessary.

2http://stackoverflow.com/

Figure 4.2: An example question and answer from StackOverflow.

Field Ave. Length

Question Body 94.00

Question Title 8.51

Question Tag 2.95

Answer Body 60.94

Question Comment Body 29.68

Answer Comment Body 30.74

Table 4.3: Average length of fields in StackOverflow collection.

# Questions 7,214,697

# Answers 12,609,623

# Comments 29,226,344

# Askers 1,328,026

# Responders 869,243

# Commenters 1,055,930

# Active Users 1,721,952

Table 4.4: Statistics of the StackOverflow collection.

4.2.1 The Question Routing Task

For a given question, the question routing task returns a ranked list of users based on their relevance to the question. For this task, the top 1000 expert candidates³ are retrieved for each question among the 869K responders of the site. Either all or some or none of the actual responders of the question are retrieved within these 1000 candidates.

In an ideal environment, evaluating this task can be performed by routing questions to these identified expert candidates, and then manual assessing the accuracy of their answers. However, due to lack of such extended manual assessments, the available data from CQA sites are being used for evaluations. For a given question all the authors of its corresponding replies are treated as relevant while all the other retrieved users who did not post an answer to the particular question are treated as irrelevant. This binary evaluation scheme, even with its flaws, was commonly used by the previous research for question routing task.

Average scores retrieved with this evaluation scheme are normally lower than the average scores from other expert retrieval research [10], due to incomplete assessments. In this task, all the highly ranked users may have the necessary knowledge and background to answer the particular question, but only the ones who actually answered the question are considered relevant while all others are assumed as irrelevant. In StackOverflow, the average reply count is 3.2, which means that for most of the questions there are only a couple of relevant users among the retrieved 1000 users. With such low number of relevant instances, it is hard to see whether the proposed approaches provide any significant changes.

In order to decrease the effects of incomplete assessments, questions with 15 replies were selected during test set construction so that questions have more users assessed as relevant on average. 50 questions with tag counts from 1 to 5 were randomly selected, and total of 250 questions were used in question routing experiments.

The success of question routing task depends on one of the identified highly ranked experts to answer the particular question, therefore the performance is reported with early precision metrics like Precision@(5, 10, 20). Furthermore, the Mean Reciprocal Rank (MRR) metric is used to analyze the rank of the first identified expert candidate who can answer the question. MRR is calculated as follows:

where expRank(q) is the rank of the first (top ranked) expert candidate for question q, and |Q| is the number of questions.

Another metric proposed by Chang and Pal [21] is Matching Set Count (MSC@n), which reports the average number of the questions in which at least one of the users ranked within top n provided an answer to the question. The intuition behind this metric is to analyze what ratio of the questions will be answered, if questions are routed to the top ranked n candidates.

MSC@n is calculated as shown:

MSC@n= 1

|Q|

q∈Q

1[{Retrieved Users(q)}n∩ {Expert Users(q)}n, ∅] (4.5)

where 1[cond] is an indicator random variable which is equal to 1 if cond is true, 0 otherwise.

3TREC 2005 Enterprise Track [26] also asked for top ranked 1000 expert candidates to be retrieved for expert finding task.

NDCG metric is also used to show the performance in general and to give some sense of relative ranking of responders based on votes they received from their replies, which are used as graded relevance assessment scores of responders.

4.2.2 The Reply Ranking Task

Experimental setting of reply ranking task is rather different than question routing. In this setting, the aim is to rank the responders of the question based on their question-specific expertise. A use case of this task is when information seekers are confused regarding the replies received for their question. If there is no feedback retrieved from other users regarding the accuracy of the provided answers, then knowing the question-specific expertise of responders can be useful to identify the best answer, or rank them based on their author’s expertise. Therefore, for this task, expertise needs to be estimated for just the corresponding responders of the given question, not for any other users.

This ranked list of responders is evaluated with respect to votes their replies received. An example StackOverflow question with replies ranked with respect to votes they received is presented in Figure 4.3. In this example, the second answer received more votes than the first answer; however it is ranked after the first one due to not being accepted as the best answer by the asker. Previous research on reply ranking in CQA sites directly used these received votes as graded relevance assessment values. Even though these available assessments values are very practical for evaluation purposes, they may not always reflect the correct assessment value of the content, due to the possible temporal or presentation bias introduced by the CQA system during voting process. These possible biases and their effects on experimental evaluation are analyzed, and a more bias free test set construction approach is proposed in the next section.

This proposed approach was used to construct the test set for reply ranking task. Questions with exactly 5 replies⁴ were chosen in order to see the effects of different approaches more clearly as the relative ranking of these 5 responders change. Similar to question routing task, 50 questions with tag counts changing from 1 to 5 were chosen randomly from the bias free question collection. Overall a total of 250 questions were chosen for reply ranking experiments.

These questions were selected among questions with all replies received positive votes in total.

Some replies receive negative feedback from users, probably due to being wrong. Questions with such replies are not selected for test set. Furthermore, questions with the most voted reply also accepted as the best reply by the asker were chosen in order to make sure that both asker and other users agree that the same reply is the best.

Due to the graded relevance values of votes, NDCG metric is used to evaluate the perfor-mance. The best answer prediction (BAP) measure which is 1 if the top ranked user’s reply received the highest vote, or 0 otherwise, is also used.

In document Effective and Efficient Approaches to Retrieving and Using Expertise in Social Media (Page 57-60)