• No results found

rq 9: Does the same user behaviour correspond to the same outcome of success across the different communities?

5.5 successful user behaviour

5.5.3 Knowledge Creation Communities

The two success criteria for Knowledge Creation communities, namely informa-tivityand complexity (Tables23and24), are essentially content quality metrics.

Hence it is not surprising that they largely dependent on content features, such as the length of original articles, edits and titles, as well as the number of URLs in original posts and responses. Interestingly, the length of original articles and edits has a detrimental effect on informativity, whereas the presence of URLs has a positive effect. This is likely the case because URLs contain more con-tent words (e.g. nouns, verbs and adjectives) than function words (e.g. articles, pronouns and conjunctions). Complexity is equally affected by the amount of content and URLs, i.e. the longer the articles and revisions, the more complex they are. That also applies to the Title length, where longer titles seem to hint at complex content as well. Our feature additions of Edit distance and

Normalised feature importance for informativity in Wikipedia

Regression Classification

100 Original post length URLs in original posts 100 97.5 URLs in responses URLs in responses 94.1

85.0 Title length Title length 85.3

83.2 Response length ratio URLs in posts 76.9 78.7 URLs in original posts Posts per user variation 61.3 76.0 Thread length variation Response length 57.0 70.3 Response length Response length ratio 45.5

67.3 Content length Posts per day 42.1

66.5 Posts per day Out-degree variation 36.2

65.7 Community age Original post length variation 33.5 63.8 Unique users per thread Original post length 32.4 58.4 Original post length ratio Community age 29.9

56.0 URLs in posts Seed post proportion 23.5

53.9 Content length variation In-degree variation 15.1 52.4 Number of original posts Original post proportion 13.6

51.1 Community growth User churn 9.1

50.0 Original post proportion Content length 5.0 46.0 Original post length variation Response time variation 0 44.7 Response time variation

42.5 Posts per user variation

Table 23: Top 20 most important features and their normalised importance scores for the success criterion informativity.

Content length change barely contribute to the prediction results, with small importance values where they are included at all.

Beyond the content features, the prediction algorithm has found that some user activity and interaction features are useful additions for predicting success of Knowledge Creation communities. In particularPosts per day (i.e. original articles + edits) andOriginal posts per day(i.e. original articles only), as well as the variations of the posts and responses (i.e. edits) per user, were helpful user activity features for predicting both informativity and complexity. The user interaction features that the two success criteria share are the variations of the in- and out-degrees, as well as theUnique users per thread.

However, we also see a difference in user interaction features between informa-tivity and complexity. Whereas, for example, the Response time variation occurs among the additional predictors for informativity, complexity profits

Normalised feature importance for complexity in Wikipedia

Regression Classification

100 URLs in posts URLs in original posts 100

98.4 URLs in original posts URLs in responses 60.6 82.8 Original post length ratio URLs in posts 53.4 68.8 Responses per user variation Original posts per day 44.5

68.1 Response length ratio Posts per day 41.8

67.1 In-degree variation Number of non-seed posts 26.4 65.2 URLs in responses Original post length 21.0 60.0 Ratio of URLs in responses Title length 20.5

60.0 Community age Content length 16.6

52.0 Number of non-seed posts Unique users per thread 7.3 49.4 Posts per day Edit distance variation 0.5 45.4 Out-degree variation Seed post proportion 0 44.4 Unique users per thread variation

43.0 Unique users per thread

41.5 VIP churn

40.9 Community size

39.5 Seed post proportion 37.0 Number of seed posts 34.1 Posts per user variation 31.9 Original post length

Table 24: Top 20 most important features and their normalised importance scores for the success criterion complexity.

more from the predictive power of seed post features, namely Number of seed posts,Number of non-seed posts andSeed post proportion. Note that many of the named user activity and interaction features are good additions to the very strong content features but not strong predictors by themselves. As we see in Table12in Chapter4, they often have little or no correlation to the Knowledge Creation success criteria on the whole data, but they may help the prediction al-gorithm to further separate portions of the data that have already been divided into “successful” and “unsuccessful” by the content features.

From the user attraction & retention category, Community age and Community growth are helpful additions, although they do not affect community success in the same way. The growth of the community seems to have a slight bene-ficial effect on informativity, whereas the age is the opposite. A fast grow-ing community shows that more and more users are joingrow-ing in order to either create high-quality content or improve existing articles. The age of the com-munity, on the other hand, is correlated to content length: Older communities tend to have longer articles, which is bad for complexity and in some cases for informativity as well, as we already established. Furthermore, VIP churn is slightly correlated to complexity, which could indicate that the creation of convoluted and overly complex content was either the cause for or the result of active users leaving the community.

5.6 summary

In this chapter, we employed a machine learning approach to determine the suc-cessful user behaviour for the three community types Q&A, Life & Health and Knowledge Creation. The first question we asked in relation to this approach is: Can we predict community success with good accuracy, given the informa-tion about user behaviour that is captured by the collected features (Research Question 7)? We achieved good results across all three community types, with accuracies of 0.67 – 0.95 (F1 score) and 0.73 – 1.0 (AUC). The lowest AUC score was achieved by the question solving delay tsolved in the SAP Community Net-work, which suggests a considerable amount of noise for that metric in that

data. With the generally good accuracy results, the prediction approach shows a strong relation between certain aspects of user behaviour and the various success criteria. This positively answers Research Question7and allows us to determine the successful user behaviour with high certainty in many cases.

The second question we asked was: Which user behaviour contributes to com-munity success, i.e. which combination of user behaviour features reflects the individual success criteria (Research Question8)? We found that there are some obvious traits of user behaviour that have an impact on the individual success criteria, such as the effect of the answer rate for Q&A communities, content features for Knowledge Creation communities, and user interaction features for Life & Health communities. These likely are observations that a community manager will intuitively look out for, but there are also less obvious aspects of user behaviour that a community manager should not disregard.

In the case of Stack Exchange communities, for example, well-written ques-tions (i.e. original posts) and question titles influence the question-solving per-formance, and too many references in answers should be avoided. Similarly, the Life & Health communities in Boards.ie do not thrive with too many URLs in replies, but also long responses have a negative effect on the prediction of the fre-quency of repliedT o user posts. On the other hand, the replySentiment is only affected by content length, and benefits from longer responses. For Knowledge Creation communities, we found that concise and frequently updated articles are beneficial for the content quality. Additionally, we noticed that user action and interaction features, such as the number of unique contributors per article, improve the prediction of informativity and complexity of the created content over pure content features. For example, while frequent edits are beneficial for the content quality in Knowledge Creation communities, the number of editors is important. We found that having too many editors is related to a decrease in content quality. That summarises our findings regarding Research Question8.

Finally, we asked: Does the same user behaviour correspond to the same out-come of success across the different communities (Research Question 9)? Our intention was to investigate the distinctly successful user behaviour in the dif-ferent community types, but also to study the generalisability between commu-nities of the same type. Intuitively, we expected certain differences, for exam-ple that some user behaviour is more related to user-interaction based success

criteria (e.g. connectedness), while other behaviour is more related to content-based success criteria (e.g. informativity). The results show that this is indeed the case. We already established that, for example, user interaction features are among the strong predictors for Q&A performance (solved) and the social connectedness in Life & Health communities, whereas some content features are important for the content quality in Knowledge Creation communities.

In order to investigate whether the same user behaviour features are related to success in communities of the same type, we compared the two Q&A platforms Stack Exchange and SAP Community Network. What we found is that there is a clear overlap of impactful features with only minor differences. Particularly the two solved variables exhibit similar relations to certain user behaviour features, which indicates that the fundamental dynamics that lead to solved questions are the same in the two Q&A sites. The most important difference is that content fea-tures are far less related to question-solving performance in the SAP Community Network compared to the Stack Exchange communities.

The two tsolved variables are not quite as similar between Stack Exchange and the SAP Community Network. However, that might be caused by the noise in SCN’s tsolved. Therefore, the comparison between the two tsolved variables is inconclusive. In summary, to answer Research Question 9, we observed that different user behaviour factors are important for the various success criteria across the different types of communities, but within the same type of commu-nity (i.e. the two Q&A platforms) we found a considerable overlap of successful user behaviour features.

This chapter and the presented findings complement and advance the exist-ing academic research through the scope of the analysis carried out: First, by defining and measuring additional user behaviour metrics, we present a more exhaustive study of the different aspects of user behaviour and their relation to community success. This is further highlighted by our efforts to compare the relation of all these metrics on three different community types, which enables us to analyse and evaluate community success and the related user behaviour on a bigger scale than has been done before. Some of the existing articles either proposed success-related user behaviour features without any evaluation [Pre01;

WL11; You13], used simple analysis to investigate a small number of features

[JNM11] or examined members’ and managers’ perception of success in surveys [WBE97; LSK03; LSK04].

Others introduced the use of machine learning (ML) to predict certain aspects of communities that may contribute to success. That includes changes in user activityfrom the composition of user roles [ARA11; RA12; RFA+13], group sta-bilityfrom group characteristics including size, activity and structure of video game guilds and DBLP co-authorship networks [PLG13], and productivity in Wikipedia articles from user turnover (i.e. churn) [QSC14]. Our approach fo-cuses on community success criteria that are specific to the type of community (e.g. solved questions in Q&A communities and content quality in Wikipedia communities) and we use machine learning as a means to identify the aspects (i.e. features) of user behaviour that are related to the various community suc-cess criteria. We do this on a bigger scale than before, by covering a wide range of user behaviour features and by comparing three types of communities with the same methodology.

The ML approach also allows us to detect features that are only important in the presence of other features, but not by themselves. For example, the correla-tion analysis in Chapter 4 showed that the Number of non-seed posts has no correlation with the tsolved success criterion in Stack Exchange, but the learning algorithm ranked it in the top 5 (regression) and top 10 (classification) of the most important features because it gained importance in the presence of other features. Using ML, we also discovered relations that were perhaps not expected, such as the negative impact of too many URLs in answers and too many contrib-utors in Wikipedia articles. In summary, we showed that the machine learning approach is not only suitable for the scale of our analysis, but can also reveal new and previously unknown links between user behaviour and community success, and therefore benefits the task of community success analysis.

6 A G E N T - B A S E D S I M U L A T I O N O F

U S E R N E E D A N D E N G A G E M E N T I N O N L I N E C O M M U N I T I E S

3 8

In the previous chapters, we defined success criteria of various types of online communities, and evaluated the user behaviour that contributes to success via correlation analysis and machine learning. In doing that, we followed a black box approach, where we measured each community’s characteristics such as growth and user activity, and compared them to the different levels of the suc-cess criteria. In other words, we determined the co-occurrence between certain user behaviour traits and the success criteria, without explicitly knowing how they are connected internally, and therefore without being able to establish the cause-and-effect relation between input and output. For example, we observed a positive relation between having many answers per question and the perfor-mance in Q&A communities, whereas user churn has a negative relation, and community size has no relation whatsoever. We often assumed a direction of ef-fect, i.e. many answers increase the probability of solving a question, and users who abandon the community remove their power of solving questions from it.

How ever plausible these assumptions may be, the black box approaches of correlation analysis and machine learning do not provide proof of that causal re-lation or any community-internal dynamics in general. However, understanding how these internal user interaction dynamics cause the resulting outcome for the community as a whole is crucial for studying success of online communities.

In this chapter, we propose an agent-based model of online communities to further investigate these dynamics of user behaviour and the effect on the sur-vivability and success of the community. Computational and non-computational models have seen usage in the online community literature to explain certain phenomena, such as community sustainability [But01], sociability [AZ09], and

38 This chapter is based on our work on the Q&A community model published as [AH14].

152

user motivation [RK14] (see Chapter 2 for more details). These models offer a cost-effective and unobtrusive way to gain insights into the emerging patterns and dynamics of user interactions. That is especially important when such sights cannot be obtained purely by analysing the recorded data, and when in-terventions on live communities would disrupt their functionality or cause bias by users who are aware of the experiment.

The purpose of simulating user interactions in online communities based on the proposed model is manifold: First, it will allow us to show some of the causal relations that we suspected in earlier chapters, such as the impact of ac-tive responders on various community success criteria, and on the other hand, the irrelevance of community size and general user activity. Second, we will be able to investigate the conditions under which online communities will thrive or perish. For example, how many active users are needed to ensure community survival? Finally, with the simulation, we can estimate to what degree external events will likely affect the community survival. An important issue for com-munity managers is how robust their comcom-munity is in the presence of malicious users and external attacks, and with the simulation, we can address this issue in a controlled environment without causing harm to a live community.

In this chapter, we describe and validate the proposed model, thereby address-ing the followaddress-ing research questions:

rq 10: Can we encode the basic elements of user behaviour in different online