rq 9: Does the same user behaviour correspond to the same outcome of success across the different communities?
5.2 similarity between user behaviour features
We expect some of the described features to be very similar to each other because they capture similar user behaviour, albeit from slightly different perspectives.
For example, in some of our data sets, the featuresNumber of postsandNumber of usersare extremely highly correlated (0.99) with Posts per day and Users per day, respectively. The reason for that particular correlation is the number of posts and users per day directly affects the total number of posts and users.
In order to assess the similarity between the features, we cluster the user be-haviour features on all four data sets Stack Exchange, SAP Community Network, Boards.ie and the Simple English Wikipedia based on their pair-wise correlation distance (complete clustering on the Euclidean distance of Pearson correlations).
That provides us with a way to visually inspect the similarity between the fea-tures, as shown in Figure 40. Here, we only depict the similarity clusters for the Boards.ie data as an example, but similar clusters can be found in the other community types as well, whose feature similarity dendrograms are shown in AppendixA.
We also exclude the standard deviation features in this analysis as they are often highly correlated with their mean counterpart and/or with other standard deviations of similar features. For example, in the Stack Exchange communities, the standard deviation ofResponse timehas a correlation of 0.94 with Response time, and the standard deviations of both Thread length and Unique users per threadhave an inter-correlation of 0.99 (the same as the two features them-selves). The full feature similarity dendrograms for all four data sets including the standard deviation features can be found in AppendixA.
On a closer look, we notice that many feature similarities are shared across all our community types. A common theme is a big similarity cluster of features that are related to posting activity, mainly centred around the number of posts, original posts and seed posts, but also including the number of responders and the resulting size of the community (blue box in Figure7). Also closely related are the growth of the community and the number of posts per day (purple box), as fast growing communities produce posts frequently. User and VIP churn, which are naturally closely related to each other (cyan box), are not strongly
Figure 7: Feature similarity in the Boards.ie communities, measured as the Pearson cor-relation and grouped by hierarchical agglomerative clustering. The colours indicate clusters of similar features.
correlated to the growth and size of the community. That means that the intuitive assumption that bigger communities automatically have a higher ratio of leaving users does not hold.
In the category of user interaction features, there is a natural strong correlation betweenUnique users per thread and Thread length(yellow box). The more users participate in a thread, the more posts are in the thread. The exception from that are discussion threads where a handful of people bounce ideas back and forth and create many more posts than there are participating users, how-ever, these situations are not prevalent. We observe a similar correlation between theResponder proportionand theResponse effort(green box), which demon-strates again that a high number of participants translates to a high number of contributions per user, from the perspective of responses to original posts.
The following pairs of features show how much the responses contribute to the overall post characteristics. The number of posts per user (orange box) and the overall content length of posts (red box) are strongly correlated to the number of responses per user and the content length of the responses, respectively. This is the case because the responses make up the majority of posts. To put this in numbers, in our data, the community average of responses per original post ranges from 1.5 (Stack Exchange) to 8.7 (Boards.ie). The most extreme case is Wikipedia, where some communities have an average of more than 50 responses per original post. In three out of the four community platforms, Information spread also correlates to the number of posts and responses per user (orange box). It is defined as the average in-degree of a user node in the interaction graph (see Section 5.1), and it is expected to increase when the number of responses goes up.
The role of responses is also apparent in other content features, where the number of URLs in responses correlates highly with the number of URLs in all posts (magenta box). The two features that are unique to our Wikipedia data, namely Content length change and Edit distance, also show correlation to each other (see Appendix A). That is expected as well, since the Levenshtein edit distance includes the change of content length. Moreover, this indicates that the revisions in the Simple English Wikipedia are usually connected with an addition of extra content, rather than rephrasing or fixing of typographical errors.
The feature that is the least similar to the others is Title length, which can be found near to the root of the similarity trees. That means that the information given in the title, i.e. its length, is not affected by and does not affect any other user behaviour. A second feature that is dissimilar to most others is Response time. While it has some correlation to the age of the community because old communities can receive responses to original posts that have been created long ago, all other factors are relatively unaffected by the response time. Intuitively, we would assume that posts with long content might cause a long response time, but the time of writing a post is negligible compared to the time delay between original post and response, which can commonly take hours or even days.
In summary, there are 8 (9 in Wikipedia) clusters of user behaviour features that are very similar across all types of communities that we investigate. This helps us understand which user behaviour is closely related and possibly inter-dependent. A high inter-dependency between features would not be useful for the prediction task, as similar features hold the same information. For that reason, part of the feature selection step (Section5.3.1) is to remove features that are highly correlated.