• No results found

rq 9: Does the same user behaviour correspond to the same outcome of success across the different communities?

5.1 additional user behaviour features

The user behaviour factors we collected in Chapter4are the basis for the features we use in our prediction approach. However, that set of features is missing some important aspects. For example, some of these features are highly dependent on the size of the community, such as the raw number of posts and seed posts. A bigger community would naturally have more posts and seed posts, but this problem can be avoided by normalising the features. Hence, we extend the list of features to cover more aspects of user behaviour that might be relevant to community success, in order to give the prediction algorithm a greater choice of features to choose from.

Most notably, we include the responses per user, the information spread in the community, as well as the standard deviation and normalisation of some user behaviour, which are to date not considered in the literature. The standard devi-ation of user behaviour can indicate particular risks in community functionality.

For example, a high standard deviation of posts per user indicates an imbal-anced distribution of community activity, which is arguably less sustainable for a community than if the load is more equally distributed among all users. In the former case, a drop-out of a highly active participant can be devastating for a community. In general, a high standard deviation indicates an imbalanced distribution of workload, which bears the risk of instability.

The normalisation of some factors allows us to look at user behaviour in a relative way, e.g. the proportion of original posts, or the response length ratio, rather than their absolute numbers that are depending on the size of the commu-nity and other factors. Further, we split content related factors into title, original post and response to investigate the effect of content features more fine-grained.

As discussed in Section 4.1 in the previous chapter, we extract the features from data fields that are readily available in most community platforms, includ-ing the ID, title, content and time stamp of posts, as well as the user ID and the user-user interactions. In the following, we describe our feature additions in the three categories User activity, User interaction, and Content creation, and we list the standard deviation features in Section5.1.4.

We remind ourselves of the notation of common elements in communities that we introduced in Chapter 4 (page 97), where Uc is the set of users and Pc the set of all posts in community c ∈ C. The posts are further divided into original posts OPc and responses Rc (Pc = OPc∪ Rc). Below, we formalise each user behaviour feature f(c) on community c.

5.1.1 User activity

• Original post proportion: The number of original posts in relation to all posts is a normalisation of theNumber of original postsfrom Section4.1.

A high number might indicate a lack of responses. With OPc and Pc as the sets of original posts and all posts (including original posts and responses) in community c, respectively:

fopProportion(c) = |OPc|

|Pc|

• Responder proportion: The number of responding users in relation to all users. A high proportion of responders indicates a good user engagement.

With Uc as the set of users and Rc as the set of responses in community c:

fresponderProportion(c) = |{u : u ∈ Uc∧ u authored any r ∈ Rc}|

|Uc|

• Responses per user: Average number of responses per user; a community can only thrive when its participants interact sufficiently. With Uc as the set of users and Rc as the set of responses in community c:

fresponsesPerUser(c) = P

u∈Uc|{r : r ∈ Rc∧ r was authored by u}|

|Uc|

5.1.2 User interaction

• Seed post proportion: Proportion of original posts that receive responses from other people, i.e. the number of non-ignored threads in relation to all threads. It reflects the user response engagement, but in contrast to Original post proportion, it ignores the actual number of responses.

With fseedPosts(c)as the number of seed posts (see page 101 in Chapter 4) and OPcas the set of original posts in community c:

fseedPostProportion(c) = fseedPosts(c)

|OPc|

• Information spread: Measures the average degree in the user graph, e.g.

as used by Radicchi et al. in the context of community detection [RCC+04].

The user graph is built by creating links between users who participate in the same thread or contribute to the same Wikipedia article. A high aver-age degree indicates that many users are involved in knowledge sharing.

For community c, the user graph Gc = (Vc, Ec) consists of vertices (Vc) that represent the users from Ucand of undirected and unweighted edges (Ec) that represent the user response interactions. The information spread (average degree) of the user graph is then defined as:

finformationSpread(c) = 2∗|Ec|

|Vc|

5.1.3 Content creation

In the literature, content features are widely ignored as success factors, except for the general content length and included links to other resources. Here, we add various more fine-grained aspects of content creation:

• Original post length: The content length in Section4.1does not account for differences between original posts and responses. For example, in Q&A communities, shorter questions could indicate that they are easy to solve,

which could motivate more users to participate. With Wop as the set of words in an original post op ∈ OPc (separated by whitespace characters) in community c:

fopLength(c) = P

op∈OPc|Wop|

|OPc|

• Original post length ratio: Original post length in relation to the com-munity’s overall average content length. A value smaller than 1.0 means that original posts are shorter than responses. With fcontentLength(c)as the average content length of community c (see Content length on page 103) and fopLength(c)as the community’s average original post length as above:

fopLengthRatio(c) = fopLength(c) fcontentLength(c)

• Response length: Short responses on the other hand might not contain enough information and might be less valuable than longer ones. With Wr

as the set of words in response r ∈ Rc(separated by whitespace characters) in community c:

fresponseLength(c) = P

r∈Rc|Wr|

|Rc|

• Response length ratio: Response length in relation to the community’s overall average content length, similar to Original post length ratio but from the perspective of responses. With fcontentLength(c) as the aver-age content length of community c (see Content lengthon page 103) and fresponseLength(c)as the community’s average response length as above:

fresponseLengthRatio(c) = fresponseLength(c) fcontentLength(c)

• Title length: A clear and precise title that includes the necessary key terms to grasp the context of the discussion or article could get other users’

interest and foster interaction. Only original posts have titles, ergo Wt,op is

• URLs in original posts: Similar toURLs in postsin Section4.1, original posts that contain links to internal or external sources show that the poster researched the issue, which could potentially increase the chance of receiv-ing good responses. With URLop as the set of URLs in an original post op∈ OPcin community c, including internal links such as wikilinks:35

furlsInOPs(c) = P

op∈OPc|URLop|

|OPc|

• Ratio of URLs in original posts: References in original posts normalised by the number of URLs in all posts. Analogue to Original post length ratio, a value smaller than 1.0 means that there are fewer URLs in orig-inal posts than in responses. Using furlsInOPs(c) as defined above and furlsInPosts(c)as defined on page103, the ratio of URLs in original posts in community c reads as follows:

fopURLratio(c) = furlsInOPs(c) furlsInPosts(c)

• URLs in responses: The average number of references per response indi-cates that they provide additional and valuable information. With URLr as the set of URLs in response r ∈ Rcin community c, including internal links such as wikilinks:

furlsInResponses(c) = P

r∈Rc|URLr|

|Rc|

• Ratio of URLs in responses: References in responses normalised by the average in all posts in community c, similar toRatio of URLs in original

35 Links encoded in double brackets [[...]], seehttps://en.wikipedia.org/wiki/Help:Link.

posts but from the perspective of responses. Using furlsInResponses(c) and furlsInPosts(c)as defined above and on page103, respectively:

fresponseURLratio(c) = furlsInResponses(c) furlsInPosts(c)

• Content length change(only Wikipedia): Difference in content length be-tween revisions and the original article. Articles that receive additional content might show that the community is actively participating in content creation. With op ∈ OPc as the set of original articles and r ∈ Rc as the set of edits or revisions in response to original articles in community c, we define Wop and Wr as the set of words of original articles and responding revisions, respectively:

fcontentLengthChange(c) = P

r∈Rc{|Wr| − |Wop| : r revises op ∈ OPc}

|Rc|

• Edit distance (only Wikipedia): The Levenshtein edit distance [Lev66]

between revisions and the original article indicates that contributors put effort in improving existing articles. With levop,r as the Levenshtein edit distance between an original article op ∈ OPc and its corresponding revi-sion r ∈ Rc in community c:

feditDistance(c) = P

r∈Rc{levop,r : rrevises op ∈ OPc}

|Rc|

5.1.4 Standard Deviation Features

A high standard deviation in the measured variables often indicates an imbal-anced distribution of workload among the community members, which should be avoided. For example, that situation can arise when a few members of the community carry all the contributions while the rest might be free-loading. The highly active contributors will likely become dissatisfied with that imbalance

and stop contributing, leaving the community to perish. A community manager might wish to avoid that.

Below is a complete list of standard deviation features we add to the pool of available features for the prediction algorithm to choose from. They all measure the standard deviation of existing features described earlier. For example, in a tiny community five users might write 2, 1, 3, 2 and 4 posts each. Their average number of posts is 2.4 and the standard deviation of posts per user is approximately 1.02. With xi ∈ X as the individual observations (e.g. how many posts each user xiwrote) and ¯x as their average, we define the standard deviation σas follows:

σ =

sP|X|

i=1(xi−¯x)2

|X|

The individual standard deviation features cover the following aspects:

• User churn variation

• VIP churn variation

• Posts per user variation

• Responses per user variation

• Thread length variation

• Unique users per thread variation

• Response effort variation

• Response time variation

• In-degree variation

• Out-degree variation

• Content length variation

• Original post length variation

• Response length variation

• Title length variation

• Content length change variation (only Wikipedia)

• Edit distance variation (only Wikipedia)