Innovation Prediction - Language change and evolution in Online Social Networks

7.6 Experiments

7.6.1 Innovation Prediction

Figure7.6.1shows the diffusion time of innovations across two separate granularities, time measured in days (figure7.6.1a) and weeks (figure 7.6.1b). We can see that the majority of the innovation diffusions (between nodes) happen within the first five days of exposure. However, within the Reddit comment network, there appears to be a regularity in peaks at 7 and 14, potentially indicating the existence of an underlying process in users’ access patterns of Reddit.

(a) Day (b) Week

Figure 7.6.1: Innovation diffusion frequencies

Algorithm 7.4:Evaluating - Static Time

Data: Each mapper receives a complete innovation diffusion in the [userid, time], with the docid being the name of the diffusion.

Result: List of quads that represents (row, column family, column, value), which then can be loaded into HBase.

1 ClassMapper(docida,diffusiond)

2 n←network

3 results table←[(user, messure, value, perf ormed)] 4 forrow u, tu∈ddo

5 if v∈table then

6 results table[u]←perf ored

7 else

8 results table[u]←0, innovator

9 end

10 forv∈Sout(ut)do

11 p_v,u←set

12 if u /∈results tablethen

13 results table[u]←p_v,u, notperf ormed

14 else

15 results table[u]←update(results table[u], p_v,u)

16 end

17 end

18 end

threshold (see section7.3), we only focus on users that have been exposed to an innovation, and not users who have used it without being exposed (this is due to the need to predict an innovation’s adoption based on exposure from other nodes; if there is no exposure, then we will not be able predict the adoption). As stated earlier, this prediction challenge is a binary classification, with the adoption being predicted if the joint probabilityiu(o) is greater than the activation threshold (suchiu(o)> σ).

To learn each global activation threshold σ, we useROC analysis. ROCanalysis models the trade- off between the True Positive Rate(TPR) andFalse Positive Rate (FPR) by varying the threshold σ, with the aim of finding the threshold at which a user has the largest number of true positives and true negatives, with the higher AUC representing a more accurate threshold. Figure 7.6.2shows the ROC curves for each of the four networks, across the three time modes, and the four measures of influence. As we can see, the results across all datasets are varied, with the static models proposed in section 7.4.1being able to predict withAUChighs of 0.92, though there is no discernible difference between the four variations of modelling influence. The introduction of decay functions appears to have reduced the accuracy across all models, resulting in some performing with an accuracy less than a random model (AU C= 0.5).

Algorithm 7.5:Evaluating - Continuous Time

Data: Each mapper receives a complete innovation diffusion in the [userid, time], with the docid being the name of the diffusion.

Result: List of quads that represents (row, column family, column, value), which then can be loaded into HBase.

1 ClassMapper(docida,diffusiond)

2 n←network

3 results table← {user: (value, perf ormed, time)} 4 forrow u, t_u∈ddo

5 if v∈table then

6 results table[u]←0, perf ored, t_u

7 else

8 results table[u]←0, innovator, t_u

9 end

10 forv∈Sout(u)do

11 if v /∈results tablethen 12 results table[u]←0, N ever

13 end

14 end

15 end

16 for(useru, valuepu, perf ormediu, timetu)∈results tabledo 17 sorted parents← {time: (node, value, perf ormed)}

18 for(userv, valuepv, perf ormediv, timetv)∈results tabledo 19 if i6=N ever&eu,v∈N then

20 sorted parents[tv]←(v,0, iv)

21 end

22 end

23 for(time, node, value, perf ormed)∈sorted parentsdo 24 computepv,u

25 end

26 tmp← {time: (user, value)} 27 fordt∈sorted parentsdo

28 tmp[dt]←dt

29 intermediate←0

30 fordt2∈tmpdo

31 minutes←dt2−dt 32 power←minutes/τ_v,u

33 intermediate←update(intermediate, value∗epower)

34 end

35 end

Algorithm 7.6:Evaluating - Discrete Time

Data: Each mapper receives a complete innovation diffusion in the form of [userid, time], with thedocidbeing the name of the diffusion.

Result: List of quads that represents (row, column family, column, value), which then can be loaded into HBase.

1 ClassMapper(docida,diffusiond)

2 n←network

3 results table← {user: (value, perf ormed, time)} 4 forrow u, tu∈ddo

5 if v∈table then

6 results table[u]←0, perf ored, tu

7 else

8 results table[u]←0, innovator, tu

9 end

10 forv∈Sout(u)do

11 if v /∈results tablethen 12 results table[u]←0, N ever

13 end

14 end

15 end

(a)Ov2u (b)Ov&u

Figure 7.6.3: Twitter geo and mention network values for the number of propagations between users

Table 7.4: AUCvalues for each givenPOStag

Reddit Comment Reddit Traversal Twitter Mention Twitter Geo

Tag Bernoulli Jaccard Bernoulli Jaccard Bernoulli Jaccard Bernoulli Jaccard

! 0.945,814 0.932,778 0.919,580 0.919,668 0.596,826 0.391,674 0.821,373 0.754,288 # 0.988,482 0.985,960 0.992,614 0.986,222 - - - - , - - 0.926,367 0.916,496 - - - - A 0.974,176 0.966,811 0.941,658 0.933,533 0.372,159 0.390,327 0.828,381 0.781,653 D 0.916,667 0.655,914 - - - - 0.710,258 0.639,674 E - - 0.997,986 0.994,964 - - - - G 0.949,570 0.946,256 0.956,844 0.954,782 0.815,499 0.712,872 0.890,735 0.852,988 L 0.961,226 0.905,481 0.970,527 0.971,248 - - 0.769,028 0.687,771 N 0.919,589 0.914,614 0.923,128 0.919,163 0.606,992 0.544,365 0.847,362 0.806,809 O 0.945,832 0.943,943 0.902,753 0.901,661 - - 0.801,335 0.767,615 P 0.797,396 0.812,736 - - - - 0.836,548 0.822,249 R 0.960,906 0.959,497 0.936,667 0.926,402 0.426,421 0.436,959 0.817,372 0.758,002 V 0.950,552 0.945,546 0.943,985 0.941,717 0.635,784 0.637,408 0.819,181 0.767,474 ˆ 0.914,667 0.910,982 0.928,660 0.924,448 0.615,903 0.567,143 0.857,183 0.819,170

from the sparsity of the network and number of propagation (Ov2u), which is significantly smaller than

that of, for example, the Twitter geo network (as visualised in Figure 7.6.3). However, the overall reduction in accuracy could be due to external unobserved processes that affect language adoption. This could be in line with [86], who stated that influence/exposure comes from not only the local connections but also the community as a whole.

Across all four networks, as the time models become more complex, the AUCreduces, though not to a large extent. This could be for a number of reasons, including the existence of external pressure which influences users language. Additionally, as the windowing strategy (τv,u) computes the decay of

influence as only happening between the two users, though influence to adopt an innovation may decay in relation to all users who have applied pressure; thus,τv,umay be too conservative, reducing theou(i)

too quickly when it is a function of all active neighbours.

As stated in section 7.3, innovations can be classified into different function sets (POS tags). Table 7.4shows theAUCvalues for the two non-credit models (equations 7.4.1and7.4.2) in each of the four networks. Across the board, the values in Table7.4 show high AUC, with noticeable improvement in the Twitter comment network. However, the majority of values are still less accurate than a random baseline (AU C <0.5) for the Twitter mention network. Again, this could be due to the sparsity within the network. When looking at the function classes, we can see that abbreviations (G) and verbs (V) appear diffuse in the most predictable manner, potentially indicating that the words that describe the action of a user are more likely to be adopted. This can be seen in Chapter6 with the growth of words such asvape andvaping. This can also be seen for acronyms such ascyw andsjw.

Unlike the static and discrete time models, there is only one maximum joint probability iu(o) when

using the continuous time model (equation7.4.9) for an individual user. This is the point at which there is the greatest amount of influence on the given user to adopt a term. To assess if the time of adoption is

Figure 7.6.4: Time difference from global maximum of user adoption

near the time at which there was the greatest amount of influence, we compute the difference between the time that a user adopted a term and the time at which they experienced the greatest influence. This can be seen in Figure7.6.4: values<0 indicate that the individual used the innovation before the maximum influence, whereas values > 0 indicate the term was used after the maximum. There is a distinctive spike around 0, indicating that the value of the maximum is on the same day the innovation was used. However, the long tail to the right of the peak potentially indicates that there is a large proportion of users who delay their first usage. This long tail could also highlight that the model is decaying the join influence too quickly, or by the wrong proportions.

In document Language change and evolution in Online Social Networks (Page 161-167)