Dataset - On the Promotion of the Social Web Intelligence

4.3.1 Data Collection

To collect and build a social network from Twitter, we first selected a random user by gen- erating a random Twitter ID. We ensured that this initial user is publicly available because the social contacts ofprotected accounts are inaccessible through the Twitter API, which makes it impossible to expand the network from a protected user node. After this user was selected, we iteratively built a network of users in a Breadth First Search (BFS) man- ner. Given that our approach exploits preference locality, we focused only on reciprocated relations instead of the asymmetric follow or friend relation. Reciprocated relations are expected to indicate a stronger relationship between the two users, and they distinguish the social network section of the Twitter-sphere from its information network [24, 38]. As we are only focused on this mutual contacts, from now on, whenever we use the wordcontact, we refer to the social contacts of the focal user with reciprocated relations.

Before adding eachpublicuser to the network, we retrieve and calculate a set of metadata about the user. We first count the percentage ofprotectedcontacts to all of the contacts of the focal user. For instance, if a user has 100 social contacts among which 20 have protected their accounts, the user will be assigned the value of 20%. This percentage, called theprivacy ratio, is a primary metric for our further analysis. In addition, we collect Twitter profile attributes (e.g., location and tweet count) and the latest 500 tweets published by the user as node metadata. Once the augmented user node is added to the network, we check if the new node has a reciprocated relationship with any of the existing nodes and add the corresponding edges. This process is repeated with a newpublicuser pulled from the BFS queue. Figure 4.1 shows an overview of our data collection process.

It should be noted that users with less than 10 tweets or less than 30 followers/friends are considered inactive and thus are not included in the data collection process. In addition,

4.3. Dataset 53

(a) (b) (c)

Figure 4.2: The degree distribution across all users in the network on a log-log scale. The distribution is shown for all of the social contacts (a), thepubliccontacts (b), as well as the

protectedones (c).

verifiedusers and users with more that 1K followers/friends are excluded since they often represent brands and celebrities and are not from the general public. By following this approach, we collected the total of 23,320publicuser nodes and 6,489,419 tweets published by these users.

4.3.2 Descriptive Analysis

In this dataset, each Twitter account is mutually connected to an average of 86 contacts. Among these neighbours, an average of 76 are public and 10 are protected. In addition, each user is associated with an average of 339 tweets. We can obtain some insight into the network structure by examining the degree distributions. Figure 4.2 (a) shows the degree distribution for all of the mutual contacts across all users on a log-log scale. A heavy tail can be seen in the graph, resembling a power-law distribution. Similarly, the degree distribution for public contacts shown in Figure 4.2 (b) exhibits a heavy-tail. The same applies to the degree distribution forprotectedcontacts, though to a larger extent compared to the other two (see Figure 4.2 (c)).

We also attempted to fit all of the three degree distributions to a power law distribution:

P(x) ∼ x−α. Throughout the fitting, we obtained the α values of 2.49, 2.98, and 1.79 for all,public, and protected accounts, respectively. For all of the three distributions, the Kolmogorov-Smirnov (KS) test indicates that the distribution is not refused (P > 0.05), and the power law can indeed be a good fit. Power law distribution is commonly observed in the context of social networks, though it is interesting to observe the same trend even after filtering the users with more than 1K friends/followers (as described in Section 4.3.1). Figure 4.3 shows the relationship ofpublicandprotectedcontacts across all users. Not surprisingly, these two metrics are positively correlated, indicating that as the number of

publiccontacts increases, so does the number ofprotectedcontacts. However, as the linear regression line represents, the number ofpubliccontacts grows at a larger scale compared to

54 Chapter4. AddressingPrivacyDichotomy inTwitter

Figure 4.3: Correlation of the number ofpubliccontacts and the number ofprotectedcontacts.

Figure 4.4: Correlation of users’ privacy ratio and the average privacy ratio for all of their contacts.

theprotectedones. Finally, as the first step to ensure that privacy preferences are localized in the context of privacy, we calculated the correlation between the privacy ratio of each node and the average privacy ratio of the contacts. The analysis of users who have at least 10 mutual contacts in the network (about 7000 users) showed a strong positive correlation between the two variables (Spearman ρ = +0.89). This correlation is also apparent in Figure 4.4, wherein the orange line is the linear regression line fitted to the data. This result indicates that users’ privacy behaviour is either influenced by their close social contacts or individuals with similar privacy behaviour tend to cluster together in social networks. In either case, this finding implies the great potential of collaborative filtering approaches for privacy preference prediction.

In document On the Promotion of the Social Web Intelligence (Page 65-67)