Analysis of Profile Attributes - On the Promotion of the Social Web Intelligence

In our analysis, the geo-enabled attribute, default profile, and default image are all binary attributes and are studied as binary variables. Similarly, the numeric attributes of the favorite count, the tweet count, the follower/friend count, as well as the list count are analyzed as is.

Based on the declared name in the Twitter account, we created a binary and a numeric attribute: we matched the account name against a directory of English names to check whether any part of their declared name is indeed a person’s name in the dictionary. We also counted the number of parts in the account name that are available in the dictionary. For example, an account name that has only the first name matched has the value of 1, whereas an account name that has both the first and the last name appearing in the dictionary has the value of 2. For the Twitter account’s username, we checked to see if it contains the declared name of the user. For description, URL, and location attributes, we simply checked whether the corresponding piece of information is provided by the user.

3.4. Analysis ofProfileAttributes 35

Finally, we used a linguistic analysis tool to study the account’s profile descriptions to understand how the users of different privacy settings describe themselves in Twitter. The analysis results of the surface-based profile features are provided in Section 3.4.1, while Section 3.4.2 explains the results of the linguistic analysis of the profile descriptions.

3.4.1 Surface-based Profile Features

Table 3.2 presents the selected binary features, along with the percentage of the protected and public accounts for which these binary attributes hold. Although the Chi-Square test results suggest statistical significance for all the features, the effect size values suggest that only three features have practically different values in the public versus the protected accounts: has location,is geo-enabled, andis default profile. We calculated the effect size using Cramer’s V and followed the convention to interpret the value [2]. A Cramer’s V needs to be at least .1 to show a practically significant effect in reality. As shown in the table, a larger percentage of protected accounts has enabled their geo-tagging feature and has provided information for the location attribute. Besides, more protected accounts have changed their default profile settings compared to the public accounts.

Table 3.3 provides an average value of our numeric features in the two types of accounts. We calculated the effect size using Cohen’s d, and followed the convention to interpret the value [2]. Specifically in our study context, a feature’s Cohen’s d value needs to be at least .2 to be considered as a practically useful feature that distinguishes the protected and public accounts. Although the t-test results suggest statistical significance for all the features, the effect size values suggest that only the Tweet count feature has a practically different value in the public versus the protected accounts. The results show that on average, protected accounts tweet more often and this feature’s effect is close to medium (d=.29) (see Table 3.3). The protected account seems to have a larger number of favorite tweets although the effect is still quite small (d=.09).

In general, the results are interesting and contrary to what we expected before the analysis. For example, we anticipated that because the protected accounts represent a more private or more privacy aware population, they would be less likely to enable the location tracking feature or change the default profile theme, or even tweet often. These findings, however, indicate otherwise.

3.4.2 Profile Descriptions: A Closer Look

As explained earlier, the Twitter users can provide up to 160 characters in the description field. In our set of the CNN followers, there are almost 500K of the protected accounts and

36 Chapter3. Privacy andProfileAttributes inTwitter

Table 3.2: Analysis of binary profile attributes of protected and public accounts. Binary Attributes %Protected %Public Effect Size

Has Name 71.17 68.86 0.02 Username Has Name 3.01 3.45 0.01 Has Description 56.79 51.69 0.05 Has URL 15.01 16.78 0.02 Has Location 64.70 49.05 0.15 Is Geo-enabled 39.78 25.59 0.15 Is Default Profile 33.26 71.17 0.38 Is Default Image 6.43 8.80 0.04

Table 3.3: Analysis of numeric profile attributes of protected and public accounts. Numeric Attributes Protected Public Effect Size

Favirote Count 189.32 115.43 0.09 Tweet Count 1389.16 384.55 0.29 Follower Count 80.71 166.78 0.03 Friend Count 255.78 242.76 0.03 List Count 1.01 0.93 0.0006 Name Count 1.10 1.07 0.04

roughly 500K of the public accounts that have descriptions. We used Language Inquiry and Word Count (LIWC 2015) to analyze the language categories in these descriptions. The LIWC program processes each text file word by word and compares them against a pre-built dictionary to detect the LIWC category that the word belongs to. After processing all the words in the text, LIWC calculates and outputs the percentage of each LIWC category. Before conducting the linguistic analysis by LIWC, we applied the following pre-processing steps on the descriptions:

• removed HTML characters

• replaced apostrophe elisions (e.g., I’m -¿ I am).

• replaced URLs with the word “url”

• replaced emoticons with their corresponding meanings (e.g., :) -¿ smile )

• removed punctuation marks

• replaced user handlers with the word “mention”

The LIWC dictionary is structured in a hierarchical format, wherein each category may encompass several sub-categories. Details about these categories can be found in the LIWC

3.4. Analysis ofProfileAttributes 37

Table 3.4: LIWC categories and their corresponding percentage for protected and public descriptions.

LIWC Category Protected Public Effect Size Function Words 37.51 33.68 0.15 Affect 8.68 7.50 0.10 Social Processes 11.08 10.75 0.02 Cognitive Processes 7.86 6.71 0.09 Drives and Needs 11.01 11.38 0.02 Relativity 10.03 10.23 0.0006

website2. Since the users’ profile descriptions are usually very short (commonly between 8-10 words), the percentages provided by LIWC are often very small for the majority of the categories. Therefore, we only focused on the higher-level categories that are at the top of the LIWC hierarchy.

Table 3.4 provides these categories as well as their corresponding percentages for the protected and public accounts. Here, we dropped those LIWC categories that had less than 5% of matching words in the entire corpus of descriptions. In addition, LIWC outputs a set of summary dimensions along with the percentage of their matching words. Table 3.5 provides the summary variables deemed relevant and their corresponding percentages for the two sets of accounts. A t-test is performed for these categories, along with the effect size measured by Cohen’s d. All the categories have statistically significant different values between the protected and the public accounts, but these differences are small based on the Cohen’s d (see Table 3.4). It is still interesting to note that the protected account has a larger percentage of the function words and affect words, which being similar to our findings regarding the surface-based attributes is in contrast to our prior expectation.

In addition to the LIWC main categories, an analysis of the summary dimensions shows that protected accounts contain a smaller number of lengthy words (i.e., words with six or more letters). They use fewer words representinganalytical thinkingandclout. However, they have a higher percentage of words that bear emotional tone and authenticity. The differences are statistically significant based on the t-test results, but are not practically significant from the Cohen’s d value (see Table 3.5).

38 Chapter3. Privacy andProfileAttributes inTwitter

Table 3.5: LIWC summary variables and their corresponding values for protected and public descriptions.

Summary Dimension Protected Public Effect Size Six Letter Words 22.73 26.69 0.14 Analytical Thinking 75.64 84.04 0.15 Clout 66.83 72.99 0.10 Emotional Tone 98.58 96.55 0.05 Authentic 28.89 21.24 0.13

In document On the Promotion of the Social Web Intelligence (Page 47-51)