Observable Attributes - User Analysis - On the Promotion of the Social Web Intelligence

4.5 User Analysis

4.5.1 Observable Attributes

Profile Attributes

A subset of the available profile attributes is deemed relevant for our purpose and is selected to be the subject of further analysis. This list is shown below, along with a brief description of the attributes. Please note that the descriptions are adapted from Twitter API specification3.

• Description: A piece of text users provide to describe their account.

• URL: A URL provided by the user in association with their profile.

• Location: The user-defined location for this account’s profile.

• Geo-enabled: When true, indicates that the user has enabled the possibility of geo- tagging their Tweets.

• Default Image: When true, indicates that the user has not uploaded their picture and a default avatar is used instead.

• Default Profile: When true, indicates that the user has not altered the theme or back- ground of their user profile.

• Favorite Count: The number of tweets this user has favorited in the account’s lifetime

• Tweet Count: The number of tweets issued by the user.

• Follower Count: The number of followers this account currently has.

• Friend Count: The number of users this account is following.

4.5. UserAnalysis 57

We employed a simplified binary attribute to study user descriptions, URLs, and lo- cations. The value of this attribute is “1” if the corresponding piece of information is provided by the user and is “0” otherwise. The geo-enabled attribute, default profile, and default image are all binary attributes and are studied as binary variables. Similarly, the numeric attributes of the favorite count, tweet count, follower/friend count are analyzed as is. To study how profile attribute of users with different privacy ratios are configured, we calculated correlations between the profile attributes and the privacy ratio across all the users.

Twitter profile attributes are among a few data points that are visible to and accessible by the public and the Twitter API regardless of the user privacy settings. Therefore, we retrieved profile attributes of protected andpublic accounts and directly compared them. Such a supplementary experiment allows us to understand whether profile attributes of Twitter users with varying privacy settings are configured differently. In addition, by com- paring these findings with the analysis ofpublicusers with different privacy ratios, we can investigate if protected accounts behave similarly to thepublic accounts located in more private neighbourhoods (regarding their profile attribute configuration). Finding similarities across the two sets can then contribute towards confirming the that the privacy attributes are localized.

To reliably compare the profile attributes of users with protected and public settings, we need a set of users in which the privacy attitude-behaviour dichotomy is minimized. Our earlier analysis of various follower sets associated with famous Twitter accounts [14] indicates that the followers of “CNN Breaking News” can be a good candidate set. This conclusion is made because the percentage of theprotected followers to the total number of followers for “CNN Breaking News” has been shown to be considerably higher than the other follower sets and the average percentage in Twitter [14]. Therefore, we collected profile attributes of 1Mpublicand 1Mprotectedaccounts from the CNN follower set and compared their profile features. In the collection process, we ensured that inactive accounts, brands, and celebrities are excluded. The details of the data collection process can be found in [14].

Table 4.1 presents the analysis of the selected binary features. The correlations between these binary attributes and the privacy ratio of users seem to be small for almost all the features. However, the differences between the protected and public accounts are statically significant for all of the variables and even practically significant for the three of them. Interestingly, with one exception (Has Description) the results show a similar behaviour for users with a high privacy ratio and those who have chosen to haveprotected accounts. In particular, the attribute ofgeo-enabled has a relatively higher positive correlation with

58 Chapter4. AddressingPrivacyDichotomy inTwitter Binary Attributes Privacy Ratio (Spearmanρ) Privacy Setting (Chi-Square P-value) Has Description -0.07 +++ Has URL -0.17 - - - *Has Location -0.06 +++ *Is Geo-enabled +0.11 +++ *Is Default Profile -0.07 - - - Is Default Image -0.03 - - -

Table 4.1: Analysis of binary profile attributes and privacy-related features.++forP< .05 and+++forP< .005 when the values are greater forprotectedaccounts, whereas - - and - - - are used when the values are smaller forprotected accounts. The variables for which at least a small effect size (Cramer’s V>0.1) is observed are marked with an asterisk. the privacy ratio and is more commonly used by users withprotectedaccounts (both statis- tically and practically significant). In addition, protectedaccounts provide external URLs less often compared to thepublicones. Similarly, the negative correlation ofHas URLand the privacy ratio indicates that users located within more private neighbours tend not to provide URL information compared to their counterparts.

Table 4.2 provides the experiment findings for the numeric features. A positive correlation is observed across all the variables; however, the correlation coefficient is relatively higher fortweet countandfavorite count. This positive correlation indicates that users with a high privacy ratio tend to tweet and favorite tweets more often. The results for the direct analysis ofprotectedandpublicaccounts are in line with the correlation coefficients since

protected users have a significantly larger tweet, favorite, friend, and follower count. The difference for thetweet countfeature is also practically significant.

Our direct analysis ofprotectedandpublicaccounts shows thatprotectedaccounts vol- untarily reveal more information about themselves (e.g., geo-tags) and participate more actively in the network (e.g., tweet count). A possible speculation is that since users with

protectedaccounts are aware that their data are private, they feel secure in this environment. On the other hand, users who are consciously following the publicsetting are utilizing a different strategy, such as self-censoring, to protect their privacy. Among the profile attributes, the existence of an external URLs shows a different pattern. Publicaccounts and users within public neighbourhoods seem to provide URLs more often. Despite our effort to exclude brands and celebrities, this difference can be attributed to professional Twit- ter accounts. For instance, artists who are on Twitter to promote their art often provide an external URL to their portfolio. Interestingly, our correlation findings reveal that pub- licly available users with a large percentage of protectedcontacts behave similarly to the

4.5. UserAnalysis 59 Binary Attributes Privacy Ratio (Spearmanρ) Privacy Setting (T-test P-value) Favorite Count +0.27 +++ *Tweet Count +0.30 +++ Follower Count +0.15 +++ Friend Count +0.02 +++

Table 4.2: Analysis of numeric profile attributes and privacy-related features. ++for P < .05 and +++ for P < .005 when the values are greater for protected accounts, whereas - - and - - - are used when the values are smaller forprotected accounts. The variables for which at least a small effect size ( Cohen’s d> 0.2) is observed are marked with an asterisk.

protected contacts. The implications of this finding are two-folds. First, it indicates the existence of locality for privacy preferences. Second, this result implies that such users feel secure in the environment in terms of sharing their information, yet they are more privacy concerned than the other public accounts. It is thus expected that they are more likely to feel invaded by targeted advertising and marketing messages.

Language Use of the Content

Natural language has been shown to be a reflection and a mediator of internal states [26]. Our words can reveal personality, emotional states and feelings, attention patterns, thought, and social situations [26, 11]. Therefore, a variety of automated content analysis tech- niques has been developed to measure such psychometric metrics from natural language. These methods range from the use of predefined dictionaries and taxonomies such as LIWC to complex computational algorithms that often utilize data mining and machine learning methods.

LIWC dictionaries are capable of providing a broad range of social and psychological insights from the language. Hence, we used LIWC to analyze the language of tweets and to examine the links between a set of linguistic indicators and users’ privacy behaviour. LIWC has a processing component that examines a text file word by word. Each word is then compared against the built-in dictionaries. Given that LIWC dictionaries are structured in a hierarchical format, the processing component then determines which LIWC categories or sub-categories the word belongs to. Once all the words are processed, LIWC outputs the percentage of words that belongs to a particular category to the total number of words in the text. In addition, a set of LIWC variables are measured independent of the dictionaries and are referred to as summary variables. These variables include four non-transparent language variables (analytical thinking, clout, authenticity, and emotional tone) and general

60 Chapter4. AddressingPrivacyDichotomy inTwitter

descriptive features of the text (words per sentence and percent of words that are longer than six letters). The definition and examples of each of these categories can be found at the LIWC website4.

Tweet sets published by each user are first cleaned and pre-processed. For instance, elisions are handled (e.g., I’m –> I am), URLs and Twitter mentions are replaced with specific tokens, and emoticons are replaced by their corresponding meaning (e.g., “:)” –> smile). Then all of the collected tweets published by a user is treated as a single document and is given to LIWC for analysis. The percentages calculated by LIWC are then studied in in terms of their correlations with the privacy ratio. Table 4.3 summarizes the correlation results for LIWC categories and summary variables, which are ranked based on their correlation strength. The Table only includes those variables with their correlation coefficient beyond a certain threshold (ρ >0.20 andP<0.005).

The majority of the LIWC features that are positively correlated with the privacy ratio can be associated with private content according to societal consensus. Examples of these features include the use of swear words, expression of anger and anxiety, and sexual top- ics. In addition, a positive correlation is observed for the use of “I” and the privacy ratio. Personal pronouns mainly appear in narratives and tweets that describe personal events, feelings, opinions, etc. Similarly, the use of past tense is often observed in more private neighbourhoods.

In LIWC, the analytical thinking feature captures the degree to which a piece of text represents formal, logical, and hierarchical thinking. Analytical thinking is negatively correlated with the privacy ratio. This finding may be attributed to the professional accounts (e.g., belonging to artists, politicians, athletes, etc.) that are often located within public neighbourhoods and may normally use a formal and a logical language.

The following are two example tweets that are from the timelines scored high on analytical thinking:

• The staffon the Woodman ready to go our last cruise of the season.

• Up to 40,000 cardiac arrests occur each year in Canada. Without treatment, most of these cardiac arrests will result in death. Learn CPR.

A relevant category among the LIWC outputs is called authenticity, which captures the degree to which the language is more honest, personal, and disclosing. Even though the correlation is below our threshold and thus is not included in the Table (ρ = +0.22), the authenticity of the language is shown to be positively correlated with the privacy ratio.

4.5. UserAnalysis 61 LIWC Feature Privacy Ratio (Spearmanρ) Swear Words +0.40 Anger +0.35 Negative Emotions +0.34 Body +0.32 Negations +0.31 Adverbs +0.30 Sexual +0.29 Sad +0.27 Analytical Thinking -0.26 FocusPast +0.26 Interrogative +0.26 Feel +0.26 I +0.26 Pronoun +0.25 Anxiety +0.25

Table 4.3: Correlation analysis of the LIWC categories and the privacy ratios. Again, this finding shows that people located within private neighbourhoods are likely to be privacy-concerned, but they are probably privacy-unaware and thus publish sensitive information about themselves. Below are two example tweets from the timelines that are scored high on authenticity:

• It’s going to be one of those days where everything that can go wrong, will go wrong.

• First night with the new roomie. Watching The A-Team! #bradleycooper

As discussed earlier, tweets published by protectedaccounts are inaccessible, making it impossible to compareprotected content with theirpubliccounterparts. However, there exists an accessible component that can represent linguistic characteristics ofprotectedac- counts: profile descriptions. When configuring their profile attributes, Twitter users can provide up to 160 characters in the description field. In our set of the CNN followers, there are almost 500K of theprotected accounts and roughly 500K of thepublic accounts that have descriptions. We used LIWC to analyze the language categories in these descriptions. Since profile descriptions are often very short (commonly between 8-10 words), the percentages provided by LIWC are very small for the majority of the categories. Therefore, we only focused on the higher-level categories that are at the top of the LIWC hierarchy as well as the summary variables. Table 4.4 shows the result summary.

62 Chapter4. AddressingPrivacyDichotomy inTwitter LIWC Feature Privacy Ratio (Spearmanρ) Privacy Setting (T-test P-value) Analytical Thinking -0.26 - - - Authentic +0.22 - - - Clout -0.15 +++ Function Words +0.23 +++ Affect Words +0.13 +++ Social Processes +0.10 +++ Cognitive Processes +0.23 +++ Drivers and Needs -0.03 - - -

Table 4.4: Analysis of the lingustic indicators and the privacy-related features. The two underlying datasets represent two different sets of users that are collected in different manners from Twitter. In addition, the language content of tweets and descriptions are provided for different purposes. However, we still see the same behaviour from

protectedaccounts and thepublicaccounts that are connected to a large percentage ofpro- tected neighbours in terms of their language use (see Tables 4.3 and 4.4). For instance, authenticity, the use of function words, affect words, social processes, and cognitive processes are positively correlated with the privacy ratio. Likewise, they are more observed in profile descriptions of theprotectedaccounts. On the other hand, analytical thinking, clout, drivers and needs are negatively correlated with the privacy ratio and are observed less often inprotectedaccounts. Again, such similarities signal the presence of locality for users’ privacy behaviour. Therefore, features that are specific to public accounts that are connected a large number of private contacts can be considered the features that characterize privacy-concerned users.

Tweet Sentiment

LIWC captures the percentage of words that belong to different sentiment-related categories (e.g., positive and negative words, anger, and anxiety) in user tweet sets. In addition to these LIWC categories, we took advantage of a lexical resource, called SentiWord- Net [3], to analyze tweet sentiments and privacy features from a different perspective. In SentiWordNet, each word is given three sentiment scores: positivity, negativity, and ob- jectivity. We first cleaned and tokenized each tweet. The tokens are then POS tagged and stemmed. The resulting token-POS tag pair is then matched against SentiWordNet. Finally, the scores retrieved from SentiWordNet are then aggregated for all the tokens in the tweet to generate an overall tweet sentiment score. It should be noted that whenever a negation

4.5. UserAnalysis 63

is observed, the inverted score of the following token is taken into account. Once tweet sentiment scores were determined, we calculated the ratio of positive and negative tweets to the total number of tweets. Based on the analysis, the ratio of positive tweets has a small negative correlation with the privacy ratio (ρ=−0.12), while the ratio of negative tweets is positively correlated with the privacy ratio (ρ= +0.26). The number of negations observed in the tweets is also positively correlated with the privacy ratio (ρ = 0.28). Due to the inaccessibility of tweets published by protected account, the direct analysis of protected

and public accounts in terms of the ratio of tweets with different sentiment labels is not possible.

Communication Behaviour

We employed a set of simple variables to characterize user’s communication behaviour from their timelines. In Section 4.5.1, we observed a positive correlation between users’ tweet count and their privacy ratios. In addition to the frequency of tweeting, we examined the average of tweet length and found a negative correlation with the privacy ratio (ρ =

−0.27). Published tweets can either be retweets from other accounts or new tweets coming from the account of focus. Note that the retweets are excluded from the tweet length analysis. The correlation between the ratio of the retweets to the total number of tweets and the privacy ratio is very small (ρ= +0.04), and the ratio of the new tweets is also positively correlated with the privacy ratio (ρ = +0.15). In addition, users tweet to interact with each other and engage in conversations. The ratio of this conversational tweets also shows a positive correlation with the privacy ratio (ρ = +0.22). Another potentially interesting variable to investigate is the use of URLs when tweeting. We expect the professional and non-personal accounts to publish news and events more frequently, which are often linked with URLs. As expected, the ratio of the tweets with URLs is found to be negatively correlated with the privacy ratio (ρ = −0.22). We found negligible correlations between the hashtag usage patterns and the privacy ratio. Similar to tweet sentiment, direct analysis of protected and public accounts is impossible due to the privacy restrictions associated withprotectedaccounts.

4.5.2 Latent Attributes

Method

To discover a set of latent attributes that are of interest to privacy-concerned users, we trans- formed each user node in the network into a set of attributes. For each attribute node, we then calculated the ratio ofprotectedcontacts to the total number of contacts. The resulting

64 Chapter4. AddressingPrivacyDichotomy inTwitter

network allows us to understand which features attract privacy-concerned users and which ones are more observed in public neighbourhoods. Figure 4.5 shows this transformation procedure for a sample network in which publicnodes are encoded by blue and the pro- tectedones are shown in red. Suppose that we have three users in the network:U1,U2, and

U3. The nodes with dotted borders are the social contacts of the users that are not yet added

to the network but are counted in the metadata calculation process and added to the BFS queue (see Section 4.3.1). As Figure 4.5 (a) shows,U1is linked with three social contacts

among which two arepublic and one is protected; therefore, it will be given the privacy ratio ofθu1 =33%. Similarly,U2andU3are given the values ofθu2 =50% andθu3= 25%,

In document On the Promotion of the Social Web Intelligence (Page 69-91)