5.2 Algorithms for identifying trust in Twitter
5.2.1 Betweeness Centrality
We have already discussed that in large social networks such as twitter, not all nodes could be regarded as equal. For instance, if we remove a specific node from a network this could have a different impact on the network which depends on the node. If the node lies at a ”dead-end” [41], its removal may have no consequences in contrast with the case of a ”cut-vertex” (see bridge/interconnectors chapter 3) which may cause network’s components to break apart. [42], [43]. In SNA, this matter of discovering the degree of ”centrality” of the different agents as a function of their position within the network was studied in the following works [44], [45]. Different quantities were then defined in this context of social networks in order to quantify this centrality. Someone could regard centrality rank as proportional to connectivity of a node. However, we need to clarify that this is a wrong assumption because centrality is in general not related to connectivity. The reason behind this is that connectivity should be examined only as a local quantity which does not provide us with all appropriate information needed in order to assess the importance of the node in the network. Indeed, it may be the case that an agent may not possess high node degree but the effect of its removal may be fatal because of the fact that it links together different parts of the network. ”A good measure of the centrality of a node has thus to incorporate a more global information such as its role played in the existence of paths between any two given nodes in the network” [41].
centrality ranking is computed. In particular, betweeness centrality counts the fraction of shortest paths going through a given node.
More precisely, the betweeness centrality of a node v is given by [44],[45] g(u) =
∑
(σst(u)/σstst)σst
is the total number of shortest paths from node s till node t and σst(u)
is the number of shortest path from s to t that are going through node u. The quotient σst/σst(u)
is defined as
µst
and is called pair dependency [46]. The betweeness centrality g scales proportional to the number of pairs of nodes s 6= t 6= u and some authors normalize it by (N - 1)(N - 2)/2 in order to get a number in the interval [0, 1] where N is the number of nodes in the giant component of the network that we discussed in a previous chapter. If some nodes receive high values of centrality this would be indicative that these nodes are able to reach others on short paths or that this vertex lies on many short paths. If a node with a high betweeness centrality value is removed from the graph then we may face two different situations. The first is that the paths between many pairs of nodes will be lengthened and there is an unwanted case when the node is a cut-vertex [42], [43] and its removal will create new smaller components of the previous graph. This was for instance used in the following work [47] to discover, iteratively, different communities in large networks. Of course there are also other centrality metrics based on shortest paths that link pairs of nodes. These are the stress, closeness, or graph centrality and could be found in these works [44],[45]. The basic pseudo-code that a programmer should consult before implementing the betweeness centrality algorithm is the following:
In the work of Ulrik Brandes [46] the above algorithm requires O(n + m) space and runs in
O(n ∗ m) and
O(n ∗ m + n2logn)
time on unweighted and weighted networks, respectively, where m is the number of links.
As next step it is crucial for us to clarify why we need betweeness centrality as one of our trust metrics for the twitter social network graph. We claim that a central position within a network may act as a router of the information flow. This means that a node which has such a position may contribute to different topics that are discussed within the different sub-networks that this node links together. This may have as a result to raise his/her trustworthiness and grow his/her trust-map. But we would like at this point to get a closer look at the picture 5.1 which will be accompanied by an example which may make clear the role of betweeness centrality in trust.
For the purpose of the example let us claim that nodes 34 and 3 are managers that represent both football and basketball athletes and try to find the best deal for their clients. Secondly we will say that the white network represents the football indus- try and the gray sub-network represents the basketball industry. We are now able to witness that although node 34 has a central role within the football industry (white sub-
ahttp://www.slideshare.net/ereteog/social-network-analysis-5800120
Figure 5.1: Another example of Betweeness Centralitya
network) he/she is some hops away from the basketball industry and that may have as a result to rely on other nodes, that may be competitors, to achieve his/her goals.
On the contrary, node 9 can have access in both football and basketball information flow and that could give him a strategic advantage over his/her competitors that could result in raising his/her reputation and thus his trustworthiness within his/her clients.
5.2.2
HITS
Another important metric that we will use is a variation of the HITS ranker algorithm. The initial letters HITS stand for Hyperlink Induced Topic Search (HITS) which is also publicly known as the Hubs and the authorities algorithm. HITS is a link analysis algorithm that rates web sites and was by introduced Jon Kleinberg and was a forerun- ner to the famous PageRank algorithm that Google uses to rank web pages. ”The idea behind Hubs and Authorities has its roots in a particular insight into the creation of web pages when the Internet was originally forming. This idea relies on the fact that, certain web pages, known as hubs, served as large directories that were not actually authoritative in the information that it held, but were used as compilations of a broad catalogue of information that led users directly to other authoritative pages” [48]. ”In other words, a good hub represented a page that pointed to many other pages, and a
good authority represented a page that was linked by many different hubs”.[49] For this reason the scheme for the HITS algorithm assigns two scores for each page: ”its authority, which estimates the value of the content of the page, and its hub value, which estimates the value of its links to other pages”.[48] But in order for the reader to understand why HITS algorithm has been chosen to rank twitter users let us consider an example that will introduce us to the concept of HITS. We will examine how HITS could be used to rank scientific Journals.[49]
Formerly, there were many different methods that tried to evaluate the significance of the academic published works. Garfield was the one that introduced the so called impact factor. [50]. According to what Garfield claims it does not matter how many citations a journal or an article receives but the importance of the citation is that which plays the most important role. ”In other words, it is better to receive citations from an important journal than from an unimportant one.” [48]
In a similar manner we use HITS to rank nodes in the twitter social network. This will also be explained in the implementation and evaluation part of this work but we will make an introduction here. Firstly we build weights for each node out of its tweets. We counted the frequency of retweets and mentions that each node received across its time in the twitter network and we build the weights out of the aggregated results to rank the nodes. Then we applied the HITS algorithm on the weighted graph and ranked again each of the nodes. Similarly a node may have a lot of incoming connections but the sum of the weights of the connections that point to that node may not be high. On the other hand there may be a node that is pointed from the most trustworthy and trustful nodes in terms of retweets and mentions and that could have as a result to enhance his/her trustworthiness over time.
We have justified why we regard the HITS algorithm as an important mechanism to measure the users’ trust. Now we will try to explain how the algorithm functions2.
”In the HITS algorithm which operates in the websites, the first step is to aggregate the results of the search query. Then, authority and hub values are defined in terms of one another in a mutual recursion. An authority value is computed as the sum of the scaled hub values that point to that page. A hub value is the sum of the scaled authority values of the pages it points to. The algorithm performs a sequence of iterations and each one consists of two basic steps:
The first step is the Authority Update. We need to update each node’s Authority score to be equal to the sum of the Hub scores of each node that points to it. This
means that a node is given a high authority score by being linked to by nodes that are recognized as Hubs for information.
The second step is the Hub Update: We have to update each node’s Hub Score to be equal to the sum of the Authority Scores of each node that it points to. This means that a node is given a high hub score by linking to nodes that are considered to be authorities on the subject.
The Hub score and Authority score for a node is calculated by executing each step of the algorithm presented below: We need to start with each node having the same hub and authority score set to 1. Secondly we need to run the Authority update step which was described above following the Hub update step. As a next step we need to normalize the values by dividing each Hub score by the sum of the squares of all Hub scores, and dividing each Authority score by the sum of the squares of all Authority scores”.3 As we said HITS is iterative and as a result we need to repeat the procedure from the second step as necessary. HITS, is similar to PageRank as it relies on iterations which are based on the linkage of the documents on the web. However it does have some major differences:
• It is executed at query time, not at indexing time, with the associated hit on per- formance that accompanies query-time processing. Thus, the hub and authority scores assigned to a page are query-specific4.
• It computes two scores per document, hub and authority, as opposed to a single score5and then an overall HITS score is assigned to each node.
This is the pseudo-code used for the hits algorithm and its main difference from HITS is that it is applied on a weighted graph that takes into account the frequency of mentions and retweets for a specific node:
3http://www.en.wikipedia.org/ 4http://matalon.org/search-algorithms/ 5http://matalon.org/search-algorithms/
Figure 5.2: This is the HITS algorithm’s pseudocode
5.3
Summary
In this chapter we explored the different trust parameters that affect the twitter user’s trust and trustworthiness such as node degree, their position within the network, as well as their interactions in terms of retweets and mentions. In order to rank them we did introduce two main algorithms, the betweeness centrality and the HITS ranker and we did analyze their features and structure.
Design
So far we did make an introduction about trust in general and we managed to disprove the statement that trust cannot emerge in online environments. Next we did explore the role of trust in online communities and social networks and after that we talked about online social networks. We managed to understand their structure and we discussed several methods of how to analyze these online social networks. The following step was to use some of those methods and perform a network analysis on the twitter social network and we observed some interesting features that we presented through charts. The final step was to identify the trust factors and the needed algorithms that would help us build our trust mechanisms. Out of these findings we did design and implement a prototype that helped us quantify and observe how trust emerges between the users within my twitter graph. In this section we will present the requirements that our prototype should fulfill its basic architecture and in the end it will be accompanied by a class and a sequence diagram.