• No results found

1.4 Overview of Dissertation Organization

2.1.4 Graph-based Approaches

In graph-based models, entities, like candidates and documents, are represented as nodes, and relations and interactions among these entities are represented as edges connecting the nodes. These graphs consist of either all the documents and expert candidates within the whole collection or they are created by using only topic-relevant documents and candidates.

2.1.4.1 Infinite Random Walk Model

Serdyukov et al. [84] proposed a multi-step relevance propagation algorithm that works on topic-relevant expertise graphs. An example topic-relevant expertise graph is provided in Figure 2.2. In this bipartite graph, the edges are in between documents and expert candidates.

Figure 2.2: A sample bipartite expertise graph.

Unlike one step propagation in Model 2 and voting models, this model proposes propagating expertise multiple times within the expertise graph based on the intuition that this is how users seek expertise. Given an expertise graph with document-document, candidate-candidate and document-candidate links, an expert seeker may do the following actions in order to find expertise.

At any time, randomly (1) read a document or (2) consult with a candidate.

After reading a document, (1) read one of the linked documents or (2) consult with one of the associated candidates.

After consulting with a candidate, (1) read one of the associated documents or (2) consult with one of the connected candidates.

Based on the assumption that documents and expert candidates of the same knowledge are located close to each other in expertise graphs, Serdyukov et al. [84] proposed random walk algorithms for estimating expertise. In infinite random walk model, the user visits document and expert candidate nodes over and over again. During this infinite walk, in order to keep

candidate in close proximity to relevant documents and candidates, jump transitions are intro-duced. According to the model, the probability of jumping to a document node is equal to document’s probability of being relevant to the query (Equation 2.14). Similarly, the jump prob-ability to an expert candidate node is equal to the probprob-ability of randomly choosing a retrieved document associated with the candidate (Equation 2.15). Following equations provide a formal representation for this model

whereλ is the probability that at any step the user makes a jump instead of following an outgoing link. n(e, Top) is the number of associated documents of expert candidate e within the retrieved

|Top| number of most relevant documents.

Graphs used for infinite random walk model are not strictly bipartite due to the probability of a jump to any node, but they still ignore the transitions between same type of entities like documents to documents or candidates to candidates. Users may follow document links to reach other related documents, or they can be directed to other expert candidates through their friends’

or co-workers’ references. Such additional links can be introduced to graphs depending on the availability of data. Serdyukov et al. [84] used documents’ links and organizational structure to extend their expertise graph, and applied infinite random walk approach by updating Equations 2.16 and 2.17 in the following way:

Pi(d)= λPjump(d)+ (1 − λ)

where Nd0is the number of outgoing document links from the document d0and Ne0is the number of outgoing candidate links from the expert candidate e0.

In addition to the infinite random walk model, finite random walk approach, which assumes that a user performs some predefined number of actions during expertise search instead of a non-stop process, has also been proposed.

These random walk approaches have been compared against Balog’s Model 2 [8] and Votes approach from Voting models [61] and shown to outperform these approaches on TREC’s enter-prise collection. However, their performances have not been compared to other more advance voting models. This comparison is an analysis that will be performed in the following chapters of this thesis.

2.1.4.2 Authority-based Approaches

In addition to directly using the user created content, the underlying social network structure between candidates is a valuable source for network-based methods in order to identify more in-fluential topic-specific experts. Therefore, previous research also worked on estimating centrality and authority of users in the graph with or without using the query by applying network-based estimation approaches like PageRank [15] or HITS [45]. Before getting into details of these network-based expertise estimation approaches, which are referred to as authority-based ap-proaches, and their application to user interaction (authority) networks, we initially describe the PageRank and HITS in order to provide some necessary background.

PageRank [15] estimates the probability of a random surfer reaching a web page. In a random walk, the surfer starts from a random page and moves to another page by either following one of the outgoing links (if there is one) or jumping to a random page. Over time, pages that are visited more are more important due to having incoming links from other important web pages.

Similarly, in a user authority network, users who have been linked from other important users should be important. Based on this intuition PageRank is applied to user authority networks as shown:

where PR(u) is the PageRank score of user u, N is the number of users, IL(u) is the set of users with an incoming link to user u, PR(ui) is the PageRank score of user ui, |OL(ui)| is the number of users user uihas an outgoing link to, and d is the damping factor for random jumps.

HITS [45] approach divides topic-related web pages into two categories; the ones that can be considered as authoritative source of information for the topic, referred to as authority pages, and the others called hub pages which contain a list of links to the topic-related authoritative pages.

A good hub page points to many good authority pages, while a good authority page is linked to by many good hub pages. Similarly, for users, a good hub user knows and refers to many good authorities on a specific topic, and a good authority user is one that has been referred to by many good hub users regarding the particular topic. The authority score of user u is equal to the sum of the hub scores of users that point to user u, and hub score is equal to the sum of the authority scores of users that user u points to. Hub and authority scores of users are calculated as follows:

Auth(u) = X

where Auth(u) is the authority and Hub(u) is the hub score of user u. IL(u) is the set of users with an incoming link to user u, while OL(u) is the set of users with an outgoing link from user u.

Previous work applied these PageRank or HITS algorithms to user authority networks con-structed from user interactions. As different social media platforms are introduced over time, new types of interactions emerge between users which can be used to estimate authorities. Lap-pas et al. [95] provided a survey of authority-based expert finding approaches used on social networks.

Campbell et al. [20] extracted senders and receivers from topic relevant emails and built expertise graphs. They showed that applying a modified version of the HITS algorithm over these graphs improves the precision with the cost of a lower recall compared to using a simple

content-based approach. Part of the email dataset used in this paper was also used by Dom et al. [28] in order to test HITS, PageRank and other graph-based ranking algorithms’ effect on expertise ranking. According to their results PageRank performed much better than other algorithms, including HITS. Chen et al. [22] also applied PageRank and HITS to the email communication network of TREC W3C collection.

HITS and PageRank like algorithms were also applied to CQA sites with the aim of identifying expertise. Jurczyk and Agichtein [42, 43] applied HITS approach to question answer communities in order to discover authorities. Zhang et al. [104] applied several network-based algorithms to online help-seeking communities. The authors proposed ExpertiseRank, a variation of PageRank algorithm which is calculated as follows,

where ER(u) is the ExpertiseRank score of user u, n is the number of users user u answered questions from, C(ui) is the number of users replied questions from user uiand d is the damping factor for random jumps. Zhang et al. [104] also tested HITS algorithms and z-score measure which combines a user’s asking and answering patterns. Their experiments revealed different results than Campbell et al. [20] and Dom et al. [28]. In their dataset, z-score, the simpler method, performed better than ExpertiseRank and HITS. After performing several simulations, the authors confirmed that the performance of graph-based algorithms highly depend on the structure of the network. For instance, they identified that their ExpertiseRank approach works better in networks when expertise of askers’ and responders’ are correlated, in other words responders are more selective on choosing challenging questions that they can answer.

One-step propagation approaches are also explored. Fu et al. [34] started with a set of experts in order to identify experts by using the associations between them. Their estimated expertise score were dependent on the expertise of the associated seed expert and the degree of association between the corresponding users. Associations like co-occurrence within a document or within an email communication (existence in any one of the from, to, cc fields) were used.

They applied one-step propagation from seeds to other associated users, therefore their approach can be considered as a voting based approach which uses the co-occurrence with topic-specific experts, instead of co-occurrence in topic-specific documents.

Several prior work also used an initially estimated expertise as prior score and propagate it in graphs to identify other users. Zhang et al. [103] propagated expertise in a heterogeneous aca-demic collaboration network in order to identify experts. An initial expertise score is calculated with profile-based approach and then a belief propagation model [31] was applied to estimate final expertise scores as shown: corresponding propagation coefficient. In an earlier version [53], the authors also applied this approach to a homogeneous academic network. This approach is similar to other propagation algorithms with an earlier estimated expertise used as the initial (prior) score before iterations.

Lu et al. [60] also applied the same algorithm to CQA answering graphs which do not only

have the direct answering edges but also the latent edges which were obtained with similarity between askers’ and responders’ profiles.

Similarly, Karimzadehgan et al. [44] tried to identify experts within an organization, by propagating expertise from users with profiles to users who don’t have profiles. After initial expertise score was estimated for a user with profile, the organizational hierarchy was used to propagate this expertise to other employees who are working closely together with the particular identified expert candidate (such as his managers or peers etc.). This approach is similar to graph smoothing as initially estimated expertise scores were locally smoothed with their neighbors scores as follows: applying this algorithm for direct neighbors, the second and third degree neighbors were also experimented.

Jiao et al. [39] proposed a modified PageRank approach for online communities, in order to overcome a possible expert spamming issue caused by mutually referencing activities within a small group of users. With the proposed Weighting Reference Relationship algorithm, the transition probabilities are re-weighted by decreasing the weights of direct or indirect back-link references within a certain radius.

Seo and Croft [83] proposed two graph construction approaches. In post-based approach the individual post’s nodes are linked to associated users’ nodes. In thread-based approach, instead of using individual documents for posts, a thread was constructed by using all the posts, and graphs are constructed by linking these threads to associated users. Applying these approaches to emails and online discussion forums within TREC enterprise collection showed that thread-based graphs are more effective due to capturing a better understanding of the posts together with other posts within the thread.

Finding topic-specific experts in Twitter is also important in terms of identifying influential and authoritative users to follow. For this purpose, Weng et al. [96] proposed TwitterRank, an extension of PageRank, to identify topic-sensitive influential twitterers. The proposed algorithm uses both the topical similarity between users and the link structure (following-follower network) to identify topical authorities in microblogs. This algorithm is different than PageRank in that transitions between users are weighted with the topic specific similarity of users which has been calculated from estimated topic models. Pal and Counts [70] also proposed a non-graphical approach for estimating topical authorities in microblogs. The authors applied probabilistic clustering over a set of user features. The users within the cluster were later ranked to create a final ranking of authorities. Compared to link-based authority estimation approaches, their algorithm shown to be more efficient in terms of running time.

Additionally Noll et al. [68] proposed a HITS-like algorithm, called SPEAR, for ranking experts in a collaborative tagging system. Their method assumed that there is a mutual re-inforcement relationship between user’s expertise and the quality of tagged documents, and experts (discoverers) are the ones who finds out useful resources before other users (followers) do. In this setting, experts act as hubs and receive hub scores as expertise score, and documents act as authorities and receive authority scores as quality score.

Authority estimation approaches were also applied to author networks constructed from citations. Liu et al. [57] proposed AuthorRank, a weighted version of PageRank approach where

the weights are based on co-authorship frequency between authors. Deng et al. [27] extended the AuthorRank approach by using community information. They estimated AuthorRank for each community and scores from communities that are similar to a given query are combined to estimate the final score.

SmallBlue (aka. IBM Atlas) [29, 54, 55] is a social networking application designed to lo-cate knowledgeable members and communities within an organization. This system combined two prior works: ExpertiseNets [92] and CommunityNet [91]. ExpertiseNet are dynamic graphs constructed for each user to represent particular user’s expertise profile. In these graphs, expo-nential random graph model [88] was used to describe the relational information of expertise areas and the dynamic actor-oriented models [89] were used to describe the temporal evaluation of these expertise areas. CommunityNet graphs were also built for each user in order to capture the context-dependent and temporal evaluation information from email communications of the particular user. The authors proposed Content-Time-Relation algorithm which is an incremental LDA approach to model the contextual, relational and temporal information together. SmallBlue system used the network structure in order determine the shortest path from expert seeker to identified expert.

More details on other prior work on applying authority-based approaches to community question answer sites are given in Section 2.2.2. Among these authority estimation algorithms, there is not a clear winner. Their effects seem to depend on the environment and network structure. However, in general, approaches that are more topic-dependent work better than ones that are topic-independent.