• No results found

6.2 Scoring Model

6.4.1 Social-Context Configuration

We evaluate the effectiveness of our scoring model and the efficiency of the CON

-TEXTMERGEalgorithm using a social-context configuration on three datasets retrieved from three real world social tagging networks (see Section 3.2).

• Delicious.com: The dataset that we harvested by crawling the social bookmark-ing service Delicious.com contains a total of12, 389 users, 175, 754 bookmarks with2, 781, 096 tags, and 152, 306 friendship connections.

• Flickr.com: For our experiments on the photo hosting network Flickr.com, we obtained a dataset with a total of52, 347 users, 10, 000, 000 images annotated with29, 111, 183 tags, and 1, 293, 777 friendship connections. In addition to ex-plicit friends defined by the user (which rarely happens in Flickr.com), we also considered two users as friends if one of them commented a photo uploaded by the other.

• LibraryThing.com: By crawling the social book cataloguing service Library-Thing.com, we retrieved a third dataset with a total of9, 986 users, 6, 453, 605 books with14, 295, 693 tags, and 17, 317 friendship edges. As users in Library-Thing.com typically use tags that consist of multiple terms, we extracted the terms from the tags and considered the set of terms used by a user for a book as if the book was tagged with these terms. The friendship relation is here again defined in a broader way, and consists of explicit friends and users marked as having interesting libraries.

For detailed information on the social tagging networks Delicious.com, Flickr.com and LibraryThing.com, see Section 3.2.

Relevance Assessments

Finding a good set of queries and relevant results for them is not an easy task. Even though there has been some work on evaluating queries in social networks, most no-tably by Bao et al. [19] who use DMOZ categories as global ground truth, such methods don’t fit our user-centric search task. Here, it is not sufficient to consider global rele-vance measures, as the notion of relerele-vance is highly subjective and dependent on the initiator of the query and her personal context. For example, a photo of a person may only be relevant if she is known to the query initiator. However, when we execute our queries in the context of a user taken from our collections, it is not possible to ask this user to assess the subjective relevance of a result item.

To solve this problem, we propose two independent evaluation methods, a user-specific ground truthand a user study with manual relevance assessments for selected queries.

• User-Specific ground truth. For a given query qU = {t0, . . . , tn−1} from a user U , we consider as user-specific ground truth the union of all documents which were tagged with all query tags by U or any of her direct social friends.

Our decision to consider also documents of friends as part of the ground truth is based on the fact that these are also documents that the user has got direct access to and it is likely that the user has seen, and agreed with the tags assigned.

Since documents on the ground truth set contain tags from the query, we eval-uate the queries on a residual collection where query tags by U and her friends are removed as they are known to lead towards relevant results. Note that this introduces a penalty for the social-context configuration of our algorithm as the transitive friends with highest friendship strengths cannot contribute relevant re-sults by definition. Queries for the ground truth were randomly selected from tag pairs with medium frequency in the corpus (between1, 000 and 2, 000), the query initiator was chosen randomly among users that have previously used the query tags and have at least one friend in the collection. This process yielded150 queries for Delicious.com and184 queries for LibraryThing.com; note that this method of identifying ground truth cannot be applied to the Flickr.com dataset because by the nature of corresponding social tagging network, there is almost no overlap in the photos that users tag.

• User Study. Our user study is done on the LibraryThing.com and Flickr.com dataset. For Delicious.com we found the manual relevance assessment much more difficult and a too time consuming task for people not knowing the marked web pages because of the high effort of browsing through many book-marks, navigating to the associated websites and reading and understanding their contents in order to hopefully get an idea of the interests of a query user and to know if a resulting bookmark might meet or not meet those interests. For our user study on the LibraryThing.com dataset, we asked five colleagues to register with LibraryThing.com, tag at least20 books there, and choose some friends among other users. They then suggested28 queries related to the books they tagged. For Flickr.com, we collected40 queries from other colleagues and randomly selected a (fictitious) query initiator among the users in our crawl from Flickr.com who used all query tags at least once on the same photo. We then ran the queries with different configurations of our algorithm and pooled the results. In the assess-ment phase, a volunteer (which was the query initiator for LibraryThing.com)

was shown, in addition to the results for the query from the pool, the documents (i.e. books or photos) from the query initiator that contain at least one of the query tags in order to understand the personal context of the query initiator. In this way, we try to overcome the problem of subjectively assessing result qual-ities with the eyes of the query initiator. The participant then marks each result as highly relevant, relevant, or non-relevant in the context of the query initiator without knowing which configuration generated the result.

Retrieval Effectiveness

Results for the User Studies.The results from the user study are shown in Table 6.

For both the Flickr.com and the LibraryThing.com dataset set the NDCG improved by increasing α, but decreased when setting α to0. This shows that while the semantic as-pect is more important than the social asas-pect for these specific datasets, the social com-ponent helps improving the search result, in particular for Flickr.com. This can also be seen with the precision[10] in Table 7, which for Flickr.com starts at0.50 for α = 1.0, increases to0.54 for α = 0.1 and drops to 0.40 for α = 0.0.

Results for the User-Specific Ground Truth.The ground truth based experiments (Table 8) show very similar results. Again, result quality improves when decreasing α, but drops when ignoring the social aspect and setting α = 0. For Delicious.com the social aspect seems to be more important than for the other datasets, here the optimal value is α = 0.2. Overall search effectiveness clearly benefits from integrating social scores.

Without Tag Expansion

α 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.7 0.2 0.1 0.0

Flickr 0.39 0.42 0.41 0.41 0.41 0.41 0.41 0.41 0.41 0.41 0.36 LibraryThing 0.61 0.65 0.65 0.66 0.67 0.66 0.66 0.68 0.70 0.72 0.71

With Tag Expansion

α 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.7 0.2 0.1 0.0

Flickr 0.42 0.40 0.40 0.40 0.40 0.39 0.39 0.40 0.40 0.40 0.36 LibraryThing 0.61 0.63 0.64 0.65 0.65 0.65 0.65 0.67 0.69 0.72 0.71

Table 6: NDCG[10] for varying α, manual assessments

Without Tag Expansion

α 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.7 0.2 0.1 0.0

Flickr 0.50 0.53 0.53 0.53 0.53 0.53 0.54 0.54 0.54 0.54 0.40 LibraryThing 0.65 0.67 0.68 0.69 0.69 0.69 0.79 0.70 0.72 0.75 0.75

With Tag Expansion

α 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.7 0.2 0.1 0.0

Flickr 0.54 0.53 0.53 0.53 0.53 0.52 0.52 0.53 0.53 0.52 0.41 LibraryThing 0.64 0.65 0.67 0.68 0.68 0.68 0.68 0.69 0.72 0.75 0.74

Table 7: Precision[10] for varying α values, manual assessments

Without Tag Expansion

α 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.7 0.2 0.1 0.0

Delicious 0.15 0.25 0.27 0.33 0.34 0.35 0.35 0.37 0.39 0.39 0.36 LibraryThing 0.29 0.42 0.49 0.54 0.53 0.55 0.56 0.59 0.60 0.63 0.62

With Tag Expansion

α 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.7 0.2 0.1 0.0

Delicious 0.16 0.25 0.28 0.36 0.36 0.36 0.36 0.39 0.40 0.37 0.36 LibraryThing 0.30 0.41 0.46 0.53 0.52 0.53 0.55 0.57 0.59 0.61 0.60

Table 8: Precision[10] for varying α values, ground truth experiments Retrieval Efficiency

Another main concern in this work has been the efficiency and scalability of the query processing. To assess the efficiency of our CONTEXTMERGEalgorithm, we evaluated the two sets of queries used for the ground-truth based evaluation on the Library-Thing.com and Delicious.com datasets, and in addition a set of 164 queries on the Flickr.comdataset that were computed similarly to the others. The algorithm was im-plemented in Java, with the index lists stored in an Oracle 10g database. The experi-ments were done on an Windows-based server machine with four single-core Opteron CPUs and 16GB of main memory. We measured wall-clock runtimes and abstract cost in terms of disk accesses, where the cost for a random access was 100 and the cost for a sequential access 1. We compared our algorithm with a standard join-then-sort algorithm that reads all index lists involved in the query execution into memory, uses an in-memory hash join to combine entries for the same document, and finally does an in-memory sort of the candidate set to compute the top-k results.

For each collection, we performed a large variety of experiments to explore the space of possible values of α, thresholds for maximal tag expansion, and conjunctive vs. disjunctive evaluation. However, we limit the discussion to selected results with conjunctive evaluation since results with disjunctive evaluation generally followed the same trends. Figure 6 depicts the average abstract cost per query for the three col-lections and selected values of α, evaluated with CONTEXTMERGE and the baseline without tag expansion. It is evident that our highly efficient top-k algorithm has got at most half the cost of the baseline algorithm for most values of α, and shows even higher savings for the LibraryThing.com collection. Table 9 shows some additional details for the experiments on LibraryThing.com; it is evident from the table that wall-clock run-time of our algorithm is also at least50% better than that of the baseline (which is also the case for the other two collections).

The Figure 7 shows the abstract cost for the same setup but now with tag expansion up to a limit of10 similar tags. Note that the bars for the LibraryThing.com baseline experiments have been cut at350, 000. Here, the effectiveness of our dynamic tag ex-pansion is clearly evident, as it saves factors of3 to 5 compared to the baseline method which needs to completely scan the lists of all10 related tags for each query tag. Ta-ble 9 again shows details for some experiments on LibraryThing.com; here, our highly efficient method manages to reduce runtime by up to an order of magnitude over the baseline. The column avg.#exp shows the average number of similar tags considered per query. Whereas the baseline methods needs to consider all tags, our self-throttling expansion technique requires only very few similar tags.

Figure 6: Average Execution Cost without tag expansion

Figure 7: Average Execution Cost with tag expansion

Configuration time[s] avg.#SA avg.#SA avg.#SA to avg.#RA #avg.

overall to DOCS USERDOCS overall #exp alpha = 1.0

CONTEXTMERGE 0.70 70,588 0 70,588 65 0

CONTEXTMERGEw/ tag exp. 1.37 126,772 0 126,772 194 7

Baseline 1.43 165,352 0 165,352 0 0

Baseline w/ tag exp. 6.10 67,0405 0 67,0405 0 20

alpha = 0.5

CONTEXTMERGE 0.68 76,012 21,834 54,178 23 0

CONTEXTMERGEw/ tag exp. 0.84 78,808 22,063 56,745 64 2

Baseline 2.0 248,093 82,742 165,351 0 0

Baseline w/ tag exp. 9.92 1,118,554 448,149 670,405 0 20 alpha = 0.0

CONTEXTMERGE 0.14 11,341 11,341 0 1 0

CONTEXTMERGEw/ tag exp. 0.21 12,223 12,223 0 9 1

Baseline 0.59 82742 82,742 0 0 0

Baseline w/ tag exp. 3.43 448,149 448,149 0 0 20

Table 9: Efficiency details for LibraryThing with conjunctive evaluation 6.4.2 Full-Context Configuration

With a full-context configuration, our CONTEXTMERGEalgorithm considers the spir-itual, social and global friendship strengths (see Definition 6.2, 6.5, 6.7) of users with respect to a query initiator for computing scores of documents as defined by our scoring model in Section 6.2.

Retrieval Effectiveness

To study the effectiveness of our fully context-specific scoring model, taking social and spiritual friendship relations into account, we performed again experiments with data extracted from partial crawls of the social book cataloguing service LibraryThing.com.

We concentrated on LibraryThing.com only, because it is the most interesting social tagging network among the ones introduced in Section 3.2, i.e. LibraryThing.com, De-licious.com and Flickr.com. We found from the previous experiments in Section 5.4 and 6.4.1 that the social aspects in Delicious.com to be rather marginal, as most book-marked pages are of fairly high quality anyway; so a user does not benefit from her friends’ recommendations more than from the overall community. Moreover, manual relevance assessment from the viewpoint of an unknown user in Delicious.com is dif-ficult as described in Section 6.4.1. Flickr.com, on the other hand, has recently grown so much that the tagging quality seems to be gradually degrading, also due the fact that only the owner of a photo or her friends usually can provide tags since, by the nature of photography and the photo hosting service Flickr.com in particular, there is almost no overlap in photo collections of users. Hence, mostly only the owner of a photo indeed annotates a photo with tags and these are sometimes relatively unspecific and given to an entire series of photos (e.g. vacation July 2007) instead of photos be-ing individually tagged. LibraryThbe-ing.com, in contrast, features intensive taggbe-ing of a quality-controlled set of items, namely, published books, and its users have built up rich social relations. Finally, books are a matter of subjective taste, so that social relations do indeed have got high potential value. You trust your friends’ taste, not necessarily their “technical” expertise.

For our studies, we extracted a dataset from the LibraryThing.com website includ-ing 11, 717 users who together own or have read 1, 289, 128 distinct books with a total of14, 738, 646 tagging events (including same tags for the same book by differ-ent users), and17, 915 social friendship relations (see Definition 3.1). For the latter, we used the LibraryThing.com notion of friends (where users mutually agree on

be-user 1 user 2 user 3

thailand travel web learning time traveler

asia guide travel mountain climbing leonardo vinci technology enhanced

multimedia social software mystery magic stephanie plum religion irony humor search engines

religion god world information retrieval sf nebula winner challenge theory probability statistics fantasy politics imagination fantasy

sci-ence

database system fantasy dragaera

drama story novel transaction management sf nuclear war

magic fantasy data mining fantasy malazan

india philosophy software development fantasy story

novel family life science fiction future

Table 10: Queries of the user study

ing friends) and the notion of referring to an “interesting library” (see Section 3.2.3).

The users included6 users from our institute who have been contributing to Library-Thing.com for an extended time period and have made various social connections.

These6 users ran queries on the dataset and assessed the quality of the results in a user study.

Note that such human assessment is indispensable for this kind of experiments, and in our setting it was crucial that a query result was assessed by the same user who posed the query. Altogether, our6 test users ran 49 queries, shown in Table 10.

Query results were computed for a variety of values of α and β, with and without tag expansion. The results from all runs for the same query were pooled; all of them together were shown to the corresponding user in random order (in a browser-based GUI), and the user assessed the quality of each result by assigning one of three possi-ble ratings:

H

Results that the user already knew, that is, books that she has got in her own library, are always discarded.

The Tables 11 and 12 show the precision and NDCG values for different choices of the configuration parameters α and β, without any form of tag expansion. These are micro-averaged results over all test users. Values printed in boldface are results that were significantly better than the baseline case (α= β = 0) according to a statistical t-test with test level0.1 [109].

The results show that both social (increasing α) and spiritual (increasing β) pro-cessing can improve the result quality. This holds for each of these two directions individually, and the combined effect is even better with a typical maximum at α= 0.2 and β = 0.8. It may seem that the improvements, for example, from an NDCG value of0.546 for the baseline to 0.581 for the best case, are not impressive. However, one has to keep in mind that differences in such effectiveness measures generally tend to be small in IR experiments as opposed to efficiency differences (e.g. response times) in the DB literature; we emphasise that the gains are statistically significant [109]. More-over, it is worth pointing out that for some individual users (i.e. micro-averaging over the queries of one user only) or for individual queries the gains are higher. For ex-ample, Tables 13 and 14 show the results for user2 and user5, aggregated over their individual queries issued and evaluated by the users themselves. Here the NDCG value increased from the baseline value of0.67 to 0.73 for α = 0.2, β = 0.8, or from 0.51 to0.72 for α = 0.5, β = 0.5, respectively.

As anecdotic evidence, the query “science illusion magic” posed by user2 strongly benefited from the user’s social relations: with global scoring alone, many good results

H

were missed; with spiritual scoring alone, the results drifted towards a big “Harry Pot-ter” cluster which was not what the user wanted; only the combination of social and spiritual similarity gave the excellent results that the user appreciated (which included novels such as “Prestige”, “Labyrinths”, “Invisible Cities”).

Table 15 shows the NDCG results with tag expansion enabled, aggregated over all49 queries of the user study. Across the entire query mix of all users, the tag expan-sion did not achieve significant improvements over the results without tag expanexpan-sion but again, for individual users such as user2 there were noticeable gains. For example, the query “Yakuza” created the expansion tags “Cosa Nostra”, “Triads”, and “night-club” (among the top-5 expansions); the first and second expansion could have been expected (and created also by an ontology-based method), but the third expansion re-ally reflected tag co-occurrences and implicitly the contents of the kinds of novels that the user wished to discover.

From a cognitive viewpoint, it is desirable that query results in social tagging net-works exhibit some diversity. For interesting and surprising discoveries, you want to benefit from the natural diversity of cultures and tastes in your social network. (Even computer geeks should have some friends who are not in the IT business or in

com-HH

Table 15: NDCG[10] with up to 5 expansions per tag

puter science.) Using only spiritual similarities among users would bear the risk of not exploring items widely enough. To study this hypothesis, we also measured a no-tion of diversity among the relevant results of a given query (rating1 or 2 for pre-cision, rating2 for strong precision). We computed the pairwise cosine dissimilarity (i.e.1− cosine similarity) for the results (inspired by, but not directly related to notions of result incoherence and query ambiguity [41]). To this end, each result document d was viewed as a feature vector or frequency distribution over tags, with the frequency of tag t set to the total number of taggings with t assigned to d aggregated over all users. In most cases, these diversity measures were maximal at α values around0.5;

for non-social recommendations with α= 0 the diversity was much lower. This clearly shows the importance of social relations in diversity-aware recommendations.

6.4.3 Lessons Learnt and Open Issues

We have developed a comprehensive framework for socially enhanced search, ranking, and recommendation. Our experimental evaluation exhibits interesting results and in-dicates the potential of exploiting social-tagging information for scoring and ranking.

However, the results reveal mixed insights, and thus also underline the need for further investigating this line of research.

The combination of social and spiritual scoring nicely improved the results of cer-tain queries or users, but also led to result degradation in other cases. On average, there is a significant gain but it is not as impressive as one could have hoped for. It seems that categorising queries and identifying the query types that can benefit from social and spiritual relations is the key to a robust solution that would choose non-zero values

The combination of social and spiritual scoring nicely improved the results of cer-tain queries or users, but also led to result degradation in other cases. On average, there is a significant gain but it is not as impressive as one could have hoped for. It seems that categorising queries and identifying the query types that can benefit from social and spiritual relations is the key to a robust solution that would choose non-zero values