• No results found

Comparison with the Baseline Algorithm

In this section, we present the results of a comparison of PointRank and the baseline algorithm. We generated 50 total and 50 partial rankings (with half of the pairs incomparable). Based on the results of the experiments in Section 4.2, we set ina, minni, maxni, pt, and pvalue to 5, 4, 15, 0.5, and 0.2, respectively. The main parameter of the baseline algorithms is n, the number of assignments generated for each pairwise relevance question. We show the results for n=40, 70, and 100.

Two main factors may affect the performance of the algorithms: the num-ber of PoIs and the worker reliability. We vary these two parameters and report the Kendall tau distance, the average number of assignments, and the average number of inconsistencies for both algorithms.

Number of Places

Figures A.10 and A.11 show the performance of the algorithms when the number of PoIs is changed. PointRank produces a lower Kendall tau dis-tance than the baseline algorithm with 40 assignments with nearly the same number of assignments. Figure A.10 also shows that our method has the same performance regardless of the number of PoIs. Figure A.12 shows that the baseline algorithm with 40 assignments causes more inconsistencies than PointRank.

Worker Reliability

In this section, the performance of PointRank and the baseline algorithm are reported for different worker reliability settings. The ability to get the rank-ing correctly even with less reliable workers is crucial for a crowdsourcrank-ing method since one cannot always be sure about the reliability of crowd work-ers.

Paper A.

0 0,0005 0,001 0,0015 0,002 0,0025 0,003 0,0035

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Kendall Tau Distance

Number of PoIs

PointRank Baseline (n=40) Baseline (n=70) Baseline (n=100)

Fig. A.10:Kendall Tau Distance vs Number of PoIs

0 5000 10000 15000 20000

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Number of Assignments

Number of PoIs

PointRank Baseline (n=40) Baseline (n=70) Baseline (n=100)

Fig. A.11:Number of Assignments vs Number of PoIs

Figures A.13 and A.14 show that PointRank produces better rankings than the baseline algorithm with 40 assignments, with the same number of as-signments. It can be also seen that the Kendall tau distance of PointRank decreases when the worker reliability increases. Figure A.14 also shows that the number of assignments in our algorithm decreases when the worker re-liability increases. In other words, our algorithm can tune the number of assignments according to the worker reliability. This is an expected outcome since we use a statistical significance test to check for consensus. When the workers are highly reliable, the algorithm stops assigning questions early.

5. Conclusion

0 50 100 150 200 250 300

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Number of Inconsistencies

Number of PoIs

PointRank Baseline (n=40) Baseline (n=70) Baseline (n=100)

Fig. A.12:Number of Inconsistencies vs Number of PoIs

0 0,1 0,2 0,3 0,4 0,5

0,5 0,55 0,6 0,65 0,7 0,75 0,8 0,85 0,9 0,95

Kendall Tau Distance

Worker Reliability

PointRank Baseline (n=40) Baseline (n=70) Baseline (n=100)

Fig. A.13:Kendall Tau Distance vs Worker Reliability

Figure A.15 shows that PointRank does not cause any inconsistencies even for the lower worker reliability values.

5 Conclusion

A spatial keyword query takes keywords and a user location as arguments and returns nearby points of interest that are relevant to the query keywords.

Such queries rely fundamentally on ranking functions. We propose the

Point-Paper A.

0 1000 2000 3000 4000 5000

0,5 0,55 0,6 0,65 0,7 0,75 0,8 0,85 0,9 0,95

Number of Assignments

Worker Reliability

PointRank Baseline (n=40) Baseline (n=70) Baseline (n=100)

Fig. A.14:Number of Assignments vs Worker Reliability

0 1000 2000 3000 4000 5000 6000 7000

0,5 0,55 0,6 0,65 0,7 0,75 0,8 0,85 0,9 0,95

Number of Inconsistencies

Worker Reliability

PointRank Baseline (n=40) Baseline (n=70) Baseline (n=100)

Fig. A.15:Number of Inconsistencies vs Worker Reliability

Rank model that enables evaluation of the quality of such ranking functions.

PointRank synthesizes answers to crowdsourced pairwise relevance ques-tions to rank a set of points of interest. The resulting rankings can then be used to assess the rankings produced by ranking functions. Using an in-novative evaluation methodology, we evaluate the quality of the synthesized rankings achieved by PointRank, showing that PointRank is capable of pro-ducing better rankings than an approach based on majority voting.

The proposed algorithm represents a step towards the evaluation of rank-ing functions for the spatial keyword queries. As future work, it is of

inter-References

est to use the proposed model to study hypotheses about spatial keyword queries. For example, by making use of the model, it is possible to study the effect of the types of keywords in a query. A user querying for "furni-ture" may be willing to travel longer than a user querying for "burger", which means that the weight assigned to the distance should be different for differ-ent keywords. It is also possible to study more advanced ranking functions.

For instance, to accommodate ranking functions that take into account user context such as gender and age, it is of interest to ensure that workers who evaluate answers satisfy the context assumed in the answers.

References

[1] (2015, Jun.) Google annual search statistics. [Online]. Available:

http://www.statisticbrain.com/google-searches/

[2] G. Sterling. (2015, May) It’s official: Google says more searches now on mobile than on desktop. [Online].

Available: http://searchengineland.com/its-official-google-says-more-searches-now-on-mobile-than-on-desktop-220369

[3] ——. (2010, Nov.) Microsoft: 53 percent of mobile searches have local intent. Available online at http://searchengineland.com/microsoft-53-percent-of-mobile-searches-have-local-intent-55556.

[4] X. Cao, L. Chen, G. Cong, C. S. Jensen, Q. Qu, A. Skovsgaard, D. Wu, and M. L. Yiu, “Spatial keyword querying,” in Proceedings of the 31st International Conference on Conceptual Modeling (ER 2012). Springer, 2012, pp. 16–29.

[5] J. Howe, Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business, 1st ed. Crown Publishing Group, 2008.

[6] J. Stoyanovich, M. Jacob, and X. Gong, “Analyzing crowd rankings,” in Proceedings of the 18th International Workshop on Web and Databases (WebDB

’15). ACM, 2010, pp. 41–47.

[7] J. Yi, R. Jin, S. Jain, and A. Jain, “Inferring users’ preferences from crowd-sourced pairwise comparisons: A matrix completion approach,” in Pro-ceedings of the First AAAI Conference on Human Computation and Crowd-sourcing (HCOMP 2013), 2013, pp. 207–215.

[8] X. Chen, P. N. Bennett, K. Collins-Thompson, and E. Horvitz, “Pair-wise ranking aggregation in a crowdsourced setting,” in Proceedings of the Sixth ACM International Conference on Web Search and Data Mining (WSDM ’13). ACM, 2013, pp. 193–202.

References

[9] J. Urbano, J. Morato, M. Marrero, and D. Martín, “Crowdsourcing pref-erence judgments for evaluation of music similarity tasks,” in Proceedings of the SIGIR 2010 Workshop on Crowdsourcing for Search Evaluation (CSE 2010), 2010, pp. 9–16.

[10] M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin,

“Crowddb: Answering queries with crowdsourcing,” in Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD ’11). ACM, 2011, pp. 61–72.

[11] A. Parameswaran and N. Polyzotis, “Answering queries using humans, algorithms and databases,” in Proceedings of the 5th Biennial Conference on Innovative Data Systems Research (CIDR ’11), 2011, pp. 160–166.

[12] A. Marcus, E. Wu, D. R. Karger, S. Madden, and R. C. Miller, “Crowd-sourced databases: Query processing with people,” in Proceedings of the 5th Biennial Conference on Innovative Data Systems Research (CIDR ’11), 2011, pp. 211–214.

[13] T. Yan, V. Kumar, and D. Ganesan, “Crowdsearch: Exploiting crowds for accurate real-time image search on mobile phones,” in Proceedings of the 8th International Conference on Mobile Systems, Applications, and Services (MobiSys ’10). ACM, 2010, pp. 77–90.

[14] A. Bozzon, M. Brambilla, and S. Ceri, “Answering search queries with crowdsearcher,” in Proceedings of the 21st International Conference on World Wide Web (WWW ’12). ACM, 2012, pp. 1009–1018.

[15] A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller, “Human-powered sorts and joins,” Proc. VLDB Endow., vol. 5, no. 1, pp. 13–24, 2011.

[16] A. G. Parameswaran, H. Garcia-Molina, H. Park, N. Polyzotis, A. Ramesh, and J. Widom, “Crowdscreen: Algorithms for filtering data with humans,” in Proceedings of the 2012 ACM SIGMOD International Con-ference on Management of Data (SIGMOD ’12). ACM, 2012, pp. 361–372.

[17] S. Guo, A. Parameswaran, and H. Garcia-Molina, “So who won?: Dy-namic max discovery with the crowd,” in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD ’12).

ACM, 2012, pp. 385–396.

[18] S. B. Davidson, S. Khanna, T. Milo, and S. Roy, “Using the crowd for top-k and group-by queries,” in Proceedings of the 16th International Conference on Database Theory (ICDT ’13). ACM, 2013, pp. 225–236.

References

[19] O. Alonso and R. Baeza-Yates, “Design and implementation of relevance assessments using crowdsourcing,” in Proceedings of the 33rd European Conference on Information Retrieval (ECIR 2011). Springer, 2011, pp. 153–

164.

[20] R. Blanco, H. Halpin, D. M. Herzig, P. Mika, J. Pound, H. S. Thomp-son, and T. Tran Duc, “Repeatable and reliable search system evaluation using crowdsourcing,” in Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’11).

ACM, 2011, pp. 923–932.

[21] G. Kazai, J. Kamps, M. Koolen, and N. Milic-Frayling, “Crowdsourcing for book search evaluation: Impact of hit design on comparative system ranking,” in Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’11). ACM, 2011, pp. 205–214.

[22] R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block de-signs: I. the method of paired comparisons,” Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952.

[23] R. Schumacker and S. Tomek, “Chi-square test,” in Understanding Statis-tics Using R, 2013, pp. 169–175.

[24] R. Fagin, R. Kumar, M. Mahdian, D. Sivakumar, and E. Vee, “Comparing partial rankings,” SIAM Journal on Discrete Mathematics, vol. 20, pp. 47–

58, 2004.

References

Paper B

CrowdRankEval: A Ranking Function Evaluation Framework for Spatial Keyword Queries

Ilkcan Keles, Christian S. Jensen, Simonas Šaltenis

The paper has been published in the

17th IEEE International Conference on Mobile Data Management (MDM ’16), pp. 353–356, 2016. DOI: 10.1109/MDM.2016.62

Abstract

We demonstrate CrowdRankEval, a novel framework for the evaluation of ranking functions for top-k spatial keyword queries. The framework enables researchers to study hypotheses regarding ranking functions. CrowdRankEval uses crowdsourcing for synthesizing results to top-k queries and is able to visualize the results and to compare them to the results obtained from ranking functions, thus offering insight into the ranking functions.

c

2016 IEEE. Reprinted, with permission, from Ilkcan Keles, Christian S.

Jensen, and Simonas Šaltenis, CrowdRankEval: A Ranking Function Evalua-tion Framework for Spatial Keyword Queries, 17th IEEE InternaEvalua-tional Con-ference on Mobile Data Management (MDM 2016), 2016.

The layout has been revised.

1. Introduction

1 Introduction

Location-based services are gaining in importance with the increase in the use of mobile, geo-positioned devices and the amount of geo-tagged web content. One core function of location-based services is top-k spatial key-word querying. Such top-k queries take a user location, keykey-words, and k as the arguments and return a ranked list of k points of interest (PoI) ac-cording to a ranking function [1]. Most ranking functions are a linear com-bination of the textual relevance of the PoIs to the query keywords and the spatial proximity of the PoIs to the query location, i.e., of the form rank(o, q) =α·sp(q.loc, o.loc) + (1α) ·tr(q.keywords, o.doc), where α, o, and q are the weighting parameter, the PoI, and the query, respectively; and sp and tr are the spatial proximity function and textual relevance function, re-spectively. However, existing studies provide no or little empirical evidence of the quality of the ranking functions. We believe that the lack of means of evaluating ranking functions is a major obstacle to the goal of developing high quality and advanced ranking functions.

The evaluation of ranking functions requires user feedback since there is no mathematical formulation of the best results of top-k spatial keyword queries. In fact, the best result is the one that users prefer. In this setting, the evaluation of a ranking function refers to the comparison of the ranking function with user preferences. The evaluation results show which ranking function performs better according to the user feedback. To be able to obtain feedback on the ranking functions, we use crowdsourcing [2].

We demonstrate a ranking function evaluation framework called Crowd-RankEval for top-k spatial keyword queries. The workflow of the framework is presented in Figure B.1. The framework is designed to enable researchers to evaluate the ranking functions used for spatial top-k queries. The user must choose a small set of queries on which the ranking functions are eval-uated. The user must also supply a query result for each query and ranking function. Given this input, the framework synthesizes a ranking of the PoIs contained in supplied query results for each query by asking pairwise rele-vance questions to the crowdsourcing workers. A pairwise relerele-vance ques-tion contains a query and a pair of PoIs and it asks a worker which of the two PoIs is more relevant to the query. Upon completion of the crowdsourcing, the framework synthesizes and displays a ranking as well as a comparison between the synthesized ranking and the rankings produced by the ranking functions. Thereby, the framework offers insight into how well the ranking functions perform.

The framework has four modules: the user interface module, the data preparation module, the PointRank module, and the evaluation module. The user interface module is the entry point of a user to the framework. The data

Paper B.

Which is more relevant to Q1?

Fig. B.1:Workflow of CrowdRankEval

preparation module allows a user to upload data. It also preprocesses the data to be able to perform the evaluation. The PointRank module simply provides an implementation of the PointRank algorithm that synthesizes the rankings for the queries using crowdsourcing [3]. Our framework is built to use CrowdFlower1as the crowdsourcing platform. Finally, the evaluation module is responsible for performing the evaluation of the ranking functions and for visualizing the synthesized rankings and the results of the evaluation.

In summary, the framework to be demonstrated contributes in these as-pects:

• The framework is able to evaluate ranking functions for top-k spatial keyword queries, and to visualize the results.

• The framework provides an implementation of the PointRank algorithm to synthesize rankings using crowdsourcing.

• The framework is built to connect with CrowdFlower to publish crowd-sourcing tasks.

The rest of the paper is arranged as follows: Section 2 presents the frame-work and gives detailed information regarding the modules, and Section 3 presents the workflow and the demonstration details. Section 4 concludes the paper.

1http://www.crowdflower.com/

2. The CrowdRankEval Framework

2 The CrowdRankEval Framework

The building block of the framework is that of an experiment. An experiment is defined by a set of queries and ranking functions. To create an experiment, the user must also specify parameters used by the PointRank algorithm. To enable PointRank to evaluate the ranking functions, the user should also upload the corresponding query results. After uploading the results for all queries, the user is able to start the evaluation task. As an example use case, the framework can be used to determine the best weighting parameter (α) for the ranking function given in Section 1 for a specific set of query keywords.

To do so, the researcher must upload the queries and their results when using the ranking function with different weighting parameters. According to the evaluation results, the researcher can decide on the best weighting parameter.

The framework is developed using Javascript, HTML, CSS, and PHP. We also used vis.js2and Highcharts3for visualization purposes.