Effectiveness Study - Spatial Keyword Querying: Ranking Evaluation and Efficient Query Processi

In order to evaluate the performance of our proposal, we execute the follow-ing step. We define a set of categories C =C1, C2, ..., Cg , where Ci is a set of queries Ci = Que1, Que2, ..., QueM_i . Suppose that we have a maximum budget and run the entire process (Figure C.2) for each category C_iusing M_i queries. The result for category C_i can be represented as an M_i-dimensional vector~v = (v₁, v2, ..., vM_i), where v_i is the score of ranking function f₁ as described in Section 6.1.

3http://www.crowdflower.com

Paper C.

Query: cheap hotel

1. Do you think ''Hotel Hvide Hus (Vesterbro 2, 9000 Aalborg, Denmark)'' is better than ''First Hotel Europa (Vesterbro 12C, 9000 Aalborg, Denmark)'' for the given query?

Click here to see the query location and the two points of interest on Google Maps Yes

Enter your confidence in your answer.

Select one

Fig. C.4:An Example in CrowdFlower

Table C.4:Parameter Settings

Parameter Value

Total budget $50

Total # of binary questions 1183 Max # of workers per binary question 31 Max # of binary questions per query 21 Average # of workers per binary question 15 Average # of binary questions per query 13 Average cost per binary question $0.003

However, a very high budget is economically infeasible. Assuming a par-ticular budget b, we run the entire process (Figure C.2) for each category Ci

using Mi queries. This gives a result that can be represented as another Mi -dimensional vectorv~_b for ranking function f₁. We use cosine similarity (CS) to compare how similar ~v_b and~v are since CS is widely used and also suits our needs. The more similar they are, the better the result with budget b is, as budget b achieves a performance comparable to what is achieved with the maximum budget.

Effect of Training Queries. We study how the training queries affect the performance of the proposed approach. We vary the number of training queries from 5 to 20. The results, shown in Figure C.5a, show that the per-formance improves as the number of training queries increases. Further, the experimental results advocate aforementioned assumption on the settings.

Effect of budget. Figure C.5b plots the CS against the budget that is varied in the range [10, 50]. In this experiment, we use voting based on dynamic confidence (VD in Section 4.1). We set the number of workers per binary question to 15, and the cost per binary question to $0.003. The figure

6. Empirical Studies

The number of training queries

CS Fig. C.5:Effects of Training Query Number and Budget

shows that the proposed approach is affected by the budget. When we spend a larger budget, we obtain a better effectiveness. Furthermore, we can see that when the budget exceeds 40, the CS of the proposed approach is quite good and maintains a steady state. This indicates that our approach is inexpensive yet effective.

The number of binary questions

CS With MV

The number of binary questions

CS With MF

Without MF

(b) Fig. C.6:Effects of Voting Methods and Matrix Factorization

Effect of voting methods. Next, we compare the three voting methods (cf. Section 4.1). In this experiment, we set the number of workers per binary question to 15, the cost per binary question to $0.003, and dcij to 0.5. Fig-ure C.6a plots the CS against the number of binary questions that is varied in

Paper C.

the range [1, 21]. We can see that voting method VD outperforms the other two methods. Specifically, VD can increase the average CS by 3.63% and 2.55% compared to MV and VC, respectively. On the other hand, MV deliv-ers the worst performance. The reason is that MV assumes that all workdeliv-ers have the same quality, and the answers from all workers are treated equally.

This is not realistic, because workers are of different quality due to their dif-ferent backgrounds and experiences. In contrast, VD models workers’ quality dynamically for different questions and has the best evaluation performance.

It is also noteworthy that VC’s performance is in-between, since it is able to capture workers’ different confidences, but does not allow workers to vary their confidence from question to question.

Effect of matrix factorization.We also evaluate the effect of matrix factor-ization (MF). To do so, we implemented two versions of our approach, with and without MF. The results are shown in Figure C.6b, where the x-axis rep-resents the number of binary questions, and the y-axis reprep-resents the CS. We can see that the method with MF is clearly better than the one without MF.

Through MF, we are able to determine answers for the binary questions that we do not crowdsource due to the budget constraint. This way, we recover more information that is used to evaluate the ranking functions.

0.1 0.3 0.5 0.7 0.9

The number of workers for each question

(b) Fig. C.7:Effects of Entropy Threshold and Worker Number

Effect of entropy threshold.In this experiment, we study the effect of dif-ferent entropy thresholds used for σ in Algorithm C.1. We set the number of workers per binary question to 15, the number of binary question per query to 13, and the cost per binary question to $0.003. The results of varying σ in the range [0.1, 0.9] are shown in Figure C.7a. Clearly, the entropy threshold has a significant impact on the CS. When the entropy threshold increases,

6. Empirical Studies

the CS of the proposed approach also increases. But after some point (0.7 in the experiment), CS stays stable. When the entropy threshold increases, object pairs are more likely to be included in crowdsourcing, which in turn tends to exploit more human intelligence in the evaluation. In the remaining experiments, we use the entropy threshold 0.7.

Effect of the number of workers. We also study whether the number of workers for each question affects the performance of the proposed approach.

We vary the number from 5 to 30. Figure C.7b shows that the proposed ap-proach is affected by the number of workers per question. As more workers are employed for a question, more human intelligence is exploited, which tends to help achieve a better evaluation. On the other hand, after some point, using more workers does not further improve the evaluation. In this experiment, for instance, when the number of workers per question exceeds 20, the CS of the proposed approach changes only little.

C1 C2 C3 C4 C5

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9