• No results found

In order to evaluate the performance of our proposal, we execute the follow-ing step. We define a set of categories C =C1, C2, ..., Cg , where Ci is a set of queries Ci = Que1, Que2, ..., QueMi . Suppose that we have a maximum budget and run the entire process (Figure C.2) for each category Ciusing Mi queries. The result for category Ci can be represented as an Mi-dimensional vector~v = (v1, v2, ..., vMi), where vi is the score of ranking function f1 as described in Section 6.1.

3http://www.crowdflower.com

Paper C.

Query: cheap hotel

1. Do you think ''Hotel Hvide Hus (Vesterbro 2, 9000 Aalborg, Denmark)'' is better than ''First Hotel Europa (Vesterbro 12C, 9000 Aalborg, Denmark)'' for the given query?

Click here to see the query location and the two points of interest on Google Maps Yes

No

Enter your confidence in your answer.

Select one

Fig. C.4:An Example in CrowdFlower

Table C.4:Parameter Settings

Parameter Value

Total budget $50

Total # of binary questions 1183 Max # of workers per binary question 31 Max # of binary questions per query 21 Average # of workers per binary question 15 Average # of binary questions per query 13 Average cost per binary question $0.003

However, a very high budget is economically infeasible. Assuming a par-ticular budget b, we run the entire process (Figure C.2) for each category Ci

using Mi queries. This gives a result that can be represented as another Mi -dimensional vectorv~b for ranking function f1. We use cosine similarity (CS) to compare how similar ~vb and~v are since CS is widely used and also suits our needs. The more similar they are, the better the result with budget b is, as budget b achieves a performance comparable to what is achieved with the maximum budget.

Effect of Training Queries. We study how the training queries affect the performance of the proposed approach. We vary the number of training queries from 5 to 20. The results, shown in Figure C.5a, show that the per-formance improves as the number of training queries increases. Further, the experimental results advocate aforementioned assumption on the settings.

Effect of budget. Figure C.5b plots the CS against the budget that is varied in the range [10, 50]. In this experiment, we use voting based on dynamic confidence (VD in Section 4.1). We set the number of workers per binary question to 15, and the cost per binary question to $0.003. The figure

6. Empirical Studies

The number of training queries

CS Fig. C.5:Effects of Training Query Number and Budget

shows that the proposed approach is affected by the budget. When we spend a larger budget, we obtain a better effectiveness. Furthermore, we can see that when the budget exceeds 40, the CS of the proposed approach is quite good and maintains a steady state. This indicates that our approach is inexpensive yet effective.

The number of binary questions

CS With MV

The number of binary questions

CS With MF

Without MF

(b) Fig. C.6:Effects of Voting Methods and Matrix Factorization

Effect of voting methods. Next, we compare the three voting methods (cf. Section 4.1). In this experiment, we set the number of workers per binary question to 15, the cost per binary question to $0.003, and dcij to 0.5. Fig-ure C.6a plots the CS against the number of binary questions that is varied in

Paper C.

the range [1, 21]. We can see that voting method VD outperforms the other two methods. Specifically, VD can increase the average CS by 3.63% and 2.55% compared to MV and VC, respectively. On the other hand, MV deliv-ers the worst performance. The reason is that MV assumes that all workdeliv-ers have the same quality, and the answers from all workers are treated equally.

This is not realistic, because workers are of different quality due to their dif-ferent backgrounds and experiences. In contrast, VD models workers’ quality dynamically for different questions and has the best evaluation performance.

It is also noteworthy that VC’s performance is in-between, since it is able to capture workers’ different confidences, but does not allow workers to vary their confidence from question to question.

Effect of matrix factorization.We also evaluate the effect of matrix factor-ization (MF). To do so, we implemented two versions of our approach, with and without MF. The results are shown in Figure C.6b, where the x-axis rep-resents the number of binary questions, and the y-axis reprep-resents the CS. We can see that the method with MF is clearly better than the one without MF.

Through MF, we are able to determine answers for the binary questions that we do not crowdsource due to the budget constraint. This way, we recover more information that is used to evaluate the ranking functions.

0.1 0.3 0.5 0.7 0.9

The number of workers for each question

CS

(b) Fig. C.7:Effects of Entropy Threshold and Worker Number

Effect of entropy threshold.In this experiment, we study the effect of dif-ferent entropy thresholds used for σ in Algorithm C.1. We set the number of workers per binary question to 15, the number of binary question per query to 13, and the cost per binary question to $0.003. The results of varying σ in the range [0.1, 0.9] are shown in Figure C.7a. Clearly, the entropy threshold has a significant impact on the CS. When the entropy threshold increases,

6. Empirical Studies

the CS of the proposed approach also increases. But after some point (0.7 in the experiment), CS stays stable. When the entropy threshold increases, object pairs are more likely to be included in crowdsourcing, which in turn tends to exploit more human intelligence in the evaluation. In the remaining experiments, we use the entropy threshold 0.7.

Effect of the number of workers. We also study whether the number of workers for each question affects the performance of the proposed approach.

We vary the number from 5 to 30. Figure C.7b shows that the proposed ap-proach is affected by the number of workers per question. As more workers are employed for a question, more human intelligence is exploited, which tends to help achieve a better evaluation. On the other hand, after some point, using more workers does not further improve the evaluation. In this experiment, for instance, when the number of workers per question exceeds 20, the CS of the proposed approach changes only little.

C1 C2 C3 C4 C5

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Categories

CS

f2 f1

Fig. C.8:Effect of Category

Performance for different categories. Next, we investigate whether the query keywords used affect performance. We classify the query keywords into five categories: “Accommodation” (C1), “Education” (C2), “Tourist”

(C3), “Food” (C4), and “Shop” (C5). The experimental results are shown in Figure C.8. Clearly, different categories result in different evaluation perfor-mance. Category C3 is of special interest as its evaluation result is opposite to those of the other four categories. The possible reason is that tourists al-ways visit some places according to their interest, and therefore the textual similarity is more important than the spatial location. In this situation, f2

works better because it gives more weight to the textual part in the ranking.

For the other four categories, the query location is more important than the textual similarity, and f1works better as it considers the spatial aspect to be more important in the ranking.

Paper C.

The number of binary questions

Seconds

The number of binary questions

Seconds

Without MF With MF

(b) Fig. C.9:The Time Cost of Voting Methods and With/Without MF