• No results found

In Barlacchi et al. [2014b], we showed that IR techniques can improve clue retrieval, but our approach was limited to providing better ranking of clues, whereas CP solvers require finding answers to fill the puzzle squares. In other words, given the list of similar clues retrieved by an IR system, a clue aggregation step and a further reranking process is needed to provide the list of answer candidates to the solver. More specifically, the clues

Learning to Rank Aggregated Answers 55

in the rank generate a set of possible answer. A straightforward way to sort the answers is to consider the rank of theirs associated clues as a vote. However, this is subject to the problem that answers are relevant to the query with different probabilities. Even the LTR algorithm score is not calibrated on the entire list of clues and answers.

In this section, we study techniques for the aggregation of answers and their rerank- ing, with the goal of solving the above problem. First of all, we calibrate the scores from the LTR model using logistic regression (LGR). This way, the voting approach uses cali- brated probabilities and improves on previous results. Secondly, we combine the evidence provided by the clues associated with the same answer: we define a representation of each answer based on aggregate features extracted from corresponding clues, e.g., their average, maximum and minimum reranking score. We experiment with this new answer representation with LGR and SVMrank [Joachims, 2002] models. Thirdly, we present an updated dataset for clue retrieval: this time it contains 2,131,034 clues and associated answers. We carried out two sets of experiments on two main tasks: (i) clue reranking, which focuses on improving the rank of clues similar to a target clue; and (ii) answer reranking, which targets the list of aggregated clue answers.

Our findings demonstrate that (i) the search engine greatly improves on DB methods for clue reranking, i.e., BM25 improves on SQL query by 6 absolute percent points; (ii) kernel-based rerankers improve SQL by more than 15 absolute percent points; and (iii) our answer aggregation procedure improves the Recall (Precision) at rank 1 by additional 2 points absolute over the best results.

As a starting point, we adopt the reranking framework applied to CPs described in Barlacchi et al. [2014b] and in the previous sections. In this work, we extend the feature sets for capturing the degrees of similarity between clues. In addition to the DKPro Similarity (DKP)features described in 4.4.6, we include the following features.

iKernels features (iK). Set of similarity features taking into account syntactic infor- mation captured by n-grams, and using kernels:

Syntactic similarities. Several cosine similarity measures are computed on n-grams (with n= 1,2,3,4) of word lemmas and POS tags.

Kernel similarities. Computed using (i) string kernels applied to clues, and tree ker- nels applied to structural representations.

WebCrow features (WC). We included the similarity measures computed on the clue pairs by WebCrow and the search engine as features.

56 Structural Representations for Reranking Text Pairs

Lucene Score. BM25 score of the target candidate.

Clue distance. It quantifies how dissimilar the input clue and the retrieved clue are, with a formula which is mainly based on the well known Levenshtein distance.

4.6.1 Aggregation Models for Answer Reranking

Subsets of clues retrieved by the search engine may share the same answers. Since the reranker associates a score to each clue, a strategy to combine such scores is needed. In the following sections, we aggregate the evidence given by clues associated with the same answer, and we extract relevant features. We designed two different strategies: (i) the first applies LGR to the reranker scores, obtaining probabilities that can be aggregated for each unique answer; (ii) the second uses an answer candidate representation that con- tains features derived from all the clues associated with it, i.e., aggregated features using standard operators such average, min. and max.

Logistic Regression Model. The search engine and the reranker does not output probabilities for clues and answers. In contrast, the LGR outputs can be interpreted as a probabilities. The latter, learned using additional features, are more effective for aggrega- tion. We apply the following formula: Score(G) = n1P

c∈G

PLR(y=1|~x c)

rankc to obtain a single

score for each unique answer candidate, where c is the answer candidate, G is the set of clue answers equal to c, and n is the size of the answer candidate list. ~xc is the feature

vector associated with c ∈ G, y ∈ {0,1} is the binary class label (y = 1 when c is the correct answer). rankc is the rank assigned from the reranker to the word c. At the end,

we divide the probability by the rank of the answer candidate to reduce the contribu- tion of bottom candidates. The conditional probability computed by the linear model is the following: PLR(y= 1|c) = 1

1+e−y ~wT ~xc, wherew~ ∈R

nis a weight vector [Yu et al., 2011].

Learning to rank aggregated answers. We apply SVMrank to the sets of clues associ- ated to the same answer candidate. Aggregated features for each group are computed by averaging the features from the individual clues, which were fed to the first reranker. We call these features FV. Additionally, we compute the sum and the average of the scores, the maximum score, the minimum score and the term frequency of the word in the CPDB Dataset. We call them (AVG). Then, we model the occurrences of the answer instance in the list by means of positional features: we use n features, where n is the size of our candidate list (i.e., 10). Each feature corresponds to the position of the answer instance in the list. We call them (POS).

Experiments on Answer Aggregation 57

Model MRR SUC@1 SUC@5

WebCrow (WC) 64.65 57.14 74.98

BM25 75.17 63.78 90.40

RR (iK) 78.01 67.34 92.32

RR (iK+DKP) 80.89 71.62 93.14

RR (iK+DKP+WC) 81.70 72.50 94.02

Table 4.6: Similar Clue Reranking.