3.2 Distilled (LETOR) Benchmark Datasets
4.1.2 Query Representation Problem
Optimising query representation has been a dominant problem in the TVM of IR research. The EML applications have been used to modify the query representation by adding and removing terms or adapting better weights for the existing query terms. This is done while considering the relevance judgements (such as users clicks on retrieved relevant documents). This problem has been devised from the Ide dec-hi and the other mathe- matical methods ideas (Salton and Buckley, 1997) for improving query representation for future user searches. This modified query term/weight vector representation has been used to retrieve more relevant documents than the original user query. These improved query representations were saved in the IR system as modified representations, in order to improve future searches using the same set of queries. However, this approach only
improved IR system effectiveness for the trained set of queries. This boosted the need to re-run EML application for every new added query to the IR system relevance to update the representation for the new query. This type of problem is divided into three groups (Cordon et al.,2003) which are as follows:
• Query Weight Learning (Weight Selection). This problem area is primarily about adapting the query weight representation to the most appropriate representation. Thus, the query can retrieve the relevant documents in the top of the search list and the irrelevant documents in the bottom of the search list.
• Query Term Selection. This problem area is primarily about adapting the query terms representation to the most appropriate representation by adding/removing terms with their weights to the query from its relevant/irrelevant documents. Thus, the query can retrieve the relevant documents in the top of the search list and the irrelevant documents in the bottom of the search list.
• Mixed Term and Weight Selection. This problem domain is mainly about mixing between query weight learning and term selection for better-saved query represen- tation in the IR systems.
The following sections will discuss some of the approaches used to tackle these prob- lems.
Hill-Climbing Using Okapi Approach
Hill-Climbing is a local search algorithm used for identifying better solutions in the solution search space by improving query vector representations (Talbi, 2009; Robert- son, Walker, Hancock-Beaulieu, Jones and Gatford,1995;Robertson et al., 1996). Hill- Climbing technique starts with initial solution which is the original query. Then, hill- Climbing technique searches the neighbourhood solutions by adapting the original query for better IR effectiveness in several evolving iterations using mutation procedure. In TREC-3 (Robertson, Walker, Hancock-Beaulieu, Jones and Gatford, 1995), the hill- climbing approach was used to optimise query representation via a term selection method. The experimental settings of this approach were as follows:
1. The weight values of the terms in term-set were static which is Okapi term-weights in the corresponding relevant documents to the query. This means that each selected term from relevant document to be added or removed had a fixed weight which is the weight existing in the relevant document.
2. The terms in the term-set were ordered as a list in a descending order of their Okapi term-weights. It was updated in every evolving iteration in hill-climbing using the index terms of the test collection.
3. The three top weighted terms from each relevant document in the collection were used to build the query’s (topic’s) term-set.
4. The terms were considered just once during the iterations of the evolution of the new query vector.
5. After the first top three terms, the successive term in the term-set was added to the query. Then the fitness value was calculated, which was the average precision of the top 1000 documents retrieved.
6. The above steps were repeated until a stopping criterion was satisfied.
The stopping criterion could be one of the following thresholds:
• The maximum number of terms on the term-set was the max-terms =30 terms.
• The maximum number of successive iterations that have worse offspring query rep- resentation was max-bads = 8 successive iterations.
• The maximum total number of terms in each query was MaxTerms = 150 terms in the chromosome representation.
• The maximum runtime which has been set to the max-time= 1 or 2 hours per each evolved query (topic).
Robertson et al. (Robertson et al., 1996) tried to minimise the computational run time of the above approach by using additional constraints. The constraints used were as follows:
1. Terms were sorted in descending order of their Retrieval Selection Value (RSV) (Robertson,1991).
2. The maximum number of terms in the selection process was 200.
3. Terms were added or removed one-by-one in a specified computational runtime limit.
This technique has been applied on TREC Disk 1, 2 and 3 with 50 queries. Robertson et al. compare their technique against Okapi-BM25 approach using average precision, precision at top-5, top-30 and top-100 document retrieved and recall evaluation metrics. From their results, their proposed technique outperformed Okapi-BM25 for all evaluation metrics. The computational runtime of this technique was 34 minutes per 48 documents used. However, there is no show for the computational runtime comparison between Okapi-BM25 and the proposed approach.
Simulated Annealing Using Okapi Approach
The hill-climbing technique is usually stuck in local optima solutions and consuming too much runtime for jumping from local optimum solution to global one. Several remedies have been proposed for this issue and Simulated Annealing (SA) is one of these proposed techniques. SA is a local search technique that is similar to hill-climbing but with accepting a worst solution in the evolving iterations under specific circumstances. The SA technique inspired from metal annealing process in Physics (Talbi, 2009). This technique has been used for optimising query representations using term and weighting selection (Walker et al.,1997). The term-weighting function used in document and query representations is Okapi. This approach employs two methods: a simple SA and a mild SA. In simple SA, the query representations are optimised by searching for the most appropriate query representations. In this approach, the query representations with lower average precision values were accepted under a certain temperature (T). This procedure is generally used for adapting the term selection procedure from local optima to global optima. The acceptance criteria for the worst score (average precision) is the probability of exp(−(best score − new score)/T). Unfortunately, when tested this approach produced disappointing results compared to hill-climbing approach in the previous
studies (Robertson, Walker, Hancock-Beaulieu, Jones and Gatford, 1995; Robertson et al.,1996). This is because the computational runtime was higher. It appeared that the annealing process was over-fitting the terms and there was no deterministic re-weighting process, which consumed too much runtime. A second method that has been used is a mild simulated annealing approach, in which a deterministic re-weighting process is combined with the SA mechanism. This approach gave a noticeable improvement in IR effectiveness. Walker et al. compare both simple and mild SA using average precision, precision at top-5, top-10, top-15, top-20 and top-30 document retrieved and recall evaluation metrics. They applied their experiments on TREC-5 routing test collection. The mild SA outperformed simple SA in all cases. There is no show for computational runtime in their paper.
Genetic Algorithm for Term and Weight Selection
In this approach, Genetic Algorithm (GA) is used to adapt an optimised query representa- tion. Several research efforts have been introduced to solve this problem using GA (Cor- don et al.,2003). (Radwan et al.,2006) proposed a new fitness function for evolving better query optimisation, which involved minimising the difference between the query vectors and their corresponding relevant documents. It is also maximising the difference be- tween the query vectors and their top-30 irrelevant documents retrieved from VSM model based on TF-IDF weighting scheme. The results were compared with non-evolving IR approaches and the GA that used cosine similarity as a fitness function. Findings showed that the new fitness function outperformed the GA approach using cosine similarity as a fitness function. The main features of this investigation were as follows:
• Three test collections (CISI, CACM and NPL) were used to demonstrate the results.
• The selection mechanism was a roulette wheel selection.
• The probabilities of crossover and mutation werePc = 0.8andPm = 0.7
• The approach was a mixture of learning the most appropriate weight and term se- lection methods.
In this approach, there is no show comparison between the runtime of the proposed approach.