Answers to Research Questions - Evolutionary algorithms and machine learning techniques for inf

This section summarises the answers for the main research questions that were presented in Chapter1. These questions and their answers are as follows:

1. Is there a limitation in Evolutionary and Machine Learning (EML) techniques on IR systems for TVM representation (Bag-of-Words) ? and What is the need for mathematical and non-learning term-weighting schemes? What is the importance of relevance judgement and relevance label in evolutionary and machine learning approaches?

Yes, there is a limitation for using EML techniques at the beginning of the IR system. This limitation applies for partially judged and non-judged test collections. The reason for this limitation is that the fitness and loss functions use the relevance labels to check the quality of the solutions in every learning iteration of any EML technique. Thus, the non-existence of relevance judgement at the beginning of IR

system restricts the use of EML technique and its quality for optimising the IR accuracy (effectiveness). In Chapter5, the limitation of applying EML was identi- fied in the well-known test collections by the IR community. In addition, TF-ATO has been proposed as a new mathematical term-weighting scheme which is more efficient than the well-known TF-IDF weighting scheme.

2. What is the limitation of static collection characteristics on different IR weighting functions on Term Vector Models? How will this parameter affect on dynamic variation in test collection?

The static test collection characteristics have been used in various MM (Mathe- matical Models such as TF-ATO and TF-IDF) and EML term-weighting schemes such as TF-IDF, Okapi, evolving term-weighting scheme using genetic program- ming and local search techniques. They usually use the number of documents in the collection as a parameter. However, this parameter should be a variable number that changes by adding and removing documents from the collection. The high variations in the test collection size causes a negative effect on the IR system. This impact is illustrated by the dynamic variation using TF-IDF and TF-ATO in Chap- ter5. The only very large dynamic variation in the test collections affects on the IR accuracy.

3. What is the impact of the pre-processing procedure (stop-word removal) in term- weighting functions?

The stop-word removal has a positive impact for improving the IR system. The non- use of stop-word removal has a strong negative effect on TF-ATO more than TF- IDF. The reason for this issue is that the TF-IDF weighting scheme removes some stop-words from the collection. The word (term) has 0 value as TF-IDF weight when the term is repeated in all documents in the collection at least one time. Fur- thermore, the term-weighting scheme used for creating TREC pooled collections are usually TF-IDF or its variations. This causes another bias in the collections re- lated to the chosen term-weighting scheme. However, TF-ATO performs better with the Discriminative Approach (DA) which can be considered as collection domain of stop-words removal. The DA increases the accuracy (effectiveness) using TF-IDF

or TF-ATO, because its capability for removing some noisy keywords. Chapter 5 demonstrated the impact of the stop-words removal and the DA for increasing the performance of the IR systems.

4. What is the importance of EML techniques in IR to overcome the pre-processing (stop-words removal and stemming) impact for creating effective IR system?

Usually, the use of stop-words removal or stemmer from different topic domains has a negative effect of the performance of the IR system. This is because some discriminative terms were removed or stemmed which were indicated as the topic of the document. Previous studies proved that the EML techniques can adjust the the similarity matching between the relevant documents and the queries. Conse- quently, EML techniques can improve the accuracy of the IR system using the relevance judgement. Chapter4states the capability of EML techniques to improve IR systems.

5. What are the limitations of applying EML techniques on IR systems for FVM representation (Bag-of-Features)?

The previous studies did not indicate fixed settings to evaluate the performance and effectiveness of various EML techniques. Furthermore, some EML techniques such as RankNET, ListNET among others did not consider the over-fitting and under- fitting problems in the sampling process in each learning iteration. Moreover, pairwise approaches have a limitation for producing an accurate ranking model for graded relevance labels. Furthermore, there is a limitation for creating FVM datasets and using EML techniques at the beginning of IR systems. On the other hand, most of EML techniques consumes large computational runtime. These is- sues motivate to propose ES-Rank application as an effective EML technique in Learning to Rank problem in IR.

6. How is the adaptive (1+1)-evolutionary techniques can be used to improve the IR system with the lowest problem size and the lowest computational time?

The (1+1)-evolutionary techniques are similar to the other population-based EML techniques for improving the IR accuracy based on the relevance judgement.

The (1+1)-evolutionary techniques use one parent chromosome and one offspring (child) chromosome to evolve a better solution. These techniques are using less memory than the population-based EML techniques. Chapters6,7and8proposed (1+1)-Evolutionary Gradient Strategy and (1+1)-Evolutionary Strategy to optimise IR systems with various novel methods. To the best of my knowledge, these methods have not been used before in the literature of the IR research field. These techniques outperformed the TF-IDF, Okapi and fourteen EML techniques in TVM and FVM approaches.

7. What is the importance of the initialisation procedure in (1+1)-Evolutionary Strat- egy technique? Chapter7showed the importance of the initialisation procedure in ES-Rank. The appropriate initialisation procedure improves the performance and accuracy of the ES-Rank. In Chapter 7, the zero values, the ranking models pro- duced by the linear regression and the support vector machine have been used as initialisation values in ES-Rank. The best performance and the best accuracy pro- duced by linear regression as initialisation procedure.

8. Can (1+1)-Evolutionary Strategy improve user simulation click ranking model?

Yes, The linear ranking model from Dependent Click Model (DCM)has been used as an initialisation procedure in ES-Rank application. We called this technique as ES-Click. Chapter 8 illustrated the ability of ES-Rank ((1+1)-Evolutionary Strat- egy) to improve DCM model in both training and testing dataset.

In document Evolutionary algorithms and machine learning techniques for information retrieval (Page 181-184)