DISCUSSION - The Ensemble MESH-Term Query Expansion Models Using Multiple LDA Topic Models and

The LDA Model Evaluation

LDA models have a various number of topics. How many topics are relevant? Although the number of topics would be dependent on the purpose of research, generally the topic number is decided by some metrics, such as perplexity, coherence, etc. The cost of generating an LDA model with lots of topics might be high if the data size is huge. It might take several days and need lots of memory (e.g. RAM). For instance, in this study, it took around 20 days to generate an LDA model with 4000 topics, so a cluster with lots of CPUs was used to 40 LDA models.

5.1.1 The Number of Topic on LDA for IR – Perplexity

The relationship between the model fit and IR performance is one concern in this study. The best K (the number of topics) decided by the model fit measure might be most effective in selecting words for QE, which would improve infAP and infNDCG. Perplexity was measured to evaluate the LDA model fit for the models with different numbers of topics. The validation dataset, randomly selected 20% of documents, was used to compare the perplexity of the models. The training dataset, 80% of data, was used to generate LDA models.

Wei and Croft (2006) compared the retrieval results on 242,918 Associated Press newswire documents (1988-90) for LDA models with different numbers of topics (K) in terms of AP (average precision). The LDA model with K=800 showed the best average precision. Meanwhile, in Liu and Croft’s research (2004), the best number of K was 2000 in the cluster-based retrieval using hierarchical agglomerative clustering algorithms for both datasets (Associated Press newswire 1988 – 90: 242,918

Even though perplexity is a measure to decide the best number (K) of topics for an LDA model, there is no clear conclusion about how related perplexity is to IR performance when LDA topic words are used for QE. To find out the relationship between perplexity and (infAP & infNDCG), perplexity was calculated for the LDA models with different numbers of topics (Figure 33). Randomly selected 80% and 20% of the dataset were used for a training set and a test set. The best k with the lowest perplexity (76.074) was 10. The mean infAP and infNDCG scores of the LDA model with 10 topics (the default TP threshold = 0.01) were 0.0199 and 0.1637 for the top1 retrieved document and 0.0209 and 0.1806 in the LDA model with thresholds for TP (0.08), TP*WP (0.03). Compared with the other LDA models (Table 7 and Table 17), mean infAP and infNDCG scores were not high. Overall, LDA models with a relatively large number of topics showed better infAP and infNDCG scores.

Figure 33. The perplexity for LDA models with different numbers of topics 5.2 Classifier Performance

A classifier played a critical role to identify relevant words for QE. Relevant features and appropriate parameters (the number of layers and nodes, iterations, batch size, etc.) as well as enough data, decide the performance of a classifier. Adjusting parameter values by testing the performance using validation sets is

a repeated process to develop a decent classifier. Some issues for constructing classifiers were raised, which affected infAP and infNDCG.

5.2.1 Overfit

Generally, many layers and nodes are helpful to increase accuracy for a training set, however, which does not guarantee better scores on validation and test sets (overfit). The overall ANN classifiers with many layers and nodes showed high accuracy for training datasets but did not show high accuracy for the validation sets (Table 4 & 5), which implies overfitting. The relevant number of layers and nodes should be decided by testing the accuracy of the validation sets. Dropout (Hinton, Krizhevsky, Sutskever, & Srivastava, 2019) and early stopping (Yao, Rosasco, & Caponnetto, 2007) are applicable techniques to preventing overfitting in training classifiers. Dropout as a regularization technique limits the number of input data in training, which just accepts a part of input data to prevent overfitting. Early stopping rule can be applied to limit the iteration number of training. If the performance does not improve, the training process stops.

5.2.2 Imbalanced classification

Another problem is skewed classification in binary classification. The binary classifiers classified most words into the negative word group. Although there were more negative words about three times, most classifiers grouped 90% of the words in the validation sets into the negative word group, except one classifier with 3 layers including 700 nodes per layer.

F1 and AUC scores on the validation sets were calculated to overcome this weakness of accuracy measure. Classifiers trained on more than 3 layers showed relatively high F1 and AUC scores (Table 4 & 5). To overcome the weakness of imbalanced classification, the probability for a specific (positive/negative) class was used for weighing a word score instead of using the output class (label).

Even though ANN classifiers have shown good performance generally, other classifiers based on different algorithms, such as SVM, decision tree, naïve Bayes, logistic regression, or k-means, can outperform an ANN classifier. As an example, an SVM classifier was compared with an ANN classifier in Appendix I.

Instead of ANN classifiers, other classifiers would be more effective when they are incorporated with LDA models. Some classifiers would be more effective for filtering; others would be more effective for ranking. The combination of different types of classifiers would lead to the best ensemble QE model.

5.3 A Cost-effective IR System

Normally, a more cost/investment results in better performance, however, a reasonable amount of input cost must be considered in practice because more input units are needed to improve the same amount of performance when IR performance is beyond a specific threshold in many cases. A compact but well- performing, and efficient IR system should be designed with reasonable cost and effort unless an IR system with very high performance is not necessary.

5.3.1 The number of vocabulary words

Document representation gives huge impacts on not only IR performance but also costs in implementing an IR system. In this study, MeSH terms including 24,883 n-gram words were considered to represent a document. Some MeSH terms barely or frequently appear. Those words might be ignored for pre-processing efficiency if the collection size is too huge. MeSH terms barely appeared might not that influential in IR. MeSH terms frequently occurred would be likely to be general terms, which may not critical in IR.

MeSH descriptors include a list of Check Tags that are very general (e.g. “Humans”). Check Tags are mostly used for filtering search results. Although Check Tags were not removed in this study, they would be removed for both effectiveness and efficiency.

5.3.2 The number of topic models and classifiers

In designing ensemble QE models, the number of models is important as much as the quality of models, which affect IR performance. Even if topic models or classifiers are homogeneous, QE using more topic models and classifiers would derive better performance. However, when resources are limited, the reasonable numbers of LDA topic models and classifiers would be decided according to how much IR performance is improved by one inputted cost unit. Also, the complexity of an IR system affects IR speed and maintenance. The more complicated the IR system is, the more resources would be required and the slower IR speed would be. The reasonable numbers of topic models and classifiers would be different according to domain areas.

In document The Ensemble MESH-Term Query Expansion Models Using Multiple LDA Topic Models and ANN Classifiers in Health Information Retrieval (Page 138-143)