Multi-label Text Classification - Topic Models for Semantically Annotated Documents

3.2 Topic Models for Semantically Annotated Documents

3.3.3 Multi-label Text Classification

To further validate the predictive power of the here presented models, we apply our gen- erative method to three challenging multi-label classification problems and compare the methods with state of the art classification algorithms. First, the TC model is benchmarked on two independent PubMed corpora against (i) a multi-label naive Bayes classifier, (ii) a method currently used by the National Library of Medicine (NLM) and (iii) a state of the art multi-label Support Vector Machine (SVM). The comparison shows encouraging results. After we have proven the predictive power of the TC model against state of the art methods, we compare the TC model with the UTC model on the CiteULike data set in a personalized tag recommendation task.

Results Pubmed Corpora

In this setting, we prune each MeSH descriptor to the first level of each taxonomy- subbranch resulting in 108 unique MeSH concepts. For example, if a document is indexed

with Muscular Disorders, Atrophic [C10.668.550], the concept is pruned to Nervous Sys-

tem Diseases [C10]. Therefore, the task is to assign at least one of the 108 classes to

an unseen PubMed abstract. Note that from a machine learning point of view, this is a challenging 108 multi-label classification problem and corresponds to other state-of-the-art text classification problems such as the Reuters text classification task [121], where the number of classes is approximately the same. In the pruned setting of our task, we have on average 9.6/10.5 (random 50K/genetics-related) pruned MeSH labels per document.

In what follows, we will first describe the used benchmark methods and then present the results for the genetics-related corpus and random 50K corpus. The Topic-Concept model is benchmarked against a method currently used by the NLM [104], which we refer to as centroid profiling, a multi-label naive Bayes classifier and a multi-label SVM. For both data sets and all methods, 5-fold cross-validation was conducted.

0 50 100 150 200 250 300 350 400 800 1000 1200 1400 1600 1800 2000 2200 2400 Topics Perplexity LDA Link−LDA Topic−Concept Model

(a) Perplexity Pubmed Corpus (random 50K)

0 50 100 150 200 250 300 350 400 200 250 300 350 400 450 500 550 Topics Perplexity LDA Link−LDA Topic−Concept Model User−Topic−Concept Model

(b)Perplexity CiteULike Corpus

Figure 3.3: Meta data annotation perplexity on test sets. Note that a lower perplexity indicates a better annotation quality.

Centroid Profiling In [104] classification is tackled by computing for each word token wi and each class label lm, in a training corpus, a term frequency measure T Fi,m =

wi,lm/

P|L|

m=1wi,lm with |L| equals to the total number of concepts. Thus, T Fi,m measures

the number of times a specific word wi co-occurs with the class label lm, normalized by the total number of times the word wi occurs. As a consequence, each word token in the training can be represented by a profile consisting of the term frequency distribution over all |L| classes. When indexing a new unseen document, the centroid over all profiles for the word tokens in the test resource is computed. This centroid represents the ranking of

all class labels for the test resource. This method was chosen, because it is currently used by the NLM in a classification task to predict so-called journal descriptors [104].

Naive Bayes (NB) NB classifiers are a very successful class of algorithms for learning to classify text documents [136]. For the multi-label NB classifier, we assumed a bag of words representation like for the Topic-Concept model and trained it for each of the 108 labels. We used the popular multinomial model for naive Bayes [130].

Multi-Label Support Vector Machine (SVM) The multi-label SVM setting was implemented according to [121]. In this setting, a linear kernel is used and the popular so-called binary method is used to adapt the SVM to a multi-label setting. This setting produced very competitive results on a large-scale text classification task on the RCV1 Reuters corpus [121]. LIBLINEAR, a part of the LIBSVM package [46] is used for the implementation. Two different weighting schemes are evaluated: Term frequency (Tf) as well as cosine-normalized Term frequency-Inverse document frequency (Tf-Idf).

Prediction with the TC Model Section3.3.1 gives details about the training details. In the TC-model, the prediction of semantic annotations for unseen resources is formulated as follows: Based on the word-topic and concept-topic count matrices learned from an independent data set, the likelihood of a concept label lrj given the test resource r is

p(lrj|r) = P_tp(lrj|t)p(t|r). The first probability in the sum, p(lrj|t), is given by the learned topic-concept distribution (see Equation 3.14). The mixture of topics for the resource p(t|r) is estimated by drawing for each word token in the test resource a topic based on the learned word-topic distribution p(w|t). The Gibbs sampling is repeated a small number of times (i = 5). This online sampling strategy showed good performance for topic model inference on streaming document collections [191]. In contrast to the typical application of topic models in text classification problems, where LDA is usually used for dimensionality reduction purposes [28], the TC model directly predicts a ranked list of concept recommendations.

Evaluation Measure In particular, we are interested in evaluating the classification task in a user-centered or semi-automatic scenario, where we want to recommend a set of concepts for a specific resource (e. g. a human indexer gets recommendations of MeSH terms for a PubMed abstract). In principle, the recommendation task can be considered as a ranking task and well-known information retrieval measures can be used for the evaluation. Indeed, the here proposed topic models return a ranked list of semantic annotations and could be therefore naturally evaluated in a ranking scenario. Unfortunately, it is not straightforward to obtain rankings for classifiers such as SVMs or NB. Thus, we decided to follow the evaluation of [86] and use the well-known F-measure. As done in [86], we average the effectiveness of the classifiers over documents rather than over categories. In addition, we weight recall over precision and use the F2-macro measure, because it reflects that human indexers will accept some inappropriate recommendations as long as the major

0 5 10 15 20 25 30 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8