3.2 Topic Models for Semantically Annotated Documents
3.3.1 Experimental Setup
Three corpora with a large number of semantic annotations are used for the evaluation of the models. Two corpora originate from the biomedical domain, where resources or documents are annotated with concepts from a terminological ontology. There is no user information available in the data sets, therefore only the TC model is applied to the biomedical corpora. The last corpus is derived from the collaborative tagging system CiteULike7, where resources are annotated with tags assigned by users. Here we can apply
both models.
PubMed Corpora
Two large PubMed corpora previously generated by [141,142] were used in the experiments. Table 3.2 summarizes the corpus statistics. The first data set is a collection of PubMed abstracts randomly selected from the MEDLINE 2006 baseline database provided by the National Library of Medicine (NLM). Word tokens from title and abstract were stemmed with a standard Porter stemmer [150] and stop words were removed using the PubMed stopword list8. Additionally, word stems occurring less than five times in the corpus were
filtered out. The collection consists of|R|= 50.000 abstracts with a total of 2.369.616 word mentions (|W| = 22.531 unique words) and 470.101 concept annotations (|L| = 17.716 unique MeSH main headings). We refer to MeSH as a terminological ontology, where relations are partially described as subtype-supertype relations and where the concepts are described by concept labels or synonyms [20]. Note that no filter criterion was defined for the MeSH vocabulary.
The second data set contains |R|= 84.076 PubMed abstracts, with a total of 912.231 semantic annotations (|L|= 18.350 unique MeSH main headings) and a total of 4.293.992 (|W| = 31.684 unique word stems). The same filtering steps were applied as described above. This corpus is composed of genetics-related abstracts from the MEDLINE 2005 baseline corpus. The here introduced bias towards genetics-related abstracts resulted from using NLM’s Journal Descriptor Indexing Tool by applying some genetics-related filtering strategies [141]. See [141, 142] for more information about both corpora. In the following, the data sets are referred to as random 50K data set and genetics-related data set re- spectively. For the qualitative evaluation the larger genetics-related corpus with all 18.350 unique MeSH main headings was used (see Section3.3.5).
While the TC model can handle the large number of concepts provided by the MeSH vocabulary, it is difficult to apply the benchmark methods such as a Support Vector Ma- chine or a Naive Bayes classifier to a multi-label classification task of this size. Therefore, we prune each MeSH descriptor to the first level of each taxonomy subbranch resulting in 108 unique MeSH concepts (see Section 3.3.3). In the pruned setting of our task, we have on average 9.6/10.5 (random 50K/genetics-related) pruned MeSH labels per document.
Training Details Parameters for the Topic-Concept model were estimated by averaging samples from ten randomly-seeded runs (S = 10), each running over 100 iterations, with an initial burn-in phase of 500 iterations (resulting in a total of 1.500 iterations). We found 500 iterations to be a convenient choice by observing a flattening of the log likelihood. The training time ranged from ten to fifteen hours depending on the size of the data set, the number of used MeSH concepts as well as on the predefined number of topics (run on a standard Linux PC with Opteron Dual Core processor, 2.4 GHz). Instead of estimating the hyperparameters α, β and γ, we fix them to 50/|T|, 0.001 and 1/C respectively in each of the experiments. Hereby,C denotes the size of the vocabulary of the semantic annotations. Therefore, throughout all experiments we use symmetric Dirichlet distributions. The values
were chosen according to [175,89]. We trained the topic models with a predefined number of topics ranging from T = 200, T = 300, T = 400 and T = 600 to show that the performance is not very sensitive to this parameter as long as the number of topics is reasonably high. In addition, models with T = 10 , T = 50 and T = 100 were trained for the perplexity evaluation in Section3.3.2.
CiteULike Data Set
CiteULike is a social bookmarking system or collaborative tagging system that allows researchers to manage their scientific reference articles. Researchers upload references they are interested in and assign tags to the reference. Therefore, semantic annotations come in form of noisy tags. CiteULike provides data snapshots ofwho posted what as well aslinkout
data on their web page9. The linkout data provides information about the origin of the resource (e. g. a certain article comes from the Science URL). In order to get the content of the resources, i. e. the titles as well as the abstracts, one needs to install several plugins provided by CiteULike. The data snapshot used in our experiments was is from November 13th 2008. We restricted to a reasonable high number of users|U|= 1393 and required for the generation of the training set that each resource had to be cited by at least three users. Thus, we wanted to ensure that for the training data set, we obtain a ”dense” fraction of resources. Note that we do not use such a restriction for the test set (see next paragraph). In addition, every word token as well as every tag had to occur at least five times in the training data. Word tokens from title and abstract were stemmed with a standard Porter stemmer [150] and stop words were removed using a standard stop word list10. Table 3.2
summarizes the corpus statistics. In total our training data set originates from a total of
|P|= 64159 posts. This comprises|R|= 18.638 resources, 1.161.794 words (|W|= 14.489 unique words), 125.808 semantic annotations in form of tags (|L|= 3.411 unique tags) and 18.628 user mentions (|U|= 1.393 unique users). In average each user uses 32 unique tags. The maximum number of unique tag labels for a specific user is 279. The average number of tag assignments per resource for a single user is three. The user id’s, resource id’s and tags are provided as supplementary data11.
Test Set for Tag Recommendation We evaluate the here proposed models in a per- sonalized tag recommendation task (see Section 3.3.3). The only restriction for the test set was that a resource had to be posted from a user previously seen in the training set. The same applies to tags. The independent test set consists of 15000 posts.
Training Details Parameters were estimated by averaging samples from ten randomly- seeded runs, each running over 100 iterations, with an initial burn-in phase of 500 for the TC model and 1500 iterations for the UTC model. This results in a total of 1500 and
9http://www.citeulike.org/faq/data.adp 10http://ir.dcs.gla.ac.uk/resources.html 11
2500 iterations respectively. Again, we found the number of burn-in iterations to be a convenient choice by observing a flattening of the log likelihood. Overcoming the burn-in phase took longer for the UTC model, since a user-topic distribution for each useruhas to be estimated as well. Instead of estimating the hyperparameters α, β and γ, we fix them to 50/T, 0.001 and 1/C respectively in each of the experiments (C represents the number of unique tags in the corpus). The values were chosen according to [175, 89]. We trained the topic models with a predefined number of topics ranging from T = 200, T = 300 and
T = 400 to show that the performance is not very sensitive to this parameter as long as the number of topics is reasonably high. In addition, models with T = 10 , T = 50 and
T = 100 were trained for the perplexity evaluation in Section 3.3.2.