• No results found

3.2 Topic Models for Semantically Annotated Documents

3.2.2 Classical Latent Dirichlet Allocation Model

In this section the classical LDA model [28] is introduced and we discuss how LDA can be applied in semantically annotated document collections. LDA, as a fully generative model for the generation of document collections, is based upon the idea that a document or resource is generated by a mixture of topics. Thus, each word in a resource r is drawn from a specific topic. Put in other words, each word wri in a resource has a specific topic assignment zri. Each resource is associated with a multinomial distribution over topics. The generation of a resourceris a three step process: First, for each resource, a distribution over topics is sampled from a Dirichlet distribution. Second, for each word in the resource, a single topic is chosen according to the resource-specific topic distribution. Finally, a word is sampled from a multinomial distribution over words specific to the sampled topic. Algorithm 4 depicts this generative process more formally. Θr denotes the multinomial distribution for the specific resource r with P|T|

t=1θtr = 1. The matrix Θ of size |T| × |D|

stores all topic probabilities given the resources. Similarly, φt denotes the multinomial distributions over words associated with a topic with P|W|

w=1φwt = 1. All probabilities of

words given the topics are stored in Φ which is of size |W| × |T|. In Figure 3.1 (Left), the generative process of the LDA model is shown in plate notation. Observed variables are shown in grey. Using the independence assumptions implied by the graphical structure, the joint distribution of a resource r factorizes as follows:

p(wr,zr, θr,Φ) =p(Φ|β)

Nr

Y

i=1

p(wri|φzri)p(zri|θr)p(θr|α). (3.3)

Learning the Parameters of the Classical LDA from Textual Collections

In reality, only the resources with their words are observed, thus the task is to extract the underlying topic structure. Exact inference in LDA intractable [28] and the use of approximative inference algorithms is needed. In particular, the task is to infer the word- topic assignmentszri for each wordi, the resource-topic distributionθr for each resourcer

as well as the topic distributions for the corpusφtfor each topict. The original LDA paper solves this problem by applying mean-field variational methods [28], but other solutions have been proposed as well, such as Gibbs sampling [175] or expectation propagation [68]. Empirical and theoretical comparisons between the various approximation methods used in the context of LDA is a current research issue [140, 9]. In this chapter, we follow the work of [175] and use Gibbs sampling for estimating the desired parameters. Gibbs sampling, a special form of Markov Chain Monte Carlo (MCMC) methods, is an approximate inference method for high-dimensional models. The general idea of MCMC methods is to model a high-dimensional distribution by the stationary behavior of a Markov chain, i. e. samples of the target distribution are drawn based on a previous state, once the method has overcome the ‘burn-in phase’. In Gibbs sampling, each dimension of the target distribution is sampled alternately, conditioned on all other dimensions (see [125] for more details). In LDA, the quantities we are interested in, are the topic assignments zri for a word wri in resource r. All other needed variables can be derived from these topic-word assignments. Conditioned on the set of words in a corpus, and the given hyperparameters α, β, we would like to sample the topic assignment zri for an individual word wri. The update equation from which the Gibbs sampler draws the hidden variable can be written as follows:

p(zri =t|wri =w, Z−ri, W−ri, α, β) ∝ p(W, Z|α, β) p(W−ri, Z−ri|α, β) (3.4) ∝ C W T wt,−ri+β P w0CwW T0t,ri +|W|β CT R tr,−ri+α P t0CtT R0r,ri+|T|α ,(3.5)

where the index−ridenotes not to consider the current dimensioniin resourcer. To derive Equation 3.5 from Equation 3.4 requires to compute the joint distribution p(W, Z|α, β), which can be done be marginalizing out Θ and Φ. The interested reader is referred to [96] for a detailed derivation of the here considered quantities. zri = t represents the assignments of the topictto theith word in a resource,wri =wrepresents the observation that the ith word is the word w, and Z−ri represents all topic assignments not including theith word. Furthermore,Cwt,W Triis the number of times wordwis assigned to topict, not including the current instance wri. Ctr,T R−ri is the number of times topic t has occurred in resourcer, not including the current instance. The Dirichlet priors play the role of pseudo counts, assigning non-zero probabilities to topic assignments. α controls the sparsity of the document-specific topic distributions. The larger α is chosen, the more topics will be involved in the generation of a document. In a similar manner, β controls the sparsity of topics, i. e. the larger β is, the more words will become a large probability mass in a specific topic.

Given the estimated topics assignment zri by the Gibbs sampling procedure, we can now compute the posterior distribution for the multinomial distributions Θ and Φ by using the fact that the Dirichlet is conjugate to the multinomial:

p(θr|Z, α)∼Dir(θr;C.rT R+α) (3.6)

with C.r being the vector of topic counts for the resource r and C.t being the vector of word counts for topic t. Using the expectation value for a single variable in a Dirichlet distribution, we get: E[θtr] = CT R tr +α P|T| t0=1CtT R0r +|T|α (3.8) E[φwt] = C W T wt +β P|W| w0=1Cw0t+|W|β . (3.9)

Gibbs Sampling Procedure Given the update equation 3.5, the Gibbs sampler pro- ceeds as follows: For each document and for each word in a document, the Gibbs sampler starts with a random initialization of topic assignments zri. The quantities of interest, i. e. the document-topic counts, the sum of document-topic counts, the word-topic counts and the sum of word-topic counts are updated respectively. After the initialization, the Gibbs sampler enters the burn-in phase. In this phase the sampler continues to iter- ate over all words in all documents, but now draws samples based on p(zri = t|wri =

w, Z−ri, W−ri, α, β). Therefore, the respective counts of the current word under investiga-

tion have to be decremented before the sampling step is conducted. After the new topic- assignment for the word under consideration has been assigned, the respective counts are updated. The burn-in phase is repeated until a convergence criterion is reached. In this work we use the log-likelihood of the word observations given the topic assignments and continue until we observe a flattening of the log-likelihood. After the sampler has overcome the burn-in phase, the Markov-chain has reached the target distribution from which we want to draw samples. We define a sampling lag variable that takes every lag Gibbs sam- pling iterations a new sample. We repeat this step a sufficient number of times (ten times in our setting) and average over the samples S. In this way, we ensure that the samples are more decorrelated. Now we can compute the model parameters with help of Equation 3.8 and Equation 3.9.

LDA in Semantically Annotated Document Collections

After having introduced the basic concepts of the LDA model, we discuss how we can apply the classical LDA model in semantically annotated document collections. In general, it is not straightforward to apply this model in these document collections, because LDA assumes that the words of a resource or document originate from one single vocabulary. However, in document collections that have annotations, we basically have two different vocabularies: the first vocabulary originates from the content of the resources R, while the second set originates from the annotations or labels L. There is no way to model the correspondences between the content of a resource and its annotations.

To be able to apply the classical LDA in our given setting we have two options. The first opportunity treats the annotations as observed words and simply merges the two vocabularies. This results in topics that are mixtures of annotations and words, there is no principled way to distinguish between words and annotations. The second option is to treat

the annotations as the content of the resources. Thus, a resource r is not represented as a bag-of-words, rather a resource r is modeled as a bag-of-concepts. This ensures that there is no mixture of vocabularies. Indeed, this is done in related work for tag recommendation, see e. g. [116, 94]. However, this approach comes with several drawbacks. First, a number of important discovery tasks such as the extraction of statistical relationships between the content of a resource and its concepts cannot be solved. Second, for recommending concepts for resources, an initial set of concepts has to be given in advance in order to estimate the most likely concepts. If a document has no initial seed of concepts, which will be often the case in real-world applications, there is no way to estimate concepts. Our language model based evaluation in Section 3.3.2 confirms the just mentioned drawbacks. Thus, more advanced models that are able to model semantically annotated documents in a more principled way are necessary.