• No results found

3.2 Topic Models for Semantically Annotated Documents

3.2.3 The Topic-Concept Model

In the last section we have seen that applying the classical LDA to semantically annotated document collections comes with difficulties. Either we use a mixed vocabulary consisting of both words from the resource and semantic annotations or we only use semantic annotations as input and thus do not consider the content of the resources itself. In this section, we present a new topic model, the Topic-Concept model (TC model) [35, 138], which models the content of the resources and their semantic annotations in a principled way. The TC model extends the basic LDA framework by including, in addition to the generation of words, the generation of concepts for the resource. The generation of a resource is first modeled as in the classical LDA model and then the process of indexing the resource with one or several concepts is modeled. The generative process captures the notion that a human indexer or annotator first collects the topics of the resource and afterwards assigns concepts to the resource based on the identified topics. Hereby, the assigned concepts are conditional on the topics that are present in the resource. Note that the concepts can emerge from one single topic of the resource, but also from several topics of the resource. Algorithm5 summarizes this generative process.

In addition to the three steps needed in LDA for generating a resource, two further steps are introduced to model the process of indexing the resource with semantic annotations. After having modeled the generation of words, an indexi∼U nif orm(1, . . . , Nr) is sampled uniformly and the topic assignment zri of word i is chosen to be the topic assignment ˜zrj for the concept j. Thus, the topic assignment ˜zrj = zri for the concept j in resource r is based on zr, the topic assignments of resource r. Finally, each concept label lrj in r is sampled from a multinomial distribution Γz˜rj over concepts specific to the sampled topic.

In addition to the introduced matrices Θ and Φ in the classical LDA model, the matrix Γ of size |L| × |T| stores the probabilities of concepts given the topics, with Γtthe multinomial distribution over concepts given a topic t with P|L|

Algorithm 5: Generative process for the Topic-Concept model. 1 foreach topic t= 1. . .|T| do 2 sample φt ∼Dirichlet(β) 3 sample Γt∼Dirichlet(γ) 4 end 5 foreach resource r = 1. . .|R| do 6 sample θr ∼Dirichlet(α)

7 foreach word wri, i= 1. . . Nr in resource r do

8 sample a topic zri ∼M ult(θr) 9 sample a word wri ∼M ult(φzri)

10 end

11 foreach label lrj, j = 1. . . Mr in resource r do 12 sample an index i∼U nif orm(1, . . . , Nr)

13 set ˜zrj ←zri

14 sample a label lrj ∼M ult(Γz˜rj)

15 end

16 end

over Mr concepts for the generation of a concept lrj within a resource r is specified as:

p(lrj) =

T

X

t=1

p(lrj|z˜rj =t)p(˜zrj =t|zr), (3.10)

where ˜zrj =tis used as the topic assignmenttto thejth concept, andp(lrj|z˜rj =t) is given by the concept-topic distribution Γ˜zrj. It is important to note that by selecting uniformly

the topic assignment for a concept out off the assignments of topics in the resourcezr, i. e.

p(˜zrj|zr) = Unif(z1, z2, . . . , zNr), leads to a coupling between both generative components.

In this way an analogy between the word-topic representation and the concept-topic rep- resentation is created. The principle idea of coupling Θ and Γ has previously been applied successfully to modeling images and their captions [27]. Thus, the generative process of the Topic-Concept model is similar to the Correspondence LDA model proposed in [27] with the difference that the Topic-Concept model imitates the generation of documents and their subsequent annotation, while [27] models the dependency between image regions and captions. As a consequence, in [27] a multivariate Gaussian distribution is used to model image regions, while in the TC model a multinomial distribution is used to sample words given a topic. In Figure 3.1 on the right, the generative process is depicted using plate notation. Observed variables are shown in grey. Using the independence assump- tions implied by the graphical structure, the joint distribution of a semantically annotated

Φ β T R α θ z w Nr l Φ β T R Mr α θ z w Nrz Γ γ T

Figure 3.1: Plate notation of standard LDA and the TC model. Graphical models in plate notation with observed (gray circles) and latent variables (white circles). Left: standard LDA. Right: Topic-Concept (TC) model.

resourcer factorizes as follows:

p(wr,zr,lr,˜zr, θr,Φ,Γ) = p(Φ|β) Nr Y i=1 p(wri|φzri)p(zri|θr)p(θr|α) ·p(Γ|γ) Mr Y j=1 p(lrj|Γz˜rj)p(˜zrj|Zr). (3.11)

With the TC model, we have introduced a probabilistic topic model that models seman- tically annotated documents in a principled way. The sampling of the topic representation for the concepts is hereby coupled to the word-topic representation, which leads to a cor- responding topic representation in both spaces.

Learning the Parameters of the Topic-Concept Model from Text Collections In the TC model, the quantities we are interested in, are the topic assignments zri for a wordwri as well as the topic assignments ˜zrj for a concept lrj in resource r. The first part of the generative process is similar to the classical LDA model (see Algorithm5), therefore the first update equation is similar to Equation3.5. In the second part, we want to sample topic assignments ˜zrj for a concept lrj conditioned on the set of topic assignments for the labels in the whole corpus, the set of topic assignments made for the words, and the hyperparameter γ. The update equation from which the Gibbs sampler draws the hidden variable yields

p(˜zrj =t|lrj =l,Z˜−rj, L−rj, Z, γ) ∝ p(L,Z|Z, γ˜ ) p(L−rj,Z˜−rj|Z, γ) ∝ p(L|Z, γ˜ ) p(L−rj|Z˜−rj, γ) p( ˜Z|Z) p( ˜Z−rj|Z) ∝ C LT Lt,−rj+γ P l0ClW T0t,rj+|L|γ CT R tr Nr . (3.12)

This is obtained by applying the chain rule and the independence assumptions implied by the model. The second term from Equation 3.12 originates from drawing the topic assignment for concept j fromU nif orm(1, . . . , Nr). Given the sampled topic assignments ˜

zrj for a conceptlrj, we can now compute the posterior for Γ by using again the fact that the Dirichlet distribution is conjugate to the multinomial:

p(Γt|Z, γ˜ )∼Dir(Γt;C.tLT +γ), (3.13) with CLT

.t be the count vector of labels for the given topic t. The expectation value of the Dirichlet distribution yields:

E[Γlt] = CLT lt +γ P|L| l0=1Cl0t+|L|γ . (3.14)

Together with equations 3.8 and 3.9, we obtain all parameters needed in the TC model.

Gibbs Sampling Procedure As in the Gibbs sampling procedure for LDA, there are three stages: the initialization, the burn-in phase, and the sampling phase. A single Gibbs sampling iteration yields to sample all topic assignments for the words according to Equa- tion3.5and to sample all topics assignments for the concepts according to Equation5. Note that the respective counts of the current word under investigation are first decremented before sampling the new topic assignment for the words. Afterwards, topic assignments for all concepts are sampled with update equation 3.12. Again, the respective counts of the concepts are decremented before sampling the new topic assignment for a specific concept.