• No results found

Topic Models and Applications in Argument Mining

2.4.1 Latent Dirichlet Allocation Topic Model

The principle idea in topic models is that documents are mixtures of topics, where a topic is a probability distribution over words (Blei et al., 2003; Hofmann, 1999, 2001; Steyvers and Griffiths, 2007; Griffiths and Steyvers, 2004; Blei, 2012). Hofmann (1999, 2001) introduced Probabilistic Latent Semantic Analysis (PLSA) that decomposes the joint probability of observing a term w and a document d with the use of a latent variable z which represent latent topics, where w and d are independent given z, and

P (w, d) = P (d)P (w|d) P (w|d) =X

z

P (w|z)P (z|d)

Figure 5 illustrates the plate digram of PLSA. Document d and word w are observed so they are represented by shaded nodes. Plates indicate repetition. The outer plate represents documents and the inner plate represents the repeated choices of topics and words within a document. PLSA assumes that a topic z is a distribution over a fixed size of vocabulary V , but does not explicitly specify this distribution. The model also assumes that a document d consists of multiple topics, but the distribution over that fix number of topics is not specified either. Therefore, in PLSA both topics and documents are represented as generic multinomial distributions, i.e., lists of numbers. Because PLSA does not define a generative

BLEI, NG,ANDJORDAN α θ z w M N k β η

Figure 7: Graphical model representation of the smoothed LDA model.

These two steps are repeated until the lower bound on the log likelihood converges.

In Appendix A.4, we show that the M-step update for the conditional multinomial parameterβ can be written out analytically:

βi jM

d=1 Nd

n=1 φ∗ dniw j dn. (9)

We further show that the M-step update for Dirichlet parameterαcan be implemented using an efficient Newton-Raphson method in which the Hessian is inverted in linear time.

5.4 Smoothing

The large vocabulary size that is characteristic of many document corpora creates serious problems of sparsity. A new document is very likely to contain words that did not appear in any of the documents in a training corpus. Maximum likelihood estimates of the multinomial parameters assign zero probability to such words, and thus zero probability to new documents. The standard approach to coping with this problem is to “smooth” the multinomial parameters, assigning positive probability to all vocabulary items whether or not they are observed in the training set (Jelinek, 1997). Laplace smoothing is commonly used; this essentially yields the mean of the posterior distribution under a uniform Dirichlet prior on the multinomial parameters.

Unfortunately, in the mixture model setting, simple Laplace smoothing is no longer justified as a maximum a posteriori method (although it is often implemented in practice; cf. Nigam et al., 1999). In fact, by placing a Dirichlet prior on the multinomial parameter we obtain an intractable posterior in the mixture model setting, for much the same reason that one obtains an intractable posterior in the basic LDA model. Our proposed solution to this problem is to simply apply variational inference methods to the extended model that includes Dirichlet smoothing on the multinomial parameter.

In the LDA setting, we obtain the extended graphical model shown in Figure 7. We treatβas a k× V random matrix (one row for each mixture component), where we assume that each row

is independently drawn from an exchangeable Dirichlet distribution.2 We now extend our infer- ence procedures to treat theβias random variables that are endowed with a posterior distribution,

2. An exchangeable Dirichlet is simply a Dirichlet distribution with a single scalar parameterη. The density is the same as a Dirichlet (Eq. 1) whereαi=ηfor each component.

1006

Figure 6: Latent Dirichlet Allocation

process for its topic distribution, the model exposes several problems: number of parameters increases linearly with the size of the training corpus and no way to assign probability to unseen documents (Blei et al., 2003).

Blei et al. (2003) extended PLSA by introducing a Dirichlet prior over the topic distri- bution and named the resulting generative model Latent Dirichlet Allocation. The model’s graphical representation is shown in Figure 6. The generative process of each document d in a corpus D is as follows (Blei et al., 2003):

1. Decide on the number of words N the document will have: N ∼ Poisson(ξ).

2. Choose a topic mixture, i.e., multinomial distribution, θ for the document according to a Dirichlet distribution over a fixed set of k topics: θ = Dir(α)

3. Generate each word wi in the document by:

a. Picking a topic according to the multinomial distribution that was sampled above: zi = Multinomial(θ).

b. Choose a word wi from p(wi|zi, β)

In this setting, the dimensionality k of of Dirichlet distribution (i.e., dimension of topic variable z) is provided and fixed. Alpha is a k-dimensional parameter vector with components αi > 0. Beta is a k× V matrix of word probability given topic, where βij = p(wj = 1|zi = 1)

and V is the vocabulary size. Each row of β is drawn independently from a Dirichlet distribution with a symmetric parameter vector, i.e., vector components are all equal to η.

Along with number of topics k, two hyper-parameters α, i.e., document-topic prior pa- rameter, and β, i.e., word-topic prior parameter, need to set-up before run LDA. A simple implementation of LDA is to have symmetric Dirichlet priors when components in the pa- rameter are the same. However, it has been shown that asymmetric α performs better than a symmetric prior, while an asymmetric β is largely not more helpful than a symmetric prior (Wallach et al.,2009). Also, a general intuition on the magnitude of α and β is that higher α values mean documents contain more similar topic contents, and a high β will result in topics with more similar word contents.

Given a set of documents, different learning algorithms were proposed to learn the document-topic and word-topic probabilities including variational expectation maximization (Blei et al., 2003) and collapsed Gibbs sampling (Griffiths and Steyvers, 2004). Extensions to LDA has been proposed including hierarchical LDA (Teh et al., 2005), supervised LDA (Mcauliffe and Blei, 2008).

2.4.2 LDA Topic Modes in Argument Mining

LDA topic models have been recognized as a useful tool for analyzing large collections of free-text documents. Applications of LDA to natural language processing can be found in a wide variety of areas such as entity analysis (Newman et al.,2006), multi-document summa- rization (Haghighi and Vanderwende,2009), word-sense disambiguation (Boyd-Graber et al., 2007). In opinion mining and sentiment analysis, LDA topic models were successfully used to separate topic and opinion words (Mei et al., 2007; Lin and He, 2009; Zhao et al., 2010; Jo and Oh, 2011). However, LDA has been studied limitedly in argument mining.

Madnani et al. (2012) were the first who proposed the idea of separating shell language, e.g., “The argument states that”, from the language that specifies claims and evidences, e.g., “based on the result of the recent research, there probably were grizzly bears in Labrador.” Du et al. (2014) based on the idea of HMM-LDA (Griffiths et al., 2005) and developed an unsupervised topic model, called Shell Topic Model, to separate shell phrases from topical contents. Their idea based on two assumptions. The first was that each word in the document is associated with a status variable which tells if the word has a shell, topic or function status.

Each status generates word using a multinomial distribution which in turn is sampled from a Dirichlet prior. Then, the authors assumed that there are transition probabilities between statuses, which follow a multinomial distribution.

In document zoning, the problem is to recognize the information structure of documents to help assist information extraction and organize factual information from the documents (Teufel and Moens,2002). Varga et al.(2012) adapted LDA topic model to document zone classification (e.g., introduction, method, results ...) with assumptions that a document is a mixture of zones and a zone is a probability distribution over words. The authors also proposed a special zone, i.e., background zone, which contains common words of different zone types, e.g., “use”, “determine”. Thus, the generative process involves a decision of whether a word is sampled from the background zone or other regular zones.

While also adapting LDA topic model to document zoning, S´eaghdha and Teufel(2014) replied on the intuition that rhetorical language used in a document is independent of the topic. Their proposed model assumes that each word is generated either from an LDA-style topic model (captures topic matter of the document) or from a distribution associated with the rhetorical category, i.e., zone type, of the sentence (captures conventional language). The resulting model combines Hidden Markov process and “switching variable” mechanism with original LDA. Their experiments showed that features from output of the topic model, e.g., zone index, yielded significant improvement to a feature-based model.

In this thesis, we hypothesize that argumentative text can be separated into argument words and domain words, and the extracted vocabularies of argument and domain words can be used to improve argument mining models. However, we do not modify but use the original LDA topic model to parse the texts and then process the output to extract argument and domain words.