Structured Models - About the exploration of data mining techniques using structured features f

2.4 Models

2.4.2 Structured Models

The kinds of models we presented in the former Sections all are independently creat- ing decisions for examples. That means that the decision is only conditioned by the particular example. For some learning tasks the examples might depend each other and the labels of these examples might depend each other, too. Structured models like Conditional Random Fields [Lafferty et al., 2001] and Structured Support Vector Machines [Tsochantaridis et al., 2004] respect this dependencies and they do not only predict examples because of their feature assignments but also because of predictions for other examples already made or to be made. A current approach by [Fernandes and Brefeld, 2011] uses partially annotated sequences for the training of a transduc- tive perceptrons. We will present Conditional Random Fields in detail in the following Paragraph because we are using these models in several experiments in this work. Conditional Random Fields Conditional Random Fields (CRF) [Lafferty et al., 2001] are a special form of the Markov Random Fields (MRF) [Kindermann and Snell, 1980].

Definition 13. A MRF is a set of random variables Z with respect to an undirected acyclic graphG = (V, E) containing vertices v ∈ V and edges e ∈ E, where Z is indexed byV (Zv∈ Z ∀ v ∈ V ). The Markovian property has to be satisfied by Z.

The variable set Z internally is connected by an undirected graph. It follows that the calculation of conditional probabilities for certain variables Zv cannot be calculated using an antecessor. In addition, not only one but several variables Zv0 can be

connected to Zv. The Markovian property (see equation (2.20)) states that the probability of a certain variable Zvonly is conditioned by all its neighbors.

2.4. MODELS

where ne(·) is the set of neighbors of a certain node.

A clique (see Definition 14) is a subset of nodes where each node is connected to every other node of the subset.

Definition 14. A clique is a set Y ⊆ V , where (y, y0) ∈ E ∀y, y0 ∈ Y.

In the case of machine learning and Data Mining the random variables are ded- icated to labels of particular examples. It follows that the prediction process has to work on the complete set of variables (and respectively labels) instead of predicting one label independently at a time. Let a certain assignment of the set of vertices V be z. Potential functions are used for the calculation of the probability of an assignment z.

Definition 15. A potential function φci(·) : Aci → R

+_{converts the assignments of a} cliqueciinto real-valued positive numbers. The set of allpotential functions is φ.

The probability of an assignment z now can be calculated by clique factorization like in equation (2.21). p(z) = 1 Z Y ∀φc∈φ φc(zc1, · · · , zc|c|) (2.21) Z = Y ∀φc∈φ Y ∀z0_{for φ} c φc(z0c1, · · · , z 0 c|c|) (2.22)

where zc₁, · · · , zc_|c|is the particular clique-assignment of z for the potential function φc. Z is a normalization factor which is needed because the result of the potential functions (see definition (15)) is a real-valued number and has to be normalized to a probability. Z is the product of all potential functions using all possible assignments z0 for each potential function φc.

The crucial part is the creation or selection of the potential functions to be used. The potential functions should model the observations made by visiting the training set. We will focus on this creation later on.

Definition 16. A Conditional Random Field is a MRF that is conditioned by observa- tionsx.

Conditional Random Fields (CRF) focus especially on the process of the construc- tion of the potential functions using observations (examples) x.

Describing the handling of arbitrary graphs would go beyond the scope of this work. We will focus on very trivial graphs being used for NER (see Section 3.2). In NER the

examples and labels respectively are ordered sequentially. Additionally, sentences are building blocks. Each of these blocks can be seen as a CRF (MRF), where the labels y1, . . . , ynare the random variables and the examples x1, . . . , xnare used to build the potential functions. Potential functions are defined on cliques. In a sequence every two neighboring nodes are building a clique. Neighboring nodes are edges contained in the set E. The potential functions on these edges are now denoted with f (·) as shown in equation (2.23). In addition, [Lafferty et al., 2001] are presenting the creation of potential functions not only for edges but for every single node, too. These functions are denoted with g(·), and they are shown in equation (2.24).

f (y, y0|x)∃(y, y0) ∈ E (2.23)

g(y|x)∃y ∈ V (2.24)

We assume that for reasons of simplicity the two kinds of potential functions are in the set of potential functions φ. For the case of a sequence of the random variables the calculation of the conditional probability (equation (2.21)) can be rewritten like in equation (2.25). p(z) = 1 Z Y ∀φc∈φ n X i=1 φc(yi, yi+1) (2.25)

Following the Hammersley-Clifford-Theorem [Hammersley and Clifford, 1971] equation (2.25) could be formed to equation (2.26) if the graph structuring the variables is a tree. p(z) = 1 Z exp[ X ∀φc∈φ n X i=1 φc(yi, yi+1)] (2.26)

Afterwards we separate the particular potential functions into the state features g(v ∈ V ) and the transition features f (e ∈ E) generating the following equation:

p(z) = 1 Z exp[ X ∀f ∈F n X i=1 fc(yi, yi+1) + X ∀g∈G n X i=1 gc(yi)] (2.27)

The potential functions f (·) and g(·) are extracted from the training set. F represents the set of potential functions f (·) and G represents the set of potential functions g(·). The definition for transition features f (·) is presented in Definition 17. The definition for state features g(·) is presented in Definition 18. The potential functions can be created out of the attributes of the training set. Each potential function gets a weighting-factor which is adjusted during training. These factors according to the potential functions are the CRF-model θ. The probability corresponding to θ for a particular assignment z is calculated by:

pθ(z) = 1 Zexp[ X ∀fc∈F n X i=1 λcfc(yi, yi+1) + X ∀gc∈G n X i=1 µcgc(yi)] (2.28)

2.4. MODELS

Definition 17. The transition features are defined by f (e ∈ E, x, i) = b(x, i) if y ∈ e andy0 ∈ e fulfill particular conditions. These conditions, for example, could be that y = class A and y0 = class B. b(x, i) = 1 if an attribute of the i-th example xihas a certain value.b(x, i) = 1 if, for instance, the i-th word of a sentence is Germany.

Otherwise,b(x, i) = 0.

Definition 18. The state features are defined by g(v ∈ V, x, i) = b(x, i) if y = v fulfills a particular condition. This condition, for example, could be thaty = class A. b(x, i) = 1 if an attribute of the i-th example xihas a certain value.b(x, i) = 1 if, for

instance, thei-th word of a sentence is Berlin. Otherwise, b(x, i) = 0. During the training phase of a CRF the parameters λi and µi are adjusted to most optimally fit the potential features to the training set. This fitting is done by maxi- mizing the log-likelihood function [Fisher, 1997]. The logarithms of the conditional probabilities of all subsets (for instance sentences) of the training set are summed up to build the log-likelihood (see equation (2.29)).

L(θ) = X S⊆T

log p(yS|xS) (2.29)

The certain values for θ (λi and µi) have to be changed until the result of the log- likelihood function is maximal. This can be done by using multiple techniques. [Laf- ferty et al., 2001] present two approaches based on iterative scaling [Della Pietra et al., 1997]. [Wallach, 2002] has shown that numerical optimization techniques do find the optimal setting for θ faster than iterative methods. An often used numerical optimization technique is L-BFGS [Nocedal, 1980]. L-BFGS is a quasi-newton optimization technique approximating the optimization step by a Taylor series of second order. This technique uses the first and second derivative of the function to be optimized. The second derivative – the Hessian – is approximated by a Taylor approximation avoid- ing the computational complex calculation of the Hessian. [Vishwanathan et al., 2006] have shown that stochastic gradient methods in general and particularly stochastic meta descent(SMD) are more efficient for CRF training than L-BFGS. State of the art im- plementations for CRFs are based on the usage of general-purpose computation on graphical processing units for a parallel and therefore very efficient computation [Pi- atkowski, 2011, Piatkowski and Morik, 2011]. We used such implementation for the experiments we present in Section 3.3.

In document About the exploration of data mining techniques using structured features for information extraction (Page 44-47)