Structural Kernels - Structural Kernels and Neural Network Models for Question Answering System

Structural kernels have been successfully applied in many domains and problems , since they allow the user to incorporate some knowledge about the structure of the data in the kernel similarity computation.

This is important especially for objects whose overall similarity is a function of the similarity of their subparts. In this section, we introduce general kernels which operate on strings and trees.

16 Support Vector Machines and Kernel Methods

2.4.1 String Kernel

The String Kernel (SK) [Lodhi et al., 2002] computes the similarity between two strings s1 and s2 by counting the number of subsequences that are shared between them. Some

symbols in the strings may be skipped, allowing the contribution of skipgrams to the final similarity. The SK is defined by the following equation:

KSK(s1, s2) = X u∈Σ∗ φu(s1)·φu(s2) = X u∈Σ∗ X ~ I1:u=s1[I~1] X ~ I2:u=s2[~I2] λd(~I1)+d(~I2) _(2.12)

Here, Σ∗ =S∞_n₌₀Σn _{is the set of all possible strings,} _I_~

1 and I~2 are the two sequences

of indexes I~ = (i1, ..., i|u|), with 1 ≤ i1 < .. < i|u| ≤ |s|, such that u = si1..si|u|, d(~I) =

i|u|−i1+ 1 and λ ∈[0,1] is a decay factor. This means that thei indexes range from 1

to the length of substring u, and u is shorter than the string length. d(I~) is the distance between the first and last character of the substring.

2.4.2 Convolution Tree Kernels

Tree kernels are powerful functions that detect the similarity of tree structures by counting the number of common subparts. The difference between tree kernels is in how the tree fragments are generated.

Convolution Tree Kernels (TKs) compute the number of substructures between two trees T1 and T2 without explicitly enumerating all the possible tree fragments, which

would be a very expensive operation.

Let T = {t1, ..., t|T |} be the set of all possible trees in the space of structures, and

χi(n) an indicator function, which is equal to 1 if the targetti is rooted ad a node n, and

equal to 0 otherwise. We can define a general tree kernel over T1 and T2 as:

KT K(T1, T2) = X n1∈NT₁ X n2∈NT₂ ∆(n1, n2) (2.13)

NT1 and NT2 are the sets of nodes of theT1 and T2 trees, and

∆(n1, n2) =

|T | X

i=1

χi(n1)χi(n2) (2.14)

computes the number of common tree fragments rooted at the n1 and n2 nodes. In

general, the specification of eq. 2.14 determines the TK expressivity. Indeed, how the fragments are defined and obtained gives rise to a number of different tree kernels.

Structural Kernels 17

Syntactic Tree Kernel

The Syntactic Tree Kernel (STK) Collins and Duffy [2002] evaluates the number of common substructures as follows:

1. if the productions atn1 and n2 are different then ∆(n1, n2) = 0;

2. if the productions atn1 and n2 are the same, and n1 and n2 have only leaf children

(they are pre-terminals) then ∆(n1, n2) = 1;

3. if the productions at n1 and n2 are the same, and n1 and n2 are not pre-terminals

then: ∆(n1, n2) = nc(n1) Y j=1 (1 + ∆(cj_n₁, cj_n₂)) (2.15)

where nc(·) is a function that returns the number of children of the argument node, andcj

nis the j-th child of the noden. A decay factor can be introduced by modifying the

steps (2) and (3) as follows1_:

2. ∆(n1, n2) =λ;

3. ∆(n1, n2) =λQ_jnc₌₁(n1)(1 + ∆(cjn1, c

j n2))

The running time of STK is O(|NT1| × |NT2|), but in pratice the computational com-

plexity tends to be linear, i.e. O(|NT1|+|NT2|) for syntactic trees Moschitti and Zanzotto

[2007]. The main observation on the STK is that the production rules of the grammar used to generate the tree will not be broken, i.e., children of a node are not separated.

Partial Tree Kernel

The Partial Tree Kernel (PTK) differs from the STK in the fact that it allows the children of a node to be separated to form tree fragments. This means that the production rules of the grammar generating the tree can be broken. PTK produces a greater number of fragments, and therefore it is more expressive. The ∆ function of the PTK is the following:

1. if the node labels n1 and n2 are different then ∆(n1, n2) = 0; else

2. ∆(n1, n2) = 1 +P

I1,~I2,l(I~1)=l(I~2) Ql(~I1)

j=1 ∆σ(cn1(I~1j), cn2(I~2j))

1_{For a similarity score between 0 and 1, a normalization in the kernel space, i.e.} _√ T K(T1,T2)

T K(T1,T1)×T K(T2,T2)

is applied.

18 Support Vector Machines and Kernel Methods

whereI~1 =hh1, h2, h3, ..iand I~2 =hk1, k2, k3, ..i are sequences of indexes synchronized

with the ordered child sequences cn1 of n1 and cn2 of n2. I~1j and I~2j index the j-th child

in the sequence, and l(·) returns the length of the index list, and therefore the number of children of a node.

In the following formulation, we add two decay factors: µaccounting for the tree depth, and λfor the length of the child subsequences with respect to the original sequence, which accounts also for gaps:

∆(n1, n2) = µ λ2+X ~ I1,~I2,l(I~1)=l(I~2) λd(~I1)+d(~I2) l(I~1) Y j=1 ∆(cn1(I~1j), cn2(~I2j)) (2.16) whered(I~1) = I~1l(I~1)− ~ I11+ 1 and d(I~2) = ~I2l(I~2)− ~

I21+ 1. The decay factors penalize

larger and deeper trees, and children that are far away from each other.

Smoothed Partial Tree Kernel

The Smoothed Partial Tree Kernel (SPTK) extends the PTK and enables the computation of lexical similarities. The ∆ function is defined as follows:

1. ifn1 and n2 are leaves then ∆σ(n1, n2) =µλσ(n1, n2); else

2. ∆σ(n1, n2) = µσ(n1, n2)× λ2+X ~ I1,~I2,l(I~1)=l(~I2) λd(I~1)+d(I~2) l(~I1) Y j=1 ∆σ(cn1(I~1j), cn2(~I2j)) ,

where σ is any similarity between nodes, e.g., between their lexical labels, and µ, λ∈[0,1] are two decay factors.

SPTK has been shown to be rather efficient in practice [Croce et al., 2011, 2012].

STKb and SubTree Kernels

We briefly mention two additional tree kernels which are variations on the STK.

The STKb kernel extends STK by allowing leaf nodes in the feature space. Since

most of the time leaves in our trees are words, we adopted the subscript b to indicate the bag-of-words.

TheSubtree Kernel (SbtK) is one of the simplest tree kernels in terms of expressive power. The reason is that it only generates complete subtrees, namely every tree fragment rooted at a given node n contains all the descendants of n.

Summary 19

In document Structural Kernels and Neural Network Models for Question Answering Systems (Page 31-35)