Completing Hankel Matrices via Convex Optimization

The main obstruction in leveraging the spectral method for learning general weighted automata from labeled examples is that of the missing entries in the Hankel matrix. When learning stochastic WFA, the statistics used by the spectral method are essentially the probabilities assigned by the target distribution to each string in PS. As we have already seen, increasingly large samples yield uniformly convergent estimates for these probabilities. Thus, it can be safely assumed that the probability of any string from PS not present in the sample is zero. When learning arbitrary WFA, however, the value assigned by the target to an unseen string is unknown. Furthermore, one cannot expect that a sample will contain the values of the target function for all the strings in PS. This simple observation raises the question of whether it is possible at all to apply the spectral method in a setting with missing data, or, alternatively, whether there is a principled way to “estimate” this missing information and then apply the spectral method.

It turns out that the second of these approaches can be naturally formulated as a constrained matrix completion problem. When applying the spectral method, the (approximate) values of the target on B = (P, S) are arranged in the matrix H ∈ RP×S_{. Thus, the main difference between the stochastic and}

non-stochastic setting can be restated as follows: when learning stochastic WFA, unknown entries of H can be filled in with zeros, while in the general setting there is a priori no straightforward method to fill in the missing values. In this section we describe a matrix completion algorithm for solving this last problem. In particular, since H is a Hankel matrix whose entries must satisfy some equality constraints, it turns out that the problem of learning general WFA from labeled examples leads to what we call the Hankel matrix completionproblem. This is essentially a matrix completion problem where entries of valid hypotheses need to satisfy the set of equalities imposed by the basis B. We will show how the problem of Hankel matrix completion can be solved with a convex optimization algorithm. In fact, many existing approaches to matrix completion are also based on convex optimization algorithms [CT10; CP10; Rec11; FSSS11]. Furthermore, since the set of valid hypotheses for our constrained matrix completion problem is convex, many of these algorithms could also be modified to solve the Hankel matrix completion problem. We choose to propose our own algorithm here because it is the one we will use on the analysis given in Section 9.3.

We begin by introducing some notation. Throughout this chapter we shall assume that B = (P, S) is a p-closed basis with λ ∈ P ∪ S. For any basis B, we denote by HB the vector space of functions RPS

whose dimension is the dimension of B, defined as |PS|. We will simply write H instead of HB when the

basis B is clear from the context. The Hankel matrix H ∈ RP×S associated to a function h ∈ H is the matrix whose entries are defined by H(u, v) = h(uv) for all u ∈ P and v ∈ S. Note that the mapping h 7→ H is linear. In fact, H is isomorphic to the vector space formed by all |P| × |S| real Hankel matrices and we can thus write by identification

It is clear from this characterization that H is a convex set because it is a subset of a convex space defined by equality constraints. In particular, a matrix in H contains |P||S| coefficients with |PS| degrees of freedom, and the dependencies can be specified as a set of equalities of the form H(u1, v1) = H(u2, v2)

when u1v1 = u2v2. We will use both characterizations of H indistinctly from now on. Also, note that

different orderings of P and S may result in different sets of matrices. For convenience, we will assume for all that follows an arbitrary fixed ordering, since the choice of that order has no effect on our results. Matrix norms extend naturally to norms in H. For any 1 ≤ p ≤ ∞, the Hankel–Schatten p-norm on H is defined as khkp = kHkp. It is straightforward to verify that khkp is a norm by the linearity of

h 7→ H. In particular, this implies that the function k · kp: H → R is convex. In the case p = 2, it can

be seen that khk2

2= hh, hi_H, with the inner product on H defined by

hh, h0i_H= X

x∈PS

cxh(x)h0(x) ,

where cx= |{(u, v) ∈ P × S : x = uv}| is the number of possible decompositions of x into a prefix in P

and a suffix in S.

We now outline our algorithm HMC+SM for learning general WFA from labeled examples based on combining matrix completion with the spectral method. As input, the algorithm takes a sample S = (z1_{, . . . , z}m_{) containing m examples z}i _{= (x}i_{, y}i_{) ∈ Σ}?_{× R, 1 ≤ i ≤ m, drawn i.i.d. from some}

distribution D over Σ?_{× R. There are three parameters a user can specify to control the behavior of}

the algorithm: a basis B = (P, S) of Σ?_{, a regularization parameter τ > 0, and the desired number of}

states n in the hypothesis. The output returned by HMC+SM is a WFA AS with n states that computes

a function fAS: Σ

→ R. The algorithm works in two stages. In the first stage, a constrained matrix completion algorithm with input S and regularization parameter τ is used to compute a Hankel matrix HS ∈ HB. In the second stage, the spectral method is applied to HS to compute a WFA AS with n

states. The spectral method is just a direct application of the equations given in Section 5.3 by observing that we can write H>

S = [H>λ, H>σ1, . . . , H

σ|Σ|], with hλ,S and hP,λ contained in Hλ because we assumed

λ ∈ P ∪ S. In the rest of this section we describe the Hankel matrix completion algorithm in detail. It is easy to see that in fact HMC+SM defines a whole family of algorithms. In particular, by combining the spectral method with any algorithm for solving the Hankel matrix completion problem, one can derive a new algorithm for learning WFA. We shall now describe a particular family of Hankel matrix completion algorithms. This family will be parametrized by a number 1 ≤ p ≤ ∞ and a convex loss ` : R × R → R+.

Given a basis B = (P, S) of Σ?_{and a sample S over Σ}?

×R, our algorithm solves a convex optimization problem and returns a matrix HS ∈ HB. We give two equivalent descriptions of this optimization, one in

terms of functions h : PS → R, and another in terms of Hankel matrices H ∈ RP×S. While the former is perhaps conceptually simpler, the latter is easier to implement within the existing frameworks of convex optimization.

We will denote by eS the subsample of S containing all the examples z = (x, y) ∈ S such that x ∈ PS. We write m to denote the size of e_e S. For any 1 ≤ p ≤ ∞ and any convex loss function ` : R × R → R+,

we define the objective function FS over H as follows:

FS(h) = τ N (h) + bR_S_e(h) = τ khk2p+ 1 e m X (x,y)∈ eS `(h(x), y) ,

where τ > 0 is a regularization parameter. FS is clearly a convex function by the convexity of k · kp

and `. Our algorithm seeks to minimize this loss function over the finite-dimensional vector space H and returns a function hS satisfying

hS ∈ argmin h∈H

FS(h) . (HMC-h)

To define an equivalent optimization over the matrix version of H, we introduce the following notation. For each string x ∈ PS, fix a pair of coordinate vectors (ux, vx) ∈ RP × RS such that u>xHvx= H(x)

and a suffix v ∈ S, and such that uv = x. Now, abusing our previous notation, we define the following loss function over matrices:

FS(H) = τ N (H) + bR_S_e(H) = τ kHk2p+ 1 e m X (x,y)∈ eS `(u>xHvx, y) .

This is a convex function defined over the space of all |P| × |S| matrices. Optimizing FS over the convex

set of Hankel matrices H leads to an algorithm equivalent to (HMC-h): HS ∈ argmin

H∈H

FS(H) . (HMC-H)

We note here that our approach shares some common aspects with some previous work in matrix completion. The fact that there may not be a true underlying Hankel matrix makes it somewhat close to the agnostic setting in [FSSS11], where matrix completion is also applied under arbitrary distributions. Nonetheless, it is also possible to consider other learning frameworks for WFA where algorithms for exact matrix completion [CT10; Rec11] or noisy matrix completion [CP10] may be useful. Furthermore, since most algorithms in the literature of matrix completion are based on convex optimization problems, it is likely that most of them can be adapted to solve constrained matrix completions problems such as the one we discuss here.

In document Learning finite-state machines: statistical and algorithmic aspects (Page 145-147)