Consistent Learning via Local Loss Optimization

In this section we present a learning algorithm for WFA based on solving an optimization problem. In particular, the algorithm will minimize a certain loss function defined in terms of observed data about the target WFA. Casting learning algorithms as optimization problems is a common approach in machine

learning, which has led to very successful classes of algorithms based on principles like empirical and structural risk minimization. In a sense that will soon become clear, our algorithm can be thought as a reinterpretation of the spectral method in terms of loss minimization. Though not entirely practical due to the quadratic nature of the loss and constraints present in the optimization problem, this point of view on the spectral method yields interesting insights which are valuable for designing new classes of algorithms. This will be the subject of the following sections.

In spirit, our algorithm is similar to the spectral method in the sense that in order to learn a function f : Σ?

→ R of finite rank, the algorithm infers a WFA using (approximate) information from a sub-block of Hf. The sub-block used by the algorithm is defined in terms of a set of prefixes P and suffixes S.

Throughout this section we assume that f is fixed and has rank r, and that a basis (P, S) of f is given. We will describe our algorithm under the hypothesis that sub-blocks H and {Hσ}σ∈Σ of Hf are

known exactly. It is trivial to modify the algorithm to work in the case when only approximations bH and { bHσ}σ∈Σof the Hankel sub-blocks are known.

Given 1 ≤ n ≤ s = |S| we define the local loss function `n(X, β∞, {Bσ}) on variables X ∈ Rs×n,

β_∞_{∈ R}n _{and B} σ ∈ Rn×nfor σ ∈ Σ as: `n = kHXβ∞− hP,λk22+ X σ∈Σ kHXBσ− HσXk2F (8.1)

Our learning algorithm is a constrained minimization of this local loss, which we call spectral optimization (SO):

min

X,β_∞,{Bσ}

`n(X, β∞, {Bσ}) s.t. X>X= I (SO)

Intuitively, this optimization tries to jointly solve the optimizations solved by SVD and pseudo-inverse in the spectral method based on Lemma 5.2.1. In particular, likewise for the SVD-based method, it can be shown that (SO) is consistent whenever n ≥ r and B = (P, S) is complete. This shows that the algorithms are in some sense equivalent, and thus provides a novel interpretation of spectral learning algorithms as minimizing a loss function on local data – when contrasted to any algorithm based on maximum likelihood – in the sense that only examples contained in the basis are considered in the loss function.

Theorem 8.2.1. Supposen ≥ r and B is a complete basis. Then, for any optimal solution (X∗_{, β}∗

∞, {B∗σ})

to problem (SO), the weighted automata B∗₌D_h> λ,SX

∗_{, β}∗ ∞, {B∗σ}

satisfiesf = fB∗

The proof of this theorem is given in Section 8.2.1. Though the proof is relatively simple in the case n = r, it turns out that the case n > r is much more delicate – unlike in the SVD-based method, where the same proof applies to all n ≥ r.

Of course, if H and {Hσ} are not fully known, but approximations bH and { bHσ} are given to the

algorithm, we can still minimize the empirical local loss b`n and build a WFA from the solution using the

same method of Theorem 8.2.1.

Despite its consistency, in general the optimization (SO) is not algorithmically tractable because its objective function is quadratic non-positive semidefinite and the constraint on X is not convex. Nonetheless, the proof of Theorem 8.2.1 shows that when H and {Hσ} are known exactly, the SVD

method can be used to efficiently compute an optimal solution of (SO). Furthermore, the SVD method can be regarded as an approximate solver for (SO) with an empirical loss function b`n as follows. Find

first an bXsatisfying the constraints using the SVD of bH, and then compute bβ_∞and { bBσ} by minimizing

the loss (8.1) with fixed bX – note that in this case, the optimization turns out to be convex. This is just one iteration of a general heuristic method known as alternate minimization that can be applied to quadratic objective functions on two variables where the objective is convex when one of the two variables is considered fixed. In this case the SVD provides a clever, though costly, initialization to the variable that will be fixed during the first iteration.

From the perspective of an optimization algorithm, the bounds for the distance between operators recovered with full and approximate data given in Section 5.4 for the spectral method, can be restated as

a sensitivity analysis of the optimization solved by the algorithm given here. In fact, a similar analysis can be carried out for (SO), though we shall not pursue this direction here.

8.2.1 Proof of Theorem 8.2.1

The following technical result will be used in the proof. Lemma 8.2.2. Letf : Σ?

→ R be a function of finite rank r and suppose that (P, S) is a complete basis forf . Then the matrix HΣ= [Hσ1. . . Hσ|Σ|] has rank r.

Proof. Let A = hα0, α∞, {Aσ}i be a minimal automata for f and denote by H = PS the rank fac-

torization induced by A. Note that we have HΣ = P [Aσ1S, . . . , Aσ|Σ|S]. Thus, it is enough to show

that rank([Aσ1S. . . Aσ|Σ|S]) = r. Since each column of S is of the form sv = Avα∞ for some suf-

fix v ∈ S, and rank(S) = r, we have Rr _{= span({s}

v}v∈S) ⊆ ∪σ∈Σrange(Aσ). Now, since for all

σ ∈ Σ one has span({Aσsv}v∈S) = range(Aσ), we see that the columns of [Aσ1S, . . . , Aσ|Σ|S] span

∪σ∈Σrange(Aσ) = Rr. From which it follows that rank(HΣ) = r.

The proof of Theorem 8.2.1 is divided into the following four claims. Claim 8.1. The optimal value of problem (SO) is zero.

Let H = UΛV> _{be a full SVD of H and write V}

n ∈ Rs×n for the n right singular vectors cor-

responding to the n largest singular values (some of which might be zero since n ≥ r). Note that we have V>

nVn = I and HVnV>n = H. Now we check that `n(Vn, (HVn)+hP,λ, {(HVn)+HσVn}) = 0.

Recall that HVn(HVn)+acts as an identity in the space spanned by the columns of HVn. Thus, writ-

ing hP,λ = HVnV>neλ we see that (HVn)(HVn)+hP,λ = hP,λ. Furthermore, since rank([H, Hσ]) =

rank(H), we have HVn(HVn)+HσVn = HσVn. This verifies the claim.

Claim 8.2. For any n ≥ r, An=

h>_λ,SVn, (HVn)+hP,λ, {(HVn)+HσVn}

satisfies fAn= f .

We will show that for any n > r one has fAn= fAr. Then the claim will follow from Lemma 5.2.1,

since Ar is the WFA corresponding to the rank factorization H = (HVr)(V>r). Write Πn = [Ir, 0] ∈

Rr×n. Since any singular vector in Vnwhich is not in Vris orthogonal to the rowspace of H and Hσ, by

construction we have ΠnAnΠ>n = Ar. Now we consider the factorization H = PnSn induced by An. If

we show that the rows of Pnlie in the span of the rows of HVn, then PnΠ>nΠn = Pn and Lemma 5.2.3

tells us that fAn= fAr. Write An= hα0, α∞, {Aσ}i. We prove by induction on |u| that α

u = α>0Au

lie in the span of the rows of HVn, which we denote by H. This will imply the claim about the rows of

Pn. For |u| = 0 we trivially have α>0 = h>λ,SVn∈ H. Furthermore, by the induction hypothesis we have

α>_u = γ>

uHVn for some γu∈ Rp. Thus we get αuσ> = α>uAσ= γ>uHVn(HVn)+HσVn = γ>uHσVn ∈

Claim 8.3. Let (X∗_{, β}∗

∞, {B∗σ}) be an optimal solution to (SO) and define B0 =

D h>_λ,SX∗, β0∞, {B0σ} E , where B0 σ = (HX∗)+HσX∗ and β0∞= (HX∗)+hP,λ. Then fB∗ = f_B0.

Since the optimal value of the objective is zero, we must have HX∗_B∗

σ= HσX∗and HX∗β∗∞= hP,λ.

Thus, by a property of the Moore–Penrose pseudo-inverse we must have that B∗

σ = B0σ + C0σ and

β∗_∞ = β0∞+ γ0, where C0σ = (I − (HX∗)+HX∗)Cσ and γ0 = (I − (HX∗)+HX∗)γ for some arbitrary

Cσ ∈ Rn×nand γ ∈ Rn. Now the claim follows from similar arguments as the ones used above, showing

by induction on the length of u that (β∗u)>= h>λ,SX∗B∗u lies in the span of the rows of HX∗.

Claim 8.4. With notation from previous claims, we have fB0 = f_A n

In the first place, note that Lemma 8.2.2 and the equation HX∗_[B0

σ1, . . . , B 0 σ|Σ|] = [Hσ1X ∗_{, . . . , H} σmX ∗_]

necessarily imply HX∗_(X∗₎> _{= H. Using this property, we can show that B}0 _{= NA}

nM with N =

(X∗₎>_V

nand M = V>nX∗. Thus, the claim follows from Lemma 5.2.3, where the condition PnMN= Pn

can be verified via an induction argument on the length of prefixes.

Summarizing, the consistency of the algorithm given by (SO) follows from the chain of equalities fB∗= f_B0 = f_A

In document Learning finite-state machines: statistical and algorithmic aspects (Page 133-136)