General Perspective on Distributionally Learnable Classes

(1)

General Perspective on Distributionally Learnable Classes

Ryo Yoshinaka Kyoto University, Japan [email protected]

Abstract

Several algorithms have been proposed to learn different subclasses of context-free grammars based on the idea generically called distributional learning. Those tech-niques have been applied to many formalisms richer than context-free grammars like mul-tiple free grammars, simple context-free tree grammars and others. The learning algorithms for those different formalisms are actually quite similar to each other. We in this paper give a uniform view on those algo-rithms.

1 Introduction

Approaches based on the idea generically called dis-tributional learninghave been making great success in the algorithmic learning of various subclasses of context-free grammars (CFGs) (Clark, 2010c; Yoshi-naka, 2012). Those techniques are applied to richer formalisms as well. The formalisms studied so far include multiple CFGs (Yoshinaka, 2011a), simple context-free tree grammars (CFTGs) (Kasprzik and Yoshinaka, 2011), second-order abstract categorial grammars (Yoshinaka and Kanazawa, 2011), par-allel multiple CFGs (Clark and Yoshinaka, 2014), conjunctive grammars (Yoshinaka, 2015) and oth-ers. The goal of this paper is to present a uniform view on those algorithms.

Every grammar formalism for which distribu-tional learning techniques have been proposed so far generate their languages through context-free derivation trees, whose nodes are labeled by produc-tion rules. The formalism and grammar rules

deter-mine how a context-free derivation treeτ is mapped

to a derived objectτ˜ = d. A context-free deriva-tion treeτ can be decomposed into a subtreeσ and

a tree-contextχso thatτ =χ[σ]. The subtree

deter-mines a substructures= ˜σofdand the tree-context determines a contextual structurec = ˜χ in which

the substructure is plugged to form the derived ob-ject d = c _⊙s, where we represent the plugging operation by⊙. In the CFG case, c is a string pair

⟨l, r_⟩andsis a stringuand_⟨l, r_{⟩ ⊙}u=lur, which

may correspond to a derivation I _⇒∗ lXr _⇒∗ lur

whereI is the initial symbol andX is a nontermi-nal symbol. In richer formalisms those substructures and contexts may have richer structures, like tuples of strings orλ-terms. A learner does not know how a given exampledis derived by a hidden grammar

behind the observed examples. A learner based on distributional learning simply tries all the possible decompositions of a positive example into arbitrary two parts c′ and s′ such that d = c′ _⊙s′ where

some grammar may derivedthorough a derivation

treeτ′ = χ′[σ′]withχ˜′ = c′ andσ˜′ = s′. Based

on observation on the relation between substructures and contexts collected from given examples, a hy-pothesis grammar is computed. We call properties on grammars with which distributional learning ap-proaches workdistributional properties.

This paper first formally defines grammar for-malisms based on context-free derivation trees. We then show that grammars with different distribu-tional properties are learnable by standard distri-butional learning techniques if the formalism sat-isfies some conditions, which include polynomial-time decomposability of objects into contexts and

(2)

substructures. In addition, we discuss cases where we cannot enumerate all of the possible contexts and substructures.

2 Σ-grammars

There is a number of ways to represent alanguage, a subset of an object set O∗, whose elements are typically strings, trees but anythings encodable are eligible. Formalisms this paper discusses generate objects in O∗ through context-free derivation trees τ, which are mapped to an element d _∈ O_∗ in a

uniform way. The map is inductively defined and computed. Each derivation subtreeτ′ ofτ also

de-termines an object, which we call a substructure

ofd. Each substructure is not necessarily a

mem-ber of O∗. For example, nonterminal symbols of multipleCFGs (Seki et al., 1991) deriven-tuples of

strings, where the value n is unique to each

non-terminal, while the languages generated by multiple CFGs are still simply string sets. A generalization of the CFG formalism is specified by kinds of objects that each nonterminal generates and admissible op-erations over those objects.

LetObe a set ofobjects, which are identified with their codes of finite length. We have a setΩof fi-nite representationsO which are interpreted as

sub-sets OO of O through an effective procedure. By a sort we flexibly refer to O _∈ Ω or _OO ⊆ O.

We also have an indexed family of computable func-tions from tuples of objects of some sorts to objects of some sort. Let_F be a set of function names or function indicesf, which represent functionsf˜. By

O1× · · · ×On→O0we denote the set of functions whose domain is O1× · · · × On and codomain is

O0. ByFO0,O1,...,On, we denote the set of function

namesf _∈Fwithf˜_∈OO₁ _{× · · · ×}OOn → OO0. We assume that the domain sortsO1, . . . , Onand the

codomain sortO0 are easily computed fromf. We

specify a class of grammars by a triple, which we call asignature,Σ =_⟨Ω,F, O_∗_⟩whereO_∗ _∈Ωis a

special sort of objects. We writeO∗forOO_∗.

A context-free Σ-grammar (Σ-grammar for

short) is a tupleG = _⟨N, σ, F, P, I_⟩ whereN is a

finite set ofnonterminal symbols,I _⊆N is a set of initial symbols,σ∈N →Ωis asort assignmenton

nonterminals such thatσ(X) = O_∗ for allX _∈ I, F _⊆_Fis a finite set of function names, andP is a

finite set ofproduction rules, which are elements of

N_×F_×N∗. Each production rule is denoted as

X0 ←f⟨X1, . . . , Xn⟩

whereX0, . . . , Xn ∈ N andf ∈ FO0,O1,...,On for

σ(Xi) =Oi. For eachO∈Ω,NO =σ−1(O)⊆N

is the set of O-nonterminals which are assigned the sort O. By G(Σ) we denote the class of Σ

-grammars.

AΣ-grammar defines its language via derivation

trees, which are recursively defined as follows. • Ifτi areXi-derivation trees fori= 1, . . . , nand

ρ is a rule of the form X0 ← f⟨X1, . . . , Xn⟩,

then the term τ0 = ρ[τ1, . . . , τn] is an X0

-derivation tree. Its yield τ˜0 is f˜(˜τ1, . . . ,τ˜n) ∈ Oσ(X0)whereτ˜iis the yield ofτi.

The case where n = 0 gives the base of this

re-cursive definition. An X-derivation tree is com-pleteifX ∈ I. The yield of anyX-derivation tree

is called an X-substructure. By S(G, X) we

de-note the set of X-substructures. The language of

GisL(G) = ∪_X_∈_IS(G, X), which we call a Σ

-language. In other words, L(G) is the set of the

yields of complete derivation trees. The class ofΣ -languages is denoted byL(Σ).

Distributional learning is concerned with what X-derivation contextsrepresent. AnX-derivation

con-text is obtained by replacing an occurrence of an

X-derivation tree in a complete derivation tree by

a special symbol□σ(X). Accordingly the yieldχ˜of

anX-derivation contextχ should be a finite repre-sentation of a function that givesχg[τ]when applied

toτ˜for anyX-derivation treeτ. We assume to have

a setEOof representations of functions fromOOto O∗forO∈Ωto which the yields of derivation con-texts belong.

• □X is anX-derivation context for allX∈Iand

its yield□O∗ ∈EO∗represents the identity func-tion onO∗,

• For an X-derivation context χ0, a rule ρ =

X _← f_⟨X1, . . . , Xn⟩ and Xi-derivation

trees τi for i ∈ {1, . . . , n} − {j}, the

term χ obtained by replacing □X in χ0

by ρ[τ1, . . . , τj−1,□Xj, τj+1, . . . , τn] is an Xj

(3)

is denoted as

˜

χ= ˜χ0⊙f˜(˜τ1, . . . ,τ˜j−1,□σ(Xj),τ˜j−1, . . . ,τ˜n),

represents the functionϕ∈ Oσ(Xj) → O∗ such

that for alls_∈Oσ(X_j₎,

ϕ(s) =ϕ0( ˜f(˜τ1, . . . ,τ˜j−1, s,˜τj+1, . . . ,τ˜n)),

whereϕ0is the function represented byχ˜0.

The yield of any X-derivation context is called

anX-context. ByC(G, X)we denote the set ofX

-contexts. Forc_{∈ C}(G, X)ands_{∈ S}(G, X),c_⊙s

is the result of the application of the function repre-sented byctos.

3 Context-substructure relation

BySandCwe denote the set of substructures and contexts, respectively, which can be obtained by some grammar inG(Σ):

S= ∪

O∈Ω

SOandC= ∪

O∈Ω

COwhere

SO=

∪

{ S(G, X)|Xis anO-nonterminal

of someG_{∈ G}(Σ)_}

CO=

∪

{ C(G, X)_|Xis anO-nonterminal of someG_{∈ G}(Σ)_}.

We writeS∗forSO∗. Note that the above definition is relative toΣ. Even if OO1 = OO2 for different

O1, O2 ∈Ω, it can be the case thatSO1 ̸=SO2 and

CO1 ̸= CO2. Though usually OO has a definition independent fromΣ, it is possible to specifyOOin

terms of the signature so thatSO =OO. Clearly if

s _∈ SO andc _∈ CO, then there is a grammarG _∈

G(Σ)generatingc_⊙susing a nonterminal of sortO. Therefore,c⊙sis well defined for anys∈SOand

c_∈COwithout specifying a particularΣ-grammar.

Similarlyc_⊙f˜(s1, . . . , sj−1,□Oj, sj+1, sn)is well

defined for anyc ∈ CO0, f ∈ FO0,...,On andsi ∈

SOi. This operation is generalized to setsS ⊆ SO

andC _⊆ _CO in the straightforward way, like C⊙

S={c⊙s|c∈Cands∈S}.

Hereafter, whenever we write c _⊙ s and

˜

f(s1, . . . , sn), we assume they are well-formed.

That is, the domains of the functions repre-sented by c and f match the sorts to which s

and s1, . . . , sn belong, respectively. Accordingly

we drop the subscript O from □O and write

˜

f(s1, . . . , sj−1,□, sj+1, . . . , sn). When we have a

substructure setS, we assume S _⊆ _SO for some

O ∈Ω. We often identifyswith{s}unless confu-sion arises. Also we assumeSO ̸=∅for allO_∈Ω.

The same assumptions apply to contexts.

We are interested in whether the compositionc_⊙s

belongs to a concerned languageL _{∈ L}(Σ). Clark (2010b) has introducedsyntactic concept latticesto analyze the context-substring relation on string lan-guages and particularly to design a distributional learning algorithm for CFLs. Generalizing his dis-cussion, we define anO-concept latticeBO(L)of a

languageL _⊆ O_∗ for respective sortsO _∈ Ω.

As-sumingLandOunderstood from the context, let us

write

S‡={c∈CO |c⊙S⊆L},

C†=_{s_∈SO _|C_⊙s_⊆L_}

forS _⊆SO andC _⊆CO. We writeS†forS‡†and C‡forC†‡.

We call a pair ⟨S, C_{⟩ ⊆} _SO×CO a concept iff

S† = C andC‡ = S. For anyS ⊆ SO andC ⊆

CO, ⟨S†, S‡_⟩ and _⟨C†, C‡_⟩ are concepts. We call

them the conceptsinduced bySandC, respectively. For two concepts⟨S1, C1⟩and⟨S2, C2⟩inBO(L),

we write⟨S1, C1⟩ ≤OL ⟨S2, C2⟩ ifS1 ⊆ S2, which

is equivalent to C2 ⊆ C1. With this partial order,

BO(L)is a complete lattice.

We can introduce a partial order to substructure sets based on the concepts that they induce. Let us writeS1 ≤OL S2ifS2‡⊆S1‡. The relation represents

the substitutability ofS1forS2.

Lemma 1. The following three are equivalent for S, T ⊆SO:

• S ≤O L T,

• c_⊙T _⊆Limpliesc_⊙S _⊆Lfor allc_∈_CO,

• T‡⊙S ⊆L.

If Si ≤OLi Ti for i = 1, . . . , n, then for any

f _∈ FO₀,O1,...,On, we have f(S1, . . . , Sn) ≤

O0

L

f(T1, . . . , Tn).

(4)

4 Conditions to be distributionally learnable

Distributional learning algorithms decompose ex-amplesd ∈ S∗ into contextsc ∈ COand

substruc-tures s ∈ SO so that c⊙s = d. Then a primal

approach uses substructures or sets of substructures as nonterminals of a conjecture grammar. We want each nonterminal [[S]] indexed by S ⊆ SO to sat-isfy S(G,[[S]]) = S†. On the other hand, a dual

approach uses contexts or sets of contexts as non-terminals where the semantics of the nonterminal is S(G,[[C]]) =C†. For an objectd_∈S_∗,O _∈Ωand

⃗

O= (O0, . . . , On)withOi ∈Ω, we define

SO|d={s∈SO|c⊙s=dfor somec∈CO} CO|d={c∈CO|c⊙s=dfor somes∈SO} F_O⃗_|_d={f ∈F_O⃗ |c⊙f(s1, . . . , sm) =d

for somec_∈_CO₀_|_dandsi∈SOi|d},

S|D =∪Od∈∈DΩSO|d,CO|D =∪dO∈∈DΩCO|dandF|D =

∪d∈D ⃗

O∈Ω∗FO⃗|dforD⊆O∗. LetΩ|D =

∪

d∈DΩ|dfor

Ω_|d={O |SO|d̸=∅}.

We requireG(Σ)to be a tractable formalism such

that composition and decomposition can be done ef-ficiently.

Assumption 1. There are polynomial-time algo-rithms which

• decide whethers_∈SOfroms_∈SandO _∈Ω, • compute _f˜₍_s₁_{, . . . , s}_n₎_from_s_i _∈ _SO

i andf ∈

FO0,O1,...,On,

• decide whetherc∈COfromc∈CandO ∈Ω,

• computec_⊙sfromc_∈COands_∈SOfor any O _∈Ω.

• decide whethers_{∈ L}(G)froms_∈S_∗andG_∈

G(Σ),

Assumption 2. There isp_∈Nsuch that the arity of everyf _∈Fis at mostp.

Assumption 3. There are polynomial-time algo-rithms that computeSUB(d), CON(d)and FUN(d)

fromd _∈ _S_∗ such thatS|d ⊆ SUB(d) ⊆ S, C|d ⊆

CON(d)⊆CandF|d⊆FUN(d)⊆F.

Actually by Assumptions 2 and 3, one can rive the polynomial-time uniform membership de-cidability. Moreover, it is easy to filter out nonmem-bers of_S_|d,C|dandF|dfrom SUB(d), CON(d)and

FUN(d), respectively, but it is not necessary.

As-sumption 3 implies|Ω_|_d_|is polynomially bounded,

since O _∈ Ω_|_d iff FO,O1,...,On ̸= ∅ for some

O1, . . . , On.

We write SUBO(D) = SUB(D) ∩ SO,

CONO(D) = CON(D) ∩ CO and FUN_O⃗(D) =

FUN(D)∩F_O⃗.

It is often the case that elements of Ω

repre-sents pairwise disjoint sets. Actually for any sig-natureΣ, one can find Σ′ = ⟨Ω′,F′, O_∗⟩ that sat-isfies this condition such thatL(Σ) = _L(Σ′). Let Ω′ =_{O′ _|O _∈Ω_{} ∪ {}O_∗_}and_OO′ =OO× {O}

for eachO ∈ Ω− {O_∗}. For f ∈ FO0,...,On with ˜

f(s1, . . . , sn) = s0, we have f′ ∈ F′_O′

0,...,O′n with ˜

f′(s′₁, . . . , s′_n) = s′₀ where s′_i = (si, Oi) if Oi ∈

Ω− {O_∗}ands′_i = si ifOi = O∗. Clearly every

Σ-grammar has an equivalent Σ′-grammar.

More-over, this makes it clear that froms _∈ S one can

immediately specify the unique sort O′ ∈ Ω′ such

thats _∈ OO′. Similarly we may assume that each

c_∈Chas uniqueO_∈Ωsuch thatc_∈COand

find-ing thatO is a trivial task. Hereafter we work under

this assumption. ByOandf we meanOOandf˜for

notational convenience.

Example 1. A right regular grammar over an alpha-bet∆is aΣreg-grammar forΣreg=⟨{∆∗},F,∆∗⟩. Fhas nullary functions which are members of∆_∪ {ε_}and unary functions faforfa(w) = aw for all

w∈ Σ∗ for somea∈ ∆. Clearly the class of right

regular grammars satisfies Assumptions 1, 2 and 3. Example 2. ACFG is a Σcfg-grammar for Σcfg =

⟨{∆∗},F,∆∗⟩ where eachf ∈ Fis represented as an(n+ 1)-dimension vector of strings⟨u0, . . . , un⟩

such thatf(v1, . . . , vn) = u0v1u1. . . vnun for all

vi ∈∆∗. The class ofCFGs itself satisfies

Assump-tion 1 but not AssumpAssump-tions 2 and 3, since we have no limit onn. But several normal forms fulfill

As-sumptions 2 and 3.

Example 3. Let Ω = {O1, O2, . . .} where Om

denotes the set of m-tuples of strings. Linear

context-free rewriting systems, equivalent to non-deleting multipleCFGs, are Σmcfg-grammars where

O_∗ = O1 = ∆∗ and everyf ∈ FOm0,Om1,...,Omn

concatenates strings ui,j occurring in an input

⟨⟨u1,1, . . . , u1,m1⟩, . . . ,⟨un,1, . . . , un,mn⟩⟩ in some

way to form an m0-tuple of strings. The

(5)

PSPACE-complete (Kaji et al., 1992). There are infinitely many ways to decompose a string dinto

substruc-tures and contexts asOm ∈Ω_|dfor allm.

Assump-tions 1 and 3 will be fulfilled when we restrict ad-missible functions so thatFOm0,...,Omn ̸=∅only if

n_≤pandmi≤qfor alli.

As is the case for multipleCFGs, Assumption 2 is often needed to make the uniform membership prob-lem solvable in polynomial-time (Assumption 1).

5 Learning models

Learning algorithms in this paper work under three different learning models.

Apositive presentation(text) of a languageL_∗ ⊆

O∗is an infinite sequenced1, d2,· · · ∈O∗such that L_∗ =_{di |i≥1}. In the framework of

identifica-tion in the limit from positive data, a learner is given a positive presentation of the languageL_∗ =_L(G_∗)

of the target grammarG_∗ and each time a new

ex-amplediis given, it outputs a grammarGicomputed

fromd1, . . . , di. We say that a learning algorithmA

identifiesG_∗in the limit from positive dataif for any

positive presentation d1, d2, . . . of L(G∗), there is an integernsuch thatGn=Gmfor allm≥nand

L(Gn) = L(G∗). We say thatA identifies a class

Gof grammars in the limit from positive dataiffA identifies allG∈Gin the limit from positive data.

We say thatAidentifies a classGof grammars in the limit from positive data and membership queries

when we allowAto askmembership queries(MQs) to an oracle when it computes a hypothesis grammar. An instance of an MQis an objectd _∈ O_∗ and the

oracle answers whetherd_∈L_∗in constant time.

The third model is the learning with a minimally adequate teacher (MAT). A learner is not given a positive presentation but it may ask equivalence queries(EQs) to an oracle in addition toMQs. An in-stance of anEQis a grammarG. If_L(G) =L_∗, the

oracle answers “Congratulations!” and the learn-ing process ends. Otherwise, the oracle returns a counerexampled _∈ (L_∗_{− L}(G))_∪(_L(G)₋L_∗),

which is calledpositiveifd∈L_∗− L(G)and

neg-ativeifd_{∈ L}(G)₋L_∗.

When we have an oracle, the learning task itself is trivial unless we show some favorable property on the learning efficiency.

6 Learnable subclasses

This section presents howΣ-grammars with

distri-butional properties can be learned. Note that all of those properties are relative toΣ. We assume Σ

-grammarsG_∗ =_⟨N_∗, σ_∗, F_∗, P_∗, I_∗_⟩in this section

have no useless nonterminals or functions. That is, S(G_∗, X)̸=∅,C(G_∗, X)̸=∅for allX ∈N_∗and

everyf _∈F_∗appears in some rule inP_∗.

6.1 Substitutable Languages

Definition 1(Clark and Eyraud (2007)). A language

L_{∈ L}(Σ)is said to besubstitutableif for anyO _∈

Ω,c1, c2 ∈COands1, s2 ∈SO,

c1⊙s1, c1⊙s2, c2⊙s1 ∈Limpliesc2⊙s2 ∈L .

The definition can be rephrased as follows:

s‡₁_∩s‡₂ _̸=∅impliess1≡Ls2.

Example 4. Yoshinaka (2008) has proposed a learn-ing algorithm fork, l-substitutableCFLs, which sat-isfy the following property:

x1uy1vz1, x1uy2vz1, x2uy1vz2 ∈L

=⇒ x2uy2vz2∈L

for anyxi, yi, zi ∈ ∆∗, u ∈ ∆k andv ∈ ∆l. We

define a signature Σk,l = ⟨Ωk,l,Fk,l, O∗⟩ as fol-lows. LetΩk,l ={O∗} ∪ {Ou,v |u ∈ ∆kandv ∈

∆l_}∪{_O

u |u∈∆<k+l}, whereO∗= ∆∗,Ou,v =

{uwv _| w _∈ ∆∗_} _and _O_u ₌ _{_u_}_{. Here we put} overlines to make elements of Ωpairwise disjoint.

LetFk,l =_{+α,β | α, β ∈ Ω− {∆∗} } ∪ {□Oα∗ |

α_∈Ω_−{O_∗_{} }∪}∆. The binary function+α,β

con-catenates two strings from sortsα andβ and gives

the right sort inΩ_{− {}O_∗_}. For example,+α,β ∈F

withα = Ou,v,β =Owhas codomainOu,x where

xis the suffix ofvw of lengthl. The unary

opera-tion□O_∗

α ∈ FO∗,αsimply removes the overline and “promotes” u¯ _∈ α tou _∈ O_∗. ∆consists of the nullary functions giving a single letter from∆. It is

not hard to see that everyCFGhas an equivalentΣk,l

-grammar. Note that O_∗-nonterminals never occur on the right hand side of a rule in aΣk,l-grammar.

Hence CO∗ is just the singleton {□O∗} such that □O_∗⊙u = ufor allu ∈ ∆∗, whereasCα ̸= CO_∗

contains arbitrary pairs of strings⟨l, r⟩ ∈ ∆∗×∆∗

such that⟨l, r_{⟩ ⊙}u¯ =lurfor anyu¯_∈α. TheΣk,l

(6)

Theorem 1. The class of substitutableΣ-languages

is identifiable in the limit from positive data.

The theorem follows Lemmas 2 and 3 below. From a finite set D of positive examples,

Al-gorithm 1 computes the grammar SUBSTP(D) = ⟨N, σ, F, P, I⟩defined as follows:

• NO={[[s]]|s∈SUBO(D)}forO∈Ω_|D,

• I ={[[s]]|s∈D}, • F =FUN(D),

• P consists of the rules of the form

[[s0]]←f⟨[[s1]], . . . ,[[sn]]⟩

where f ∈ FUNO0,...,On(D) for [[si]] ∈ NOi if

there isc_∈CONO(D)such that

c⊙s0, c⊙f(s1, . . . , sn)∈D .

Since we assume elements of Ω are pairwise

dis-joint, each[[s]]belongs to a unique sort. Otherwise,

each nonterminal should be tagged with a sort like

[[s, O]].

Algorithm 1Learning substitutableΣ-grammars Data: A positive presentationd1, d2, . . .

Result: A sequence of grammarsG1, G2, . . .

let_Gˆ _{be a grammar such that}_L_{( ˆ}_G_{) =}_∅_; forn= 1,2, . . . do

letD=_{d1, . . . , dn};

ifD⊈L( ˆG)then

let_Gˆ ₌_S_UBST_P₍_D₎_; end if

output_Gˆ_as_G_n_; end for

An alternative way to construct a grammar is to use contexts rather than substructures for nontermi-nals. One can replace SUBSTP(D)in the algorithm

by SUBSTD(D)which is defined as follows. • NO={[[c]]|c∈CONO(D)}forO ∈Ω|D,

• I =_{[[□O_∗]]},

• F =FUN(D),

[[c0]]←f⟨[[c1]], . . . ,[[cn]]⟩

wheref _∈ FUNO0,...,On for [[ci]] ∈ NOi if there

aresi∈SUBOi(D)such that

ci⊙si∈Dfor alliandc0⊙f(s1, . . . , sn)∈D .

The existing algorithms for different classes of substitutable languages (Clark and Eyraud, 2007; Yoshinaka, 2008; Yoshinaka, 2011a) are based on slight variants of SUBSTP. This paper shows the correctness of the algorithm using SUBSTD.

Lemma 2. Let D be a finite subset of a Σ

-substitutable languageL_∗ andGthe grammar out-put bySUBSTD(D). ThenL(G)_⊆L_∗.

Proof. One can show by induction on the deriva-tion that if s ∈ S(G,[[c]]) then c⊙s ∈ L_∗.

Sup-pose that G has a rule [[c]] _← f_⟨[[c1]], . . . ,[[cn]]⟩,

si ∈ S(G,[[ci]])ands=f(s1, . . . , sn). The

induc-tion hypothesis saysci⊙si ∈L∗for alli. By the rule construction, there areti fori = 1, . . . , nsuch that

ci⊙ti∈D⊆L∗andc⊙f(t1, . . . , tn)∈D⊆L∗. We havesi ≡L_∗ tisince they occur in the same

con-textci. By Lemma 1,c⊙f(s1, . . . , sn)∈L∗.

Let G_∗ = ⟨N_∗, σ_∗, F_∗, P_∗, I_∗⟩ be a Σ-grammar

generating L_∗. Fix sX ∈ S(G∗, X) and cX ∈

C(G_∗, X) where cX = □O_∗ for X ∈ I∗. Define D_∗by

D_∗ =_{cX ⊙sX |X∈N∗}

∪ {cX0 ⊙f(sX1, . . . , sXn)

|X0 ←f⟨X1, . . . , Xn⟩ ∈P∗}.

Lemma 3. If D_∗ ⊆ D, then S(G_∗, X) ⊆ S(SUBSTD(D),[[sX]])for allX.

Proof. Let G = SUBSTD(D). If G_∗ has a rule X0 ← f⟨X1, . . . , Xn⟩then Ghas the

correspond-ing rule[[cX0]]←f⟨[[cX1]], . . . ,[[cXn]]⟩, since

cX0 ⊙f(sX1, . . . , sXn), cXi ⊙sXi ∈D .

In particular sincecX forX ∈Iis the identity

func-tion □O_∗, the corresponding nonterminal [[cX]] =

[[□O∗]]is the initial symbol ofG, too.

This shows that we do not need too many data to achieve a right grammar, since|D_∗_{| ≤ |}P_∗_|+_|N_∗_|,

where| · |denotes the cardinality of a set. Moreover, it is easy to see Algorithm 1 updates its conjecture in polynomial time in the total size ofDby

(7)

6.2 Finite kernel property

Definition 2 (Clark et al. (2009), Yoshinaka (2011b)). A nonempty finite setS _⊆_Sσ(X)is called

ak-kernelof a nonterminalXif|S| ≤kand

S(G, X)_≡_L_(G) S .

AΣ-grammarG is said to have thek-finite kernel property (k-FKP) if every nonterminal X has a k

-kernelSX.

Theorem 2. Under Assumptions 1, 2 and 3, Algo-rithm 2 identifiesΣ-grammars with thek-FKPin the

limit from positive data and membership queries.

Algorithm 2LearningΣ-grammars withk-FKP Data: A positive presentationd1, d2, . . . ofL∗;

Result: A sequence ofΣ-grammarsG1, G2, . . .;

letD:=K :=F :=J :=∅;

let_Gˆ _:=_P_RIMAL_k₍_{K, F, J}₎_; forn= 1,2, . . . do

letD:=D∪ {dn};J :=CON(D);

ifD⊈_L( ˆG)then

letK :=SUB(D)andF := FUN(D); end if

output_Gˆ ₌_P_RIMAL_k₍_{K, F, J}₎_as_G_n_; end for

The conjecture grammar PRIMALk(K, F, J) =

⟨N, σ, F, P, I_⟩of Algorithm 2 is defined from finite

sets of substructuresK _⊆ _S, functions F _⊆_Fand contexts J ⊆ C. The subsets of those sets corre-sponding to respective sorts are denoted as KO =

K_∩_SO,JO=J∩COandF_O⃗ =F∩F_O⃗.

• NO = {[[S]] | S ⊆KOwith1 ≤ |S| ≤ k}for

eachO _∈Ω_|_D,

• I ={[[S]]∈NO∗ |S⊆L}, • P consists of the rules of the form

[[S0]]←f⟨[[S1]], . . . ,[[Sn]]⟩

wheref _∈FO0,O1,...,Onfor[[Si]]∈NOi if

(S0‡∩JO0)⊙f(S1, . . . , Sn)⊆L∗. (1)

The grammar is constructed by the aid of finitely many MQs. PRIMALk(K, F, J) can be computed

in polynomial time by Assumptions 1, 2 and 3. A rule [[S0]] ← f⟨[[S1]], . . . ,[[Sn]]⟩ is compatible

with the semantics of the nonterminals if S₀† ⊇

f(S₁†, . . . , Sn†), which is equivalent to

S0‡⊙f(S1, . . . , Sn)⊆L∗ (2)

by Lemma 1. However, this condition (2) cannot be checked by finitely many MQs. The condition (1) can be seen as an approximation of (2), which is decidable by finitely manyMQs. Clearly (2) im-plies (1) but not vice versa. If a rule satisfies (1) but not (2), we call the rule incorrect. If a rule is in-correct, there is awitnessc∈ CO0 −JO0 such that

c_⊙S0∈L∗andc⊙f(S1, . . . , Sn)∈/L∗.

Lemma 4. For every finiteK_⊆SandF _⊆Fthere isJ _⊆ _Csuch thatGˆ = PRIMAL(K, F, J) has no

incorrect rules and |J| ≤ |F||K|k(p+1)_{, in which}

caseL( ˆG)_⊆L_∗.

LetSX be ak-kernel of each nonterminalX of a

grammarG_∗=_⟨N_∗, σ_∗, F_∗, P_∗, I_∗_⟩generatingL_∗.

Lemma 5. There is a finite subset D _⊆ L_∗ such that SX ⊆ S_|D for all X ∈ N∗, F∗ ⊆ F|D and

|D| ≤ k|N_∗|+|P_∗|. Moreover, ifSX ⊆K for all

X_∈N_∗andF_∗ _⊆F, thenL_∗ _{⊆ L}( ˆG).

We prove Theorem 2 discussing the efficiency.

Proof of Theorem 2. Clearly Algorithm 2 updates its conjecture in polynomial time in the data size. Polynomially (in the size ofG_∗) many positive

ex-amples will stabilizeK andF by Lemma 5. After KandFstabilized, all the incorrect rules will be

re-moved with at most polynomially (in|K||F|) many examples by Lemma 4. After that point Algorithm 2 never changes the conjecture, which generates the target languageL_∗.

6.3 Congruential grammars

Definition 3 (Clark (2010a)). A Σ-grammar G is

said to becongruential if every s _{∈ S}(G, X) is a 1-kernel of everyX∈N.

Congruential Σ-grammars have the 1-FKP. Un-der the following additional assumption, this special case will be polynomial-time learnable with a mini-mally adequate teacher.

Assumption 4. For any derivation treeτ, the size of its yieldτ˜is polynomially bounded by that ofτ.

(8)

congruentialΣ-grammar G_∗ with a minimally ad-equate teacher in time polynomial in |N_∗_|,_|F_∗_|, ℓ where ℓ is the total size of counterexamples given to the learner.

Algorithm 3Learning congruentialΣ-grammars

letK:=F :=J :=∅;

let_Gˆ _:=_P_RIMAL₁₍_{K, F, J}₎_; forn= 1,2, . . . do

ifL( ˆG) =L_∗ (equivalence query)then

output_Gˆ_{and halt;}

else if the given counterexample dis positive (d∈L_∗− L( ˆG))then

letK:=K_∪SUB(d)andF :=F_∪FUN(d); else

letJ := J ∪WITNESSP(τd,□O∗)whereτd is an (implicit) parse tree ofdbyGˆ

end if

let_Gˆ₌_P_RIMAL₁₍_{K, F, J}₎_; end for

Algorithm 3 uses the same grammar construction PRIMAL as Algorithm 2 where the parameters K

and F are calculated from positive

counterexam-ples given by the oracle. On the other hand, J is

computed in a different way. By Lemma 4, when the oracle answers a negative counterexampled

to-wards anEQ, our conjectureGˆmust use an incorrect

rule to derived. To find and remove such an

incor-rect rule, Algorithm 3 calls a subroutine WITNESSP with input (τd,□), whereτd is a derivation tree of

ˆ

G whose yield is d. To be precise, τd does not

have to be a derivation tree. Rather what we re-quire is that for each s ∈ S|d, one can compute at

least one tuple of s1, . . . , sn ∈ S|d and f ∈ F|d

such that s = f(s1, . . . , sn) and the height of the

lowest derivation tree of each si is strictly lower

than that of s. Indeed one can do this in

polyno-mial time by a dynamic programming method from SUB(d) and FUN(d). Yet for explanatory easiness,

we treat such information as an (implicit) derivation tree τd. The procedure WITNESSP returns a

con-text that witnesses an incorrect rule that contributes to generating dby searching τd recursively calling

itself. The procedure WITNESSP in general takes a pair(τ, c)such thatτ is an[[s]]-derivation tree ofGˆ

andc _∈s‡₋τ˜‡. Letτ =ρ(τ1, . . . , τn)whereρ =

[[s]]_←f_⟨[[s1]], . . . ,[[sn]]⟩. Ifc⊙f(s1, . . . , sn)∈/ L∗

then the ruleρis incorrect. So WITNESSP returnsc

which witnesses the incorrectness of the rule. Oth-erwise, we have

c_⊙f(s1, . . . , sn)∈L∗ c_⊙f(˜τ1, . . . ,˜τn)∈/L∗

for the yieldsτ˜iofτi. One can findisuch that

c⊙f(s1, . . . , si−1, si,τ˜i+1, . . . ,τ˜n)∈L∗ c_⊙f(s1, . . . , si−1,τ˜i,˜τi+1, . . . ,τ˜n)∈/L∗.

This means an incorrect rule is in τi. We call

WITNESSP(τi, c⊙f(s1, . . . , si−1,□,τ˜i+1, . . . ,τ˜n)).

Lemma 6. The procedure WITNESSP(τd,□) runs

in polynomial time inℓand_|d_|.

Proof. The number of recursive calls of WITNESSP is no more than the height ofτd, which is at most

|S|d|. Let the instance of thej-th recursive call be

(τj, cj) and χj the derivation context for c = ˜χj.

χj+1 is obtained from χj by replacing at most p

subtrees by a derivation tree whose yield is an el-ement ofK. By Assumption 4, the size ofcj and

thus the size of an instance of an MQis polynomi-ally bounded by|d_|ℓ. WITNESSP runs in polynomial time.

Lemma 7. Each time Algorithm 3 receives a neg-ative counterexample, at least one incorrect rule is removed.

Lemma 8. LetG_∗ = _⟨N_∗, σ, F_∗, P_∗, I_∗_⟩ be a con-gruential grammar generatingL_∗. Each time Algo-rithm 3 receives a positive counterexample, the car-dinality of the set{X _∈N_∗ _|K_{∩ L}(G_∗, X) =∅_}

∪(F_∗₋F)decreases strictly.

Proof of Theorem 3. Time between an EQ and an-other is polynomially bounded by Lemma 6. By Lemmas 5 and 8, Algorithm 3 gets at most|N_∗|+ |F_∗|positive counterexamples. The grammar_Gˆ ₌

PRIMAL(K, F, J)is constructed from those positive

(9)

6.4 Finite context property

Definition 4(Clark (2010b), Yoshinaka (2011b)1₎_.

A nonempty finite setC _⊆ _Cis called ak-context

of a nonterminalXif|C| ≤kand

S(G, X)_≡_L_(G) C†.

AΣ-grammarG is said to have thek-(weak) finite context property(k-FCP) if every nonterminalXhas

ak-contextCX.

Theorem 4. Under Assumptions 1, 2 and 3, Algo-rithm 4 identifiesΣ-grammars with thek-FCPin the

limit from positive data and membership queries.

The theorem can be shown by an argument similar to the proof of Theorem 2 based on Lemmas 9 and 10 below. The discussion on the learning efficiency of Algorithm 2 is applied to Algorithm 4 as well.

Algorithm 4LearningΣ-grammars withk-FCP Data: A positive presentationd1, d2, . . . ofL∗;

Result: A sequence ofΣ-grammarsG1, G2, . . .;

letD:=J :=F :=K :=∅;

let_Gˆ _:=_D_UAL_k₍_{J, F, K}₎_; forn= 1,2, . . . do

letD:=D∪ {dn};K :=SUB(D);

ifD⊈L( ˆG)then

letJ :=CON(D)andF := FUN(D); end if

output_Gˆ ₌_D_UAL_k₍_{J, F, K}₎_as_G_n_; end for

The conjecture grammar DUALk(J, F, K) =

⟨N, σ, F, P, I_⟩of Algorithm 4 is defined from finite

sets of contextsJ ⊆ C, functionsF ⊆ Fand sub-structuresK ⊆S. For eachC ⊆JO, we writeC(K)

to meanC†_∩KO. This set can be seen as a finite

ap-proximation ofC†, which is computable withMQs. • NO ={[[C]] |C ⊆ JOwith1 ≤ |C| ≤ k}for

O _∈Ω_|_D, • I =_{[[{□_∗_}]]_},

[[C0]]←f⟨[[C1]], . . . ,[[Cn]]⟩

where f ∈ FO0,...,On for [[Ci]] ∈ NOi if C0 ⊙

f(C₁(K), . . . , Cm(K))⊆L∗.

1_{We adopt the definition by Yoshinaka, which is slightly}

weaker than Clark’s.

We say that a rule[[C0]] ← f([[C1]], . . . ,[[Cn]])is

incorrectifC0⊙f(C₁†, . . . , Cn†)⊈L∗. In that case, there aresi ∈C_i†such thatC0⊙f(s1, . . . , sn)⊈L∗.

Lemma 9. For every finiteJ _⊆CandF _⊆Fthere is K _⊆ _S such that Gˆ = DUAL(J, F, K) has no

incorrect rules and|K| ≤ p|F||J|k(p+1)_{, in which}

caseL( ˆG)_⊆L_∗.

Let G_∗ = ⟨N_∗, σ_∗, F_∗, P_∗, I_∗⟩ generate L_∗ and CX ak-context of each nonterminalX ∈N∗.

Lemma 10. There is a finite subset D ⊆ L_∗ such that C_|_D _⊇ CX for all X ∈ N∗, F|D ⊇ F∗ and

|D_{| ≤} k_|N_∗_|+_|P_∗_|. Moreover, ifJ _⊇ CX for all

X∈N_∗andF ⊇F_∗, thenL_∗ ⊆ L( ˆG).

6.5 Context-deterministic grammars

Definition 5 (Shirakawa and Yokomori (1993), Yoshinaka (2012)2₎_. _A_Σ_-grammar_G_{is said to be} (weakly) context-deterministicif everyc_{∈ C}(G, X)

is a 1-context of everyX_∈NO.

Differently from Theorem 3, we do not need As-sumption 4 for learning context-deterministic gram-mars with a minimally adequate teacher.

Theorem 5. Under Assumptions 1, 2 and 3, Al-gorithm 3 learns any language L_∗ generated by a context-deterministic Σ-grammar G_∗ with a min-imally adequate teacher in time polynomial in

|N_∗|,|F_∗|, ℓ where ℓ is the total size of counterex-amples given to the learner.

Proof. By Lemmas 11, 12 and 13 below.

Algorithm 5 uses the same grammar construc-tion DUAL as Algorithm 4. By Lemma 9, when the oracle answers a negative counterexampled

to-wards an EQ, our conjecture Gˆ must use an

incor-rect rule to derive d. To find and remove such

an incorrect rule, Algorithm 5 calls a subroutine WITNESSD with a derivation tree τd of Gˆ whose

yield isd. The procedure WITNESSD returns a fi-nite set of substructures that witnesses an incorrect rule that contributes to generatingd. An input given

to WITNESSD is in general a [[c]]-derivation tree τ

such thatc_⊙τ /˜ _∈ L_∗. Letτ =ρ[τ1, . . . , τn]where

ρ= [[c]]← f⟨[[c1]], . . . ,[[cn]]⟩. If there isisuch that

ci ⊙τ˜i ∈/ L∗, we recursively call WITNESSD(τi). 2_{We adopt the definition by Yoshinaka, which is slightly}

(10)

Algorithm 5 Learning context-deterministic Σ

-grammars

letJ :=F :=K:=_∅; let_Gˆ _:=_D_UAL₁₍_{J, F, K}₎_; forn= 1,2, . . . do

ifL( ˆG) =L_∗ (equivalence query)then output_Gˆ_{and halt;}

else if the given counterexample dis positive

(d_∈L_∗_{− L}( ˆG))then

letJ :=J∪CON(d)andF :=F∪FUN(d); else

letK := K_∪WITNESSD(τ)whereτ is an

(implicit) parse tree ofdbyGˆ

end if

let_Gˆ₌_D_UAL₁₍_{J, F, K}₎_; end for

Otherwise, τ˜i ∈ ci† for all i, which means the

rule ρ is incorrect. WITNESSD(τ) returns the set {τ˜1, . . . ,τ˜n}. Differently from the case of WIT

-NESSP, an instance of a recursive call is always an (implicit) derivation tree of somes_∈ S_|_d. This

ex-plains why we do not need Assumption 4 in this case.

Lemma 11. Time between an EQ and another is

polynomially bounded.

Lemma 12. Each time Algorithm 5 receives a neg-ative counterexample, at least one incorrect rule is removed.

Lemma 13. Let G_∗ = ⟨N_∗, σ, F_∗, P_∗, I_∗⟩ be a context-deterministic grammar for L_∗. Each time Algorithm 5 receives a positive counterexample, the set{X ∈ N_∗ | J ∩ C(G_∗, X) = ∅} ∪(F_∗ −F)

gets shrunk.

6.6 Combined approaches

By combining primal and dual approaches, one can obtain stronger approaches (Yoshinaka, 2012). The class of Σ-grammars whose nonterminals

ad-mit either ak-kernel orl-context can be learned by

combining the techniques presented in Sections 6.2 and 6.4 under Assumptions 1, 2 and 3. Also

Σ-grammars whose nonterminals satisfy either the

requirement to be congruential or to be context-deterministic can be learned with a minimally ade-quate teacher under Assumptions 1, 2, 3 and 4 (Sec-tions 6.3 and 6.5).

7 Restricted cases

In some grammar classes, it may be the case that only (supersets of) C|d and F|d are computable in

polynomial-time but _S_|d is not, or the other way

around: S|d andF|d are efficiently computable but C|d is not. For example, in non-permuting

paral-lel multiple CFGs (Seki et al., 1991), elements of

S|d for a stringd are tuples of strings of the form

⟨v1, . . . , vm⟩ for d = u0v1u1. . . vmum and such

substrings are polynomially many if m is fixed.

However,C|dcontains exponentially many contexts.

Clark and Yoshinaka (2014) showed that still a dual approach works for parallel multipleCFGs if nonter-minals are known to have k-contexts belonging to

a certain subsetC ⊆ C such thatC_|_d = C_|_d∩C

is polynomial-time computable. A symmetric re-sult of a primal approach has also been obtained by Kanazawa and Yoshinaka (2015) targeting a certain kind of tree grammars. This section does not postu-late Assumption 3.

Definition 6. A Σ-grammar G is said to have the

(k,S)-FKP if every nonterminal admits a k-kernel

which is a subset ofS.

Assumption 5. There are polynomial-time algo-rithms that computeSUB(d), CON(d) and FUN(d)

such thatS_|d ⊆ SUB(d) ⊆S,C|d ⊆ CON(d) ⊆ C

andF|d⊆FUN(d)⊆F, whereS|d=S∩S|d.

It is not hard to see that Algorithm 2 works for learning Σ-grammars with (k,S)-FKP under As-sumptions 1, 2 and 5. All discussions in Section 6.2 hold for this restricted case.

The symmetric definition and assumption are as follows.

Definition 7. A Σ-grammar G is said to have the

(k,C)-FCPif every nonterminal admits ak-context

which is a subset ofC.

Assumption 6. There are polynomial-time algo-rithms that computeSUB(d), CON(d) and FUN(d)

such thatS|d ⊆ SUB(d) ⊆ S,C|d ⊆ CON(d) ⊆C

and_F_|d⊆FUN(d)⊆F.

It is not hard to see that under Assumptions 1, 2 and 6, Algorithms 4 work for learningΣ-grammars

with (k,C)-FCP Σ-grammars. All discussions in

Section 6.4 hold for this restricted case.

(11)

Assumption 7. There are setsS ⊆ SandC ⊆ C

such that for every nonterminal X of G _{∈ G}(Σ),

we haveS(G, X)_∩S_̸=_∅andC(G, X)_∩C_̸=_∅. Moreover, there are polynomial-time algorithms that compute SUB(d), CON(d) and FUN(d) such that S_|_d _⊆ SUB(d) _⊆ _S, C_|_d _⊆ CON(d) _⊆ _C and

F|d⊆FUN(d)⊆F.

Under Assumptions 1, 2 and 7, Algorithm 1 works using either SUBSTP or SUBSTD.

On the other hand, the results on the polynomial-time MAT learnability of congruential and context-deterministicΣ-grammars do not hold anymore

un-der any of Assumptions 5, 6 and 7.

8 Extending learnable classes

This section compares learnable classes of Σ

-languages for differentΣwith the same special sort

O_∗. For Σ1 and Σ2 with Σi = ⟨Ωi,Fi, O∗⟩, if

Ω1 ⊆ Ω2 and F1 ⊆ F2_{, every} _Σ

1-grammar is a

Σ2-grammar, soL(Σ1) ⊆ L(Σ2). However, since

the distributional properties defined so far are rela-tive to a signature, aΣ1-grammar with a

distribu-tional property underΣ1 does not necessarily have

the corresponding property underΣ2. Yet ifSOand COare preserved by moving fromΣ1toΣ2, the

dis-tributional properties other than the substitutability are preserved.

Let us define the direct unionΣ0 =⟨Ω0,F0, O∗⟩ of arbitrary signaturesΣ1 andΣ2byΩ0 ={O∗} ∪

{(O, i) _|O _∈Ωiwithi_{∈ {}1,2_{} }}whereO(O,i) = {(s, i) _| s _∈ O_}andF0 ₌ _G1 _∪_G2_{∪ {□}

1,□2},

where Gi _{is a trivial variant of} _Fi _{working on the}

new domain and codomain of the form (O, i) and □i(s, i) =sfor alls∈O∗. Then everyΣi-grammar

Gcan be seen as a special type ofΣ0-grammar by

adding a new initial symbolZand rules of the form Z _← □Σi⟨X⟩ for all initial symbolsX ofG. We

haveL(Σ1)∪ L(Σ2)⊆ L(Σ0). EveryΣi-grammar

that is congruential, context-deterministic, with the

k-FKPor with thek-FCPfori= 1,2can be seen as

aΣ0-grammar with those properties. Note thatCO_∗

is the singleton of the identity function inΣ0, which

means any element ofL(G)is a1-kernel of the new

initial symbolZ. In this way, from two signatures,

one can obtain a richer learnable class of languages. The above argument on signature generalization does not hold for substitutable case. Rather the

op-posite holds. IfΩ1 ⊆Ω2 andF1 ⊆ F2_{, then a}

lan-guage substitutable underΣ2 is substitutable under

Σ1 but not vice versa.

Let us say that Σ2 is finer than Σ1 if every sort

ofΩ1is partitioned into finite number of sorts inΩ2

and every function ofF2 _{is a subfunction of some}

function inF1_{which accords with the partition. That}

is, every sortOofΩ1_{has a finite set}_Ω2

O⊆Ω2such

thatO = ∪Ω2_O and F1_O₀_,...,O_n = ∪{F2 O′

0,...,O′n |

O_i′ ∈ Ω2_O_i}. For instance,Σk,l is finer than Σk′_,l′ for k′ ≤ k and l′ ≤ l in Example 4. If Σ2 is

finer than Σ1, L(Σ1) = L(Σ2) holds. Every

lan-guage substitutable underΣ1 is substitutable under

Σ2 but not vice versa. Moreover, every

congruen-tial (resp. context-deterministic)Σ1-grammar has an

equivalent congruential (resp. context-deterministic)

Σ2-grammar but not vice versa.

9 Grammars with partial functions

Yoshinaka (2015) showed that a dual approach can be applied to the learning ofconjunctive grammars. Conjunctive grammars (Okhotin, 2001) areCFGs ex-tended with the conjunctive operation&so that one can extract the intersection of the languages of non-terminals. For example, a conjunctive rule A0 →

A1&A2 means that if bothA1 andA2 generate the

same stringu then so doesA0. Conjunctive

gram-mars cannot be seen asΣ-grammars, since the

con-junctive operation&is apartialfunction whose do-main is not represented as the direct product of two sorts, which is not legitimate in the general frame-work ofΣ-grammars.

A partial signature is a triple Π = _⟨Ω,F, O_∗_⟩

which is defined in the way similar to a (total) sig-nature but F may have partial functions. Accord-ingly contexts inCwill be partial functions. We do not have C(G, X)_{⊙ S}(G, X) _{⊆ L}(G) any more, sincec⊙s may not be defined for some elements c _{∈ C}(G, X) and s _{∈ S}(G, X). The

correspon-dence betweenO-concept lattices andΣ-grammars collapses. This prevents the application of the theory of distributional learning developed in this paper to

Π-grammars. Still we can generalize the discussion on the learning of conjunctive grammars.

(12)

CX ⊆COwith|CX| ≤ksuch that

S(G, X) ={s|c⊙s∈Lfor allc∈CX}.

Definition 8 requires every c _∈ CX to be

to-tal on S(G, X). One can learn Π-grammars with the strong k-FCP under Assumptions 1, 2 and 6, whereCconsists of total functions only. The

gram-mar construction DUALkshould be modified so that

we have a rule [[C0]] ← f⟨[[C1]], . . . ,[[Cn]]⟩ if c⊙

f(s1, . . . , sn) ∈ L∗ for any c ∈C0 andsi ∈ Ci(K)

such thatf(s1, . . . , sn)is defined. One might think

that one can naturally define context-deterministic grammars accordingly: Everyc ∈ C(G, X)should

be a1-context ofX. However, this means that

func-tions in such aΠ-grammar are essentially total.

Acknowledgments

The view presented in this paper has been sharpened through the interactions and discussions with sev-eral researchers with whom I worked on the distri-butional learning of generalizedCFGs. I would like to show my deepest gratitude to Alexander Clark, Makoto Kanazawa, Anna Kasprzik and Gregory Ko-bele. Without those people this work would have been hard to accomplish. Any insufficiency or er-rors in this paper are of course of my own.

References

Alexander Clark and R´emi Eyraud. 2007. Polynomial identification in the limit of substitutable context-free languages. Journal of Machine Learning Research, 8:1725–1745.

Alexander Clark and Ryo Yoshinaka. 2014. Distribu-tional learning of parallel multiple context-free gram-mars.Machine Learning, 96(1-2):5–31.

Alexander Clark, R´emi Eyraud, and Amaury Habrard. 2009. A note on contextual binary feature grammars. InEACL 2009 workshop on Computational Linguistic Aspects of Grammatical Inference, pp. 33–40. Alexander Clark. 2010a. Distributional learning of

some context-free languages with a minimally ade-quate teacher. In J. Sempere and P. Garc´ıa, editors, ICGI, LNCS 6339, pp. 24–37. Springer.

Alexander Clark. 2010b. Learning context free gram-mars with the syntactic concept lattice. In J. Sempere and P. Garc´ıa, editors,ICGI, LNCS 6339, pp. 38–51. Springer.

Alexander Clark. 2010c. Towards general algorithms for grammatical inference. In M. Hutter, F. Stephan, V. Vovk, and T. Zeugmann, editors,ALT, LNCS 6331, pp. 11–30. Springer.

Yuichi Kaji, Ryuichi Nakanishi, Hiroyuki Seki, and Tadao Kasami. 1992. The universal recognition prob-lems for parallel multiple context-free grammars and for their subclasses. IEICE Transaction on Informa-tion and Systems, E75-D(7):499–508.

Makoto Kanazawa and Ryo Yoshinaka. 2015. Distribu-tional learning and context/substructure enumerability in non-linear tree grammars. In Formal Grammar -20th International Conference, FG 2015, Barcelona, Spain, August 8-9, 2015. Proceedings. to appear. Anna Kasprzik and Ryo Yoshinaka. 2011.

Distribu-tional learning of simple context-free tree grammars. In J. Kivinen, C. Szepesv´ari, E. Ukkonen, and T. Zeugmann, editors, ALT, LNCS 6925, pp. 398–412. Springer.

Alexander Okhotin. 2001. Conjunctive grammars. Journal of Automata, Languages and Combinatorics, 6(4):519–535.

Hiroyuki Seki, Takashi Matsumura, Mamoru Fujii, and Tadao Kasami. 1991. On multiple context-free gram-mars.Theoretical Computer Science, 88(2):191–229. Hiromi Shirakawa and Takashi Yokomori. 1993.

Polynomial-time MAT learning of c-deterministic context-free grammars. Transaction of Information Processing Society of Japan, 34:380–390.

Ryo Yoshinaka and Makoto Kanazawa. 2011. Distribu-tional learning of abstract categorial grammars. In S. Pogodalla and J.-P. Prost, editors,LACL, LNCS 6736, pp. 251–266. Springer.

Ryo Yoshinaka. 2008. Identification in the limit of k, l-substitutable context-free languages. In A. Clark, F. Coste, and L. Miclet, editors, ICGI, LNCS 5278, pp. 266–279. Springer.

Ryo Yoshinaka. 2011a. Efficient learning of multiple context-free languages with multidimensional substi-tutability from positive data. Theoretical Computer Science, 412(19):1821–1831.

Ryo Yoshinaka. 2011b. Towards dual approaches for learning context-free grammars based on syntactic concept lattices. In G. Mauri and A. Leporati, editors, DLT, LNCS 6795, pp. 429–440. Springer.

Ryo Yoshinaka. 2012. Integration of the dual approaches in the distributional learning of context-free grammars. In A. H. Dediu and C. Mart´ın-Vide, editors, LATA, LNCS 7183, pp. 538–550. Springer.