The neural network pushdown automaton: Architecture, dynamics and training

(1)

The Neural Network Pushdown Automaton: Model, Stack

and Learning Simulations

UNIVERSITY OF MARYLAND TR NOs. UMIACS-TR-93-77 & CS-TR-3118

August 20, 1993 - Revised January, 1995

G.Z. Sun

a,b

, C.L. Giles

b,c

, H.H. Chen

a,b

and Y.C. Lee

a,b a

_{Laboratory For Plasma Research,}

b

_{Institute for Advanced Computer Studies}

University of Maryland, College Park, MD 20742

and

3

_{NEC Research Institute}

4 Independence Way, Princeton, NJ 08540

Abstract

In order for neural networks to learn complex languages or grammars, they must have sufficient computational power or resources to recognize or generate such languages. Though many approaches to effectively utilizing the com-putational power of neural networks have been discussed, an obvious one is to couple a recurrent neural network with an external stack memory - in effect creating a neural network pushdown automata (NNPDA). This NNPDA general-izes the concept of a recurrent network so that the network becomes a more complex computing structure. This paper discusses in detail a NNPDA - its construction, how it can be trained and how useful symbolic information can be ex-tracted from the trained network.

To effectively couple the external stack to the neural network, an optimization method is developed which uses an error function that connects the learning of the state automaton of the neural network to the learning of the operation of the external stack: push, pop, and no-operation. To minimize the error function using gradient descent learning, an analog stack is designed such that the action and storage of information in the stack are continuous. One interpretation of a continuous stack is the probabilistic storage of and action on data. After training on sample strings of an unknown source grammar, a quantization procedure extracts from the analog stack and neural network a discrete pushdown au-tomata (PDA). Simulations show that in learning deterministic context-free grammars - the balanced parenthesis language, 1n0n, and the deterministic Palindrome - the extracted PDA is correct in the sense that it can correctly rec-ognize unseen strings of arbitrary length. In addition, the extracted PDAs can be shown to be identical or equivalent to the PDAs of the source grammars which were used to generate the training strings.

I. INTRODUCTION

Recurrent neural networks are dynamical network structures which have the capabilities of processing and gen-erating temporal information. To our knowledge the earliest neural network model that processed temporal information was that of McCulloch and Pitts [McCulloch43]. Kleene [Kleene56] extended this work to show the equivalence of finite automata and McCulloch and Pitts’ representation of nerve net activity. Minsky [Minsky67 showed that any hard-threshold neural network could represent a finite state automata and developed a method for ac-tually constructing a neural network finite state automata. However, many different neural network models can be defined as recurrent; for example see [Grossberg82] and [Hopfield82]. Our focus is on discrete-time recurrent neural networks that dynamically process temporal information and follows in the tradition of recurrent network models ini-tially defined by [Jordan86] and more recently by [Elman90] and [Pollack91]. In particular this paper develops a neural

(2)

network pushdown automaton (NNPDA), a hybrid system that couples a recurrent network to an external stack mem-ory. More importantly, a NNPDA should be capable of learning and recognizing some class of Context-free grammars. As such, this model is a significant extension of previous work where neural network finite state automata simulated and learned regular grammars. We explore the capabilities of such a model by inferring automata from sample strings - the problem of grammatical inference. It is important to note that our focus is only on that of inference, not of pre-diction or translation. We will be concerned with problem of inferring an unknown system model based on observing sample strings and not on predicting the next string element in a sequence.

1.1 Motivation

To enhance the computational power of a recurrent neural network finite state automaton to that of an infinite

ma-chine [Minsky67] requires an expansion of resources. One way to achieve this goal is to introduce a potentially infinite

number of neurons but a finite set of uniformly distributed local connection weights per neuron. [Sun91] is an example of this approach and shows the Turing equivalence by construction. Another way to construct a neural network infinite

machine is to allow infinite precision of neuron units but keep a finite size network (finite number of neurons and

con-nection weights) [Siegelmann91, Pollack87]. Doing so is equivalent to constructing a more general nonlinear dynamic system with a set of continuous, recurrent state variables. Such a system in general would have rich dynamical behav-ior: fixed points, limit cycles, strange attractors and chaos, etc. However, how easily is such a system trained? In general, without additional knowledge it is almost impossible to train an infinite neural system to learn a desired be-havior. In effect, putting constraints and a priori knowledge in learning systems has been shown to significantly enhance the practical capabilities of those systems.

The model we introduce has this flavor. It enhances the neural network by giving it an infinite memory - a stack - and constrains the learning model by permitting the network to operate on the stack in the standard pre-specified way - push, pop or no-operation (no-op). As such, this model can be viewed as: (1) a neural network system with some special constraints on an infinite neural memory, or (2) a hybrid system which couples an external stack memory (con-ventionally a discrete memory, but here a continuous stack) with a finite size neural network state automaton. There are many issues in connecting and training an external computational structure such as stack to a neural network. For example what form does the objective function take; when and how are the push/pop/no-op operations of the stack incorporated into the neural net; and after training how are can learned rules extracted? We provide a complete proce-dure for training such a neural network pushdown automata.

1.2 Grammars and Grammatical Inference

Because this paper is concerned with new models of neural networks, we give only a brief explanation of gram-mars and grammatical inference. For more details, please see the enclosed references. Grammatical inference is the problem of inferring an unknown grammar from only grammatical string samples [Angluin83, Fu82, Gold78, Mi-clet90]. In the Chomsky hierarchy of phrase structured grammars [Harrison78, Hopcroft79, Partee90], the simplest grammars and its associated automata are regular grammars and finite state automata (FSA). Moving up in complexity in the Chomsky hierarchy, the next class is the context-free grammars (CFGs) and their associated recognizer - the pushdown automata (PDA), where a finite state automaton has to control an external stack memory in addition to its own state transition rules. For all classes of grammars, the grammatical inference problem is in the worst case at least NP [Angluin83]. Because of the difficulty of this problem, we feel that training a neural network to learn grammars is a good testbed for exploring the networks computational capabilities. However, comparison of a neural network push-down automata with other methods for grammatical inference is not discussed. Our concern has only been with how such an architecture can be constructed, how it is trained and how it learns grammars from grammatical strings.

1.3 Outline of Paper

In next section, we review some of the previous work on recurrent neural network finite state automata and work that extends the power of recurrent neural network beyond that of a finite state automata. We show that from the stand-point of representation, it is more computationally efficient to use a “real” external stack instead of the neural network emulator of stack memory [Pollack90]. In Section III we systematically introduce the model of the Neural Network Pushdown Automata (NNPDA), the structure, the dynamics and the optimization (learning) algorithms. This model is substantiated by means of theoretical analysis of many of the related issues regarding its construction. The attempt there is to give a rigorous mathematical description of the NNPDA structure. We then illustrate the model by correctly

(3)

learning the context-free languages: balanced parentheses and the 1n 0n. A modified version of NNPDA is then intro-duced to learn the more difficult Palindrome grammar. The conclusion covers enhancements and further directions. In the Appendices, a detailed mathematical derivation of the crucial formula necessary for the training equations of NNP-DA is discussed. The key point is that in order to use real-time recurrent learning (RTRL) algorithm [Williams89], we have to assume a recursion relation for all variables, which means that the NNPDA model must be approximated by a finite state automaton. In the Appendices, we discuss this paradox and show one solution to this problem.

II. RELATED WORK

In this section we review previous work related to the NNPDA. However, the general area of grammatical infer-ence and language processing will not be covered; see for example [Angluin83, Fu82, Miclet90] and more recently the proceedings of the workshop on grammatical inference [Lucas93]. We only focus on neural network related research and, even there, only on work directly related to our model.

2.1 Recurrent Neural Network - Connectionist State Machine

Recurrent neural networks have been explored as models for representing and learning formal and natural lan-guages. The basic structure of the recurrent networks, shown in Fig. 1, is that of a neural network finite state automaton (NNFSA) [Allen90, Cleeremans89, Giles92a, Horne92, Liu90, Mozer90, Noda92, Pollack91, Sanfeliu92, Wa-trous92]. More recently, [Nerrand93] formalizes recurrent networks in a finite-state canonical form. We will not directly discuss neural network finite state machines, i.e. NNFSA which have additional output symbols, see for ex-ample [Das91, Chen92]. The computational capabilities of recurrent networks were discussed more recently by [Giles92a, Pollack91, Siegelmann92].

All of the recurrent network models discussed will be higher-order. We and others have found that these models can be extremely useful and more powerful for representing specific computational constructs in neural networks; for a discussion of their use see the following papers [Lee86, Goudreau94, Miller93, Pao89, Perantonis92, Pollack87, Psaltis88, Watrous92]. (It is easy to see that higher order terms are more general than sigma-pi [Rumelhart86a] or pi-sigma [Ghosh92] expressions.) Using second order connection weights, the recurrent dynamics of the state neurons can be given by

, (1)

where S_it is the activity of the i_thState neuron at time step t, I_kt is the k_th component of the input symbol at time step t,

g is the nonlinear operator, usually the sigmoid function g(x) = 1 / (1+exp(-x)) andθ_iis the bias term for the i_th neuron. When a temporal sequence of length T: {I1, I2, I3,...,IT} is fed into the recurrent net, the input symbol It at each time

...

S

t

....

...

S

t

+

1 I

t

Fig. 1

Fig.1 A simple structure of a recurrent neural network, where It and St represent the current input and state, and St+1 is the next state.

Si t+1 g WijkSj t Ik t j k,

∑

+θi ( ) =

(4)

step together with the current state S t (initial state is assigned) are the “input” to the network and the “output” would be the next time state S t+1. The recurrent network therefore acts like a state automata. At the end of an input string, an end symbol is given to the network and the output in the last state neuron is checked to determine the classification category of the input string. This neural network finite state automaton (NNFSA) can be used to recognize strings that belong to a regular grammar. The work of [Cleeresman89, Giles92a, Giles92b, Liu90, Omlin92, Pollack91, Wa-trous92, Zeng93] has shown the possibility of using neural networks to perform grammatical inference on regular grammar, i.e. to find a “useful set” of production rules P from only a finite set of sample training strings.

One of the limitations of NNFSA is its difficulty in processing higher level languages. A “brute-force” method to enhance the computational power of a NNFSA is to increase the size of the existing neural network structure (or in-crease the precision of the neuron units in the network) while training on a more complex language, say a context-free grammar [Allen90]. The assumption is that the size of the neural networks has no bound, but the knowledge gained as the network grows gives clues to the representation of the underlying grammar and it associated machine ([Crutch-field91] uses this approach to show that context-free grammars are generated by a nonlinear system on the edge of chaos). But in practice gaining this knowledge is difficult. What usually happens is that the trained NNFSA will only recognize the language up to a certain string length (in effect, a regular grammar). For the NNFSA to generalize cor-rectly on longer unseen strings, the NNFSA needs to be re-trained on those strings. Thus, we argue that this method of knowledge representation is in itself inefficient.

2.2 Recurrent Neural Network - Beyond the Finite State Automaton

There has been a great deal of effort to enhance the power of recurrent neural networks by increasing the precision or size of the network or by coupling it with an external, potentially infinite, memory. The work of [Williams89] cou-pled a recurrent neural network to a memory tape to emulate a Turing machine and to learn the state automaton controller for the balanced-parentheses grammar (a context-free grammar). More specifically, a recurrent network was trained to be the correct finite-state controller of a given Turing machine by supervising the input-output pairs, where the input is the tape reading from a target Turing machine and the output is the desired action of the finite controller. The important distinction between NNPDA model and that of [Williams89] is in the training - particularly, the behav-ior of their target controller was known a prbehav-iori and not learned. In the most general case of grammatical inference the transition rules of the target machine are not known beforehand; only the classification for each training sequence is known. The NNPDA model we describe allows the NNPDA itself to “figure out” how to construct a neural net con-troller that knows both the state transition rules and, in addition, how to use and manipulate the tape or stack.

Closely related work is the RAAM model of [Pollack90], which proposed an “internal” neural network model of stack memory as a plausible model for cognitive processing. Let us consider using this model to build a NNPDA. As shown in Fig. 2, the “push” and “pop” actions onto the stack are emulated by a coder and a decoder separately, where the “STACK1”, “STACK2”, and “STACK3” are the neuron arrays with the same size and the “TOP” represents the symbol(s) on the top of the stack. The training can be performed by concatenating the network in Fig2(b) with the net-work in Fig2(a) and using error back-propagation. The desired outcome requires “STACK3” to be identical to “STACK1”. This recursive distributed representation of a stack memory may be of particular interest to cognitive models of language processing. However, as a computational model this structure has drawbacks. First, this recursive

STACK1 TOP

STACK3

STACK2

TOP

(a). Push onto stack

(b). Pop from stack

Fig. 2

Fig.2 A neural network emulator of a stack proposed by [Pollack90]. (a) Coding process emulates a “push” action onto a stack. (b) Decoding process emulates a “pop” action from a stack.

(5)

structure is identical to a NNFSA, where the “STACK’s” configurations correspond to internal neural states. In other words, this model transfers the complexity of a stack manipulation to NNFSA state transitions. For a stack with limited length, this model is equivalent to training a FSA with a small number of states. But in general, such a model will be limited since, theoretically, the stack represents a potentially infinite number of states. Even for a limited length stack, this model is inefficient. To illustrate this, consider a stack with length L and number of symbols N. The total number of possible configurations of the stack is

. (2)

If we wish to build a distributed memory of internal states that behaves like a stack, we need to construct (or learn) a NNFSA with NL internal states. The required memory size of neurons (or weights) will scale as ~ NL which severely limits the usefulness of the internal neural network stack.

Other closely related work is the connectionist Turing machine models of [Siegalmann92, Pollack87]. They showed that a stack can be simulated in terms of binary representations of a fractional number which are manipulated by neural network generated actions. The focus of this work was initially on “representational” issues and not on a “practical” learning system. Their proposed stacks use a fractional number represented in terms of a sequence of binary symbols “0” and “1”. A “pop” action removes the leading bit from the fraction and can be simulated by two consecu-tive numerical operations: multiplication by two and subtraction of the leading bit. A “push” is represented by adding “0” or “1” to the original stack and dividing the sum by two. This stack model is clearly as efficient as the conventional discrete stack. An additional feature is its simple representation -- a fractional number. However, for learning, these stack models have the problem that they are not easily coupled to gradient-based learning algorithms. This is because, although a fractional number is continuous, any small perturbation of the fraction causes a discrete change of the stack content that this fraction is representing.

Finally, an interesting model developed by [Lucas90] proposes an entirely different method for learning contex-t-free grammars with a neural network. [Lucas90] maps directly the production rules of the CFG, both terminals and nonterminals, directly in neural networks and shows some preliminary results for character recognition. ([Frasconi93, Giles93, Sanfeliu92] illustrate similar techniques for mapping regular grammars into recurrent networks.)

The original NNPDA model with an external continuous stack and its learning algorithm were originally proposed in short papers [Giles90, Sun90a, Sun90b]. Recently [Das92] showed benchmark experiments with different order connection weights of NNPDA and pointed out that third order weights were better than first or second order. [Das93] showed the advantage of using hints in learning CFGs. Recent work of [Mozer93] also shows that the continuous stack can be used to manipulate the “continuous rewrite rules” necessary to parse context-free grammars. [Zeng94] showed that when a recurrent network controlling an external stack is trained by a pseudo-gradient method and discretized dur-ing traindur-ing, the trained NNPDA can successfully classify strdur-ings of arbitrarily long length.

III. NEURAL NETWORK PUSHDOWN AUTOMATA

In this section, the NNPDA model is thoroughly described. The schematic diagram of the neural network push-down automata (NNPDA) is shown in Fig. 3. This NNPDA, after being trained, will hopefully be able to represent the underlying grammar of the given training set (we assume that for each of our training sets there is a unique underlying grammar) and be able to correctly classify all unseen input strings generated by an unknown CFG. To use the NNPDA as a classifier, input strings are fed into the NNPDA one character a time, and the “error function” at the end of each string sequence decides the classification. It is important to note that all grammars and automata discussed in this paper are deterministic.

The proposed NNPDA consists of two major components: a recurrent neural network controller and an external continuous stack memory. The structure and working mechanism of these two components will be described in detail in subsections 3.1 and 3.2. A brief introduction of the NNPDA dynamics follows. The neural network controller con-sists of four types of neurons: input neurons, state neurons, action neurons and stack reading neurons; and the stack is simply a conventional stack with analog symbol “length”. At each time step, the recurrent neural network can be con-sidered an input-output mapping. The input to the mapping is: the current internal state St, input symbol It and the stack reading Rt. And the output are the next time internal state St+1 and the stack action At+1. This action will be performed

Ns N l l=0 L

∑

NL+1−1 N−1 N L ∼ = =

(6)

onto the external stack, which in turn will renew the next time stack reading Rt+1. This new stack reading together with new internal state St+1 and new input symbol It+1 will serve as a new input for another input-output mapping. At the end of input sequence the content of internal state and stack will determine whether or not the input string is legal.

During the training stage, the weights of the recurrent neural net will be modified to minimize the error function, which is fully discussed in subsections 3.4 and 3.5. In some sense the learning can be thought of as unsupervised or reinforcement style learning, because (a) no credit assignment is made before the end of input sequences and (b) the system can extract the classification rules automatically from the input examples.

3.1 Neural Network Controller

The neural network controller is an extended version of the neural network finite state automata (NNFSA) previ-ously described in [Giles92a, Liu90]. It is still a high order recurrent neural network (Fig.3). The difference is that the NNPDA introduces additional input and output neurons (and, of course, the external stack). The “hidden” recurrent neurons {S_i, i=1,2,...,N_S} represent the internal states of the system to be learned. The input neurons {I_i, i=1,2,...,N_I}, are each associated with a particular input symbol (a localist or one-hot encoding scheme). These two groups of neu-rons are the same as that of NNFSA. The additional “nonrecurrent” input neuneu-rons {R_i, i=1,2,...,N_R} represent the stack content read from the top of stack memory. The additional “nonrecurrent” output neurons {A_i, i=1,2,...,N_A} represent the action values that operate the stack (pushes, pops or no-operations). The state neurons are feedback into themselves after one time step delay (Fig. 3).

The discrete time dynamics of the neural network controller can be written in general form as

..

.

A(0.33) B(0.67) K(0.98) J(0.45)

...

S

t

S

t

+

1 I

t

....

...

. . . .

R

t

(state neuron

at time t) (input symbol)(reading fromtop of stack with unit depth.)

A

t

+

1

(action on stack)

push or pop with depth |A|

continuous stack

recursion

error function

L

t

+

1

(length of stack at time t+1) weight

training

High-Order Connection

Fig. 3

Fig.3 The schematic diagram of the Neural Network Pushdown Automata NNPDA, where a high-order re-current network is coupled with an external continuous stack. The inputs to the neural net are the re-current internal states (St), input symbols (It) and the stack reading (Rt). The outputs from the neural net are the next time inter-nal state (St+1) and the stack action (At+1). This action will be performed on the external stack, which in turn will renew the next stack reading (Rt+1). The weights of the recurrent neural network controller will be trained by minimizing the error function, which is a function of the final state and the stack length at the end of input string.

(7)

, (3) where St,Rt and It are vectors of internal state, stack reading and input symbol at time t, and Wsand Wa represent the weight matrices for the state dynamics and action mappings. It is seen from Eq.(3) that for a full description of the dynamic, we need another equation for the stack reading Rt. In general, this function could be written as

. (4)

The combination of Eqs. (3) and (4) describes a dynamical process for the system “state variables” {St,Rt, At} that evolves in time as a function of an input sequence {I1, I2, I3,...,IT}, given a set of initial values of S0,R0 and A0. However, this is not a state machine, because Eq.(4) indicates that there does not exist a simple recursive function for the stack reading Rt. The value of Rt depends on the entire history of input and actions (or equivalently, Rt depends on weight matrices and input history). This mapping of Rtis highly nonlinear and is determined by the definition of the stack mechanism, which will be later discussed in detail. To be exact, the so called neural network controller is defined only by Eq.(3).

To decide the proper structure of neural network controller, both the neural representations and the target mapping functions need to be known. For discrete pushdown automata, the mappings (or transition rules) are third-order in na-ture, by which we mean that each transition rule is a unique mapping from a third-order combination: {St×Rt×It} to its output, the next time state St+1 and stack action At+1. Assume that unary representations of It, Rt and St are em-ployed. For instance let It=(1, 0, 0), (0, 1, 0) and (0, 0, 1) represent symbols

a, b

and

c

, and St =(1, 0) and (0, 1) the two different states. It is easily seen that any transition rule: {S_jt, R_kt, I_lt} →S_it+1 or A_it+1 could be coded into two four-dimensional matrices Wsijkl and Waijkl, each component being a binary value 0 or 1(for Wsijkl), or ternary value 1, 0, -1(for Wa_ijkl). For example, the state transition rule {S(j), R(k), I(l)} →S(i) means that if the input symbol is the

l_th symbol, the stack reading is the k_th symbol and the internal state is the j_th state, then the next state will be the i_th state. And, this rule would be coded as Ws_ijkl=1 and Ws_mjkl=0, m≠i. Similarly, Wa_ijkl= [1, 0, -1] implies a mapped ac-tion: [push, no-op, pop] of Ait+1. In this way we show that any deterministic PDA could be implemented by a third

order, one layer recurrent neural network with discrete neural activity function. Particularly, if the NNPDA’s neural network controller is represented by third-order nets of the form

, (5)

the existence of a solution to any given PDA would be guaranteed upon proper quantization of the nonlinear functions

g(x) and f(x). During learning, the sigmoid function g(x) is used and f(x) is defined as f(x) = 2g(x) -1.

However, this proof does not exclude solutions with other neural net structures and does not necessarily guarantee the best learning behavior with third-order weights for all problems. In practice, second-order weights were used for some problems and good training results were achieved. The recurrent updating formula for second-order networks can be written as

, (6)

where (Rt⊕It)_k is the concatenation of the two vectors Rt and It, whose components are given by

St+1=G S( t, ,Rt It;Ws) At+1=F S( t, ,Rt It;Wa) Rt = F A( 1, , , , , , ,A2 … At I1 I2 … It) Ai t+1 f Wijkl a Sj t Rk t Il t _θ i a + j k l

∑

, , ( ) = Si t+1 g Wijkl s Sj t Rk t Il t _θ i s + j k l

∑

, , ( ) = Ai t+1 f Wijk a Sj t Rt⊕It ( )k θi a + j k,

∑

( ) = Si t+1 g Wijk s Sj t Rt⊕It ( )k θi s + j k,

∑

( ) =

(8)

. (7)

Experiments and comparisons between NNPDAs with different orders of connection weights were discussed in [Das92]. In most cases the third-order weights gave better learning results.

The existence proof of the NNPDA controller discussed above is based on the assumption of unary representations of internal states and symbols (both input and reading symbols). For the stack reading Rt and input It, a unary repre-sentation (or linear independent vector reprerepre-sentation) is necessary. This will be discussed in next subsection. However, unary representation of internal states may not be necessary. Moreover, to extract a discrete PDA, the pro-cedure of state quantization is performed after learning and the quantized state vectors (often expressed in a binary form) are neither unary, nor linearly independent. But, during learning (especially hard problems), we often encounter the cases where we need to adjust independently the transitions between these linearly dependent state vectors. With third order weights the degrees of freedom are limited and each weight parameter does not associate with only one particular state transition as in the case of unary representations. Therefore, learning could be often trapped at a local minimum. To solve this problem, we propose a “full-order” connected network and find it very useful in learning some hard problems, like the Palindrome grammar. A “full-order” network is defined one is which the order of the correla-tion is the produce of all independent state neurons. The “full-order” network we used for one accorrela-tion output is

, (8)

where the subscript {j}≡{j₁, j₂, ..., j_n}, represents all 2n possible n-bit binary numbers (j_m=0, 1; m=1, 2, ..., n), and n is the number of state neurons. The state vector St_{j} is an n_th order product of St’s components defined as

. (9)

For example St_{1101} = S₁tS₂t(1-S₃t)S₄t for a 4-state neuron net. In learning the palindrome grammar, the combination of Eq.(8) and the third order state dynamics of Eq.(5) led to successful training.

3.2 External Continuous Stack Memory

One of novel features of the NNPDA is the continuous stack memory. The continuous (or analog) stack was mo-tivated by a desire to manipulate a stack with a gradient descent training algorithm. In order to minimize the error func-tion along the gradient descent direcfunc-tion, the weight modificafunc-tion is proporfunc-tional to the gradient of the error funcfunc-tion

. (10)

To couple the neural net with a stack memory, the stack variable must be included in the error function. One way of doing this is to make the stack variables a continuous function of the connection weights, so that an infinitesimal change of weights will cause an infinitesimal change of action values, which in turn cause an infinitesimal change of stack readings. Any discontinuity among these relations may cause the derivative to be infinity, thereby interfering with the learning process.

3.2.1 Continuous Stack Action

To fully describe the mechanism of the continuous stack, we discuss in detail: (1) the continuous stack action and stack operation; (2) how to read the stack and (3) the neural representation of the stack reading. Consider a conven-tional stack, as shown in Fig. 4(a), where there are stored a number of discrete symbols. The discrete stack actions include pop, push and no-op. Without affecting the generality of a stack function, it is assumed that each action only deals with one symbol. The pop simply removes the top symbol and the push places the symbol read from input string onto the top of stack. When the continuous stack is introduced, we have to replace both the discrete symbols in the stack by continuous symbols and the discrete pop and push actions by continuous actions. Therefore, we define the continuous length of every symbols. In Fig. 4(a), the stack is filled with discrete symbols and each symbol is

interpret-Rt⊕It ( )k R_kt Ik−N_R t î   =

if N

_R

< k

≤

N

_I

+N

_R

if 0 < k

≤

N

_R At+1 f W{ }j kl a S{ }j t Rk t Il t+_θa j { }

∑

, ,k l ( ) = St_{{ }}_j (j_mS_mt + (1−j_m) (1−S_mt)) m=1 n

∏

= ∆W W ∂∂ (ErrorFunction) ∝

(9)

ed as having equal length L=1. In the general case, as shown in Fig.4(b), the stack is filled with continuous symbols, each having a continuous length: 1≥L≥ 0. These continuous symbols are generated by the continuous stack actions. As described in the neural network controller in Eqs.(5), (6) and (8), the output of the action neurons A_it are calculated by the function f(x) with analog values distributed within the interval [-1, 1]. The value of Ait is interpreted as the

in-tensity of the actions to be taken on the conventional stack [Harrison78]. When A_it takes on continuous values, the natural generalization of the discrete dynamics is to interpret each continuous action A_it as an uncertainty about the action to be taken. We represent this uncertainty in terms of the length of the discrete symbols to be pushed or popped. Therefore, at each time step only part of a discrete symbol is pushed or popped onto the stack with length determined byA_it. Whether to push or pop is determined by the sign of A_it: push if A_it >ε and pop if A_it<−ε whereεis a small number close to zero; otherwise a no-operation (no-op) takes place. After such actions, the stack construction would appear as in Fig.4 (b).

In the above description of the stack operation, only one component of the vector is used and all three actions

pop, push and no-op are represented by one variable. However, one could integrate continuous actions into a

conven-tional discrete stack in many different ways. For instance, separate action neurons could be used to represent the different types of actions, i.e. one neuron with output to represent the value of push and another neuron with output to represent the value of pop action. In this case both and could simultaneously have nonzero output and the order in which the two actions (push and pop) are executed must be assigned in advance. If we first take a pop action and then push, we in effect introduce four types of actions in the discrete limit: (1) push

( and ), (2) pop ( and ), (3) no action ( and ) and (4) replace (

and ).

3.2.2 Reading the Stack

How to read from a continuous stack must be defined. For simplicity, we assume only one action neuron is used. In the conventional discrete stack a read operation only reads one symbol from the top of stack and sees nothing below. This reading method is not suitable for the continuous stack, since there will be a discontinuity in the content of the stack reading. We treat the stack as a one-way tape and the reading can be performed without popping the stack. More specifically, a reading discontinuity may happen in either of the following two cases: (1) after performing the action

At, a symbol with an infinitesimal length is left on the top of the stack; or (2) the top symbol has a infinitesimal (or

(a) (b)

a

_a

b

c

a

b

c

b

a

L = 1 _{L = 0.7} L = 1 L = 1 L = 1 L = 0.6 L = 0.7 L = 1.0 L = 0.3 L = 0.9 Fig. 4 Fig.4 Stack symbols with continuous lengths

(a) discrete stack is filled with discrete symbols which can be viewed as all having length = 1.

(b) continuous stack is filled with discrete symbols having continuous length:0≤ L≤ 1. A_it 0≤A₁t ≤1 0≤A₂t ≤1 A1 t A2 t A1 t 1 = A2 t 0 = A1 t 0 = A2 t 1 = A1 t 0 = A2 t 0 = A1 t 1 = A2 t 1 =

(10)

zero) part being removed by the previous pop action At. In these two cases an infinitesimal perturbation to the action value At could generate a discrete jump in the stack readings. See the example shown in Fig. 4(b). If At = -0.9, the symbol “a” will be popped entirely from the top of the stack. And the next reading Rt+1would be the symbol “b” with length = 0.6. However, if there is a small perturbation to the connection weights such that the value of At increases by only 0.001, then At=-0.899. The top symbol “a” with length L=0.899 will be popped and a small portion of “a” remains on the top of stack. In that case the next reading Rt+1 would be the symbol “a” with length = 0.001. A similar discrete jump will happen for the case where At≈ 0. To avoid this discontinuity we impose the condition that each time the con-tinuous stack is read with depth equal to 1 from the stack’s top.

The advantages of this reading method are outlined below. First, a continuous reading function will be constructed with respect to the connection weights - any infinitesimal change of weights will cause an infinitesimal change of stack readings. In the example of Fig.4(b), for At=-0.9 the symbol “a” on the top is popped. The next reading contains two parts: symbol “b” with length = 0.6 and symbol “c” with length = 0.4 (the total length = 0.6 + 0.4 = 1.0). If the action value was changed to At=-0.899 due to a small perturbation of the connection weights, the symbol “a” is not totally popped off and a small fraction is left. In this case the next reading would contain: a small fraction of symbol “a” with length = 0.001, a part of symbol “b” with length = 0.6 and a part of symbol “c” with length = 0.399 (total length = 0.001 + 0.6 + 0.399 = 1.0). This example shows that the change of the next stack reading Rt+1 is proportional to the change of previous action values At. When∆At approaches zero, the change of readings∆Rt+1 also approaches zero. It should be noted that this continuity of the reading function does not automatically guarantee that it is differentiable; and, even if it is differentiable, its derivative may not be a function feasible for numerical implementation. The com-plication of the derivatives∂Rt/∂W and ∂Rt/∂Aτ will be discussed in Appendix A.

The other advantage of the proposed reading method is its correspondence with a probabilistic interpretation of the continuous action value; a stochastic machine. The continuous action values can be interpreted as a type of uncer-tainty compared to the deterministic discrete push and pop. If the maximum of the absolute action value is one,

i.e. , the length of a symbol to be pushed or popped can be interpreted as the probability of this discrete action. Consequently, the reading of the stack with a total length equal to one implies the normalization of the total probabil-ities i.e. the summation of all the probabilprobabil-ities for reading each discrete symbol normalized to one. In other words, as in the previous example of Fig.4 (b), if the stack reading (with total length equals to one) contains: ‘a’ with length = 0.001, ‘b’ with length = 0.6 and “c” with length = 0.399, we can interpret that the stack symbol is being read with un-certainty: the probability of the read symbol to be “a” is very small as 0.001, the probability to be “b” is 0.6 and to be “c” is 0.399. When the stack length is less than 1, the reading may be only an ‘a’ with length = 0.1, this could be in-terpreted that the probability to read ‘a’ is 0.1 and the probability to read empty stack is 0.9.

3.2.3 Neural Representation

In the last subsections, the stack reading Rt and the input It are often described as a symbol. In this subsection, the actual neural representation of these two vectors will be discussed.

The neural representations of the input string symbol It and the stack readings Rt are determined under the follow-ing considerations. First, in the discrete limit (by quantization of the analog neurons to discrete levels) the learned neural network pushdown automata is required to behave the same way as a conventional pushdown automata. In this limit, since both sets {It} and {Rt} (each element of which corresponds to a symbol) represent the same set of discrete symbols, the neural representations of each It and Rt need to be identical. In this regard, there are no restrictions on their neural representations as long as they are the same. For instance, consider the symbols ‘a’, ‘b’ and ‘e’, the set {It} or {Rt} can be represented either by two neurons as (0, 1), (1, 0) and (1, 1) if a binary code is used or by three neurons as (1, 0, 0), (0, 1, 0) and (0, 0, 1) if an orthogonal code is used.

Second, during training, the stack reading should consist of continuous neuron values and each reading neuron Rt should be able to represent the contents inside a segment of the continuous stack with total length = 1. This is in general a distributed mixture of the three possible symbols, each with a analog length less than 1. For effective neural infor-mation representation, it is important to require that there exist a unique one-to-one mapping between each vector Rt and the stack symbol component it represents.

(11)

The general mapping from the three continuous lengths to Rt can be written as

, (11)

where l1, l2 and l3 are the three continuous lengths of discrete symbols ‘a’, ‘b’, and ‘e’ contained in Rt and are the vector representations of ‘a’, ‘b’, and ‘e’ in neuron space. The condition l₁+l₂+ l₃≤1(not l₁+l₂+ l₃=1) includes the case of partial empty stack during training where the total length of symbols stored in the stack is less than one.

The first requirement for the discrete limit can be stated as

. (12)

One simple way to satisfy this condition is to write Rt as a linear combination of three basis vectors

. (13)

For the second requirement, uniqueness, the necessary and sufficient condition for the mapping in Eq.(13) is that the three neural vectors be linearly independent. (By the uniqueness we mean that if there exists another set of

co-efficients l’₁, l’₂ and l’₃ such that then , and .) If there

are m symbols used in the input strings, then at least m analog neurons are needed to represent the input string symbol

It and the stack readings Rt because any m vectors in the lower, less than m, dimensional space would be linearly de-pendent on each other. In the three symbol example, this excludes the use of binary vectors (0, 1), (1, 0) and (1, 1) to represent symbols ‘a’, ‘b’ and ‘e’. For simplicity the unary neural representation, i.e. ,

and are used for the three symbols ‘a’, ‘b’ and ‘e’. In this case the stack readings Rt are represented by a three-dimensional vector (l1, l2, l3), indicating that in the current stack reading the lengths of letters ‘a’, ‘b’ and ‘e’ are l₁, l₂, l₃respectively.

To conclude this section, a novel continuous stack is introduced. One interpretation of the continuous stack is the concept of a magnitude associated with a discrete symbol. This new concept stresses two aspects: (1) generalization of a discrete stack to a continuous stack and (2) identification of the stack readings and actions as neural network input and output with a probabilistic interpretation.

3.3 Dynamics of the Neural Network Pushdown Automata

For simplicity the following assumptions are made: (a) only deterministic pushdown automata are considered; (b) only one action neuron output At is used; (c) the same set of symbols represent both the input and stack symbols, so that an action push only pushes the current input I t onto the stack. These assumptions will restrict the class of CFG languages that the NNPDA can learn and recognize.

We illustrate the NNPDA dynamics by examples. Consider two symbol strings of ‘a’ and ‘b’. To mark the end of an input string the end symbol ‘e’ is introduced. A possible input string may be: “aababbabe.” Each time a string sym-bol ‘a’ (or ‘b’) is fed into the neural network controller, this same symsym-bol ‘a’ (or ‘b’) could be pushed onto the stack (or the stack could be popped from the top) with magnitudeAt according to the sign of At. The last symbol ‘e’ in-dicates the end of the input string. Upon receiving the end symbol, the neural network pushdown automata would generate a proper output to tell whether the input string was legal or illegal.

Numerically, two arrays are used to represent the stack: an integer array stacksymbol[] to store the symbols {‘a’, ‘b’, ‘e’} and a real number array stacklength[] for their lengths. A record of the number of symbols stored on the stack is kept in an integer top. Assume that four state neurons are used such that St = (s₁, s₂, s₃, s₄), where 0≤s₁, s₂, s₃, s₄≤1 are the four neurons output.

Rt=f l( 1, , , , ,l2 l3 a b e) l₁+ +l₂ l₃≤1, l₁≥0 l, ₂≥0 l, ₃≥0 a b e, , Rt = a if l₁ = 1 l, ₂ = 0 l, ₃ = 0; Rt = b if l2 = 0 l, 2 = 1 l, 3 = 0; Rt = e if l3 = 0 l, 2 = 0 l, 3 = 1 a b e, , Rt = l1a+l2b+l3e a b e, , l'₁a+l'₂b+l'₃e = l₁a+l₂b+l₃e l'₁ = l₁ l'₂ = l₂ l'₃ = l₃ a = (1 0 0, , ) b = (0 1 0, , ) e = (0 0 1, , )

(12)

The NNPDA operations are outlined for successive time steps. (1) t = 0.

Initially, the stack is empty, so that top = 0 and the stack reading at t = 0 is R0 = (0, 0, 0). If the first symbol of the string is letter ‘a’, the initial input neural vector would be I0 = (1, 0, 0). Assume the initial state to be S0 = (1, 0, 0, 0). The stack is shown in Fig. 5(a).

(2) t = 1.

Initialize the NNPDA with the values S0, I0 and R0 (as shown in Fig.3). After one iteration of Eq.(3), the new state S1 and new action A1 are obtained. Assume that the action output is A1 = 0.6, then push symbol ‘a’ with length = 0.6 onto the stack. The new status of the stack can be represented as stacksymbol[1] =’a’, stacklength[1] = 0.6 and top=1. Then the next reading R1 would be (.6, 0, 0). The stack is shown in Fig. 5(b).

If the next symbol in the input string is ‘b’, then I1 = (0, 1, 0). Substituting the new values S1, I1 and R1 into Eq.(3) generates the next time values. Repeat the procedure.

(3) some later time t.

After several possible pushes, pops and no-ops, the current stack memory may have stored several continuous symbols as in Fig. 6(a): top = 4 (four symbols are stored), stacksymbol[] = (‘a’, ‘a’, ‘b’, ‘a’) and stacklength[] = (0.32, 0.2, 0.7, 0.4). Since the stack is read down from the top with depth = 1, the current stack reading would be Rt = (0.4, 0.6, 0) as shown in Fig. 6(a). Assume the input symbol is ‘a’, so that It = (1, 0, 0). The state vector can also be read from the state neuron output as St.

a

0.6 (a)

Fig. 5

(b)

Fig.5 Stack status at (a) t = 0 and (b) t = 1.

a

(a)

time t

(b) time t+1

b

a

b

a

0.32

0.2

0.7

0.4

0.32

0.2

0.24 R

t

1.0

1.0 R

t+1

a

b

popped off by At = -0.86

top

0.4

0.46

This portion is Fig. 6

(13)

(4) time t+1.

Substitute St, It and Rt into Eq.(3) and the next time values are obtained. If the action At+1 =-.86, a segment of the stack with content of length = 0.86 is popped. This “popped segment” includes 0.4 of ‘a’ and 0.46 of ‘b’ and the stack now has top = 3 (three symbols are left), stacksymbol[] = (‘a’, ‘a’, ‘b’) and stacklength[] = (0.32, 0.2, 0.24). The next stack reading would be Rt+1 = (.52,.24, 0) (formed by 0.32 of ‘a’ plus 0.2 of ‘a’ plus 0.24 of ‘b’).

This procedure is repeated until the end of the input string. The classification of an input string is determined by examining the final state neuron output and the stack length. The criterion for training and classification will be dis-cussed in the next two sections.

3.4 Objective Function

The objective function to be minimized is defined as a scalar error measure which is a function of both the end state and the stack length. For a conventional pushdown automata, either the end state or the stack length alone is a sufficient criterion to determine the acceptance of input strings [Harrison78]. If either the end state reaches a desired final state, or the stack is ended empty, the input string is legal; otherwise illegal. However in training the NNPDA we find that a combination of these two criteria seems necessary. (Initially, we tried only one of these criteria in training, but training was unsuccessful. For the stack-empty only criterion, the stack actions always converged to pop. For the final-state only criterion, the stack actions were not affected.) We speculate that this is because of the existence of too many local minimum in phase space. Thus, an objective function consisting of only one criteria of final state or stack length will have a very complex phase space configuration so that the local learning algorithm gradient descent -would not be able to drive the system from the local minima. Therefore, a legal string is required to satisfy both con-ditions: (1) at the end the NNPDA reaches a desired final state and (2) the stack is empty.

Define the stack length at time t to be Lt. Then, Lt can be evaluated recursively in terms of the action value At

, (14)

because only the push or pop actions can change the length of stack. The initial condition is Lt = 0 and the constraint

Lt≥0 should be imposed at all the times. Let T-1 be the final time at the end of input string. For legal strings the straightforward error function E to be minimized could be

, (15)

where S_f is the desired final state. However, this error function could not be used to train illegal strings. For illegal strings the desired value of function E is not known. Maximizing the same error E as in Eq.(15), in general, would not give a correct answer because E is an unbounded function and an illegal string may not end with a long stack length. However, replacing S_f in Eq.(15) with a desired end state for illegal strings and then minimizing E presents the same problem since illegal strings are required to end with an empty stack (in effect avoid using stack). The main difficulty is that there is not enough information to decide the desired value of stack length for illegal strings.

In general, the following reasoning is applied. Since a legal string requires both (a) the desired final state ST=Sf,

and (b) an empty stack (Lt = 0); an illegal string should require the opposite: either (a) the final state be a large mea-surable distance from S_f, or (b) a non-empty stack (Lt≥ 1).Although other training requirements could be defined, in practice, both of these conditions are successfully used.

One way to implement the above requirement is to introduce a unified error function E which can be used to train both legal and illegal strings. For simplicity we assign the final state(s) in such a way that only one neuron S_Ns output is to be checked at time T at the end of input string. We require S_NsT= 1 and LT= 0 for legal strings and S_NsT= 0 or

LT≥ 1for illegal strings. In this case the unified error function to be minimized for both legal and illegal strings can be defined as

, (16)

where v is a parameter assigned as a target value for each training example. For legal strings v = 1 and for illegal strings

v = min{0, S_NsT-LT}. The learning algorithm is derived by minimizing this error function with the proper value of v

Lt+1 = Lt+At E Sf S T − ( )2 LT ( )2 + = E v+LT SN_S T − ( )2 e2 ≡ =

(14)

for each input string. Correctness of the error function(16) can be checked separately for each string. If the input string is legal, v = 1. Then, minimizing E corresponds to the requirement that S_NsT=1 and LT=0 the desired final state and empty stack. If the input string is illegal, we require v = min{0, SNsT-LT}. There are two possible cases. First, when

S_NsT>LT, let v = 0, which implies that minimizing E corresponds to driving LTto approach S_NsT. The minimum of E can be reached if S_NsT=LT. This means that for each input string (neuron activity S_NsT is discretized to 0 or 1) one of the following requirements is met: S_NsT= 0 or LT= 1. Second, if LT is already greater than S_NsT, then

v = min{0, SNsT-LT}=SNsT-LT. This leads to E=0, implying “do not care” or “no error”. Thus, in the discrete limit, the

combination of the two cases corresponds a requirement for illegal strings: either S_NsT= 0 (illegal state) or

LT≥ 1(non-empty stack).

From the above analysis for analog values of S_NsT, the expression H≡S_NsT-LT could be considered as a continuous measure of how well both of the two conditions S_NsT= 1 and LT=0 are satisfied. The desired value for legal string is

H=1 and for illegal strings H≤0. This H function also provides a simple test measure for new input string strings. After training we will use the same measure H≡SNsT-LT to test the generalization capability of the NNPDA on unseen input

strings. The measure H will be evaluated for each input string. A string is classified as legal if H >.5, otherwise illegal. Another criterion to assist learning is the “trap state,” one of the “hints” used by [Das93]. This “trap state” is used in training the non-trivial Palindrome grammar; details are discussed in Section IV.

3.5 Training Algorithm

The training algorithm is derived by minimizing the error function using a gradient descent optimization method. There are currently two ways to implement gradient descent optimization in recurrent neural networks: the chain-rule differentiation can be propagated forward or backward in time. The forward propagation method is also known as Real Time Recurrent Learning (RTRL) [Williams89], which propagates a sensitivity matrix forward in time until the end of an input sequence. Then, error correction is performed and the weights are modified according to the error message and the sensitivity matrix. Back-propagation-through-time [Rumelhart86b] can be applied to recurrent network train-ing by unfoldtrain-ing the time sequence of mapptrain-ings into a multilayer feed-forward net, each layer with identical weights. This method requires memorizing the state history of input sequence and, whenever the error is found, the error must be propagated backward in time to the starting point. Due to the nature of the backward path, it is an off-line method. In principle, both methods can be generalized to couple the external stack memory with recurrent neural network and train the NNPDA. RTRL is desirable for on-line training because the weights can be modified immediately after the error is detected without waiting for back-propagation. But it has a complexity of O(N4) compared to the complexity of O(N3) for back-propagation through time (N is the number of neurons and first order connection weights are as-sumed). For the task of grammatical inference, on-line training is not necessary because error messages are only given at the end of input strings. But, since the derivation of forward propagation algorithm is more straightforward for NNP-DA, we first consider the generalization of RTRL for training the NNPDA.

From Eqs.(10) and (16), the weight correction for gradient descent learning becomes

, (17)

whereη is the learning rate and the partial derivatives of LT_{and S}T

Ns with respect to weight matrix W can be calculated

recursively. The formula for∂Lt/∂W is easily derived from Eq.(14)

. (18)

The recursions for∂St/∂W and∂At/∂W are found by differentiating the controller dynamical equations. For example

the second-order connection weights of Eq.(5) yield

∆W η v LT S_N S T − + ( ) W ∂ ∂LT W ∂ ∂SN_S T −     − = W ∂ ∂Lt+1 W ∂ ∂Lt W ∂ ∂At + =

(15)

. (19)

It should be noticed that Eq.(19) is an abbreviation of four equations for∂St+1i’/∂Wsijk,∂St+1i’/∂Wajk,∂At+1/∂Wsijk

and∂At+1/∂Wa_jk. For simplicity the notations of St and At are combined into one equation. The (N_S+1)_th component of vector St is At. The function h_i(x) represents derivatives g’(x) for i =1 to N_S and f’(x) for i = N_S+1. Ws and Wa are similarly combined such that W_ijk represents W_ijks for i=1 to N_S and W_jka for i=N_S+1. (Note the assumption that N_A=1 and NR=NI). The learning algorithm formulas for the third order state transition and “full order” action mapping are

presented in Appendix B.

From these recursions and knowing the initial conditions of∂S0/∂W, ∂A0/∂W, their values at a later time can be

evaluated by Eq.(19). But, the recursion is not complete until∂Rt+1/∂W is expressed in terms of∂St/∂W, ∂At/∂W and

∂Rt/∂W. This relation may not be easy to find, since the stack reading is a highly nonlinear function of all the previous

actions and input symbols, as shown in Eq.(4), Rt=F(A1, A2, ..., At; I1, I2, ... , It). The approximate recursive relation for∂Rt+1/∂W can be derived (for details see Appendix A). To the lowest order in its expansion, we have (from the

derivation in Appendix A)

, (20)

where r1t and r2t are the ordinal numbers of neurons that represent the top and the bottom symbols respectively in the

reading Rt. Consider for example the case where after the execution of the action At, the stack is (from bottom to top): (0, 0.9, 0), (.2, 0, 0), (0, .7, 0) and (0, 0, .15). Then r₁t=3 and r₂t=1, because the symbol (0, 0, .15) on the top is the third symbol and the symbol (.2, 0, 0) on the bottom of Rt is the first one.

The complete recursive equations Eqs.(18), (19) and (20), together with the NNPDA dynamical equations can be forward propagated with initial conditions∂S0/∂W=0, ∂A0/∂W=0 and∂R0/∂W=0. The initial values of A0 and R0 are zero and the initial state S0 could be assigned any constant. At the end of the input string, the weight correction Eq.(17) is evaluated. The final weight correction can be performed using either batch or stochastic learning.

However, there is the case of “pop empty stack.” If the total length of the remaining symbols in the stack is less than the value of a pop action (Lt-1<|At|), a “pop empty stack” occurs. For a well designed conventional pushdown au-tomata “pop empty stack” never occurs. But, in learning a PDA, whether with a NNPDA or another method, such an action seems almost inevitable. We devise two possible ways to deal with this case. First, the input sequence can be interrupted whenever a “pop empty stack” occurs and weight corrections are made to increase the stack length (∆W ~ ∂Lt/∂W). And, second, when we have “pop empty stack” and the input string is illegal, no weight correction is made.

Conversely, weight corrections are made for legal input strings.

3.6 Extraction of PDA from a Trained NNPDA

After training with examples of a context free grammar, the NNPDA in general could recognize correctly the training set up to a certain length of strings. But, because of the analog nature of NNPDA, the recognition results are not “correct” in the discrete sense. The final state output are analog values between 0 and 1, which are usually reduced to the binary values of 0 and 1 by a threshold of 0.5. But, analog errors from intermediate states still exist and could accumulate as the input strings become longer. To extract from the trained NNPDA a PDA which represents the un-derlying CFG, we devise a quantization procedure that converts an analog NNPDA to a discrete PDA. To simplify the state structure of the extracted discrete PDA, a minimization procedure for the PDA must be devised.

The quantization can be performed as follows. First, the action neuron(s) is quantized into three discrete values: -1, 0 and 1 according to the rule

S_it_′+1 ∂ W ijk ∂ hi′ Si′ t ( ) δ_ii_′S j t Rt⊕It ( )_k W i′j′k′ R t It ⊕ ( )_k_′ Sj′ t ∂ W ijk ∂ Wi′j′k′Sj′ t Rk′ t ∂ W ijk ∂ k′=1 N I

∑

j′=1 N s

∑

+ k′=1 2N I

∑

j′=1 N s

∑

+       = Rk′ t ∂ Wijk ∂ (δk′r₁t −δk′r₂t) At ∂ Wijk ∂ ≈

(16)

, (21)

where the threshold A* was chosen to be 0.5 for most of our numerical simulations (However, our experience indicates that the quantization results do not seem sensitive to the selection of A* values and other values besides 0.5 could be used). In this way the continuous stack will behave like a discrete stack and generate the discrete actions: push, no-op and pop actions. Next we perform a cluster analysis of the internal states. All input strings that have been recognized correctly are fed into the trained NNPDA and a set of analog internal states is generated. This set is divided into several clusters using a standard K-mean clustering algorithm [Duda73]. The number of clusters K is determined by minimiz-ing the averaged distance from each state to its cluster center (in case the clusters are not well separated more trainminimiz-ing with these strings may be needed). After the cluster analysis store the cluster centers as the representative points of quantized internal states, then a PDA with discrete states is created and the number of states is equal to the number of clusters. During further testing, each analog internal state is quantized to its nearest cluster representative points and the discrete transition rules can be extracted. Now construct a transition diagram and this is the extracted PDA.

In some cases, instead of quantizing the whole state vectors, quantizing each of the state neurons is also useful. If the state neuron’s output is distributed near their saturation values (0 or 1), a binary quantization is natural, i.e. Sti is

quantized to one if St_i > 0.5 and zero otherwise. If the state neural activity is uniformly distributed, more quantization levels are needed. The quantized NNPDA is tested with training or test strings again. If the recognition is incorrect, a finer re-quantization is needed (see [Giles92a] for a discussion of a similar method for FSA extraction for trained NN-FSA).

When a linear “full order” mapping is used for the action output (linear “full order” mapping is the linear form of Eq.(8)), then the quantization rule of Eq.(21) can be replaced by quantizing the connection weights by:

, (22)

where Wa are the connection weights for action output and W* is the threshold. For details, see the numerical simula-tion for learning the Palindrome grammar.

After extraction of the discrete PDA, we reduce the state structure by pruning equivalent states. It is known that, in general, there exists no minimization algorithm (as for FSAs) for obtaining the unique minimal PDA; and that there exists no algorithm to tell whether or not two context free grammars or the two PDAs which accept two context free grammars are equivalent [Hopcroft79]. But, for a given specific structure of a PDA, the minimal size can be obtained by exhaustive search. For instance, assume a specific structure of a deterministic PDA, which pushes and pops only one symbol per input and the stack symbols are the same as input symbols. For this type of PDA each state transition can be characterized by a three-tuple condition (α,β,γ), whereα is input symbol,β is stack reading symbol andγ=1, -1, 0 represents push, pop and no-op. If we consider each combination of (α,β,γ) as an equivalent input symbol of a regular grammar, the extracted PDA transition diagram is equivalent to a finite state automaton transition diagram where a transition occurs each time a “symbol” (α,β,γ) is seen. Thus, the minimization algorithm for FSA can also be effectively used to reduce the extracted PDA. For detailed examples, see the next section.

IV. NUMERICAL SIMULATIONS (learning grammars)

To illustrate the learning capabilities of the NNPDA, we train the NNPDA on a finite number of positive and neg-ative strings of three context-free grammars. Different types of NNPDA and training procedures are discussed for each particular problem set. For all problems the external stack of the NNPDA is initially empty. All simulations were per-formed with 64 bit, double precision. For training we started with short strings and gradually increased the string length [Elman91]. For some simulations only 5 significant figures are presented.

4.1 Balanced Parenthesis Grammar

We train a second-order NNPDA to correctly recognize a given sequence of “balanced” parentheses. Input se-A 0, if ( A ≤A*) 1 − , if (A<−A*) 1, if (A>A*) î   = Wa 0, if ( Wa ≤W*) 1 − , if (Wa<−W*) 1, if (Wa>W*) î   =