Sie sind auf Seite 1von 7

Context Free Grammar Representation in Neural Networks

Whitney Tabor Department of Psychology Cornell University Ithaca, NY 14853 Email: tabor@cs.cornell.edu Category: Theory (Cognitive Science) Preference: Oral presentation

Abstract
Neural network learning of context free languages has been applied only to very simple languages and has often made use of an external stack. Learning complex context free languages with a homogeneous neural mechanism looks like a much harder problem. The current paper takes a step toward solving this problem by analyzing context free grammar computation (without addressing learning) in a class of analog computers called Dynamical Automata, which are naturally implemented in neural networks. The result is a widely applicable method of using fractal sets to organize innite state computations in a bounded state space. This method leads to a map of the locations of various context free grammars in the parameter space of one dynamical automaton/neural net. The map provides a global view of the parameterization problem which complements the local view of gradient descent methods.

1. Introduction
A number of researchers have studied the induction of context free grammars by neural networks. Many have used an external stack and negative evidence ((Giles et al., 1990), (Sun et al., 1990), (Das et al., 1992), (Das et al., 1993),(Mozer and Das, 1993), (Zheng et al., 1994)). Some have used more standard architectures and only positive evidence ((Wiles and Elman, 1995), (Rodriguez et al., ta)). In all cases, only very simple context free languages have been learned.

2 Table 1: Grammar 1. S!ABCD S! A!aS A!a B!bS B!b C!cS C!c D!dS D!d

It is desirable to be able to handle more complex languages. It is desirable to avoid the problem of choosing ungrammatical examples (negative evidence) in an unbiased way (see (Mozer and Das, 1993)). And it is desirable not to use an external stack: such a stack is not biologically motivated; it makes it harder to see the relationship between language learning nets and other more homogeneous neural architectures; it absolves the neural network from responsibility for the challenging part of the taskkeeping track of an unbounded memorythus making its accomplishment fairly similar to another well-studied case, the learning of nite state languages (e.g., (Zheng et al., 1994)). This paper takes a step toward addressing these issues by providing a representational analysis of neurally implementable devices called Dynamical Automata or DAs which can recognize all context free languages as well as many other languages. The approach is less ambitious in the sense that learning is not attempted. On the other hand, it reveals the structural principles governing the DAs, and corresponding networks, computations for a wide range of languages. The essential principle, consistent with (Pollack, 1991)s experiments, is that fractal sets provide a method of organizing recursive computations in a bounded state space. The networks are recurrent, use linear and threshold activation functions and gating units, but have no external stack. An analysis of the parameter space of one simple DA shows a mingling of languages from different complexity classes which is unlike anything that arises by adjusting parameters in a symbolic model, and is more consistent with the observed range of complexities of human languages (Shieber, 1985). Moreover, this global view of the structure of the parameter space presents a contrast to the local view provided by gradient descent methods and may be useful in learning. To be sure, previous analyses have shown how analog devices can simulate Turing machines ((Pollack, 1987), (Siegelmann and Sontag, 1991)) and even recognize nonrecursively enumerable languages ((Siegelmann, 1996), (Moore, 1996))thus, mere proofs of computational capability at the lower, context free, level are not revealing. However, these prior analyses have focused on complexity classication and have not explored representational implications. The current results enhance these results by providing a parameter-space map and probing the relevance to neural network learning.

2. An example dynamical automaton


A fractal is a set of points which is self-similar at arbitrarily small scales. Figure 1a shows a diagram of the fractal called the Sierpinski Triangle (the letter labels in the diagram will be explained presently). The Sierpinski triangle, a kind of Cantor set, is the limit of the process of successively removing the middle quarter of a triangle to produce three new triangles. The grammar shown in Table 1 is a context free grammar. This grammar generates strings in the standard manner ((Hopcroft and Ullman, 1979); denotes the empty string). Examples of strings generated by Grammar 1 are a b c d, a b c d a b c d, a b c a a b c d b c d d. The last case illustrates center-embedding. A pushdown automaton for the language of Grammar 1 would need to keep track of each abcd string that has been started but

3 Figure 1: a. An indexing scheme for selected points on the Sierpinski triangle. The points are the analogues of stack states in a pushdown automaton. By convention, each label lists more-recently-added symbols to the left of less-recently-added symbols. b. A sample trajectory of the DA described in Table 2. a.
CCC CC

0.8

CCB CBC CB CBB

CCA C CAC CA CBA CAB CAA 0 ACC BC AC BCA B BAC BB BA BBA BAB BAA ABB ACB ABC AB ABA AAB ACA A AAC AA AAA

0.4 0.0
0.0
b.
1.0

BCC BCB BBC BBB

0.2

0.4

0.6

0.8

1.0

10 . c 11 . d 3. c 7. c 12 . d 0. 9. b 2. b 6. b 8. d 4. a 1. a 5. a
0.8 0.0
0.0

0.2

0.4

0.6

0.2

0.4

0.6

0.8

1.0

not completed. For this purpose it could store a symbol corresponding to the last letter of any partially completed string on a pushdown stack. For example, if it stored the symbol A whenever an embedding occurred under a, B for an embedding under b and C for an embedding under c, the stack states would be members of fA; B; C g .1 . We can use the Sierpinski Triangle to keep track of the stack states for Grammar 1. Consider the labeled triangle in Figure 1a. Note that all the labels are at the midpoints of hypotenuses of ?0 subtriangles (e.g., the label CB corresponds to the point, 0:125 ). The labeling scheme is :625 organized so that each member of fA; B; C g is the label of some midpoint (only stacks of cardinality 3 are shown). We dene a DA (called DA 1) that recognizes the language of Grammar 1 by the Input z Map shown in Table 2. The essence of the DA is a two-element vector, ~, corresponding to z a position on the Sierpinski triangle. The DA functions as follows: when ~ is in the subset of the plane specied in the Compartment column, the possible inputs are those shown in the Input column. Given a compartment and a legal input for that compartment, the change in ~ that results from reading the input is shown in the State Change column. z
1

For a set of symbols, denotes the set of all nite strings of symbols drawn from .

4 Table 2: Dynamical Automaton (DA 1). Compartment Input b c d a


=

z1 > 1=2 and z2 < 1=2 z1 < 1=2 and z2 < 1=2 z1 < 1=2 and z2 > 1=2
Any If we specify that the DA must start with ~ z

~ z ~ z ~ z ~ z

State Change
?

~ ? ?1=2 z 0 ~ + 1=2 z 0 ? 2 ~ ? 1=2 z 0 ? 1 z 1=2 0 2~ +

= , make state changes according to the ? z 1=2 rules in Table 2 as symbols are read from an input string, and return to ~ = 1=2 (the Final
1 2

?1=2

Region) when the last symbol is read, then the computer functions as a recognizer for the language of Grammar 1. To see this intuitively, note that any subsequence of the form a z b c d invokes the identity map on ~. Thus DA 1 is equivalent to the nested nite-state machine version of Grammar 1. For illustration, the trajectory corresponding to the string a b c a a b c d b c d d is shown in Figure 1b (1. a is the position after the rst symbol, an a, has been processed; 2. b is the position after the second symbol, a b has been processed, etc.) One can construct a wide variety of computing devices which organize their computations around fractals. At the heart of each fractal computer is a set of iterating functions (Tabor, sub) which have associated stable states and can be analyzed using the tools of dynamical systems theory (Barnsley, 1993). Hence the name, Dynamical Automaton.

3. The general case and neural implementation


The method of Section 2 can be extended to languages requiring any nite number of stack alphabet symbols ((Moore, 1996), (Tabor, sub)). For an alphabet of N symbols, n n z 1z e 1 ; 2 ; : : : ; N , consider the functions pi : R ! R dened by pi (~) = 2 (~ + ~i ), where ~i is the vector with a 1 in the ith position and 0s elsewhere. Let the starting state of e the automaton be the vector in Rn with every element equal to 1 . Then an application of 2 pi to ~ corresponds to pushing symbol i onto the stack of a pushdown automaton, and z ? an application of pi 1 corresponds to popping i off the stack. The compartments used in the input map are the sets, pi (S ) where S is the open polygon in Rn with vertices at f~1; : : : ;~N g. To make sure the DA never tries to pop a symbol it hasnt pushed, the e e input map must be dened so all moves out of compartment pi (S ) always begin with with ? an application of pi 1 . (Tabor, sub) shows that if the pi are pooling functions on S (i.e., pi(S ) \ pj (S ) = for i 6= j , and N pi (S ) S ), then every stack state corresponds i to a unique point in S , provided the start state is outside of N pi (S ). Since the current i example satises this condition, this fractal memory never confuses its histories. DAs that obey these conditions and thus emulate pushdown automata are called pushdown DAs (or PDDAs). The pooling functions for the previous example are ~ z 1 (~ +~1 );~ 1 (~ +~2 ), z 2z e 2 z e and ~ z 1 ~ on the open triangle with vertices at the origin, ~1 , and ~2 in R2 . z e e 2 Dynamical Automata can be implemented in neural networks by using a combination of signaling units and gating units. By a signaling unit, I mean the standard sort of unit which sends out a signal reecting its activation state to other units it is connected to. By a gating unit, I mean a unit which serves to block or allow transmission of a signal along a connection

5 Table 3: Parameterized Dynamical Automaton M(mL ; mR ). Compartment


?(0;1] ?(001) ?(001) 1

Input

; ;

l r r

~ ! ?mzL z z ~ ! ?mRz1 z z+ R ~ ! mz z z
2 2 2

State Change
?
1 1 1

Figure 2:
z2 r (0, 1)

M(1=2; 17=8) accepting l3r3.


r r [

z1 (0, 0) l l l (1, 0)

between two other units. All units (signaling and gating) compute a weighted sum of their inputs and pass this through an activation functioneither identity or a threshold.

z q~ + r) to dene the state changes in a DA makes z The use of simple afne functions (~ for a simple translation into a network with signaling and gating units. The coefcients q and r determine weights on connections. The connections corresponding to linear terms (e.g., q ) are gated connections. The connections corresponding to constant terms (e.g., r ) are standard connections. When these afne functions can be interpreted as compositions of pooling functions and their inverses it is easy to dene a PDDAs compartments neurally: a conjunction of linear separators isolates each compartment.
4. Navigation in dynamical automaton space
As I suggested at the beginning, one incentive for studying neural networks in a dynamical automaton setting is that DA analysis provides a more global view of parameter space than the standard gradient descent procedures. A simple case illustrates this idea. Consider the parameterized dynamical automaton M(mL ; mR ) which operates on the two-symbol alphabet = fl; r g and has the input map shown in Table 3.

1 The starting point for M is the point 1 and the Final Region is the set f 1;1 ) g. The 0 scalars, mL (Leftward move) and mR (Rightward move) are parameters which can be adjusted to change the language the DA recognizes. Figure 2 illustrates the operation of this dynamical automaton. When 0 < mL = m?1 < 1, M recognizes the language ln r n . R
When, mL 6= m?1, a variety of interesting languages result. Under every parameterization, R M recognizes strings of the form lnrk where k is the smallest integer satisfying mn mk 1. L R This implies that k = ?n logmR mL ]] where x]] denotes the smallest integer greater

6 Figure 3: The bands in the space languages reside.


3.0

mL mR where the simplest (two-rule) context free


1

m_L

1.0

1.5

2.0

2.5

0.0

0.5
1 0.0 0.5 1.0 1.5
m_R

2.0

2.5

3.0

than or equal to x. If mL 1 or mR 1 then the language of M is a nite-state language. If mL < 1 and mL is a negative integer power of mR , then M generates a context free language which can be described with two rules. For example, if mL = 1 and mR = 2, 4 then k = 2n and the language of M is ln r 2n . This language is generated by the context free grammar, fS ! l r r, S ! l S r rg. Non whole-number rational relationships between mL and mR produce more complex context free languages (i.e. languages requiring more rules). Not surprisingly, irrational relationships produce non context-free languages (Tabor, sub). Figure 3 is a map of part of the parameter space mL mR . The curves show the points at which the simplest (2-rule) context-free grammars reside. Although this analysis considers only a very simple case, it is interesting because it suggests a new way of looking at learning. A map of the regions in parameter space where the simplest languages of a given complexity reside may be useful as a navigational tool in the process of identifying a good model for a data stream. For example, the map may provide insight into how to steer a gradient-descent mechanism away from local minima, or to encourage it to focus on solutions which are not unduly complex. DAs are appealingly compatible with gradient descent learning. Suppose we adopt the following simple method of generating strings with a DA: let the DA generate transitions at random according to its rules, with equal probabilities assigned to transitions out of the same compartment. Dene the behavioral distance between DA M (for Model) and DA T (for Target) as the expected value of the distance between the output probability distributions of M and T upon reading an arbitrary symbol. Then, if the DAs transition functions are well-behaved, the parameter space distance shrinks continuously with the behavioral distance between M and T. This situation permits application of gradient descent learning every time a symbol is read, with only positive evidence considered. As in (Wiles and Elman, 1995) and (Rodriguez et al., ta), this makes training more like human learning of natural language and avoids the problem of selecting negative examples.

5. Conclusions
A general difculty with applying neural networks to complex problems is that their learned representations are hard to interpret. Dynamical automaton analysis is a way of using notions from complexity theory to identify useful landmarks in the space of neural rep-

REFERENCES

resentations. Such a global perspective may be helpful in surmounting the challenges of non-toy problems.

References
Barnsley, M. ([1988]1993). Fractals Everywhere, 2nd ed. Academic Press, Boston. Das, S., Giles, C. L., and Sun, G. Z. (1992). Learning context-free grammars: Capabilities and limitations of of neural networks with an external stack memory. In Proceedings of the 14th Annual Conference of the Cognitive Science Society, pages 7915. Erlbaum, Hillsdale, NJ. Das, S., Giles, C. L., and Sun, G. Z. (1993). Using prior knowledge in a NNPDA to learn contextfree languages. In Hanson, S. J., Cowan, J. D., and Giles, C. L., editors, Advances in Neural Information Processing Systems 5, pages 6572. Morgan Kaufmann, San Mateo, CA. Giles, C., Sun, G., Chen, H., Lee, Y., and Chen, D. (1990). Higher order recurrent networks & grammatical inference. In Touretzky, D., editor, Advances in Neural Information Processing Systems 2, pages 3807. Morgan Kaufmann Publishers, San Mateo, CA. Hopcroft, J. E. and Ullman, J. D. (1979). Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Menlo Park, California. Moore, C. (1996). Dynamical recognizers: Real-time language recognition by analog computers. TR No. 96-05-023, Santa Fe Institute. Mozer, M. C. and Das, S. (1993). A connectionist symbol manipulator that discovers the structure of context-free languages. In Hanson, S. J., Cowan, J. D., and Giles, C. L., editors, Advances in Neural Information Processing Systems 5, pages 86370. Morgan Kaufmann, San Mateo, CA. Pollack, J. B. (1987). On connectionist models of natural language processing. Ph.D. Thesis, Department of Computer Science, University of Illinois. Pollack, J. B. (1991). The induction of dynamical recognizers. Machine Learning, 7:227252. Rodriguez, P., Wiles, J., and Elman, J. (ta). How a recurrent neural network learns to count. Connection Science. Shieber, S. M. (1985). Evidence against the context-freeness of natural language. Linguistics and Philosophy, 8. Also in Savitch, W. J., et al. (eds.) The Formal Complexity of Natural Language, pp. 32034. Siegelmann, H. (1996). The simple dynamics of super Turing theories. Theoretical Computer Science, 168:461472. Siegelmann, H. T. and Sontag, E. D. (1991). Turing computability with neural nets. Applied Mathematics Letters, 4(6):7780. Sun, G. Z., Chen, H. H., Giles, C. L., Lee, Y. C., and Chen, D. (1990). Connectionist pushdown automata that learn context-free grammars. In Caudill, M., editor, Proceedings of the International Joint Conference on Neural Networks, pages 577580. Lawrence Earlbaum, Hillsdale, NJ. Tabor, W. (sub). Metrical relations among analog computers. http://www.cs.cornell.edu/home/tabor/tabor.html. Draft version available at

Wiles, J. and Elman, J. (1995). Landscapes in recurrent networks. In Moore, J. D. and Lehman, J. F., editors, Proceedings of the 17th Annual Cognitive Science Conference. Lawrence Erlbaum Associates. Zheng, Z., Goodman, R. M., and Smyth, P. (1994). Discrete recurrent neural networks for grammatical inference. IEEE Transactions on Neural Networks, 5(2):32030.

Das könnte Ihnen auch gefallen