Sie sind auf Seite 1von 18

Earley Parsing for Context-Sensitive Grammars

Daniel M. Roberts May 4, 2009

Introduction

The eld of parsing has received considerable attention over the last 40 years, largely because of its applicability to nearly any problem involving the conversion of data. It has proven to be a beautiful example how theoretical research, in this case automaton theory, and clever algorithms can lead to vast improvements over naive or brute-force implementations. String parsing has applications in data retrieval, compilers, and many other elds. For compilers especially, format description languages such as YACC have developed to dene how code in a text le should be interpreted as a tree structure which the compiler can handle directly. While this is good for the specialized domain, there are cases when one might wish to use a lighter-weight all-in-one approach that also includes a small amount of in-line programmability. On the ip-side of the pursuit for expressiveness are eciency constraints. If the language is quite restricted, the parsing algorithm can be more tightly tailored to t the problem, and thus can be more ecient. In practice, a simple language that is the subset of a more complicated language often outperforms the more expressive language when the two are given the same simple-language input. For example, regular expressions engines of Java, Python, Ruby, Perl, and certainly others, which include expressive features such as look-aheads, back-references, and boundary conditions, have worst-case exponential performance even on strictly regular regular expressions [1], whereas an implementation specialized for strict regular expressions can achieve worst-case quadratic timeor linear time if we allow preprocessing. The goal of the current research is to explore how to go about creating a parsing language that in addition to being generally ecient, is in fact maximally ecient on restricted sublanguages with known solutions. This work is mainly concerned with three formats for specifying the structure of textual documents: (1) regular expressions, which describe the regular languages, (2) context-free grammars (CFGs) with regular right-hand sides, which describe the context-free languages, and (3) what we shall call the context-sensitive grammars (CSGs)1 , which will be the focus
1

CSGs as dened here are not to be confused with the related but dierent denition of a context-free

of this paper. Regular expressions and CFGs have long been staples of parsing and are used for everything from web apps and text search to compilers. Matching a string of length m to a regular expression of length n is known to be worst case O(nm) in time if we do not allow preprocessing, and this bound can be achieved with the Thompson algorithm, described below. Parsing a string of length m in accordance with a context-free grammar of size n is known to be worst case O(m2 n) in time, and an algorithm that achieves this time bound is the Earley algorithm which is based on the same principle as Thompson. Earley can also be extended to parse strings according to a CSG, but because of the wide range of expressiveness allowed by this format, there is no good worst-case time bound. Nevertheless, one key property of the extension is that when the CSG describes a context-free language, it is no less ecient than traditional Earley parsing. Since both regular expressions and CFGs are subsets of CSGs, this augmented grammar specication format oers more exibility than either, without sacricing eciency in the cases where special features are not used. I shall begin with an overview of the theory for parsing regular expressions and contextfree grammars.

Regular Expressions

Regular expressions have the form rexp ::= Eps | Char(char) | Alt(rexp,rexp) | Cat(rexp,rexp) | Star(rexp) Basic regular expressions syntax and matching conditions can be described as follows. 1. Eps. A string matches () if it is the empty string. 2. Char(c). A string matches c if it consists of the single character c. 3. Alt(re1, re2). A string matches re1 |re2 if it matches either re1 or re2 . 4. Cat(re1, re2). A string matches re1 re2 if it can be split into two strings s1 and s2 such that s1 matches re1 and s2 matches re2 . 5. Star(re). A string matches re* if it can be split into k 0 strings all of which match re. In addition, it is conventional to give the operators |, , and * binding preferences that are analogous to those of +, , and x2 in mathematical expressions, thus reducing the need for parentheses. It is also conventional to leave out the explicit symbol , just as it is in mathematical expressions. Thus, for example, the regular expression ab|cd* stands for the
grammar used in relation to the Chomsky Hierarchy.

more explicit (a b)|(c (d*)), and matches either the string ab or any string of the form cddd..., i.e. a single c followed by zero or more ds. The naive way to match a regular expression is to match its parts recursively, more or less as described above. Although this algorithm has been shown to be exponential in the degenerate case, it is by far the most widely used algorithm in practice because it is conceptually the simplest to implement and it is trivially extensible to include more powerful features such as back-references, which are inherently non-regular in the strict sense. It would be desirable, however, for the mere existence of such language features not to interfere with the computational complexity of parsing against expressions that use only the operations given here. This is the motivation for the present research. The polynomial time algorithm eluded to above is attributed to Ken Thompson, 1968. It works by reading in the characters of the string one at a time, keeping track of all possible parses at once. To assist in keeping track of the partial parses, the algorithm rst builds a directed graph representation of the regular expression so that each node represents a parse state, and each transition is labeled with the character that needs to be parsed in order for the edge to be followed. Some edges can be unlabeled, in which case no character is needed to pass from one state to the other. This type of graph is often called a Finite State Automaton (FSA). Below is an example of a recursive function in pseudocode that returns the Initial node and the Final node. A parse that begins at the Start node and ends at the Final node is a successful parse. FSA(re): I = fresh_state() F = fresh_state() switch(re): Eps: return (F, F) Char(c): build [I --->c F] return (I, F) Alt(re1, re2): (I1, F1) = FSA(re1) (I2, F2) = FSA(re2) build [I ---> I1] build [I ---> I2] build [F1 ---> F] build [F2 ---> F] return (I, F) Cat(re1, re2): (I1, F1) = FSA(re1) 3

(I2, F2) = FSA(re2) build [F1 ---> I2] return (I1,F1) Star(re1): (I1, F1) = FSA(re1) build [I ---> I1] build [F1 ---> I] build [I ---> F] return (I, F) The Thompson algorithm to parse a string str against an FSA keeps track of Sk , the set of states that are reachable after parsing the rst k characters, starting with k = 0 up to the length of the string. Specically, for k > 0, Sk+1 depends only on Sk and str[k]. Here are the rules for moving ahead with Thompson: Initialization: I S0 Consumption: u Sk , u str[k] u u Sk+1 Null Propagation: u Sk , u u u Sk When F Sk , the rst k characters of the string match the regular expression. There are a number of additions to the syntax of regular expressions that do not aect the overall complexity of the grammars they describe. In particular we shall use 1. re+ is equivalent to re re*, and 2. re? is equivalent to re|(). It is also convenient both for ease of expression and for eciency of parsing to have character classes. A character class matches any symbol in a given set of symbols: for example, we may wish to have \a match any alphabetic character, have \w match any alphanumeric character, and have a period (.) match any character, etc. Since we can often represent a set of characters as a range in the ascii character set, it is more ecient for a parser to see whether a character falls within the range then to see whether it is equal to one of some list of characters, because the latter approach requires a linear-time search.

Context Free Grammars

The context-free grammars describe a class of languages that is in some sense innitely more complex than regular expressions. Among the language features that CFGs can describe but that regular expressions cannot are 1. Matching parentheses, 2. Recursive algebraic structures, and 3. Trees of arbitrary depth. The idea of recursion is explicitly built into the denition of a CFG. This makes them perfect for describing many formal languages, such programming language syntax and data formats. As we will see shortly, the regular expressions syntax can be described as a CFG, and for that matter so can CFG syntax.

3.1

CFG Syntax

For the purposes of this paper, a CFG is a context-free grammar with regular right-hand sides. A CFG is a list of rules of the form nont = rexp; where nont is the nonterminal symbol being dened and rexp is a regular expression over terminals and nonterminals. The rst nonterminal in the list of denitions is taken as the start nonterminal, which is to say that a string matches the CFG if it matches the rst nonterminal. To distinguish nonterminals from terminals, nonterminals are enclosed in curly braces {nont}. Just as character classes can be simulated by the other operations, namely alternation, so too can regular expressions be simulated by the more basic CFG syntax, which allows only concatenation. Allowing regular right-hand sides not only simplies the notation for the programmer, but also makes parsing more ecient. In addition to this baseline language, there are two extra syntactic features that do not express conditions on whether a string is accepted, but instead aect how verbose the resulting parse tree is. By default, all nonterminals processed in the course of the matching a string are represented as a node in the tree, and all terminals are not represented at all. To hide a nonterminal and let its subtree be subsumed by its parent, the syntax .nont = rexp; is used. To express the characters that appear in a regular expression, use the syntax $(re). This is especially useful when you want the parse tree to keep track of the actual character used to match a word class: $(.), $(\a), etc.; or to get the result of an alternation: $(a|b|c), etc. For grouping that does not save, brackets are used. Below is an example CFG that denes regular expressions syntax: 5

rexp = {rexp1}; alt = {rexp2} (\| {rexp2})+; cat = {rexp3} {rexp3}+; uny = {rexp4} $(\*|\+|\?); eps = \(\); c = $(\w|\.|\\.); .paren = \({rexp1}\); .rexp1 = {rexp2} | {alt}; .rexp2 = {rexp3} | {cat}; .rexp3 = {rexp4} | {uny}; .rexp4 = {eps} | {c} | {paren}; Here the four nonterminals rexp1, rexp2, rexp3, and rexp4 represent four dierent levels of binding. This prevents a string such as ab|cd* from being interpreted as the regular expression (a(b|c)d)*, etc, but rather as (ab)|(c(d*)).

3.2

CFG Transducers

The transducers needed to represent a CFG will not be nite state automata in general, because the expressive power of FSAs is equal to that of regular expressions. We will compile one FSA for each regular right-hand side, and use a new type of graph edge to join them together: the call edge. If some node u in nonterminal A has transition u {B} u , (where A may or may not be the same as B), we build a call edge from u to the start state of Bs FSA. To use this kind of transducer, we have to maintain a stack of return addresses: when we follow u call IB , , we push u on to the stack. When we reach FB , we pop the rst item on the stack, u, and transition to some u where u {B} u . [3] Once a transducer has been assembled from a CFG, it can be marshaled and stored as a preprocessed form of the CFG, to be used directly for parsing.

3.3

Earley Parsing

As with FSAs, these transducers are in general non-deterministic; that is, from any given state there may be null-edges or multiple edges of the same type. Either of these means that during the parse, there will be moments where it is not clear which state we ought to move to next. As with regular expressions, this nondeterminism can in theory be dealt with by recursively trying every possible path until a match is found, but this kind of backtracking leads to poor performance that is worst-case exponential. The classic polynomial-time solution to this problem was proposed by Jay Earley [2] and has a similar avor to the Thompson algorithm for regular expressions. The version described below has been modied to work with the CFG transducers described above, and is largely based on a version described in by Trevor Jim and Yitzhak Mandelbaum [3].

For every position 0 j n in the string to be parsed, the algorithm constructs an Earley set. Just as in Thompson, an Earley set is a set of possible parse states, but in the case of CFGs, a transducer vertex does not fully describe a parse state, because we also need to be able to reconstruct the call stack. One way to do this would be to describe a parse state as a transducer vertex and the call stack, but this has the downside of being very inecient, because there is no bound on the length of the call stack, and thus these sets of possible states would be able to grow arbitrarily large. A far more compact way to represent the call stack is with a return address i not to a vertex, but to an Earley set. What this means is that any vertex with a representative in the ith Earley set is a valid return address. Thus, if two parse states in the same Earley set call the same nonterminal, the sub-parse is only done once, rather than once per call. Below are the formal parsing semantics. Rules carried over from Thompson: Initialization: (I, 0) S0 Consumption: (u, i) Sj , u str[j] v (v, i) Sj+1 Null Propagation: (u, i) Sj , u v (v, i) Sj New Rules: Call Propagation: (u, i) Sj , u call v (v, j) Sj Return Propagation: (u, i) Sj , u A, (u , i ) Si , u A v (v, i ) Sj The general algorithm is to seed a set Sj with Initialization if j = 0 and Consumption otherwise; repeat the three Propagation rules until Sj does not change; and to recursively apply this to Sj+1 if j < m. There are many ways to optimize propagation. One case where this can signicantly increase eciency is when a nonterminal is nullable, that is, it matches the null string. Nullable nonterminals require applying the rules in several rounds until nothing changes, because if a nonterminal doesnt consume any characters, 7

Return Propagation searches through the items in the same Earley set (i = j above), which is yet unnished2 . This issue, however does not aect the overall time-complexity of the algorithm, which has worst case time O(n2 m) in either case, and this paper does not deal with such optimizations.

Context Sensitive Grammars

The purpose of the present paper is to explore a grammar that has context-sensitive features, but that looks formally quite similar to our formulation of CFGs with regular righthand sides. In formal language theory, context-sensitivity is often formulated by loosening the restriction that rules have only a single nonterminal on the left hand side. The present formulation is easier to reason with and incorporates some familiar features of imperative programming, such as the ability to pass arguments to subroutines, to store values in variables, and to reason with and operate on both values and variables.

4.1

CSG Syntax

The augmentation of CFG syntax to accommodate forms of context-sensitivity can be done entirely by adding new types of expressions to regular expressions, now renamed rhss due to their lack of regularity in the grammar theory sense. var = string nont = string rhs ::= Eps | Char(char) | Alt(rhs,rhs) | Cat(rhs,rhs) | Star(rhs) | Nont(nont, exp) | Assert(exp) | Capture(rhs,var) | Set(var,exp) exp ::= ... Here is a summary of the additions: 1. A call to a nonterminal may contain an argument, which the nonterminal may use to guide its parse. 2. An assert statement, for example [len(x) > 3], which matches the empty string if and only if the expression evaluates to true. 3. A capture statement, which after matching an rhs, stores the matched string in a variable, which may be referenced later. For example (.. @ x) matches two characters and stores them in the variable x. 4. We may set and reset variables at any time with a command like (x=len(y)).
2

See

The expression language exp may be as expressive as one wishes so long as it does not modify the external environment, though a very simple language that includes integer calculation and comparison, strings, characters, atoi, string-length, variables, and equality testing is enough to allow this class of grammars to describe many practically applicable cases of context-sensitivity.

4.2

CSG Transducers

As above, once we have reformulated how to build the right-hand sides, joining them to create the transducer is done as described for CFGs, with the small caveat that call edges are parameterized with the argument to be passed. To build a transducer fragment, we use the method described for regular expressions, also parameterizing nonterminal edges with their argument. An assert edge is labeled with the assertion, and a set edge is labeled with the assignment. Capture(x,rhs) requires its own mechanism, which is to surround rhss fragment with an incoming push arrow and an outgoing pop x arrow. The transducer semantics are described below.

4.3

Augmented Earley Parsing

This algorithm is a minimal modication of the Earley algorithm to include extra state information, namely variable contexts and the capture stack. Thus, an Earley item for CSG parsing is (u, i, E, ), where E is the context which has type [(var, exp)], and which has type [int]. Rules carried over from Thompson: Initialization: (I, 0, [], []) S0 Consumption: (u, i, E, ) Sj , u str[j] v (v, i, E, ) Sj+1 Null Propagation: (u, i, E, ) Sj , u v (v, i, E, ) Sj Rules carried over from vanilla Earley: Call Propagation: (u, i, E, ) Sj , u call(e) v (v, j, [(arg, e)], []) Sj

Return Propagation: (u, i, E, ) Sj , u A, (u , i , E , ) Si , u A(E[arg]) v (v, i , E , ) Sj New Rules: Assert Propagation: (u, i, E, ) Sj , u assert(e) v, eval(e, E) = Bool(true) (v, i, E, ) Sj Set Propagation: (u, i, E, ) Sj , u set(x,e) v (v, i, ((x, e) :: E), ) Sj Push Propagation: (u, i, E, ) Sj , u push v (v, i, E, (j :: )) Sj Pop Propagation: (u, i, E, (k :: )) Sj , u pop(x) v (v, i, ((x, str[k : j]) :: E), ) Sj Two details deserve attention. First, Call Propagation has been modied to pass on a context that contains the special variable arg set to the value of the parameter. Second, Return Propagation has been modied so that a nonterminal arc must match both in nonterminal and in parameter. The capture mechanism works as follows: to start a capture, the input position is pushed onto the capture stack; to nish a capture, pop the start position o the stack and extract the substring of the input that starts there and ends at the current position, and store this string in some variable.

The Expression Language

In the current implementation, a simple, untyped expression language is used. Here is a BNF outline: type exp ::= Var(var) | Unit | Bool(bool) | Int(int) | Char(char) | Str(string) | Not(exp) | Equals(exp, exp) | Less(exp, exp) | Minus(exp) | Sum(exp, exp) | Prod(exp, exp) | | GetChar(exp, exp) | Len(exp) | Atoi(exp) | Fail type var = string In addition to the symbols ! for Not, =, <, - for unary Minus, +, and *, the symbols <=, >, >=, and - for binary Sum(e1,Minus(e2)) are used for syntactic convenience. The syntax str[i] is used for GetChar(str,i), len(str) for Len(str), and int(str) for Atoi(str). 10

Parse Tree Building

To return a parse tree from the algorithm outlined above, it suces to store for each parse state item a pointer to the item or items that participated in its creation. The scheme used in the current implementation is as follows. Every item stores one of the following parse annotations: 1. When an item is added by Call Propagation, it stores a PCall tag, with no pointers. 2. When an item is added by Return Propagation, it stores PReturn(u , u, A, E[args], show), where u, u , A, and E correspond to the variables in Return Propagation as stated above, and show corresponds to whether the parse of the nonterminal should be given its own subtree, or whether it should be subsumed by the caller parse; this is shown syntactically by the omission or inclusion of a period (.) before the name of the nonterminal in the grammar le. 3. When an item is added by anything else, it stores a simple back pointer to the item PTransEps(item). Additionally, in order to have some sort of control over what shows up in our parse tree and what is omitted, we can augment the CSG right hand sides to include a Show(var) constructor. Syntactically this is written ($var). This generates an arc in the transducer that acts just like a null transition, except that if u show(var) v, then v stores a special parse annotation: PShow(exp), where exp is the value of var, E[var]. This lets the tree generator know to include exp as one of the Leafs of the tree. Note that a Leaf(exp) records the expression as an expression value, Unit, Bool(b), Int(i), Char(c), or Str(s), preserving the type.

Examples

To test the ecacy of this system in practice, the following examples have been tested on the current implementation.

7.1

IP Adresses

This example matches an IP address, whose format is N.N.N.N, where 0 N 255. IP = {N255}\.{N255}\.{N255}\.{N255}; .N255 = (\d+@x)(x=int(x))[x>=0][x<=255]($x); Note that the nonterminal uses $n to show the value of n, which is an integer and not a string, in the parse.

11

7.2

Char-Terminated String

This example uses a parameterized nonterminal. The nonterminal String matches the shortest substring that is terminated by the character passed as an argument, i.e. any string that doesnt contain arg anywhere except as the last character. program = _ {until \;} _; until = $( ( (. @ x)[x[0]!=arg] )* )

(. @ x)[x[0]=arg];

Note that the $ directive used to save the value of the string matched, without the terminating character.

7.3

XML

XML is often thought of as the all-purpose way to represent a tree structure. Ironically a normal CFG, the all-purpose way to generate tree structures, is actually incapable of describing XML syntax. This is because the open and close tags in XML have to match, and testing for that match requires string comparison of two sections of the parse. CSGs are capable of doing this. program = _ {xml} _; xml = \< _ $({word} @ head) (\s+ {setting})* _ \> {text}?({xml}+ {text})*{xml}* \<\/ _ ({word} @ tail) [head=tail] _ \>; setting = $({word}) _ \= _ ($({word})|\"$(((.@cstr)[cstr!="\""])*)\"); text = $(((. @ cstr)[cstr!="\<"][cstr!="\>"])*); .word = (\a | \_)(\w | \_)*; This little hack represents the basic XML format, that is, header and parameters in triangle brackets (<head A=a b="b b">), followed by text and other xml tags, followed by a matching end tag (</head>).

7.4

Operator Binding Strength

In an earlier CFG example, we described regular expressions syntax as a CFG, using multiple nonterminals to achieve order of operations rules, or binding strength. Parameterized nonterminals give us another, potentially neater way to express these rules. Here is a version of the previous example that uses a few of our simple programming features. program .rexp = | | = _ {rexp} _; [arg=()] {rexp 1} [arg<=1] {alt} [arg<=2] {cat} 12

| [arg<=3] {uny} | {eps} | {c} | {paren}; alt = {rexp 2} (\| {rexp 2})+; cat = {rexp 3} {rexp 3}+; uny = {rexp 4}$(\*|\+|\?); eps = \(\); c = $(\w|\.|\\.); .paren = \({rexp}\); Note that this example would not work as written if the expressions were type-safe. That is, rexp accepts parameters of type Unit and those of type Int. Many successful scripting languages work this way, such as AWK and Python, but it is generally considered bad form in the programming languages community.

Performance

In his article concerning regular expressions [1], Russ Cox plots the performance of Perl against GNU grep on matching the string an against the regular expression a?n an (e.g. aaaa against a?a?a?a?aaaa). Here is a comparison of Python versus the CSG implementation. 25 26 27 28 29 Python 7.30 15.0 30.5 61.9 126 CSG .016 .017 .018 .019 .019

Table 1: Time to match an against a?n an measured in seconds.

This shows that, as expected, the CSG implementation is equivalent to Thompsons algorithm in the degenerate case. An interesting case where Python signicantly outperforms the CSG implementation is on nding two identical words in a string of words. For example, given a aa aaa a, we would like to nd a because it appears twice. The python re for this search is (\b\w+\b).*\b\1\b and the corresponding CSG is double = (.*\s)?(.*@x)\s(.*\s)?(.*@y)[x=y] (\s .*)($x); 13

When matched against strings of the form a a2 a3 ak a, python can instantly handle k in the hundreds. The CSGs performance, on the other hand, has the following behavior: k 8 16 17 18 19 20 21 22 n 45 153 171 190 210 231 253 276 CSG .066 1.3 1.9 2.4 3.0 4.3 5.3 6.6

Table 2: Finding repeated words in a a2 a3 ak a. n is the length such a string.

It seems that the slowness has nothing to do with the linear-time complexity of [x=y]; in fact, removing this assertion makes parsing much slower, not faster, so that it takes 16.5 seconds to parse when k = 16, instead of 1.3 seconds. A minimal example of how captures can slow a parse down is (.*@x).*(.*@y), which shows similar behavior on long strings; the analogous non-capturing regular expression, .*.*.*, matches long strings instantaneously. The exact reason for CSGs poor performance in certain circumstances when compared with Python has not been shown. Although it may simply be a combination of the fact that Pythons algorithm is optimized for word boundary assertions, and that it may involve preprocessing, it is quite possible that the backtracking algorithm is simply faster in this case. The Earley algorithm does, after all, do every parse simultaneously. When dealing with CFGs, the size of an Earley set is bounded. Now that we are dealing with Earley sets that can grow to arbitrary size, this approach may cause the algorithm to take a signicance performance hit in some cases.

Further Work

This line of research needs to be more thoroughly explored and taken to its logical conclusion in a number of ways. What has been outlined here is a minimal example of how to eciently incorporate programming features into the CFG and regular expressions paradigm. Here are a few of directions that deserve attention in future research.

14

9.1

Determinization

It is often desirable to determinize a transducer in order to avoid repeatedly following the same unnecessary paths. Full determinization produces a transducer in which (1) no null edges exist and (2) no node has two identical outgoing edges. A determinized transducer thus has the advantage that there is no guesswork, and this results in linear-time parsing. In the case of CSG transducers, full determinization must be compromised because of the nature of the problem, but it is possible that some standard techniques for determinizing CFGs may be applicable or partially applicable to CSG parsing. Potential challenges include: 1. Preserving the behavior of context information and the capture stack 2. Determinizing calls 3. The essentially nondeterministic nature of situations such as (u push v, u assert(x=y) w), where u has two outgoing nodes, neither of which eats a character. This can potentially be managed by combing certain null-cosuming edges such as push-edges with the edges they point to, so that (u push v, v c w) becomes (u push; c w).

9.2

Compact Language Features

There are certain syntactic features that programmers are used to that could make CSGs easier to read, some of which would have the added benet of speeding up parsing. Here are some ideas: 1. Include if p then re1 else re2 syntax, which could be shorthand for [p]re1 |[!p]re2 . 2. Include match x with v1 -> re1 or or vn -> ren default -> re syntax, which could be shorthand for [x=v1 ]re1 |[x!=v1 ]( [x=vn ]ren ([x!=vn ]re)), but would be more ecient if implemented separately. 3. Guards: [len(re) < 6], [re = exp], shorthand for (re @ x)[len(x)<6] and (re @ x)[x=exp]. Note, however, that this allows a faster implementation than the shorthands imply. For example, instead of matching an entire string and then checking to see whether it matches some string found earlier, it is possible to match the ith character of the input against the ith character of the target string.

9.3

Type Safety and Richer Functional Features

The current expression language does not ensure type safety, and doing so would require further integrating the expression language into the CSG syntax, namely explicitly specifying the argument type of a nonterminal. Moreover, it may be desirable to allow nonterminals to take an arbitrary number of arguments. We may also want to have a richer expression 15

language that includes ways of creating functions, making lists and tuples, etc. While this may be desirable, it may also unnecessarily complicate the language and lead to poor parsing performance.

9.4

Integrating CFGs into the Right Hand Sides

It is a bit of a theoretical eyesore that all of the rich language features are build directly into the right hand sides except for the association of nonterminals with their rhss. This could be remedied by removing the top-level CFG structure and replacing it with a new rhs constructor. rhs ::= ... | WithNonts(rhs, [(nont,rhs)]) This constructor combines an rhs that has undened nonterminals and a mapping of nonterminals to rhss. This changes the nature of the language in a number of ways. First, the start nonterminal does not need a name, which means that the language also accepts vanilla regular expressions. (In the present system, one must write S = rexp;.) Also, this allows for modularization, where nonterminals have a scope. For example, matching XML les as in the example above requires a number of nonterminals, but no other nonterminal would ever call them directly. Using this modularized CFG notation, the example for XML might look something like: _ {xml} _ : { xml = \< _ $({word} @ head) (\s+ {setting})* _ \> {inner} \<\/ _({word} @ tail) [head=tail] _ \> : { inner = {text}?({xml}+ {text})*{xml}*: { text = $(((. @ cstr)[cstr!="\<"][cstr!="\>"])*); }; setting = $({word}) _ \= _ ($({word})|\"$(((.@cstr)[cstr!="\""])*)\"); .word = (\a | \_)(\w | \_)*; }; } Unlike some other potential modications, this would only have the eect of making code easier to read, write, and update. It would not have a signicant impact on performance.

9.5

Explicit Tree Construction

In the present system, tree construction is automatic and based on the way the nonterminals are parsed. The only control we have over tree structure from within the language is the somewhat awkward dot-notation to suppress expression of a nonterminal. There may be cases when we want to set aside parts of an rhs as a subtree without explicitly making a nonterminal for it. To add such control over tree construction, we can add the following constructor: 16

rhs ::= ... | Label(rhs, string) This could be written {label: rhs}; nonterminal syntax could be replaced with <nont>; {nont} could be a shorthand for {nont: <nont>}; and dot-notation could be eliminated. Thus there would be an explicit way to signal subtree creation, as well as a way to show or not show nonterminal subtrees on a per-case basis. A second addition would change the language quite dramatically but may be a good way to incorporate the equivalent of YACCs actions. Essentially, we could replace the constructor above with rhs ::= ... | Construct(exp) So long as the expression language is rich enough, we can explicitly build an arbitrary structure and not rely on the parsers tree representation at all. If we are also allowed to include global type statements for the expression language, we can get something like the following code for building regular expressions: type rexp = Eps | Char(string) | Alt([rexp]) | Cat([rexp]) | Star(rexp); _ {rexp} _ : rexp = <rexp_b 1> rexp_b = [arg<=1] <alt> | [arg<=2] <cat> | [arg<=3] <uny> | <eps> | <c> | <paren>; alt = (<rexp_b 2> @ r1) (rlist=[r1]) (\| (<rexp_b 2> @ r)(rlist=r::rlist))+ {Alt(rlist)}; cat = (<rexp_b 3> @ r1) (rlist=[r1]) ((<rexp_b 3> @ r)(rlist=r::rlist))+ {Cat(rlist)}; uny = (<rexp_b 4> @ r)(\*|\+|\? @ sym) [sym="*"] {Star(r)} | [sym="+"] {Cat([r,Star(r)])} | [sym="?"] {Alt([r,Eps])}; eps = \(\) {Eps}; c = \w|\.|\\. @ c {Char c}; paren = \(<rexp>\); Within curly braces the object that a nonterminal is supposed to represent is explicitly being built. Instead of arbitrary C code or arbitrary ML code, placed within the curly braces is an expression in CSGs own expression language.

10

Comments

It is often the case that merely allowing the possibility of increased expressiveness in a language actually increases the complexity of the running time in practice even for instances 17

that only use a simpler subset of the language. This paper shows an instance of three languages in a hierarchy of increasing expressiveness, each of which can be used as a replacement for the languages below it without sacricing speed. That is, if we use the CSG engine to match a regular expression, it will have a running time of O(nm), and it we use it to match a CFG, it will have a running time of O(n2 m) worst case. Notably, the extensions are all entirely natural, in that the algorithm does not have to explicitly probe the complexity of the grammar in order to optimize eciency. Rather, any subsection of a CFG that is regular will be parsed like a regular expression and any subsection of a CSG that is context-free will be parsed like a context-free grammar, by mere virtue of not using certain parse features. This minimal revision of CFGs to include context sensitivity shows some performance vulnerabilities that need to be explored further. It may improve performance, for example, to put a preliminary cap on the sizes of the Earley sets, and only do a full parse if the rst pass does not bear fruit. Reasoning about the transducer graph and doing optimizations that way may also address some performance issues. The fact remains, however, the parsing the general CSG is an NP-hard problem [1], and that beyond a certain point, solutions will inevitably have to involve (1) optimizing for more special subproblems and (2) using heuristics to help determine parsing order. This framework has already shown how to integrate the solutions to two special subproblems, an in this particular respect, CSGs are superior to and out perform their counterparts in the regex libraries of many scripting languages. Whether the Earley approach can be modied to match Python and Perl in the general context-sensitive cases as well remains to be seen.

References
[1] Russ Cox. Regular expression matching can be simple and fast, January 2007. http://swtch.com/~rsc/regexp/regexp1.html. [2] Jay Earley. An ecient context-free parsing algorithm. Commun. ACM, 13(2):94102, 1970. [3] Trevor Jim and Yitzhak Mandelbaum. Ecient earley parsing with regular right-hand sides. Proceedings of LDTA 2009, March 2009.

18

Das könnte Ihnen auch gefallen