Sie sind auf Seite 1von 8

Top down parsing Topdown parsing with backtracking

•  Types of parsers: •  ScAd


•  Top down: •  Aab|a
–  repeatedly rewrite the start symbol;
–  find a left-most derivation of the input string;
•  w=cad
–  easy to implement; S S
–  not all context-free grammars are suitable.
•  Bottom up: S c A d c A d
–  start with tokens and combine them to form interior nodes of the
parse tree; c A d a b a
–  find a right-most derivation of the input string;
–  accept when the start symbol is reached;
cad cad cad cad
–  it is more prevalent.

1 2

Parsing trace Parsing trace


Expansion Remaining input Action Expansion Remaining input Action
S cad Try ScAd
 S   cad   Try  ScAd  
 cAd   cad   Match  c   cAd cad Match c
 Ad   ad   Try  Aab     Ad ad Try Aab
 abd   ad   Match  a   abd ad Match a
 bd   d   Dead  end,  backtrack   bd d Dead end, backtrack
 ad   ad   Try  Aa  
 ad   ad   Match  a   ad ad Try Aa
 d   d   Match  d   ad ad Match a
 Success   d d Match d
Success

ScAd •  ScAd
Aab|a
•  Aab|a

3 4

SAB
Another example AaA|ε Top down vs. bottom up parsing
BbB|b
•  Bottom up approach
Expansion Remaining input Action aabb
•  Given the rules
S aabb Try SAB SAB –  SAB a abb
aAB aa bb
aaAB –  AaA|ε
AB aabb Try AaA aaεB –  BbB|b
aaε bb
aaA bb
aAB aabb match a aabB aA bb
aabb
•  How to parse aabb ? A bb
AB abb AaA Ab b
•  Topdown approach
Abb
aAB abb match a SAB
AbB
aAB
AB bb Aepsilon aaAB
AB
S
B bb BbB aaεB
aabB •  If read backwards, the derivation is
bB bb match b aabb right most
•  In both topdown and bottom up
B b Bb Note that it is a left most approaches, the input is scanned
derivation from left to right
b b match

5 6

1
Recursive descent parsing
•  Each method corresponds to a non-terminal
//BbB|b
static boolean checkS() {
int savedPointer = pointer;
if (checkA() && checkB()) SAB
static boolean checkB() {
return true;
int savedPointer = pointer;
pointer = savedPointer;
return false; if (nextToken().equals(‘b’) && checkB())
} return true;
pointer = savedPointer;
static boolean checkA() {
int savedPointer = pointer; AaA|ε if(nextToken().equals(‘b’)) return true;
if (nextToken().equals(‘a’) && checkA()) pointer = savedPointer;
return true;
return false;
pointer = savedPointer;
return true; }
}

7 8

Recursive descent parsing (a complete


Left recursion
example)
•  What if the grammar is changed to
SAB
•  Grammar
AAa|ε program statement program | statement
Bb|bB
•  The corresponding methods are
statement assignment
static boolean checkA() { assignment ID EQUAL expr
int savedPointer = pointer;
if (checkA() && nextToken().equals(‘a’)) ... ...
return true;
pointer = savedPointer; •  Task:
return true;
} –  Write a java program that can judge whether a program is
static boolean checkB() { syntactically correct.
int savedPointer = pointer;
if (nextToken().equals(‘b’)) –  This time we will write the parser manually.
return true;
pointer = savedPointer;
–  We can use the scanner
if(nextToken().equals(‘b’) && checkB()) return true;
return false;
•  How to do it?
pointer = savedPointer;
}

9 10

RecursiveDescent.java outline One of the methods


1.  static int pointer=-1; 1.  /** program-->statement program
2.  static ArrayList tokens=new ArrayList(); 2.  program-->statement
3.  static Symbol nextToken() { }
4.  public static void main(String[] args) { 3.  */
5.  Calc3Scanner scanner=new 4.  static boolean program() throws Exception {
6.  Calc3Scanner(new FileReader(”calc2.input")); 5.  int savedPointer = pointer;
7.  Symbol token; 6.  if (statement() && program()) return true;
8.  while(token=scanner.yylex().sym!=Calc2Symbol.EOF) 7.  pointer = savedPointer;
9.  tokens.add(token);
10.  boolean legal= program() && nextToken()==null; 8.  if (statement()) return true;
11.  System.out.println(legal); 9.  pointer = savedPointer;
12. } 10.  return false;
11. }
13. static boolean program() throws Exception {…}
14. static boolean statement() throws Exception {…}
15. static boolean assignment () throws Exception {…}
16. static boolean expr() {…}
11 12

2
Recursive Descent parsing Summary of recursive descent parser
•  Recursive descent parsing is an easy, natural way to code top-down •  Simple enough that it can easily be constructed by hand;
parsers. •  Not efficient;
–  All non terminals become procedure calls that return true or false; •  Limitations:
–  all terminals become matches against the input stream. /** E E+T | T **/
•  Example: static boolean expr() throws Exception {
/** assignment--> ID=exp **/ int savePointer = pointer;
if ( expr()
static boolean assignment() throws Exception{
&& nextToken().sym==Calc2Symbol.PLUS
int savePointer= pointer; && term())
if ( nextToken().sym==Calc2Symbol.ID return true;
&& nextToken().sym==Calc2Symbol.EQUAL pointer = savePointer;
&& expr()) if (term()) return true;
return true; pointer = savePointer;
pointer = savePointer; return false;
return false; }
}
•  A recursive descent parser can enter into infinite loop.

13 14

Left recursion has to be removed for recursive


Left recursion descent parser
•  Defini8on   Look at the previous example that works:
E T+E | T
–  A  grammar  is  le>  recursive  if  it  has  a  nonterminal  A  such  that  there  is  a  
static boolean expr() throws Exception {
deriva8on  A    ⇒+  Aα  
int savePointer = pointer;
–  There  are  tow  kinds  of  le>  recursion:   if (term() && nextToken().sym == Calc2Symbol.PLUS && expr()) return true;
•  direct  le>  recursive  :    A    ⇒  Aα   pointer = savePointer;
•  in-­‐direct  le>-­‐recursive:    A    ⇒+  Aα,  but  not  A    ⇒  Aα   if (term()) return true;
•  Example:     pointer = savePointer;
return false;
EE+T|T          is  direct  le>  recursive  
}
•  Indirect  le>-­‐recursive   What if the grammar is left recursive?
SAa|b   E E+T | T
AAc|Sd|ε   static boolean expr() throws Exception {
int savePointer = pointer;
Is  S  le>  recursive?    
if (expr() && nextToken().sym == Calc2Symbol.PLUS && term ()) return true;
S⇒  Aa     pointer = savePointer;
     ⇒Sda         if (term()) return true;
S⇒+Sda   pointer = savePointer;
return false;
}

There will be infinite loop!


15 16

Remove left recursion Remove left recursion


•  Direct left recursion •  In general, for a production
AAα|β AAα1 | Aα2 | ... | Aαm | β1 | β2 | ... | βn
expanded form: Aβ α α ... α where no βi begins with A.
Left recursion removed:
Aβ Z It can be replaced by:
Zα Z|ε Aβ1A’ | β2A’|... | βnA’
A’ α1A’ |α2A’| ... |αmA’| ε
•  Example:
EE+T |T
expanded form: ET +T +T ... +T
ETZ
Z+TZ| ε

17 18

3
Predictive parsing Left factoring
•  Predictive parser is a special case of top-down parsing when no •  General method
backtracking is required;
For a production Aα β1 | αβ2| ... | αβn | γ
•  At each non-terminal node, the action to undertake is unambiguous;
STATif ... where γ represents all alternatives that do not begin with α,
| while ... it can be replaced by
| for ... Aα B | γ
•  Not general enough to handle real programming languages; Bβ1 | β2 | ... | βn
•  Grammar must be left factored;
IFSTATif EXPR then STAT
•  Example
| if EXPR then STAT else STAT ET+E | T
–  A predictive parser must choose the correct version of the IFSTAT Can be transformed into:
before seeing the entire input ET E’
–  The solution is to factor out common terms:
IFSTATif EXPR then STAT IFREST E’+E| ε
IFRESTelse STAT | ε
•  Consider another familiar example:
ET+E | T

19 20

Predictive parsing More on LL(1)

•  The recursive descent parser is not efficient because of •  LL(1) grammar has no left-recursive productions and has
the backtracking and recursive calls. been left factored.
•  a predictive parser does not require backtracking. –  left factored grammar with no left recursion may not be LL(1)
–  able to choose the production to apply solely on the basis of the •  there are grammars that cannot be modified to become
next input symbol and the current nonterminal being processed LL(1).
•  In such cases, another parsing technique must be
•  To enable this, the grammar must be LL(1). employed, or special rules must be embedded into the
–  The first “L” means we scan the input from left to right; predictive parser.
–  the second “L” means we create a leftmost derivation;
–  the 1 means one input symbol of lookahead.

21 22

First() set--motivation Fisrt(): motivating example


•  Navigating through two choices seemed simple enough, however, •  On many cases, rules starts with non- S ⇒Bc
terminals ⇒gAc
what happens where we have many alternatives on the right side? SAb|Bc
–  statement  assignment | returnStatement | ifStatement | ADf|CA ⇒gCAc
whileStatement | blockStatement BgA|e ⇒gcAc
CdC|c ⇒gcDfc
•  When implementing the statement() method, how are we going to be Dh|i
able to determine which of the 5 options to match for any given ⇒gchfc
How to parse “gchfc”?
input?
⇒ Dfb ⇒hfb if the next token is h, i, d, or c,
•  Remember, we are trying to do this without backtracking, and just / ⇒ifb alternative Ab should be
one token of lookahead, so we have to be able to make immediate ⇒Ab selected.
decision with minimal information— this can be a challenge! / \ If the next token is g or e,
/ ⇒ CAb ⇒ dCAb ⇒ …. alternative Bc should be
•  Fortunately, many production rules starts with terminals, which can S ⇒ cAb ⇒ …
selected.
help in deciding which rule to use. \ In this way, by looking at the
\
next token, the parser is
–  For example, if the input token is ‘while’, the program should know that able to decide which rule
the whileStatement rule will be used. ⇒ Bc ⇒ gAc ⇒ … to use without exhaustive
⇒ ec searching.

23 24

4
First(): Definition First() algorithm
•  First(): compute the set of terminals that can begin a rule
•  The First set of a sequence of symbols α, written as First 1.  if a is a terminal, then first(a) is {a}.
(α), is the set of terminals which start the sequences of 2.  if A is a non-terminal and Aaα is a production, then add a to first(A).
symbols derivable from α. if Aε is a production, add ε to first(A).
3. if Aα1 α2 ... αm is a production, add Fisrt(α1)-ε to First(A).
–  If α =>* aβ, then a is in First(α).
If α1 can derive ε, add First(α2)-ε to First(A).
–  If α =>* ε , then ε is in First(α). If both α1 and α2 derives ε, add First(α3)-ε to First(A). and so on.
If α1 α2 ... αm =>*ε , add ε to First(A).

•  Given a production with a number of alternatives: •  Example


S  Aa | b
–  A  α1 | α2 | ..., A  bdZ | eZ
–  we can write a predicative parser only if all the sets First(αi) are Z  cZ | adZ | ε
disjoint.
First(A) = {First(b), First(e)}= {b, e} (by rule 2, rule 1)
First(Z) = {a, c, ε } (by rule 2, rule 1)
First(S)
= {First(A), First(b)} (by rule 3)
= {First(A), b} (by rule 1)
25 = {b, e, b} = {b, e} (by rule 2) 26

A slightly modified example Follow()– motivation


S  Aa | b •  Consider
A  bdZ | eZ| ε –  S*aaAb
Z  cZ | adZ | ε –  Where Aε|aA .
•  When can A ε be used? What is the next token
First(S)
expected?
= {First(A), First(b)} (by rule 3)
= {First(A), b} (by rule 1)
= {b, e, b} = {b, e} (by rule 2) ? •  In general, when A is nullable, what is the next token we
expect to see?
First(S) = {First(A), First(a), b} = { a, b, e, ε } ? –  A non-terminal A is nullable if ε in First(A), or
–  A*ε
Answer: First(S) = { a, b, e}
•  the next token would be the first token of the symbol
following A in the sentence being parsed.
27 28

Follow() Compute First() and Follow()


•  Follow(): find the set of terminals that can immediately follow a non-
terminal 1.  ETE’ First (E)
1.  $(end of input) is in Follow(S), where S is the start symbol; 2.  E’+TE’|ε = First(T)
2.  For the productions of the form AαBβ then everything in First(β) but ε 3.  TFT’ = First(F)
is in Follow(B). 4.  T’*FT’|ε ={ (, id }
3.  For productions of the form AαB or AαBβ where First(β) contains ε, 5.  F(E)|id First (E’)
then everything in Follow(A) is in Follow(B). = {+, ε }
–  aAb => aαBb
First (T’)= { *, ε }
•  Example
SAa | b Follow (E)
A bdZ |eZ = { ), $ }
Z cZ | adZ | ε
=Follow(E’)
Follow(S) Follow(T)
= {$} (by rule 1) = Follow(T’)
Follow(A) = { +, ), $ } First(E’) except ε plus Follow(E)
= {a} (by rule 2) Follow (F)
Follow(Z) = { *, +, ), $} First(T’) except ε plus Follow(T’)
= {Follow(A)} = {a} (by rule 3)
29 30

5
The use of First() and Follow() LL(1) parse table construction
•  Construct a parse table (PT) with one axis the set of terminals, and
•  If we want to expand S in this grammar: the other the set of non-terminals.
S  A ... | B ... •  For all productions of the form Aα
A  a... –  Add Aα to entry PT[A,b] for each token b in First(α);
B  b ... | a... –  add Aα to entry PT[A,b] for each token b in Follow(A) if First(α)
contains ε;
•  If the next input character is b, we should rewrite S with
–  add Aα to entry PT[A,$] if First(α) contains ε and Follow(A) contains $.
A... or B ....?
S  Aa | b
–  since First(B) ={a, b}, and First(A)= {a}, we know to rewrite S with A  b d Z | eZ First Follow
B; S Aa b, e $
Z  cZ | a d Z | ε
–  First and Follow gives us information about the next characters b b
expected in the grammar. a b c d e $
A bdZ b a
•  If the next input character is a, how to rewrite S? S SAa SAa eZ e
Sb
–  a is in both First(A) and First(B); Z cZ c a
A AbdZ AeZ
–  The grammar is not suitable for predictive parsing. adZ a

Z Zε ZcZ ε ε
ZadZ
31 32

a b c d e $
Construct the parsing table S SAa SAa S  Aa | b
Sb A  b d Z | eZ
Z  cZ | a d Z | ε
•  if Aα, which column we place Aα in row A? A AbdZ AeZ
–  in the column of t, if t can start a string derived from α, i.e., t in Z Zε ZcZ
ZadZ
First(α).
–  what if α is empty? put Aα in the column of t if t can follow an A,
i.e., t in Follow(A). Stack RemainingInput Action

S$ bda$ Predict SAa or Sb? suppose Aa is used


Aa$ bda$ Predict AbdZ
bdZa$ bda$ match
dZa$ da$ match
Za$ a$ Predict Zε
a$ a$ match
$ $ accept

–  Note that it is not LL(1) because there are more than one rule can be selected.
–  The correspondent (leftmost) derivation
SAabdZabdε a
–  Note when Zε rule is used.
33 34

LL(1) grammar A complete example for LL(1) parsing


•  If the table entries are unique, the grammar is said to be LL(1): SP
–  Scan the input from Left to right; P { D; C}
–  performing a Leftmost derivation. D d, D | d
•  LL(1) grammars can have all their parser decisions made using one C c, C| c
token look ahead.
•  In principle, can have LL(k) parsers with k>1. •  The above grammar corresponds loosely to the structure of
•  Properties of LL(1) programs. (program is a sequence of declarations followed
–  Ambiguous grammar is never LL(1); by a sequence of commands).
–  Grammar with left recursion is never LL(1);
•  Need to left factor the grammar first.
•  A grammar G is LL(1) iff whenever A –> α | β are two distinct
productions of G, the following conditions hold: SP First Follow
–  For no terminal a do both α and β derive strings beginning with a (i.e., P { D; C} S { $
First sets are disjoint); D d D2
–  At most one of α and β can derive the empty string P { $
–  If β =>* ε then α does not derive any string beginning with a terminal in
D2 , D | ε D d ;
Follow(A) Cc C2 D2 , ε ;
C2  , C | ε C c }
C2 ,ε }
35 36

6
Construct LL(1) parse table LL(1) parse program
{ } ; , c d $
SP$
input$
S
P P{D;C} S
D DdD2 t Program
a
D2 D2ε D2,D parse
c
C CcC2 k table
C2 C2ε C2,C $

First Follow
SP S { $
•  Stack: contain the current rewrite of the start symbol.
P { D; C} P { $
D d D2 •  Input: left to right scan of input.
D d ;
D2 , D | ε •  Parse table: contain the LL(k) parse table.
Cc C2 D2 , ε ;
C2  , C | ε C c }
C2 ,ε }
37 38

SP
LL(1) parsing algorithm Running LL(1) parser P { D; C}
D d D2
Stack Remaining Input Action D2 , D | ε
•  Use the stack, input, and parse table with the following S {d,d;c}$ predict SP$ Cc C2
rules: P$ {d,d;c}$ predict P{D;C} C2  , C | ε
{D;C}$ {d,d;c}$ match {
–  Accept: if the symbol on the top of the stack is $ and the input D;C}$ d,d;c}$ predict Dd D2
symbol is $, successful parse d D2 ; C } $ d,d;c}$ match d Derivation
–  match: if the symbol on the top of the stack is the same as the D2 ; C } $ ,d;c}$ predict D2,D SP$
next input token, pop the stack and advance the input ,D;C}$ ,d;c}$ match , { D;C}$
D;C}$ d;c}$ predict Dd D2 {d D2 ; C } $
–  predict: if the top of the stack is a non-terminal M and the next d D2 ; C } $ d;c}$ match d {d, D;C}$
input token is a, remove M from the stack and push entry PT[M,a] D2 ; C } $ ;c}$ predict D2ε {d,d D2; C } $
on to the stack in reverse order ε;C}$ ;c} $ match ; {d,d; C} $
–  Error: Anything else is a syntax error C}$ c}$ predict Cc C2 {d,d; c C2}$
c C2 } $ c}$ match c {d,d; c} $
C2 } $ }$ predict C2ε
}$ }$ match }
$ $ accept Note that it is leftmost
derivation

39 40

The expression example Parsing int*int


Stack Remaining input Action
1.  ETE’
2.  E’+TE’|ε E$ int*int $ predicate ETE’
3.  TFT’ TE’$ int*int $ predicate TFT’
4.  T’*FT’|ε
FT’E’ $ int*int $ predicate Fint
5.  F(E)|int
int T’ E’ $ int*int $ match int
T’ E’ $ * int $ predicate T’*FT’
* F T’ E’ $ * int $ match *
F T’ E’ $ int $ predicate Fint
int T’ E’ $ int $ match int
+ * ( ) int $
T’ E’ $ $ predicate T’ε
E ETE’ ETE’
E’ $ $ predicate E’ε
E’ E’+TE’ E’ε E’ε
$ $ match $. success.
T TFT’ TFT’

T’ T’ε T’*FT’ T’ε T’ε


F F(E) Fint
41 42

7
Parsing the wrong expression int*]+int Error handling
Stack Remaining input Action •  There are three types of error processing:
E$ int*]+int $ predicate ETE’
–  report, recovery, repair
TE’$ int*]+int $ predicate TFT’
FT’E’ $ int*]+int $ predicate Fint •  general principles
int T’ E’ $ int*]+int $ match int –  try to determine that an error has occurred as soon as possible. Waiting
T’ E’ $ * ]+int $ predicate T’*FT’ too long before declaring an error can cause the parser to lose the actual
location of the error.
* F T’ E’ $ *]+ int $ match *
F T’ E’ $ ]+int $ error, skip ] –  Error report: A suitable and comprehensive message should be reported.
F T’ E’$ + int $ PT[F, +] is sync, pop F “Missing semicolon on line 36” is helpful, “unable to shift in state 425” is
not.
T’E’$ +int$ predicate T’ ε
–  Error recovery: After an error has occurred, the parser must pick a likely
E’ $ +int$ predicate E’+TE’
place to resume the parse. Rather than giving up at the first problem, a
... ... parser should always try to parse as much of the code as possible in
It is easy for LL(1) parsers to skip error order to find as many real errors as possible during a single run.
–  A parser should avoid cascading errors, which is when one error
+ * ( ) int ] $ generates a lengthy sequence of spurious error messages.
E ETE’ ETE’ error

E’ E’+TE’ E’ε error E’ε

T TFT’ TFT’ error

T’ T’ε T’*FT’ T’ε error T’ε

F sync (Follow(F)) sync F(E) sync Fint error sync 43 44

Error report Error Recovery


•  report an error occurred and what and where possibly the error is; •  Error recovery: a single error won’t stop the whole parsing. Instead,
–  Report expected vs found tokens by filling holes of parse table with error the parser will be able to resume the parsing at certain place after
messages) the error;
–  Give up on current construct and restart later:
–  Delimiters help parser synch back up
–  Skip until find matching ), end, ], whatever
–  Use First and Follow to choose good synchronizing tokens
+ * ( ) int $ –  Example:
E ETE’ Err, int or ETE’ •  duoble d; use Follow(TYPE) D  TYPE ID SEMI
( expected •  junk double d; use First(D)
in line … •  Error repair: Patch up simple errors and continue .
E’ E’+TE’ E’ε E’ε –  Insert missing token (;)
T TFT’ TFT’ –  Add declaration for undeclared name
T’ T’ε T’*FT’ T’ε T’ε
F F(E) Fint

45 46

Types of errors Summarize LL(1) parsing


•  Types of errors
•  Massage the grammar
–  Lexical: @+2
•  Captured by JLex –  Remove ambiguity
–  Syntactical: x=3+*4; –  Remove left recursion
•  Captured by javacup
–  Left factoring
–  Semantic: boolean x; x = 3+4;
•  Captured by type checker, not implemented in parser generators •  Construct the LL(1) parse table
–  Logical: infinite loop –  First(), Follow()
•  Not implemented in compilers
–  Fill in the table entry
•  Run the LL(1) parse program

47 48

Das könnte Ihnen auch gefallen