Sie sind auf Seite 1von 22

The Chinese University of Hong Kong CSC 3130: Automata theory and formal languages

Fall 2008

Parsers for programming languages

Andrej Bogdanov
http://www.cse.cuhk.edu.hk/~andrejb/csc3130

CFG of the java programming language


Identifier: IDENTIFIER QualifiedIdentifier: Identifier { . Identifier } Literal: IntegerLiteral FloatingPointLiteral CharacterLiteral StringLiteral BooleanLiteral NullLiteral Expression: Expression1 [AssignmentOperator Expression1]] AssignmentOperator: = += -= *= /= &= |=

from http://java.sun.com/docs/books/jls /second_edition/html/syntax.doc.html#52996

Parsing java programs


class Point2d { /* The X and Y coordinates of the point--instance variables */ private double x; private double y; private boolean debug; // A trick to help with debugging public Point2d (double px, double py) { x = px; y = py; debug = false; } public Point2d () { // Default constructor this (0.0, 0.0); // Invokes 2 parameter Point2D constructor } // Note that a this() invocation must be the BEGINNING of // statement body of constructor public Point2d (Point2d pt) { x = pt.getX(); y = pt.getY(); } // Another consructor // Constructor

// turn off debugging

Simple java program: about 1000 symbols

Parsing algorithms
How long would it take to parse this?
exhaustive algorithm CYK algorithm about 1080 years (longer than life of universe) about 1 week!

Can we parse faster?


No! CYK is the fastest known general-purpose parsing algorithm

Another way of thinking

Scientist:
Find an algorithm that can parse strings in any grammar

Engineer:
Design your grammar so it has a very fast parsing algorithm

An example
Stack a ab A T Ta Taa Taab TaA TaT TaTb TA T Tc S Input abaabbc baabbc aabbc aabbc aabbc abbc bbc bc bc bc c c c

Action
shift shift reduce (5) reduce (3) shift shift shift reduce (5) reduce (3) shift reduce (4) reduce (2) shift reduce (1)

S Tc(1) T TA(2) | A(3) A aTb(4) | ab(5)

input: abaabbc
S T T A a b A T A

Items
S Tc(1) S Tc S Tc S Tc
Stack a ab A T Ta

T TA(2) T A(3) A aTb(4) A ab(5) T TA T TA T TA


Input abaabbc baabbc aabbc aabbc aabbc abbc

T A T A

A aTb A ab A aTb A ab A aTb A ab A aTb

Action
shift shift reduce (5) reduce (3) shift shift

Idea of parsing algorithm: Try to match complete items to top of stack

Some terminology
Stack a ab A T Ta Taa Taab TaA TaT TaTb TA T Tc S Input abaabbc baabbc aabbc aabbc aabbc abbc bbc bc bc bc c c c

Action
shift shift reduce (5) reduce (3) shift shift shift reduce (5) reduce (3) shift reduce (4) reduce (2) shift reduce (1)

S Tc(1) T TA(2) | A(3) A aTb(4) | ab(5)

input: abaabbc
handle
valid items: aTb, ab valid items: Ta, Tc, aTb

Outline of LR(0) parsing algorithm


As the string is being read, it is pushed on a stack
Algorithm keeps track of all valid items Algorithm can perform two actions: no complete there is one valid item,
item is viable and it is complete

shift

reduce

Running the algorithm


A Stack Input aabb abb S a S aa S R S R aab aA aAb A bb b b Valid Items A aAb A aAb A aAb A aAb A aAb A ab A aAb A aAb A ab A ab A ab A ab A ab

A aAb | ab

A aAb aabb

Running the algorithm


A Stack Input aabb abb S a S aa S R S R aab aA aAb A bb b b Valid Items A aAb A aAb A aAb A aAb A aAb A ab A aAb A aAb A ab A ab A ab A ab A ab

A aAb | ab

A aAb aabb

How to update viable items


Initial set of valid items
S a for every production S a

Updating valid items on shift b


A abb A aXb

is updated to

A abb

disappears if X b

After these updates, for every valid item A aCb and production C d, we also add C d as a valid item

a, b: terminals notatio A, B: variables n X, Y: mixed symbols a, b: mixed strings

How to update viable items


Updating valid items on reduce b to B
First, we backtrack to viable items before reduce Then, we apply same rules as for shift B (as if B were a terminal) A aBb A aBb is updated to A aXb disappears if X B

C d

is added for every valid item A aCb and production C d

Viable item updates by NFA


States of NFA will be items (plus a start state q0)
For every item S a we have a transition
q0 S a

For every item A aXb we have a transition


A aXb X A aXb

For every item A aCb and production C d


A aCb C d

Example
A aAb | ab
a A A aAb b A aAb A ab b A ab

A aAb q0

A aAb

A ab a

Convert NFA to DFA


a 1 A aAb A ab A 2 A aAb A ab A aAb A ab 4 A aAb b

5 A aAb

b
3 A ab

die

states correspond to sets of valid items transitions are labeled by variables / terminals

Attempt at parsing with DFA


A Stack S a S aa S aab R aA Input aabb abb bb b b DFA state 1 A aAb 2 A aAb A aAb 2 A aAb A aAb 3 A ab ? A aAb A ab A ab A ab A ab A ab

A aAb | ab

A aAb aabb

Remember the state in stack!


A Stack 1 S 1a2 S 1a2a2 S R S R 1a2a2b3 1a2A4 1a2A4b5 1A Input aabb abb bb b b DFA state 1 A aAb 2 A aAb A aAb 2 A aAb A aAb 3 A ab 4 A aAb 5 A aAb A ab A ab A ab A ab A ab

A aAb | ab

A aAb aabb

LR(0) grammars and deterministic PDAs


The parsing procedure can be implemented by a deterministic pushdown automaton
A PDA is deterministic if in every state there is at most one possible transition
for every input symbol and pop symbol, including

Example: PDA for w#wR is deterministic, but PDA for wwR is not

LR(0) grammars and deterministic PDAs


Not every PDA can be made deterministic
Since PDAs are equivalent to CFLs, LR(0) parsing algorithm must fail for some CFLs! When does LR(0) parsing algorithm fail?

Outline of LR(0) parsing algorithm


Algorithm can perform two actions:
no complete item is valid there is one valid item, and it is complete

shift (S) What if:


some valid items complete, some not

reduce (R)

more than one valid complete item

S / R conflict

R / R conflict

Hierarchy of context-free grammars


context-free grammars
parse using CYK algorithm (slow)

LR() grammars

to be continued java
LR(1) grammars LR(0) grammars
parse using LR(0) algorithm perl python