Sie sind auf Seite 1von 24

CSE309N

Lexical Analyzer

Lexical Analyzer reads the source program


character by character to produce tokens.

Normally a lexical analyzer doesnt return a list of


tokens at one shot, it returns a token when the
parser asks a token from it.
source
program

lexical
analyzer

token
parser
get
next
token
symbol
table

Chapter 1

CSE309N

LEXICAL ANALYZER

Scan Input

Remove WS, NL,

Identify Tokens

Create Symbol Table

Insert Tokens into ST

Generate Errors

Send Tokens to Parser

Chapter 1

CSE309N

Token

Token represents a set of strings described


by a pattern.

Identifier represents a set of strings which start with


a letter continues with letters and digits
The actual string (newval) is called as lexeme.
Tokens: identifier, number, addop,

A) identifier
B) keywords
C) operators
D) special symbols
E) constants

Chapter 1

CSE309N

Introducing Basic Terminology

What are Major Terms for Lexical Analysis?

TOKEN

Tokens represent a set of strings described by a pattern.


Examples Include <Identifier>, <number>, etc.

PATTERN
The set of strings is described by a rule called a
pattern associated with the token.

LEXEME
A lexeme is a sequence of characters in the source
program that matches the pattern for a token
Identifiers: x, count, name, etc
Chapter 1

CSE309N

Introducing Basic Terminology

const

Sample LexemesInformal Description of


Pattern
const
const

if

if

Token

if

relation <, <=, =, < >,


>, >=
id
pi, count, D2
num
3.1416, 0,
literal
6.02E23
core dumped

Classifi
es
Pattern

< or <= or = or < > or >=


or >
letter followed by
letters and digits
any numeric constant
any characters between
and except

Actual values are critical.


is :
1.Stored in symbol table
2.Returned to parser

Info

Chapter 1

CSE309N

Attributes for Tokens


TOKEN
Example:

ATTRIBUTES values
E = M * C ** 2

LEXEME
E <id, pointer to symbol-table entry for E>
<assign_op, >
<id, pointer to symbol-table entry for M>
<mult_op, >
<id, pointer to symbol-table entry for C>
<exp_op, >
<num, integer value 2>

Chapter 1

CSE309N

Handling Lexical Errors

Error Handling is very localized, with Respect to Input


Source

For example: whil ( x = 0 ) do


generates no lexical errors in PASCAL

In what Situations do Errors Occur?

Lexical analyzer is unable to proceed because none of the


patterns for tokens matches a prefix of remaining input.

Panic mode Recovery

Delete successive characters from the remaining input until the


analyzer can find a well-formed token.

May confuse the parser

Possible error recovery actions:


Deleting an extra character

Inserting a missing Input Characters


Replacing a incorrect character by a correct character.
Transposing two adjacent Characters
Chapter 1

CSE309N

Regular Expressions

A Formal Specification for Tokens

We use regular expressions to describe tokens of a


programming language.

Each regular expression denotes a language.

A language denoted by a regular expression is


called as a regular set.

CS416 Compiler Design

Chapter 1

CSE309N

Regular Expressions (Rules)


Regular expressions over alphabet
Reg. Expr

a
(r1) | (r2)
(r1) (r2)
(r)*
(r)

{}
{a}

Language it denotes

L(r1) L(r2)
L(r1) L(r2)
(L(r))*
L(r)

(r)+ = (r)(r)*
(r)? = (r) |

CS416 Compiler Design

Chapter 1

CSE309N

Regular Expressions (cont.)

We may remove parentheses by using precedence


rules.

*
highest
concatenation
next
|
lowest

ab*|c

Ex:

means

(a(b)*)|(c)

= {0,1}
0|1 => {0,1}
(0|1)(0|1) => {00,01,10,11}
0* => { ,0,00,000,0000,....}
(0|1)* => all strings with 0 and 1, including the empty
string
CS416 Compiler Design

10 Chapter 1

CSE309N

Regular Definitions

To write regular expression for some languages can be


difficult, because their regular expressions can be quite
complex. In those cases, we may use regular definitions.
We can give names to regular expressions, and we can
use these names as symbols to define other regular
expressions.
Define regular expressions in terms of named regular
expressions
A regular definition is a sequence of the definitions of
the form:
d1 r1 where di is a definition name and
d2 r2 ri is a regular expression over symbols in
.
{d1,d2,...,di-1}
dn rn

11 Chapter 1

CSE309N

Regular Definitions (cont.)

Ex: Identifiers in Pascal

letter A | B | ... | Z | a | b | ... | z


digit 0 | 1 | ... | 9
id letter (letter | digit ) *
If we try to write the regular expression representing identifiers without using
regular definitions, that regular expression will be complex.
(A|...|Z|a|...|z) ( (A|...|Z|a|...|z) | (0|...|9) ) *

Ex: Unsigned numbers in Pascal

digit 0 | 1 | ... | 9
digits digit digit* or digits digit +
opt-fraction ( . digits ) ?
opt-exponent ( E (+|-)? digits ) ?
unsigned-num digits opt-fraction opt-exponent

CS416 Compiler Design

12 Chapter 1

CSE309N

Recognition of tokens

The next step is to formalize the patterns:


digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id
-> letter (letter|digit)*
If
-> if
Then -> then
Else
-> else
Relop -> < | > | <= | >= | = | <>

We also need to handle whitespaces:


ws -> (blank | tab | newline)+

Chapter 1

CSE309N

CHAPTER 3 LEXICAL ANALYSIS


Section 3 Recognition of Tokens

1 Task of recognition of token in a lexical analyzer

Regular
expression
if
id

Token

Attribute-value

if
id

<

relop

Pointer to table
entry
LT

Chapter 1

CSE309N

Transition diagrams

Transition diagram for relop

Chapter 1

CSE309N

Transition diagrams (cont.)

Transition diagram for reserved words and


identifiers

Chapter 1

CSE309N

Transition diagrams (cont.)

Transition diagram for unsigned numbers

Chapter 1

CSE309N

Transition diagrams (cont.)

Transition diagram for whitespace

Chapter 1

CSE309N

Buffer Pairs

Lexical analyzer needs to look ahead several characters


beyond the lexeme for a pattern before a match can be
announced.
Use a function ungetc to push lookahead characters back
into the input stream.
Large amount of time can be consumed moving characters.

Special Buffering
Use a buffer
Technique
divided into two N-character
halves

N = Number of characters on one disk block


One system command read N characters
Fewer than N character

=>

eof
Chapter 1

CSE309N

Buffer Pairs (2)


Two pointers to the input buffer are maintained
The string of characters between the pointers
is the current lexeme
Once the next lexeme is determined, the forward
pointer is set to the character at its right
end.
E

lexeme
beginning

eof

forward
(scans ahead
to find
pattern
match)
Comments and white space can be treated as
patterns that yield no token
Chapter 1

CSE309N

if

Code to advance forward pointer


forward at the end of first half then
reload second half ;
forward : = forward + 1;

end
else if

end
else

begin

forward at end of second half then begin


reload first half ;
move forward to biginning of first half

forward

: =

forward

+ 1;

Pitfalls
1. This buffering scheme works quite well most of the
time but with it amount of lookahead is limited.
2. Limited lookahead makes it impossible to recognize
tokens in situations where the distance, forward
pointer must travel is more than the length of
buffer.
Chapter 1

CSE309N

Algorithm: Buffered I/O with Sentinels


Current token
E

lexeme

eof C

forward : = forward + 1 ;
beginning
if forward = eof then begin
if forward at end of first half
then begin
Block I/O
reload second half ;
forward : = forward + 1
end
else if forward at end of second
half then begin Block I/O
reload first half ;
move forward to biginning of
first half
2nd eof no more
end
input !
else / * eof within buffer
signifying end of input * /

eof

eof

forward
(scans ahead
to find
pattern
Algorithm
match)
performs
I/Os. We can
still
have get &
ungetchar
Now these work
on
real memory
buffers ! Chapter 1

CSE309N

Formalizing Token Definition


EXAMPLES AND OTHER CONCEPTS:
Suppose:

S is the string

Prefix

ban,

Suffix

ana, banana

Substring :
banana

banana

banana

nan, ban, ana,

Subsequence: bnan, nn
Proper prefix,
subfix, or
substring cannot
be all of S

Chapter 1

CSE309N

Algorithm: Buffered I/O with


Sentinels
Current token
E

M * eo C * * 2 eo
f

lexeme beginning

forward : = forward + 1 ;
if forward = eof then begin
if forward at end of first half
Block I/O
then begin
reload second half ;
forward : = forward + 1
end
Block I/O
else if forward at end of
second half then begin
reload first half ;
move forward to biginning
of first half
end
2nd eof no more
else input
/ * eof
! within buffer

eo
f

forward
(scans ahead
to find
pattern
Algorithm
match)
performs
I/Os. We can
still
have get &
ungetchar
Now these work
on
real memory
buffers ! Chapter 1

Das könnte Ihnen auch gefallen