Beruflich Dokumente
Kultur Dokumente
Lexical Analyzer
lexical
analyzer
token
parser
get
next
token
symbol
table
Chapter 1
CSE309N
LEXICAL ANALYZER
Scan Input
Identify Tokens
Generate Errors
Chapter 1
CSE309N
Token
A) identifier
B) keywords
C) operators
D) special symbols
E) constants
Chapter 1
CSE309N
TOKEN
PATTERN
The set of strings is described by a rule called a
pattern associated with the token.
LEXEME
A lexeme is a sequence of characters in the source
program that matches the pattern for a token
Identifiers: x, count, name, etc
Chapter 1
CSE309N
const
if
if
Token
if
Classifi
es
Pattern
Info
Chapter 1
CSE309N
ATTRIBUTES values
E = M * C ** 2
LEXEME
E <id, pointer to symbol-table entry for E>
<assign_op, >
<id, pointer to symbol-table entry for M>
<mult_op, >
<id, pointer to symbol-table entry for C>
<exp_op, >
<num, integer value 2>
Chapter 1
CSE309N
CSE309N
Regular Expressions
Chapter 1
CSE309N
a
(r1) | (r2)
(r1) (r2)
(r)*
(r)
{}
{a}
Language it denotes
L(r1) L(r2)
L(r1) L(r2)
(L(r))*
L(r)
(r)+ = (r)(r)*
(r)? = (r) |
Chapter 1
CSE309N
*
highest
concatenation
next
|
lowest
ab*|c
Ex:
means
(a(b)*)|(c)
= {0,1}
0|1 => {0,1}
(0|1)(0|1) => {00,01,10,11}
0* => { ,0,00,000,0000,....}
(0|1)* => all strings with 0 and 1, including the empty
string
CS416 Compiler Design
10 Chapter 1
CSE309N
Regular Definitions
11 Chapter 1
CSE309N
digit 0 | 1 | ... | 9
digits digit digit* or digits digit +
opt-fraction ( . digits ) ?
opt-exponent ( E (+|-)? digits ) ?
unsigned-num digits opt-fraction opt-exponent
12 Chapter 1
CSE309N
Recognition of tokens
Chapter 1
CSE309N
Regular
expression
if
id
Token
Attribute-value
if
id
<
relop
Pointer to table
entry
LT
Chapter 1
CSE309N
Transition diagrams
Chapter 1
CSE309N
Chapter 1
CSE309N
Chapter 1
CSE309N
Chapter 1
CSE309N
Buffer Pairs
Special Buffering
Use a buffer
Technique
divided into two N-character
halves
=>
eof
Chapter 1
CSE309N
lexeme
beginning
eof
forward
(scans ahead
to find
pattern
match)
Comments and white space can be treated as
patterns that yield no token
Chapter 1
CSE309N
if
end
else if
end
else
begin
forward
: =
forward
+ 1;
Pitfalls
1. This buffering scheme works quite well most of the
time but with it amount of lookahead is limited.
2. Limited lookahead makes it impossible to recognize
tokens in situations where the distance, forward
pointer must travel is more than the length of
buffer.
Chapter 1
CSE309N
lexeme
eof C
forward : = forward + 1 ;
beginning
if forward = eof then begin
if forward at end of first half
then begin
Block I/O
reload second half ;
forward : = forward + 1
end
else if forward at end of second
half then begin Block I/O
reload first half ;
move forward to biginning of
first half
2nd eof no more
end
input !
else / * eof within buffer
signifying end of input * /
eof
eof
forward
(scans ahead
to find
pattern
Algorithm
match)
performs
I/Os. We can
still
have get &
ungetchar
Now these work
on
real memory
buffers ! Chapter 1
CSE309N
S is the string
Prefix
ban,
Suffix
ana, banana
Substring :
banana
banana
banana
Subsequence: bnan, nn
Proper prefix,
subfix, or
substring cannot
be all of S
Chapter 1
CSE309N
M * eo C * * 2 eo
f
lexeme beginning
forward : = forward + 1 ;
if forward = eof then begin
if forward at end of first half
Block I/O
then begin
reload second half ;
forward : = forward + 1
end
Block I/O
else if forward at end of
second half then begin
reload first half ;
move forward to biginning
of first half
end
2nd eof no more
else input
/ * eof
! within buffer
eo
f
forward
(scans ahead
to find
pattern
Algorithm
match)
performs
I/Os. We can
still
have get &
ungetchar
Now these work
on
real memory
buffers ! Chapter 1