Unit2-Compiler Design

CSE309N
Lexical Analyzer
Lexical Analyzer reads the source program

character by character to produce tokens.
Normally a lexical analyzer doesnt return a list of

tokens at one shot, it returns a token when the
parser asks a token from it.
source
program
lexical
analyzer
token
parser
get
next
token
symbol
table
Chapter 1
CSE309N
LEXICAL ANALYZER
Scan Input
Remove WS, NL,
Identify Tokens
Create Symbol Table
Insert Tokens into ST
Generate Errors
Send Tokens to Parser
Chapter 1
CSE309N
Token
Token represents a set of strings described

by a pattern.
Identifier represents a set of strings which start with

a letter continues with letters and digits
The actual string (newval) is called as lexeme.
Tokens: identifier, number, addop,
A) identifier
B) keywords
C) operators
D) special symbols
E) constants
Chapter 1
CSE309N
Introducing Basic Terminology
What are Major Terms for Lexical Analysis?
TOKEN
Tokens represent a set of strings described by a pattern.

Examples Include <Identifier>, <number>, etc.
PATTERN
The set of strings is described by a rule called a
pattern associated with the token.
LEXEME
A lexeme is a sequence of characters in the source
program that matches the pattern for a token
Identifiers: x, count, name, etc
Chapter 1
CSE309N
Introducing Basic Terminology
const
Sample LexemesInformal Description of

Pattern
const
const
if
if
Token
if
relation <, <=, =, < >,

>, >=
id
pi, count, D2
num
3.1416, 0,
literal
6.02E23
core dumped
Classifi
es
Pattern
< or <= or = or < > or >=

or >
letter followed by
letters and digits
any numeric constant
any characters between
and except
Actual values are critical.

is :
1.Stored in symbol table
2.Returned to parser
Info
Chapter 1
CSE309N
Attributes for Tokens

TOKEN
Example:
ATTRIBUTES values
E = M * C ** 2
LEXEME
E <id, pointer to symbol-table entry for E>
<assign_op, >
<id, pointer to symbol-table entry for M>
<mult_op, >
<id, pointer to symbol-table entry for C>
<exp_op, >
<num, integer value 2>
Chapter 1
CSE309N
Handling Lexical Errors
Error Handling is very localized, with Respect to Input

Source
For example: whil ( x = 0 ) do

generates no lexical errors in PASCAL
In what Situations do Errors Occur?
Lexical analyzer is unable to proceed because none of the

patterns for tokens matches a prefix of remaining input.
Panic mode Recovery
Delete successive characters from the remaining input until the

analyzer can find a well-formed token.
May confuse the parser
Possible error recovery actions:

Deleting an extra character
Inserting a missing Input Characters

Replacing a incorrect character by a correct character.
Transposing two adjacent Characters
Chapter 1
CSE309N
Regular Expressions
A Formal Specification for Tokens
We use regular expressions to describe tokens of a

programming language.
Each regular expression denotes a language.
A language denoted by a regular expression is

called as a regular set.
CS416 Compiler Design
Chapter 1
CSE309N
Regular Expressions (Rules)

Regular expressions over alphabet
Reg. Expr
a
(r1) | (r2)
(r1) (r2)
(r)*
(r)
{}
{a}
Language it denotes
L(r1) L(r2)
L(r1) L(r2)
(L(r))*
L(r)
(r)+ = (r)(r)*
(r)? = (r) |
Chapter 1
CSE309N
Regular Expressions (cont.)
We may remove parentheses by using precedence

rules.
*
highest
concatenation
next
|
lowest
ab*|c
Ex:
means
(a(b)*)|(c)
= {0,1}
0|1 => {0,1}
(0|1)(0|1) => {00,01,10,11}
0* => { ,0,00,000,0000,....}
(0|1)* => all strings with 0 and 1, including the empty
string
10 Chapter 1
CSE309N
Regular Definitions
To write regular expression for some languages can be

difficult, because their regular expressions can be quite
complex. In those cases, we may use regular definitions.
We can give names to regular expressions, and we can
use these names as symbols to define other regular
expressions.
Define regular expressions in terms of named regular
expressions
A regular definition is a sequence of the definitions of
the form:
d1 r1 where di is a definition name and
d2 r2 ri is a regular expression over symbols in
.
{d1,d2,...,di-1}
dn rn
11 Chapter 1
CSE309N
Regular Definitions (cont.)
Ex: Identifiers in Pascal
letter A | B | ... | Z | a | b | ... | z

digit 0 | 1 | ... | 9
id letter (letter | digit ) *
If we try to write the regular expression representing identifiers without using
regular definitions, that regular expression will be complex.
(A|...|Z|a|...|z) ( (A|...|Z|a|...|z) | (0|...|9) ) *
Ex: Unsigned numbers in Pascal
digit 0 | 1 | ... | 9
digits digit digit* or digits digit +
opt-fraction ( . digits ) ?
opt-exponent ( E (+|-)? digits ) ?
unsigned-num digits opt-fraction opt-exponent
12 Chapter 1
CSE309N
Recognition of tokens
The next step is to formalize the patterns:

digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id
-> letter (letter|digit)*
If
-> if
Then -> then
Else
-> else
Relop -> < | > | <= | >= | = | <>
We also need to handle whitespaces:

ws -> (blank | tab | newline)+
Chapter 1
CSE309N
CHAPTER 3 LEXICAL ANALYSIS

Section 3 Recognition of Tokens
1 Task of recognition of token in a lexical analyzer
Regular
expression
if
id
Token
Attribute-value
if
id
<
relop
Pointer to table
entry
LT
Chapter 1
CSE309N
Transition diagrams
Transition diagram for relop
Chapter 1
CSE309N
Transition diagrams (cont.)
Transition diagram for reserved words and

identifiers
Chapter 1
CSE309N
Transition diagram for unsigned numbers
Chapter 1
CSE309N
Transition diagram for whitespace
Chapter 1
CSE309N
Buffer Pairs
Lexical analyzer needs to look ahead several characters

beyond the lexeme for a pattern before a match can be
announced.
Use a function ungetc to push lookahead characters back
into the input stream.
Large amount of time can be consumed moving characters.
Special Buffering
Use a buffer
Technique
divided into two N-character
halves
N = Number of characters on one disk block

One system command read N characters
Fewer than N character
=>
eof
Chapter 1
CSE309N
Buffer Pairs (2)

Two pointers to the input buffer are maintained
The string of characters between the pointers
is the current lexeme
Once the next lexeme is determined, the forward
pointer is set to the character at its right
end.
E
lexeme
beginning
eof
forward
(scans ahead
to find
pattern
match)
Comments and white space can be treated as
patterns that yield no token
Chapter 1
CSE309N
if
Code to advance forward pointer

forward at the end of first half then
reload second half ;
forward : = forward + 1;
end
else if
end
else
begin
forward at end of second half then begin

reload first half ;
move forward to biginning of first half
forward
: =
forward
+ 1;
Pitfalls
1. This buffering scheme works quite well most of the
time but with it amount of lookahead is limited.
2. Limited lookahead makes it impossible to recognize
tokens in situations where the distance, forward
pointer must travel is more than the length of
buffer.
Chapter 1
CSE309N
Algorithm: Buffered I/O with Sentinels

Current token
E
lexeme
eof C
forward : = forward + 1 ;
beginning
if forward = eof then begin
if forward at end of first half
then begin
Block I/O
forward : = forward + 1
end
else if forward at end of second
half then begin Block I/O
reload first half ;
move forward to biginning of
first half
2nd eof no more
end
input !
else / * eof within buffer
signifying end of input * /
eof
eof
forward
(scans ahead
to find
pattern
Algorithm
match)
performs
I/Os. We can
still
have get &
ungetchar
Now these work
on
real memory
buffers ! Chapter 1
CSE309N
Formalizing Token Definition

EXAMPLES AND OTHER CONCEPTS:
Suppose:
S is the string
Prefix
ban,
Suffix
ana, banana
Substring :
banana
banana
banana
nan, ban, ana,
Subsequence: bnan, nn
Proper prefix,
subfix, or
substring cannot
be all of S
Chapter 1
CSE309N
Algorithm: Buffered I/O with

Sentinels
Current token
E
M * eo C * * 2 eo
f
lexeme beginning
forward : = forward + 1 ;
if forward = eof then begin
if forward at end of first half
Block I/O
then begin
forward : = forward + 1
end
Block I/O
else if forward at end of
second half then begin
reload first half ;
move forward to biginning
of first half
end
2nd eof no more
else input
/ * eof
! within buffer
eo
f
forward
(scans ahead
to find
pattern
Algorithm
match)
performs
I/Os. We can
still
have get &
ungetchar
Now these work
on
real memory
buffers ! Chapter 1

Unit2-Compiler Design

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Unit2-Compiler Design

Hochgeladen von

Copyright:

Verfügbare Formate

CSE309N

Lexical Analyzer reads the source program

Normally a lexical analyzer doesnt return a list of

Remove WS, NL,

Create Symbol Table

Insert Tokens into ST

Send Tokens to Parser

Token represents a set of strings described

Identifier represents a set of strings which start with

Introducing Basic Terminology

What are Major Terms for Lexical Analysis?

Tokens represent a set of strings described by a pattern.

Introducing Basic Terminology

Sample LexemesInformal Description of

relation <, <=, =, < >,

< or <= or = or < > or >=

Actual values are critical.

Attributes for Tokens

Handling Lexical Errors

Error Handling is very localized, with Respect to Input

For example: whil ( x = 0 ) do

In what Situations do Errors Occur?

Lexical analyzer is unable to proceed because none of the

Panic mode Recovery

Delete successive characters from the remaining input until the

May confuse the parser

Possible error recovery actions:

Inserting a missing Input Characters

A Formal Specification for Tokens

We use regular expressions to describe tokens of a

Each regular expression denotes a language.

A language denoted by a regular expression is

CS416 Compiler Design

Regular Expressions (Rules)

CS416 Compiler Design

Regular Expressions (cont.)

We may remove parentheses by using precedence

To write regular expression for some languages can be

Regular Definitions (cont.)

Ex: Identifiers in Pascal

letter A | B | ... | Z | a | b | ... | z

Ex: Unsigned numbers in Pascal

CS416 Compiler Design

The next step is to formalize the patterns:

We also need to handle whitespaces:

CHAPTER 3 LEXICAL ANALYSIS

1 Task of recognition of token in a lexical analyzer

Transition diagram for relop

Transition diagrams (cont.)

Transition diagram for reserved words and

Transition diagrams (cont.)

Transition diagram for unsigned numbers

Transition diagrams (cont.)

Transition diagram for whitespace

Lexical analyzer needs to look ahead several characters

N = Number of characters on one disk block

Buffer Pairs (2)

Code to advance forward pointer

forward at end of second half then begin

Algorithm: Buffered I/O with Sentinels

Formalizing Token Definition

nan, ban, ana,

Algorithm: Buffered I/O with