Sie sind auf Seite 1von 67

Lexical Analysis

Lexical Analysis

Lexical analysis recognizes the vocabulary of


the programming language and transforms a
string of characters into a string of words or
tokens
Lexical
analysis discards white spaces and
comments between the tokens
Lexical analyzer (or scanner) is the program
that performs lexical analysis
2
Contents

Scanners

Tokens

Regular expressions
Finite automata
FLex - a scanner generator

3
Scanners

token
characters Scanner Parser
next token

Symbol
Table

4
Tokens

A token is a sequence of characters that can


be treated as a unit in the grammar of a
programming language
A programming language classifies tokens into
a finite set of token types
Type Examples
ID foo i n
NUM 73 13
IF if
5 COMMA ,
Semantic Values of Tokens

Semantic values are used to distinguish


different tokens in a token type
< ID, foo>, < ID, i >, < ID, n >
< NUM, 73>, < NUM, 13 >
< IF, >
< COMMA, >
Token
types affect syntax analysis and
semantic values affect semantic analysis
6
Scanner Generators

Scanner
Scanner
definition in Scanner
Generator
matalanguage

Program in
Scanner Token types &
programming
semantic values
language

7
Languages

A language is a set of strings


A string is a finite sequence of symbols taken
from a finite alphabet
The C language is the (infinite) set of all strings that
constitute legal C programs
The language of C reserved words is the (finite) set
of all alphabetic strings that cannot be used as
identifiers in the C programs
Each token type is a language
8
Regular Expressions (RE)

Language allows us to use finite descriptions to


specify (possibly infinite) sets
RE is the metalanguage used to define the
token types of a programming language

9
Regular Expressions

is a RE denoting L = {}
If a alphabet, then a is a RE denoting L = {a}
Suppose r and s are RE denoting L(r) and L(s)
alternation: (r) | (s) is a RE denoting L(r) L(s)
concatenation: (r) (s) is a RE denoting L(r)L(s)
repetition: (r)* is a RE denoting (L(r))*
(r) is a RE denoting L(r)
10
Examples

a|b {a, b}
(a | b)(a | b) {aa, ab, ba, bb}
a* {, a, aa, aaa, ...}
(a | b)* the set of all strings of as and bs
a | a*b the set containing the string a and all
strings consisting of zero or more as followed
by a b

11
Regular Definitions

Names for regular expressions


d1 r 1
d2 r 2
...
dn r n
where ri over alphabet {d1, d2, ..., di-1}
Examples:
letter A | B | ... | Z | a | b | ... | z
digit 0 | 1 | ... | 9
12 identifier letter ( letter | digit )*
Notational Abbreviations

One or more instances


(r)+ denoting (L(r))+
r * = r+ | r+ = r r*
Zero or one instance
r? = r |
Character classes
[abc] = a | b | c [a-z] = a | b | ... | z
[^abc] = any character except a | b | c
Any character except newline
.
13
Examples

if {return IF;}
[a-z][a-z0-9]* {return ID;}
[0-9]+ {return NUM;}
([0-9]+.[0-9]*)|([0-9]*.[0-9]+) {return REAL;}
(--[a-z]*\n)|(
| \n | \t)+
{/*do nothing for white spaces and comments*/}
. { error(); }
14
Completeness and Disambiguity

A lexical specification should be complete;


namely, it always matches some initial substring
of the input
Longest match disambiguation rules: the longest
initial substring of the input that can match any
regular expression is taken as the next token
Rule priority disambiguation rules: for a particular
longest initial substring, the first regular
expression that can match determines its token
type
15
Examples

. /* match any */
([0-9]+.[0-9]*)|([0-9]*.[0-9]+) /* REAL */

0.9
if /* IF */
[a-z][a-z0-9]* /* ID */

if
16
Finite Automata

A finiteautomaton is a finite-state transition


diagram that can be used to model the
recognition of a token type specified by a
regular expression
A finite automaton can be a nondeterministic
finite automaton or a deterministic finite
automaton

17
Nondeterministic Finite Automata (NFA)

An NFA consists of
A finite set of states
A finite set of input symbols
A transition function that maps (state, symbol)
pairs to sets of states
A state distinguished as start state
A set of states distinguished as final states
18
An Example

RE: (a | b)*abb start


States: {1, 2, 3, 4}
1 a,b
Input symbols: {a, b} a
Transition function: 2
(1,a) = {1,2}, (1,b) = {1} b
(2,b) = {3}, (3,b) = {4} 3
Start state: 1 b
Final state: {4} 4
19
Acceptance of NFA

AnNFA accepts an input string s iff there is


some path in the finite-state transition diagram
from the start state to some final state such
that the edge labels along this path spell out s
Thelanguage recognized by an NFA
automaton is the set of strings it accepts

20
An Example

(a | b)*abb aabb

start a b b
0 1 2 3

21
An Example

(a | b)*abb aaba

start a b b
0 1 2 3

a
b

22
Another Example

RE: aa* | bb*


States: {1, 2, 3, 4, 5}
Input symbols: {a, b}
Transition function:
(1, ) = {2, 4}, (2, a) = {3}, (3, a) = {3},
(4, b) = {5}, (5, b) = {5}
Start state: 1
Final states: {3, 5}
23
Finite-State Transition Diagram

aa* | bb*
a
a
2 3

start
1

4 5
b
b

24
Deterministic Finite Automata (DFA)

A DFA is a special case of an NFA in which


no state has an -transition
for
each state s and input symbol a, there is at
most one edge labeled a leaving s

25
An Example

RE: (a | b)*abb
States: {1, 2, 3, 4}
Input symbols: {a, b}
Transition function:
(1,a) = {2}, (2,a) = {2}, (3,a) = {2}, (4,a) = {2}
(1,b) = {1}, (2,b) = {3}, (3,b) = {4}, (4,b) = {1}
Start state: 1
Final state: {4}

26
Finite-State Transition Diagram

(a | b)*abb

b
a
start b b
1 2 3 4
a a
b a

27
Acceptance of DFA

A DFA accepts an input string s iff there is one


path in the finite-state transition diagram from
the start state to some final state such that the
edge labels along this path spell out s
The language recognized by a DFA automaton
is the set of strings it accepts

28
An Example

(a | b)*abb aabb

b
a
start b b
1 2 3 4
a a
b a

29
An Example

(a | b)*abb aaba

b
a
start b b
1 2 3 4
a a
b a

30
Combined Finite Automata

start i f
if 1 2 3 IF
ID
start a-z
[a-z][a-z0-9]* 1 2 a-z,0-9

0-9 REAL
([0-9]+.[0-9]*) .
0-9 2 3 0-9
| start 1
. 0-9
([0-9]*.[0-9]+) 4 5 0-9
31 REAL
Combined Finite Automata

i f
2 3 4 IF
ID
start a-z
1 5 6 a-z,0-9

0-9 REAL
.
0-9 8 9 0-9
7
. 0-9
10 11 0-9
NFA
REAL
32
Combined Finite Automata

f IF
2 3
g-z
a-e a-z,0-9
i 4 a-z,0-9
j-z
ID
start a-h 0-9 REAL
1 0-9 .
5 6 0-9
.
0-9
7 8 0-9
DFA
33 REAL
Recognizing the Longest Match

The automaton must keep track of the longest


match seen so far and the position of that
match until a dead state is reached
Use two variables Last-Final (the state number
of the most recent final state encountered) and
Input-Position-at-Last-Final to remember the
last time the automaton was in a final state

34
An Example

f IF
2 3
iffail+ g-z S C L P
a-e a-z,0-9 1 0 0
i 4 a-z,0-9 i 2 0 0
j-z
ID f 3 3 2
start a-h 0-9 REAL f 4 4 3
1 0-9 . a 4 4 4
5 6 0-9 i 4 4 5
.
0-9 l 4 4 6
7 8 0-9 + ?
DFA
REAL
35
Lexical Analyzer Generators

RE

NFA

DFA

36
From a RE to an NFA

Thompsons construction algorithm


For , construct

start
i f

For a in alphabet, construct

start a
i f

37
From a RE to an NFA

Suppose N(s) and N(t) are NFA for RE s and t


for s | t, construct

is N(s) fs

start f
i

it N(t) ft

for s t, construct
start fs
i N(s) it N(t) f
38
From a RE to an NFA

for s*, construct


start is N(s) fs
i f

for (s), use N(s)

39
An Example


(a | b)*abb
a
1 2

start a b b
7 5 6 8 9 10 11

b
3 4

40
Simulating a DFA

Input: An input string ended with eof and a DFA with start
state s0 and final states F.
Output: The answer yes if accepts, no otherwise.

begin
s := s0; c := nextchar;
while c <> eof do begin
s := move(s, c); c := nextchar
end;
if s is in F then return yes else return no
41 end.
An Example

(a | b)*abb

b
a
start b b
1 2 3 4
a a
b a

42
An Example

bbababb bbabab

s=1 s=1
s = move(1, b) = 1 s = move(1, b) = 1
s = move(1, b) = 1 s = move(1, b) = 1
s = move(1, a) = 2 s = move(1, a) = 2
s = move(2, b) = 3 s = move(2, b) = 3
s = move(3, a) = 2 s = move(3, a) = 2
s = move(2, b) = 3 s = move(2, b) = 3
s = move(3, b) = 4 s is not in {4}
s is in {4}
43
Simulating an NFA

Input: An input string ended with eof and an NFA with


start state s0 and final states F.
Output: The answer yes if accepts, no otherwise.

begin
S := -closure({s0}); c := nextchar;
while c <> eof do begin
S := -closure(move(S, c)); c := nextchar
end;
if S F <> then return yes else return no
end.
44
Operations on NFA states

-closure(s):set of NFA states reachable from


NFA state s on -transitions alone
-closure(S):
set of NFA states reachable from
some NFA state s in S on -transitions alone
move(S, c): set of NFA states to which there is
a transition on input symbol c from some NFA
state s in S

45
An Example


(a | b)*abb
a
3 4

start a b b
1 2 7 8 9 10 11

b
5 6

46
An Example

S = -closure({1}) = {1,2,3,5,8}
S = -closure(move({1,2,3,5,8}, b)) bbabb
= -closure({6}) = {2,3,5,6,7,8}
S = -closure(move({2,3,5,6,7,8}, b))
= -closure({6}) = {2,3,5,6,7,8}
S = -closure(move({2,3,5,6,7,8}, a))
= -closure({4,9}) = {2,3,5,5,7,8,9}
S = -closure(move({2,3,5,5,7,8,9}, b))
= -closure({6,10}) = {2,3,5,6,7,8,10}
S = -closure(move({2,3,5,6,7,8,10}, b))
= -closure({6,11}) = {2,3,5,6,7,8,11}
47 S {11} <>
Computation of -closure

Input: An NFA and a set of NFA states S.


Output: T = -closure(S).
begin
push all states in S onto stack; T := S;
while stack is not empty do begin
pop t, the top element, off of stack;
for each state u with an edge from t to u labeled do
if u is not in T then begin
add u to T; push u onto stack
end
end;
return T
48 end.
From an NFA to a DFA

Subset construction Algorithm.


Input: An NFA N.
Output: A DFA D with states Dstates and trasition table Dtran.
begin
add -closure(s0) as an unmarked state to Dstates;
while there is an unmarked state T in Dstates do begin
mark T;
for each input symbol a do begin
U := -closure(move(T, a));
if U is not in Dstates then
add U as an unmarked state to Dstates;
Dtran[T, a] := U
end
49 end.
An Example


(a | b)*abb
a
3 4

start a b b
1 2 7 8 9 10 11

b
5 6

50
An Example

-closure({1}) = {1,2,3,5,8} = A
-closure(move(A, a))=-closure({4,9}) = {2,3,4,5,7,8,9} = B
-closure(move(A, b))=-closure({6}) = {2,3,5,6,7,8} = C
-closure(move(B, a))=-closure({4,9}) = B
-closure(move(B, b))=-closure({6,10}) = {2,3,5,6,7,8,10} = D
-closure(move(C, a))=-closure({4,9}) = B
-closure(move(C, b))=-closure({6}) = C
-closure(move(D, a))=-closure({4,9}) = B
-closure(move(D, b))=-closure({6,11}) = {2,3,5,6,7,8,11} = E
-closure(move(E, a))=-closure({4,9}) = B
-closure(move(E, b))=-closure({6}) = C
51
An Example

Input Symbol
State
a b
A = {1,2,3,5,8} B C
B = {2,3,4,5,7,8,9} B D
C = {2,3,5,6,7,8} B C
D = {2,3,5,6,7,8,10} B E
E = {2,3,5,6,7,8,11} B C

52
An Example

{2,3,5,
6,7,8}
b
b a

start b
{2,3,4,5, {2,3,5,6, {2,3,5,6,
{1,2,3,5,8} a
7,8,9} 7,8,10} b 7,8,11}
a

a a
53
Flex Lexical Analyzer Generator

A language for specifying lexical analyzers

lang.l Flex compiler lex.yy.c

lex.yy.c C compiler a.out


-lfl

source code a.out tokens


54
Flex Programs

%{
auxiliary declarations
%}
regular definitions
%%
translation rules
%%
auxiliary procedures
55
Translation Rules

P1 action1
P2 action2
...
Pn actionn

where Pi are regular expressions and


actioni are C program segments
56
Example I

%%
username printf( %s, getlogin() );

By default, any text not matched by a flex lexical


analyzer is copied to the output. This lexical
analyzer copies its input file to its output with
each occurrence of username being replaced
with the users login name.
57
Example II

%{
int lines = 0, chars = 0;
%}
%%
\n ++lines; ++chars;
. ++chars; /* all characters except \n */
%%
main() {
yylex();
printf(lines = %d, chars = %d\n, lines, chars);
58 }
Example III

%{
#define EOF 0
#define LE 25
...
%}
delim [ \t\n]
ws {delim}+
letter [A-Za-z]
digit [0-9]
id {letter}({letter}|{digit})*
number {digit}+(\.{digit}+)?(E[+\-]?{digit}+)?
59 %%
Example III

{ws} { /* no action and no return */ }


if {return (IF);}
else {return (ELSE);}
{id} {yylval=install_id(); return (ID);}
{number} {yylval=install_num(); return (NUMBER);}
<= {yylval=LE; return (RELOP);}
== {yylval=EQ; return (RELOP);}
...
<<EOF>> {return(EOF);}
%%
install_id() { ... }
60 install_num() { ... }
Functions and Variables

yylex()
a function implementing the lexical analyzer and returning
the token matched

yytext
a global pointer variable pointing to the lexeme matched

yyleng
a global variable giving the length of the lexeme matched

yylval
61 an external global variable storing the attribute of the token
NFA from Flex Programs

P1 | P2 | ... | Pn

N(P1)


s0 N(P2)
...

N(Pn)
62
Rules

Look for the longest lexeme


number
Look for the first-listed pattern that matches
the longest lexeme
keywords and identifiers
List frequently occurring patterns first
white space

63
Rules

View keywords as exceptions to the rule of


identifiers
construct a keyword table
Lookahead operator: r1/r2 - match a string in r1
only if followed by a string in r2
DO 5 I = 1. 25
DO 5 I = 1, 25
DO/({letter}|{digit})* = ({letter}|{digit})*,

64
Rules

Start condition: <s>r match r only in start


condition s
<str>[^]* {/* eat up string body */}
Start conditions are declared in the first
section using either %s or %x
%s str
A start condition is activated using the BEGIN
action
\ BEGIN(str);
The default start condition is INITIAL
65
Lexical Error Recovery

Error:none of patterns matches a prefix of the


remaining input
Panic mode error recovery
delete successive characters from the remaining
input until the pattern-matching can continue
Error repair:
delete an extraneous character
insert a missing character
replace an incorrect character
66 transpose two adjacent characters
Maintaining Line Number

Flex allows to maintain the number of the


current line in the global variable yylineno
using the following option mechanism

%option yylineno

in the first section

67

Das könnte Ihnen auch gefallen