Sie sind auf Seite 1von 39

Lecture 2: Lexical Analysis

CS 540 George Mason University

Lexical Analysis - Scanning


Source language Scanner (lexical analysis) tokens Parser (syntax analysis) Semantic Analysis (IC generator) Code Generator

Code Optimizer

Tokens described formally Breaks input into tokens White space

Symbol Table
CS 540 Spring 2010 GMU 2

Lexical Analysis
INPUT: sequence of characters OUTPUT: sequence of tokens
Next_char() Next_token()

Input

character

Scanner

token

Parser

Symbol Table

A lexical analyzer is generally a subroutine of parser: Simpler design Efficient Portable


CS 540 Spring 2010 GMU 3

Definitions
token set of strings defining an atomic element with a defined meaning pattern a rule describing a set of string lexeme a sequence of characters that match some pattern

CS 540 Spring 2010 GMU

Examples
Token
while

Pattern
while

Sample Lexeme
while < 42 hello

relation_op = | != | < | > integer string (0-9)* Characters between


CS 540 Spring 2010 GMU

Input string: size := r * 32 + c


<token,lexeme> pairs: <id, size> <assign, :=> <id, r> <arith_symbol, *> <integer, 32> <arith_symbol, +> <id, c>
CS 540 Spring 2010 GMU 6

Implementing a Lexical Analyzer


Practical Issues: Input buffering Translating RE into executable form Must be able to capture a large number of tokens with single machine Interface to parser Tools
CS 540 Spring 2010 GMU 7

Capturing Multiple Tokens


Capturing keyword begin b e g i n WS WS white space A alphabetic AN alphanumeric

Capturing variable names A WS AN

What if both need to happen at the same time?


CS 540 Spring 2010 GMU 8

Capturing Multiple Tokens


b A-b AN AN WS WS e g i n WS WS white space A alphabetic AN alphanumeric

Machine is much more complicated just for these two tokens!


CS 540 Spring 2010 GMU 9

Real lexer (handcoded)


http://cs.gmu.edu/~white/CS540/lexer.cpp Comes from C# compiler in Rotor >950 lines of C++

CS 540 Spring 2010 GMU

10

Lex Lexical Analyzer Generator


Lex specification Lex C/C++/Java

compiler

flex more modern version of lex jflex java version of flex

input

exec

tokens

CS 540 Spring 2010 GMU

11

Lex Specification (flex)


%{ int charCount=0, wordCount=0, lineCount=0; Definitions %} Code, RE word [^ \t\n]+ %% {word} {wordCount++; charCount += strlen(yytext); } Rules [\n] {charCount++; lineCount++;} . {charCount++;} RE/Action pairs %% main() { User Routines yylex(); printf(Characters %d, Words: %d, Lines: %d\n, charCount, wordCount, lineCount); }
CS 540 Spring 2010 GMU 12

Lex Specification (jflex)


import java.io.*; %% %class ex1 %unicode %line %column %standalone %{ static int charCount = 0, wordCount = 0, lineCount = 0; public static void main(String [] args) throws IOException { ex1 lexer = new ex1(new FileReader(args[0])); lexer.yylex(); System.out.println("Characters: " + charCount + " Words: " + wordCount +" Lines: " +lineCount); } %} %type Object //this line changes the return type of yylex into Object word = [^ \t\n]+ %% {word} {wordCount++; charCount += yytext().length(); } [\n] {charCount++; lineCount++; } . {charCount++; } CS 540 Spring 2009 GMU

Definitions Code, RE

Rules RE/Action pairs


13

Lex definitions section


C/C++/Java code:
%{ int charCount=0, wordCount=0, lineCount=0; %} word [^ \t\n]+

Surrounded by %{ %} delimiters Declare any variables used in actions

RE definitions:
Define shorthand for patterns: digit [0-9] letter [a-z] ident {letter}({letter}|{digit})* Use shorthand in RE section: {ident}
CS 540 Spring 2009 GMU 14

Lex Regular Expressions


{word} {wordCount++; charCount += strlen(yytext); } [\n] {charCount++; lineCount++;} . {charCount++;}

Match explicit character sequences


integer, +++, \<\>

Character classes
[abcd] [a-zA-Z] [^0-9] matches non-numeric
CS 540 Spring 2009 GMU 15

Alternation
twelve | 12

Closure
* - zero or more + - one or more ? zero or one {number}, {number,number}

CS 540 Spring 2010 GMU

16

Other operators
. matches any character except newline ^ - matches beginning of line $ - matches end of line / - trailing context () grouping {} RE definitions
CS 540 Spring 2010 GMU 17

Lex Operators
Highest: closure concatenation alternation Special lex characters: -\/*+>{}.$()|%[]^ Special lex characters inside [ ]: -\[]^
CS 540 Spring 2010 GMU 18

Examples
a.*z (ab)+ [0-9]{1,5} (ab|cd)?ef = abef,cdef,ef -?[0-9]\.[0-9]

CS 540 Spring 2010 GMU

19

Lex Actions
Lex actions are C (C++, Java) code to implement some required functionality Default action is to echo to output Can ignore input (empty action) ECHO macro that prints out matched string yytext matched string yyleng length of matched string (not all versions have this) In Java:
yytext() and yytext().length()
CS 540 Spring 2010 GMU 20

User Subroutines
main() { yylex(); printf(Characters %d, Words: %d, Lines: %d\n,charCount, wordCount, lineCount); }

C/C++/Java code Copied directly into the lexer code User can supply main or use default
CS 540 Spring 2010 GMU 21

How Lex works


Lex works by processing the file one character at a time, trying to match a string starting from that character.
1. Lex always attempts to match the longest possible string. 2. If two rules are matched (and match strings are same length), the first rule in the specification is used.

Once it matches a string, it starts from the character after the string
CS 540 Spring 2010 GMU 22

Lex Matching Rules


1. Lex always attempts to match the longest possible string.
beg begin in {} {} {}

Input begin can match either of the first two rules. The second rule will be chosen because of the length.

CS 540 Spring 2010 GMU

23

Lex Matching Rules


2. If two rules are matched (the matched strings are same length), the first rule in the specification is used.
begin [a-z]+ { } {}

Input begin can match both rules the first one will be chosen
CS 540 Spring 2010 GMU 24

Lex Example: Extracting white space


%{ #include <stdio.h> %} %% [ \t\n] ; . {ECHO;} %%

To compile and run above (simple.l): flex simple.l flex simple.l gcc lex.yy.c ll g++ -x c++ lex.yy.c ll a.out < input a.out < input
CS 540 Spring 2010 GMU

-lfl on some systems25

Lex Example: Extracting white space (Java)


%% %class ex0 %unicode %line %column %standalone %% [^ \t\n] {System.out.print(yytext());} . {} [\n] {}

name of class to build

To compile and run above (simple.l): java -jar ~cs540/JFlex.jar simple.l javac ex0.java java ex0 inputfile

CS 540 Spring 2010 GMU

26

Input: This is a file of stuff we want white space from to extract all

Output: Thisisafileofstuffwewantoextractallwhitespacefrom

CS 540 Spring 2010 GMU

27

Lex (C/C++)
Lex always creates a file lex.yy.c with a function yylex() -ll directs the compiler to link to the lex library (-lfl on some systems) The lex library supplies external symbols referenced by the generated code The lex library supplies a default main: main(int ac,char **av) {return yylex(); }
CS 540 Spring 2010 GMU 28

Lex Example 2: Unix wc


%{ int charCount=0, wordCount=0, lineCount=0; %} word [^ \t\n]+ %% {word} {wordCount++; charCount += strlen(yytext); } [\n] {charCount++; lineCount++;} . {charCount++;} %% main() { yylex(); printf(Characters %d, Words: %d, Lines: %d\n,charCount, wordCount, lineCount); }
CS 540 Spring 2010 GMU 29

Lex Example 3: Extracting tokens


%% and array begin
. . .

return(AND); return(ARRAY); return(BEGIN);

\[ := [a-zA-Z][a-zA-Z0-9_]* [+-]?[0-9]+ [ \t\n] %%

return([); return(ASSIGN); return(ID); return(NUM); ;

CS 540 Spring 2010 GMU

30

Uses for Lex


Transforming Input convert input from one
form to another (example 1). yylex() is called once; return is not used in specification Extracting Information scan the text and return some information (example 2). yylex() is called once; return is not used in specification. Extracting Tokens standard use with compiler (example 3). Uses return to give the next token to the caller.
CS 540 Spring 2010 GMU 31

Lex States
Regular expressions are compiled to state machines. Lex allows the user to explicitly declare multiple states.
%s COMMENT

Default initial state INITIAL (0) Actions for matched strings may be different for different states
CS 540 Spring 2010 GMU 32

Lex States
%{ int ctr = 0; int linect = 1; %} %s COMMENT %% <INITIAL>. <INITIAL>[\n] <INITIAL>/* <COMMENT>. <COMMENT>[\n] <COMMENT>*/

<COMMENT>/* %%

ECHO; {linect++; ECHO;} {BEGIN COMMENT; ctr = 1;} ; linect++; {if (ctr == 1) BEGIN INITIAL; else ctr--; } {ctr++;}

CS 540 Spring 2010 GMU

33

Project #1: Using Lex


You are going to use Lex to read in DFA descriptions (given as context free rules), along with one or more inputs to test on the given DFA. The output of the program you submit will information about whether each of the given inputs is accepted or not.
CS 540 Spring 2010 GMU 34

Any regular language can be expressed using a CFG


Starting with a NFA: For each state Si in the NFA
Create non-terminal Ai If transition (Si,a) = Sk, create production Ai a Ak If transition (SiI = Sk, create production Ai Ak If Si is a final state, create production Ai I If Si is the NFA start state, s = Ai

CS 540 Spring 2010 GMU

35

NFA to CFG Example


ab*a

2 b

A1 A2 A2 A3
CS 540 Spring 2010 GMU

a A2 b A2 a A3 I
36

Example Input/Output
A --> a B B --> a A A --> aaaaa aaaa aaaaaaaaaa aba String rejected String accepted String accepted String rejected

CS 540 Spring 2010 GMU

37

Using Lex
1. Create a lex specification that recognizes the tokens of the input terminals, nonterminals, --> 2. Use Lex states to get the DFA and the input (see next slide) 3. Add code to step 2 to capture the DFA as a table and to then run the DFA on each of the given inputs
CS 540 Spring 2009 GMU 38

DFA to accept input

CS 540 Spring 2009 GMU

39

Das könnte Ihnen auch gefallen