CS540 2 Lecture2

Lecture 2: Lexical Analysis
CS 540 George Mason University
Lexical Analysis - Scanning

Source language Scanner (lexical analysis) tokens Parser (syntax analysis) Semantic Analysis (IC generator) Code Generator
Code Optimizer
Tokens described formally Breaks input into tokens White space
Symbol Table
CS 540 Spring 2010 GMU 2
Lexical Analysis
INPUT: sequence of characters OUTPUT: sequence of tokens
Next_char() Next_token()
Input
character
Scanner
token
Parser
Symbol Table
A lexical analyzer is generally a subroutine of parser: Simpler design Efficient Portable

Definitions
token set of strings defining an atomic element with a defined meaning pattern a rule describing a set of string lexeme a sequence of characters that match some pattern
CS 540 Spring 2010 GMU
Examples
Token
while
Pattern
while
Sample Lexeme
while < 42 hello
relation_op = | != | < | > integer string (0-9)* Characters between

Input string: size := r * 32 + c

<token,lexeme> pairs: <id, size> <assign, :=> <id, r> <arith_symbol, *> <integer, 32> <arith_symbol, +> <id, c>
Implementing a Lexical Analyzer

Practical Issues: Input buffering Translating RE into executable form Must be able to capture a large number of tokens with single machine Interface to parser Tools
Capturing Multiple Tokens

Capturing keyword begin b e g i n WS WS white space A alphabetic AN alphanumeric
Capturing variable names A WS AN
What if both need to happen at the same time?

Capturing Multiple Tokens

b A-b AN AN WS WS e g i n WS WS white space A alphabetic AN alphanumeric
Machine is much more complicated just for these two tokens!

Real lexer (handcoded)

http://cs.gmu.edu/~white/CS540/lexer.cpp Comes from C# compiler in Rotor >950 lines of C++
10
Lex Lexical Analyzer Generator

Lex specification Lex C/C++/Java
compiler
flex more modern version of lex jflex java version of flex
input
exec
tokens
11
Lex Specification (flex)

%{ int charCount=0, wordCount=0, lineCount=0; Definitions %} Code, RE word [^ \t\n]+ %% {word} {wordCount++; charCount += strlen(yytext); } Rules [\n] {charCount++; lineCount++;} . {charCount++;} RE/Action pairs %% main() { User Routines yylex(); printf(Characters %d, Words: %d, Lines: %d\n, charCount, wordCount, lineCount); }
Lex Specification (jflex)

import java.io.*; %% %class ex1 %unicode %line %column %standalone %{ static int charCount = 0, wordCount = 0, lineCount = 0; public static void main(String [] args) throws IOException { ex1 lexer = new ex1(new FileReader(args[0])); lexer.yylex(); System.out.println("Characters: " + charCount + " Words: " + wordCount +" Lines: " +lineCount); } %} %type Object //this line changes the return type of yylex into Object word = [^ \t\n]+ %% {word} {wordCount++; charCount += yytext().length(); } [\n] {charCount++; lineCount++; } . {charCount++; } CS 540 Spring 2009 GMU
Definitions Code, RE
Rules RE/Action pairs

13
Lex definitions section

C/C++/Java code:
%{ int charCount=0, wordCount=0, lineCount=0; %} word [^ \t\n]+
Surrounded by %{ %} delimiters Declare any variables used in actions
RE definitions:
Define shorthand for patterns: digit [0-9] letter [a-z] ident {letter}({letter}|{digit})* Use shorthand in RE section: {ident}
Lex Regular Expressions

{word} {wordCount++; charCount += strlen(yytext); } [\n] {charCount++; lineCount++;} . {charCount++;}
Match explicit character sequences

integer, +++, \<\>
Character classes
[abcd] [a-zA-Z] [^0-9] matches non-numeric
Alternation
twelve | 12
Closure
* - zero or more + - one or more ? zero or one {number}, {number,number}
16
Other operators
. matches any character except newline ^ - matches beginning of line $ - matches end of line / - trailing context () grouping {} RE definitions
Lex Operators
Highest: closure concatenation alternation Special lex characters: -\/*+>{}.$()|%[]^ Special lex characters inside [ ]: -\[]^
Examples
a.*z (ab)+ [0-9]{1,5} (ab|cd)?ef = abef,cdef,ef -?[0-9]\.[0-9]
19
Lex Actions
Lex actions are C (C++, Java) code to implement some required functionality Default action is to echo to output Can ignore input (empty action) ECHO macro that prints out matched string yytext matched string yyleng length of matched string (not all versions have this) In Java:
yytext() and yytext().length()
User Subroutines
main() { yylex(); printf(Characters %d, Words: %d, Lines: %d\n,charCount, wordCount, lineCount); }
C/C++/Java code Copied directly into the lexer code User can supply main or use default
How Lex works

Lex works by processing the file one character at a time, trying to match a string starting from that character.
1. Lex always attempts to match the longest possible string. 2. If two rules are matched (and match strings are same length), the first rule in the specification is used.
Once it matches a string, it starts from the character after the string
Lex Matching Rules

1. Lex always attempts to match the longest possible string.
beg begin in {} {} {}
Input begin can match either of the first two rules. The second rule will be chosen because of the length.
23
Lex Matching Rules

2. If two rules are matched (the matched strings are same length), the first rule in the specification is used.
begin [a-z]+ { } {}
Input begin can match both rules the first one will be chosen
Lex Example: Extracting white space

%{ #include <stdio.h> %} %% [ \t\n] ; . {ECHO;} %%
To compile and run above (simple.l): flex simple.l flex simple.l gcc lex.yy.c ll g++ -x c++ lex.yy.c ll a.out < input a.out < input
-lfl on some systems25
Lex Example: Extracting white space (Java)

%% %class ex0 %unicode %line %column %standalone %% [^ \t\n] {System.out.print(yytext());} . {} [\n] {}
name of class to build
To compile and run above (simple.l): java -jar ~cs540/JFlex.jar simple.l javac ex0.java java ex0 inputfile
26
Input: This is a file of stuff we want white space from to extract all
Output: Thisisafileofstuffwewantoextractallwhitespacefrom
27
Lex (C/C++)
Lex always creates a file lex.yy.c with a function yylex() -ll directs the compiler to link to the lex library (-lfl on some systems) The lex library supplies external symbols referenced by the generated code The lex library supplies a default main: main(int ac,char **av) {return yylex(); }
Lex Example 2: Unix wc

%{ int charCount=0, wordCount=0, lineCount=0; %} word [^ \t\n]+ %% {word} {wordCount++; charCount += strlen(yytext); } [\n] {charCount++; lineCount++;} . {charCount++;} %% main() { yylex(); printf(Characters %d, Words: %d, Lines: %d\n,charCount, wordCount, lineCount); }
Lex Example 3: Extracting tokens

%% and array begin
. . .
return(AND); return(ARRAY); return(BEGIN);
\[ := [a-zA-Z][a-zA-Z0-9_]* [+-]?[0-9]+ [ \t\n] %%
return([); return(ASSIGN); return(ID); return(NUM); ;
30
Uses for Lex

Transforming Input convert input from one
form to another (example 1). yylex() is called once; return is not used in specification Extracting Information scan the text and return some information (example 2). yylex() is called once; return is not used in specification. Extracting Tokens standard use with compiler (example 3). Uses return to give the next token to the caller.
Lex States
Regular expressions are compiled to state machines. Lex allows the user to explicitly declare multiple states.
%s COMMENT
Default initial state INITIAL (0) Actions for matched strings may be different for different states
Lex States
%{ int ctr = 0; int linect = 1; %} %s COMMENT %% <INITIAL>. <INITIAL>[\n] <INITIAL>/* <COMMENT>. <COMMENT>[\n] <COMMENT>*/
<COMMENT>/* %%
ECHO; {linect++; ECHO;} {BEGIN COMMENT; ctr = 1;} ; linect++; {if (ctr == 1) BEGIN INITIAL; else ctr--; } {ctr++;}
33
Project #1: Using Lex

You are going to use Lex to read in DFA descriptions (given as context free rules), along with one or more inputs to test on the given DFA. The output of the program you submit will information about whether each of the given inputs is accepted or not.
Any regular language can be expressed using a CFG

Starting with a NFA: For each state Si in the NFA
Create non-terminal Ai If transition (Si,a) = Sk, create production Ai a Ak If transition (SiI = Sk, create production Ai Ak If Si is a final state, create production Ai I If Si is the NFA start state, s = Ai
35
NFA to CFG Example

ab*a
2 b
A1 A2 A2 A3
a A2 b A2 a A3 I
36
Example Input/Output
A --> a B B --> a A A --> aaaaa aaaa aaaaaaaaaa aba String rejected String accepted String accepted String rejected
37
Using Lex
1. Create a lex specification that recognizes the tokens of the input terminals, nonterminals, --> 2. Use Lex states to get the DFA and the input (see next slide) 3. Add code to step 2 to capture the DFA as a table and to then run the DFA on each of the given inputs
DFA to accept input
39

CS540 2 Lecture2

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

CS540 2 Lecture2

Hochgeladen von

Copyright:

Verfügbare Formate

Lecture 2: Lexical Analysis

CS 540 George Mason University

Lexical Analysis - Scanning

Tokens described formally Breaks input into tokens White space

A lexical analyzer is generally a subroutine of parser: Simpler design Efficient Portable

CS 540 Spring 2010 GMU

relation_op = | != | < | > integer string (0-9)* Characters between

Input string: size := r * 32 + c

Implementing a Lexical Analyzer

Capturing Multiple Tokens

Capturing variable names A WS AN

What if both need to happen at the same time?

Capturing Multiple Tokens

Machine is much more complicated just for these two tokens!

Real lexer (handcoded)

CS 540 Spring 2010 GMU

Lex Lexical Analyzer Generator

flex more modern version of lex jflex java version of flex

CS 540 Spring 2010 GMU

Lex Specification (flex)

Lex Specification (jflex)

Rules RE/Action pairs

Lex definitions section

Surrounded by %{ %} delimiters Declare any variables used in actions

Lex Regular Expressions

Match explicit character sequences

CS 540 Spring 2010 GMU

CS 540 Spring 2010 GMU

How Lex works

Lex Matching Rules

CS 540 Spring 2010 GMU

Lex Matching Rules

Lex Example: Extracting white space

-lfl on some systems25

Lex Example: Extracting white space (Java)

name of class to build

CS 540 Spring 2010 GMU

CS 540 Spring 2010 GMU

Lex Example 2: Unix wc

Lex Example 3: Extracting tokens

return(AND); return(ARRAY); return(BEGIN);

\[ := [a-zA-Z][a-zA-Z0-9_]* [+-]?[0-9]+ [ \t\n] %%

return([); return(ASSIGN); return(ID); return(NUM); ;

CS 540 Spring 2010 GMU

Uses for Lex

CS 540 Spring 2010 GMU

Project #1: Using Lex

Any regular language can be expressed using a CFG

CS 540 Spring 2010 GMU

NFA to CFG Example

CS 540 Spring 2010 GMU

DFA to accept input

CS 540 Spring 2009 GMU

Das könnte Ihnen auch gefallen