Sie sind auf Seite 1von 21

UNIT-I LEXICAL ANALYSIS

TOPIC-I (INTRODUCTION TO COMPILING)


Definition of Compilers: Simply stated, a compiler is a program that reads a program written in one language-the source language-and translates it into an equivalent program in another language-the target language (see fig.1) As an important part of this translation process, the compiler reports to its user the presence of errors in the source program. Compiler

Source Program

Target Program

Error messages Fig1. A Compiler Compilers are sometimes classified as single-pass, multi-pass, load-and-go, debugging, or optimizing, depending on how they have been constructed or on what function they are supposed to perform. Despite this apparent complexity, the basic tasks that any compiler must perform are essentially the same. Analysis Synthesis Model of Compilation: The process of compilation has two parts namely: Analysis Synthesis Analysis: The analysis part breaks up the source program into constituent pieces and creates an intermediate representation of the source program. The analysis part is often called the front end of the compiler Synthesis: The synthesis part constructs the desired target program from the intermediate representation. The synthesis part is the back end of the compiler. Software Tools: Many software tools that manipulate source programs first perform some kind of analysis. Some examples of such tools include: Structure Editors : Pretty printers : Static Checkers Interpreters Structure Editors : A structure editor takes as input a sequence of commands to build a source program. The structure editor not only performs the text-creation and modification functions of an ordinary text editor, but it also analyzes the program text, putting an appropriate hierarchical structure on the source program. Example while . do and begin.. end. Pretty printers : A pretty printer analyzes a program and prints it in such a way that the structure of the program becomes clearly visible. Static Checkers : A static checker reads a program, analyzes it, and attempts to discover potential bugs without running the program. Interpreters : Translate from high level language (BASIC, FORTRAN, etc..) into assembly or machine language. Interpreters are frequently used to execute command language,

tt h

:/ p

cs /

tu e

e. b

k/ t

http://csetube.weebly.com/

Examples of Compiler: The following examples are similar to that of a conventional complier. Text formatters. Silicon Compiler. Query interpreters. Text formatters: It takes input as a stream of characters includes commands to indicate paragraphs, figures etc. Silicon Compiler: Variables represent logical signal 0 & 1 and output is a circuit design. Query interpreters: It translates a predicate containing relational & Boolean operators into commands to search a database.

TOPIC-2 ANALYSIS OF THE SOURCE PROGRAM

The analysis phase breaks up the source program into constituent pieces and creates an intermediate representation of the source program. Analysis consists of three phases: Linear analysis Hierarchical analysis Semantic analysis Linear analysis (Lexical analysis or Scanning) : The lexical analysis phase reads the characters in the source program and grouped them as tokens that are sequence of characters having a collective meaning. Example: position: = initial + rate * 60 Identifiers position, initial, rate. Assignment symbol - : = Operators - +, * Number - 60 Blanks eliminated. Hierarchical analysis (Syntax analysis or Parsing) : It involves grouping the tokens of the source program hierarchically into nested collections that are used by the complier to synthesize output. :=

tt h

:/ p
id1

cs /
+ id2 id3

tu e

e. b

k/ t

* inttoreal 10

Semantic analysis : In this phase checks the source program for semantic errors and gathers type information for subsequent code generation phase. An important component of semantic analysis is type checking. Example : int to real conversion

http://csetube.weebly.com/

THE DIFFERENT PHASES OF A COMPILER


Conceptually, a compiler operates in phases, each of which transforms the source program from one representation to another.

TOPIC-3

Fig.2. Phases of a compiler The first four phases, forms the bulk of the analysis portion of a compiler. (i.e)Lexical Analysis,Syntax Analysis & Semantic Analysis,Intermediate Code Generation. The next two phases forms the bulk of the synthesis portion of a compiler. (i.e)Code Optimization,Code Generation Symbol table management and error handling, are shown interacting with the six phases. The Analysis phase:

tt h

:/ p

cs /

tu e

e. b

k/ t

I)Lexical Analysis(or)Scanner: Consider the statement, position := initial + rate * 10 The lexical analysis phase reads the characters in the source pgm and groups them into a stream of tokens in which each token represents a logically cohesive sequence of characters, such as an identifier, a keyword etc. The character sequence forming a token is called the lexeme for the token. Tokens identified during this phase are stored in the symbol table along with its properties called its attributes. The representation of the statement given above after the lexical analysis would be: id1,: =, id2, + ,id3 ,* ,10

http://csetube.weebly.com/

II) Syntax analysis (or) Parser: The tokens from the lexical analyzer are grouped hierarchically into nested collections with collective meaning called Parse Tree followed by syntax tree as output. A Syntax Tree is a compressed representation of the parse tree in which the operators appears as interior nodes & the operands as child nodes. := id1 id2 id3 + * 10

III)Semantic Analysis: An important part of semantic analysis is type checking, where the compiler checks that each operator has matching operands. For example, a binary arithmetic operator may be applied to either a pair of integers or to a pair of floating-point numbers. If the operator is applied to a floating-point number and an integer, the compiler may convert the integer into a floating-point number. := id1 id2 id3 + *

IV)Intermediate Code Generation: The intermediate representation should have two important properties: it should be easy to produce and it should be easy to translate into the target machine Some of the intermediate forms are three address codes, postfix notation etc. We consider an intermediate form called three-address code, which consists of a sequence of assembly-like instructions with three operands per instruction.

tp t

// :

cs

tu e
inttoreal 10

e. b

k/ t

Properties of three-address instructions: 1. Each three-address assignment instruction has at most one operator on the right side. 2. The compiler must generate a temporary name to hold the value computed by a threeaddress instruction. 3. Some "three-address instructions may have fewer than three operands. In three-address code, the source pgm might look like this, temp1: = inttoreal (10) temp2: = id3 * temp1 temp3: = id2 + temp2 id1: = temp3

http://csetube.weebly.com/

The Synthesis Phase: V)Code Optimization: The code optimization phase attempts to improve the intermediate code, so that faster running machine codes will result. Some optimizations are trivial. There is a great variation in the amount of code optimization different compilers perform. In those that do the most, called optimizing compilers, a significant fraction of the time of the compiler is spent on this phase. Two optimizations Technique: i)Local Optimization: Elimination of common sub expression copy propagation. ii)Loop Optimization: Finding out loop invariants & avoiding them. The output will look like this: temp1: = id3 * 10.0 id1: = id2 + temp1 VI)Code Generation: The code generator takes as input an intermediate representation of the source program and maps it into the target language. If the target language is machine code, registers or memory locations are selected for each of the variables used by the program. Then, the intermediate instructions are translated into sequences of machine instructions that perform the same task. MOVF id3, R2 MULF #10.0, R2 MOVF id2 , R1 ADDF R2, R1 MOVF R1, id1 Symbol table management

A symbol table is a data structure containing a record for each identifier, with fields for the attributes of the identifier. The data structure allows us to find the record for each identifier quickly and to store or retrieve data from that record quickly. When an identifier in the source program is detected by the lex analyzer, the identifier is entered into the symbol table. Error Handler

tt h

:/ p

cs /

tu e

e. b

k/ t

Each phase can encounter errors. Feature of the compiler is to detect & report errors. 1. Lexical Analysis --- Characters may be misspelled 2. Syntax Analysis --- Structure of the statement violates the rules of the language 3. Semantic Analysis --- No meaning in the operation involved 4. Intermediate Code Generation --- Operands have incompatible data types 5. Code Optimizer --- Certain Statements may never be reached 6. Code Generation --- Constant is too long 7. Symbol Table --- Multiple declared variables Position : = initial + rate * 10 Lexical Analyser

id1,: = , id2 ,+, id3 ,*, 10

http://csetube.weebly.com/

Syntax Analyzer

:= id1 id2 * id3 10 +

Semantic Analyzer := id1 id2 id3 + *

Intermediate code generator

tt

:/ p

cs /

tu e
10

inttoreal

e. b

k/ t

temp1: = inttoreal (10) temp2: = id3 * temp1 temp3: = id2 + temp2 id1: = temp3 Code optimizer

temp1: = id3 * 10.0 id1: = id2 + temp1 Code generator

MOVF id3, R2 MULF #10.0, R2 MOVF id2 , R1 ADDF R2, R1 MOVF R1, id1 Fig .3 Output of the phases of compiler

http://csetube.weebly.com/

TOPIC-4 COUSINS OF THE COMPILER


Definition: The cousins of the compiler means the context in which a compiler typically operates The cousins of the compiler are: Preprocessor. Assembler. Loader and Link-editor. I) Preprocessor: A preprocessor is a program that processes its input data to produce output that is used as input to another program. The preprocessor is executed before the actual compilation of code begins. They may perform the following functions 1. Macro processing 2. File Inclusion 3."Rational Preprocessors 4. Language extension 1. Macro processing: A macro is a rule or pattern that specifies how a certain input sequence (often a sequence of characters) should be mapped to an output sequence (also often a sequence of characters) according to a defined procedure. macro definitions (#define, #undef) To define preprocessor macros we can use #define. Its format is: #define identifier replacement

When the preprocessor encounters this directive, it replaces any occurrence of identifier in the rest of the code by replacement. Example: #define TABLE_SIZE 100 int table1[TABLE_SIZE]; After the preprocessor has replaced TABLE_SIZE, the code becomes equivalent to: int table1[100]; 2.File Inclusion:

tt h

:/ p

cs /

tu e

e. b

k/ t

Preprocessor includes header files into the program text. When the preprocessor finds an #include directive it replaces it by the entire content of the specified file. There are two ways to specify a file to be included: #include "file" #include <file> The only difference between both expressions is the places (directories) where the compiler is going to look for the file. In the first case where the file name is specified between double-quotes, the file is searched first in the same directory that includes the file containing the directive. In case that it is not there, the compiler searches the file in the default directories where it is configured to look for the standard header files. If the file name is enclosed between angle-brackets <> the file is searched directly where the compiler is configured to look for the standard header files. Therefore, standard header files are usually included in angle-brackets, while other specific header files are included using quotes.

http://csetube.weebly.com/

3."Rational Preprocessors: These processors augment older languages with more modern flow of control and data structuring facilities. For example, such a preprocessor might provide the user with built-in macros for constructs like while-statements or if-statements,where none exist in the programming language itself. 4.Language extension : These processors attempt to add capabilities to the language by what amounts to built-in macros. For example, the language equal is a database query language embedded in C. Statements begging with ## are taken by the preprocessor to perform the database access. II)Assembler: Typically a modern assembler creates object code by translating assembly instruction mnemonics into opcodes, and by resolving symbolic names for memory locations and other entities. There are two types of assemblers based on how many passes through the source are needed to produce the executable program. One pass Two -Pass One-pass assembler goes through the source code once and assumes that all symbols will be defined before any instruction that references them. Two-pass assemblers create a table with all symbols and their values in the first pass, then use the table in a second pass to generate code. III)Linkers And Loaders: The process of loading consists of taking relocatable machine code, altering the relocatable addresses and placing the altered instructions and data in memory at the proper locations. A linker or link editor is a program that takes one or more objects generated by a compiler and combines them into a single executable program.

Definition: These tools have been developed for helping implement various phases of a compiler. These systems have often been referred to as compiler-compilers, compilergenerators or translator writing systems. Some commonly used compiler-construction tools include Parser generator Scanner generator Syntax-directed translation engine Data flow engine Automatic code generator I)Scanner generators: Generates Input(Regular Expression)--------------------- Output(Lexical Analyzer)

TOPIC-5 t t COMPILER-CONSTRUCITON TOOLS

:/ p

cs /

tu e

e. b

k/ t

http://csetube.weebly.com/

- automatically generates lexical analyzers from a specification based on regular expression. - The basic organization of the resulting lexical analyzers is finite automation. II)Parser generators: Generates Input(Context Free Grammar)--------------------- Output(Syntax Analyzer)

- produce syntax analyzers from input that is based on context-free grammar. - Many parser generators utilize powerful parsing algorithms that are too complex to be carried out by hand. III)Syntax-directed translation engines: Generates Input(Parse or Syntax Tree)--------------------- Output(Intermediate code) - produce collections of routines that walk a parse tree and generating intermediate code. - The basic idea is that one or more translations are associated with each node of the parse tree. - Each translation is defined in terms of translations at its neighbor nodes in the tree. IV) Data-flow analysis engines: Generates Input(Intermediate code)--------------------- Output(Optimized code) - Gathering of information about how values are transmitted from one part of a program to each other part. - Data-flow analysis is a key part of code optimization from Intermediate codes,. V) Automatic Coder generators:

Generates Input(Optimized code)--------------------- Output(Object Code) - A tool takes a collection of rules that define the translation of each operation of the optimized language into the machine language for a target machine. - The rules must include sufficient details that we can handle the different possible access methods for data.

tt h

:/ p

cs /

tu e

e. b

k/ t

TOPIC-6 GROUPING OF PHASES

Definition: Activities from more than one phase are often grouped together. The phases are collected into a front end and a back end Front and Back Ends: Front End: The Front End consists of those phases or parts of phases that depends primarily on the source language and is largely independent of target machine. Lexical and syntactic analysis, symbol table, semantic analysis and the generation of intermediate code is included. Certain amount of code optimization can be done by the front end. It also includes error handling that goes along with each of these phases.

http://csetube.weebly.com/

Back End: The Back End includes those portions of the compiler that depend on the target machine and these portions do not depend on the source language. Find the aspects of code optimization phase, code generation along with necessary error handling and symbol table operations. Passes:

Several phases of compilation are usually implemented in a single pass consisting of reading an input file and writing an output file. It is common for several phases to be grouped into one pass, and for the activity of these phases to be interleaved during the pass. Eg: Lexical analysis, syntax analysis, semantic analysis and intermediate code generation might be grouped into one pass. If so, the token stream after lexical analysis may be translated directly into intermediate code.

Reducing the number of passes: It is desirable to have relatively few passes, since it takes time to read and write intermediate files. If we group several phases into one pass, we may forced to keep the entire program in memory, because one phase may need information in a different order than a previous phase produces it.

tt h

:/ p

cs /

tu e

e. b

k/ t

http://csetube.weebly.com/

CHAPTER-2 TOPIC-I THE ROLE OF THE LEXICAL ANALYZER The lexical analyzer is the first phase of compiler. Its main task is to read the input characters and produces output a sequence of tokens that the parser uses for syntax analysis. As in the figure, upon receiving a get next token command from the parser the lexical analyzer reads input characters until it can identify the next token.

Source program

Lexical analyzer

token

Parser
get next token

Symbol table

Fig 4. Interaction of lexical analyzer with parser.

Since the lexical analyzer is the part of the compiler that reads the source text, it may also perform certain secondary tasks at the user interface. One such task is stripping out from the source program comments and white space in the form of blank, tab, and new line character. Another is correlating error messages from the compiler with the source program. Sometimes lexical analyzers are divided into a cascade of two phases, the first called scanning and the second lexical analysis. Issues in Lexical Analysis

There are several reasons for separating the analysis phase of compiling into lexical analysis and parsing. 1) Simpler design is the most important consideration. The separation of lexical analysis from syntax analysis often allows us to simplify one or the other of these phases. 2) Compiler efficiency is improved. 3) Compiler portability is enhanced. Tokens Patterns and Lexemes. There is a set of strings in the input for which the same token is produced as output. This set of strings is described by a rule called a pattern associated with the token. The pattern is set to match each string in the set. A lexeme is a sequence of characters in the source program that is matched by the pattern for the token. For example in the Pascals statement const pi = 3.1416; the substring pi is a lexeme for the token identifier. In most programming languages, the following constructs are treated as tokens: keywords, operators, identifiers, constants, literal strings, and punctuation symbols such as parentheses, commas, and semicolons.

tt h

:/ p

cs /

tu e

e. b

k/ t

http://csetube.weebly.com/

TOKEN

SAMPLE LEXEMES

INFORMAL DESCRIPTION OF PATTERN

const if relation id num literal

Const if <,<=,=,<>,>,>= pi,count,D2 3.1416,0,6.02E23 core dumped

const if < or <= or = or <> or >= or > letter followed by letters and digits any numeric constant any characters between and except

In the example when the character sequence pi appears in the source program, the token representing an identifier is returned to the parser. The returning of a token is often implemented by passing and integer corresponding to the token. It is this integer that is referred to as bold face id in the above table. A pattern is a rule describing a set of lexemes that can represent a particular token in source program. The pattern for the token const in the above table is just the single string const that spells out the keyword. Certain language conventions impact the difficulty of lexical analysis. Languages such as FORTRAN require a certain constructs in fixed positions on the input line. Thus the alignment of a lexeme may be important in determining the correctness of a source program. Attributes of Token The lexical analyzer returns to the parser a representation for the token it has found. The representation is an integer code if the token is a simple construct such as a left parenthesis, comma, or colon .The representation is a pair consisting of an integer code and a pointer to a table if the token is a more complex element such as an identifier or constant .The integer code gives the token type, the pointer points to the value of that token .Pairs are also retuned whenever we wish to distinguish between instances of a token. Input buffering: The lexical analyzer scans the characters of the source program one a t a time to discover tokens. Often, however, many characters beyond the next token many have to be examined before the next token itself can be determined. For this and other reasons, it is desirable for the lexical analyzer to read its input from an input buffer. Figure shows a buffer divided into two haves of, say 100 characters each. One pointer marks the beginning of the token being discovered. A look ahead pointer scans ahead of the beginning point, until the token is discovered .we view the position of each pointer as being between the character last read and the character next to be read. In practice each buffering scheme adopts one convention either a pointer is at the symbol last read or the symbol it is ready to read.

tt h

:/ p

cs /

tu e

e. b

k/ t

Token beginnings

look ahead pointer

The distance which the lookahead pointer may have to travel past the actual token may be large. For example, in a PL/I program we may see: DECALRE (ARG1, ARG2 ARG n)

http://csetube.weebly.com/

Without knowing whether DECLARE is a keyword or an array name until we see the character that follows the right parenthesis. In either case, the token itself ends at the second E. If the look ahead pointer travels beyond the buffer half in which it began, the other half must be loaded with the next characters from the source file. Since the buffer shown in above figure is of limited size there is an implied constraint on how much look ahead can be used before the next token is discovered. In the above example, if the look ahead traveled to the left half and all the way through the left half to the middle, we could not reload the right half, because we would lose characters that had not yet been grouped into tokens. While we can make the buffer larger if we chose or use another buffering scheme, we cannot ignore the fact that overhead is limited Specification of tokens: Strings and Languages The term alphabet or character class denotes any finite set of symbols. Typically examples of symbols are letters and characters. The set {0, 1} is the binary alphabet. A String over some alphabet is a finite sequence of symbols drawn from that alphabet. In Language theory, the terms sentence and word are often used as synonyms for the term string. The term language denotes any set of strings over some fixed alphabet. This definition is very broad. Abstract languages like , the empty set, or { }, the set containing only the empty set, are languages under this definition. Certain terms fro parts of a string are prefix, suffix, substring, or subsequence of a string. There are several important operations like union, concatenation and closure that can be applied to languages. Regular Expressions

In Pascal, an identifier is a letter followed by zero or more letters or digits. Regular expressions allow us to define precisely sets such as this. With this notation, Pascal identifiers may be defined as letter (letter | digit)* The vertical bar here means or , the parentheses are used to group subexpressions, the star means zero or more instances of the parenthesized expression, and the juxtaposition of letter with remainder of the expression means concatenation. A regular expression is built up out of simpler regular expressions using set of defining rules. Each regular expression r denotes a language L(r). The defining rules specify how L(r) is formed by combining in various ways the languages denoted by the subexpressions of r. Unnecessary parenthesis can be avoided in regular expressions if we adopt the conventions that: 1. the unary operator has the highest precedence and is left associative. 2. concatenation has the second highest precedence and is left associative. 3. | has the lowest precedence and is left associative.

tt h

:/ p

cs /

tu e

e. b

k/ t

Regular Definitions If is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form d r , d r where d,d is a distinct name , and r,r is a regular expression over the symbols in U {d,d,} , i.e.; the basic symbols and the previously defined names. Example: The set of Pascal identifiers is the set of strings of letters and digits beginning with a letter. The regular definition for the set is letter A|B||Z|a|b|z digit 0|1|2||9

http://csetube.weebly.com/

id letter ( letter | digit ) * Unsigned numbers in Pascal are strings such as 5280,56.77,6.25E4 etc. The following regular definition provides a precise specification for this class of strings: digit 0|1|2|..|9 digits digit digit * This definition says that digit can be any number from 0-9, while digits is a digit followed by zero or more occurrences of a digit. Notational Shorthands Certain constructs occur so frequently in regular expressions that it is convenient to introduce notational shorthands for them. 1. One or more instances. The unary postfix operator + means one or more instances of 2. Zero or one instance. The unary postfix operator ? means zero or one instance of . The notation r? is a shorthand for r/. 3. Character classes. The notation [abc] where a , b , and c are the alphabet symbols denotes the regular expression a|b|c. An abbreviated character class such as [a-z] denotes the regular expression a|b|c|.|z. Recognition of tokens: The question of how to recognize the tokens is handled in this section. The language generated by the following grammar is used as an example. Consider the following grammar fragment: stmt if expr then stmt |if expr then stmt else stmt | exprterm relop term |term termid |num where the terminals if , then, else, relop, id and num generate sets of strings given by the following regular definitions: if if then ten else else relop <|<=|=|<>|>|>= idletter(letter|digit)* numdigit+ (.digit+)?(E(+|-)?digit+)? For this language fragment the lexical analyzer will recognize the keywords if, then, else, as well as the lexemes denoted by relop, id, and num. To simplify matters, we assume keywords are reserved; that is, they cannot be used as identifiers. Unsigned integer and real numbers of Pascal are represented by num.

tt h

:/ p

cs /

tu e

e. b

k/ t

In addition, we assume lexemes are separated by white space, consisting of nonnull sequences of blanks, tabs and newlines. Our lexical analyzer will strip out white space. It will do so by comparing a string against the regular definition ws, below. delimblank|tab|newline wsdelim+ If a match for ws is found, the lexical analyzer does not return a token to the parser. Rather, it proceeds to find a token following the white space and returns that to the parser. Our goal is to construct a lexical analyzer that will isolate the lexeme for the next token in the input buffer and produce as output a pair consisting of the appropriate token and attribute value, using the translation table given in the figure. The attribute values for the relational operators are given by the symbolic constants LT,LE,EQ,NE,GT,GE.

http://csetube.weebly.com/

REGULAR EXPRESSION

TOKEN

ATTIBUTE VALUE

ws if then else id num < <= = <> > >=

if then else id num relop relop relop relop relop relop

Pointer to table entry Pointer to table entry LT LE EQ NE GT GE

Transition diagram A transition diagram is a stylized flowchart. Transition diagram is used to keep track of information about characters that are seen as the forward pointer scans the input. We do so by moving from position to position in the diagrams as characters are read. Positions in a transition diagram are drawn as circles and are called states. The states are connected by arrow, called edges. Edges leaving state s have labels indicating the input characters that can next appear after the transition diagram has reached state s. the label other refers to any character that is not indicated by any of the other edges leaving s. One state is labeled as the start state; it is the initial state of the transition diagram where control resides when we begin to recognize a token. Certain states may have actions that are executed when the flow of control reaches that state. On entering a state we read the next input character if there is and edge from the current state whose label matches this input character, we then go to the state pointed to by the edge. Otherwise we indicate failure. A transition diagram for >= is shown in the figure.

Lex A Lexical Analyzer Generator Lex helps write programs whose control flow is directed by instances of regular expressions in the input stream. It is well suited for editor-script type transformations and for segmenting input in preparation for a parsing routine. Lex source is a table of regular expressions and corresponding program fragments. The table is translated to a program which reads an input stream, copying it to an output stream and partitioning the input into strings which match the given expressions. As each such string is recognized the corresponding program fragment is executed. The recognition of the expressions is performed by a deterministic finite automaton generated by Lex. The program fragments written by the user are executed in the order in which the corresponding regular expressions occur in the input stream. The lexical analysis programs written with Lex accept ambiguous specifications and choose the longest match possible at each input point. If necessary, substantial lookahead is performed on the input, but the input stream will be backed up to he end of the current partition, so that the user has general freedom to manipulate it. Introduction. Lex is a program generator designed for lexical processing of character input streams. It accepts a high-level, problem oriented specification for character string

tt h

:/ p

cs /

tu e

e. b

k/ t

http://csetube.weebly.com/

matching, and produces a program in a general purpose language which recognizes regular expressions. The regular expressions are specified by the user in the source given to Lex. The Lex written code recognizes these expressions in an input stream and partitions the input stream into strings matching the expressions. At the boundaries between strings program sections provided by the user are executed. The Lex source file associates the regular expressions and the program fragments. As each expression appears in the input to the program written by Lex, the corresponding fragment is executed. Lex is not a complete language, but rather a generator representing a new language feature which can be added to different programming languages, called host languages. Lex can write code in different host languages. The host language is used for the output code generated by Lex and also for the program fragments added by the user. Compatible run-time libraries for the different host languages are also provided. This makes Lex adaptable to different environments and different users. Each application may be directed to the combination of hardware and host language appropriate to the task, the users background, and properties of local implementations. Lex turns the users expressions and actions (called source in this memo) into the host general-purpose language; the generated program is named yylex The yylex program will recognize expressions in a stream (called input in this memo) and perform the specified actions for each expression as it is detected. Source Lex yylex yylex Output

Input >

An overview of Lex

For a trivial example, consider a program to delete from the input all blanks or tabs at the ends of lines. %% [ \t]+$ ;

is all that is required. The program contains a%% delimiter to mark the beginning of the rules, and one rule. This rule contains a regular which matches one or more instances of the characters blank rtab (written \t for visibility, in with the C language convention) just prior to the end of a line. The brackets indicate character class made of blank and tab; the + indicates one or more ...; and the $ indicates end of line, as in QED. No action is specified, so the program generated by Lex (yylex) will ignore these characters. Everything else will be copied. To change any remaining string of blanks or tabs to a single blank, add another rule:

tt h

:/ p

cs /

tu e

e. b

k/ t

%% [ \t]+$ ; [ \t]+ printf (" "); The finite automaton generated for this source will scan for both rules at once, observing at the termination of the string of blanks or tabs whether or not there is a newline character, and executing the desired rule action. The first rule matches all strings of blanks or tabs at the end of lines, and the second rule all remaining strings of blanks or tabs. Lex can be used alone for simple transformations, or for analysis and statistics gathering on a lexical level. Lex can also be used with a parser generator to perform the lexical analysis phase; it is particularly easy to interface Lex and Yacc .Lex programs

http://csetube.weebly.com/

recognize only regular expressions; Yacc writes parsers that accept a large class of context free grammars, but require a lower level analyzer to recognize input tokens. Thus, a combination of Lex and Yacc is often appropriate. When used as a preprocessor for a later parser generator, Lex is used to partition the input stream, and the parser generator assigns structure to the resulting pieces. The flow of control in such a case (which might be the first half of a compiler, for example) is shown below.Additional programs, written by other generators or by hand, can be added easily to programs written by Lex.. lexical rules Lex Input yylex grammar rules | Yacc yyparse Parsed input

Lex with Yacc Yacc users will realize that the name yylex is what Yacc expects its lexical analyzer to be named, so that the use of this name by Lex simplifies interfacing. Lex generates a deterministic finite automation from the regular expressions in the source .The automaton is interpreted, rather than compiled, in order to save space. The result is still a fast analyzer. In particular, the time taken by a Lex program to recognize and partition an input stream is proportional to the length of the input. The number of Lex rules or the complexity of the rules is not important in determining speed, unless rules which include forward context require a significant amount of rescanning. What does increase with the number and complexity of rules is the size of the finite automaton, and therefore the size of the program generated by Lex. In the program written by Lex, the users (representing the actions to be performed as each regular expression is found) are gathered as cases of a switch. The automaton interpreter directs the control flow. Opportunity is provided for the user to insert either declarations or additional statements in the routine containing the actions, or to add subroutines outside this action routine. Lex is not limited to source which can be interpreted on the basis of one character look-ahead. For example, if there are two rules, one looking for ab and another for abcdefg , and the input stream is abcdefh , Lex will recognize ab and leave the input pointer just before cd Such is more costly than the processing of languages.

tt h

:/ p

cs /

tu e

e. b

k/ t

Lex Source. General format of Lex source is: {definitions} %% {rules} %% {user subroutines} where the definitions and the user subroutines are often omitted. The second % % is optional, but the first is required to mark the beginning of the rules. The absolute minimum Lex program is %% (no definitions, no rules) which translates into a program which copies the input to the output unchanged. In the outline of Lex programs shown the rules represent the users control signs; they are a table, in which the left column contains regular expressions and the right column contains actions to be executed when the expressions are recognized. Thus an individual rule might appear integer printf("found keyword INT"); to look for the string integer in the input stream and print the message found keyword INT whenever it appears. In this example the host procedural language

http://csetube.weebly.com/

is C and the C library function printf is used to print the string. The end of the expression is indicated by the first blank or tab character. If the action is merely a single C expression, it can just be given on the right side of the line; if it is compound, or takes more than a line, it should be enclosed in braces. As a slightly more useful example, suppose it is desired to change a number of words from British to American spelling. Lex rules such as colour printf ("color"); Mechanise printf ("mechanize"); petrol printf("gas"); would be a start.

Lex Regular Expressions. A regular expression specifies a set of strings to be matched. It contains text characters (which match the corresponding characters in the strings being compared) and operator characters (which specify repetitions, choices, and other features). The letters of the alphabet and the digits are always text characters; thus the regular expression integer matches the string integer wherever it appears and the expression a57D looks for the string a57D Metacharacter Matches . \n * + ? ^ $ a|b (ab)+ "a+b" [] any character except newline newline zero or more copies of preceding expression one or more copies of preceding expression zero or one copy of preceding expression beginning of line end of line a or b one or more copies of ab (grouping) literal a+b (C escapes still work) character class

Expression Matches

tt h

:/ p

cs /

tu e

e. b

k/ t

abc abc abc* ab, abc, abcc, abccc, abc+ abc, abcc, abccc, a(bc)+ abc, abcbc, abcbcbc, a(bc)? a, abc [abc] a, b, c [a-z] any letter, a through z [a\-z] a, -, z [-az] -, a, z [A-Za-z0-9]+ one or more alphanumeric characters [ \t\n]+ whitespace [^ab] anything except: a, b [a^b] a, ^, b [a|b] a, |, b a|b a or b

name function

http://csetube.weebly.com/

int yylex(void) char *yytext yyleng yylval int yywrap(void) FILE *yyout FILE *yyin INITIAL BEGIN ECHO

call to invoke lexer, returns token pointer to matched string length of matched string value associated with token wrap-up, return 1 if done, 0 if not done output file input file initial start condition condition switch start condition write matched string

A recognizer for a language is a program that takes as input a string x and answers yes if a sentence of the language and no otherwise. We compile a regular expression into a recognizer by constructing a transition diagram called finite automation. A finite automation can be deterministic or non deterministic where non deterministic means that more than one transition out of a state may be possible out of a state may be possible on a same input symbol. Dfa s are faster recognizers than nfas but can be much bigger than equivalent nfas. Non deterministic finite automata A mathematical model consisting : 1) 2) 3) 4) 5) a set of states S input alphabet transition function initial state final state

Transition table: state

tt h

:/ p

cs /

tu e

e. b

k/ t

Input symbols a {0,1} b {0} {2} {3}

0 1 2

Deterministic finite automata

Special case of nfa in which 1) no state has epsilon transition

http://csetube.weebly.com/

2)

for each state s and input symbol a, there is at most one edge labeled a leaving s

Conversion of nfa to dfa Subset construction algorithm input: nfa N output: equivalent dfa D Method:

Operations on nfa states: operation Epsilon-closure(S) Epsilon-closure(T) Move(T,a) description Set of nfa states reachable from nfa state s on e-transitions alone Set of nfa states reachable from nfa state s in Ton etransitions alone Set of nfa states to which there is a transition on input symbol afrom nfa state s in T

Subset construction: Initially,e-closure(So) is the only state in dfa-states and it is unmarked While there is an unmarked state in T in D-states do begin Mark T: For each input symbol a do begin U:=e-closure(move(T,a)); If U is not in D-states then Add U asv an unmarked state to D-states; Dtransiton[T,a]:=U End End

CONSTRUCTION OF AN NFA FROM A REGULAR EXPRESSION Thomsons Construction To convert regular expression r over an alphabet into an NFA N accepting L(r) Parse r into its constituent sub-expressions. Construct NFAs for each of the basic symbols in r. For , construct the NFA

tt h

:/ p

cs /

tu e

e. b

k/ t

Here i is a new start state and f is a new accepting state. This NFA recognizes {}. For a in , construct the NFA

http://csetube.weebly.com/

Again i is a new start state and f is a new accepting state. This NFA accepts {a}. If a occurs several times in r, then a separate NFA is constructed for each occurrence. Keeping the syntactic structure of the regular expression in mind, combine these NFAs inductively until the NFA for the entire expression is obtained. Each intermediate NFA produced during the course of construction corresponds to a sub-expression r and has several important properties it has exactly one final state, no edge enters the start state and no edge leaves the final state. Suppose N(s) and N(t) are NFAs for regular expressions s and t. (a) For regular expression s|t, construct the following composite NFA N(s|t) :

(b) For the regular expression st, construct the composite NFA N(st) :

tt h

:/ p

cs /

tu e

e. b

k/ t

(c) For the regular expression s* , construct the composite NFA N(s*) :

(d) For the parenthesized regular expression (s), use N(s) itself as the NFA

http://csetube.weebly.com/

Das könnte Ihnen auch gefallen