Sie sind auf Seite 1von 9

UNIT 3

Compiler

3.1 Lexical analysis


The word lexical in the traditional sense means pertaining to words. In terms of
programming languages, words are entities like variable names, numbers, keywords etc. Such
words are traditionally called tokens/Lexemes.
A lexical analyser, or lexer for short, will as its input take a string of individual letters and divide
this string into tokens. Additionally, it will filter out whatever separates the tokens (the so-called
white-space), i.e., lay-out characters (spaces, newlines etc.) and comments.
The main purpose of lexical analysis is to make task easier for the subsequent syntax analysis
phase. In theory, the work that is done during lexical analysis can be made an integral part of
syntax analysis, and in simple systems this is indeed often done. However, there are reasons for
keeping the phases separate:

Efficiency: A lexer may do the simple parts of the work faster than the more general parser can.
Furthermore, the size of a system that is split in two may be smaller than a combined system. It
is usually not terribly difficult to write a lexer by hand Furthermore, a handwritten lexer may be
complex and difficult to maintain. Hence, lexers are normally constructed by lexer generators,
which transform human-readable specifications of tokens and white-space into efficient
programs. For lexical analysis, specifications are traditionally written using regular expressions:
The generated lexers are in a class of extremely simple programs called finite automata.

3.2 Syntax analysis


The syntax analysis phase of a compiler will take a string of tokens produced by the lexer, and
from this construct a syntax tree for the string by finding a derivation of the string from the start
symbol of the grammar. Where lexical analysis splits the input into tokens, the purpose of syntax
analysis (also known as parsing) is to recombine these tokens. Not back into a list of characters,
but into something that reflects the structure of the text. As the name indicates, this is a tree
structure. The leaves of this tree are the tokens found by the lexical analysis, and if the leaves are
read from left to right, the sequence is the same as in the input text. Hence, what is important in
the syntax tree is how these leaves are combined to form the structure of the tree and how the
interior nodes of the tree are labeled. In addition to finding the structure of the input text, the
syntax analysis must also reject invalid texts by reporting syntax errors.
This can be done by guessing derivations until the right one is found, but random guessing is
hardly an effective method. Even so, some parsing techniques are based on guessing
derivations. However, these make sure, by looking at the string, that they will always guess right.
These are called predictive parsing methods. Predictive parsers always build the syntax tree from
the root down to the leaves and are hence also called (deterministic) top-down parsers
Other parsers go the other way: They search for parts of the input string that matches right-hand
sides of productions and rewrite these to the left-hand non terminals, at the same time building
pieces of the syntax tree. The syntax tree is eventually completed when the string has been
rewritten (by inverse derivation) to the start symbol. Also here, we wish to make sure that we

always pick the right rewrites, so we get deterministic parsing. Such methods are called
bottom-up parsing methods.

3.3 Semantic analysis:


The semantic analysis phase deals with the following tasks.

Type checking of expressions


Type checking of function declaration
Verification of Assignment operation
Handling various programming features such as polymorphism, type conversion
etc.

3.4 Intermediate code generation:


The final goal of a compiler is to get programs written in a high-level language to run on a
computer. This means that, eventually, the program will have to be expressed as machine code
which can run on the computer. This doesnt mean that we need to translate directly from the
high-level abstract syntax to machine code. Many compilers use a medium-level language as a
stepping-stone between the high-level language and the very low-level machine code. Such
stepping-stone languages are called intermediate code
Apart from structuring the compiler into smaller jobs, using an intermediate language has other
advantages:
If the compiler needs to generate code for several different machine-architectures, only one
translation to intermediate code is needed. Only the translation from intermediate code to
machine language (i.e., the back-end) needs to be written in several versions.
If several high-level languages need to be compiled, only the translation to intermediate code
need to be written for each language. They can all share the back-end, i.e., the translation from
intermediate code to machine code.

Instead of translating the intermediate language to machine code, it can be interpreted by a


small program written in machine code or a language for which a compiler already exists.

The advantage of using an intermediate language is most obvious if many languages are to be
compiled to many machines. If translation is done directly, the number of compilers is equal to
the product of the number of languages and the number of machines. If a common intermediate
language is used, one front-end (i.e., compiler to intermediate code) is needed for every language
and one backend is needed for each machine, making the total equal to the sum of the number of
languages and the number of machines.
If an interpreter for the intermediate language is written in a language for which there already
exist compilers on the target machines, the interpreter can be compiled on each of these. This
way, there is no need to write a separate back-end for each machine.
The advantages of this approach are:
No actual back-end needs to be written for each new machine.
A compiled program can be distributed in a single intermediate form for all machines, as
opposed to shipping separate binaries for each machine.
The intermediate form may be more compact than machine code. This saves space both in
distribution and on the machine that executes the programs (though the latter is somewhat offset
by requiring the interpreter to be kept in memory during execution).

Example of source code translation:


Let us consider the following grammar and a string to implement all the phases of translation on
it.
Grammar:

Exp-> Exp + Exp


Exp -> Exp Exp
Exp-> Exp + Exp
Exp-> Exp + Exp
Exp-> num

String:

Position=initial+rate*60

Lexical analysis:
The lexical analyzer reads the input stream and groups them into meaningful sequence called
lexemes. For each lexeme the lexer produces tokens.
Position

<id,1>

<=>

Initial

< id,2>

<+>

Rate

<id,3>

<* >

60

<60>

And the statement is represented as below


<id,1><=><id,2><+><id,3><*><60>
Syntax analysis:
The second phase of the compiler is syntax analysis or parsing. The parser uses the first
components of the tokens produced by the lexical analyzer to create a tree-like intermediate
representation that depicts the grammatical structure of the token stream. A typical representation
is a syntax tree in which each interior node represents an operation and the children of the node
represent the arguments of the operation The tree has an interior node labeled * with ( id, 3 ) as
its left child and the integer 60 as its right child. The node (id, 3) represents the identifier rate.
The node labeled * makes it explicit that we must first multiply the value of rate by 60. The node
labeled + indicates that we must add the result of this multiplication to the value of initial. The
root of the tree, labeled =, indicates that we must store the result of this addition into the location
for the identifier posit ion. This ordering of operations is consistent with the usual conventions of
arithmetic which tell us that multiplication has higher precedence than addition, and hence that
the multiplication is to be performed before the addition.

Semantic Analysis:

The semantic analyzer uses the syntax tree and the information in the symbol table to check the
source program for semantic consistency with the language definition. It also gathers type
information and saves it in either the syntax tree or the symbol table, for subsequent use during
intermediate-code generation. An important part of semantic analysis is type checking, where the
compiler checks that each operator has matching operands. For example, many programming
language definitions require an array index to be an integer; the compiler must report an error if a
floating-point number is used to index an array.
The language specification may permit some type conversions called coercions. For example, a
binary arithmetic operator may be applied to either a pair of integers or to a pair of floating-point
numbers. If the operator is applied to a floating-point number and an integer, the compiler may
convert or coerce the integer into a floating-point number. Suppose that position, initial, and rate
have been declared to be floating-point numbers, and that the lexeme 60 by itself forms an
integer. The type checker in the semantic analyzer discovers that the operator * is applied to a
floating-point number rate and an integer 60. In this case, the integer may be converted into a
floating-point number. In the below Fig. notice that the output of the semantic analyzer has an
extra node for the operator inttofloat , which explicitly converts its integer argument into a
floating-point number

Intermediate Code Generation:

In the process of translating a source program into target code, a compiler may construct one or
more intermediate representations, which can have a variety of forms. One form is the Syntax
trees are a form of intermediate representation; they are commonly used during syntax and
semantic analysis. After syntax and semantic analysis of the source program, many compilers
generate an explicit low-level or machine-like intermediate representation, which we can think of
as a program for an abstract machine. This intermediate representation should have two
important properties: it should be easy to produce and it should be easy to translate into the target
machine.
The second type is the three address code which consists of a sequence of assembly-like
instructions with three operands per instruction. The output of the intermediate code generator
consists of the following three-address code sequence
t l= inttofloat (60)
t2= id3 * t l
t3= id2 + t2
id1 = t3

Code Optimization
The machine-independent code-optimization phase attempts to improve the intermediate code so
that better target code will result. Usually better means faster, but other objectives may be
desired, such as shorter code, or target code that consumes less power. For example, a
straightforward algorithm generates the intermediate code as in previous phase, using an
instruction for each operator in the tree representation that comes from the semantic analyzer.
A simple intermediate code generation algorithm followed by code optimization is a reasonable
way to generate good target code. The optimizer can deduce that the conversion of 60 from
integer to floating point can be done once and for all at compile time, so the inttofloat operation
can be eliminated by replacing the integer 60 by the floating-point number 60.0. Moreover, t3 is
used only once to transmit its value to id1 so the optimizer can transform the intermediate code
as produced by the previous phase, into the shorter sequence as below.

t 1 = id3 * 60 . 0
id1 = id2 +t 1

Code Generation
The code generator takes as input an intermediate representation of the source program and maps
it into the target language. If the target language is machine code, registers or memory locations
are selected for each of the variables used by the program. Then, the intermediate instructions are
translated into sequences of machine instructions that perform the same task. A crucial aspect of
code generation is the judicious assignment of registers to hold variables.
For example, using registers R1 and R2, the intermediate code produced by previous phase might
get translated into the following machine code
LDF

R2 , id3

MULF

R2 , #60 . 0

LDF

R1 , id2

ADDF

R 1 , R2

STF

id1 , R1

The first operand of each instruction specifies a destination. The F in each instruction tells us that
it deals with floating-point numbers. The instruction in above code loads the contents of address
id3 into register R2, then multiplies it with floating-point constant 60.0. The # signifies that 60.0
is to be treated as an immediate constant. The third instruction moves id2 into register Rl and the
fourth adds to it the value previously computed in register R2. Finally, the value in register Rl is
stored into the address of id l , so the code correctly implements the assignment statement that
we started with.

3.5 Compiler- Construction Tools

The compiler writer, like any software developer, can profitably use modern software
development environments containing tools such as language editors, debuggers, version
managers, profilers, test harnesses, and so on. In addition to these general software-development
tools, other more specialized tools have been created to help implement various phases of a
compiler.

These tools use specialized languages for specifying and implementing specific components, and
many use quite sophisticated algorithms. The most successful tools are those that hide the details
of the generation algorithm and produce components that can be easily integrated into the
remainder of the compiler.
Some commonly used compiler-construction tools include
1. Parser generators that automatically produce syntax analyzers from a grammatical description
of a programming language.
2. Scanner generators that produce lexical analyzers from a regular-expression description of the
tokens of a language.
3. Syntax-directed translation engines that produce collections of routines for walking a parse
tree and generating intermediate code.
4. Code-generator generators that produce a code generator from a collection of rules for
translating each operation of the intermediate language into the machine language for a target
machine.
5. Data-flow analysis engines that facilitate the gathering of information about how values are
transmitted from one part of a program to each other part. Data-flow analysis is a key part of
code optimization.
6. Compiler- construction toolkits that provide an integrated set of routines for constructing
various phases of a compiler

Das könnte Ihnen auch gefallen