Sie sind auf Seite 1von 4

Compiler design

A compiler is a computer program that translates a computer program written in one

computer language (called the source language) into a program written in another

computer language (called the output or the target language).

How it works

Usually the translation is from a source code (generally a high level language) to a target code (generally a low level object code or machine language) that may be directly executed by a computer or a virtual machine. However, a compiler from a low level language to a high level one is also possible; this is normally known as a decompiler

if it is reconstructing a high level language which (could have) generated the low level

language. Compilers also exist which translate from one high level language to another,

or sometimes to an intermediate language that still needs further processing; these are sometimes known as cascaders.

Typical compilers output so-called objects that basically contain machine code augmented by information about the name and location of entry points and external calls (to functions not contained in the object). A set of object files, which need not have all come from a single compiler provided that the compilers used share a common output format, may then be linked together to create the final executable which can be run directly by a user.

Types of compilers

A compiler may produce code intended to run on the same platform as the compiler itself

runs on. This is sometimes called a native-code compiler. Alternatively, it might produce

code designed to run on a different platform. This is known as a "cross compiler". Cross compilers are very useful when bringing up a new hardware platform for the first time

Compiler design

In the past, compilers were divided into many passes to save space. When each pass is

finished, the compiler can free the space needed during that pass.

Many modern compilers share a common 'two stage' design. The first stage, the 'compiler front end' translates the source language into an intermediate representation. The second stage, the 'compiler back end' works with the internal representation to produce code in the output language.

While compiler design is a complex task, this approach mitigates the complexity by allowing either the front end or back end to retarget the compiler's source or output language respectively. This way, modern compilers are often portable and allow multiple dialects of a language to be compiled.

Certain languages, due to the design of the language and certain rules placed on the declaration of variables and other objects used, and the predeclaration of executable procedures prior to reference or use, are capable of being compiled in a single pass.

Compiler front end

The compiler front end consists of multiple phases itself, each informed by formal language theory:

1. Lexical analysis - breaking the source code text into small pieces ('tokens' or 'terminals'), each representing a single atomic unit of the language, for instance a keyword, identifier or symbol names. The token language is typically a regular language, so a finite state automaton constructed from a regular expression can be used to recognize it. This phase is also called lexing or scanning.

2. Syntax analysis - Identifying syntactic structures of source code. It only focuses on the structure. In other words, it identifies the order of tokens and understand hierarchical structures in code. This phase is also called parsing.

3. Semantic analysis is to recognize the meaning of program code and start to prepare for output. In that phase, type checking is done and most of compiler errors show up.

4. Intermediate language generation - an equivalent to the original program is created in an intermediate language.

Compiler back end

While there are applications where only the compiler front end is necessary, such as static language verification tools, a real compiler hands the intermediate representation generated by the front end to the back end, which produces a functional equivalent program in the output language. This is done in multiple steps:

1. Optimization - the intermediate language representation is transformed into functionally equivalent but faster (or smaller) forms.

2. Code generation - the transformed intermediate language is translated into the output language, usually the native machine language of the system. This involves resource and storage decisions, such as deciding which variables to fit into registers and memory and the selection and scheduling of appropriate machine instructions.

Compiled vs. Interpreted languages

Many people divide higher level programming languages into two categories: compiled languages and interpreted languages. However, in fact most of these languages can be implemented either through compilation or interpretation, the categorisation merely reflecting which method is most commonly used to implement that language. (Some interpreted languages, however, cannot easily be implemented through compilation, especially those which allow self-modifying code.)

Parse tree

A parse tree is a grammatical structure represented as a tree data structure.

grammatical structure represented as a tree data structure. A sentence structure represented as a parse tree.

A sentence structure represented as a parse tree.

Semantic analysis

In computer science, semantic analysis is a pass by a compiler that adds semantical

information to the parse tree and performs certain checks based on this information. It follows the parsing phase, in which the parse tree is generated, and precedes the code generation phase, in which executable code is generated. Typical examples of semantical information that is added and checked is typing information (type checking) and the binding of variables and function names to their definitions (object binding). Sometimes also some early code optimization is done in this phase.

For this phase the compiler usually maintains so called symbolic tables in which it stores what each symbol (variable names, function names, etc.) refers to.

Code generation

In computer science, the code generation is a process on compilers that actually outputs

machine code on the target environment.

Code generators usually take intermediate form generated by directly semantic analysis or intermediate code generators. Three address code format is a common intermediate format, described in fact in the dragon book. It also refers to one of the trial generations in genetic programming

In more general sense, the code generation is to produce programs in some automatical manner. A compiler-compiler, a program that generates a compiler like yacc, is a very common instance.

Code generations can be done either at runtime, including load time or compiler time. Just-in-time compilers usually produce native or nearly native code from byte code when programs are loaded onto the compilers. Compiler-compilers, for instance, almost always generate code at compiler time on the other hand.