Sie sind auf Seite 1von 17

DECOMPILER

(REVERSE ENGINEERING)

1
ABSTRACT

With major businesses focusing more and more on web enablement, the

proliferation of web-based applications, and the growth of many operating systems in the

mainframe and midrange marketplace, there is a growing demand for Decompilers

specific to these systems. A compiler is system software that takes as input a program

written in a high level language and produces as output an executable program for a

target machine. A Decompiler, or reverse compiler, attempts to perform the inverse

process: given an executable program the aim is to produce a high-level language

program that performs the same function as the executable program.

In general, decompilers are used to recover lost source code. They work by

analyzing the byte-code of the software, and making educated guesses about the code that

created it. The input in this case is machine dependent, and the output is language

dependent. That is, an intermediate language representation can be formed and some

related code is generated in any high level language. Decompilation is a process that uses

some tools to load binary program into memory, parse or disassemble such a program,

and decompile or analyze the program to generate a high-level language program. The

accuracy depends on the benefits from compiler and library signatures to recognize

particular compilers and library subroutines. In most cases, the high level code generated

is subjected to small changes before compiling again.

2
INTRODUCTION

Java bytecode is the form of instructions that the Java virtual machine executes. Each
bytecode opcode is one byte in length, although some require parameters, resulting in
some multi-byte instructions. Not all of the possible 256 opcodes are used. In fact, Sun
Microsystems, the original creators of the Java programming language, the Java virtual
machine and other components of the Java Runtime Environment (JRE), have set aside 3
values to be permanently unimplemented.

Java compiler compiles the Java source code files (*.java) into binaries files (*.class).
You would use the Java de-compiler to convert java class files into source code files
(*.java).

As each byte has 256 potential values, there are 256 possible opcodes. Of these, 0x00
through 0xca, 0xfe, and 0xff are assigned values. 0xba is unused for historical reasons.
0xca is reserved as a breakpoint instruction for debuggers and is not used by the
language. Similarly, 0xfe and 0xff are not used by the language, and are reserved for
internal use by the virtual machine.

Instructions fall into a number of broad groups:

• Load and store (e.g. aload_0, istore)


• Arithmetic and logic (e.g. ladd, fcmpl)
• Type conversion (e.g. i2b, d2i)
• Object creation and manipulation (new, putfield)
• Operand stack management (e.g. swap, dup2)
• Control transfer (e.g. ifeq, goto)
• Method invocation and return (e.g. invokespecial, areturn)

Java de-compiler is very useful especially if you have *.class files and you do not have
access to the source code. Some vendors do not ship the source code for java class files,
in which case you use the java decompiler to look at the source code.

Java Decompiler is Windows 95, Windows 98, Windows 2000, Windows XP, Windows
2003, Windows Vista, Windows 7 decompiler and disassembler for Java that reconstructs
the original source code from the compiled binary CLASS files (for example Java
applets). Java Decompiler is able to decompile complex Java applets and binaries,
producing accurate source code. It lets you quickly obtain all essential information about
the class files.

3
Java Decompiler is a stand-alone Windows application; it doesn't require having Java
installed!

What is Decompilation?

In general, decompilers are used to recover lost source code. They work by

analyzing the byte-code of the software, and making educated guesses about the code that

created it. The input in this case is machine dependent, and the output is language

dependent. That is, an intermediate language representation can be formed and some

related code is generated in any high level language. Decompilation is a process that uses

some tools to load binary program into memory, parse or disassemble such a program,

and decompile or analyze the program to generate a high-level language program. The

accuracy depends on the benefits from compiler and library signatures to recognize

particular compilers and library subroutines. In most cases, the high level code generated

is subjected to small changes before compiling again.

ETHICS OF DECOMPILATION

Is Decompilation Possible?

Yes and No. Fully automated decompilation is not possible – this problem is

theoretically equivalent to the Halting Problem, an undesirable problem in Computer

Science. What this means is that decompilation cannot be achieved for all possible

programs that are ever written, and that the separation of data and code is hard to achieve.

Further, even if a certain degree of success is achieved, the generated program lacks

4
meaningful variable and function means as these are not normally stored in an executable

file, except when stored for debugging purposes.

Some people believe it is only possible to recover the assembly sources; this in

itself is not a trivial problem again. However, in practice, there have been more

sophisticated ways to identify to all programming constructs, so that a high degree of

understanding of data and control flow of the executable would be possible. The more

successful ones make use of extra information (e.g. knowledge of the compiler used) or

require human input at the hard paths of the disassembly process.

Decompilers – Friend or Foe

When a programmer writes software, and releases it to the public, he (or she)

normally releases a compiled version of the application that users can run on their own

machine. Whether it is a commercial offering, or a free piece of software, the

programmer has put a considerable amount of time and effort into producing it. The

source code behind the software is something private, that the programmer has created.

Programmers don’t want people looking for flaws in their software, and they don’t want

people to change the title of the software and then redistribute it as someone else’s

5
product. It is for this reason that programmers don’t often release their source code – but

few realize that every time we release compiled software, we are also giving people the

opportunity to reconstruct the source code.

Decompilation and reverse engineering is often prohibited by software license

agreements – but this won’t always stop an unscrupulous competitor, or an enthusiastic

hacker from analyzing the code. While decompilers do represent a threat, they also can be

of great benefit to programmers. There are also many legitimate purposes for the use of

decompilers. Some companies might decompile software of a competitor, to establish the

structure of data files to include support for that file-type in their application. Whether or

not such actions are legal is a gray area, but including support for competing

spreadsheets, word processors, databases, etc is handy for end-users.

Decompilers aren’t necessary evil – but they do pose an ethical dilemma for many

software developers. The programmers can protect their software against decompilation

or at least make the task harder, by using special software that protects them from prying

eyes. Decompilers can also be used to steal the source code of competitors, or by hackers

to determine weaknesses in the design of software. But just blaming the compiler is

meaningless – it is the programmer who uses it for intellectual property theft, or the

hacker that decompiles the software to find security holes that is at fault.

6
Legal Aspects

If decompilation is possible to a certain extent, is it then also allowed?

Throughout the world, copyright law protects most programs. Copyright protects the

expression of an idea in the form of a program, hence protecting the developer’s (or

company’s) intellectual property on the software. Copyright law provides a bundle of

exclusive rights to the software developer, among others, the right to reproduce and make

adaptations to the developed computer program. It is a breach of these rights the making

of reproductions and adaptations without permission of the copyright holder. Different

countries have different exceptions to the copyright owner’s rights or precedent has been

established in court proceedings. This means that these are uses are allowed by law, but

varies according to the country. The most common ones are:

• Decompilation for the purpose of interoperability (to another piece of software

of hardware) where the interface specification has not been made available

• Decompilation for the purposes of error correction where the owner of the

copyright is not available to make the correction, and

• To determine parts of program (algorithms), that is not protected by copyright,

without breach of other forms of protection (trade secrets).

SCOPE AND NEED FOR DECOMPILATION

7
Decompilation is a tool for a computer professional. There are two major areas

where decompilation is used: Software Maintenance and Security.

In the former area, it is used:

• To recover lost or inaccessible source code

• When third party vendors go out of business

• To translate code written in an obsolete language into a newer language

• To structure old code written in an unstructured way (i.e. spaghetti code) into

a structured program

• To migrate applications to a new hardware platform

• To debug binary programs that are known to have bugs but for which the

source code is unavailable

• When multiple versions of source have been created and creation dates are

destroyed making it impossible to tell which source matches the currently

running object/executable.

In the latter area, decompilation is used as a tool to verify the object code

produced by a compiler in software-critical systems. The extensive use of computers and

networks worldwide has raised the awareness of the need for tools and techniques to aid

in computer security analysis of binary code, such as the understanding of Malwares such

as Viruses, Trojans, Worms, Backdoors and general security flaws, in order to provide

immediate solutions with or without the aid of software vendors, whether these are

8
caused by intentionally introduced malicious code or by the malicious exploitation of

begin code to the detriment of the user.

The classical technique used to study malware is the use of a debugger to step the

executable program (containing thousands of lines of assembly code) one assembly line

at a time until the problem is found – it is then possible to reconstruct that part of the

traced program in order to provide a solution for it. This method requires an expert

engineer that understands assembly code – a skill that is disappearing as years go by, due

to the increasing use of higher-level languages such as C++ and Java. By decompilation,

we can reduce the amount of code that the engineer has to process, and present the

engineer with a higher level of abstraction, so that only fewer man-hours will be needed

in order to understand the program’s code. Further these techniques will reduce the

additional skills and training required for professionals working in especially network

security teams. Thus, decompilation would effectively help in reducing the amount of

time needed to trace a security flaw in an executable program, as well as reducing the

costs of acquiring or training skilled assembler engineers.

DECOMPILATION PROBLEMS

A decompiler writer has to face several theoretical and practical problems when

writing it. Some of these problems can be solved by use of heuristic methods, others

cannot be determined completely:

9
• Recursive undesirability (need for proof of abstract concept)

• The von Neumann Architecture (inseparable data and instructions)

• Self-modifying code (for best utilization of available memory)

• Idioms (unidentifiable sequence of instructions to form logical entity)

• Virus and Trojan tricks (hiding of malicious code)

• Architecture-dependent Restrictions

• Subroutines included by Compiler/Linker

RUN-TIME ENVIRONMENT

Before considering decompilation, the relations between the static binary code of

the program and the actions performed at run-time to implement the program should be

considered. The representation of objects in a binary program differs between compilers;

an equivalent data object in the machine often represents elementary data types such as

integers, characters, and reals, whereas aggregate objects such as arrays, strings, and

structures are represented in various different ways. A high-level language program is

composed of one or more subroutines, called the user sub-routines. The corresponding

binary program is composed of user subroutines, library routines that were invoked by

the user program, and other subroutines linked in by the linker to provide support for the

compiler at run-time. The general format of the binary code contains a startup-code, user

program including library routines and an exit code. For DOS and Windows

environments, when a program is loaded into memory, a Program Segment Prefix is built

10
on the earlier bytes of the allocated memory, and it contains important information such

as parent information, interrupt details, etc.

Each subroutine is associated with a stack frame during run-time containing set of

parameters, local variables, and return address of the caller subroutine. Entering a

subroutine, allocating local data, preserving register values, accessing parameters,

returning a value, exiting the subroutine and parameter parsing are some of the important

tasks to be analyzed from the byte code. Meanwhile, a symbol table is normally built to

store information on variables used throughout the program. Variables are identified by

their address; variables that have physical memory address are global variables and that

are located at a negative offset from the stack pointer are local variables to corresponding

stack frame’s subroutine and variables at positive offsets are actual arguments to the

subroutine. Register variables needs a special attention. The symbol table would grow

dynamically and built for easy and quick access.

PHASES OF A DECOMPILER

Conceptually, a decompiler is structured in a similar way to a compiler, by a

series of phases that transform the source machine program from one representation to

another.

Syntax Analyzer: It groups bytes of the source program into grammatical phrases

(or sentences) of the source machine language, using a parse tree. Case tables are

used to distinguish data and instructions.

11
Semantic Analyzer: It checks the source program for the semantic meaning of

groups of instructions; gathers type information and propagates this type across

the subroutine.

Intermediate Code Generation: An explicit intermediate representation of the

source program is necessary for the decompiler to analyze the program. The

second pass would use this intermediate code to generate target language code.

Data Flow Analysis: This phase attempts to improve intermediate code, by

eliminating use of temporary registers and condition flags, thereby identifying

high-level language expressions.

Control Flow Analysis: This phase is useful to eliminate compiler-generated

intermediate jumps and to determine the high-level control structures used in the

program.

Target Code Generation: High-level language code is generated here after

selecting names for local and global variables. Subroutine names are selected

using library and signature bindings. Further, intermediate instructions and

control structures are translated to appropriate high-level statements.

DECOMPILATION SYSTEM

12
Many people misbelieve that decompilation is equivalent to disassembly, but

disassembler is just a module of decompilation. The entire decompilation system involves

the following tools:

Loader: It loads binary program into memory and relocates the machine code if it

is relocatable to alter the required instructions.

Signature Generator: It delinks and generates patterns that uniquely identify

each compiler and library subroutine, to reduce arbitrary subroutine names.

Disassembler: It transforms machine language into assembler language and to

higher representation in some cases.

Library Binder: It binds the subroutine names to the appropriate library routines.

Postprocessor: It converts high-level program into semantically equivalent high-

level program, such as converting generic set of control structures (while loops) to

appropriate control structures of high-level language (for loops).

13
A Decompilation System

SIMPLE ILLUSTRATIVE APPROACH

A simple approach to decompile a binary executable is to first parse it and

separate it into functions (C style). Once we know where the entry point of the program

(“main” for a C program), we can start decompiling that function, and any other function

it calls. After we have separated out the instructions for a function, we need to emulate

the processor and interpret each and every machine instruction, to combine logical group

of instructions into simple high-level language statements.

Understanding Assignment Statements and Expressions: Look at code:

Mov ax, [bp+4]

Mov bx, 20

Mul bx

Add ax, 4

14
Mov [bp+4], ax

Lets there by two variables wAX and wBX, and then:

i) wAX = [bp+4]

ii) wBX = 20

iii) wAX = wAX*wBX = [bp+4]*20

iv) wAX = wAX+4 = ([bp+4]*20) + 4

v) [bp+4] = wAX = ([bp+|4]*20) + 4

And, if we substitute i for [bp+4], we get: i = (i*20) + 4;

Understanding Condition Evaluation and Branches: Lets look at:

Mov ax, [bp+4]

Cmp ax, 10

Jnz labl

Mov bx, 15

Mov [bp+2], bx

Jmp lab2

Labl: mov bx, 20

Mov [bp+2], bx

15
Lab2:

Using cmp and jnz instructions, we conclude: if (i! =10) j=20; else j=15; But, we

should remember condition need not always be evaluated using “compare

instruction”; even arithmetic instructions can set/reset conditions flags.

Interpreting function calls and returns values: The usual convention of

function calling is to push the parameters onto stack and then call the function.

So, “push” and “call” instructions indicate a function call. Look at the code:

Mov ax, [bp+4]

Push ax

Mov ax, [bp+2]

Push ax

Call _func

Mov [bp+4], ax

Matching [bp+4] with “i” and [bp+2] with “j”, we get: i = func (j, i);

CONCLUSION

Reverse Engineering is a field of research, which has attracted many hackers and

researchers. Though decompilation is generally unsolvable, vast research is being

conducted to improve the results. The decompilation should be thought as a means to

retrieve the source code in case of emergency, without the violation of laws. Decompilers

are usually interesting enough due to various reasons such as multiple versions of the

compiler for the sample platform exists, the compiler itself will continue to be changed

and those changes must be kept up with, etc. Nobody can stop unscrupulous persons and

hackers from illegal decompiling, but it should be directed to be used for valid purposes.

16
17

Das könnte Ihnen auch gefallen