DECOMPILER

DECOMPILER
(REVERSE ENGINEERING)
1
ABSTRACT
With major businesses focusing more and more on web enablement, the
proliferation of web-based applications, and the growth of many operating systems in the
mainframe and midrange marketplace, there is a growing demand for Decompilers
specific to these systems. A compiler is system software that takes as input a program
written in a high level language and produces as output an executable program for a
target machine. A Decompiler, or reverse compiler, attempts to perform the inverse
process: given an executable program the aim is to produce a high-level language
program that performs the same function as the executable program.
In general, decompilers are used to recover lost source code. They work by
analyzing the byte-code of the software, and making educated guesses about the code that
created it. The input in this case is machine dependent, and the output is language
dependent. That is, an intermediate language representation can be formed and some
related code is generated in any high level language. Decompilation is a process that uses
some tools to load binary program into memory, parse or disassemble such a program,
and decompile or analyze the program to generate a high-level language program. The
accuracy depends on the benefits from compiler and library signatures to recognize
particular compilers and library subroutines. In most cases, the high level code generated
is subjected to small changes before compiling again.
2
INTRODUCTION
Java bytecode is the form of instructions that the Java virtual machine executes. Each
bytecode opcode is one byte in length, although some require parameters, resulting in
some multi-byte instructions. Not all of the possible 256 opcodes are used. In fact, Sun
Microsystems, the original creators of the Java programming language, the Java virtual
machine and other components of the Java Runtime Environment (JRE), have set aside 3
values to be permanently unimplemented.
Java compiler compiles the Java source code files (*.java) into binaries files (*.class).
You would use the Java de-compiler to convert java class files into source code files
(*.java).
As each byte has 256 potential values, there are 256 possible opcodes. Of these, 0x00
through 0xca, 0xfe, and 0xff are assigned values. 0xba is unused for historical reasons.
0xca is reserved as a breakpoint instruction for debuggers and is not used by the
language. Similarly, 0xfe and 0xff are not used by the language, and are reserved for
internal use by the virtual machine.
Instructions fall into a number of broad groups:
• Load and store (e.g. aload_0, istore)

• Arithmetic and logic (e.g. ladd, fcmpl)
• Type conversion (e.g. i2b, d2i)
• Object creation and manipulation (new, putfield)
• Operand stack management (e.g. swap, dup2)
• Control transfer (e.g. ifeq, goto)
• Method invocation and return (e.g. invokespecial, areturn)
Java de-compiler is very useful especially if you have *.class files and you do not have
access to the source code. Some vendors do not ship the source code for java class files,
in which case you use the java decompiler to look at the source code.
Java Decompiler is Windows 95, Windows 98, Windows 2000, Windows XP, Windows
2003, Windows Vista, Windows 7 decompiler and disassembler for Java that reconstructs
the original source code from the compiled binary CLASS files (for example Java
applets). Java Decompiler is able to decompile complex Java applets and binaries,
producing accurate source code. It lets you quickly obtain all essential information about
the class files.
3
Java Decompiler is a stand-alone Windows application; it doesn't require having Java
installed!
What is Decompilation?
In general, decompilers are used to recover lost source code. They work by
analyzing the byte-code of the software, and making educated guesses about the code that
created it. The input in this case is machine dependent, and the output is language
dependent. That is, an intermediate language representation can be formed and some
related code is generated in any high level language. Decompilation is a process that uses
some tools to load binary program into memory, parse or disassemble such a program,
and decompile or analyze the program to generate a high-level language program. The
accuracy depends on the benefits from compiler and library signatures to recognize
particular compilers and library subroutines. In most cases, the high level code generated
is subjected to small changes before compiling again.
ETHICS OF DECOMPILATION
Is Decompilation Possible?
Yes and No. Fully automated decompilation is not possible – this problem is
theoretically equivalent to the Halting Problem, an undesirable problem in Computer
Science. What this means is that decompilation cannot be achieved for all possible
programs that are ever written, and that the separation of data and code is hard to achieve.
Further, even if a certain degree of success is achieved, the generated program lacks
4
meaningful variable and function means as these are not normally stored in an executable
file, except when stored for debugging purposes.
Some people believe it is only possible to recover the assembly sources; this in
itself is not a trivial problem again. However, in practice, there have been more
sophisticated ways to identify to all programming constructs, so that a high degree of
understanding of data and control flow of the executable would be possible. The more
successful ones make use of extra information (e.g. knowledge of the compiler used) or
require human input at the hard paths of the disassembly process.
Decompilers – Friend or Foe
When a programmer writes software, and releases it to the public, he (or she)
normally releases a compiled version of the application that users can run on their own
machine. Whether it is a commercial offering, or a free piece of software, the
programmer has put a considerable amount of time and effort into producing it. The
source code behind the software is something private, that the programmer has created.
Programmers don’t want people looking for flaws in their software, and they don’t want
people to change the title of the software and then redistribute it as someone else’s
5
product. It is for this reason that programmers don’t often release their source code – but
few realize that every time we release compiled software, we are also giving people the
opportunity to reconstruct the source code.
Decompilation and reverse engineering is often prohibited by software license
agreements – but this won’t always stop an unscrupulous competitor, or an enthusiastic
hacker from analyzing the code. While decompilers do represent a threat, they also can be
of great benefit to programmers. There are also many legitimate purposes for the use of
decompilers. Some companies might decompile software of a competitor, to establish the
structure of data files to include support for that file-type in their application. Whether or
not such actions are legal is a gray area, but including support for competing
spreadsheets, word processors, databases, etc is handy for end-users.
Decompilers aren’t necessary evil – but they do pose an ethical dilemma for many
software developers. The programmers can protect their software against decompilation
or at least make the task harder, by using special software that protects them from prying
eyes. Decompilers can also be used to steal the source code of competitors, or by hackers
to determine weaknesses in the design of software. But just blaming the compiler is
meaningless – it is the programmer who uses it for intellectual property theft, or the
hacker that decompiles the software to find security holes that is at fault.
6
Legal Aspects
If decompilation is possible to a certain extent, is it then also allowed?
Throughout the world, copyright law protects most programs. Copyright protects the
expression of an idea in the form of a program, hence protecting the developer’s (or
company’s) intellectual property on the software. Copyright law provides a bundle of
exclusive rights to the software developer, among others, the right to reproduce and make
adaptations to the developed computer program. It is a breach of these rights the making
of reproductions and adaptations without permission of the copyright holder. Different
countries have different exceptions to the copyright owner’s rights or precedent has been
established in court proceedings. This means that these are uses are allowed by law, but
varies according to the country. The most common ones are:
• Decompilation for the purpose of interoperability (to another piece of software
of hardware) where the interface specification has not been made available
• Decompilation for the purposes of error correction where the owner of the
copyright is not available to make the correction, and
• To determine parts of program (algorithms), that is not protected by copyright,
without breach of other forms of protection (trade secrets).
SCOPE AND NEED FOR DECOMPILATION
7
Decompilation is a tool for a computer professional. There are two major areas
where decompilation is used: Software Maintenance and Security.
In the former area, it is used:
• To recover lost or inaccessible source code
• When third party vendors go out of business
• To translate code written in an obsolete language into a newer language
• To structure old code written in an unstructured way (i.e. spaghetti code) into
a structured program
• To migrate applications to a new hardware platform
• To debug binary programs that are known to have bugs but for which the
source code is unavailable
• When multiple versions of source have been created and creation dates are
destroyed making it impossible to tell which source matches the currently
running object/executable.
In the latter area, decompilation is used as a tool to verify the object code
produced by a compiler in software-critical systems. The extensive use of computers and
networks worldwide has raised the awareness of the need for tools and techniques to aid
in computer security analysis of binary code, such as the understanding of Malwares such
as Viruses, Trojans, Worms, Backdoors and general security flaws, in order to provide
immediate solutions with or without the aid of software vendors, whether these are
8
caused by intentionally introduced malicious code or by the malicious exploitation of
begin code to the detriment of the user.
The classical technique used to study malware is the use of a debugger to step the
executable program (containing thousands of lines of assembly code) one assembly line
at a time until the problem is found – it is then possible to reconstruct that part of the
traced program in order to provide a solution for it. This method requires an expert
engineer that understands assembly code – a skill that is disappearing as years go by, due
to the increasing use of higher-level languages such as C++ and Java. By decompilation,
we can reduce the amount of code that the engineer has to process, and present the
engineer with a higher level of abstraction, so that only fewer man-hours will be needed
in order to understand the program’s code. Further these techniques will reduce the
additional skills and training required for professionals working in especially network
security teams. Thus, decompilation would effectively help in reducing the amount of
time needed to trace a security flaw in an executable program, as well as reducing the
costs of acquiring or training skilled assembler engineers.
DECOMPILATION PROBLEMS
A decompiler writer has to face several theoretical and practical problems when
writing it. Some of these problems can be solved by use of heuristic methods, others
cannot be determined completely:
9
• Recursive undesirability (need for proof of abstract concept)
• The von Neumann Architecture (inseparable data and instructions)
• Self-modifying code (for best utilization of available memory)
• Idioms (unidentifiable sequence of instructions to form logical entity)
• Virus and Trojan tricks (hiding of malicious code)
• Architecture-dependent Restrictions
• Subroutines included by Compiler/Linker
RUN-TIME ENVIRONMENT
Before considering decompilation, the relations between the static binary code of
the program and the actions performed at run-time to implement the program should be
considered. The representation of objects in a binary program differs between compilers;
an equivalent data object in the machine often represents elementary data types such as
integers, characters, and reals, whereas aggregate objects such as arrays, strings, and
structures are represented in various different ways. A high-level language program is
composed of one or more subroutines, called the user sub-routines. The corresponding
binary program is composed of user subroutines, library routines that were invoked by
the user program, and other subroutines linked in by the linker to provide support for the
compiler at run-time. The general format of the binary code contains a startup-code, user
program including library routines and an exit code. For DOS and Windows
environments, when a program is loaded into memory, a Program Segment Prefix is built
10
on the earlier bytes of the allocated memory, and it contains important information such
as parent information, interrupt details, etc.
Each subroutine is associated with a stack frame during run-time containing set of
parameters, local variables, and return address of the caller subroutine. Entering a
subroutine, allocating local data, preserving register values, accessing parameters,
returning a value, exiting the subroutine and parameter parsing are some of the important
tasks to be analyzed from the byte code. Meanwhile, a symbol table is normally built to
store information on variables used throughout the program. Variables are identified by
their address; variables that have physical memory address are global variables and that
are located at a negative offset from the stack pointer are local variables to corresponding
stack frame’s subroutine and variables at positive offsets are actual arguments to the
subroutine. Register variables needs a special attention. The symbol table would grow
dynamically and built for easy and quick access.
PHASES OF A DECOMPILER
Conceptually, a decompiler is structured in a similar way to a compiler, by a
series of phases that transform the source machine program from one representation to
another.
Syntax Analyzer: It groups bytes of the source program into grammatical phrases
(or sentences) of the source machine language, using a parse tree. Case tables are
used to distinguish data and instructions.
11
Semantic Analyzer: It checks the source program for the semantic meaning of
groups of instructions; gathers type information and propagates this type across
the subroutine.
Intermediate Code Generation: An explicit intermediate representation of the
source program is necessary for the decompiler to analyze the program. The
second pass would use this intermediate code to generate target language code.
Data Flow Analysis: This phase attempts to improve intermediate code, by
eliminating use of temporary registers and condition flags, thereby identifying
high-level language expressions.
Control Flow Analysis: This phase is useful to eliminate compiler-generated
intermediate jumps and to determine the high-level control structures used in the
program.
Target Code Generation: High-level language code is generated here after
selecting names for local and global variables. Subroutine names are selected
using library and signature bindings. Further, intermediate instructions and
control structures are translated to appropriate high-level statements.
DECOMPILATION SYSTEM
12
Many people misbelieve that decompilation is equivalent to disassembly, but
disassembler is just a module of decompilation. The entire decompilation system involves
the following tools:
Loader: It loads binary program into memory and relocates the machine code if it
is relocatable to alter the required instructions.
Signature Generator: It delinks and generates patterns that uniquely identify
each compiler and library subroutine, to reduce arbitrary subroutine names.
Disassembler: It transforms machine language into assembler language and to
higher representation in some cases.
Library Binder: It binds the subroutine names to the appropriate library routines.
Postprocessor: It converts high-level program into semantically equivalent high-
level program, such as converting generic set of control structures (while loops) to
appropriate control structures of high-level language (for loops).
13
A Decompilation System
SIMPLE ILLUSTRATIVE APPROACH
A simple approach to decompile a binary executable is to first parse it and
separate it into functions (C style). Once we know where the entry point of the program
(“main” for a C program), we can start decompiling that function, and any other function
it calls. After we have separated out the instructions for a function, we need to emulate
the processor and interpret each and every machine instruction, to combine logical group
of instructions into simple high-level language statements.
Understanding Assignment Statements and Expressions: Look at code:
Mov ax, [bp+4]
Mov bx, 20
Mul bx
Add ax, 4
14
Mov [bp+4], ax
Lets there by two variables wAX and wBX, and then:
i) wAX = [bp+4]
ii) wBX = 20
iii) wAX = wAX*wBX = [bp+4]*20
iv) wAX = wAX+4 = ([bp+4]*20) + 4
v) [bp+4] = wAX = ([bp+|4]*20) + 4
And, if we substitute i for [bp+4], we get: i = (i*20) + 4;
Understanding Condition Evaluation and Branches: Lets look at:
Mov ax, [bp+4]
Cmp ax, 10
Jnz labl
Mov bx, 15
Mov [bp+2], bx
Jmp lab2
Labl: mov bx, 20
Mov [bp+2], bx
15
Lab2:
Using cmp and jnz instructions, we conclude: if (i! =10) j=20; else j=15; But, we
should remember condition need not always be evaluated using “compare
instruction”; even arithmetic instructions can set/reset conditions flags.
Interpreting function calls and returns values: The usual convention of
function calling is to push the parameters onto stack and then call the function.
So, “push” and “call” instructions indicate a function call. Look at the code:
Mov ax, [bp+4]
Push ax
Mov ax, [bp+2]
Push ax
Call _func
Mov [bp+4], ax
Matching [bp+4] with “i” and [bp+2] with “j”, we get: i = func (j, i);
CONCLUSION
Reverse Engineering is a field of research, which has attracted many hackers and
researchers. Though decompilation is generally unsolvable, vast research is being
conducted to improve the results. The decompilation should be thought as a means to
retrieve the source code in case of emergency, without the violation of laws. Decompilers
are usually interesting enough due to various reasons such as multiple versions of the
compiler for the sample platform exists, the compiler itself will continue to be changed
and those changes must be kept up with, etc. Nobody can stop unscrupulous persons and
hackers from illegal decompiling, but it should be directed to be used for valid purposes.
16
17

DECOMPILER

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

DECOMPILER

Hochgeladen von

Copyright:

Verfügbare Formate

DECOMPILER

mainframe and midrange marketplace, there is a growing demand for Decompilers

target machine. A Decompiler, or reverse compiler, attempts to perform the inverse

process: given an executable program the aim is to produce a high-level language

program that performs the same function as the executable program.

is subjected to small changes before compiling again.

Instructions fall into a number of broad groups:

• Load and store (e.g. aload_0, istore)

is subjected to small changes before compiling again.

theoretically equivalent to the Halting Problem, an undesirable problem in Computer

file, except when stored for debugging purposes.

sophisticated ways to identify to all programming constructs, so that a high degree of

require human input at the hard paths of the disassembly process.

Decompilers – Friend or Foe

machine. Whether it is a commercial offering, or a free piece of software, the

opportunity to reconstruct the source code.

Decompilation and reverse engineering is often prohibited by software license

agreements – but this won’t always stop an unscrupulous competitor, or an enthusiastic

decompilers. Some companies might decompile software of a competitor, to establish the

spreadsheets, word processors, databases, etc is handy for end-users.

If decompilation is possible to a certain extent, is it then also allowed?

company’s) intellectual property on the software. Copyright law provides a bundle of

of reproductions and adaptations without permission of the copyright holder. Different

varies according to the country. The most common ones are:

• Decompilation for the purpose of interoperability (to another piece of software

copyright is not available to make the correction, and

• To determine parts of program (algorithms), that is not protected by copyright,

without breach of other forms of protection (trade secrets).

SCOPE AND NEED FOR DECOMPILATION

where decompilation is used: Software Maintenance and Security.

In the former area, it is used:

• To recover lost or inaccessible source code

• When third party vendors go out of business

• To translate code written in an obsolete language into a newer language

• To migrate applications to a new hardware platform

source code is unavailable

destroyed making it impossible to tell which source matches the currently

produced by a compiler in software-critical systems. The extensive use of computers and

begin code to the detriment of the user.

costs of acquiring or training skilled assembler engineers.

cannot be determined completely:

• The von Neumann Architecture (inseparable data and instructions)

• Self-modifying code (for best utilization of available memory)

• Idioms (unidentifiable sequence of instructions to form logical entity)

• Virus and Trojan tricks (hiding of malicious code)

• Subroutines included by Compiler/Linker

considered. The representation of objects in a binary program differs between compilers;

structures are represented in various different ways. A high-level language program is

as parent information, interrupt details, etc.

subroutine, allocating local data, preserving register values, accessing parameters,

dynamically and built for easy and quick access.

Conceptually, a decompiler is structured in a similar way to a compiler, by a

used to distinguish data and instructions.

Intermediate Code Generation: An explicit intermediate representation of the

Data Flow Analysis: This phase attempts to improve intermediate code, by

eliminating use of temporary registers and condition flags, thereby identifying

high-level language expressions.

Control Flow Analysis: This phase is useful to eliminate compiler-generated

Target Code Generation: High-level language code is generated here after

using library and signature bindings. Further, intermediate instructions and

control structures are translated to appropriate high-level statements.

disassembler is just a module of decompilation. The entire decompilation system involves

the following tools:

iii) wAX = wAXwBX = [bp+4]20