Sie sind auf Seite 1von 10

THE ASSEMBLY LANGUAGE

Some knowledge of assembly is necessary in order to understand the operation of the


buffer overflow exploits. There are essentially three kinds of languages:

Language Description Example


This is what the computer
actually sees and deals with. 83 ec 08 -> sub $0x8,%esp
Every command the computer 83 e4 f0 -> and $0xfffffff0,%esp
Machine sees is given as a number or b8 00 00 00 00 -> mov $0x0,%eax
Language sequence of numbers. It is in 83 c0 0f -> add $0xf,%eax
binary, normally presented in
hex to simplify and be more
readable.
AT&T:

push %ebp
sub $0x8,%esp
movb $0x41,0xffffffff(%ebp)

Intel:
This is the same as machine
push ebp
language, except the command
mov ebp, esp
Assembly numbers have been replaced by
sub esp, 0C0h
Language letter sequences which are more
readable and easier to HLA (High Level Assembly):
memorize.
program HelloWorld;
#include( "stdlib.hhf" )
begin HelloWorld;
stdout.put( "Hello, World of Assembly
Language", nl );
end HelloWorld;

High-level languages are there C/C++:


to make programming easier.
Assembly language requires you #include <stdio.h>
to work with the machine itself.
High-level languages allow you int main()
High-Level to describe the program in a {
Language more natural language. A single char name[20];
command in a high-level …
language usually is equivalent to return 0;
several commands in an }
assembly language. Readability
is the best.
Table 1: Kind of languages.

Assembly is a symbolic language that is assembled into machine language by an


assembler. In other words, assembly is a mnemonic statement that corresponds directly
to processor-specific instructions. Each type of processor has its own instruction set
and thus its own assembly language. Assembly deals directly with the registers of the
processor and memory locations. There are some general rules that are typically true
for most assembly languages are listed below:
 Source can be memory, register or constant.
 Destination can be memory or non-segment register.
 Only one of source and destination can be memory.
 Source and destination must be same size.
Opcodes are the actual instructions that a program performs. Each opcode is
represented by one line of code, which contains the opcode and the operands that are
used by the opcode. The number of operands varies depending on the opcode. The
entire suite of opcodes available to a processor is called an instruction set. Depending
on the processor, OS, and disassembler used, the operands may be in reverse
order. For example, on Windows:
MOV dst, src
Is equivalent to:
MOV %src, %dst on Linux.
Windows uses Intel assembly whereas Linux uses AT&T assembly. Another one you
may find is Mac OS (PowerPC) that is Motorola processor instruction set. High Level
Assembly (HLA) (http://www.plantation-productions.com/Webster/) also quite popular
among programmers. This paper will use both Windows and ATT assembly. Whatever
assembly used, there are several common categories of instructions based on their
usages as listed in the following Table.

Instruction
Meaning Example
Category
move from source to
Data Transfer mov, lea, les, push, pop, pushf, popf
destination
add, adc, sub, sbb, mul, imul, div, idiv, cmp, neg,
Arithmetic arithmetic on integers inc, dec, xadd,
cmpxchg
Floating point arithmetic on floating point fadd, fsub, fmul, div, cmp
Logical, Shift, and, or, xor, not, shl/sal, shr,
bitwise logic operations
Rotate and Bit sar, shld and shrd, ror, rol, rcr and rcl
conditional and
Control transfer unconditional jumps, jmp, jcc, call, ret, int, into, bound.
procedure calls
move, compare, input and movs, lods, stos, scas, cmps, outs, rep,
String
output repz, repe, repnz, repne, ins
I/O For input and output in, out
Provide assembly data
Conversion movzx, movsx, cbw, cwd, cwde, cdq, bswap, xlat
types conversion
manipulate individual
flags, provide special
Miscellaneous processor services, or clc, stc, cmc, cld, std, cl, sti
handle privileged mode
operations

Table 2: Assembly instruction set categories.

The following is C source code portion and the assembly equivalent example using
Linux/Intel.

C code’s portion Label Mnemonic operands Comment


movl a, %eax
if (a > b) cmpl b, %eax #compare, a – b
jle L1 #jump to L1 if a <= b

movl a, %eax #a > b branch


c = a; movl %eax, c
jmp L2 #finish, jump to L2
L1: #a <= b branch
else movl b, %eax
c = b; movl %eax, c
L2: #Finish

Figure 1: C and assembly codes.

Compilers available for assembly languages include Macro Assembler


(http://www.masm32.com/ - MASM), GNU’s Assembler (GAS wiki, GAS manual),
Borland’s (Company change to microfocus.com and the IDE products go
to embarcadero.com) TASM, Netwide (NASM)
and GoASM (http://www.godevtool.com/). For HLA it is available from Webster at
http://www.plantation-productions.com/Webster/.
COMPILER, ASSEMBLER, LINKER AND LOADER
Normally the C’s program building process involves four stages and utilizes different
tools such as a preprocessor, compiler, assembler, and linker. At the end there should
be a single executable image that ready to be loaded by loader as a running
program. Below are the stages that happen in order regardless of the operating
system/compiler and graphically illustrated in Figure 2.

1. Preprocessing is the first pass of any C compilation. It processes include-files,


conditional compilation instructions and macros.
2. Compilation is the second pass. It takes the output of the preprocessor, and the
source code, and generates assembler source code.
3. Assembly is the third stage of compilation. It takes the assembly source code
and produces an assembly listing with offsets. The assembler output is stored in
an object file.
4. Linking is the final stage of compilation. It takes one or more object files or
libraries as input and combines them to produce a single (usually executable)
file. In doing so, it resolves references to external symbols, assigns final
addresses to procedures/functions and variables, and revises code and data to
reflect new addresses (a process called relocation).
5. Loading the executable image for program running.

Bear in mind that if you use the Integrated Development Environment (IDE) type
compilers, these processes quite transparent. Now we are going to examine more
detail about the process that happens before and after the linking stage. For any given
input file, the file name suffix (file extension) determines what kind of compilation is
done and the example for gcc is listed in Table 3.

File
Description
extension
file_name.c C source code which must be preprocessed.
file_name.i C source code which should not be preprocessed.
file_name.ii C++ source code which should not be preprocessed.
file_name.h C header file (not to be compiled or linked).
file_name.cc
file_name.cp
file_name.cxx C++ source code which must be preprocessed. For file_name.cxx, the xx must
file_name.cpp both be literally character x and file_name.C, is capital c.
file_name.c++
file_name.C
file_name.s Assembler code.
file_name.S Assembler code which must be preprocessed.
By default, the object file name for a source file is made by replacing the extension
file_name.o
.c, .i, .s etc with .o

Table 3: File suffix.


The following Figure shows the steps involved in the process of building the C program
starting from the compilation until the loading of the executable image into the memory
for program running.
Figure 2: C program building process.

OBJECT FILES AND EXECUTABLE


After the source code has been assembled, it will produce an object files and then
linked, producing an executable files. An object and executable come in several
formats such as ELF (Executable and Linking Format) and COFF (Common Object-File
Format). For example, ELF is used on Linux systems, while COFF is used on Windows
systems. Other object file formats that you may find sometime somewhere is listed in
the following Table.

Object File
Description
Format
The a.out format is the original file format for Unix. It consists of three sections: text,
data, and bss, which are for program code, initialized data, and uninitialized data,
a.out respectively. This format is so simple that it doesn't have any reserved place for
debugging information. The only debugging format for a.out is stabs, which is encoded
as a set of normal symbols with distinctive attributes.
The COFF (Common Object File Format) format was introduced with System V
Release 3 (SVR3) Unix. COFF files may have multiple sections, each prefixed by a
COFF
header. The number of sections is limited. The COFF specification includes support
for debugging but the debugging information was limited.
A variant of COFF. ECOFF is an Extended COFF originally introduced for Mips and
ECOFF
Alpha workstations.
The IBM RS/6000 running AIX uses an object file format called XCOFF (eXtended
COFF). The COFF sections, symbols, and line numbers are used, but debugging
XCOFF
symbols are dbx-style stabs whose strings are located in the .debug section (rather than
the string table). The default name for an XCOFF executable file is a.out.
Windows 9x and NT use the PE (Portable Executable) format for their executables. PE
PE
is basically COFF with additional headers.
The ELF (Executable and Linking Format) format came with System V Release 4
(SVR4) Unix. ELF is similar to COFF in being organized into a number of sections,
ELF
but it removes many of COFF's limitations. ELF used on most modern Unix systems,
including GNU/Linux, Solaris and Irix. Also used on many embedded systems.
SOM (System Object Module) and ESOM (Extended SOM) is HP's object file and
SOM/ESOM debug format (not to be confused with IBM's SOM, which is a cross-language
Application Binary Interface - ABI).

Table 4: Object file formats.

When we examine the content of these object files there are areas called
sections. Depend on the settings of the compilation and linking stages, sections can
hold:

1. Executable code.
2. Data.
3. Dynamic linking information.
4. Debugging data.
5. Symbol tables.
6. Relocation information.
7. Comments.
8. String tables, and
9. Notes.

THE RELOCATION RECORDS


Because the various object files will include references to each others code and/or
data, these will need to be combined during the link time. For example in Figure 1, the
object file that has main() includes calls to funct() and printf() functions. After linking all
of the object files together, the linker uses the relocation records to find all of the
addresses that need to be filled in.

THE SYMBOL TABLE


Since assembling to machine code removes all traces of labels from the code, the
object file format has to keep these around in a different place. It is accomplished by
the symbol table, a list of names and their corresponding offsets in the text and data
segments. A disassembler provides support for translating back from an object file or
executable.
Figure 1: The relocation record.

LINKING (EXAMPLE IN LISTING – LINKING 2 OBJECT FILES)


The linker actually enables separate compilation. As shown in Figure 2, an executable
can be made up of a number of source files which can be compiled and assembled
into their object files respectively, independently.
Figure 2: Linking process of object files

SHARED OBJECTS
In a typical system, a number of programs will be running. Each program relies on a
number of functions, some of which will be standard C library functions, like printf(),
malloc(),strcpy(), etc. If every program uses the standard C library, it means that each
program would normally have a unique copy of this particular library present within it.
Unfortunately, this result in wasted resources, degrade the efficiency and
performance. Since the C library is common, it is better to have each program
reference the common, one instance of that library, instead of having each program
contain a copy of the library. This is implemented during the linking process where
some of the objects are linked during the link time whereas some done during the run
time (deferred/dynamic linking).

STATICALLY LINKED
The term ‘statically linked’ means that the program and the particular library that it’s
linked against are combined together by the linker at link time. This means that the
binding between the program and the particular library is fixed and known at link time
before the program run. It also means that we can't change this binding, unless we re-
link the program with a new version of the library.
Programs that are linked statically are linked against archives of objects (libraries) that
typically have the extension of .a. An example of such a collection of objects is the
standard C library, libc.a. You might consider linking a program statically for example,
in cases where you weren't sure whether the correct version of a library will be
available at runtime, or if you were testing a new version of a library that you don't yet
want to install as shared. For gcc, the –static option is used during the
compilation/linking of the program.
ASCII Character set[edit]
ASCII (1977/1986)

_0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _A _B _C _D _E _F

NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI


0_ 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 000A 000B 000C 000D 000E 000F
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
1_ 0010 0011 0012 0013 0014 0015 0016 0017 0018 0019 001A 001B 001C 001D 001E 001F
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

SP
2_ 0020
! " # $ % & ' ( ) * + , - . /
0021 0022 0023 0024 0025 0026 0027 0028 0029 002A 002B 002C 002D 002E 002F
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

3_
0 1 2 3 4 5 6 7 8 9 : ; < = > ?
0030 0031 0032 0033 0034 0035 0036 0037 0038 0039 003A 003B 003C 003D 003E 003F
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

4_
@ A B C D E F G H I J K L M N O
0040 0041 0042 0043 0044 0045 0046 0047 0048 0049 004A 004B 004C 004D 004E 004F
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79

5_
P Q R S T U V W X Y Z [ \ ] ^ _
0050 0051 0052 0053 0054 0055 0056 0057 0058 0059 005A 005B 005C 005D 005E 005F
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95

6_
` a b c d e f g h i j k l m n o
0060 0061 0062 0063 0064 0065 0066 0067 0068 0069 006A 006B 006C 006D 006E 006F
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111

7_
p q r s t u v w x y z { | } ~ DEL
007F
0070 0071 0072 0073 0074 0075 0076 0077 0078 0079 007A 007B 007C 007D 007E
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127

Das könnte Ihnen auch gefallen