Sie sind auf Seite 1von 5

How to crack a Binary File Format

By Frans Faase

Introduction
I often receive request from people to help them with cracking a particular binary file format.
It seems that many software vendors are not willing to document the format of the binary files
that they are using. Maybe there are good reasons for this. Personally, I feel that users have
the right to access the information that is stored in the document that they manipulate with a
program that they are legally allowed to use. That is why I do not consider reverse
engineering of a binary file format as an illegal act (opposite to what the use of the verb
"cracking" is suggesting).

The most likely reason why people ask me for help is because I already have reversed
engineerd some binary file formats. So far, I have worked on:

• AutoCAD DWG12 file format


• AutoCAD DWG13 file format
• MS Word 5.0 file format
• Quark Xpress file format

Requirements
Although, cracking a binary file format can be very rewarding, like a form of mental
mountain climbing, there are also some basic skills that are absolutely neccessary, if you ever
want to succeed. These are (in order of importance):

• A good reason why you want to crack the format at all.


• A lot of time.
• A lot of preseverence.
• A lot of creativity.
• Willingness to do repeative tasks for hours and hours.
• Good understanding of how numbers and strings are represented in binary manner.
• Excellent command of the C programming language.

Cracking a file format is not easy. It can drive you crazy. Try to avoid it if you can. And, of
course, you first search the Internet for at least one day seeing what is already known about
the format, or to find out if someone already offers some conversion service or tool to a
format that you can read. One place to look at, is Wotsit, a site dedicated to file formats. Also,
I think you should have read the stories of attempts to crack a binary file format, and studied
some programs that read binary files. (Follow the links mentioned in the introduction to start
with.)

Besides this, you also need:

• a computer (quite obvious),


• a simple text editor capable of handling long lines,
• (optional) a hex viewer or editor,
• a calculator with hexadecimal-to-decimal conversion and vice-versa functionality,
• a C or C++ compiler,
• enough sample files about which you know what data they contain, and
• (optional) the application which uses the files.

010 Editor - Hex Editor

Recently, a professional hex editor was brought under my attention with some nice features
for cracking binary files. 010 Editor allows you to write templates for describing the structure
of the binary file, and having it parse on the fly showing the results. The languages used has
some powerful features. It some what resembles my ideas for BFF: a grammar for Binary File
Formats.

What is the fun?


Indeed, cracking a format is like climbing a mountain, when you reach the top it gives you a
great reward. But it is also like entering the mind of someone else. With every step you come
further, you know something more about the program being used to read and write the file.
And by this, you also know more about the mind of the person (or persons) who designed that
program. However, sometimes this person becomes your enemy, when attempts have been
made to prevent total reverse engineering of the format. Yes, those people designing a format
sometimes are given the assignment to make reverse engineering difficult. There are many
ways to do it. Custom made compression and encrytion are two techniques. Luckily, both
these can reduce performance of loading and saving. But there are also small tricks you can
use, which make it a lot harder.

Programming style
Although, cracking a binary file format may tempt you to write instant code, you should
realize that it is a long and complicated task, which often requires the code to undergo major
revisions. Code revisions are much easier if your code is clean and well structured. (The
principles of eXtreme Programming do also apply to this kind of programming work.)

Some practical remarks:

• Document your code (either through comments or, better, by giving meaningful names
to variables and procedures).
• Make regular back-ups. Although this should be a standerd procedure, with cracking a
file format you will see yourself more often going back to an earlier version of the
program whenever you get stuck in a certain area.
• Build into your program switches (simple Boolean variables) which allow you to
dump any intermediate data during any stage of the parsing. You will find that you
often need to go back one step to make a step forward.

How to work
Where to start
The first thing you need to do is to understand the global structure of the binary files. Look at
the sample files with a hex-viewer, or write a small hex-dump program, which dumps each
next 32 bytes on a separate line, generating an empty line after each 256 bytes. (Alternatively,
you can also dump 256 bytes on each line.) You can use one of the following statements to
print a single byte:
printf("%02X", byte);
printf("%02X ", byte);
printf("%c%02X", (byte > ' ' && byte < 127)?byte:' ', byte);
printf((byte > ' ' && byte < 127)?" %c":"%02X", byte);
Take care that byte is defined as a unsigned char, otherwise you may get confusing effects.
(If you don't understand the above code fragments, you clearly do not meet the requirements,
so, don't ask me to explain it to you.) Once the global structure has been discovered, you can
start writing a C program, which simply dumps the file along this structure. A lot the work,
just consists of searching for structure in the hex-dumps and editing them in long lines
representing the various records on a single line, in order to discover the structure and
determine the meaning of the various parts.

For a primary investigation of a format, the strings program might be useful. Sometimes
binary format contain whole chunks of text.

Block structure

Some binary file formats use a block structure where each chunk of data is stored in a number
of equal sized blocks. A block size of 256 is often used. Examples of these are MS Word 5.0
and Quark Xpress. Whenever a block structure is used, data structures are often referenced by
the block number in which the data start.

Number representation

The most common used manner to represent numbers is to interpret a sequence of bytes as a
binary number. Numbers can also be stored as strings, where each character is representes one
digit. (These are relatively easy to recognize in a binary file.) Because only a limit set of
characters is needed to represent all kinds of numbers (including fractions and exponents),
two characters can be put in a single byte. In the past there were even processors that had
instructions to perform arithematic operations on such numbers. (The two parts of the byte
were often refered to as nibbles.) Nowadays this format is not in use any more (but it might
still occur in some binary file format, as these sometimes survive their original platform).

Negative integers

For simple positive numbers interpreting a sequence of bits as a binary number is the rather
obvious choice. But for numbers that can be both positive and negative this is not. One
interpretation is to consider one of the bits (the first usually) as the sign bit, and the rest of the
bits as a binary number. A problem with this interpretation is that there are two representations
for zero. Another solution is to interpret a sequence of bits as a binary number, and to subtract
a large value from it, in case the number is larger than some number. There are two options
here, with respect to which numbers is subtracted with a certain amouth of available bits.
These are called 1's compliment and 2's compliment number representation. Nowadays, the
2's compliment method is most often used. This means that for a byte (8 bits) the number 256
is subtracted, when the number is larger than 127. That means that the unsigned number 128
represent the signed number -128, and the unsigned number 255 represent the signed number
-1. (With the 1's compliment number representation, the number 255 is subtracted. Again, this
introduces two representations for zero, as the unsigned number 255 becomes the signed
number 0.) Note that with the 2' compliment the first bit can be seen as the sign bit. The only
disadvantage the 2' compliment has is that there is always a negative number that does not
have a positive counterpart, e.g., for the byte there is a representation for -128 but not for 128.

An important advantage of the 2-compliments notation is that the arithmetic operations for the
signed and unsigned representation are exactly the same. The only time the difference become
obvious is when a number is cast to a larger representation form. For example the
hexadecimal byte value 0xFF needs to be changed to 0xFFFF, when the signed number -1 is
cast from a byte to a word, and has to be changed to 0x00FF, when the unsigned number 255
is cast from a byte to a word. In a binary file format, it can occur that a unsigned byte value is
stored as a word with the signed expansion applied. As a result of this it can happen that 127
is stored hexadecimal as 0x007F, where 128 is stored hexadecimal as 0xFF80.

little and big endian

Processors can be divide into little and big endian. This determines the order of the bytes in
for short and long integer values. little endian processors start with the least significant byte.
Big endian processors start with the most significant byte. The reason why little endian
processors start with the least significant by first, is because it gives an advantage for 8 bits
processors when having to perform adding and subtraction operations. Although this plays no
role in current 32 bit and 64 bit processors, some processors are still little endian because of
their origin. The Pentium processor is a little endian. The PowerPC (used in Macs) is big
endian. This difference is especially important for exchanging binary data between different
processors. One should be aware that some applications running on a big endian processor,
may use the little endian byte order, because the application was originally developed to run
on a little endian processor. Sometimes, long values (32 bits) are stored as two big endian
words, with the least significant word first.

floating point formats

The possibilities for representing floating number far exceed those of integers. Luckily,
nowadays it has been pretty much standardized. Most of the time it matches the way common
processors are using. So, a simple cast to float or double will work, sometimes in
combination with a byte reordering with respect to little and big endian differences.

Strings

Usually, it is rather easy to recognize strings in a binary format. One should realize that each
character in a string can either use one or two bytes (unicode), and that little and big endian
could play a role. There is a difference between fixed and variable size strings. Fixed strings
take a fixed amouth of space in the file. They are often filled up with spaces or the null-
character. Variable sized strings, either can have the length at the start, or use a termination
character (most often the null-character) to indicate the end of the string. Sometimes, the next
piece of data does not directly start at the next byte following the string, but could be at a
word or long word boundary.

Packed format

To save space, some binary format may make use of bit packing, which means that the values
stored in the binary file do not necessary start at byte/word boundaries. (In C this can be
achieved by using the ":" followed by a number in the struct/class member definitions.) One
such format is the AutoCAD DWG13 file format. See the program for some suggestions.

More often it happens that a number of Boolean values are stored in a single byte or word.
Sometimes it can be helpful to print numbers in their binary representation.

Rubbish

Be aware that many binary files just contain chunks of rubbish, especially those that allocate
fixed blocks for storing data in the binary file. Also memory mapped files could contain large
sections of rubbish. Rubbish can be very misleading and a great trouble in reverse engineering
binary files, because sometimes it just seems to contain valid data. On the other hand,
something that looks like rubbish could actually contain some information.

Reading files

When reading a binary file, one often has to access the file in a random manner, or at least
often look ahead from the current reading position. Direct random access of a binary file is
slow. To avoid this, it is best (if size permits this) to store the whole file into a single byte
array. (For an example see my CBuf class, which I used for reading Quark Xpress files.) If
size does not permit it, you could try the Memory Mapped file option. For an example of this
under Windows see the class Mmfile in the Suneido sources that uses the functions
CreateFileMapping and MapViewOfFile.

The difference trategy

One way of cracking a binary file format is to make many, preferable small, files with the
application that contain small changes compared to each other. For example, you could start
to write a document with the application and each time save a document under a different
name.

You can use a Binary File Compare utility to compare the files and discover how certain
changes are reflected in the binary file, which gives you information about the where the
information is stored.

Das könnte Ihnen auch gefallen