You are on page 1of 9

An analysis of the MD5

hashing algorithm

Brian Murray
ISSM 533
Table of Contents
Executive Summary............................................................................................................. ..........................3
Summary of MD5....................................................................................................................................... .....3
Definitions............................................................................................................................
......................3
Input data................................................................................................................................
....................3
Step 1: Data padding........................................................................................................ ..........................3
Step 2: Appending length............................................................................................................. ..............3
Step 3: Initialize the Message Digest buffer.............................................................................. ................3
Step 4: Process input........................................................................................................ ..........................4
Step 5: Output.............................................................................................................................................6
T Value Table........................................................................................................................... ..................6
Technical Vulnerabilities........................................................................................................ ........................7
Signature attack with a hidden message........................................................................ ............................7
X.509 Certificate Signature attack........................................................................................... ..................7
Conclusions............................................................................................................................
.........................8
References.............................................................................................................................
..........................9
Executive Summary
In this paper, the MD5 hashing algorithm is discussed. MD5 is a common hashing algorithm used in
many cryptographic schemes. It is also used for file verification for file downloads. Part one of this paper
outlines how MD5 works. There have been quite a few recent discussions on whether or not MD5 is still
a solid algorithm. Some less informed have stated that MD5 has been completely broken. Part two of this
paper investigates the current state of MD5 and its attack vectors. In the final section, this paper outlines
conclusions based on section two.

Summary of MD5

Definitions
A word is a 32 bit group.
A byte is an 8 bit group.
'b' represents the length of the input data in bits.
'|' is a bitwise OR. '&' is a bitwise AND. '^' is a bitwise XOR. '~' is a bitwise NOT.
X <<< s is a circular bit shift of X, by s bit positions.

Input data
The data input into MD5 can be any number of bits. They do not need to fall on a 8 bit boundary. It may
also be 0 bits long.

Step 1: Data padding


The input data is first padded so that it is 64 bits short of a 512 bit boundary. This means that the number
of bytes should be congruent to 448, modulo 512. The data is padded first by a binary one, then by as
many zero's as needed to complete the padding, 64 bits short of a 512 boundary.

Step 2: Appending length


The length of the message 'b' is appended to data in step 1. This is the number of input bits before
padding. In the case that the length's representation is larger then 64 bits long, only the low order bit
representation is added. The data will now be an exact multiple of 512. The data is also capable of being
broken down into an even multiple of 16, 32 bit words. These are denoted as M[0 .. N-1], where N is a
multiple of 16.

Step 3: Initialize the Message Digest buffer


Four, 1 word buffers are used. They are initialized with the low order bytes first. Here they are
represented in hex form. These values are loaded in as the default initialization vector. Note, C and D are
just reverses of B and A respectively.

A: 01 23 45 67
B: 89 ab cd ef
C: fe dc ba 98
D: 76 54 32 10

Step 4: Process input


To start with this phase, 4 functions must be defined. Each of these functions take in three, 32bit words,
and output one, 32 bit word.
F(A,B,C) = (A & B) | (~A & C)
G(A,B,C) = (A & C) | (B & ~C)
H(A,B,C) = A ^ B ^ C
I(A,B,C) = B ^ (A | ~C)

F acts as a per-position conditional operation, such that If A, then B, else C.


G is very similar, as it is almost identical. If C, then A, else B.
H is effectively just a parity function of the 3 inputs.
In all 4 of the functions, each bit is unbiased, and independent, such that no other bit affects it, aside from
its counterparts from the other inputs.

Next, a table is constructed of 64 elements. Their values are: 4294967296 times abs(sin(i)), where i is the
number of the element, and in radians. The table is numbered T[1 .. 64]. So, T[1] = 0xD76AA478. These
can be calculated in each round on the fly, but since they never change, it is most efficient to simply
statically code them into the end code, as was done in the reference section of RFC 1321.

Now we need to process every 16 word block.


For i = 0 .. N/16-1

We also want to load all of the messages into a temporary buffer.


For j = 0 .. 15
X[j] = M[i*(j+16)]
Next, we need to create a copy of the data from A,B,C,D, since we need all of this data later.
AA = A, BB = B, CC = C, DD = D
Now we perform the calculations in 4 rounds, changing the function used each time. So round 1 uses
function F, round 2 uses function G, and so on. Unfortunately, these are so diverse that it is simply easier
to code them statically, instead of via looping mechanisms.
The following page is taken from RFC 1321:
/* Round 1. */
/* Let [abcd k s i] denote the operation
a = b + ((a + F(b,c,d) + X[k] + T[i]) <<< s). */
/* Do the following 16 operations. */
[ABCD 0 7 1] [DABC 1 12 2] [CDAB 2 17 3] [BCDA 3 22 4]
[ABCD 4 7 5] [DABC 5 12 6] [CDAB 6 17 7] [BCDA 7 22 8]
[ABCD 8 7 9] [DABC 9 12 10] [CDAB 10 17 11] [BCDA 11 22 12]
[ABCD 12 7 13] [DABC 13 12 14] [CDAB 14 17 15] [BCDA 15 22 16]
/* Round 2. */
/* Let [abcd k s i] denote the operation
a = b + ((a + G(b,c,d) + X[k] + T[i]) <<< s). */
/* Do the following 16 operations. */
[ABCD 1 5 17] [DABC 6 9 18] [CDAB 11 14 19] [BCDA 0 20 20]
[ABCD 5 5 21] [DABC 10 9 22] [CDAB 15 14 23] [BCDA 4 20 24]
[ABCD 9 5 25] [DABC 14 9 26] [CDAB 3 14 27] [BCDA 8 20 28]
[ABCD 13 5 29] [DABC 2 9 30] [CDAB 7 14 31] [BCDA 12 20 32]
/* Round 3. */
/* Let [abcd k s t] denote the operation
a = b + ((a + H(b,c,d) + X[k] + T[i]) <<< s). */
/* Do the following 16 operations. */
[ABCD 5 4 33] [DABC 8 11 34] [CDAB 11 16 35] [BCDA 14 23 36]
[ABCD 1 4 37] [DABC 4 11 38] [CDAB 7 16 39] [BCDA 10 23 40]
[ABCD 13 4 41] [DABC 0 11 42] [CDAB 3 16 43] [BCDA 6 23 44]
[ABCD 9 4 45] [DABC 12 11 46] [CDAB 15 16 47] [BCDA 2 23 48]
/* Round 4. */
/* Let [abcd k s t] denote the operation
a = b + ((a + I(b,c,d) + X[k] + T[i]) <<< s). */
/* Do the following 16 operations. */
[ABCD 0 6 49] [DABC 7 10 50] [CDAB 14 15 51] [BCDA 5 21 52]
[ABCD 12 6 53] [DABC 3 10 54] [CDAB 10 15 55] [BCDA 1 21 56]
[ABCD 8 6 57] [DABC 15 10 58] [CDAB 6 15 59] [BCDA 13 21 60]
[ABCD 4 6 61] [DABC 11 10 62] [CDAB 2 15 63] [BCDA 9 21 64]
Finally, we mathematically add the previous word values to the end word values.
A = A + AA
B = B + BB
C = C + CC
D = D + DD

Step 5: Output
We are finally left with an output of 4 words. A is the low order word, and D is the high order word.
From here, we can simply print them.

T Value Table
The following table is the numbers that are to be used in step 4 as the values of T. The formula for the
values is: 4294967296 times abs(sin(i)), where i is the number of the element, and is in radians.

D76AA478 E8C7B756 242070DB C1BDCEEE F57C0FAF 4787C62A A8304613 FD469501


698098D8 8B44F7AF FFFF5BB1 895CD7BE 6B901122 FD987193 A679438E 49B40821
F61E2562 C040B340 265E5A51 E9B6C7AA D62F105D 02441453 D8A1E681 E7D3FBC8
21E1CDE6 C33707D6 F4D50D87 455A14ED A9E3E905 FCEFA3F8 676F02D9 8D2A4C8A
FFFA3942 8771F681 6D9D6122 FDE5380C A4BEEA44 4BDECFA9 F6BB4B60 BEBFBC70
289B7EC6 EAA127FA D4EF3085 04881D05 D9D4D039 E6DB99E5 1FA27CF8 C4AC5665
F4292244 432AFF97 AB9423A7 FC93A039 655B59C3 8F0CCC92 FFEFF47D 85845DD1
6FA87E4F FE2CE6E0 A3014314 4E0811A1 F7537E82 BD3AF235 2AD7D2BB EB86D391
Technical Vulnerabilities
Currently, there is only one simple vulnerability with MD5. It is based on the ability to change just a few
bits of a set of seemingly random data. Two sets of data producing the same MD5 hash is called a
'Collision'. These collisions can be used in a few different ways, which are noted later.
There are a number of ways to generate these collisions. The most recent of them is called “Tunneling”,
by Vlastimil Klima. In his paper, he describes a method to find these collisions in under a minute. A
Pentium 4, 3.2GHz is capable of finding these collisions, on average, in 17 seconds. His method also
applies to other hashing algorithms, including SHA-1.
To date, there has been no quick attacks against a specific hash. IE, one cannot turn a specific string into
a given hash. However, this does allow for other attacks.

Signature attack with a hidden message


One specific attack is used against signed documents. The attack requires a programed language, such as
postscript, to be carried out. The attack starts out by finding a colliding prefix for the file. Since MD5(A)
== MD5(B) in a collision, MD5(A + M) == MD5(B + M), meaning that we can append anything, and the
MD5 will be the same, so long as the appended text is the same. In the prefix, we would set a variable to
be one of the colliding two MD5's. Later, in the appended text, we would check if the variable was the
first of the two colliding values. If so, then output one thing. Otherwise, output the second thing. For
example, you would have 2 messages within the body of the message. One would state 'thank you for
your contributions', and the second message would be a message stating that you should be given full
access to all resources. You would then have someone trusted sign the message, seeing the first of the
two, such as the security manager or other authority. Then, you would change the colliding text to trigger
the second of the two messages. The recipient would see the message stating you should be given full
access, with a valid signature.

X.509 Certificate Signature attack


A second attack against MD5 has been with X.509 certificates. The method was described in 2005. First,
one starts out by creating all of the Certificate Signing Request, without the public key. The data before
the public key modulus must be on a 64 byte boundary. Adding some information after the
Distinguishing name will serve to pad out the data to ensure the 64 byte boundary. Also, the byte lengths
of the modulus and public key exponent must be a fixed length. The MD5 algorithm is run, and we are
left with its output being the IV for the next section. Since the data is exactly on a 512 MD5 block
boundary, no padding is done, which allows for us using the output as an IV for the next section.
Remember, the certificate thus far must be X509 compatible, otherwise it will be rejected by the
certificate authority. Next, we can use either Xiaoyun Wang et al., or the tunneling method, to create 2
different, but similar messages that produce the same MD5 based on the initialization vector used. In the
demonstration used, the public exponent was 65537. This number must be the same for both certificates.
Next, a p1,p2,q1,q2 are found with the help of a single, common value that gets appended to the colliding
value previous. This will yield 2 public keys with the same MD5, as well as 2 separate private keys. What
this means, is that 2 certificates can contain the same signature.
Conclusions

Although there have been 2 attacks against MD5, it is my belief that MD5 is still a completely valid
algorithm. Both of these attacks are very case specific.
For instance, the attack against Alice's boss requires pre-existing malicious code to be inserted before the
signature is taken. Signing other documents, such as a PDF, make it impossible to carry out such an
attack. In the case of signing code using the same method, it would require the attacker to implement the
malicious code upstream, which would require a code maintainer to sign off on the change. At that point,
they may as well just add a separate command line switch, or watch if a file exists to trigger the attack, as
it would be much simpler.
In the case of the X509 certificates, it in fact has the opposite effect. First, the original certificate owner
must perform the attack before the certificate is signed, very much like the Alice's boss type of attack.
However, if ever a duplicate signature is found, one can use the same method as described by Lensta, A.
et al., to reverse engineer the certificate, and provide the original private key and modulo. This, in effect,
defeats the security behind the certificate in the first place, and leaves the original attacker very open to
an attack back on them.
To date, there has been no attack against MD5 for creating files that are of identical MD5's, even though
their contents vastly differ. Alice's boss types of attacks require an attacker to prepare the malicious code,
and for the signer to forgo due-diligence in checking the contents. In the X509 certificate attack, it simply
allows for 2 different, but similar certificates to be created. However, if it is not known that a second
certificate exists, then it is impossible for a relying party to know if the receiving party is the actual
recipient. Of course, the attacker would have needed to give the certificate to the other party, and if that is
the case, then the attacker may as well just give the data to the other party freely as well.
In my opinion, both of these attacks require a breakdown in other systems to make these attacks feasible.
The only MD5 attack that is feasible is against passwords, where brute forcing becomes a possibility.
Without any current method to 'pick' a resultant MD5, either based on a IV or not, MD5 is still a valid
method for verifying data.
References
Rivest, R., “The MD5 Message_Digest Algorithm”, RFC 1321, MIT and RSA Data Security, Inc., April
1992
Klima, V., “Tunnels in Hash Functions: MD5 Collisions Within a Minute”, April 2006
Daum, M. & Lucks, S., “Attacking Hash Functions by Poisoned Messages "The Story of Alice and her
Boss"”, http://www.cits.rub.de/MD5Collisions/, June 2005
Lensta, A., Wang, X., Weger, B., “Colliding X.509 Certificates”, March 2005