Sie sind auf Seite 1von 17

Security Response

Portable Document
Format Malware
Kazumasa Itabashi

Introduction
Contents Approximately two years ago a vulnerability in Adobe Reader’s
Introduction........................................................ 1
JavaScript API was discovered, and malware authors continue to pro-
Background Information.................................... 1
duce malicious PDF files that exploit this flaw. This vulnerability has
Obfuscation Using Features of JavaScript......... 2
been patched, though a number of other vulnerabilities have been
Obfuscation Using Features of PDF Format....... 7
found and used in active exploits before being patched themselves.
Encryption......................................................... 10
JavaScript Features Unique to PDF.................. 11 There are numerous reasons why malware authors might use vulner-
Conclusion........................................................ 15 abilities in Adobe Reader and Acrobat as an attack vector. First, the PDF
Bibliography...................................................... 16 format is widely used throughout the world for sharing documents, and
Adobe Reader is the most popular PDF viewer; many OEMs ship PCs with
the software preinstalled. Second, the PDF file format specification and
the properties of the viewer allow malware authors a significant degree
of freedom when designing and developing a threat. Third, the nature of
the PDF format provides malware authors with some useful tricks that
help to avoid detection by AV scanners, and the support for JavaScript
further extends this capability. Obfuscation, encryption, and misdirec-
tion are techniques often employed in a similar manner to how they
may be seen in HTML and other environments that support JavaScript.

This paper aims to detail the different paths malware authors


have taken and point out how attack techniques via PDF have
evolved. It is hoped that it will aid AV vendors and PC users alike
in better understanding the problems posed by malicious PDFs,
as well as the importance of staying up-to-date with patches.

Background Information
The first JavaScript-based PDF malware came to light in February 2008.
A vulnerability in one of Adobe’s JavaScript API functions, collectEmailIn-
fo(), was discovered and used in conjunction with a heap-spray attack. The
Portable Document Format Malware
Security Response

malicious code copied shellcode to heap memory and subsequently called the vulnerable API, thus exploiting the
vulnerability. This was fairly simple JavaScript in and of itself.

In November 2008 – nine months later – another vulnerability was found, this time in the printf() API. Several
further vulnerabilities emerged over time, and to date malware exploiting vulnerabilities in the following func-
tions has been found:
• collectEmailInfo()
• printf()
• getIcon()
• customDictionaryOpen()
• getAnnots()
• newPlayer()
The exploits are very similar to those that might be found in attacks on Web browsers. Unfortunately both Web-
and PDF-based attacks continue because of the myriad methods that may be used to obfuscate the code and
hence evade detection by security software.

Web-based JavaScript attacks commonly make use of HTML features to obfuscate the code effectively. By using
<iframe> or <script> tags, malware authors can make it more difficult for AV products to detect the malicious
code. Similar techniques can be used within PDF files, and as such the following sections detail some ways in
which malicious JavaScript may be obfuscated by malware authors.

Obfuscation Using Features of JavaScript


Most JavaScript can be easily obfuscated courtesy of features of the language. JavaScript-based malware is typi-
cally used to trigger drive-by downloads on the Web and cause further malware to be downloaded on to users’
computers. The downloaded malware may be categorized as a Trojan horse, Backdoor, or Infostealer, for exam-
ple. In order to perform these kinds of exploits, malware authors must periodically update the JavaScript used
or risk detection and hence the failure of the drive-by download. The aims of the exploits and the update-release
cycle are echoed in the PDF world, with AV vendors forced to be aware of the ever-changing face of PDF-based
malware.

Simple Obfuscation
Even beginner or unsophisticated JavaScript programmers can make use of simple string obfuscation tech-
niques. While there are legitimate uses of code obfuscation, the same techniques may be used to craft malicious
code that evades detection.

Split Strings
Dynamic string manipulation is as easy in JavaScript as it is in other interpreted script-based languages. Often,
strings (or strings and numbers) may be concatenated using the “+” character. While this is useful for the easy
manipulation of textual content, it also makes it easy for malware authors to create simply obfuscated code.

Figure 1 shows a shellcode block that has been split into many shorted strings, some of which are defined as
variables. The string literals and variables are concatenated and evaluated as arguments to the unescape() func-
tion, which transforms the string into the final binary form. For an AV scanner to detect this shellcode it must
include both a lexical and a structural parser.

A string can also be used by calling the object method using the bracket notation. Although it is broken down
into substring “components” it is evaluated as a single string. Figure 2 shows an example of this.

Regular Expressions
JavaScript supports regular expressions as a built-in language feature. They can be used for pattern-matching
and programmatic text manipulation, for example detecting line breaks or validating input characters. The
JavaScript object used to represent regular expressions is called RegExp.

Page 2
Portable Document Format Malware
Security Response

Figure 1

Simple string concatenation

Figure 2

Property access using the bracket notation

Regular expressions are an effective method of string obfuscation used by malware authors. The characters
that comprise the string to be obfuscated can be “scattered” throughout a longer string and retrieved using a
regular expression when they are to be used. Figure 3 shows this technique in use. Each instance of l, k, u and
d in the obfuscated string is re-
Figure 3
placed with the % character, yielding
Obfuscation using a regular expression %25%34%35%30%30%30%66.
This string is then evaluated using the
unescape() function, giving the final
result of %45000f.

This is a simple example in which a single character was added to each 2-byte hexadecimal num-
ber. The technique can be made more complex, however, for example by adding more characters, us-
ing a more complex sequence, or by replacing parts of each hexadecimal number, as in the expression
“%25%34%35%3Z%3Z%3Z%66”.replace(new RegExp(/Z/g), “0”).

The use of regular expressions can yield more complex obfuscation than simple split strings. It also tends to be
used to hide strings in conjunction with other techniques.

The eval Function


JavaScript provides a global function called eval() that may be used to evaluate a string as though it were an
expression. A legitimate use of this function is when dynamically generated code is to be used. The following two
JavaScript statements produce the same result: a message box that displays the text “Hello World”:
• app.alert ( “Hello World” );
• eval ( ‘app.alert ( “Hello World” );’ );
This function is one of the most effective ways through which malware authors can produce obfuscated code.
In conjunction with the use of split strings and regular expressions most recognizable JavaScript code can be
obfuscated, producing results similar to the example in figure 4.

Page 3
Portable Document Format Malware
Security Response

Figure 4

How many eval()s?

One downside to using the eval() technique for code obfuscation is that most malcode researchers would likely
begin their search for malicious code by looking for this keyword. In the PDF format, however, alternatives to the
eval() function are available.

The function app.setTimeOut(statement, timeout) executes the statement given as its first argument after the
time (in milliseconds) given as its second. In the Web-based world, the first argument must be a reference to a
function but the PDF format allows any code to be specified. This allows for further obfuscation. Figure 5 shows
an example of a split eval(). The string is the final element of the array whose first element is “oibj”.
Figure 5

eval() in an array

Figure 6 Arrays are evaluated from left to right2 so the array


Numeric eval() is evaluated last and the statement is equivalent to
qkgd=(“yeid”, … ,”ngir”)[“eval”], which can be evalu-
ated as method calling using the bracket notation. This
means that the variable qkgd is equivalent to eval and
can be called as such.

An alternative method is to use a numeric representation to produce the desired string. Figure 6 shows how this
can be achieved.

Following the addition, the variable ikhircrro has the value 693741. The next line converts this to a string by
treating it as a radix-36 (i.e. 7 + 29) representation:

693741 = 14 X 363 + 31 X 362 + 10 X 36 + 21

Page 4
Portable Document Format Malware
Security Response

If radix-36 is used to represent from 0 to 9 and A to Z inclusive, 14 is “e”, 31 is “v”, 10 is “a” and 21 is “l”, and thus
the variable lfbhmy represents “eval”.

Obfuscation Using Packers


The eval() function is commonly used by packer designers; PDF-based malware makes use of many kinds of
packer. Following operations to inflate, deflate, encrypt, or multiplex a combination of different transformations,
the original JavaScript is represented in a different form and with a different code size.

The unescape() function


As previously mentioned, unescape() can decode from a hexadecimal representation to raw binary data. The
unescape() function is able to deal with strings that decode to non-ASCII values and therefore is commonly used
in heap-spray attacks, but it can also be used as a method of obfuscation when the decoded results are another
ASCII string. This mode of operation is un-
Figure 7
likely to appear in non-malicious code.
eval(), unescape(), and replace()
Malware authors often use the unescape()
function in conjunction with the replace()
method to obfuscate code, as in the example
in figure 7.

Base64
Base64 encoding3 is used in numerous places on the Internet to represent arbitrary binary data using only the
US-ASCII character set. Malware authors have produced a JavaScript Base64 decode implementation in order
to decode base64 representations of malicious code on the fly, as seen in figure 8, an example discovered in
October 2008.

Figure 8

Implementation of Base64

Page 5
Portable Document Format Malware
Security Response

The original Base64 specification has a fixed index table that includes the alphabet, numerical digits, and the
characters “+”, “/” and “=”. The order of the index table is fixed as a result of the need to be able to encode and
decode consistently. Malware authors, however, often alter these standards in their implementations, either re-
ordering the index table or changing the characters. Packers also exist that use Base64 in conjunction with XOR
or ADD operations.

RC4 Encryption
RC4, a powerful stream cipher, is one of the methods of encryption that has been used within packers. Operating
on a previously encrypted block of ciphertext, the decryption code uses the RC4 decryption algorithm to decrypt
and subsequently execute the malicious code body. Note that the decryption key must appear in the decryption
code, a potential area of weakness.

An example implementation of RC4 appears in figure 9. This code was first discovered in January 2009.

Figure 9

JavaScript RC4 implementation

An Example Packer from Neosploit


There are many toolkits available to perform Web browser exploits using JavaScript; the toolkits commonly use
packers to prevent their code from being detected by AV scanners. One such packer has been discovered in use
within a sample of PDF malware; this example was first discovered in March 2009.

The packer uses a fairly simple substitution encrypt/decrypt algorithm but uses a method of key generation that
had not previously been seen. While most packers include the decryption key within the code in plaintext, the
Neosploit packer generates the key from the decryption function itself using arguments.callee in order to com-
plicate the process of analysis. Example code from the Neosploit packer appears in figure 10.

Page 6
Portable Document Format Malware
Security Response

Figure 10

Example code from Neosploit packer

Overpacking using Multiple Packers


There are currently over 30 known types of JavaScript packer. There are many examples of JavaScript code that
has been packed multiple times using different packers.

Obfuscation Using Features of the PDF Format


The previous section outlined various methods of obfuscating JavaScript using packers. Malware authors may
also take advantage of PDF file format features in order to obfuscate malicious code. These methods will be
outlined in this section.

Using the File Header


Many file formats make use of a file header or “magic number” to identify the file type, which usually is simply a
few bytes at the beginning of the file. Windows executables, for example, use “MZ”, bitmap files have “BM”, and
so on. Although PDF files commonly have “%PDF” at the beginning of the file, this need not always be the case.
The PDF specification contains the following description:

The first line of a PDF file shall be a header consisting of the 5 characters %PDF- followed by a version number of
the form 1.N, where N is a digit between 0 and 71.

This statement can be seen to be somewhat ambiguous. Some samples of PDF malware have been observed to
have malformed file headers, for example, “%PDF” is not at the beginning of the file:

Page 7
Portable Document Format Malware
Security Response

“%PDF“ appears on the third line, not the first:

“%PDF“ appears following “MZ”, similar to a Windows executable:

The “%” character delimits the beginning of a comment in a PDF file. Adobe Reader is able to load and parse the
contents of files even if the first four bytes are not “%PDF“, and additionally copes with files whose first byte is
not “%”.

This flexibility creates the possibility of a performance issue for AV vendors: merely scanning the first four bytes
is not sufficient to identify a file as a PDF that may be opened using Adobe Reader.

Cross-reference Table
Many legitimate PDF files contain cross-references. The specification contains the following text:

The cross-reference table contains information that permits random access to indirect objects within the file so that
the entire file need not be read to locate any particular object.

It was thought that random access to PDF objects did not present a problem for AV scanners but the discovery
of malicious PDFs with invalid offsets changed this perception. While PDF readers can parse PDF files from the
beginning even if invalid offsets are present, this is a problem for those developing AV scanners, as it may be
necessary to parse an entire file to ascertain whether or not it contains malicious code.

Stream Filters
In order to encapsulate large objects such as images, font data, and so on, the PDF format supports the inclusion
of stream data with encoding and/or compression, as seen in figure 11.

To date, PDF malware has been found using the following Figure 11

filters to hide malicious JavaScript: PDF stream encoding and compression


• ASCIIHexDecode
• ASCII85Decode
• LZWDecode
• FlateDecode
• RunLengthDecode
• JBIG2Decode
Thus, six of the ten standard filters defined in the PDF
specification have been used for malicious purposes.

Page 8
Portable Document Format Malware
Security Response

Combining Filters
Initially only malware that made use of one filter type was found. The FlateDecode filter decompresses data
that has been compressed using the zlib/deflate algorithm. This method can both shorten file length and hide
JavaScript content, and minor changes to the code will result in major differences in the compressed version.

In March 2010, malware that Figure 12

made use of multiple PDF Multiple stream filters


filters was discovered. As
detailed in the PDF specifica-
tion, decompress and decode
filters were used in conjunc-
tion. This raised the bar for AV
vendors who found themselves
requiring scanners that could
decompress/decode all filters
in order to scan the content of
the streams.

Figure 12 shows an example of


the use of multiple stream filters.

Stream Length
Each stream includes a length field that holds the number of bytes that comprise the stream. PDF reader ap-
plications may read the stream contents based on this field, although if it is missing they may use the stream
and endstream keywords that mark the beginning and end of a stream respectively. Adobe Reader is able to read
stream content even if the length field is incorrect.

Malicious PDFs often have invalid length values, although this may not be intentional on the malware authors’
parts; many such files are dynamically produced using server-side polymorphic techniques when PDF vulner-
abilities are targeted for drive-
by downloads. The malicious Figure 13
JavaScript is modified whereas Invalid “length” value
the other PDF content is not,
which results in an invalid
length field..

Figure 13 is an example taken


from a malicious PDF discov-
ered in April 2008. An invalid
length value of 0000 can be
seen.

Endstream or Endstrebm?
As describe above, stream and endstream respectively mark the beginning and end of a stream. In September
2008 a sample was found that used endstrebm instead of endstream. The stream contained malicious JavaScript
and was compressed using zlib. Perhaps surprisingly Adobe Reader was able to recognize the stream data and
perform the decompression before falling foul of the exploit contained within.

Case Sensitivity
According to the PDF specification, all entries such as “/Type” or “/Action” should be identified in a case-sensi-
tive manner:

PDF is case-sensitive; corresponding uppercase and lowercase letters shall be considered distinct.

Page 9
Portable Document Format Malware
Security Response

In January 2010, however, some samples were discovered that contained these such entries appearing in a
seemingly random mixture of upper and lower case characters. Any PDF parser that – in accordance with the
spec – identified these entries in a case-sensitive manner required an update to use case-insensitive identifica-
tion. Figure 14 shows an example taken from the sample.

Figure 14

Variations in case

Encryption
The PDF format has supported encryption since version 1.1. Once a PDF is encrypted, all strings and streams in
the file will be in ciphertext. Either the RC4 or AES algorithm may be used, and two forms of password are avail-
able: user password and owner password.

User passwords are mainly used as a way to prevent PDF content from being displayed, whereas owner pass-
words are used to prevent content modification. Users must enter the correct password to perform either opera-
tion, which provides document creators control over their work. Empty strings are acceptable passwords and as
such an encrypted PDF with an empty password string may be displayed in any reader.

Having the ability to encrypt PDFs would initially seem to be something that could be leveraged by malware
authors to evade detection by AV scanners, but most malicious PDFs are served via the Web and aim to down-
load files without users’ knowledge of what is happening. With stealth being of primary importance, malware
authors typically only have two options: a plain PDF file or an encrypted PDF file with no password. The latter
option leaves AV vendors at a disadvantage performance-wise as any such PDF files must be decrypted before
the content is scanned.

RC4 and AES


When a PDF is to be encrypted using RC4 or AES, the encryption key is constructed using the following param-
eters:
• 32-byte string based on user password
• 32-byte string based on owner password
• User access permission flag
• Document ID
• Object ID
• Generation number
Within a PDF, keys differ between objects because object ID and generation number are used for key generation.
All strings and/or streams in the same object are encrypted using the same key.

Password Validation Errors


Both user and owner passwords are represented as 32-byte strings in the U(ser) and O(wner) dictionaries re-
spectively. Malicious PDFs tend to be produced with empty owner passwords, which does not affect the display
of a file or its potential ability to deliver an exploit. This encryption does, however, affect how a file may open in
an application that allows PDFs to be modified and also necessitates decryption when analysis is required. Most
PDF files have no user password so they can be opened and read by anyone but when a file that has a non-blank
user password is opened the PDF reader will display a dialog box to allow the password to be entered.

Page 10
Portable Document Format Malware
Security Response

Fragmental JavaScript
A JavaScript object in a PDF file can be split up or fragmented using the name dictionary. The original function-
ality of the name dictionary was to allow an object to be referred to by name rather than by object reference. A
number of different object types can be referred to in this way. An entry can be set as follows:

/Names [ (name1) 35 0 R (name2) 36 0 R …]

The above text defines a reference to object 35, named as “name1”. The name dictionary can contain
multiple entries, each of which defines a similar relationship; the name “name2” is also associated with object
36 in the example above.

The name dictionary has the following functionality:

When the document is opened, all of the actions in this name tree shall be executed, defining JavaScript functions for
use by other scripts in the document.

This description from the PDF specification describes the mechanism through which frag-
mental JavaScript objects can be executed when a PDF file is opened. JavaScript frag-
ments must be gathered together and evaluated together. Each JavaScript fragment may ad-
ditionally be compressed or encrypted which means that AV scanners must perform the inverse
of these operations in order to check for malicious content.

The first sample found that used this fragmental JavaScript technique was discovered in August 2009. An ex-
cerpt from the file appears in figure 15. Two JavaScript objects appear in the file, one to perform heap-spraying
(figure 16) and the other to deliver the exploit decrypt shellcode (figure 17).

Figure 15

Fragmental JavaScript

Page 11
Portable Document Format Malware
Security Response

Figure 16

Heap spray and exploit

Figure 17

Shell Decryption

JavaScript Features Unique to PDF


JavaScript that may be used within a PDF file has a number of unique features, such as the ability to make use of
Acrobat forms for user input. This section details how such features may be used by a malware author.

Malicious JavaScript can be split up in a PDF file with the malicious code body being placed inside a PDF object
or objects. This may be further encrypted JavaScript, shellcode, or any other malicious code. The JavaScript that
makes use of this technique appears to be legitimate because the main malicious code body is not visible on the
surface; only PDF object references are evident. In order to scan for malicious code, AV vendors must develop
scanners that gather together all related objects and reconstruct them. This increases the time and amount of
memory it takes to scan a particular file.

Use of this.getField()
The PDF JavaScript API has a built-in function called getField(), the main purpose of which is to retrieve data
from the Field object of an individual widget, as in the following example:

var firstName = this.getField(“Name.First”).value;


app.alert (“Your first name is “ + firstName);

This example shows how JavaScript can retrieve user input from a text entry widget.

Page 12
Portable Document Format Malware
Security Response

In November 2008 a sample was discovered that hides a segment of code to a Field object and later uses get-
Field() to retrieve it. The hidden JavaScript code is packed as escaped characters but can be executed by using
unescape() and eval(), as detailed earlier in this document (see figure 7). The example in figure 18 shows the use
of getField() in the sample, in which getField() takes the string “data” as an argument.
Figure 18

Use of getField()

Figure 19 shows the target object referenced in the example above. The type of the object is “/Widget” and its
text label is “data”, which matches the string used in the JavaScript object. The malicious JavaScript content ex-
ists in the “/DV” entry as a string, and clearly is a string of escaped characters.
Figure 19

Field widget

Use of app.doc.getAnnots()
The app.doc.getAnnots() Figure 20

function is built-in to the PDF Use of app.doc.getAnnots()


JavaScript API and operates in
a similar manner to getField(),
outlined above. This function
allows data to be retrieved from
a ScreenAnnot object. An ex-
ample of how this function may
be used to hide malicious code
appears in figure 20.

Page 13
Portable Document Format Malware
Security Response

The 6th object in the docu- Figure 21

ment, visible in figure 21, is a ScreenAnnot object


ScreenAnnot that contains a
reference to a further 7th object.

Finally, the 7th object in the


document, visible in figure 22,
is a stream object that contains
escaped malicious JavaScript.

Use of this.info.Producer and this.info.Title


The PDF Info object contains Figure 22

document meta-data such as Stream data referred to by screenAnnot object


the title, producer, and so on.
Malware authors can use the
document information diction-
ary to store hidden malicious
JavaScript in a similar way to the
methods detailed above.

A sample making use of this technique was discovered in November 2009. A 70448-byte string masquerading
as the document title was present in the file; this string contained obfuscated and escaped JavaScript code. To
execute the hidden code the threat retrieves the title string, replaces all instances of “j866p886a39” with “%”
and then uses the now-familiar unescape() and eval() operations on the document “title”. An excerpt from the
code appears in figure 23.

Figure 23

Use of this.info.title to hold malicious JavaScript

Page 14
Portable Document Format Malware
Security Response

Conclusion
PDF-based malware can harbor malicious JavaScript in a similar manner to how it may exist on the Web, but the
features and specification of the PDF file format mean that a number of additional tricks are available to the mal-
ware author. The cat-and-mouse game between AV vendors and malware authors continues; the complexity and
flexibility of the PDF file format mean that malware authors are continually pushing the envelope and as such AV
vendors must continue to improve and refine their PDF parsing technology.

The possibility of false positives exists as a result of toolkits that may be used to craft both legitimate and mali-
cious PDFs alike. It is crucial for AV vendors to exercise caution when adding definitions so as to avoid the disrup-
tion that may be caused when a legitimate file is falsely convicted.

Some good news is that Adobe has introduced sandboxing functionality into Reader during 2010. This may help
to contain even malware that uses new or previously unknown techniques. Sandboxing technology is not the per-
fect solution to all problems however, and time will tell how successful such an approach may be; the introduc-
tion of such sandbox technology may also bring with it new vulnerabilities to be exploited. For now it is essential
to keep software patches and virus definitions up-to-date and for antivirus vendors to strive to keep pace with
the tricks and techniques deployed by the malware authors.

Page 15
Portable Document Format Malware
Security Response

Bibliography
1. Adobe, “PDF Reference and Adobe Extensions to the PDF Specification.” http://www.adobe.com/devnet/pdf/
pdf_reference.html
2. ecma, “Standard ECMA-262 ECMAScript Language Specifrication”
3. RFC3548, “The Base16, Base32, and Base64 Data Encodings” http://tools.ietf.org/html/rfc3548

Page 16
Security Response

Any technical information that is made available by Symantec Corporation is the copyrighted work of Symantec Corporation and is owned by Symantec
Corporation.

NO WARRANTY . The technical information is being delivered to you as is and Symantec Corporation makes no warranty as to its accuracy or use. Any use of the
technical documentation or the information contained herein is at the risk of the user. Documentation may include technical or other inaccuracies or typographical
errors. Symantec reserves the right to make changes without prior notice.

About Symantec
Symantec is a global leader in
providing security, storage and
systems management solutions to
help businesses and consumers
secure and manage their information.
Headquartered in Moutain View, Calif.,
About the author
Kazumasa Itabashi is a Principle Symantec has operations in more
Software Engineer at Symantec Security than 40 countries. More information
Response in Tokyo specializing in PDF malware. is available at www.symantec.com.

For specific country offices and contact num- Symantec Corporation Copyright © 2010 Symantec Corporation. All rights reserved.
Symantec and the Symantec logo are trademarks or registered
bers, please visit our Web site. For product World Headquarters trademarks of Symantec Corporation or its affiliates in the
information in the U.S., call 350 Ellis Street U.S. and other countries. Other names may be trademarks of
their respective owners.
toll-free 1 (800) 745 6054. Mountain View, CA 94043 USA
+1 (650) 527-8000
www.symantec.com

Das könnte Ihnen auch gefallen