Beruflich Dokumente
Kultur Dokumente
Portable Document
Format Malware
Kazumasa Itabashi
Introduction
Contents Approximately two years ago a vulnerability in Adobe Reader’s
Introduction........................................................ 1
JavaScript API was discovered, and malware authors continue to pro-
Background Information.................................... 1
duce malicious PDF files that exploit this flaw. This vulnerability has
Obfuscation Using Features of JavaScript......... 2
been patched, though a number of other vulnerabilities have been
Obfuscation Using Features of PDF Format....... 7
found and used in active exploits before being patched themselves.
Encryption......................................................... 10
JavaScript Features Unique to PDF.................. 11 There are numerous reasons why malware authors might use vulner-
Conclusion........................................................ 15 abilities in Adobe Reader and Acrobat as an attack vector. First, the PDF
Bibliography...................................................... 16 format is widely used throughout the world for sharing documents, and
Adobe Reader is the most popular PDF viewer; many OEMs ship PCs with
the software preinstalled. Second, the PDF file format specification and
the properties of the viewer allow malware authors a significant degree
of freedom when designing and developing a threat. Third, the nature of
the PDF format provides malware authors with some useful tricks that
help to avoid detection by AV scanners, and the support for JavaScript
further extends this capability. Obfuscation, encryption, and misdirec-
tion are techniques often employed in a similar manner to how they
may be seen in HTML and other environments that support JavaScript.
Background Information
The first JavaScript-based PDF malware came to light in February 2008.
A vulnerability in one of Adobe’s JavaScript API functions, collectEmailIn-
fo(), was discovered and used in conjunction with a heap-spray attack. The
Portable Document Format Malware
Security Response
malicious code copied shellcode to heap memory and subsequently called the vulnerable API, thus exploiting the
vulnerability. This was fairly simple JavaScript in and of itself.
In November 2008 – nine months later – another vulnerability was found, this time in the printf() API. Several
further vulnerabilities emerged over time, and to date malware exploiting vulnerabilities in the following func-
tions has been found:
• collectEmailInfo()
• printf()
• getIcon()
• customDictionaryOpen()
• getAnnots()
• newPlayer()
The exploits are very similar to those that might be found in attacks on Web browsers. Unfortunately both Web-
and PDF-based attacks continue because of the myriad methods that may be used to obfuscate the code and
hence evade detection by security software.
Web-based JavaScript attacks commonly make use of HTML features to obfuscate the code effectively. By using
<iframe> or <script> tags, malware authors can make it more difficult for AV products to detect the malicious
code. Similar techniques can be used within PDF files, and as such the following sections detail some ways in
which malicious JavaScript may be obfuscated by malware authors.
Simple Obfuscation
Even beginner or unsophisticated JavaScript programmers can make use of simple string obfuscation tech-
niques. While there are legitimate uses of code obfuscation, the same techniques may be used to craft malicious
code that evades detection.
Split Strings
Dynamic string manipulation is as easy in JavaScript as it is in other interpreted script-based languages. Often,
strings (or strings and numbers) may be concatenated using the “+” character. While this is useful for the easy
manipulation of textual content, it also makes it easy for malware authors to create simply obfuscated code.
Figure 1 shows a shellcode block that has been split into many shorted strings, some of which are defined as
variables. The string literals and variables are concatenated and evaluated as arguments to the unescape() func-
tion, which transforms the string into the final binary form. For an AV scanner to detect this shellcode it must
include both a lexical and a structural parser.
A string can also be used by calling the object method using the bracket notation. Although it is broken down
into substring “components” it is evaluated as a single string. Figure 2 shows an example of this.
Regular Expressions
JavaScript supports regular expressions as a built-in language feature. They can be used for pattern-matching
and programmatic text manipulation, for example detecting line breaks or validating input characters. The
JavaScript object used to represent regular expressions is called RegExp.
Page 2
Portable Document Format Malware
Security Response
Figure 1
Figure 2
Regular expressions are an effective method of string obfuscation used by malware authors. The characters
that comprise the string to be obfuscated can be “scattered” throughout a longer string and retrieved using a
regular expression when they are to be used. Figure 3 shows this technique in use. Each instance of l, k, u and
d in the obfuscated string is re-
Figure 3
placed with the % character, yielding
Obfuscation using a regular expression %25%34%35%30%30%30%66.
This string is then evaluated using the
unescape() function, giving the final
result of %45000f.
This is a simple example in which a single character was added to each 2-byte hexadecimal num-
ber. The technique can be made more complex, however, for example by adding more characters, us-
ing a more complex sequence, or by replacing parts of each hexadecimal number, as in the expression
“%25%34%35%3Z%3Z%3Z%66”.replace(new RegExp(/Z/g), “0”).
The use of regular expressions can yield more complex obfuscation than simple split strings. It also tends to be
used to hide strings in conjunction with other techniques.
Page 3
Portable Document Format Malware
Security Response
Figure 4
One downside to using the eval() technique for code obfuscation is that most malcode researchers would likely
begin their search for malicious code by looking for this keyword. In the PDF format, however, alternatives to the
eval() function are available.
The function app.setTimeOut(statement, timeout) executes the statement given as its first argument after the
time (in milliseconds) given as its second. In the Web-based world, the first argument must be a reference to a
function but the PDF format allows any code to be specified. This allows for further obfuscation. Figure 5 shows
an example of a split eval(). The string is the final element of the array whose first element is “oibj”.
Figure 5
eval() in an array
An alternative method is to use a numeric representation to produce the desired string. Figure 6 shows how this
can be achieved.
Following the addition, the variable ikhircrro has the value 693741. The next line converts this to a string by
treating it as a radix-36 (i.e. 7 + 29) representation:
Page 4
Portable Document Format Malware
Security Response
If radix-36 is used to represent from 0 to 9 and A to Z inclusive, 14 is “e”, 31 is “v”, 10 is “a” and 21 is “l”, and thus
the variable lfbhmy represents “eval”.
Base64
Base64 encoding3 is used in numerous places on the Internet to represent arbitrary binary data using only the
US-ASCII character set. Malware authors have produced a JavaScript Base64 decode implementation in order
to decode base64 representations of malicious code on the fly, as seen in figure 8, an example discovered in
October 2008.
Figure 8
Implementation of Base64
Page 5
Portable Document Format Malware
Security Response
The original Base64 specification has a fixed index table that includes the alphabet, numerical digits, and the
characters “+”, “/” and “=”. The order of the index table is fixed as a result of the need to be able to encode and
decode consistently. Malware authors, however, often alter these standards in their implementations, either re-
ordering the index table or changing the characters. Packers also exist that use Base64 in conjunction with XOR
or ADD operations.
RC4 Encryption
RC4, a powerful stream cipher, is one of the methods of encryption that has been used within packers. Operating
on a previously encrypted block of ciphertext, the decryption code uses the RC4 decryption algorithm to decrypt
and subsequently execute the malicious code body. Note that the decryption key must appear in the decryption
code, a potential area of weakness.
An example implementation of RC4 appears in figure 9. This code was first discovered in January 2009.
Figure 9
The packer uses a fairly simple substitution encrypt/decrypt algorithm but uses a method of key generation that
had not previously been seen. While most packers include the decryption key within the code in plaintext, the
Neosploit packer generates the key from the decryption function itself using arguments.callee in order to com-
plicate the process of analysis. Example code from the Neosploit packer appears in figure 10.
Page 6
Portable Document Format Malware
Security Response
Figure 10
The first line of a PDF file shall be a header consisting of the 5 characters %PDF- followed by a version number of
the form 1.N, where N is a digit between 0 and 71.
This statement can be seen to be somewhat ambiguous. Some samples of PDF malware have been observed to
have malformed file headers, for example, “%PDF” is not at the beginning of the file:
Page 7
Portable Document Format Malware
Security Response
The “%” character delimits the beginning of a comment in a PDF file. Adobe Reader is able to load and parse the
contents of files even if the first four bytes are not “%PDF“, and additionally copes with files whose first byte is
not “%”.
This flexibility creates the possibility of a performance issue for AV vendors: merely scanning the first four bytes
is not sufficient to identify a file as a PDF that may be opened using Adobe Reader.
Cross-reference Table
Many legitimate PDF files contain cross-references. The specification contains the following text:
The cross-reference table contains information that permits random access to indirect objects within the file so that
the entire file need not be read to locate any particular object.
It was thought that random access to PDF objects did not present a problem for AV scanners but the discovery
of malicious PDFs with invalid offsets changed this perception. While PDF readers can parse PDF files from the
beginning even if invalid offsets are present, this is a problem for those developing AV scanners, as it may be
necessary to parse an entire file to ascertain whether or not it contains malicious code.
Stream Filters
In order to encapsulate large objects such as images, font data, and so on, the PDF format supports the inclusion
of stream data with encoding and/or compression, as seen in figure 11.
To date, PDF malware has been found using the following Figure 11
Page 8
Portable Document Format Malware
Security Response
Combining Filters
Initially only malware that made use of one filter type was found. The FlateDecode filter decompresses data
that has been compressed using the zlib/deflate algorithm. This method can both shorten file length and hide
JavaScript content, and minor changes to the code will result in major differences in the compressed version.
Stream Length
Each stream includes a length field that holds the number of bytes that comprise the stream. PDF reader ap-
plications may read the stream contents based on this field, although if it is missing they may use the stream
and endstream keywords that mark the beginning and end of a stream respectively. Adobe Reader is able to read
stream content even if the length field is incorrect.
Malicious PDFs often have invalid length values, although this may not be intentional on the malware authors’
parts; many such files are dynamically produced using server-side polymorphic techniques when PDF vulner-
abilities are targeted for drive-
by downloads. The malicious Figure 13
JavaScript is modified whereas Invalid “length” value
the other PDF content is not,
which results in an invalid
length field..
Endstream or Endstrebm?
As describe above, stream and endstream respectively mark the beginning and end of a stream. In September
2008 a sample was found that used endstrebm instead of endstream. The stream contained malicious JavaScript
and was compressed using zlib. Perhaps surprisingly Adobe Reader was able to recognize the stream data and
perform the decompression before falling foul of the exploit contained within.
Case Sensitivity
According to the PDF specification, all entries such as “/Type” or “/Action” should be identified in a case-sensi-
tive manner:
PDF is case-sensitive; corresponding uppercase and lowercase letters shall be considered distinct.
Page 9
Portable Document Format Malware
Security Response
In January 2010, however, some samples were discovered that contained these such entries appearing in a
seemingly random mixture of upper and lower case characters. Any PDF parser that – in accordance with the
spec – identified these entries in a case-sensitive manner required an update to use case-insensitive identifica-
tion. Figure 14 shows an example taken from the sample.
Figure 14
Variations in case
Encryption
The PDF format has supported encryption since version 1.1. Once a PDF is encrypted, all strings and streams in
the file will be in ciphertext. Either the RC4 or AES algorithm may be used, and two forms of password are avail-
able: user password and owner password.
User passwords are mainly used as a way to prevent PDF content from being displayed, whereas owner pass-
words are used to prevent content modification. Users must enter the correct password to perform either opera-
tion, which provides document creators control over their work. Empty strings are acceptable passwords and as
such an encrypted PDF with an empty password string may be displayed in any reader.
Having the ability to encrypt PDFs would initially seem to be something that could be leveraged by malware
authors to evade detection by AV scanners, but most malicious PDFs are served via the Web and aim to down-
load files without users’ knowledge of what is happening. With stealth being of primary importance, malware
authors typically only have two options: a plain PDF file or an encrypted PDF file with no password. The latter
option leaves AV vendors at a disadvantage performance-wise as any such PDF files must be decrypted before
the content is scanned.
Page 10
Portable Document Format Malware
Security Response
Fragmental JavaScript
A JavaScript object in a PDF file can be split up or fragmented using the name dictionary. The original function-
ality of the name dictionary was to allow an object to be referred to by name rather than by object reference. A
number of different object types can be referred to in this way. An entry can be set as follows:
The above text defines a reference to object 35, named as “name1”. The name dictionary can contain
multiple entries, each of which defines a similar relationship; the name “name2” is also associated with object
36 in the example above.
When the document is opened, all of the actions in this name tree shall be executed, defining JavaScript functions for
use by other scripts in the document.
This description from the PDF specification describes the mechanism through which frag-
mental JavaScript objects can be executed when a PDF file is opened. JavaScript frag-
ments must be gathered together and evaluated together. Each JavaScript fragment may ad-
ditionally be compressed or encrypted which means that AV scanners must perform the inverse
of these operations in order to check for malicious content.
The first sample found that used this fragmental JavaScript technique was discovered in August 2009. An ex-
cerpt from the file appears in figure 15. Two JavaScript objects appear in the file, one to perform heap-spraying
(figure 16) and the other to deliver the exploit decrypt shellcode (figure 17).
Figure 15
Fragmental JavaScript
Page 11
Portable Document Format Malware
Security Response
Figure 16
Figure 17
Shell Decryption
Malicious JavaScript can be split up in a PDF file with the malicious code body being placed inside a PDF object
or objects. This may be further encrypted JavaScript, shellcode, or any other malicious code. The JavaScript that
makes use of this technique appears to be legitimate because the main malicious code body is not visible on the
surface; only PDF object references are evident. In order to scan for malicious code, AV vendors must develop
scanners that gather together all related objects and reconstruct them. This increases the time and amount of
memory it takes to scan a particular file.
Use of this.getField()
The PDF JavaScript API has a built-in function called getField(), the main purpose of which is to retrieve data
from the Field object of an individual widget, as in the following example:
This example shows how JavaScript can retrieve user input from a text entry widget.
Page 12
Portable Document Format Malware
Security Response
In November 2008 a sample was discovered that hides a segment of code to a Field object and later uses get-
Field() to retrieve it. The hidden JavaScript code is packed as escaped characters but can be executed by using
unescape() and eval(), as detailed earlier in this document (see figure 7). The example in figure 18 shows the use
of getField() in the sample, in which getField() takes the string “data” as an argument.
Figure 18
Use of getField()
Figure 19 shows the target object referenced in the example above. The type of the object is “/Widget” and its
text label is “data”, which matches the string used in the JavaScript object. The malicious JavaScript content ex-
ists in the “/DV” entry as a string, and clearly is a string of escaped characters.
Figure 19
Field widget
Use of app.doc.getAnnots()
The app.doc.getAnnots() Figure 20
Page 13
Portable Document Format Malware
Security Response
A sample making use of this technique was discovered in November 2009. A 70448-byte string masquerading
as the document title was present in the file; this string contained obfuscated and escaped JavaScript code. To
execute the hidden code the threat retrieves the title string, replaces all instances of “j866p886a39” with “%”
and then uses the now-familiar unescape() and eval() operations on the document “title”. An excerpt from the
code appears in figure 23.
Figure 23
Page 14
Portable Document Format Malware
Security Response
Conclusion
PDF-based malware can harbor malicious JavaScript in a similar manner to how it may exist on the Web, but the
features and specification of the PDF file format mean that a number of additional tricks are available to the mal-
ware author. The cat-and-mouse game between AV vendors and malware authors continues; the complexity and
flexibility of the PDF file format mean that malware authors are continually pushing the envelope and as such AV
vendors must continue to improve and refine their PDF parsing technology.
The possibility of false positives exists as a result of toolkits that may be used to craft both legitimate and mali-
cious PDFs alike. It is crucial for AV vendors to exercise caution when adding definitions so as to avoid the disrup-
tion that may be caused when a legitimate file is falsely convicted.
Some good news is that Adobe has introduced sandboxing functionality into Reader during 2010. This may help
to contain even malware that uses new or previously unknown techniques. Sandboxing technology is not the per-
fect solution to all problems however, and time will tell how successful such an approach may be; the introduc-
tion of such sandbox technology may also bring with it new vulnerabilities to be exploited. For now it is essential
to keep software patches and virus definitions up-to-date and for antivirus vendors to strive to keep pace with
the tricks and techniques deployed by the malware authors.
Page 15
Portable Document Format Malware
Security Response
Bibliography
1. Adobe, “PDF Reference and Adobe Extensions to the PDF Specification.” http://www.adobe.com/devnet/pdf/
pdf_reference.html
2. ecma, “Standard ECMA-262 ECMAScript Language Specifrication”
3. RFC3548, “The Base16, Base32, and Base64 Data Encodings” http://tools.ietf.org/html/rfc3548
Page 16
Security Response
Any technical information that is made available by Symantec Corporation is the copyrighted work of Symantec Corporation and is owned by Symantec
Corporation.
NO WARRANTY . The technical information is being delivered to you as is and Symantec Corporation makes no warranty as to its accuracy or use. Any use of the
technical documentation or the information contained herein is at the risk of the user. Documentation may include technical or other inaccuracies or typographical
errors. Symantec reserves the right to make changes without prior notice.
About Symantec
Symantec is a global leader in
providing security, storage and
systems management solutions to
help businesses and consumers
secure and manage their information.
Headquartered in Moutain View, Calif.,
About the author
Kazumasa Itabashi is a Principle Symantec has operations in more
Software Engineer at Symantec Security than 40 countries. More information
Response in Tokyo specializing in PDF malware. is available at www.symantec.com.
For specific country offices and contact num- Symantec Corporation Copyright © 2010 Symantec Corporation. All rights reserved.
Symantec and the Symantec logo are trademarks or registered
bers, please visit our Web site. For product World Headquarters trademarks of Symantec Corporation or its affiliates in the
information in the U.S., call 350 Ellis Street U.S. and other countries. Other names may be trademarks of
their respective owners.
toll-free 1 (800) 745 6054. Mountain View, CA 94043 USA
+1 (650) 527-8000
www.symantec.com