Beruflich Dokumente
Kultur Dokumente
Preface: I wish to acknowledge that this article was written with full reference to http://www.adobe.com/devnet/acrobat/pdfs/PDF32000
Surprisingly, it is easy and interesting to read! I am writing this tutorial out of my interest in knowing the PDF specification. My quest star
(steve@printmyfolders.com) if you find any errors. I have relied on the PDF specification (link on page top) to create this tutorial. This tut
Imagine you are alone on a island with no internet, no means of communication apart from a phone where you can only make voice call
will arrive in a week's time to take you home. All of a sudden, you remember that an important document, currently with you has to be s
your printer?
Here is one way of doing it. You would grab a ruler, measure the width and height of your page and note it down. You will measure the e
give her all these details. Provided you have given her enough details she should be able to replicate your document exactly as it is.
PDF works in a similar way. We give instructions so that, the PDF viewing app such as, Adobe Reader can understand our instructions an
Keep this imagery in mind and it will help you understand better.
PDF files are interesting. If you were to open a PDF file in a text editor like Notepad, the contents may look like junk and probably not ve
"At the core of PDF is an advanced imaging model derived from the PostScript® page description language. This PDF Imaging Model en
interactive viewing, PDF defines a more structured format than that used by most PostScript language programs. Unlike Postscript, whic
PDF also includes objects, such as annotations and hypertext links, that are not part of the page content itself but are useful for interactiv
The basic building blocks of a PDF files are objects. There are eight types of objects that are used in PDF files. Before we look at them we
White-space characters: Null, Horizontal tab, Line feed, Form feed, Carriage return and Space. Tip: If you need to remember them, imag
and other objects from each other. Interestingly, PDF treats all white-space characters outside a comment, string or stream the same. Ou
is that you may have 5 spaces but in reality it is considered as one. Note that this does not apply to white-space characters within strings
role in showing where a new line starts. Carriage return followed immediately by a Line feed is considered as one EOL marker.
Delimiter characters: (, ), <, >, [, ], {, }, / and % (4 pairs and 2 unique). These are used in the objects we would look at later. They basically
and then stretched [ ] and then bent again { } and then eventually made flat / %.
Regular characters: All characters other than White-space and Delimiter characters including those that are not part of the standard AS
An interesting fact to note is that a PDF may consist entirely of just ASCII characters or can consist of ASCII characters and Binary data. In
This allows a possibility of 128 unique characters for ASCII files and 256 unique characters for Binary files. Most PDF files that are encrypt
or even opened and saved in normal text editors like Notepad. It is critical that we understand the difference between ASCII and Binary f
files. http://www.cs.umd.edu/class/spring2003/cmsc311/Notes/BitOp/asciiBin.html
You may also wonder why you don't see any text (that can be seen when opened using a reader) or its equivalent when opening a PDF f
text is stored/kept) is encoded (transformed/changed) to conserve space. This is what happens with most files.The second reason could
Objects:
Here are the objects that make use of the characters we looked at above.
Each of these objects have a corresponding class in PDFBox. Have a look at this link https://pdfbox.apache.org/1.8/architecture.htm
2. Numeric objects: There are two types; integer & real. Integers are numbers without any decimal points and can have a + or – symbol
3. String Objects: Strings contain characters (can be zero characters as well). They can be literal characters within parenthesis or hexadec
There are escape characters that can be used. Refer to the PDF specification for more details. The sequence \ddd where ddd is an octal c
especially the ones that cause characters to move, for instance \n (newline), did not have any visual effects when I added them to a string
We can also use octal characters (usually to represent character outside the printable ASCII character set) when using parenthesis. An oc
<48454C4C4F>
Each pair is taken as a value. In the above example, the hexadecimal value 48 is decimal 72 which is the ASCII equivalent of H. Likewise 4
<4845C>
will be considered as
<4845C0>
4. Name Objects: Names consist of a sequence of characters (except null). A forward slash / must be used to introduce the name. In case
followed by the hexadecimal value. All characters that are not regular characters have to be represented by the # followed by their hexad
5. Array Objects: These are similar to the arrays found in computer languages but differ in that they can contain different object types (in
spaces.
Arrays are single dimensional but can include other arrays which can hold arrays themselves.
6. Dictionary Objects:
These are similar to an actual dictionary, where a description follows a word. The description here can be any object (including another d
/Age 99
/PostCode (4321)
>>
>>
7. Stream Objects: Streams are similar to strings except that can be of unlimited length. They are usually used to represent large data th
represented as a 'Contents' stream. It consists of a dictionary followed by the keyword ‘stream’, newline, the stream’s data and the k
dictionary
stream
……….
endstream
While the stream consists of the data, the dictionary contains information about the stream itself. Here are the keys that are common for
Length - Mandatory entry that contains the length (number of bytes) of the stream. An error occurs if the stream has more byte
Filter - The name(s) of the filter(s) used to decode the data on the stream. Can be a name or an array. The filters are used in the o
DecodeParms - Parameter dictionary used by the filters. Can be a dictionary or array. Parameters are values used by the filters. I
F - From PDF specification version 1.2, the stream contents can be stored in an external file. This shows the file where it is stored
FFilter - Similar to Filter entry but for the stream's external file.
DL - An approximate size of the contents (after decoding) in the stream. This will help determine if there is enough disk space fo
8. Null object - Refers to a non existent object and denoted by the keyword null.
Comments: The comment is represented by the percentage sign i.e. %. This is commonly used to describe the version of PDF specificatio
%PDF-1.7
Indirect Objects:
An object (for example, a string) that has been given a unique object identifier, that other objects can use to refer to itself is called an ind
opening it in a text editor) you will notice lots of indirect objects.
The identifier has two parts. The first part is the object number that can be any positive integer. The second part is the generation numbe
and ’endobj’. Be aware that the combination of the object number and generation number has to be unique.
1 0 obj
(my biography)
endobj
The object number is 1 and the generation number is 0. The indirect object is the string object (my biography). If another object wanted
10R
File Structure: A PDF file will initially have the following structure. However, if the file is updated or edited, additional elements may be a
1. A Header (not more than a line) showing the PDF specification this file follows.
4. A Trailer that shows the location of the cross-reference table and special objects within the body of the file.
File Header: This denotes the PDF specification version of the PDF file. %PDF- followed by version number 1.N, where N is a digit betwee
%PDF-1.2
As mentioned earlier this is actually a comment that is used to specify the PDF version.
Beginning with version 1.4 the document's catalog dictionary is used instead of this. If the file has binary data, there will be at least four b
Again, when opening some PDF files in their raw form (as in a text editor) you may notice the four binary characters just after the first co
File Body: The File body consists of indirect objects (discussed earlier). These objects represent text and other details (like font type etc.
Cross-Reference Table: This table is similar to a directory. It contains the location of each object within the PDF file. By looking at the en
object is accessed in a random manner (rather than reading every line of the file). The cross-reference table can have one or more sectio
Note: From PDF version 1.5 onward the cross-reference table can be stored as a stream and if so you will not be able to view the table as
Each section begins with the word 'xref'. Following this line are two numbers separated by a single space. The first number is the object n
file that has been created for the first time or a PDF file that has not been incrementally updated, there shall be only one subsection and
xref
05
Following this are the entries for each object. Each entry shall be exactly 20 bytes long. The entries are of the format
nnnnnnnnnn - This is a ten digit value. This reveals how far the object is from the start of the file. For instance, the value
m - can be either 'n' or 'f'. 'n' denotes that the object is still in use and 'f' denotes that the object has been deleted and is
The ten digits, followed by space, followed by five digits, followed by space, followed by a single character and the eol make exactly 20 d
Let's come back to the 0 5 that we saw in the example earlier.The 0 denotes the object number of the first object in this subsection. The v
3 and 4. The first entry at the cross-reference table is for object 0. Object 0 will have 0000000000 as its first ten digits (if there are no othe
xref
02
0000000000 65535 f
...........
If there are object(s) that have been deleted and are free then the ten digit number will be changed to denote the nth entry of the next fr
xref
04
0000000003 65535 f
0000000015 00000 n
0000000075 00000 n
0000000000 00005 f
The 0 4 denotes that there are four entries - Entry for object 0 followed by entries for objects 1, 2 & 3. The first ten digits (0000000003) o
have the object number of the next free object. In this case, as there are no other free objects it points back to object 0. Objects 1 & 2 are
xref
04
0000000003 65535 f
0000000015 00000 n
0000000075 00000 n
0000000000 00005 f
92
0000000099 00000 n
0000000150 00000 n
In the above case, in addition to objects 0, 1, 2, 3 there are two other objects 9 & 10.
When an indirect object is deleted, its entry is marked as free (by changing the n to f) and linked to the linked list of free objects. It's gen
maximum of 65,535. For instance, an indirect object that was referenced as 1 0 will become 1 1 when reused.
Trailer: The end of a PDF file is read first by the PDF reading application. The trailer holds information about the location and details of th
certain fields.
The second part has the keyword startxref, and in the next line, a number. The number denotes how far (in bytes) the keyword xref (of th
A random PDF taken from my computer has this trailer. Looking at this trailer I can assume that the xref of the last section of the cross-re
trailer
/ID [(X$X@>66...)(X$X@>66...)]
>>
startxref
361441
%%EOF
Size - Total number of entries in the cross reference table (combination of original & update sections). The value has to be an integer. In
Root - Is an indirect reference to the PDF's catalog (which we will learn later). In the example above, I can assume that the indirect object
Some keys are mandatory when certain capabilities are used. We may look at the Info & ID keys later.
Incremental Updates: The team that developed the PDF specification was smart enough to include a special feature. When a PDF gets u
need not be modified). However, this also makes me wonder what happens when many changes take place. Does the size of the PDF file
When a PDF gets an incremental update, in addition to the data being added, a new cross-reference section is created. This new section
of the cross-reference entry. This means that if say object 5, existed before and was deleted during the update the new cross section will
When the PDF file gets updated, along with a new cross-reference section a new trailer is added. This contains all the entries from the pr
cross-reference section.
%%EOF will continue to be the last line for the new trailer as well. Hopefully we will discuss this in detail later.
The structure of a PDF file is like the different levels of hierarchy found in a typical company. Similar to the CEO, the Document Catalog d
As we saw earlier a PDF reading application will look at the trailer of the PDF first. The trailer will have a Root entry that has the location o
details via the contact section (Trailer) on the company's website (PDF file).
Document Catalog: The Document Catalog is a dictionary that refers to other objects that define the PDF file. Basically, the Document C
will for the time being only look at the mandatory keys.
Pages - An indirect reference to the object that is the root of the page tree (will look at this later)
A PDF file that I created using a free PDF creating software has this Catalog Dictionary
1 0 obj
>>
endobj
You will notice that each of these dictionaries always start with a '/Type' entry that descirbes what type of dictionary it is. In this case, it is
An application that reads the above Catolog dictionary will know that it needs to read the 'Pages' dictionary (indirect object 3) to get info
Page Tree: Page Tree is the name of the structure used to describe the pages in a PDF file. It has two type of nodes - page tree nodes a
Parent - the page tree node which is this node's parent. Not allowed in root node.
Kids - an array referring to the children of this node. The children can only be page tree nodes or page objects
Count - the number of page objects that are descendants of this node
The PDF that I had created earlier has this page tree (remember that the Catalog Dictionary was pointing to indirect object 3).
3 0 obj
40R
] /Count 1
/Rotate 0>>
endobj
This Page tree node has only one kid which is object 4. The Parent key is missing and therefore this is the root node.
As the /Count is 1, we can safely assume that there is only 1 page under this Page tree (which based on the /Kids array is indirect object 4
As menioned earlier, you will notice that this dictionary too has an entry '/Type' that reveals what type of dictionary it is.
Page Objects: This is a dictionary that reveals the page itself characteristics. Some of the keys are
Note: Most of the keys are new to me. I have purposefully left out keys that make no sense to me at this moment. As I learn more about
LastModified - Date and time when this page was last modified
Resources - The resources required by this page. This usually refers to the font used on this page and other info.
MediaBox - A rectangle that defines the boundary inside which the page has to be displayed.
Rotate - In multiples of 90. Rotates the page by the number of degrees before displaying.
Thumb - A stream object that gives the thumbnail image for this page.
Dur - the number of seconds the page will be displayed in presentations before automatically moving on to the next page.
Trans - A dictionary advising what transition to use when displaying the page during presentation.
Annots - This is an array of dictionaries containing references to all the annotations for this page
AA - This is the short form for additional-actions. This dictionary defines the actions that need to be taken when the file is open or closed
Here is a grab from a sample PDF that I created using a free PDF creating software.
4 0 obj
/Rotate 0/Parent 3 0 R
/Resources<</ProcSet[/PDF /Text]
/ExtGState 10 0 R
/Font 11 0 R
>>
/Contents 5 0 R
>>
endobj
3 0 obj
40R
] /Count 1
/Rotate 0>>
endobj
1 0 obj
>>
endobj
As you can see Object 1 is the catalog that directs the PDF reading application to the root of the page tree (Object 3). Object 3, the root
rotated (Rotate 0) and has Object 3 as its parent. It's 'resources' as well as its contents (Object 5) are included. Here is Object 5 from my f
As we had discussed earlier, the stream in this object starts with a dictionary that shows the length of the stream (which is stored in Obje
5 0 obj
stream
S-lE,.C.W.󰇒YGKM\vjEG.'|F[j.:..2f2Ź^.uujNWnjY::si/.L9,ČGPY1k/.%'f!endstream
endobj
Page attributes are inherited: Here is an interesting fact. Certain attributes in a page can be inherited from its parent or any of its ances
value for an attribute, that value can be replaced or changed by the child.
Name Dictionary: Rather than referring to the objects by their references, some objects can be referred to by their names. The link betw
used to specify the Name Dictionary. Please refer to the PDF specification for more details.
Content Streams: This is a stream (an object in PDF, if you remember) that has instructions on how to display text & graphics on the cor
5 0 obj
stream
S-lE,.C.W.󰇒YGKM\vjEG.'|F[j.:..2f2Ź^.uujNWnjY::si/.L9,ČGPY1k/.%'f!endstream
endobj
The data in the stream makes no sense because the data has been encoded (converted from its original form to another). In the followin
Before proceeding further we will try to create a simple PDF file from what we have learnt so far.
Sample PDF file: Here is a sample PDF file that I created with help from the specification. You can copy this file from here and save it in a
inclusive). You can then view it with a PDF reader (for instance using Acrobat Reader). Note: Not all PDF files are as simple as this. This PD
I love your feedback and suggestions. Please leave a comment below or contact me at steve@printmyfolders.com.