Sie sind auf Seite 1von 5

Testing PDFs with Python

Mon 26 January 2015

Overview
When you generate PDFs, you need a way to test their integrity—not only must they be valid,
but they should behave correctly and display consistently, even on different platforms. This
article describes how you can use the PyPDF2 library to test your PDF files for broken links
(both internal and external), and how to find fonts that are not embedded in the PDF.

Note that some PDF readers are ‘smart’ and will create a live hyperlink from a string of text
that looks like a URL, even though the text is not coded as a URL in the PDF file. The
technique described in this article does not address this issue—it only tests the actual URLs
present in the PDF file.

Code from this article is available as a gist on GitHub. Get a copy and play along.

Testing Internal and External Links


To test the links, we’ll create a function to check the urls found inside the PDF file. This
function uses the requests library, which you can install with pip.

There may be some ftp links, which are not directly supported by the requests library. You
can either use the urllib package as in the following code, or you can use the requests-ftp
package, available on pypi.

from PyPDF2 import PdfFileReader


from pprint import pprint
import requests
import sys
import urllib

The check_ftp function checks a given url for a response. If it fails, or if the response is
empty, it returns False, along with the reason; otherwise it returns True.

def check_ftp(url):
try:
response = urllib.urlopen(url)
except IOError as e:
result, reason = False, e
else:
if response.read():
result, reason = True, 'okay'
else:
result, reason = False, 'Empty Page'
return result, reason

The check_url function is also simple: If the url starts with ftp, it delegates to the
check_ftp function. Otherwise, it attempts to get the url with some timeout value using
typical header values. The the function returns the response along with the reason it succeeded
or failed.

def check_url(url, auth=None):


headers = {'User-Agent': 'Mozilla/5.0', 'Accept': '*/*'}
if url.startswith('ftp://'):
result, reason = check_ftp(url)
else:
try:
response = requests.get(url, timeout=6, auth=auth,
headers=headers)
except (requests.ConnectionError,
requests.HTTPError,
requests.Timeout) as e:
result, reason = False, e
else:
if response.text:
result, reason = response.status_code, response.reason
else:
result, reason = False, 'Empty Page'

return result, reason

Now that we have this utility, we can check the PDF file. We will create four lists:

 links The internal PDF links in the file; for example, a reference to a section or figure.
 badlinks Of the internal links in the file, these are links that target a missing
destination (broken link).
 urls The links from the PDF to an external location; for example, a hyperlink to a
web site.
 badurls Of the external links in the file, these are the urls that target a missing
destination (broken url)

Now for the PyPDF2 goodies. The following check_pdf function loops over the pages in the
PDF file object. For each page, it walks through the Annots dictionary. If that dictionary has
an action (\A) with a key of \D (destination?), that is an internal link, so update the links list
with the destination.

If the dictionary has an action with a key of \URI, it is an external link. Check the external
links with the check_url function and update the urls and bad_urls lists.

After checking each page, get a list of all the anchors in the PDF with the
getNamedDestinations attribute; compare that list of all known anchors to the list of internal
links we just created. If there is a link with no matching anchor, that link belongs in the
badlinks list.

def check_pdf(pdf):
links = list()
urls = list()
badurls = list()

for page in pdf.pages:


obj = page.getObject()
for annot in [x.getObject() for x in obj.get('/Annots', [])]:
dst = annot['/A'].get('/D')
url = annot['/A'].get('/URI')
if dst:
links.append(dst)
elif url:
urls.append(url)
result, reason = check_url(url)
if not result:
badurls.append({'url':url, 'reason': '%r' % reason})

anchors = pdf.namedDestinations.keys()
badlinks = [x for x in links if x not in anchors]
return links, badlinks, urls, badurls

Finally, make the code into a callable script that takes a single argument, the path to the PDF
file. Then print the results of the check_pdf function on stdout.

if __name__ == '__main__':
fname = sys.argv[1]
print 'Checking %s' % fname
pdf = PdfFileReader(fname)
links, badlinks, urls, badurls = check_pdf(pdf)
print 'urls: ', urls
print
print 'bad links: ', badlinks
print
print 'bad urls: ',badurls

Test for Embedded Fonts


Test to make sure that the fonts used in the PDF file are embedded. If a font is not embedded,
your PDF file may display differently on different machines, even if it is a font that is
putatively “standard”, like Times Roman or Helvetica. To insure that your PDF displays as
intended on any machine, all fonts must be embedded.

In the following code, the walk function is a recursive function that takes a dictionary-like
object (obj) and two sets (fnt and emb). It walks the given dictionary object: for every key in
the given dictionary, the function calls itself on the corresponding value (if that value is a
nested dictionary).

If the dictionary has a key called BaseFont, the value corresponding to that key is the name of
a font used in the PDF; add that font name to the fnt set of fonts used.

If the dictionary has a key called FontName, the dictionary is a descriptor for that font, so
check for another key in the same font descriptor dictionary that begins with FontFile (the
key could be FontFile, FontFile2, or FontFile3). If that key exists, the font is embedded;
add that font name to the set of fonts embedded.

If the two sets are not identical, there are unembedded fonts in the PDF.

fontkeys = set(['/FontFile', '/FontFile2', '/FontFile3'])

def walk(obj, fnt, emb):


if '/BaseFont' in obj:
fnt.add(obj['/BaseFont'])
elif '/FontName' in obj and fontkeys.intersection(set(obj)):
emb.add(obj['/FontName'])

for k in obj:
if hasattr(obj[k], 'keys'):
walk(obj[k], fnt, emb)

return fnt, emb

Finally, make the code into a callable script that takes a single argument, the path to the
PDF file.

Start with two empty sets, fonts and embedded. Open the file with PyPDF2. The library
gives us access to the internal structure of the PDF. We loop over each page in the PDF,
passing the page’s Resources dictionary to the walk function, described above. Add the
corresponding results to the two sets and calculate the unembedded fonts by differencing
the sets.

Print the fonts used in the PDF file and if there are unembedded fonts, print their names as
well. Of course here you can do anything you want with the information such as save it to test
database, print a report, and so on.

if __name__ == '__main__':
fname = sys.argv[1]
pdf = PdfFileReader(fname)
fonts = set()
embedded = set()

for page in pdf.pages:


obj = page.getObject()
f, e = walk(obj['/Resources'], fonts, embedded)
fonts = fonts.union(f)
embedded = embedded.union(e)

unembedded = fonts - embedded


print 'Font List'
pprint(sorted(list(fonts)))
if unembedded:
print '\nUnembedded Fonts'
pprint(unembedded)

Using PyPDF2 Methods


Obviously, the more you can specify about the PDFs you produce, the more you can test. For
example, you may know that your PDF should have specific metadata, should be encrypted,
contain a certain number of pages, and so on.

You can test for those conditions with the built-in tools that the PDFFileReader in pyPDF2
provides. If you have a PDFFileReader instance, you can use the following properties
for testing:

documentInfo
returns the document metadata such as author, creator, producer, subject, and title.
isEncrypted
returns boolean value specifiying whether the document is encrypted
numPages
returns the number of pages in the document

Summary
If you produce PDF documents, you need to test them. The more you can specify about your
PDFs, the more you can test. This article describes how you can test that the links (internal
and external) are valid and that the fonts used in the document are embedded.