Sie sind auf Seite 1von 7

How To Bookmark PDF Files Of Heritage Documents ©

By
Kedarnath Jonnalagadda, Vaidika Gramam, Hyderabad 2011
smartxpark@yahoo.com

A book without bookmarks is the same as bookmarks without a book.

1 Introduction

This document illustrates methods that were used to easily bookmark thousands of PDF pages.

A bookmark is reference to the position of desired detail in reading material.

Reading material comprises traditional printed on paper "books" and "documents" or electronic files with
extensions such as ".txt", ".doc",".xls", ".pdf" and web pages on Internet.

Bookmarks help to "home in" to items of specific interest in a book or even in other books, documents, or
web pages.

Conceptually, a bookmark has two parts (1) a "pointer", "link" or "hyperlink" in an electronic document
and (2) the "pointed" area.

To be of any use, "pointers" and the "pointed" need to be well structured and accurate. Familiar examples
of these are "Contents", "Index" and "List of References" pages in books and documents. "Contents"
page displays topic and the page number where the topic starts. The "Index" page displays details of
topics and where they can be located. "List of References" are "pointers" to other books, documents or
web sites.

Compiling a list of bookmarks in a book or document is time and labor intensive. Such compilations can
become another book. For example, in Sanskrit literature, [anukramAnika] are huge and detailed indexes
of the [mantra],in the [veda] and [sutra] of [pANiNI].

Old documents that have heritage value are Heritage Documents. These are being made inceasingly
available in PDF and other electronic forms. Among these, PDF files have greater appeal to readers
because they are true to life pictures of antique original documents. PDF can be made with (1) pictures of
original texts only (2) or pictures of original text with additional processable text of the same.

Processable text from pictures of text is useful for search and scholarly analysis. And most importantly, to
assist preparation of Bookmarks! Processable text is obtained by manually keying in text seen in pictures.
Or using Character recognition or optical character recognition software.

In principle, using software to generate text from pictures is like employing a robot "typist" to see
the pictures of text, recognize the characters and type it in using a keyboard. Practically, this is fraught with
difficulties. Robot "typists" are notorious for the "mistakes" made. And troubles are compounded by the
huge quantity of errors generated with great speed. Often you could spend more time correcting OCR
mistakes than the time you may take to manually type in the text.

Despite all the above, processable text that is reasonably readable is of very great use to prepare a list of
bookmarks in the document. And such bookmarks can be great help to correct OCR mistakes too.

2 Assumptions and Materials

Page 1
2.1. PDF of a Heritage document is available or you can make it.

2.2. The PDF file has

(a) pictures of text


(b) analyzable text (OCR or manually input) true to the pictures

2.3. Software / facility to extract OCR text from a PDF file.

Some PDF readers allow PDF file to printed "as text", if there is processable text at all in the file.

A-PDF Text Extractor© has simple but very useful additional features. (a) The serial numbers of PDF pages
in the file is prominently printed along with extracted text (b) Coordinates of each item of text on PDF
pages, is optionally printed along with extracted text. http://www.a-pdf.com/

2.4. Text editors such as Microsoft Notepad© supplied with operating system Windows and Notepad ++ ©,
available from http://www.notepad-plus-plus-org

2.5. A combination of PDF software and tools

2.5.1 Primarily, PDFILL© Editor and PDFILL PDF Tools© available from http://www.pdfill.com
2.5.2 And support and testing with

eXPert PDF Reader from http://www.visagesoft.com/


Foxit© PDF Reader or Adobe Reader

2.6. On Screen Virtual Keyboard such as Click-N-Type Virtual© from http://www.cnt.lakefolks.com

2.7. Screen Magnifier such as Meazure© from http://www.cthing.com

The above 2 are necessary for ergonomic and/or accessibility reasons. Microsoft Windows© provides such
tools and can be accessed via the menu route Start - All Programs - Accessories - Accessibility
- On-Screen Keyboard and Magnifier.

2.8. Spreadsheet software such as Calc©, a part of Openoffice.org from http://www.Openoffice.org

3 Definitions of Resources

3.1. PDF technology offer resources for easy casual reading or scholarly study of books and documents.

The main resources for these in PDF Readers are

3.1.1 Bookmarks displaying contents of a book or document. These are similar to "Contents" pages in books
with "topics " and "page number" where the topic starts. When a topic of interest on the Bookmarks area
is selected, the software displays the page or area of page where the topic of interest is located.

Bookmarks can be viewed as used User Designed Contents Area with a list of "hyperlinks" to different pages
and sections of pages.

3.1.2 Comments are user made highlights,notes and drawing markups in a PDF file. PDF programs make a
a list of hyperlinks to these and displays them in a seperate area Comments. This, can be viewed as
facility for additional Bookmarks that can be modified, sourteed and grouped.

Page 2
Generally Bookmarks are designed to reference items as they are in the original book or document.
These do not leave any visible marks on the PDF pages.

Bookmarks and Comments are probably two imperatives for proper study of Heritage Documents. The
number of pages in such books can run into thousands. PDF file sizes can be 90 or 100MB and more.
3.1.3 Pages is a software generated list of hyperlinks to individual pages in file..

3.1.4 Attachments is facility to include other files in a PDFfile.

4 Needs

4.1.Sizing things down - Large size PDF files are unwieldy. Working with them can frustratingly slow on
the fastest of computers. The easiest workaround is to split a large PDF file into convenient small sizes.
One way is to split them Chapter-wise same as in the original. Names given to the files ought to be
self explanatory to the extent possible, and numbered such that the list can be sorted to display the same
order of the original single document.

For example,
01-of-48-sv01-a-01-OF-5-akAratRutva-MW 4938173 bytes
02-of-48-sv01-a-02-OF-5-anu-garjita.-MW 4911093 bytes
03-of-48-sv01-a-03-OF-5-abhi-grah-MW 4979304 bytes

4.2. Structured Placement of Files in Structured Folders

All the split files ought to be in one suitably named folder

A record of lists of all files is easily made Microsoft Windows© Command Prompt resource accessed via
menu route Start - All Programs - Accessories - Command Prompt. Executing the command
DIR > <filename>.txt generates a text file with list of all files in a folder.A file name such as such as

Files_in_Hertage_3_folder .txt will be self explanatory and easily remembered and identified.
Powertoys is set of additional Microsft tools for XP computers. This includes the useful
Command Prompt Here tool that enables easy switch to DOS directory from a Windows folder location.
These tools are available from http//www.microsoft.com downloads section.

5. Reasons And Choice Of PDF Software

5.1 Splitting PDF files is imperative for Heritage Documents because PDF of original scanned files
can be unmanageably huge on most computers. Concomitant with splitting is need to Merge PDF files
for a variety of reasons These coud be due to mistakes while splitting, or including a page or pages as
enhancement for study. For example, a list of Abbreviations in every split file enhances facility for better
study of any file.

PDFILL PDF Tools © is a very comprehensive collection of tools to process PDF files and Image
in them.

5.2 Accessing Split PDF files is most conveniently done by Bookmarks that can link and open the split
PDF files of a large book. Such facility is provided in Foxit Reader but the choice of software is
PDFILL PDF Editor© because of the highly innovative and easy method it provides for Export and Import
of Bookmarks as xml file

Page 3
When you have 20 or 30 split files of a book, each of these must have a menu to access other split files.
Bookmarks to access the diiferent split files can be created once and exported as xml. This can then be
import easily into any number of files of the same book or even another book. For example, references to the
the set split files of an entire Sanskrit dictionary could be imported into every split file of a Sanskrit
Grammar book.

5.3 Creating Bookmarks

Bookmarks enable access to page with selected item. Additionally they can enable display of particular
areas of interest in a page with a desired zoom factor.

eXPert PDF Reader from http://www.visagesoft.com/


Foxit© PDF Reader or Adobe Reader

enable quick manual creation of bookmarks with display of desired area and zoom of PDF pages.

5.3.1 Create a few bookmarks using above and save the PDF.

5.3.2 Use PDFILL PDF Editor to open above saved file and create a template by export of the
Bookmarks as xml.

5.3.3 Use the template to create a full set of Bookmarks as xml for the document.

5.3.4 Import the completed set into PDFILL PDF Editor and create a complete PDF file with full
set of Bookmarks.

6. Creating Bookmarks as xml - step 1

6.1 The first requirement to create Bookmarks as xml is to have processable text of the PDF hidden in it.
Adobe Reader enables output PDF files as text, if it has hidden or OCR text.

Our current interest in the processable text of the PDF is to locate the page number of items of interest.

6.2 Notepad++ enables a quick look at the processable text. It is a powerful text editor that can read and
write xml files too. Bookmarks exported as as xml with PDFILL PDF Editor is also read into Notepad++ to
use as template for a complete set of Bookmarks. Bookmarks exported as as xml have this structure.

<?xml version="1.0" encoding="ISO8859-1"?>


<Bookmark>
<Title Action="GoTo" Page="1 XYZ 0.0 562.0 1.5" Color="0.0 0.0 0.0" > CHAPTER III. </Title>
<Title Action="GoTo" Page="1 XYZ 0.0 562.0 1.5" Color="0.0 0.0 0.0" > 3708. That which is
called an affix, has an acute accent on its </Title>
</Bookmark>

The areas of the code of functional interest to create our complete set of Bookmarks are

1 Page The page no the items of our interest are located


2. XYZ ... 1.5 This is the section of the page ad the last 1.5 means 150% zoom
3. Color... This is font style and color parameters of the bookmark
4.> CHAPTER III This is the text appearing in the Bookmark as well as in the book

7. Creating Bookmarks as xml - step 2

Page 4
7.1 The entire processable text from PDF read first in Notepad++ is copied and pasted into a spreadsheet
using option Fixed Width. This pastes each line of text into a single cells in a column. The column is given
an appropriate header, for example, Text_fromPDF
7.2 Control Serial number is entered in cells adjacent left to the column of lines of text in single cells. The
menu route in OpenOffice.org Calc© is Edit-Fill-Series. An appropriate header for column, such as
Control_SNo, is entered.
7.3 A column is inserted between the above two for text that appears in the Bookmark.This is item 4
in the structure of xml described above. The column header for this could be Bookmark_Text
7.4 Next is locating page numbers in the book. Page numbers follow a pattern in books. They could be on
the top or the bottom of the page. If at the bottom they are usually sole occupants of a single line. If at the
top, theymay be preceded by text if right hand page. If it is a left hand page, text may follow a number.

Our objective is to recognize this pattern and extract these into a separate column having only numbers.
So depending on the pattern we need to insert one or two columns adjacent left to the column.

7.5 The full range of cells including headers is named in Openoffice.org via the menu route
Data-Define Range. The spreadsheet software used could be Microsoft Excel or Gnumeric where
this procedure might be slightly different.

8. Extracting Items For Bookmarks xml

8.1 Once the range has been defined, Automatic Filter is set for the range. In Openoffice.org the
menu route is Data - Filter

8.2 Standard Filter is selected on the column Text_FromPDF. This gives a variety of options such as
Contains, Does Not Contains, Begins With and Ends With. Our objective is to set filter for page
numbers or numerals.

This is easily done by using Regular Expression option for example the Standard Filter that Begins with
or Ends with and the Regular expression [0-9] . Regular expression option is selected in More Options
dialog box in Openoffice.org . This diplayed rows will have digits 0 thru 9.Entering [0-9][0-9] will display
double digit items and [0-9][0-9][0-9] fwill display three digit items.

Software such as A-PDF Text Extractor © from http://www.A-PDF.com makes things a lot better. It imprints
page identifying text, " =Page 1=", "" =Page 2="," =Page 2=" ... etc. When text extracted with this
program is used the standard Filter can be a simple Contains "=Page"

8.3 After appropriate filter has been set for lines with page numbers, text from PDF file in displayed lines
needs to be copied into a blank column inserted adjacent left to the Text_fromPDF and named
appropriatel, Identified_Page_nos. When filter is set in OpenOffice.org, contents cannot of cells cannot be
simply copied and pasted. Formula, however,in the first cell of blank column giving reference to text can be
entered and copy pasted into other cells of the filtered range. for example, in the blank cell F47 the formula
<=G47] can be entered, copied and pasted into all visible blank cells below it. Thus, when filter is removed
the first cell having any formula will be in the 47th row and next in appropriate place in F143 and so on.
The entire column without any filter needs to be copied and pasted back special as text in the column.

8.4 Numbering lines above or below the identified rows with page

This is needed because all lines from the beginning to line 47 in above example are all on page one and
we may be needing item of text on line 16 and 32 as Bookmarks.

This is most easily done by inserting a column adjacent left to the Identified_page_Nos column.
This can be given header Identified_lines_and_pgs.

Page 5
And then entering a formula to do the job for us. To enter correct formula we need to understand
what we want and state that explicitly. In this example we can see that a number of blank cells
precede the identified page number. In other words, all lines above an identified line with page
number belongs to that page. So, the logic in our formula would be, If cell in Identified_page_Nos
has a number then cell in Identified_lines_and_pgs. should have that number otherwise it should
have the same number as that below it.

Translated into spreadsheet formula <=>IF(F!43 <>"":F143;E144)

6 C D E F G
7 Control_SNo Text_Bookmark Identified_lines Identified_pageText_fromPDF
8 _and_pgs. _Nos column.
9
10 1 ON RULES OF, GENDERS
11 2 CHAPTER I.
12 3 FEMININE GENDER.
13 4 q i 'fag*' u
14 5 1. The Gender.
15 6 Note: — There are three
16 7 * i '13ft' i wfasmnri sh ii
17 8 2. The Feminine (Gender).
18 9 These two are A'dhikara

47 <=>G47 = Page 1 =
143 <=>IF(F!43 <>"": <=>G143 = Page 2 =
F143;E144)
8.5 Having got the page nos for all lines with the formulas, the results are copied and pasted special as
numbers only.

8.6 Items such as chapters, sections and any other that occur with regularity or even specially needed
items such as reference to perhaps a particular author can be identified by setting appropriate filter.
Such identified items are tagged or marked in a seperate newly inserted column. Procedure is same as
was followed first for the occurrence of page numbers.

8.7 Text_Bookmarks is the text displayed as Bookmark. When filter is set for different marked items
described above, related page numbers and the actual text in the PDF is displayed. This can be used
as text that is displayed in Bookmarks. There could be OCR and other errors but these can corrected
later in the spreadsheet or xml file and re imported into PDFILL PDF Editor. After all work is done a final
filter is set on this column to select non blank items. Now, the items are ready for copy and paste special
as text.

9 Getting the xml file ready in spreadsheet

Bookmarks exported as xml described in section 4. Preparation for Creating Bookmarks as xml - step 1
are copied into the spreadsheet. This is a template, that we may like to use for our Bookmarks.

9.1 A separate sheet renamed xml_prefinal is used for this.

This has 7 columns.

Page 6
The first two lines are exact constants with no preceding space.
The third line is shown with its sections one below other for reasons of truncation and space in this
document prepared in spreadsheet.

All item in red except col 4 are constants in Bookmarks exported / imported as xml.

All <Title Action="GoTo" Page=" sections have one preceding space

Col4 has items giving the page location display parameters, zoom and Co lour display of text
of the bookmark.

x is a marker that helps visually contain overflow text from preceding columns This will be
removed in Notepad++

Items in green are the Bookmark variables in any document

Col3 has the Page number


Col7 is the displayed text in the Bookmark

Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8


Page Text_Bookmarks

<?xml version="1.0" encoding="ISO8859-1"?>


<Bookmark>
<Title Action="GoTo" Page="
x
1
XYZ 0.0 562.0 1.5" Color="0.0 0.0 0.0" >
x
CHAPTER III.
x
</Title>

x
<Title Action="GoTo" 92 XYZ 0.0 562.0 1.5" Color="0.0
Page=" x 0.03976.
0.0" > A finite verb,
x along
</Title>with £s prt ceding Gat
x
<Title Action="GoTo" 93 XYZ 0.0 562.0 1.5" Color="0.0
Page=" x 0.03977.
0.0" > A Gatijbecomes
x unaccented, when followed
</Title>
x
<Title Action="GoTo" 93 XYZ 0.0 562.0 1.5" Color="0.0
Page=" x 0.03978.
0.0" > A Gati becomes
x anudatta, when followed
</Title>
</Bookmark>

Any "<" or ">" character should not be present in text in Col6.


Each cell will be separated by <TAB> character when this xml range is copied and pasted into Notepad++
The sequence <TAB>x<TAB> must all be removed in Notepad++ using find and replace with null string.
Other single <TAB> characters must be replaced with space. After needed changes the file should be
saved as xml file. And then imported in PDFILL PDF Editor and then saves as a PDF file.

Page 7

Das könnte Ihnen auch gefallen