Sie sind auf Seite 1von 7

Polish the EPUB

developerWorks

http://www.ibm.com/developerworks/opensource/library/x-po...

Technical topics

Open source

Technical library

Polish the EPUB


Find and correct problems in EPUB files
In EPUB documents, you cannot detect some problems with normal validation methods. As long as the document
validates as well-formed XML and follows the EPUB standard, it can appear to be correct but might not read correctly
in an e-Reader. Examples include broken paragraphs, bad page numbering, and spelling errors caused by OCR
scanning. But you can view and correct errors using two methods: with the EPUB editor Sigil and with PHP in
combination with SimpleXML and the Enchant libraries. Regular expressions provide the key to ecient processing.

Share:
Colin Beckingham is a freelance researcher, writer, and programmer who lives in eastern Ontario, Canada. Holding degrees from Queen's
University, Kingston, and the University of Windsor, he has worked in a rich variety of fields including banking, horticulture, horse racing,
teaching, civil service, retail, and travel and tourism. The author of database applications and numerous newspaper, magazine, and online
articles, his research interests include open source programming, VoIP, and voice-control applications on Linux. You can reach Colin at
colbec@start.ca.

30 August 2011
Also available in Chinese Japanese Spanish

The EPUB format is an efficient method of presenting documents. Its XML


structure ensures that document components are in their place and will be
displayed reasonably well on a wide variety of devices. For an introduction to
EPUB, see Liza Daly's article in Resources.
These documents can fail at two levels:
At a fundamental level, where the XML markup or the content is
broken
More subtlyat a level that checking the XML cannot detect

Frequently used acronyms


GUI: Graphical user interface
OCR: Optical character recognition

For the former problem, where the EPUB is broken internally, you

HTML: Hypertext Markup Language


WYSIWYG: What you see is what you

can use the EpubCheck project (see Resources for a link). The

get

remainder of this article examines the second type of issue, which

XML: Extensible Markup Language

can be an annoyance for readers.


The tight control that XML enforces goes only so far. XML happily permits a number of errors that,
although not sufficiently grave to cause a software fault, nevertheless impede smooth reading. It is easy
to see how these errors can happenif a publisher uses OCR on a printed page to transfer it to text
format, then all of the oddities of the printed page are carried over, including errors resulting from font
incompatibility. In a commercial situation, editors review the result by hand to produce a polished edition,
but where the product is designed for free and open source distribution, publishers cannot absorb these
costs as easily. So what ends up in your e-Reader is good but not as good as it might be. Examples
include broken paragraphs, blank pages, odd page numbering, and spelling errors.
From a developer's point of view, the challenge is how you can tackle these issues by using the structure
of the EPUB. This article looks at how you can use the Sigil EPUB editor to address some of them and
employ PHP in combination with SimpleXML and spelling libraries to resolve many others.

Broken paragraphs and blank pages


Take the broken paragraph as an example of a secondary problem. In HTML markup, this problem
appears as:
<p>This is where my paragraph begins, hits the end of a physical page here</p>

1 di 7

12/12/15 09:35

Polish the EPUB

http://www.ibm.com/developerworks/opensource/library/x-po...

<div class="newpage" id="page-12"></div>


<p>and then continues from the top of the next physical page,
finally coming to an end here.</p>

The scanner has read to the end of a page, put in a paragraph tag regardless of whether it applies to
ensure that the page is syntactically complete, then starts at the top of the next page, ensuring that it
begins with a new paragraphagain, whether it is appropriate or not. It makes for complete code but
incomplete paragraphs because of orphaned sections. On the e-Reader, the user might see both
sections on the same device page with no page marker displayed but the paragraph sections separated
as if they were independent paragraphs.
Similarly, consider blank pages:
<div class="newpage" id="page-128"></div>
<p></p>
<div class="newpage" id="page-129"></div>

Does page 129 in the snippet above really exist? It might be important to preserve it blank, but otherwise,
it is inconvenient to have to turn two pages when only one should be necessary.
Spelling errors are a different kind of problem where you compare two different lists of words rather than
look for complex patterns. This problem you deal with separately using scripting methods.

Sigil
Sigil (see Resources for the website and support pages) is a WYSIWYG EPUB editor that can find the
pattern-matching types of errors and allow programmers to correct them. See the Regular expressions
sidebar for a quick introduction to regular expressions, and see Resources for more detailed information.
Sigil might not be available from your Linux repository, but it is

Regular expressions

available as a precompiled binary or as source files. Once in the

Regular expressions provide a powerful

GUI, click File > Open to open your EPUB directly. Doing so
extracts the EPUB and displays a directory of the component files

way to search and replace text using


pattern-matching techniques. The syntax
is concise, so you need to exercise care

on the left; it reveals a browser pane on the right in which you can

to avoid unwanted effects.

display the contents of individual files either as you might view them

An example of a regular expression is

in the e-Reader or as the marked-up code. This latter point is an

[^.]</p>,

essential feature in finding and correcting problems.

which searches for an end of


paragraph tag that is not preceded by a
period. This might or might not be a

Choose one of the HTML files that your EPUB contains, and

problem.

double-click it to open it in the browser window. Then, click View >

In this regular expression, the square

Code View to display the code behind the file. All the tags should

brackets ([]) enclose a group of


characters in which one only might apply,

now be visible.
Suppose that you want to find orphaned paragraph chunks. The
criterion you are looking for is </p> end-of-paragraph tags that are
not preceded by a normal end-of-sentence character. The most
common of these characters is the period. Sigil provides a search
function (Edit > Find), and the normal search mode lets you find
strings like .</p>, but it does not help you find the end of paragraph

the caret symbol (^) means not any of the


following characters, and the period (.)
inside the group stands for itself, as do
the rest of the symbols outside the
brackets.
See Resources for a more in-depth
discussion of this useful tool.

that does not have a period before it. For this, you need the regular
expression search mode, which appears when you click More. Navigate to the top of the code in the
browser window, then perform these steps:
1. Select Down for the direction.
2. Select Regular expression for the search mode.
3. Type [^.]</p> as your Find what string.
4. Click Find Next.

2 di 7

12/12/15 09:35

Polish the EPUB

http://www.ibm.com/developerworks/opensource/library/x-po...

This process should find what you are seeking, if it exists. If there are no hits, you might want to create
one temporarily just to check that the search function works.
After using this technique for a while, you soon find that paragraphs can legitimately end with characters
other than periods. You find that double quotation marks ("), exclamation mark (!), question marks (?),
and maybe some other characters fit the requirement of a complete sentence. Allowing for this is not a
problem with regular expressions. Because the square brackets indicate a group, if you change the Find
what to [^.?!"]</p>, the search accepts as normal anything that has a period, question mark,
exclamation mark, or double quotation mark at the end of a paragraph and flags as erroneous anything
else.
Another tell-tale sign of a broken paragraph might be those that begin with <p> followed by a lowercase
alphabetic character. The regular expression version of this would be <p>[a-z].. Another useful one is
<p>[0-9]., which looks for paragraphs that begin with numbers. This sign might be valid where the

scanner has picked up a page number that in an e-Reader context might no longer be relevant.
How you decide to fix one of these errors is another matter. If a page marker separates the two pieces,
you might move the marker to before or after the true paragraph and rejoin the two pieces to make one
single paragraph. The page numbering is then approximately but not perfectly accurate.
Searching for page markers is a similar process. Again, using the regular expression option if the Find
what is page-[0-9]+, the editor searches for any string that begins with the literal characters p, a, g, e,
and dash followed by at least one of and maybe several number characters from the range zero to nine.
An interesting break that you can find easily is one where a word, paragraph, and page are all broken at
the same time. The print version indicates the break with a hyphen or dash, which is easily visible and
searched for in code view:
<p>This is where my paragraph begins, hits the end of a phys-</p>
<div class="newpage" id="page-12"></div>
<p>ical page and then continues from the top of the next physical page,
finally coming to an end here.</p>

In this case, a global normal search using the Find what string of -</p> should pick them out quite
quickly.

Review page numbers


Although you can use Sigil to find and review page breaks and numbering, in a more than 100-page
document, doing so might be tedious. An easier way is to iterate through the documents with PHP and
review the numbering.
The script in Listing 1 finds and reviews the HTML pages and runs through the page breaks. It finds the
number for the first page, which is quite often different from page 1, and verifies that each subsequent
page is an increment from the first page. Although the page numbering test is fairly simple, it is an
example of how to use the OPF file to find and examine the component HTML.
Listing 1. Page checking the EPUB with PHP and SimpleXML
<?php
/* epub is a zipped package containing many files
the file "content.opf" contains the pointers to the constituent files
inside content.opf you have
package (root)
-> manifest
-> item
which we need to filter for media-type="application/xhtml+xml"
and to check these are real text pages, not just full page images
these are the text chapters which need to be checked one by one
*/
$firstpage = 0;
$oldpage = 0;
// look for the text to be checked
$opf_file = "./OEBPS/content.opf";

3 di 7

12/12/15 09:35

Polish the EPUB

http://www.ibm.com/developerworks/opensource/library/x-po...

if (!file_exists($opf_file)) {
//cleanup();
die("Cannot find the OPF file\n");
} else {
echo "Found it!\n";
$xml = simplexml_load_file($opf_file);
// get the manifest items
foreach ($xml->manifest->item as $mi) {
if ($mi['media-type']=='application/xhtml+xml') {
echo "Found ".$mi['href']."\n";
if (substr($mi['href'],0,4) == 'part') {
echo "Page number check in document ".$mi['href']."\n";
echo scan_chap("./OEBPS/".$mi['href']);
}
}
}
}
function scan_chap($chap) {
global $firstpage, $oldpage;
echo "Trying to page num check section $chap \n";
if (!file_exists($chap)) {
echo "Cannot find the chapter $chap\n";
} else {
echo "Found it!\n";
$xml = simplexml_load_file($chap);
//$i = 0;
foreach ($xml->body->div->div as $pagnumdiv) {
if ($pagnumdiv["class"]=='newpage') {
echo $pagnumdiv["id"]."\n";
$page = (int) substr($pagnumdiv["id"],5);
if ($firstpage == 0) {
$firstpage = $oldpage = $page;
} else {
if ($page != $oldpage+1) echo "Problem at page after $oldpage\n";
$oldpage++;
}
}
}
}
return "Done...\n";
}
?>

The code first sets up global variables for the number of the first logical page found (set once at the
beginning of the loop) and the number of the previous page checked (that changes with each iteration). It
then declares the name of the OPF file, looks for that file, andif it cannot find itends with an error. If
the file is found, the script opens the file as an XML object and looks for the names of the files mentioned
in the manifest that appear to be HTML using the media-type attribute. In this particular EPUB
document, some HTML files contain only a full-page image and therefore can be ignored. The file names
of these pages contain the string leaf; the other files that contain extended text have a part label. The
code filters these out using substrings.
Now that you know the name of the file, you can read this file into its own simpleXML object. Iterating
through the <div> tags and filtering for those that have a class attribute of newpage, you can find the
value of the id attribute that contains the page number. You need to let the book tell you which number is
the first page because this is often not page 1, and after this value is stored in the global first page
variable, you can go on to predict what the number of the next page should be. If it happens not to be the
expected number, the script generates an error and continues checking.
This script does not attempt to make changes to the text. It merely flags what it thinks might need your
attention.

Spell checking using PHP, XML, and Enchant


Spelling is a different problem. In this case, you are really after events such as Upon, which the OCR has
read as TJpon or IJpon, which is close but not correct. It might come in as a number of alternatives, and
the spelling routine sees it as so strange that the suggestions it offers are not close or helpful.
A spelling routine examines words one by one and compares them to a standard known list, pointing out
those that don't match, making suggestions, and allowing you to make changes. Sigil can make
replacements of specific strings across multiple documents in the EPUB package, but you need the
power of a scripting engine such as PHP, Perl, Python, and so on, together with specialist libraries, for
finer-grained control.
Newer versions of PHP now contain the hooks necessary not only for digging into XML and HTML files

4 di 7

12/12/15 09:35

Polish the EPUB

http://www.ibm.com/developerworks/opensource/library/x-po...

using SimpleXML but also for using the Enchant spelling manager library. Enchant is capable of
managing multiple different base spelling lists. It helps to differentiate UK English from US English
spellings, for example.
The script in Listing 2 examines each of the manifest files separately using the same method as in Listing
1, this time going through paragraph by paragraph and word by word checking each against the known
spelling list. It uses the same method of iterating through the HTML component files as in Listing 1 and
adds the required instructions to access the dictionaries.
Listing 2. Spell checking the EPUB with PHP, SimpleXML, and Enchant
<?php
// spell check an epub
/* epub is a zipped package containing many files
the file "content.opf" contains the pointers to the constituent files
inside content.opf we have
package (root)
-> manifest
-> item
which we need to filter for media-type="application/xhtml+xml"
and to check these are real text pages, not just full page images
these are the text chapters that need to be checked one by one
Acknowledgment: Some of the dictionary-related code
was copied from the PHP Enchant manual page
*/
// set up console for input
$console = fopen("php://stdin","r");
// set up enchant (from PHP manual)
$tag = 'en_CA';
$r = enchant_broker_init();
$bprovides = enchant_broker_describe($r);
echo "Current broker provides the following backend(s):\n";
print_r($bprovides);
$dicts = enchant_broker_list_dicts($r);
print_r($dicts);
if (enchant_broker_dict_exists($r,$tag)) {
$d = enchant_broker_request_dict($r, $tag);
$dprovides = enchant_dict_describe($d);
echo "dictionary $tag provides:\n";
} else {
cleanup();
die ("Cannot set up the spell checker\n");
}
// look for the text to be checked
$opf_file = "./OEBPS/content.opf";
if (!file_exists($opf_file)) {
cleanup();
die("Cannot find the OPF file\n");
} else {
echo "Found it!\n";
$xml = simplexml_load_file($opf_file);
foreach ($xml->manifest->item as $mi) {
if ($mi['media-type']=='application/xhtml+xml') {
echo "Found ".$mi['href']."\n";
if (substr($mi['href'],0,4) == 'part') {
echo "Need to spell check ".$mi['href']."\n";
echo scan_chap("./OEBPS/".$mi['href']);
}
}
}
}
function cleanup() {
global $d, $r;
enchant_broker_free_dict($d);
enchant_broker_free($r);
}
function scan_chap($chap) {
echo "Trying to spell check section $chap \n";
if (!file_exists($chap)) {
echo "Cannot find the chapter $chap\n";
} else {
echo "Found it!\n";
$xml = simplexml_load_file($chap);
$i = 0;
foreach ($xml->body->div->p as $para) {
echo $para."\n";
// need to spell check the contents of $para
spell_check(trim($para));
$i++;
if ($i > 5) break;
}
}
return "Done...\n";
}
function spell_check($para) {
global $console, $d;
$para = str_replace(" "," ",$para);
$para = str_replace(".","",$para);
$para = $para." ";
echo "Checking text : $para\n";
$start = 0;
while ($pos !== false) {
$pos = strpos($para," ",$start);
echo "Found $pos\n";
if (!$pos) break;

5 di 7

12/12/15 09:35

Polish the EPUB

http://www.ibm.com/developerworks/opensource/library/x-po...

$len = $pos-$start;
$theword = substr($para,$start,$len);
// tidy up theword which may contain punctuation
$punc = array(':',';',',','"','?','!');
$theword = str_replace($punc,"",$theword);
//
if ((strlen($theword) > 0) and (!is_numeric($theword))) {
if ($wordcorrect = enchant_dict_check($d, $theword)) {
echo "$theword is OK!\n";
} else {
$suggs = enchant_dict_suggest($d, $theword);
echo "Suggestions for <$theword>:\n";
//print_r($suggs);
$max = 5;
foreach ($suggs as $k=>$sugg) {
echo "$k => $sugg\n";
if ($k > $max) break;
}
$inp = fgets($console,1024);
}
}
$start += $len+1;
}
}
?>

In this code, you start by declaring a file pointer to standard input so that you can get interactive
information from the keyboard during the spell-check process. The next section establishes the
connection to the dictionaries. Note that the tag variable indicates en-CA, which, in this instance, puts a
preference on Canadian English. The result is that the checker chooses colour over color,
acknowledgement over acknowledgment, and so on. A more standard setting for the tag is en-US. After
the dictionary is connected, it performs the same search for HTML text files as in Listing 1, but this time,
instead of looking for page number <div> tags, it looks for paragraphs with real text.
Before performing the actual spell check, the script cleans up the paragraph text to make it more
manageable by removing long spaces and removing periods and commas because the goal is to
examine word by word. Then, the actual spell checking starts by moving from word to word in the
paragraph, ignoring words that are numbers and comparing the word to the dictionary. Where the
dictionary does not contain the word, the script suggests words that might be a better substitute. In this
case, the script presents only the first five alternates. The script halts at each problem word and waits for
user input from the keyboard. At this point, you can add code to change, ignore once, ignore for the
session, and so on.

Conclusion
Both Sigil and PHP scripting with XML and spelling libraries are helpful tools in finding and fixing errors
that cannot be detected using normal EPUB checking routines. Whether these secondary errors are truly
errors or just minor cosmetic inconveniences depends on the context in which you are using the
document and the ability of the hardware reader and its own software to resolve these issues on the fly.

Resources
Learn
Build a digital book with EPUB (Liza Daly, developerWorks, updated January
2011, published November 2008): Read an introduction to EPUB and a list of
EPUB resources.
Know your regular expressions, (Michael Stutz, developerWorks, June 2007):
Check out this introduction to regular expressions on UNIX systems.
Discover the available tools and techniques that can help you learn how to
construct regular expressions for various programs and languages.
More articles by this author (Colin Beckingham, developerWorks, March
2009-current): Read articles about XML, voice recognition, XHTML, PHP,
SMIL, and other technologies.
New to XML? Get the resources you need to learn XML.

6 di 7

Dig deeper into XML on


developerWorks
Overview
New to XML
Technical library (tutorials and more)
Forums
Downloads and products
Open source projects
Standards
Events

developerWorks Premium
Exclusive tools to build your next
great app. Learn more.

developerWorks Labs

12/12/15 09:35

Polish the EPUB

http://www.ibm.com/developerworks/opensource/library/x-po...

XML area on developerWorks: Find the resources you need to advance your
skills in the XML arena, including DTDs, schemas, and XSLT. See the XML

Technical resources for innovators


and early adopters to experiment
with.

technical library for a wide range of technical articles and tips, tutorials,
standards, and IBM Redbooks.

IBM evaluation software

IBM XML certification: Find out how you can become an IBM-Certified

Evaluate IBM software and


solutions, and transform
challenges into opportunities.

Developer in XML and related technologies.


developerWorks technical events and webcasts: Stay current with technology
in these sessions.
developerWorks on Twitter: Join today to follow developerWorks tweets.
developerWorks podcasts: Listen to interesting interviews and discussions for
software developers.
developerWorks on-demand demos: Watch demos ranging from product
installation and setup for beginners to advanced functionality for experienced
developers.
Get products and technologies
Sigil: Explore this multi-platform WYSIWYG ebook editor, designed to edit
books in EPUB format.
Enchant: Learn about spell checking with this wrapper that provides uniformity
and conformity on top of several libraries.
EpubCheck project: Check out this useful tool to validate IDPF EPUB files. It
can detect many types of errors in EPUB.
IBM product evaluation versions: Download or explore the online trials in the
IBM SOA Sandbox and get your hands on application development tools and
middleware products from DB2, Lotus, Rational, Tivoli, and
WebSphere.
Discuss
XML zone discussion forums: Participate in any of several XML-related
discussions.
The developerWorks community: Connect with other developerWorks users
while exploring the developer-driven blogs, forums, groups, and wikis.

7 di 7

12/12/15 09:35

Das könnte Ihnen auch gefallen