Beruflich Dokumente
Kultur Dokumente
Java
How to use JSoup and XMLBeam in practice
This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing
process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools
and many iterations to get reader feedback, pivot until you have the right book and build
traction once you do.
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
LeanPub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
What took me the most time? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Upgrade to Java 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
LeanPub
I publish my books on LeanPub because
So if someone do not want to pay some dollars because there wont be any updates and he/she
has to use some online errata, I can just say: do not worry, here youll get every update of the
book and they are already included in the price.
It can happen that I expand the book with some chapters on my own or I write another sample
application and at this time the new version of the book will be available here too. This because
the technology is altering so fast and after some time a book stuffed with old knowledge wont
be useful.
1
Preface 2
I developed the pieces I got on more and better solutions and things I wanted to show you so I
went on, refactored a part of the code.
This meant that sometimes I had to redo runs with different configurations to have better results.
But I do not regret that I did these errands if I want them call so. I learned new things and I
have many ideas where to go on. Perhaps this will result in another book or more blog articles.
Acknowledgement
I have to thank Sven Ewald, the creator of XMLBeam, for reviewing my books chapters about
XMlBeam. Beside this he found time to answer my question and provide me samples with
answers.
Because LeanPub currently cannot display the whole books Table of Contents in the sample,
I use this workaround to show you what youll get if you buy this book. I know, it is a bit
awkward (you get some empty pages in your sample PDF) but so you can see what comes in the
other version.
http://xmlbeam.org
XML Processing and the Google App
Engine
In this chapter Ill introduce you to XML processing and the Google App Engine (GAE).
Why GAE?
This is a good question. Mostly because Ive worked with the GAE and I encountered some
problems with it and the XML processing. So I thought I could share my problems and solutions
with you. Perhaps you are interested in it or even it helps you to solve some problems.
Writing about development to a GAE environment is always kind of fun because you have
your solutions and at the end you get a punch in your face from the GAE: some classes you
want to use are not permitted. Then you start looking for a solution inside of the feasible area.
This was the case when we (a co-worker and I) had the task to render an XML (provided
from somewhere somehow it is not important in the current context) in various formats:
PDF and RTF. And as a bonus (because rendering those documents was not the easiest thing) I
implemented a web-based display too to see if we get the right data. Visualizing XML as HTML
is always the easiest thing. For me at least.
XML to HTML
As I mentioned this was no requirement but I wanted to see results as soon as possible so I added
an HTML display of the XML input.
Converting XML to HTML is easy: you only need to do an XSL Transformation (XSLT) and
then you are done. The result you get is an HTML file (or XML or text depending on
3
XML Processing and the Google App Engine 4
your configuration). But this is for GAE a no-go because you are not allowed to create files
dynamically from your application.
Nevertheless you can end up with a solution to display your XML data represented as an HTML
page: you only have to add the stylesheet to your data and most of the browsers will display it
correctly.
to transform the XML to HTML with XSLT (the detailHtml.xsl contains the transformation
information).
If you get your data from an interface (for example from a SOAP service) you have to be a bit
tricky to get your XSL into your XML because you get all of the data in one XML dataset.
However if you think about a solution you would end up with: replacing the starting root node
with itself and the stylesheet-node. With this workaround you can alter the XML dataset and
display it along with XSLT. And this works with GAE too.
The example above is a little hack but you have to do this to add the stylesheet to the XML data.
XML to PDF
Converting an XML to a PDF is something simple too: with XSLT you create an XSL-FO (FO for
Formatted Objects) document from your XML. An FO document is an XML using element names
(node names) from the FO namespace. After this you can send your resulting FO document to a
render-engine (for example Apache FOP) and you get your PDF.
Sounds simple however GAE does not allow some of the classes which are used by Apache FOP
(for example AWT graphics). So there is need for another workaround.
iText is a good alternative to FOP however it does not handle FO documents. Nevertheless, iText
has an XmlWorker project which should be used to render XML (XHTML) documents. So this
sounds very good so I gave it a try. To get an XHTML from the XML I used again XSLT.
Unfortunately I had some problems with applying the required CSS to the XHTML output (some
of them worked, some not) and as far as I can remember the XmlWorker had some problems with
displaying the required images too. And beside the images there is a requirement of specific fonts
to use when displaying the texts and this is hardly manageable too when it comes to XHTML
to PDF conversion (or at least I did not find a good-enough solution).
XML Processing and the Google App Engine 5
So I ended up creating the PDF manually with iText added each element on its own, program-
matically. To achieve this I created a custom XML extractor which split the provided XML result
document into some classes (grouped by coherence) and added display-information to these
classes.
This was the least time-consuming solution. Eventually I could have taken a look at Flying Saucer
(which has the same purpose as the XmlWorker: to create PDF from XHTML) but as I mentioned
we needed the data as quick as possible. However if I get some free time between my projects
Ill take a look at Flying Saucer and try out how good it is to generate the required PDF from an
XHTML.
XML to RTF
The second requirement was to create an RTF document. Why RTF? Because it can be displayed
over various platforms (Windows, Mac OS, Linux) and there exist some open source tools to
create RTF documents and it has been used in another project successfully.
Well. What you have to know about RTF is: it was a Microsoft standard until 2008. Since then
Microsoft gave it up and does not improve it any further. They work on their new standard (the
.*X means in my terminology for .docX, .xlsX, .pptX). Besides this, RTF was not supported with
100% by any other document editor than MS Word. And the tool which had been used (namely
jRTF) does not implement all of the features so creating documents with it can be a pain in the
neck.
iText had an RTF generator too (alongside with the PDF generator) but it isnt improved anymore.
So we did not even try to use it for our purposes.
But we did it as far as we could. Some features (like embedding fonts) do not work so we had
to loosen the requirements for the RTF documents. We used jRTF in the end because it was the
only tool which was available and fairly up-to-date. And as I mentioned previously: it was a pain
in the neck to create the RTF documents. For example: pictures in the document are displayed
as a single line in my Macs Open Office. Not the best thing, is it?
You could ask why are RTF documents needed if you have a PDF? Well, the PDF can contain
too much information or the display order of the data could be not the best (a good example for
this is a CV or some management reports) and the users want to alter the document. Alternative
for this would be a customizable application where you could pick and order the data which you
want to display. I suggested this feature but it had been rejected because it would need some
more time to finish the task.
XML to .*X
Yes. Finally here it comes. I mentioned Microsofts new standard for documents above. And there
is a possibility that the management will decide to create a Word document from the provided
XML data.
Currently were evaluating the possibilities and features of frameworks such as Apache POI
and docx4j because we have already the data extracted from the XML to create the docx
XML Processing and the Google App Engine 6
programmatically. Parallel we are evaluating a solution to transform the XML data into a docx
with XSLT. If we get to any results Ill end up with a book-update about the topic but currently
there is nothing in sight.
@Override
protected void doGet(HttpServletRequest req, HttpServletResponse resp)
throws ServletException, IOException {
OutputStream os = resp.getOutputStream();
// set the content type of the result
resp.setContentType("application/pdf");
// set the name of the resulting PDF file
resp.setHeader("Content-Disposition", "filename=Result.pdf");
// createPDF() returns a ByteArrayOutputStream
os.write(createPDF().toByteArray());
os.flush();
os.close();
}
If you want to export an RTF you have to alter the content type as follows:
I included the setHeader function only to show that it should be not bad if you alter the file
extension to .rtf from .pdf.
With this settings RTF files will be downloaded PDF files only displayed in the browser (if you
have this option enabled and your browser is capable to render the PDF file without downloading
it). However if you do not want the browser to display the PDF file just download it, you can
enhance the content disposition as follows:
XML Processing and the Google App Engine 7
8
XML processing when memory
matters
9
Website scraping with JSoup and
XMLBeam
10
Runtime comparison advanced
11
Upgrade to Java 8
12
Custom printing for HTML with
JSoup
13
Printing XMLBeam projections
14