Beruflich Dokumente
Kultur Dokumente
L has two main advantages: first, it offers a standard way of structuring data, and, second, we can specify
the vocabulary the data uses. We can define the vocabulary (what elements and attributes an XML
document can use) using either a document type definition (DTD) or the XML Schema language.
DTDs were inherited from XML's origins as SGML (Standard Generalized Markup Language) and, as such,
are limited in their expressiveness. DTDs are for expressing a text document's structure, so all entities are
assumed to be text. The XML Schema language more closely resembles the way a database describes
data.
Schemas provide the ability to define an element's type (string, integer, etc.) and much finer constraints (a
positive integer, a string starting with an uppercase letter, etc.). DTDs enforce a strict ordering of elements;
schemas have a more flexible range of options (elements can be optional as a group, in any order, in strict
sequence, etc.). Finally schemas are written in XML, whereas DTDs have their own syntax.
As you'll see in this article, schemas themselves are quite straightforward—I find them easier than DTDs as
there is no extra syntax to remember. The difficulties arise in using XML Namespaces and in getting the
Java parsers to validate XML against a schema.
In this article, I first cover the basics of XML Schema, then validate XML against some schema using several
popular APIs, and finally cover some of the more powerful elements of the XML Schema language. But first,
a short detour.
Various experts and interested parties gather under the umbrella of the W3C and, after much deliberation,
issue a recommendation. Companies, individuals, or foundations such as Apache, will then write
implementations of those recommendations.
• XML 1.0
• XML Namespaces
• XML Schema
When producing XML, remember to escape text fields that might contain special characters such as &. This
is a common oversight.
A document that is not well formed is not really XML and doesn't conform to the W3C's stipulations for an
XML document. A parser will fail when given that document, even if validation is turned off.
To be valid, a document must be well formed, it must have an associated DTD or schema, and it must
comply with that DTD or schema. Ensuring a document is well formed is easy. In this article, we focus on
ensuring our documents are valid.
Let's get right down to it. First, we're going to need an XML file to validate.
Page 2 of 5
Save this document somewhere. We will use it later in this article to try validation and interesting schema
rules later.
The first line <?xml version="1.0"?> is the prologue. It is optional in XML 1.0 and compulsory in XML 1.1. If it
is absent, parsers assume we're using XML 1.0—but we like to be thorough.
The schema
For the server to validate our XML, we need a schema:
<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified"
xmlns="urn:nonstandard:test"
targetNamespace="urn:nonstandard:test">
<xsd:element name="order" type="Order" />
<xsd:complexType name="Order">
<xsd:all>
<xsd:element name="user" type="User" minOccurs="1" maxOccurs="1" />
<xsd:element name="products" type="Products" minOccurs="1" maxOccurs="1" />
</xsd:all>
</xsd:complexType>
<xsd:complexType name="User">
<xsd:all>
<xsd:element name="fullname">
<xsd:simpleType>
<xsd:restriction base="xsd:string">
<xsd:maxLength value="30" />
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
</xsd:all>
</xsd:complexType>
<xsd:complexType name="Products">
<xsd:sequence>
<xsd:element name="product" type="Product" minOccurs="1" maxOccurs="unbounded" />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="Product">
<xsd:attribute name="id" type="xsd:long" use="required" />
<xsd:attribute name="quantity" type="xsd:positiveInteger" use="required" />
</xsd:complexType>
</xsd:schema>
Save this schema as test.xsd in the same directory as the XML document. And, for the moment, ignore the
root node's attributes and the fact that everything is prefixed with xsd.
This says our document will have an element called order of type Order. This element is a global declaration
(with scope like a global variable). In fact, it is our only global element, so it will be the root element of any
document that conforms to this schema.
An element's type will be either built-in (such as string, long, or positiveInteger) or custom. Custom types
can be either a simpleType or a complexType. simpleType elements are variations on the built-in types:
either a restriction, a list, or a union. If the element has children, it will always be a complexType. For a full
list of built-in types, see Resources.
Our Order is a complex type made up of two elements: user and products. These two elements are local.
We cannot refer to them anywhere outside the Order type. This distinction between global and local types
will prove important when we look at XML Namespaces.
The User type is again made up of two elements. The first, deliveryAddress, is of built-in type string. The
second, fullname, lacks a type in its element declaration. Instead, the type is given in-line. This is an
anonymous type in that we cannot refer to it anywhere else by name as it doesn't have a name. Anonymous
types prevent reuse, and I find them harder to read than named types. Unless a type is simple and unlikely
to be reused, avoiding anonymous types is best.
The type of fullname is the built-in string type, like deliveryAddress, but with the restriction that it has a
maximum length of 30 characters.
The Products type is simply a sequence of product entries. The sequence element allows its children to
appear multiple times (all does not).
Finally, the Product type has two attributes and no body. For an example of a type with both attributes and a
body, see the "Database Style Constraints: Primary Keys and Foreign Keys" section that appears later in
this article
Page 3 of 5
Add a schema
We must link the document to the schema. To do this, we only need to change the root element. Thus, the
start of the document becomes:
Edit the XML document you saved earlier and change the root element to match the entry above.
To understand what we have just added, we need to know about XML Namespaces, but first, let's review
URIs.
A Uniform Resource Name (URN) identifies a resource forever—a good example being a book's ISBN
number or a product's barcode.
A Uniform Resource Locator (URL) identifies a resource by its location. URIs, in the context of XML
Namespaces, are nearly always URLs. The URI identifying a namespace is not required to point to a
document, so, if the URI is pasted into a browser, it may not find anything. However, as the URI identifying
your namespace looks exactly like a URL, users will expect there to be something at that address, so it is
good practice to put something there. Sun and the W3C, for example, have pages at their namespace
URLs.
This article's example document does not have a URL as its namespace identifier; instead, it has a made-up
URN. Though unusual, it helps to show that the namespace identifier is just that: an identifier. In a real
application, our root element would probably read:
<order xmlns="http://www.mycompany.com/xml/myproject"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.mycompany.com/xml/myproject file:./test.xsd">
Namespaces
An XML namespace is a collection of names, identified by a URI reference, which are used in XML
documents as element types and attribute names. A namespace in XML is a bit like a package in Java. It
groups a set of elements together. The type user in the urn:nonstandard:test namespace differs from a type
user in any other namespace.
Only one namespace can be the default—the others must be given a prefix. The xmlns attribute (which
comes from the XML Namespaces Recommendation) defines the default namespace—i.e., the namespace
for unprefixed elements. The form xmlns:xsd defines the namespace for entries prefixed with xsd (xsd is
commonly used for the schema prefix, but any prefix would do).
When defining a schema, we refer to our own types (Order, User, Product, etc.) and use types from the
schema namespace (element, complexType, string, etc.). For this reason, we usually prefix the schema
namespace. We could also prefix our types instead and use the schema namespace unprefixed. The first
part of our schema would then look like this:
<?xml version="1.0"?>
<schema xmlns="http://www.w3.org/2001/XMLSchema" targetNamespace="urn:nonstandard:test"
elementFormDefault="qualified" xmlns:ts="urn:nonstandard:test">
Prefixed names are called qualified names. They contain a single colon separating the name into a
namespace prefix and a local part. The prefix, which is mapped to a URI reference, selects a namespace.
In writing schema, we define new elements and attributes. The targetNamespace attribute specifies the
namespace these new elements will be a part of. An XML document that conforms to this schema will import
that namespace (via an xmlns or xmlns:prefix attribute).
The xmlns:xsi attribute simply imports a namespace and maps it the xsi prefix. The namespace here is a
special one: the XML Schema instance namespace. Every XML document that conforms to XML Schema
imports that namespace. The XMLSchema-instance schema declares only four attributes: type, nil,
schemaLocation, noNamespaceSchemaLocation.
The schemaLocation attribute indicates where to find the schema to validate each namespace. The format is
the namespace, a space, and the URL. A comma can separate several namespace/URL entries. Since we
are only interested in validating our namespace, we just declare the location of the schema for
urn:nonstandard:test—in this case, a file called test.xsd in the current directory (the schema we saved
earlier). In a real application, the location would usually be a publicly accessible URL. schemaLocation just
provides a hint to the parser; if the parser is given a different schema by the code invoking it, it will use that
schema, not schemaLocation's.
If the XML document we want to validate comes in at the interface between our application and the external
world, we will probably want to use our own copy of the schema for validation. For internal documents,
trusting the document's header is probably okay.
The targetNamespace and the schemaLocation are attributes of a schema's root element. An XML Schema
document's root element (xsd:schema) must always include at least:
Unless all our types are anonymous, we must include the namespace of the entries we are defining for use
within the document. This namespace is usually unprefixed: xmlns="sameAsTargetNamespace.
The elementFormDefault attribute indicates whether locally declared elements should be qualified (prefixed)
or not. The following section describes that attribute.
Page 4 of 5
(...)
<xsd:element name="order" type="Order" />
<xsd:element name="user" type="User" />
<xsd:element name="products" type="Products" />
<xsd:complexType name="Order">
<xsd:all>
<xsd:element ref="user" minOccurs="1" maxOccurs="1" />
<xsd:element ref="products" minOccurs="1" maxOccurs="1" />
</xsd:all>
</xsd:complexType>
(...)
In a DTD, all elements are global, which can make DTDs difficult to read. In a schema, declaring only the
root element as global makes it easier to read.
A schema's schema root element can take the elementFormDefault attribute, which indicates whether locally
declared elements should be qualified or unqualified. If elementFormDefault is unqualified (the default), our
XML document will need to specify which namespace the global elements are in (remember our only global
element is the root node order), but not where the local elements are located. If elementFormDefault is
unqualified, declaring a namespace for local elements will result in an error.
This document shows unqualified locally declared elements, which is how your documents will usually look:
<ts:order xmlns:ts="urn:nonstandard:test">
<user>
<!-- etc -->
</user>
</ts:order>
This tells the parser that order is in the urn:nonstandard:test namespace and says nothing about user.
Internally, order turns into urn:nonstandard:test:order, but user remains as is. It is not qualified by a
namespace, but instead is assumed to be in the namespace of its first global parent—in this case, order.
<ts:order xmlns:ts="urn:nonstandard:test">
<ts:user>
<!-- etc -->
</ts:user>
</ts:order>
In qualified mode, the parser does not assume anything about local elements—we must specify their
namespaces too. Internally, order becomes urn:nonstandard:test:order, as before, and user now becomes
urn:nonstandard:test:user. The internal expansion of namespace prefixes is important, because it explains
why the example below will only work if the schema is set as elementFormDefault='qualified':
<order xmlns="urn:nonstandard:test">
<user>
<!-- etc -->
</user>
</order>
Here, we declare the default namespace as urn:nonstandard:test, so all elements are assumed to be in that
namespace. It is an easier-to-read version of the example above, where we qualified everything.
If elementFormDefault had been left as unqualified (remember, that's the default) we would get the error:
error: cvc-complex-type.2.4.a: Invalid content was found starting with element 'user'. One of '{"":user,
"":products}' is expected.
This error indicates that the parser was looking for an unqualified local element ("":user), but instead found
an element qualified by the default namespace (urn:nonstandard:test:user).
The schema element can also be given the attributeFormDefault attribute, which behaves exactly like
elementFormDefault, but for attributes.
Now that we have an XML file and a schema we understand, let's validate the first against the second.
If a different parser is on the classpath, JAXP will automatically use that parser. For example, if you include
Oracle's parser on the classpath, the DocumentBuilderFactory you get from
DocumentBuilderFactory.newInstance() will be an Oracle implementation, instead of one based on Xerces,
which is packaged into J2SE 5.
If you have J2SE 5, the code below should work as is. If you have 1.4, then download either Xerces or
Oracle XDK, and make sure it is on your classpath. Both those parsers are XML Namespaces and XML
Schema-aware. If you have an earlier version of Java, you'll also need to download JAXP.
This class takes an XML file as a command line argument, validates it using a parser obtained via JAXP, and
prints the name of the XML document's root node:
import java.io.IOException;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.xml.sax.SAXException;
try {
XmlTester xmlTester = new XmlTester(xmlFile);
}
catch (Exception e) {
System.out.println( e.getClass().getName() +": "+ e.getMessage() );
}
}
factory.setNamespaceAware(true);
factory.setValidating(true);
factory.setAttribute("http://java.sun.com/xml/jaxp/properties/schemaLanguage",
"http://www.w3.org/2001/XMLSchema");
// Specify our own schema - this overrides the schemaLocation in the xml file
//factory.setAttribute("http://java.sun.com/xml/jaxp/properties/schemaSource", "file:./test.xsd");
import org.xml.sax.ErrorHandler;
import org.xml.sax.SAXParseException;
Here are the DocumentBuilderFactory methods to set those two standard features:
factory.setNamespaceAware(true);
factory.setValidating(true);
Unfortunately, the feature for turning on schema validation has not been standardized. With JAXP you do:
factory.setAttribute("http://java.sun.com/xml/jaxp/properties/schemaLanguage",
"http://www.w3.org/2001/XMLSchema");
Save the XmlTester class given at the start of this section to XmlTester.java and the error handler that
follows it to SimpleErrorHandler.java, in the same directory as the document (XML file) and schema.
Compile them, then run using java XmlTester test.xml. The name of the class implementing
DocumentBuilderFactory and the document's root node should print. You should now be able to change the
XML document and/or the schema and check validation fails if they don't match.
Unless you are parsing big documents, most likely, you won't use the Simple API for XML Parsing (SAX)
directly as it is a cumbersome API. For most parsing, the W3C DOM included in J2SE is a good choice.
Learning and using the W3C DOM has a big benefit: it is standard, meaning that, in Python, Javascript, or
C#, you will use the same objects with the same methods. However, the W3C DOM can be verbose, so,
sometimes, you will want a more powerful API, or one that fits more naturally with Java.
Commons Digester
The best example of an API more powerful than DOM is Jakarta Commons Digester. This API turns XML
into Java objects on the fly. Commons Digester removes the need for manual XML parsing in cases where
you need to read the whole file and allows you to work with regular JavaBeans instead.
Since Commons Digester turns XML into JavaBeans, we need a bean. Here is a simple Order bean:
import java.io.IOException;
import org.apache.commons.digester.Digester;
import org.xml.sax.ErrorHandler;
import org.xml.sax.SAXException;
import org.xml.sax.SAXNotRecognizedException;
import org.xml.sax.SAXNotSupportedException;
import org.xml.sax.SAXParseException;
if ( args.length != 1 ) {
System.out.println("Usage: java DigesterTester myFile.xml");
System.exit(-1);
}
String xmlFile = args[0];
try {
DigesterTester xmlTester = new DigesterTester(xmlFile);
}
catch (Exception e) {
System.out.println( e.getClass().getName() +": "+ e.getMessage() );
}
digester.addObjectCreate("order", "Order");
To compile this class, you need the Commons Digester jar on your classpath. To run it, you will need
Digester, Commons Collections, and Commons Logging on your classpath. These are all available from the
Jakarta Commons project.
Running the example on a valid XML file should print Order: order. An invalid file will produce an exception
stack trace.
The three important lines from our earlier XmlTester example resemble the ones in this DigesterTester class.
They are:
digester.setValidating(true);
digester.setNamespaceAware(true);
digester.setProperty("http://java.sun.com/xml/jaxp/properties/schemaLanguage",
"http://www.w3.org/2001/XMLSchema");
JDOM
JDOM is an API similar to DOM, but fits more naturally with Java. For JDOM, the relevant sections are:
import java.io.File;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.jdom.Document;
import org.jdom.Element;
import org.jdom.input.SAXBuilder;
import org.jdom.JDOMException;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
if ( args.length != 1 ) {
System.out.println("Usage: java JDOMTester file:./myFile.xml");
System.exit(-1);
}
String xmlFile = args[0];
try {
JDOMTester tester = new JDOMTester(xmlFile);
}
catch (Exception e) {
System.out.println( e.getClass().getName() +": "+ e.getMessage());
}
}
To compile and run this example, you need JDOM on the classpath. Also, note that JDOM takes its input as
a URL, so, for a local file, you need a URL like file:./test.xml.
One API that doesn't seem to provide a way for setting the schemaLanguage feature on the underlying
parser is dom4j. If you use dom4j, you will need to create your own SAXParser with validation turned on and
pass that to dom4j.
If you use Xerces with dom4j and don't mind losing the ability to swap parsers, you can use the Xerces-
specific feature:
More schema
Now we're going to explore a range of things XML Schema can do for us:
Schema validation
Grouping
Schema separation
Adding uniqueness
Having the XML file, schema, and a program to validate one with the other will prove useful in this section. If
you haven't already, save them locally and try the validation.
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/XMLSchema
http://www.w3.org/2001/XMLSchema.xsd"
xmlns="urn:nonstandard:test"
elementFormDefault="qualified"
targetNamespace="urn:nonstandard:test">
Note the two new elements—one to link to the XML Schema instance namespace and a second to point to
the schema's schema. You should now be able to use the XMLTester class from earlier to validate this
schema by running: java XmlTester test.xsd. Validating this schema should take a little longer than with the
XML example earlier as the parser has to fetch the schema from the W3C site.
Grouping
A complexType has three ways of grouping its child elements: all, sequence, or choice.
All indicates that all the elements listed can appear zero or one time in any order. In this example, order and
product must both appear once in any order, otherwise, neither can appear:
<xsd:complexType name="Order">
<xsd:all minOccurs="0">
<xsd:element name="user" type="User" minOccurs="1" maxOccurs="1" />
<xsd:element name="products" type="Products" minOccurs="1" maxOccurs="1" />
</xsd:all>
</xsd:complexType>
<xsd:complexType name="Order">
<xsd:all minOccurs="1">
<xsd:element name="user" type="User" minOccurs="1" maxOccurs="1" />
<xsd:element name="products" type="Products" minOccurs="0" maxOccurs="1" />
</xsd:all>
</xsd:complexType>
In a sequence, the ordering of elements is strictly enforced, and each element can appear several times.
This example says user must always come before products, and several products entries can appear:
<xsd:complexType name="Order">
<xsd:sequence>
<xsd:element name="user" type="User" minOccurs="1" maxOccurs="1" />
<xsd:element name="products" type="Products" minOccurs="1" maxOccurs="unbounded" />
</xsd:sequence>
</xsd:complexType>
The choice element only allows one of its children to be used, but the child element used can appear
multiple times (if its maxOccurs allows it). This example allows either one user element or one or more
products elements:
<xsd:complexType name="Order">
<xsd:choice>
<xsd:element name="user" type="User" minOccurs="1" maxOccurs="1" />
<xsd:element name="products" type="Products" minOccurs="1" maxOccurs="unbounded" />
</xsd:choice>
</xsd:complexType>
<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns="urn:nonstandard:test"
elementFormDefault="qualified"
targetNamespace="urn:nonstandard:test">
<xsd:complexType name="Products">
<xsd:sequence>
<xsd:element name="product" type="Product" minOccurs="1" maxOccurs="unbounded" />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="Product">
<xsd:attribute name="id" type="xsd:long" use="required" />
<xsd:attribute name="quantity" type="xsd:positiveInteger" use="required" />
</xsd:complexType>
</xsd:schema>
Save this code as products.xsd. Then remove the Products and Product type from test.xsd and insert this
line right after the root node and before the declaration of the order element:
This tells our schema to include the types defined in products.xsd (which are the types we have just
removed from test.xsd).
In products.xsd, we use the same target namespace as the file it is included in, which is a requirement for
using the include tag. To use types from a different namespace, we would use the import tag. See
Resources for more information.
<xsd:unique name="productIdUnique">
<xsd:selector xpath="test:products/test:product" />
<xsd:field xpath="@id" />
</xsd:unique>
</xsd:element>
We have added a unique element, which uses XPath to find the elements that must be unique. As we are in
the order element, we can only declare a unique constraint on children of that node. The
test:products/test:product expression selects all the product nodes that are direct children of the products
node that are direct children of the current node (order). The @id selects the id attribute of the nodes we
have just selected (i.e., the product nodes).
See Resources for more details on the XPath syntax and for a good XPath tutorial.
In XPath, there is no way to select the default namespace. If our document had no namespace (and we
used a xsi:noNamespaceSchemaLocation), then we could select the product node simply by using the
XPath expression products/product.
We do have a namespace that we import as the default. XPath has no syntax for selecting that namespace,
so we must import the same namespace again with a prefix.
Neither Xerces nor Oracle XDK warns you if the XPath selector matches nothing, so, if you don't receive the
expected results, check the expression carefully and remember to take namespaces into account.
<prices>
<price productId="12345">$34.99</price>
<price productId="3232">$4.99</price>
</prices>
Next, add a prices element into the Order complexType declaration, after products:
Then, add the Prices and Price types at the bottom of the file, just before the closing </xsd:schema> tag:
<xsd:complexType name="Prices">
<xsd:sequence>
<xsd:element name="price" type="Price" minOccurs="1" maxOccurs="unbounded" />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="Price">
<xsd:simpleContent>
<xsd:extension base="xsd:string">
<xsd:attribute name="productId" type="xsd:long" use="required" />
</xsd:extension>
</xsd:simpleContent>
</xsd:complexType>
Note how the Price type is the string type extended by adding an attribute to it.
Now we need to define product's id attribute as a primary key. Simply change the unique element we added
in order earlier to read:
<xsd:key name="productIdKey">
<xsd:selector xpath="test:products/test:product" />
<xsd:field xpath="@id" />
</xsd:key>
The XPath expression is the same as before. The field(s) that form a key are always unique, just like a
unique element, but additionally they must be present.
Finally, we make the price's productId attribute a foreign key to the product's id attribute. Add a keyref so that
the order element becomes:
<xsd:key name="productIdKey">
<xsd:selector xpath="test:products/test:product" />
<xsd:field xpath="@id" />
</xsd:key>
</xsd:element>
The syntax should be familiar by now. These new constraints will produce a validation error if you put a price
in with a productId that does not match the product ID.
As we have seen, XML Schema provides a powerful way of putting constraints on data. However, there are
other ways of ensuring data integrity, and schema constraints are not always the best choice.
Conclusion
XML Schema is an easy-to-learn and powerful way of describing XML data. The next time you need an XML
document, sketch how it should appear, then write a schema, and get your code to validate it. If you
encounter difficulties working with XML Schema, they are more likely to do with XML Namespaces than with
the schema syntax itself.
To learn more about XML Schema, a good place to start for a complete yet digestible schema reference is
the XML Schema Recommendation.