Beruflich Dokumente
Kultur Dokumente
Trinity College
Fundamentals of XML
Owen.Conlan@scss.tcd.ie
What is Markup
• <Lecture>
• Sequence of characters within a text or word
processing file to define
– Print properties
– Display properties
– Document's logical structure
• Markup indicators are often called "tags"
– Examples
</>
• RTF +
\ '
• EDIFACT
<>
• XML {} :
"
Mark Up: RTF
\li0\ri0\sb240\sa60\keepn\widctlpar\aspalpha\aspnum\faauto\outlinelevel2\a
djustright\rin0\lin0\itap0
\b\f1\fs26\lang2057\langfe1033\cgrid\langnp2057\langfenp1033
{\lang6153\langfe1033\langnp6153 Entity Relationship Diagram
\par }\pard\plain \s1\ql
\li0\ri0\sb240\sa60\keepn\widctlpar\aspalpha\aspnum\faauto\outlinelevel0\a
djustright\rin0\lin0\itap0 \cbpat17
\b\f1\fs24\lang2057\langfe1033\kerning32\cgrid\langnp2057\langfenp1033
{\lang6153\langfe1033\langnp6153 Entity Type
\par }\pard\plain \ql
\li0\ri0\widctlpar\aspalpha\aspnum\faauto\adjustright\rin0\lin0\itap0
\fs24\lang2057\langfe1033\cgrid\langnp2057\langfenp1033
{\b\fs20\ul\lang6153\langfe1033\langnp6153
Def.:}{\b\fs20\lang6153\langfe1033\langnp6153 }{
\fs20\lang6153\langfe1033\langnp6153 An object or concept that is
identified by the enterprise as having an independent existence.
\par }\pard\plain \s1\ql
\li0\ri0\sb240\sa60\keepn\widctlpar\aspalpha\aspnum\faauto\outlinelevel0\a
djustright\rin0\lin0\itap0 \cbpat17
\b\f1\fs24\lang2057\langfe1033\kerning32\cgrid\langnp2057\langfenp1033
{\lang6153\langfe1033\langnp6153 Entity
\par }\pard\plain \ql
Mark Up: EDIFACT
'''ED2'''OPENET:1111111:OVT':003705655815:OVT'ABC1234567'0'TYP:ORDERS'N
RQ:1'''
UNA:+.?
'UNB+UNOC:2+003705655815:30+1111111:30+980729:2233+4++ORDERS911+++KKK
KATE+1'UNH+
1+ORDERS:001:911:UN:FI0030'BGM+640+1234567'DTM+4:19981201:102'DTM+2:199
90101:102'DTM+2:9901:616'RFF+BC:123'RFF+VN:123456'NAD+BY+003705655815:1
00'
NAD+SE+11111111::92'NAD+PL+53432::92++KAUPPA:KAUPUNKI+KATU
9+KAUPUNKI++00007'NAD+CN+-::ZZ++TERMINAALI+OVI 42+TOINEN
KAUPUNKI++00069'UNS+D'LIN+1++23442423234
:EN'PIA+5+3244:MF'PIA+5+2341234324:ZBU'PIA+5+234243:ZCG'IMD+F+8+-
::91:KUKKAPUR
KKI:SAVI'QTY+21:8:KPL'FTX+AAA+++T.HARMAA:V[RI'FTX+AAA+++10:KOKO'PRI+NTP
:7.23:+
RP:7.32:PE'TAX+7+VAT+++:::22.00'LIN+2++543434554345:EN'PIA+5+535:MF'PIA
+5+45:
PCE‘UNT+38+2'UNZ+2+4'
'''EOF'''9'
Mark Up: XML
<fragment>
<section>
<title>Introduction</title>
<para>Since the emergence of <acronym refid="xml">XML</acronym> in
early 1998 and it's subsequent adoption across diverse application
domains, one of the key benefits it enabled was the separation of
content and presentation <bibref refloc="Bos97"/>. <acronym
refid="xml">XML</acronym> borrowed this model (along with other
important concepts) from the <acronym.grp><acronym
refid="sgml">SGML</acronym><expansion id="sgml">Standard
Generalised Markup Language</expansion></acronym.grp>. An
<acronym refid="sgml">SGML</acronym> document consists of
logically structured content and uses a separate file (style
sheet) to specify how the content should be formatted for
[...]
<figure id="img1">
<title>ePublishing Components</title>
<graphic href="02-04-03-fig01.jpg" width="321" height="214"/>
</figure>
</section>
</fragment>
What is SGML?
• Standard Generalised Mark- <!DOCTYPE anthology [
<!ELEMENT anthology - - (poem+) >
Up Language
<!ELEMENT poem - - (title?, stanza+)>
• ISO standard since 1986 <!ELEMENT title - O (#PCDATA) >
• Meta-language for defining <!ELEMENT stanza - O (line+) >
document mark-up <!ELEMENT line O O (#PCDATA) >
vocabularies ]]>
Meta Languages
SGML
XML
XSL
HTML
Vocabularies
SMIL
XHTML
SVG
HL7 CEN ASTM
SynExML
v3 TC251 31.25 XTM
• XML Namespaces
XML Declaration
• Placed at the start of an <?xml version="1.0" ?>
XML document
• Informs XML software of <?xml
version="1.0“
– the version of XML the encoding="UTF-8" ?>
document conforms to
– the character encoding
<?xml
scheme used in the
version="1.0“
document
encoding="UTF-8"
– whether or not a set of standalone="yes" ?>
external declarations
affect the interpretation of
this document
Elements
• Define logical structure and <?xml version="1.0" ?>
sections of XML documents
• Four different content types:
<doc>
– Data content
<title>Java Gently</title>
– Element content
– Mixed content <author>Judy Bishop</author>
– Empty. <publisher name=‘HH’ />
• Each element must be <chapter>
completely enclosed by <thetext> this is <bold>
another element, except for bold </bold> text </thetext>
the root
• Note <paragraph/>
– Any XML name must start </chapter>
with a letter, underscore but </doc>
after that can include also
digits, fullstops, hyphens.
Don’t start with colon due to
namespaces
Don’t include spaces
Attributes
• Provides additional <?xml version="1.0" ?>
information about an
element <doc type="book"
isbn="0-201-71050-1">
• Attributes are contained
within the start-tag <title>Java Gently</title>
• Consists of a name and <author>Judy Bishop</author>
associated value <chapter>
separated by an equals <paragraph type="abstract">
In this book ...
sign
</paragraph>
• The attribute value must </chapter>
always be enclosed by
quotes </doc>
• The order of attributes is
insignificant
ELEMENT vs. ATTRIBUTE
• Lexically little difference,
• application specific,
• no hard/fast rules available.
ELEMENT ATTRIBUTE
• Constituent data, • Inherent data,
• Used for content, • Used for meta-data,
• White space can be • No further nesting
ignored or preserved possible (atomic data),
• Nesting allowed (child • Default values,
elements), • Minimal datatypes,
• Convenient for large
values, or binary
entities.
Entities
• Storage units for <math>
5 < 6 and 6 > 5
repeated text </math>
– Defined in a DTD
<copyright>
• Character entities are
©right-notice;
used to insert characters </copyright>
that cannot be typed
directly <bullet>
XML contains a number
• XML contains a number of 'built-in'
of 'built-in' entities entities
<list>
– " <item>&quot;</item>
– ' <item>&apos;</item>
<item>&lt;</item>
– <
<item>&gt;</item>
– > <item>&amp;</item>
– & </list>
</bullet>
Character Data Sections
• Data which is to be <![CDATA[
parsed is called PCDATA You don't need to escape
special characters in CDATA
• An XML parser will not sections, such as <, >, &, ,
treat the contents of a ' and ".
]]>
CDATA section as
markup
<![CDATA[<<< STOP now >>>]]>
– Used to simplify mark-up
by escaping a selection of
<![CDATA[<?xml version='1.0'?>
text
• Entity references are not <person>
<name>Mike</name>
resolved <age>24</age>
• Useful for including </person>]]>
source code in XML
Processing Instructions
• Pass additional <?xml-stylesheet type='text/
css' href='style.css'?>
information to
application (e.g. parser) <?xml-stylesheet type='text/
• Application-specific xsl' href='style.xsl'?>
instructions
<?myapp filename='test.txt'?>
• Consists of a PI Target
and PI Value
• Processed by
applications that
recognise the PI Target
Comments
• Used to comment XML <!–- one-line comment -->
documents
<!--
• Not considered to be This
part of an XML is a
document multi-line comment
• An XML parser is not -->
required to pass
comments to higher-
level applications
Well formed XML
• XML Declaration required
• At least one element
– Exactly one root element
• Empty elements are written in one of two ways:
– Closing tag (e.g. "<br></br>")
– Special start tag (e.g. "<br />")
• For non-empty elements, closing tags are required
• Start tag must match closing tag (name & case)
• Correct nesting of elements
• Attribute values must always be quoted
• Attribute minimisation not allowed
Document Type Declaration
• Internal/embedded DTD <?xml version='1.0'
standalone='yes'>
<!DOCTYPE person [
<!ELEMENT person (name,
adult, nationality)>
…
]>
<?xml version='1.0'>
• External DTD
<!DOCTYPE person SYSTEM
'person.dtd'>
What are XML Namespaces?
• W3C recommendation (January 1999)
• Each XML vocabulary is considered to own a
namespace in which all elements (and attributes) are
unique
• A single document can use elements and attributes
from multiple namespaces
– A prefix is declared for each namespace used within a
document.
– The namespace is identified using a URI (Uniform Resource
Identifier)
• An element or attribute can be associated with a
namespace by placing the namespace prefix before its
name (i.e. 'prefix:name')
– Elements (and attributes) belonging to the default namespace
do not require a prefix
Example: XML Namespaces
<?xml version='1.0'?>
© 2003 B. Jung
Why Namespaces?
• Important for creating XML documents
containing different types of data
• An XML document can be assembled using
elements (and attributes) from different XML
vocabularies
• Must be able to
– avoid conflicts between names
– identify the vocabulary an element belongs to
XML Processing: DOM Processing
XML
Doc Character Navigation
Process
Stream API
into Tree
DOM Application
XML
Doc Character Stream Events API
SAX Application
}
COPYRIGHT © 2000-2003 ANDERS MØLLER & MICHAEL I. SCHWARTZBACH
Summary
• XML = eXtensible Markup Language
• An XML document is a hierarchical data structure
using self-definable tags
• Physical parts of XML document
– XML Declaration
– Elements
– Attributes
– Document Type Declaration
– Entities
– Processing Instructions
– Comments
– Character Data Sections
– XML Namespaces
• Two types of APIs popular for XML Processing: DOM &
SAX
• </Lecture>
University of Dublin
Trinity College
Owen.Conlan@scss.tcd.ie
What is an XML vocabulary?
• Synonyms
– ‘Application of XML’
– XML Language
• Set of elements and attributes for
representing domain-specific information
• “Instance” of a Mark Up Language
• Defined by DTD or XML Schema
• Some are approved by standard organisations
– E.g. ebXML, MathML, XSL etc.
• Define sequence of
<!ELEMENT author
elements (name | synonym)>
– ",": followed-by
(Sequence) <!ELEMENT image EMPTY>
– "|": logical or
<!ELEMENT paragraph
(Choice) (#PCDATA | bold | italic)*>
Element Type Declaration
• Define occurrences of <!ELEMENT doc
(title, author+, editor?,
elements chapter+, appendix*)>
– ?: zero-or-one
<!ELEMENT chapter
– +: one-or-more
(title,
– *: zero-or-more (section+ | paragraph+))>
<!ELEMENT list
(item?, item?, item)>
<!ELEMENT paragraph
(#PCDATA | %list;)*>
Attribute List Declaration
• Define type of attribute <!ATTLIST person
ssn ID #IMPLIED>
– ID
– IDREF <!ATTLIST adult
– ENTITY age CDATA #REQUIRED>
– NMTOKEN
– NOTATION <!ATTLIST mml
version ‘1.0’ #FIXED>
• Define default values of
attributes <!ATTLIST person
– #REQUIRED sex (m | f) #REQUIRED>
– #IMPLIED
– #FIXED <!ATTLIST day
temperature (l | m | h) "l">
– A list of values with
default selection
Entity Declaration
• Internal entities <!ENTITY author
"Norman Walsh, Sun Corp.">
– Built-in
<firstname>Michael</firstname>
XML doc. Instance
Named Types – complex
<xsd:complexType name="namePerson">
<xsd:sequence>
<xsd:element name="firstname" type="xsd:string"/>
XML Schema
<name>
XML doc. Instance
<firstname>Michael</firstname>
<lastname>Porter</lastname>
</name>
Primitive Datatypes
• string • gYearMonth
• boolean • gYear
• decimal • gMonthDay
• float • gDay
• double • gMonth
• duration • hexBinary
• dateTime • base64Binary
• time • anyURI
• date • QName
• NOTATION
http://www.w3.org/TR/xmlschema-2/
Simple Type - Restriction
<simpleType name='celsiusBodyTemp'>
<restriction base='decimal'>
<totalDigits value='4'/>
XML Schema
<fractionDigits value='1'/>
<minInclusive value='36.4'/>
<maxInclusive value='40.5'/>
</restriction>
</simpleType>
<xsd:element name="temp" type="celsiusBodyTemp"/>
<temp>37.2</temp>
XML doc. Instance
Simple Type - Enumeration
<xsd:simpleType name="weekday">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="Sunday"/>
<xsd:enumeration value="Monday"/>
XML Schema
<xsd:enumeration value="Tuesday"/>
[...]
</xsd:restriction>
</xsd:simpleType>
<xsd:element name="delivery" type="weekday"/>
<delivery>Tuesday</delivery>
XML doc. Instance
Complex Type - Cardinalities
<xsd:complexType name="fullname">
<xsd:sequence>
<xsd:element name="title" minOccurs="0"/>
XML Schema
<name>
XML doc. Instance
<firstname>Michael</firstname>
<firstname>Jason</firstname>
<lastname>Porter</lastname>
</name>
Complex Type – Derived Type by extension
<xsd:complexType name="fullnameExt">
<xsd:complexContent>
<xsd:extension base="fullname">
<xsd:sequence>
XML Schema
<name>
XML doc. Instance
<firstname>Jane</firstname>
<lastname>Porter</lastname>
<maidenname>Hughes</maidenname>
</name>
Complex Type – Derived Type by Restriction
<xsd:complexType name="simpleName">
<xsd:complexContent>
<xsd:restriction base="fullname">
<xsd:sequence>
XML Schema
<firstname>Jane</firstname>
<lastname>Porter</lastname>
</name>
Structure - Sequence
<xsd:complexType name="fullname">
<xsd:sequence>
<xsd:element name="title" minOccurs="0"/>
XML Schema
<name>
XML doc. Instance
<firstname>Michael</firstname>
<firstname>Jason</firstname>
<lastname>Porter</lastname>
</name>
Structure - Choice
<xsd:complexType name="payment">
<xsd:sequence>
<xsd:element ref="product"/>
<xsd:element ref="number"/>
XML Schema
<xsd:choice>
<xsd:element ref="cash"/>
<xsd:element ref="cheque"/>
</xsd:choice>
</xsd:sequence>
</xsd:complexType>
<xsd:element name="pay" type="payment"/>
<pay>
XML doc. Inst.
<xsd:element name="greeting">
<xsd:complexType>
<xsd:simpleContent>
XML Schema
<xsd:extension base="xsd:string">
<xsd:attribute name="language" type="xsd:string"/>
</xsd:extension>
</xsd:simpleContent>
</xsd:complexType>
</xsd:element>
<greeting language="German">Hello!</greeting>
XML doc. Instance
Attribute Groups
<!ELEMENT img EMPTY>
<!ATTLIST img src CDATA #REQUIRED
DTD
<xsd:attributeGroup name="imgAttributes">
<xsd:attribute name="src" type="xsd:string" use="required"/>
<xsd:attribute name="width" type="xsd:integer"/>
<xsd:attribute name="height" type="xsd:integer"/>
XML Schema
</xsd:attributeGroup>
<xsd:element name="img">
<xsd:complexType>
<xsd:attributeGroup ref="imgAttributes"/>
<xsd:complexType>
</xsd:element>
<!ELEMENT b (#PCDATA)>
<xsd:element name="img">
<xsd:complexType>
<xsd:attribute name="src" type="xsd:string"/>
XML Schema
</xsd:complexType>
</xsd:element>
<img src="XMLmanager.gif"/>
XML doc. Instance
XML Schema Example
<?xml version="1.0" encoding="utf-8"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2000/10/XMLSchema">
<xsd:element name="book">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="title" type="xsd:string"/>
<xsd:element name="author" type="xsd:string"/>
<xsd:element name="character” type="xsd:string"
minOccurs="0" maxOccurs="unbounded">
</xsd:element>
</xsd:sequence>
</xsd:element>
</xsd:schema>
Summary
• XML Vocabularies are defined using
– DTD
– XSD
• DTDs/XSDs used to validate XML documents
• XSD – more powerful than DTDs
– Supports simple and complex data-types such as
user-defined types
– Can validate documents containing multiple
namespaces