XML Parsing

XML Parsing
Designed By:
Creating Markup with
XML
Outline
Introduction
Introduction to XML Markup
Parsers and Well-formed XML Documents
Parsing an XML Document with msxml
Characters
Character Set
Characters vs. Markup
While Space, Entity References and Built-in Entities
Using Unicode in an XML Document
Markup
CDATA Sections
Introduction
XML
Technology for creating markup languages
Enables document authors to describe data of any
type
Allows creating new tags
HTML limits document authors to fixed tag set
XML document (intro.xml)
Marks up message as XML
Commonly stored in text files
Extension .xml
1 <?xml version = "1.0"?>
Document begins with declaration
2 that specifies XML version 1.0
3  Simple XML
4  document containing a
5 message.
6 <myMessage>
Line numbers are
7 <message>Welcome to XML!</message>
Element message is not part of XML
8 </myMessage>
child element of root document. We
element myMessage include them for
clarity.
Line numbers are not part
of XML document. We Document begins
include them for clarity. with declaration that
specifies XML
version 1.0
Comments
Element message is
child element of root
element myMessage
(cont.)
XML documents
Must contain exactly one root element
Attempting to create more than one root element is
erroneous
Elements must be nested properly
Incorrect: <x><y>hello</x></y>
Correct: <x><y>hello</y></x>
Rules for Elements
XML document syntax
Considered well formed if syntactically correct
Single root element
Each element has start tag and end tag

Tags properly nested
Attribute (discussed later) values in quotes
Proper capitalization
Case sensitive
Parsers and Well-formed XML
Documents
XML parser
Processes XML document
Reads XML document
Checks syntax
Reports errors (if any)
Allows programmatic access to document’s contents

Parsers and Well-formed XML
Documents (cont.)
XML parsers support
Document Object Model (DOM)
Builds tree structure containing document data in memory
Simple API for XML (SAX)

Generates events when tags, comments, etc. are
encountered
(Events are notifications to the application)
Parsing an XML Document with
msxml
XML document
Contains data
Does not contain formatting information
Load XML document into Internet Explorer 5.0
Document is parsed by msxml.
Places plus (+) or minus (-) signs next to container elements
Plus sign indicates that all child elements are hidden
Clicking plus sign expands container element
Displays children
Minus sign indicates that all child elements are visible
Clicking minus sign collapses container element
Hides children
Error generated, if document is not well formed
XML document shown in IE5.
Error message for a missing end
tag.
Characters
Character set
Characters that may be represented in XML
document
e.g., ASCII character set
Letters of English alphabet
Digits (0-9)
Punctuation characters, such as !, - and ?
Character Set
XML documents may contain
Carriage returns
Line feeds
Unicode characters
Enables computers to process characters for several
languages
Characters vs. Markup
XML must differentiate between
Markup text
Enclosed in angle brackets (< and >)
e.g,. Child elements
Character data
Text between start tag and end tag
e.g., line 7: Welcome to XML!
Element Naming
XML elements must follow these naming rules:
Names can contain letters, numbers, and other
characters
Names must not start with a number or punctuation
character
Names must not start with the letters xml (or XML or
Xml ..)
Names cannot contain spaces
White Space
Whitespace characters
Spaces, tabs, line feeds and carriage returns
Significant (preserved by application)
Insignificant (not preserved by application)
Normalization
Whitespace collapsed into single whitespace character
Sometimes whitespace removed entirely
<markup>This is character data</markup>

after normalization, becomes
<markup>This is character data</markup>
Entity References
XML-reserved characters
Ampersand (&)
Left-angle bracket (<)
Right-angle bracket (>)
Apostrophe (’)
Double quote (”)
Entity references
Allow to use XML-reserved characters
Begin with ampersand (&) and end with semicolon (;)
Prevents from misinterpreting character data as markup
Built-in Entities
Build-in entities
Ampersand (&)
Left-angle bracket (<)
Right-angle bracket (>)
Apostrophe (')
Quotation mark (")
Mark up characters “<>&” in element message
<message><>&</message>
Using Unicode in an XML
Document
XML Unicode support
e.g., displays Arabic words
Arabic characters
represented by entity references for Unicode characters
1 <?xml version = "1.0"?> Root
2 element
Document type definition welcome
3 
4 
(DTD) defines document contains
5 structure and entities child
6 <!DOCTYPE welcome SYSTEM "lang.dtd"> elements
7 from and
8 <welcome> XML document subject
that
9 <from> contains Arabic words
10
11  Document type
12 دايتَل definition (DTD)
13 أند defines document
14 structure and entities
15 
16 &assoc;
Root element welcome
contains child elements
17 </from>
from and subject
18
19 <subject> Sequence of entityfor
20 Sequence of entity references
references for Unicode
21  Unicode characters in Arabic
characters in Arabic
22 أهلاً alphabet
alphabet
23 بكم
24 فيِ lang.dtd defines
25 عالم lang.dtd defines entitiesassoc and
entities
26
text
assoc and text
27 
28 &text;
29 </subject>
30 </welcome>
XML document that contains
Arabic words.
Markup
XML element markup
Consists of
Start tag
Content
End tag
All elements must have corresponding end tag
<img src = “img.gif”>
is correct in HTML, but not XML
XML requires end tag or forward slash (/) for termination
<img src = “img.gif”></img>
or
<img src = “img.gif”/>
is correct XML syntax
Markup (cont.)
Elements
Define structure
May (or may not) contain content
Child elements, character data, etc.
Attributes
Describe elements
Elements may have associated attributes
Placed within element’s start tag
Values are enclosed in quotes
Element car contains attribute doors, which has value “4”
<car doors = “4”/>
Markup (cont.)
Processing instruction (PI)
Passed to application using XML document
Provides application-specific document information
Delimited by <? and ?>
Processing instruction specifies
2 stylesheet (discussed in Chapter 12)
3  XML document that
4  Element chapters
marks up information
5 aboutfour
contains a fictitious
child
6 <?xml:stylesheet type = "text/xsl" href = "usage.xsl"?> book.each which
elements,
7 contain two attributes
8 <book isbn = "999-99999-9-X"> Processing
9 <title>Deitel&s XML Primer</title> instruction specifies
10 stylesheet
11 <author> Root element book contains(discussed
child in
12 <firstName>Paul</firstName> elements title, author,Chapter 12)
13 <lastName>Deitel</lastName> chapters and media
Root element book
14 </author> contains child
15 Element book contains elements title,
16 <chapters> attribute isbn, which hasauthor,
17 <preface num = "1" pages = "2">Welcome</preface>
value of 999-99999-9-X chapters and
18 <chapter num = "1" pages = "4">Easy XML</chapter> media
19 <chapter num = "2" pages = "2">XML Elements?</chapter>
Element book
20 <appendix num = "1" pages = "9">Entities</appendix>
contains attribute
21 </chapters> isbn, which has
22 value of 999-
23 <media type = "CD"/> 9999-9-X
24 </book>
Element chapters
contains four child
elements, each
which contain two
attributes
XML document that marks up
information about a fictitious book.
2
3 
XML document
4 
that marks up a
5
6 <letter>
letter.
7
8 <contact type = "from">
9 <name>Jane Doe</name>
10 <address1>Box 12345</address1>
11 <address2>15 Any Ave.</address2>
12 <city>Othertown</city>
13 <state>Otherstate</state>
14 <zip>67890</zip>
15 <phone>555-4321</phone>
16 <flag gender = "F"/>
17 </contact>
18
19 <contact type = "to">
20 <name>Jane Doe</name>
21 <address1>123 Main St.</address1>
22 <address2></address2>
23 <city>Anytown</city>
24 <state>Anystate</state>
25 <zip>12345</zip>
26 <phone>555-1234</phone>
27 <flag gender = "M"/>
28 </contact>
29
30 <salutation>Dear Sir:</salutation>
31
32 <paragraph>It is our privilege to inform you about our new
33 database managed with <bold>XML</bold>. This new system
34 allows you to reduce the load on your inventory list
35 server by having the client machine perform the work of
36 sorting and filtering the data.</paragraph>
37
38 <paragraph>The data in an XML element is normalized, so XML document
39 plain-text diagrams such as
that marks up a
40 /---\
41 | |
letter. (Part 2)
42 \---/
43 will become gibberish.</paragraph>
44
45 <closing>Sincerely</closing>
46 <signature>Ms. Doe</signature>
47
48 </letter>
XML document that marks up a
letter.
CDATA Sections
CDATA sections
May contain text, reserved characters and whitespace
Reserved characters need not be replaced by entity references
Not processed by XML parser
Commonly used for scripting code (e.g., JavaScript)
Begin with <![CDATA[
Terminate with ]]>
1 <?xml version = "1.0"?> XML does not process
2
CDATA section
3  Using a CDATA
4  section.
5
Entity references required if
6 <book title = "C++ How to Program" edition = "3"> not in CDATA section
Entity references
7
required if not in
8 <sample>
CDATA section
9 // C++ comment
10 if ( this->getX() < 5 && value[ 0 ] != 3 )
XML does not
11 cerr << this->displayError(); process CDATA
12 </sample> section
13
14 <sample> Note the
15 <![CDATA[ simplicity offered
16 by CDATA section
17 // C++ comment
18 if ( this->getX() < 5 && value[ 0 ] != 3 )
19 cerr << this->displayError(); Note the simplicity offered by
20 ]]> CDATA section
21 </sample>
22
23 C++ How to Program by Deitel & Deitel
24 </book>
Using a CDATA section
XML Namespaces
Naming collisions
Two different elements have same name
<subject>Math</subject>
<subject>Thrombosis</subject>
Namespaces
Differentiate elements that have same name
<school:subject>Math</school:subject>
<medical:subject>Thrombosis</medical:subject>
school and medical are namespace prefixes
Prefixed to elements and attribute names
Tied to uniform resource identifier (URI)
Series of characters for differentiating names
XML Namespaces (cont.)
Creating namespaces
Use xmlns keyword
xmlns:text = “urn:deitel:textInfo”
xmlns:image = “urn:deitel:imageInfo”
Creates two namespace prefixes text and image
urn:deitel:textInfo is URI for prefix text
urn:deitel:imageInfo is URI for prefix image
Default namespaces
Child elements of this namespace do not need prefix
xmlns = “urn:deitel:textInfo”
2
Element directory contains two
3 
namespace prefixes
4 
5
6 <directory xmlns:text = "urn:deitel:textInfo"
7 xmlns:image = "urn:deitel:imageInfo"> Listing for
8 namespace.xml.
9 <text:file filename = "book.xml">
10 <text:description>A book list</text:description> Element
11 </text:file> directory
contains two
12
namespace
13 <image:file filename = "funny.jpg"> prefixes
14 <image:description>A funny picture</image:description>
15 <image:size width = "200" height = "100"/> Use prefix text to
16 </image:file>
describe elements
file and
17 description
18 </directory>
Apply prefix text
to describe
elements file,
description
and size

XML Parsing

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

XML Parsing

Hochgeladen von

Copyright:

Verfügbare Formate

XML Parsing

Each element has start tag and end tag

Attribute (discussed later) values in quotes

Reports errors (if any)

Allows programmatic access to document’s contents

Simple API for XML (SAX)

<markup>This is character data</markup>

Das könnte Ihnen auch gefallen

XML Parsing

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

XML Parsing

Hochgeladen von

Copyright:

Verfügbare Formate

XML Parsing

 Each element has start tag and end tag

 Attribute (discussed later) values in quotes

 Reports errors (if any)

 Allows programmatic access to document’s contents

 Simple API for XML (SAX)

<markup>This is character data</markup>

Das könnte Ihnen auch gefallen

Each element has start tag and end tag

Attribute (discussed later) values in quotes

Reports errors (if any)

Allows programmatic access to document’s contents

Simple API for XML (SAX)