Sie sind auf Seite 1von 36

XML Parsing

Designed By:
Creating Markup with
XML
Outline
Introduction
Introduction to XML Markup
Parsers and Well-formed XML Documents
Parsing an XML Document with msxml
Characters
Character Set
Characters vs. Markup
While Space, Entity References and Built-in Entities
Using Unicode in an XML Document
Markup
CDATA Sections
Introduction
„ XML
„ Technology for creating markup languages
„ Enables document authors to describe data of any
type
„ Allows creating new tags
„ HTML limits document authors to fixed tag set
Introduction to XML Markup
„ XML document (intro.xml)
„ Marks up message as XML
„ Commonly stored in text files
„ Extension .xml
1 <?xml version = "1.0"?>
Document begins with declaration
2 that specifies XML version 1.0
3 <!-- Fig. 5.1 : intro.xml --> „ Simple XML
4 <!-- Simple introduction to XML markup --> document containing a
5 message.
6 <myMessage>
„ Line numbers are
7 <message>Welcome to XML!</message>
Element message is not part of XML
8 </myMessage>
child element of root document. We
element myMessage include them for
clarity.
Line numbers are not part
of XML document. We Document begins
include them for clarity. with declaration that
specifies XML
version 1.0

Comments

Element message is
child element of root
element myMessage
Introduction to XML Markup
(cont.)
„ XML documents
„ Must contain exactly one root element
„ Attempting to create more than one root element is
erroneous
„ Elements must be nested properly
„ Incorrect: <x><y>hello</x></y>
„ Correct: <x><y>hello</y></x>
Rules for Elements
„ XML document syntax
„ Considered well formed if syntactically correct
„ Single root element

„ Each element has start tag and end tag


„ Tags properly nested

„ Attribute (discussed later) values in quotes

„ Proper capitalization
„ Case sensitive
Parsers and Well-formed XML
Documents
„ XML parser
„ Processes XML document
„ Reads XML document

„ Checks syntax

„ Reports errors (if any)

„ Allows programmatic access to document’s contents


Parsers and Well-formed XML
Documents (cont.)
„ XML parsers support
„ Document Object Model (DOM)
„ Builds tree structure containing document data in memory

„ Simple API for XML (SAX)


„ Generates events when tags, comments, etc. are
encountered
„ (Events are notifications to the application)
Parsing an XML Document with
msxml
„ XML document
„ Contains data
„ Does not contain formatting information
„ Load XML document into Internet Explorer 5.0
„ Document is parsed by msxml.
„ Places plus (+) or minus (-) signs next to container elements
„ Plus sign indicates that all child elements are hidden
„ Clicking plus sign expands container element
„ Displays children
„ Minus sign indicates that all child elements are visible
„ Clicking minus sign collapses container element
„ Hides children
„ Error generated, if document is not well formed
XML document shown in IE5.
Error message for a missing end
tag.
Characters
„ Character set
„ Characters that may be represented in XML
document
„ e.g., ASCII character set
„ Letters of English alphabet
„ Digits (0-9)
„ Punctuation characters, such as !, - and ?
Character Set
„ XML documents may contain
„ Carriage returns
„ Line feeds

„ Unicode characters
„ Enables computers to process characters for several
languages
Characters vs. Markup
„ XML must differentiate between
„ Markup text
„ Enclosed in angle brackets (< and >)
„ e.g,. Child elements
„ Character data
„ Text between start tag and end tag
„ e.g., line 7: Welcome to XML!
Element Naming
XML elements must follow these naming rules:
„ Names can contain letters, numbers, and other
characters
„ Names must not start with a number or punctuation
character
„ Names must not start with the letters xml (or XML or
Xml ..)
„ Names cannot contain spaces
White Space
„ Whitespace characters
„ Spaces, tabs, line feeds and carriage returns
„ Significant (preserved by application)
„ Insignificant (not preserved by application)
„ Normalization
„ Whitespace collapsed into single whitespace character
„ Sometimes whitespace removed entirely

<markup>This is character data</markup>


after normalization, becomes
<markup>This is character data</markup>
Entity References
„ XML-reserved characters
„ Ampersand (&)
„ Left-angle bracket (<)
„ Right-angle bracket (>)
„ Apostrophe (’)
„ Double quote (”)
„ Entity references
„ Allow to use XML-reserved characters
„ Begin with ampersand (&) and end with semicolon (;)
„ Prevents from misinterpreting character data as markup
Built-in Entities
„ Build-in entities
„ Ampersand (&amp;)
„ Left-angle bracket (&lt;)
„ Right-angle bracket (&gt;)
„ Apostrophe (&apos;)
„ Quotation mark (&quot;)
„ Mark up characters “<>&” in element message
<message>&lt;&gt;&amp;</message>
Using Unicode in an XML
Document
„ XML Unicode support
„ e.g., displays Arabic words
„ Arabic characters
„ represented by entity references for Unicode characters
1 <?xml version = "1.0"?> Root
2 element
Document type definition welcome
3 <!-- Fig. 5.4 : lang.xml -->
4 <!-- Demonstrating Unicode -->
(DTD) defines document contains
5 structure and entities child
6 <!DOCTYPE welcome SYSTEM "lang.dtd"> elements
7 from and
8 <welcome> „ XML document subject
that
9 <from> contains Arabic words
10
11 <!-- Deitel and Associates --> „ Document type
12 &#1583;&#1575;&#1610;&#1578;&#1614;&#1604; definition (DTD)
13 &#1571;&#1606;&#1583; defines document
14 structure and entities
15 <!-- entity -->
16 &assoc;
Root element welcome
contains child elements
17 </from>
from and subject
18
19 <subject> Sequence of entityfor
20 Sequence of entity references
references for Unicode
21 <!-- Welcome to the world of Unicode --> Unicode characters in Arabic
characters in Arabic
22 &#1571;&#1607;&#1604;&#1575;&#1611; alphabet
alphabet
23 &#1576;&#1603;&#1605;
24 &#1601;&#1610;&#1616; lang.dtd defines
25 &#1593;&#1575;&#1604;&#1605; lang.dtd defines entitiesassoc and
entities
26
text
assoc and text
27 <!-- entity -->
28 &text;
29 </subject>
30 </welcome>
XML document that contains
Arabic words.
Markup
„ XML element markup
„ Consists of
„ Start tag
„ Content
„ End tag
„ All elements must have corresponding end tag
<img src = “img.gif”>
is correct in HTML, but not XML
„ XML requires end tag or forward slash (/) for termination
<img src = “img.gif”></img>
or
<img src = “img.gif”/>
is correct XML syntax
Markup (cont.)
„ Elements
„ Define structure
„ May (or may not) contain content
„ Child elements, character data, etc.
„ Attributes
„ Describe elements
„ Elements may have associated attributes
„ Placed within element’s start tag
„ Values are enclosed in quotes
„ Element car contains attribute doors, which has value “4”
<car doors = “4”/>
Markup (cont.)
„ Processing instruction (PI)
„ Passed to application using XML document
„ Provides application-specific document information
„ Delimited by <? and ?>
1 <?xml version = "1.0"?>
Processing instruction specifies
2 stylesheet (discussed in Chapter 12)
3 <!-- Fig. 5.5 : usage.xml --> „ XML document that
4 <!-- Usage of elements and attributes --> Element chapters
marks up information
5 aboutfour
contains a fictitious
child
6 <?xml:stylesheet type = "text/xsl" href = "usage.xsl"?> book.each which
elements,
7 contain two attributes
8 <book isbn = "999-99999-9-X"> „ Processing
9 <title>Deitel&amp;s XML Primer</title> instruction specifies
10 stylesheet
11 <author> Root element book contains(discussed
child in
12 <firstName>Paul</firstName> elements title, author,Chapter 12)
13 <lastName>Deitel</lastName> chapters and media
Root element book
14 </author> contains child
15 Element book contains elements title,
16 <chapters> attribute isbn, which hasauthor,
17 <preface num = "1" pages = "2">Welcome</preface>
value of 999-99999-9-X chapters and
18 <chapter num = "1" pages = "4">Easy XML</chapter> media
19 <chapter num = "2" pages = "2">XML Elements?</chapter>
Element book
20 <appendix num = "1" pages = "9">Entities</appendix>
contains attribute
21 </chapters> isbn, which has
22 value of 999-
23 <media type = "CD"/> 9999-9-X
24 </book>
Element chapters
contains four child
elements, each
which contain two
attributes
XML document that marks up
information about a fictitious book.
1 <?xml version = "1.0"?>
2
3 <!-- Fig. 5.6: letter.xml -->
„ XML document
4 <!-- Business letter formatted with XML -->
that marks up a
5
6 <letter>
letter.
7
8 <contact type = "from">
9 <name>Jane Doe</name>
10 <address1>Box 12345</address1>
11 <address2>15 Any Ave.</address2>
12 <city>Othertown</city>
13 <state>Otherstate</state>
14 <zip>67890</zip>
15 <phone>555-4321</phone>
16 <flag gender = "F"/>
17 </contact>
18
19 <contact type = "to">
20 <name>Jane Doe</name>
21 <address1>123 Main St.</address1>
22 <address2></address2>
23 <city>Anytown</city>
24 <state>Anystate</state>
25 <zip>12345</zip>
26 <phone>555-1234</phone>
27 <flag gender = "M"/>
28 </contact>
29
30 <salutation>Dear Sir:</salutation>
31
32 <paragraph>It is our privilege to inform you about our new
33 database managed with <bold>XML</bold>. This new system
34 allows you to reduce the load on your inventory list
35 server by having the client machine perform the work of
36 sorting and filtering the data.</paragraph>
37
38 <paragraph>The data in an XML element is normalized, so XML document
39 plain-text diagrams such as
that marks up a
40 /---\
41 | |
letter. (Part 2)
42 \---/
43 will become gibberish.</paragraph>
44
45 <closing>Sincerely</closing>
46 <signature>Ms. Doe</signature>
47
48 </letter>
XML document that marks up a
letter.
CDATA Sections
„ CDATA sections
„ May contain text, reserved characters and whitespace
„ Reserved characters need not be replaced by entity references
„ Not processed by XML parser
„ Commonly used for scripting code (e.g., JavaScript)
„ Begin with <![CDATA[
„ Terminate with ]]>
1 <?xml version = "1.0"?> XML does not process
2
CDATA section
3 <!-- Fig. 5.7 : cdata.xml --> „ Using a CDATA
4 <!-- CDATA section containing C++ code --> section.
5
Entity references required if
6 <book title = "C++ How to Program" edition = "3"> not in CDATA section
„ Entity references
7
required if not in
8 <sample>
CDATA section
9 // C++ comment
10 if ( this-&gt;getX() &lt; 5 &amp;&amp; value[ 0 ] != 3 )
XML does not
11 cerr &lt;&lt; this-&gt;displayError(); process CDATA
12 </sample> section
13
14 <sample> Note the
15 <![CDATA[ simplicity offered
16 by CDATA section
17 // C++ comment
18 if ( this->getX() < 5 && value[ 0 ] != 3 )
19 cerr << this->displayError(); Note the simplicity offered by
20 ]]> CDATA section
21 </sample>
22
23 C++ How to Program by Deitel &amp; Deitel
24 </book>
Using a CDATA section
XML Namespaces
„ Naming collisions
„ Two different elements have same name
<subject>Math</subject>
<subject>Thrombosis</subject>
„ Namespaces
„ Differentiate elements that have same name
<school:subject>Math</school:subject>
<medical:subject>Thrombosis</medical:subject>
„ school and medical are namespace prefixes
„ Prefixed to elements and attribute names
„ Tied to uniform resource identifier (URI)
„ Series of characters for differentiating names
XML Namespaces (cont.)
„ Creating namespaces
„ Use xmlns keyword
xmlns:text = “urn:deitel:textInfo”
xmlns:image = “urn:deitel:imageInfo”
„ Creates two namespace prefixes text and image
„ urn:deitel:textInfo is URI for prefix text
„ urn:deitel:imageInfo is URI for prefix image
„ Default namespaces
„ Child elements of this namespace do not need prefix
xmlns = “urn:deitel:textInfo”
1 <?xml version = "1.0"?>
2
Element directory contains two
3 <!-- Fig. 5.8 : namespace.xml -->
namespace prefixes
4 <!-- Namespaces -->
5
6 <directory xmlns:text = "urn:deitel:textInfo"
7 xmlns:image = "urn:deitel:imageInfo"> „ Listing for
8 namespace.xml.
9 <text:file filename = "book.xml">
10 <text:description>A book list</text:description> „ Element
11 </text:file> directory
contains two
12
namespace
13 <image:file filename = "funny.jpg"> prefixes
14 <image:description>A funny picture</image:description>
15 <image:size width = "200" height = "100"/> Use prefix text to
16 </image:file>
describe elements
file and
17 description
18 </directory>
Apply prefix text
to describe
elements file,
description
and size

Das könnte Ihnen auch gefallen