Beruflich Dokumente
Kultur Dokumente
Outline Background: documents (SGML/HTML) and databases (structured and semistructured i dd data) ) XML Basics and Document Type Descriptors XML query languages: XPath, XQuery
Part I: Background
Whats the difference between the world of documents and information retrieval and databases and query interfaces?
Documents vs Databases
Document world > plenty of small documents > usually static > implicit structure > tagging
section, paragraph, toc,
Database world > a few large databases > usually dynamic > explicit structure (schema) > records > machine friendly > content
schema, data, methods
> Paradigms
> Paradigms
> meta-data
> meta-data
schema description
4
querying y g q
composing/transforming
Lin Lingua f franca n f for p publishing blishin hypertext h p t xt on n the th World W ld Wide Wid Web HTML is widely used for formatting and structuring Web documents. documents Designed to describe how a Web browser should arrange text, images and push-buttons on a page. learn but does not convey structure and meaning of Easy to learn, data in the Web pages. Fixed tag set.
Opening tag Text (PCDATA)
HTML
<HTML> <HEAD><TITLE>Welcome to the XML course</TITLE></HEAD> <BODY> <H1>Introduction</H1> <IMG SRC=dragon.jpeg" WIDTH="200" HEIGHT="150 >
Closing tag
</BODY> </HTML>
Semistructure data
1. Information integration: important new application that motivates what follows. follows 2. Semistructured data: a new data model designed to cope with problems of information integration. 3 XML: 3. XML a new Web W b standard d d that h is i essentially semistructured data. 4. XQUERY: an emerging standard query language for XML data.
Information Integration
Problem: related data exists in many places. They talk about the same things, g , but differ ff in model, , schema, conventions (e.g., terminology). Example: In the real world, every bar has its own database. Some may have relations like beer-price; others have an Microsoft Word file from which the menu i printed. is i t d Some keep phones of manufacturers but not addresses. addresses Some distinguish beers and ales; others do not.
8
Two approaches
1. Warehousing: Make copies of information
at each data source centrally. centrally Reconstruct data daily/weekly/monthly, but do not try to keep it up-to-date. up to date
2. Mediation: Create a view of all information, but do not make copies. p Answer queries by sending appropriate queries to sources. q .
user query
result
Warehouse
Combiner
Wrapper
Wrapper
DB1
DB2
10
Mediation
query result lt
Mediator query Wrapper query DB1 result result query result Wrapper query DB2 result
11
Semistructured Data
A different kind of data model, more suited to information-integration applications than either relational or OO. Think of objects objects, but with the type of an object for its own business rather than the business of the class to which it belongs. Allows All i information f i f from several l sources, with related but different properties, to b fit together be t th i in one whole. h l Major application: XML documents.
12
13
Well-Formed XML
1. Declaration = <? ... ?> .
Normal declaration is
<? XML VERSION = "1.0" STANDALONE = "yes" ?>
Standalone means that there is no DTD specified. specified 2. Root tag surrounds the entire balance of the d document. t <FOO> is balanced by </FOO>, as in HTML. 3. Any balanced structure of tags OK. dont t require balance, balance Option of tags that don like <P> in HTML. 15
16
XML text
XML has only one basic type -- text. It is bounded by tags, e.g. <title> The Big Sleep </title> yea 1935 935 </ / yea year> --- 1935 935 is s st still ll text <year> XML text is called PCDATA (for parsed character data). It uses a 16-bit encoding.
17
XML structure
Nesting tags can be used to express various structures. E.g., A tuple (record) : <person> <name> Malcolm M l l Atchison At hi </name> / <tel> (215) 898 4321 </tel> < <email> il> mp@dcs.gla.ac.sc @d l </ </email> il> </person>
18
Terminology
The segment of an XML document between an opening and a corresponding closing tag is called an element.
<person> <name> Malcolm Atchison </name> <tel> (215) 898 4321 </tel> <tel> (215) 898 4321 </tel> <email> mp@dcs.gla.ac.sc </email> </person> element, a sub-element element sub element of not an element l
element
19
20
21
Example
bar beer name Bud manf A.B. servedAt manf name year Mlob 1995 award Gold b beer
prize
addr Maple
22
Example
<?XML VERSION = "1.0" STANDALONE = "yes"?> y <BARS> <BAR><NAME>Joe's <BAR><NAME>Joe s Bar</NAME> <BEER><NAME>Bud</NAME> <PRICE>2.50</PRICE></BEER> <BEER><NAME>Miller</NAME> <PRICE>3.00</PRICE></BEER> </BAR> <BAR> ... </BARS>
23
employees: name
ssn
age
24
25
27
Attributes
An (opening) tag may contain attributes. These are typically yp y used to describe the content of f an element. <entry> <word language = en> cheese </word> <word language = fr> fr > fromage </word> <word language = ro> branza </word> <meaning> g A food made </meaning> / g </entry>
28
Attributes (contd)
Another common use for attributes is to express dimension or type yp <picture> <height dim= dim cm> 2400 </height> <width dim= in> 96 </width> <data encoding = gif gif compression = zip> zip > M05-.+C$@02!G96YE<FEC ... </data> </picture>
29
Using IDs
<family> <person id="jane" mother="mary" father="john"> <name> Jane Doe </name> </person> <person p id="john" j children="jane j jack"> j <name> John Doe </name> </person> <person id= id="mary" mary children= children="jane jane jack jack"> > <name> Mary Doe </name> </person> <person id="jack" mother=mary" father="john"> <name> Jack Doe </name> </person> </family>
30
An object-oriented schema
class Movie ( extent Movies, key title )
{
attribute string title; attribute string director; relationship l h set<Actor> casts inverse Actor::acted_In; attribute int budget;
}; };
attribute string name; relationship set<Movie> acted_In inverse Movie::casts; attribute int age; attribute set<string> directed;
31
<db> <movie id=m1> <title>Waking Ned Divine</title> <director>Kirk Jones III</director> <cast idrefs=a1 a3></cast> <budget>100,000</budget> </movie> <movie movie id=m2> m2 <title>Dragonheart</title> <director>Rob Cohen</director> < t idrefs <cast id f =a2 2 a9 9 a21></cast> 21></ t> <budget>110,000</budget> </movie> <movie id=m3> <title>Moondance</title> <director>Dagmar Hirtz</director> <cast idrefs=a1 a8></cast> <budget>90,000</budget> </movie>
:
An example
<actor id=a1> <name>David Kelly</name> f =m1 m3 m78 > <acted_In idrefs </acted_In> </actor> <actor t id=a2> 2 <name>Sean Connery</name> <acted_In idrefs=m2 m9 m11> </acted In> </acted_In> <age>68</age> </actor> <actor id= =a3> a3 > <name>Ian Bannen</name> <acted_In idrefs=m1 m35> </acted In> </acted_In> </actor> : </db> /db
32
33
34
As many address lines (in order) ) <addr> Rome, OH 98765 </addr> as needed (
<tel> (321) 786 2543 </tel> <f > (321) 786 2543 </fax> <fax> </f > <tel> (321) 786 2543 </tel> <email> jm@abc.com </email> </person>
37
38
40
41
attribute string title; attribute string director; relationship l h set<Actor> casts inverse Actor::acted_In; attribute int budget;
}; };
attribute string name; relationship set<Movie> acted_In inverse Movie::casts; attribute int age; attribute set<string> directed;
42
43
44
Elements of a DTD
An element is a name (its tag) and a parenthesized description of tags within an element. element Special case: (#PCDATA) after an element name means it is text. text Example <!DOCTYPE Bars [ <!ELEMENT BARS (BAR*)> <!ELEMENT BAR (NAME (NAME, BEER+)> <!ELEMENT NAME (#PCDATA)> <!ELEMENT BEER (NAME (NAME, PRICE)> <!ELEMENT PRICE (#PCDATA)> ]>
45
Example of (a)
<?XML VERSION = "1.0" STANDALONE = "no"?> <!DOCTYPE Bars [ <!ELEMENT BARS (BAR*)> <!ELEMENT BAR (NAME, BEER+)> <!ELEMENT NAME (#PCDATA)> <!ELEMENT BEER (NAME, PRICE)> <!ELEMENT PRICE (#PCDATA)> ]> <BARS> <BAR><NAME>Joe's Bar</NAME> <BEER><NAME>Bud</NAME> <PRICE>2.50</PRICE></BEER> <BEER><NAME>Miller</NAME> <PRICE>3.00</PRICE></BEER> </BAR> <BAR> ... </BARS>
46
Example of (b)
Suppose our bars DTD is in file bar.dtd: <?XML VERSION = "1.0" STANDALONE = "no"?> <!DOCTYPE Bars SYSTEM "bar.dtd"> <BARS> <BAR><NAME>Joe's Bar</NAME> <BEER><NAME>Bud</NAME> <PRICE>2.50</PRICE></BEER> <BEER><NAME>Miller</NAME> <PRICE>3.00</PRICE></BEER> </BAR> <BAR> ... </BARS>
47
Attribute Lists
Opening tags can have arguments arguments that appear within the tag tag, in analogy to constructs like <A HREF = ...> in HTML. Keyword !ATTLIST introduces a list of attributes and their types for a given element. Example: <!ELEMENT BAR (NAME BEER*)> <!ATTLIST BAR type = "sushi"|"sports"|"other" > Bar objects can have a type, type and the value of that type is limited to the three strings shown. Example of use: <BAR type = "sushi"> . . . </BAR>
48
Example
Let us include in our Bars document type elements that are the manufacturers of beers, and have each beer object j link, with an IDREF, to the proper manufacturer object. <!DOCTYPE Bars [ <!ELEMENT BARS (BAR*)> <!ELEMENT BAR (NAME, BEER+)> <!ELEMENT NAME (#PCDATA)> ( ) <!ELEMENT MANF (ADDR)> <!ATTLIST MANF (name ID)> <!ELEMENT ADDR (#PCDATA)> <!ELEMENT BEER (NAME, PRICE)> <!ATTLIST BEER (manf = IDREF)> <!ELEMENT PRICE (#PCDATA)> ]> ]
50
Another file:
<!DOCTYPE db SYSTEM "schema.dtd">
A URL:
<!DOCTYPE db SYSTEM "http://www.schemaauthority.com/schema.dtd">
51
Some of the XML extensions impose something like a schema or type on an XML document. Well see these h later l
52
53
Some tools
XML Authority http://www.extensibility.com/tibco/solutions/xml p y _authority/index.htm py XML Spy http://www.xmlspy.com/download.html
54
Summary
XML is a new data format. Its main virtues are widespread acceptance and the (important) ability to handle semistructured data (data without sch m ) schema). DTDs provide some useful syntactic constraints on documents As schemas they are weak. documents. weak
55
XPath
Reasonably widely adopted -- in XML-Schema and query languages. Neither more expressive nor less expressive than regular path expressions i Primary goal = to permit to access some nodes from a given document XPath XP th main i construct st t : axis is navigation i ti An XPath path consists of one or more navigation steps, separated by / A navigation i ti st step is a triplet: t i l t: axis is + node-test d t st + list of f predicates Examples p
/descendant::node()/child::author /descendant::node()/child::author[parent/attribute::booktitle = XML][2]
57
aaa
5
ccc
6
aaa
7
bbb
aaa
bbb
ccc
58
59
Predicates
[2] -- the second child node of the context node chapter[5] -- the fifth chapter child of the context node [ [last()] ast()] -- the last child ch ld node of the context node chapter[title=introduction] -- the chapter children of the context node that have one or more title children whose string-value is introduction person[.//firstname = joe] -- the person children of the context node that have in their descendants a firstname element with string-value Joe
60
61
Axis navigation
So far, nearly all our expressions have moved us down the by moving to child nodes. Exceptions were
. -- stay where you are / go to the root // all descendants of the root .// all descendants of the context node
All other expressions have been abbreviations for child:: e.g. child::para hild . child hild:is i an example l of f an axis i XPath has several axes: ancestor, ancestor-or-self, attribute, child, descendant, descendant-or-self, following, g followingg sibling, namespace, parent, preceding, preceding-sibling, self
Some of these (self, parent) describe single nodes, others describe sequences of nodes.
62
preceding-sibling self
following-sibling
child
preceding
attribute namespace
following
descendant
(nothing) @ // . .// .. /
/company - returns the company root node and all its descendant nodes, that is, the wholeXML docukment. / /company/department /d //employee [employeeSalary gt 70000]/employeeName returns all employeeName nodes that are direct children of an employee node, such that the employee node has another child element employeeSalary whose value is gt 70000. /company/employee [employeeSalary gt 70000]/employeeName / /company/project/projectWorker / j / j W k [hours [h ge 20.0] 20 0] returns a child hild node hours with a value ge 20.0 hours.
65
XQuery
Xpath allows to write expressions that select nodes f from a tree-structured XML document. XQuery permits the specification of more general queries on one or more XML documents. q The typical form of a query in Xqurey is known as a FLWR expression. FOR <variable bindings to individual nodes (elements)> LET <variable bindings to collection of nodes (elements)> WHERE <qualifier conditions> RETURN <query result specification>
66
XQuery
Emerging standard for querying XML documents. Basic form: FOR <variables ranging over sets of elements> WHERE <condition> RETURN <set of elements>; Sets of elements described by paths, consisting of: f 1. URL, if necessary. 2. Element names forming a path in the semistructured data graph, e.g., //BAR/NAME = start start at any BAR node and go to a NAME child. child 3. Ending condition of the form
[<condition about subelements, subelements @attributes, attributes and values>]
68
Example
The file http://www.cse.ucsc.edu/bars.xml: <?XML VERSION = "1.0" STANDALONE = "no"?> <!DOCTYPE Bars SYSTEM "bar "bar.dtd"> dtd"> <BARS> <BAR type = "sports"> <NAME>Joe's Bar</NAME> / <BEER><NAME>Bud</NAME> <PRICE>2.50</PRICE></BEER> <BEER><NAME>Miller</NAME> <PRICE>3.00</PRICE></BEER> </BAR> <BAR type = "sushi"> <NAME>Homma's</NAME> <BEER><NAME>Sapporo</NAME> <PRICE>4.00</PRICE></BEER> </BAR> ... </BARS>
69
XQUERY Query
Query: Qu ry F Find n th the pr prices c s charged charg for Bud u by y sports bars ars that serve Miller.
FOR $ba IN document("http://www.cse.ucsc.edu/bars.html")
//BAR[@type = "sports"], //BAR[@ " "] $be IN $b / $ba/BEER[NAME [ A =" "Bud"] d"]
WHERE $ba/BEER/[NAME = "Miller"] RETURN $be/PRICE;
70
Conclusions
XML is a data format for which there are an increasing g number of f useful f tools f for
Constructing schemas Programming Querying
Although it is likely that a query language will soon emerge m as s a st standard, d d th there is less l ss agreement m t or understanding on how to store XML data efficiently. efficiently Many other database issues remain to make it f f for m manipulating p g large g amounts m of f data. . useful
71