Sie sind auf Seite 1von 22

Nested Mappings:

Schema Mapping Reloaded

P. Papotti
Universita’ Roma Tre

M.A. Hernandez - H. Ho - L. Popa A. Fuxman - R.J. Miller


IBM Almaden Research Center University of Toronto
The Problem of Mapping Generation
 Schemas can be arbitrarily different
 E.g., different normalization & naming,
missing/extra elements
 Input: correspondences between
atomic schema elements
 (Automatic discovery)
 Logical and declarative expressions of
relationships between schemas.
 Abstraction for data interoperability
tasks
 Simpler than actual implementations of
data exchange (SQL/XQuery/XSLT)
 Must generate transformation that:
 Preserves data relationships:
pname-dname, pname-ename, etc.
 Creates new target values (pid)
 Produces “correct” groupings

03/22/09 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti 2


Outline
 Schema mapping generation
 [VLDB’02] Fagin, Hernandez, Popa (IBM Almaden),
Miller, Velegrakis (Univ. of Toronto)
 From basic to nested:
 Issues with basic mappings
 Nested mappings and their advantages
 Generation algorithm
 Performance impact
 Conclusion
 Related work
 Future directions

03/22/09 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti 3


Schema Mapping Generation
Schema Correspondences
Source Target
schema S schema T

Mappings

Source Concepts Target Concepts


(relational views) (relational views)

 Step 1. Extraction of “concepts” (in each schema).


 Concept = one category of data that can exist in the schema
 Step 2. Mapping generation
 Enumerate all non-redundant maps between pairs of concepts

03/22/09 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti 4


Example The concept of
“project of a dept:
dept Set [
department” m2 dname
m1 budget
 m1 maps proj to dept-projects proj:
proj Set [ emps:
emps Set [
dname ename
m 1: pname salary
∀(p0 in proj) emps:
emps Set [ worksOn:
worksOn Set [
ename
∃(d in dept) ∃(p in d.projects) salary pid
p0.dname = d.dname ] ]
] ]
∧ p0.pname = p.pname projects:
projects Set [
pid
m 2: pname
∀(p0 in proj) ∀(e0 in p0.emps) ] ]

∃(d in dept) ∃(p in d.projects)


∃(e in d.emps) ∃(w in
e.worksOn)  m2 maps proj-emps to
expression for w.pid = p.pid
dept-emps- ∧ p0.dname = d.dname dept-emps-worksOn-projects
worksOn- ∧ p0.pname = p.pname The concept of
projects “project of an
∧ e0.ename = e.ename employee of a
∧ e0.salary = e.salary department”

 Two ‘basic’ mappings (or source-to-target tgds or GLAV


formulas)

03/22/09 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti 5


Outline
 Schema mapping generation
 [VLDB’02] Fagin, Hernandez, Popa (IBM Almaden),
Miller, Velegrakis (Univ. of Toronto)
 From basic to nested:
 Issues with basic mappings
 Nested mappings and their advantages
 Generation algorithm
 Performance impact
 Conclusion
 Related work
 Future directions

03/22/09 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti 6


Issue 1: Many Small Uncorrelated Formulas
dept:
dept Set [
m2 dname
m1 budget
proj:
proj Set [ emps:
emps Set [
dname ename
pname salary
emps:
emps Set [ worksOn:
worksOn Set [
ename
salary pid
] ]
] ]
projects:
projects Set [
pid
pname
] ]

 m1: “for every proj tuple there must be dept and project tuples such that …“
 m2: “for every emp of a proj tuple there must be: dept, emp, worksOn, project … “
 If we also had dependents under employees, then:
“for every dependent of an emp of a proj … “
and so on …
 There is a lot of common mapping behavior that is repeated
 E.g., m2 repeats the mapping behavior of m1 (although for a “subconcept”)

03/22/09 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti 7


Issue 2: Redundancy in the Generated Data
Possible output:
dept:
dept Set [
m2 dname CS CS CS
Input: m1 budget B1 B2 B3
proj:
proj Set [ emps:
emps Set [
CS dname ename {} { Alice { John
uSearch pname salary 120K 90K
{ emps:
emps Set [ worksOn:
worksOn Set [
Alice John ename { X2 } { X3 }
120K, 90K salary pid
] ] } }
} ]
]
projects:
projects Set [ { X1 { X2 { X3
pid
pname uSearch } uSearch } uSearch }
] ]

Required to exist Required to exist


based on m1 based on m2

 m2 repeats the mapping behavior of m1:


 “duplicate” dept and project tuples
 “duplicate” nulls (pid values: X2 and X3, and budget values)
 Moreover, this duplication happens for each joining emp tuple in the source

03/22/09 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti 8


Issue 3: No Grouping in the Target
Possible output:
dept:
dept Set [
m2 dname CS CS CS CS
Input: m1 budget B1 B2 B2 B3
proj:
proj Set [ emps:
emps Set [
CS dname ename {} { Alice, {John
{ Alice John
uSearch pname salary 120K 120K, 90K 90K
{ emps:
emps Set [ worksOn:
worksOn Set [
Alice John ename { X2{}X2} { X3 {} X3 }
120K, 90K salary pid
] ] } } }
} ]
]
projects:
projects Set [ { X1 { X2 { X3 { X3
pid
pname uSearch } uSearch }
uSearch }uSearch }
] ]

Required to exist Required to exist


based on m1 based on m2

 Alice and John are in different singleton sets (E and E’)


 There can be as many singleton sets as emp tuples in the source nested set
 It is desirable to enforce the grouping on the target data

03/22/09 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti 9


Summary of issues
 Fragmentation of the specification
 (Too) many small tgds

 Fragmentation of the data


 Generate redundant data (which later needs to be removed
or fused)
 No grouping enforced on the target data (need additional
phase to enforce any grouping)

03/22/09 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti 10


Idea

dept:
dept Set [  We would like to reuse (in m2) the
m2 dname
m1 budget “dept” and “project” tuples that the
proj:
proj Set [ emps:
emps Set [
dname ename simpler mapping m1 asserts.
pname salary
emps:
emps Set [ worksOn:
worksOn Set [
 Make m2 assert only the “extra”
ename information
salary pid
] ]  Also accumulate the corresponding
] ]
projects:
projects Set [ employees into one set
pid
pname
] ]  Idea: Correlate the mapping formulas
based on their common part

03/22/09 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti 11


Correlating Mapping Formulas
m1: ∀(p0 in proj)
∃(d in dept) ∃(p in d.projects)
p0.dname = d.dname ∧ p0.pname = p.pname

m2: ∀(p0 in proj) ∀(e0 in p0.emps)


proj tuples ∃(d in dept) ∃(p in d.projects) ∃(e in d.emps) ∃(w in
mapped only once e.worksOn)
w.pid=p.pid
∧ p0.dname = d.dname ∧ p0.pname = p.pname
Submapping, ∧ e0.ename = e.ename ∧ e0.salary = e.salary
Replace with
correlated to the
parent mapping n: ∀(p0 in proj)
∃(d in dept) ∃(p in d.projects)
p0.dname = d.dname ∧ p0.pname = p.pname
∧ [ ∀(e0 in p0.emps)
For every proj tuple,
∃(e in d.emps) ∃(w in e.worksOn)
we map all employees,
as a group. w.pid=p.pid
∧ e0.ename = e.ename ∧ e0.salary =
(Source grouping is e.salary
preserved) ]
This is a nested mapping

03/22/09 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti 12


Advantages of Nested Mappings
 Nested tgds can exploit the natural hierarchy that exists
on the concepts of a schema proj:
proj Set [
 e.g., proj-emps is a “subconcept” of proj, in the source dname
pname
schema emps:
emps Set [
ename
 Map higher concept only once; use submappings for salary
subconcepts ]
]

 Nested mappings are strictly more expressive:


There is no set of source-to-target tgds that is equivalent to n.

03/22/09 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti 13


Nesting Algorithm: Sketch

 Step 1. Discovery: construct a DAG of basic mapping based on


the concepts hierarchy

 Step 2. Correlation: construct nested mappings by traversing


the DAG, starting from each root, and repeatedly applying the
nesting step hinted before.
 We get a forest of nested mappings

03/22/09 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti 14


Nesting Algorithm: Example
dept:
dept Set of [
dname
budget
proj:
proj Set of [ emps:
emps Set of [
P X D
dname ename
pname salary
emps:
emps Set of [ worksOn:
worksOn Set of [ PE DE DP
ename
salary pid
] ]
] ]
projects:
projects Set of [ DEPW
pid
pname
] A DAG of basic mappings
]
for p in proj
exists d’ in dept, p’ in d’.projects
where d’.dname=p.dname and PDP
p’.pname=p.pname and
( for e in p.emps
exists e’ in d’.emps, w in e’.worksOn PEDEPW
where w.pid=p’.pid and
e’.ename=e.ename and e’.salary=e.salary
)

03/22/09 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti 15


Experimental evaluation
 Goal: show empirically that nested mappings can dramatically:
 reduce the cost of producing a target instance
 improve the quality of the generated data

 DBLP-like schema, on both source and target, with four levels of


nesting/grouping:
 authors – level 1
 conferences – level 2
 years – level 3
 publications – level 4

 Mappings are implemented by generating queries (in XQuery)


 Qbasic based on basic mappings
 Qnested based on nested mappings

03/22/09 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti 16


Example Queries – 2 Levels Only
Qbasic let $doc0 := fn:doc("instance.xml") return
<authorDB> Qnested
{ for $x0 in $doc0/authorDB/author,
$x1 in $x0/conf
return
let $doc0 := fn:doc("instance.xml") return
<author>
<name> { $x0/name/text() } </name> <authorDB>
{ for $x0L1 in $doc0/authorDB/author, { for $x0 in $doc0/authorDB/author
$x1L1 in $x0L1/conf return
where $x0/name/text()=$x0L1/name/text() <author>
Multiple <name> { $x0/name/text() } </name>
query terms return
<conf> { for $x1 in $x0/conf
(one per basic return
mapping) <name> { $x1L1/name/text() } </name>
</conf> } <conf>
</author> <name>{ $x1/name/text() }</name>
} </conf> }
{ for $x0 in $doc0/authorDB/author </author> }
return </authorDB>
<author>
<name> { $x0/name/text() } </name>
</author>
}
</authorDB>

 Need re-grouping (over entire data)  Single pass over the data
 Generate duplicates  No duplicates

03/22/09 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti 17


Execution time comparison
• Qbasic execution time / Qnested execution time
• Logarithm scale
10000

1020 KB
Basic query execution time /
Nested query execution time

514 KB
1000
312 KB

100
Execution time for basic:
22 minutes
Execution time for nested:
10 1.1 seconds

1
1 2 3 4
Ne sti n g Le ve l

03/22/09 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti 18


Output file size comparison
• Qbasic output file size / Qnested output file size
• Logarithm scale

100,0
2111 KB
514 KB
Basic query output size /
Nested query output size

312 KB

10,0 • Size of generated data for basic


(including duplicates): 45MB
• Size of generated data for
nested: 552KB

1,0
1 2 3 4
Ne sting Le ve l

The nested mapping results in much more efficient execution with less redundant data

03/22/09 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti 19


Related work
 Both embedded mappings [Melnik et al. SIGMOD’05] and HePTox [Bonifati et al.
VLDB’05] support nested data, but do not support nesting of mappings.
 Nested mappings are less general than languages used for composition [Fagin et
al. PODS’04, Nash et al. PODS’05], but are more compact and easier to
understand/program

 The generation algorithm identifies common expressions within mappings: same


spirit of work in query optimization [e.g., Roy et al. SIGMOD’00].
 But query optimization preserves query equivalence, while our techniques lead to
mappings with better semantics (do not preserve query equivalence).

 There are already commercial tools that use similar paradigms (e.g., IBM
Ascential DataStage TX) but most of the mapping generation work is manual.

03/22/09 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti 20


Conclusion

 Nested tgds: better specification language for transformation


 Use correlation (hierarchy) between concepts
 Less redundancy in the output, more efficient

 Naturally preserve source grouping


 For more complex mappings we expose Skolem functions to let users
alter the default grouping behavior

 Nested tgds are more compact and easier to understand/program


 Humans think top-down: map top concepts, then submappings, etc.

 Can be generated too !

03/22/09 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti 21


Future Directions
 Extend existing solutions to use nested mappings
 Data integration, mapping analysis and reasoning, schema
evolution, etc.
 Nested tgds are more complex as a logic formalism !

 Study the formal foundation of nested mappings


 More generally, develop methods for deciding when and why is
a schema mapping specification “better” than another
 Need to look at issues such as:
 preservation of the source data (associations, correlations, etc.)
 minimization of incompleteness

03/22/09 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti 22

Das könnte Ihnen auch gefallen