Beruflich Dokumente
Kultur Dokumente
Panos Vassiliadis
University of Ioannina
(joint work with Alkis Simitsis, IBM Almaden Research Center, Timos Sellis and Dimitrios Skoutas, NTUA & ICCS)
Outline
PrOPr 2007
Outline
PrOPr 2007
PrOPr 2007
Extract-Transform-Load (ETL)
Extract
Load
Sources
DSA
PrOPr 2007
DW
5
ETL: importance
30% of effort and expenses in the budget of the DW 55% of the total costs of DW runtime 80% of the development time in a DW project
most vendors implement a core set of operators and provide GUI to create a data flow
PrOPr 2007 6
Now: currently, ETL designers work directly at the physical level (typically, via libraries of physicallevel templates) Challenge: can we design ETL flows as declaratively as possible? Detail independence:
no care for the algorithmic choices no care about the order of the transformations (hopefully) no care for the details of the inter-attribute mappings
PrOPr 2007
Now:
DW
Physical templates
Physical scenario
Engine
PrOPr 2007
Vision:
ETL tool
DW
DW
Logical scenario
Optimizer
Physical scenario
Physical templates
Physical scenario
Engine
Engine
PrOPr 2007
Detail independence
ETL tool
DW
Automate (as much as possible) Conceptual: the details of the interattribute mappings Logical: the order of the transformations Physical: the algorithmic choices
Logical scenario
Optimizer
Physical templates
Physical scenario
Engine
PrOPr 2007
10
Outline
PrOPr 2007
11
PS1
PKey DW.PARTSUPP
PK
S1.PARTSUPP
SuppKey Qty
SK
PS2
Date = SysDate()
PrOPr 2007
12
PrOPr 2007
13
PrOPr 2007
14
Application vocabulary
Datastore mappings
VC = {product, store} VPproduct = {pid, pName, quantity, price, type, storage} VPstore = {sid, sName, city, street} VFpid = {source_pid, dw_pid} VFsid = {source_sid, dw_sid} VFprice = {dollars, euros} VTtype = {software, hardware} VTcity = {paris, rome, athens}
Datastore annotation
PrOPr 2007
15
PrOPr 2007
16
Outline
PrOPr 2007
17
Logical Model
DS.PSNEW2.PKEY, DS.PSOLD2.PKEY DS.PSNEW2 DIFF2 DS.PSOLD2 DS.PS2 AddAttr2 rejected Log DS.PSNEW1.PKEY, DS.PSOLD1.PKEY DS.PS2.PKEY, LOOKUP_PS.SKEY, SOURCE DS.PS1 SK1 rejected DS.PSOLD1 Log Log SOURCE DS.PS1.PKEY, LOOKUP_PS.SKEY, SOURCE SK2 rejected Log COST DATE QTY,COST
DSA
$2
Log
COST
DATE=SYSDATE
U
PKEY,DATE
DS.PSNEW1
DIFF1
NotNULL rejected
AddDate
PK
rejected Log
DW.PARTSUPP.DATE, DAY
S1.PARTS FTP1 TIME
Aggregate2
V2
Sources
DW
PrOPr 2007 18
Logical Model
Main question:
What information should we put inside a metadata repository to be able to answer questions like:
what is the architecture of my DW back stage? which attributes/tables are involved in the population of an attribute? what part of the scenario is affected if we delete an attribute?
PrOPr 2007
19
Architecture Graph
DS.PSNEW2.PKEY, DS.PSOLD2.PKEY DS.PSNEW2 DIFF2 DS.PSOLD2 DS.PS2 AddAttr2 rejected Log DS.PSNEW1.PKEY, DS.PSOLD1.PKEY DS.PS2.PKEY, LOOKUP_PS.SKEY, SOURCE DS.PS1 SK1 rejected DS.PSOLD1 Log Log SOURCE DS.PS1.PKEY, LOOKUP_PS.SKEY, SOURCE SK2 rejected Log COST DATE
QTY,COST
DSA
$2
Log
COST
DATE=SYSDATE
U
PKEY,DATE
DS.PSNEW1
DIFF1
NotNULL rejected
AddDate
PK
rejected Log
DW.PARTSUPP.DATE, DAY
S1.PARTS FTP1 TIME
Aggregate2
V2
Sources
DW
PrOPr 2007 20
Architecture Graph
Example
DS.PS2 OUT IN Add_Attr2 PAR PKEY SUPPKEY QTY COST DATE PKEY SUPPKEY QTY COST DATE PKEY SUPPKEY QTY COST DATE SOURCE AddConst2 PKEY SUPPKEY QTY COST DATE SOURCE OUT IN SK2 PAR PKEY SUPPKEY QTY COST DATE SOURCE SKEY PKEY SUPPKEY QTY COST DATE SOURCE OUT IN TMP_STOR. PARTSUPP
in
out
PKEY SOURCE
2
LOOKUP2
OUT
SOURCE SKEY
PrOPr 2007
21
Architecture Graph
Example
DS.PS2 OUT IN Add_Attr2 PAR PKEY SUPPKEY QTY COST DATE PKEY SUPPKEY QTY COST DATE PKEY SUPPKEY QTY COST DATE SOURCE AddConst2 OUT
input schema
PKEY
IN
SK2 PAR
OUT
output schema
PKEY PKEY
IN
TMP_STOR. PARTSUPP
in
out
PKEY SOURCE
2
LOOKUP2 OUT
PrOPr 2007
Optimization
Execution order
S2.PARTSUPP
DW.PARTSUPP
PK
SK
y Ke P . y S2 pKe Sup . 2 S S2.Date SUM SU (S2.Q ty) M( S2 .C os t)
f1 f2
Optimization
Execution order
S2.PART SUPP
SK
f1
f2
PK
DW.PART SUPP
PrOPr 2007
24
Logical Optimization
1 3 7 8 9
PARTS1
NN
(COST)
(COST)
PARTS
PARTS2
$2
($COST)
A2E
(DATE)
(DATE)
Can we push selection early enough? Can we aggregate before $2 takes place?
8_1
PARTS1
(COST)
NN
(COST)
U
2 4 8_2 6 5
PARTS
PARTS2
$2
($COST)
(COST)
(DATE)
A2E
(DATE)
PrOPr 2007
25
Outline
PrOPr 2007
26
Logical to Physical
ETL tool
Conceptual to logical mapper
Logical templates
DW
Logical scenario
Optimizer
Physical templates
Physical scenario
identify the best possible physical implementation for a given logical ETL workflow
Engine
PrOPr 2007 27
Problem formulation
Given a logical-level ETL workflow GL Compute a physical-level ETL workflow GP Such that
the semantics of the workflow do not change all constraints are met the cost is minimal
PrOPr 2007
28
Solution
We model the problem of finding the physical implementation of an ETL process as a state-space search problem. States. A state is a graph GP that represents a physical-level ETL workflow.
The initial state G0P is produced after the random assignment of physical implementations to logical activities w.r.t. preconditions and constraints.
Transitions. Given a state GP, a new state GP is generated by replacing the implementation of a physical activity aP of GP with another valid implementation for the same activity.
Extension: introduction of a sorter activity (at the physical-level) as a new node in the graph.
Sorter introduction
Sorters: impact
We intentionally introduce orderings, (via appropriate physical-level sorter activities) towards obtaining physical plans of lower cost. Semantics: unaffected Price to pay:
cost of sorting the stream of processed data it is possible to employ order-aware algorithms that significantly reduce processing cost It is possible to amortize the cost over activities that utilize common useful orderings
Gain:
PrOPr 2007
30
Sorter gains
3
500
A
100000
Z
sel3=0.1
2
10000
A<600
sel1=0.1
A>300
sel2=0.5
5000
V A,
4 1000
W
sel4=0.2
1000
B
sel5=0.2
Without order
PrOPr 2007
31
Interesting orders
3
500
A
100000
Z
sel3=0.1
2
10000
A<600
sel1=0.1
A>300
sel2=0.5
5000
V A,
4 1000
W
sel4=0.2
1000
A asc
A desc
{A,B, [A,B]}
B
sel5=0.2
PrOPr 2007
32
Outline
PrOPr 2007
33
DW
WHY
Logical scenario
WHAT
Optimizer
Physical templates
Physical scenario
HOW
Engine
PrOPr 2007 34
what is the architecture of my DW back stage? it is described as the Architecture Graph which attributes/tables are involved in the population of an attribute? what part of the scenario is affected if we delete an attribute? follow the appropriate path in the Architecture Graph
PrOPr 2007
35
Because there is a process (described by the Architecture Graph at the logical level + the conceptual model) that produces this kind of tuples
Hard! If there is a way to derive an inverse workflow that links the DW tuples to their sources you can answer it. Not always possible: transformations are not invertible, and a DW is supposed to progressively summarize data Widoms work on record lineage
PrOPr 2007
36
(update takes place at the source, DW+data marts must be updated) Done, although in a tedious way: log sniffing, mainly. Also, diff comparison of extracted snapshots
When errors are discovered during the ETL process, how are they handled?
(update takes place at the data staging area, sources must be updated) Too hard to back-fuse data into the sources, both for political and workload issues. Currently, this is not automated.
PrOPr 2007 37
What happens if there are updates to the schema of the involved data sources?
Currently this is not automated, although the automation of the task is part of the detail independence vision
Nothing is versioned back still, not really any user requests for this to be supported nothing really
PrOPr 2007
38
Thank you!
PrOPr 2007
39