Etl Etel

Data Provenance in ETL Scenarios
Panos Vassiliadis
University of Ioannina
(joint work with Alkis Simitsis, IBM Almaden Research Center, Timos Sellis and Dimitrios Skoutas, NTUA & ICCS)
Outline

Introduction Conceptual Level Logical Level Physical Level Provenance &ETL
PrOPr 2007
Outline

PrOPr 2007
Data Warehouse Environment
PrOPr 2007
Extract-Transform-Load (ETL)
Extract
Transform & Clean
Load
Sources
DSA
PrOPr 2007
DW
5
ETL: importance
ETL and Data Cleaning tools cost
30% of effort and expenses in the budget of the DW 55% of the total costs of DW runtime 80% of the development time in a DW project
ETL market: a multi-million market
IBM paid $1.1 billion dollars for Ascential
ETL tools in the market

software packages in-house development
No standard, no common model
most vendors implement a core set of operators and provide GUI to create a data flow
PrOPr 2007 6
Fundamental research question
Now: currently, ETL designers work directly at the physical level (typically, via libraries of physicallevel templates) Challenge: can we design ETL flows as declaratively as possible? Detail independence:

no care for the algorithmic choices no care about the order of the transformations (hopefully) no care for the details of the inter-attribute mappings
PrOPr 2007
Now:
DW
Involved data stores +
Physical templates
Physical scenario
Engine
PrOPr 2007
Vision:
ETL tool
DW
Schema mappings Conceptual to logical mapping

Logical templates
DW
Conceptual to logical mapper

Physical templates
Involved data stores +
Logical scenario
Optimizer
Physical scenario
Physical templates
Physical scenario
Engine
Engine
PrOPr 2007
Detail independence
ETL tool

Logical templates
DW
Automate (as much as possible) Conceptual: the details of the interattribute mappings Logical: the order of the transformations Physical: the algorithmic choices
Logical scenario
Optimizer
Physical templates
Physical scenario
Engine
PrOPr 2007
10
Outline

PrOPr 2007
11
Conceptual Model: first attempts

Due to acccuracy and small size (< update window) Necessary providers: S1 and S2
{Duration<4h}
PS1
Annual PartSupps S2.PARTSUPP Recent PartSupps {XOR} PKey

y Ke y .P Ke 2 S Supp . 2 S S2.Date SU SU M(S2.Q ty) M (S 2. Co st)
SK
PKey DW.PARTSUPP
PK
S1.PARTSUPP
SuppKey Qty
PKey SuppKey Qty Date Cost
SK
PKey Dept SuppKey
SuppKey Date Qty Cost

NN f
PS2
Qty Cost Dept PKey SuppKey Cost Dept
American to European Date
Date = SysDate()
PS1.Pkey+=PS2.PKey PS1.SuppKey+=PS2.SuppKey PS1.Dept+=PS2.Dept
PrOPr 2007
12
Conceptual Model: The Data Mapping Diagram
Extension of UML to handle inter-attribute mappings
PrOPr 2007
13
Conceptual Model: The Data Mapping Diagram
Aggregating computes the quarterly sales for each product.
PrOPr 2007
14
Conceptual Model: Skoutas annotations
Application vocabulary
Datastore mappings
VC = {product, store} VPproduct = {pid, pName, quantity, price, type, storage} VPstore = {sid, sName, city, street} VFpid = {source_pid, dw_pid} VFsid = {source_sid, dw_sid} VFprice = {dollars, euros} VTtype = {software, hardware} VTcity = {paris, rome, athens}
Datastore annotation
PrOPr 2007
15
Conceptual Model: Skoutas annotations
The class hierarchy
Definition for class DS1_Products
PrOPr 2007
16
Outline

PrOPr 2007
17
Logical Model
DS.PSNEW2.PKEY, DS.PSOLD2.PKEY DS.PSNEW2 DIFF2 DS.PSOLD2 DS.PS2 AddAttr2 rejected Log DS.PSNEW1.PKEY, DS.PSOLD1.PKEY DS.PS2.PKEY, LOOKUP_PS.SKEY, SOURCE DS.PS1 SK1 rejected DS.PSOLD1 Log Log SOURCE DS.PS1.PKEY, LOOKUP_PS.SKEY, SOURCE SK2 rejected Log COST DATE QTY,COST
DSA
$2
A2EDate rejected rejected Log
Log
COST
DATE=SYSDATE
U
PKEY,DATE
DS.PSNEW1
DIFF1
NotNULL rejected
AddDate
PK
rejected Log
PKEY, DAY MIN(COST) S2.PARTS FTP2 DW.PARTS Aggregate1 V1
DW.PARTSUPP.DATE, DAY
S1.PARTS FTP1 TIME
PKEY, MONTH AVG(COST)
Aggregate2
V2
Sources
DW
PrOPr 2007 18
Logical Model
Main question:
What information should we put inside a metadata repository to be able to answer questions like:

what is the architecture of my DW back stage? which attributes/tables are involved in the population of an attribute? what part of the scenario is affected if we delete an attribute?
PrOPr 2007
19
Architecture Graph
DS.PSNEW2.PKEY, DS.PSOLD2.PKEY DS.PSNEW2 DIFF2 DS.PSOLD2 DS.PS2 AddAttr2 rejected Log DS.PSNEW1.PKEY, DS.PSOLD1.PKEY DS.PS2.PKEY, LOOKUP_PS.SKEY, SOURCE DS.PS1 SK1 rejected DS.PSOLD1 Log Log SOURCE DS.PS1.PKEY, LOOKUP_PS.SKEY, SOURCE SK2 rejected Log COST DATE
QTY,COST
DSA
$2
A2EDate rejected rejected Log
Log
COST
DATE=SYSDATE
U
PKEY,DATE
DS.PSNEW1
DIFF1
NotNULL rejected
AddDate
PK
rejected Log
PKEY, DAY MIN(COST) S2.PARTS FTP2 DW.PARTS Aggregate1 V1
DW.PARTSUPP.DATE, DAY
S1.PARTS FTP1 TIME
PKEY, MONTH AVG(COST)
Aggregate2
V2
Sources
DW
PrOPr 2007 20
Architecture Graph
Example
DS.PS2 OUT IN Add_Attr2 PAR PKEY SUPPKEY QTY COST DATE PKEY SUPPKEY QTY COST DATE PKEY SUPPKEY QTY COST DATE SOURCE AddConst2 PKEY SUPPKEY QTY COST DATE SOURCE OUT IN SK2 PAR PKEY SUPPKEY QTY COST DATE SOURCE SKEY PKEY SUPPKEY QTY COST DATE SOURCE OUT IN TMP_STOR. PARTSUPP
in
out
PKEY SOURCE
1 PKEY LPKEY LSOURCE LSKEY
2
LOOKUP2
OUT
SOURCE SKEY
PrOPr 2007
21
Architecture Graph
Example
DS.PS2 OUT IN Add_Attr2 PAR PKEY SUPPKEY QTY COST DATE PKEY SUPPKEY QTY COST DATE PKEY SUPPKEY QTY COST DATE SOURCE AddConst2 OUT
input schema
PKEY
IN
SK2 PAR
OUT
output schema
PKEY PKEY
IN
TMP_STOR. PARTSUPP
SUPPKEY QTY COST DATE SOURCE
SUPPKEY QTY COST DATE SOURCE SKEY
SUPPKEY QTY COST DATE SOURCE
in
out
PKEY SOURCE
projected-out schema generated schema functionality schema

22
2
LOOKUP2 OUT
PKEY SOURCE SKEY
LPKEY LSOURCE LSKEY
PrOPr 2007
Optimization
Execution order
S2.PARTSUPP
DW.PARTSUPP
PK
PKey SuppKey Qty Date Cost
SK
y Ke P . y S2 pKe Sup . 2 S S2.Date SUM SU (S2.Q ty) M( S2 .C os t)
PKey SuppKey Date Qty Cost
f1 f2
which is the proper execution order?

PrOPr 2007 23
Optimization
Execution order
S2.PART SUPP
SK
f1
f2
PK
DW.PART SUPP
order equivalence? SK,f1,f2 or SK,f2,f1 or ... ?
PrOPr 2007
24
Logical Optimization
1 3 7 8 9
PARTS1
NN
(COST)
(COST)
PARTS
PARTS2
$2
($COST)
A2E
(DATE)
(DATE)
Can we push selection early enough? Can we aggregate before $2 takes place?
8_1
PARTS1
(COST)
NN
(COST)
U
2 4 8_2 6 5
PARTS
PARTS2
$2
($COST)
(COST)
(DATE)
A2E
(DATE)
PrOPr 2007
25
Outline

PrOPr 2007
26
Logical to Physical
ETL tool
Logical templates
DW
Logical scenario
Optimizer
Physical templates
Physical scenario
identify the best possible physical implementation for a given logical ETL workflow
Engine
PrOPr 2007 27
Problem formulation

Given a logical-level ETL workflow GL Compute a physical-level ETL workflow GP Such that

the semantics of the workflow do not change all constraints are met the cost is minimal
PrOPr 2007
28
Solution
We model the problem of finding the physical implementation of an ETL process as a state-space search problem. States. A state is a graph GP that represents a physical-level ETL workflow.
The initial state G0P is produced after the random assignment of physical implementations to logical activities w.r.t. preconditions and constraints.
Transitions. Given a state GP, a new state GP is generated by replacing the implementation of a physical activity aP of GP with another valid implementation for the same activity.
Extension: introduction of a sorter activity (at the physical-level) as a new node in the graph.
Sorter introduction
Intentionally introduce sorters to reduce execution & resumption costs

PrOPr 2007 29
Sorters: impact
We intentionally introduce orderings, (via appropriate physical-level sorter activities) towards obtaining physical plans of lower cost. Semantics: unaffected Price to pay:
cost of sorting the stream of processed data it is possible to employ order-aware algorithms that significantly reduce processing cost It is possible to amortize the cost over activities that utilize common useful orderings
Gain:

PrOPr 2007
30
Sorter gains
3
500
A
100000
Z
sel3=0.1
2
10000
A<600
sel1=0.1
A>300
sel2=0.5
5000
V A,
4 1000
W
sel4=0.2
1000
B
sel5=0.2
Without order

cost(i) = n costSO() = n*log2(n)+n cost(i) = seli * n costSO() = n
Cost(G) = 100.000+10.000 +3*[5.000*log2(5.000)+5.000] = 309.316
With appropriate order

If sorter SA,B is added to V: Cost(G) = 100.000+10.000 +2*5.000+[5.000*log2(5.000)+5.000] = 247.877
PrOPr 2007
31
Interesting orders
3
500
A
100000
Z
sel3=0.1
2
10000
A<600
sel1=0.1
A>300
sel2=0.5
5000
V A,
4 1000
W
sel4=0.2
1000
A asc
A desc
{A,B, [A,B]}
B
sel5=0.2
PrOPr 2007
32
Outline

PrOPr 2007
33
A principled architecture for ETL

ETL tool
Logical templates
DW
WHY
Logical scenario
WHAT
Optimizer
Physical templates
Physical scenario
HOW
Engine
PrOPr 2007 34
Logical Model: Questions revisited

What information should we put inside a metadata repository to be able to answer questions like:
what is the architecture of my DW back stage? it is described as the Architecture Graph which attributes/tables are involved in the population of an attribute? what part of the scenario is affected if we delete an attribute? follow the appropriate path in the Architecture Graph
PrOPr 2007
35
Fundamental questions on provenance & ETL
Why do we have a certain record in the DW?
Because there is a process (described by the Architecture Graph at the logical level + the conceptual model) that produces this kind of tuples
Where did this record come from in my DW?
Hard! If there is a way to derive an inverse workflow that links the DW tuples to their sources you can answer it. Not always possible: transformations are not invertible, and a DW is supposed to progressively summarize data Widoms work on record lineage
PrOPr 2007
36
How are updates to the sources managed?

(update takes place at the source, DW+data marts must be updated) Done, although in a tedious way: log sniffing, mainly. Also, diff comparison of extracted snapshots
When errors are discovered during the ETL process, how are they handled?

(update takes place at the data staging area, sources must be updated) Too hard to back-fuse data into the sources, both for political and workload issues. Currently, this is not automated.
PrOPr 2007 37
What happens if there are updates to the schema of the involved data sources?
Currently this is not automated, although the automation of the task is part of the detail independence vision
What happens if we must update the workflow structure and semantics?
Nothing is versioned back still, not really any user requests for this to be supported nothing really
What is the equivalent of citations in ETL?
PrOPr 2007
38
Thank you!
PrOPr 2007
39

Etl Etel

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Etl Etel

Hochgeladen von

Copyright:

Verfügbare Formate

Data Provenance in ETL Scenarios

Introduction Conceptual Level Logical Level Physical Level Provenance &ETL

Introduction Conceptual Level Logical Level Physical Level Provenance &ETL

Data Warehouse Environment

Transform & Clean

ETL and Data Cleaning tools cost

ETL market: a multi-million market

IBM paid $1.1 billion dollars for Ascential

ETL tools in the market

software packages in-house development

No standard, no common model

Fundamental research question

Involved data stores +

Schema mappings Conceptual to logical mapping

Conceptual to logical mapper

Involved data stores +

Schema mappings Conceptual to logical mapping

Conceptual to logical mapper

Introduction Conceptual Level Logical Level Physical Level Provenance &ETL

Conceptual Model: first attempts

Annual PartSupps S2.PARTSUPP Recent PartSupps {XOR} PKey

PKey SuppKey Qty Date Cost

PKey Dept SuppKey

SuppKey Date Qty Cost

Qty Cost Dept PKey SuppKey Cost Dept

American to European Date

PS1.Pkey+=PS2.PKey PS1.SuppKey+=PS2.SuppKey PS1.Dept+=PS2.Dept

Conceptual Model: The Data Mapping Diagram

Extension of UML to handle inter-attribute mappings

Conceptual Model: The Data Mapping Diagram

Aggregating computes the quarterly sales for each product.

Conceptual Model: Skoutas annotations

Conceptual Model: Skoutas annotations

The class hierarchy

Definition for class DS1_Products

Introduction Conceptual Level Logical Level Physical Level Provenance &ETL

A2EDate rejected rejected Log

PKEY, DAY MIN(COST) S2.PARTS FTP2 DW.PARTS Aggregate1 V1

PKEY, MONTH AVG(COST)

A2EDate rejected rejected Log

PKEY, DAY MIN(COST) S2.PARTS FTP2 DW.PARTS Aggregate1 V1

PKEY, MONTH AVG(COST)

1 PKEY LPKEY LSOURCE LSKEY

SUPPKEY QTY COST DATE SOURCE

SUPPKEY QTY COST DATE SOURCE SKEY

SUPPKEY QTY COST DATE SOURCE

projected-out schema generated schema functionality schema

PKEY SOURCE SKEY

LPKEY LSOURCE LSKEY

PKey SuppKey Qty Date Cost

PKey SuppKey Date Qty Cost

which is the proper execution order?

order equivalence? SK,f1,f2 or SK,f2,f1 or ... ?

Introduction Conceptual Level Logical Level Physical Level Provenance &ETL

Schema mappings Conceptual to logical mapping

Intentionally introduce sorters to reduce execution & resumption costs

cost(i) = n costSO() = n*log2(n)+n cost(i) = seli * n costSO() = n

Cost(G) = 100.000+10.000 +3*[5.000*log2(5.000)+5.000] = 309.316

With appropriate order

If sorter SA,B is added to V: Cost(G) = 100.000+10.000 +2*5.000+[5.000*log2(5.000)+5.000] = 247.877

Introduction Conceptual Level Logical Level Physical Level Provenance &ETL

A principled architecture for ETL

Schema mappings Conceptual to logical mapping

cost(i) = n costSO() = nlog2(n)+n cost(i) = seli n costSO() = n

Cost(G) = 100.000+10.000 +3[5.000log2(5.000)+5.000] = 309.316

If sorter SA,B is added to V: Cost(G) = 100.000+10.000 +25.000+[5.000log2(5.000)+5.000] = 247.877