Sie sind auf Seite 1von 39

Data Provenance in ETL Scenarios

Panos Vassiliadis
University of Ioannina

(joint work with Alkis Simitsis, IBM Almaden Research Center, Timos Sellis and Dimitrios Skoutas, NTUA & ICCS)

Outline

Introduction Conceptual Level Logical Level Physical Level Provenance &ETL

PrOPr 2007

Outline

Introduction Conceptual Level Logical Level Physical Level Provenance &ETL

PrOPr 2007

Data Warehouse Environment

PrOPr 2007

Extract-Transform-Load (ETL)

Extract

Transform & Clean

Load

Sources

DSA
PrOPr 2007

DW
5

ETL: importance

ETL and Data Cleaning tools cost

30% of effort and expenses in the budget of the DW 55% of the total costs of DW runtime 80% of the development time in a DW project

ETL market: a multi-million market

IBM paid $1.1 billion dollars for Ascential

ETL tools in the market


software packages in-house development

No standard, no common model

most vendors implement a core set of operators and provide GUI to create a data flow
PrOPr 2007 6

Fundamental research question

Now: currently, ETL designers work directly at the physical level (typically, via libraries of physicallevel templates) Challenge: can we design ETL flows as declaratively as possible? Detail independence:

no care for the algorithmic choices no care about the order of the transformations (hopefully) no care for the details of the inter-attribute mappings

PrOPr 2007

Now:
DW

Involved data stores +

Physical templates

Physical scenario

Engine

PrOPr 2007

Vision:
ETL tool
DW

Schema mappings Conceptual to logical mapping


Logical templates

DW

Conceptual to logical mapper


Physical templates

Involved data stores +

Logical scenario

Optimizer
Physical scenario
Physical templates

Physical scenario

Engine

Engine

PrOPr 2007

Detail independence
ETL tool

Schema mappings Conceptual to logical mapping


Logical templates

DW

Automate (as much as possible) Conceptual: the details of the interattribute mappings Logical: the order of the transformations Physical: the algorithmic choices

Conceptual to logical mapper

Logical scenario

Optimizer
Physical templates

Physical scenario

Engine

PrOPr 2007

10

Outline

Introduction Conceptual Level Logical Level Physical Level Provenance &ETL

PrOPr 2007

11

Conceptual Model: first attempts


Due to acccuracy and small size (< update window) Necessary providers: S1 and S2
{Duration<4h}

PS1

Annual PartSupps S2.PARTSUPP Recent PartSupps {XOR} PKey


y Ke y .P Ke 2 S Supp . 2 S S2.Date SU SU M(S2.Q ty) M (S 2. Co st)
SK

PKey DW.PARTSUPP
PK

S1.PARTSUPP

SuppKey Qty

PKey SuppKey Qty Date Cost

SK

PKey Dept SuppKey

SuppKey Date Qty Cost


NN f

PS2

Qty Cost Dept PKey SuppKey Cost Dept

American to European Date

Date = SysDate()

PS1.Pkey+=PS2.PKey PS1.SuppKey+=PS2.SuppKey PS1.Dept+=PS2.Dept

PrOPr 2007

12

Conceptual Model: The Data Mapping Diagram

Extension of UML to handle inter-attribute mappings

PrOPr 2007

13

Conceptual Model: The Data Mapping Diagram

Aggregating computes the quarterly sales for each product.

PrOPr 2007

14

Conceptual Model: Skoutas annotations

Application vocabulary

Datastore mappings

VC = {product, store} VPproduct = {pid, pName, quantity, price, type, storage} VPstore = {sid, sName, city, street} VFpid = {source_pid, dw_pid} VFsid = {source_sid, dw_sid} VFprice = {dollars, euros} VTtype = {software, hardware} VTcity = {paris, rome, athens}

Datastore annotation

PrOPr 2007

15

Conceptual Model: Skoutas annotations

The class hierarchy

Definition for class DS1_Products

PrOPr 2007

16

Outline

Introduction Conceptual Level Logical Level Physical Level Provenance &ETL

PrOPr 2007

17

Logical Model
DS.PSNEW2.PKEY, DS.PSOLD2.PKEY DS.PSNEW2 DIFF2 DS.PSOLD2 DS.PS2 AddAttr2 rejected Log DS.PSNEW1.PKEY, DS.PSOLD1.PKEY DS.PS2.PKEY, LOOKUP_PS.SKEY, SOURCE DS.PS1 SK1 rejected DS.PSOLD1 Log Log SOURCE DS.PS1.PKEY, LOOKUP_PS.SKEY, SOURCE SK2 rejected Log COST DATE QTY,COST

DSA

$2

A2EDate rejected rejected Log

Log

COST

DATE=SYSDATE
U

PKEY,DATE

DS.PSNEW1
DIFF1

NotNULL rejected

AddDate

PK

rejected Log

PKEY, DAY MIN(COST) S2.PARTS FTP2 DW.PARTS Aggregate1 V1

DW.PARTSUPP.DATE, DAY
S1.PARTS FTP1 TIME

PKEY, MONTH AVG(COST)

Aggregate2

V2

Sources

DW
PrOPr 2007 18

Logical Model

Main question:
What information should we put inside a metadata repository to be able to answer questions like:

what is the architecture of my DW back stage? which attributes/tables are involved in the population of an attribute? what part of the scenario is affected if we delete an attribute?

PrOPr 2007

19

Architecture Graph
DS.PSNEW2.PKEY, DS.PSOLD2.PKEY DS.PSNEW2 DIFF2 DS.PSOLD2 DS.PS2 AddAttr2 rejected Log DS.PSNEW1.PKEY, DS.PSOLD1.PKEY DS.PS2.PKEY, LOOKUP_PS.SKEY, SOURCE DS.PS1 SK1 rejected DS.PSOLD1 Log Log SOURCE DS.PS1.PKEY, LOOKUP_PS.SKEY, SOURCE SK2 rejected Log COST DATE

QTY,COST

DSA

$2

A2EDate rejected rejected Log

Log

COST

DATE=SYSDATE
U

PKEY,DATE

DS.PSNEW1
DIFF1

NotNULL rejected

AddDate

PK

rejected Log

PKEY, DAY MIN(COST) S2.PARTS FTP2 DW.PARTS Aggregate1 V1

DW.PARTSUPP.DATE, DAY
S1.PARTS FTP1 TIME

PKEY, MONTH AVG(COST)

Aggregate2

V2

Sources

DW
PrOPr 2007 20

Architecture Graph
Example
DS.PS2 OUT IN Add_Attr2 PAR PKEY SUPPKEY QTY COST DATE PKEY SUPPKEY QTY COST DATE PKEY SUPPKEY QTY COST DATE SOURCE AddConst2 PKEY SUPPKEY QTY COST DATE SOURCE OUT IN SK2 PAR PKEY SUPPKEY QTY COST DATE SOURCE SKEY PKEY SUPPKEY QTY COST DATE SOURCE OUT IN TMP_STOR. PARTSUPP

in

out

PKEY SOURCE

1 PKEY LPKEY LSOURCE LSKEY

2
LOOKUP2

OUT

SOURCE SKEY

PrOPr 2007

21

Architecture Graph
Example
DS.PS2 OUT IN Add_Attr2 PAR PKEY SUPPKEY QTY COST DATE PKEY SUPPKEY QTY COST DATE PKEY SUPPKEY QTY COST DATE SOURCE AddConst2 OUT

input schema
PKEY

IN

SK2 PAR

OUT

output schema
PKEY PKEY

IN

TMP_STOR. PARTSUPP

SUPPKEY QTY COST DATE SOURCE

SUPPKEY QTY COST DATE SOURCE SKEY

SUPPKEY QTY COST DATE SOURCE

in

out

PKEY SOURCE

projected-out schema generated schema functionality schema


22

2
LOOKUP2 OUT

PKEY SOURCE SKEY

LPKEY LSOURCE LSKEY

PrOPr 2007

Optimization

Execution order

S2.PARTSUPP

DW.PARTSUPP

PK

PKey SuppKey Qty Date Cost

SK
y Ke P . y S2 pKe Sup . 2 S S2.Date SUM SU (S2.Q ty) M( S2 .C os t)

PKey SuppKey Date Qty Cost

f1 f2

which is the proper execution order?


PrOPr 2007 23

Optimization

Execution order

S2.PART SUPP

SK

f1

f2

PK

DW.PART SUPP

order equivalence? SK,f1,f2 or SK,f2,f1 or ... ?

PrOPr 2007

24

Logical Optimization
1 3 7 8 9

PARTS1

NN
(COST)

(COST)

PARTS

PARTS2

$2
($COST)

A2E
(DATE)

(DATE)

Can we push selection early enough? Can we aggregate before $2 takes place?

8_1

PARTS1

(COST)

NN
(COST)

U
2 4 8_2 6 5

PARTS

PARTS2

$2
($COST)

(COST)

(DATE)

A2E
(DATE)

PrOPr 2007

25

Outline

Introduction Conceptual Level Logical Level Physical Level Provenance &ETL

PrOPr 2007

26

Logical to Physical
ETL tool
Conceptual to logical mapper
Logical templates

Schema mappings Conceptual to logical mapping

DW

Logical scenario

Optimizer
Physical templates

Physical scenario

identify the best possible physical implementation for a given logical ETL workflow

Engine
PrOPr 2007 27

Problem formulation

Given a logical-level ETL workflow GL Compute a physical-level ETL workflow GP Such that

the semantics of the workflow do not change all constraints are met the cost is minimal

PrOPr 2007

28

Solution

We model the problem of finding the physical implementation of an ETL process as a state-space search problem. States. A state is a graph GP that represents a physical-level ETL workflow.

The initial state G0P is produced after the random assignment of physical implementations to logical activities w.r.t. preconditions and constraints.

Transitions. Given a state GP, a new state GP is generated by replacing the implementation of a physical activity aP of GP with another valid implementation for the same activity.

Extension: introduction of a sorter activity (at the physical-level) as a new node in the graph.

Sorter introduction

Intentionally introduce sorters to reduce execution & resumption costs


PrOPr 2007 29

Sorters: impact

We intentionally introduce orderings, (via appropriate physical-level sorter activities) towards obtaining physical plans of lower cost. Semantics: unaffected Price to pay:

cost of sorting the stream of processed data it is possible to employ order-aware algorithms that significantly reduce processing cost It is possible to amortize the cost over activities that utilize common useful orderings

Gain:

PrOPr 2007

30

Sorter gains
3
500

A
100000

Z
sel3=0.1

2
10000

A<600
sel1=0.1

A>300
sel2=0.5

5000

V A,

4 1000

W
sel4=0.2

1000

B
sel5=0.2

Without order

cost(i) = n costSO() = n*log2(n)+n cost(i) = seli * n costSO() = n

Cost(G) = 100.000+10.000 +3*[5.000*log2(5.000)+5.000] = 309.316

With appropriate order


If sorter SA,B is added to V: Cost(G) = 100.000+10.000 +2*5.000+[5.000*log2(5.000)+5.000] = 247.877

PrOPr 2007

31

Interesting orders
3
500

A
100000

Z
sel3=0.1

2
10000

A<600
sel1=0.1

A>300
sel2=0.5

5000

V A,

4 1000

W
sel4=0.2

1000

A asc

A desc

{A,B, [A,B]}

B
sel5=0.2

PrOPr 2007

32

Outline

Introduction Conceptual Level Logical Level Physical Level Provenance &ETL

PrOPr 2007

33

A principled architecture for ETL


ETL tool
Conceptual to logical mapper
Logical templates

Schema mappings Conceptual to logical mapping

DW

WHY

Logical scenario

WHAT

Optimizer
Physical templates

Physical scenario

HOW

Engine
PrOPr 2007 34

Logical Model: Questions revisited


What information should we put inside a metadata repository to be able to answer questions like:

what is the architecture of my DW back stage? it is described as the Architecture Graph which attributes/tables are involved in the population of an attribute? what part of the scenario is affected if we delete an attribute? follow the appropriate path in the Architecture Graph

PrOPr 2007

35

Fundamental questions on provenance & ETL

Why do we have a certain record in the DW?

Because there is a process (described by the Architecture Graph at the logical level + the conceptual model) that produces this kind of tuples

Where did this record come from in my DW?

Hard! If there is a way to derive an inverse workflow that links the DW tuples to their sources you can answer it. Not always possible: transformations are not invertible, and a DW is supposed to progressively summarize data Widoms work on record lineage

PrOPr 2007

36

Fundamental questions on provenance & ETL

How are updates to the sources managed?


(update takes place at the source, DW+data marts must be updated) Done, although in a tedious way: log sniffing, mainly. Also, diff comparison of extracted snapshots

When errors are discovered during the ETL process, how are they handled?

(update takes place at the data staging area, sources must be updated) Too hard to back-fuse data into the sources, both for political and workload issues. Currently, this is not automated.
PrOPr 2007 37

Fundamental questions on provenance & ETL

What happens if there are updates to the schema of the involved data sources?

Currently this is not automated, although the automation of the task is part of the detail independence vision

What happens if we must update the workflow structure and semantics?

Nothing is versioned back still, not really any user requests for this to be supported nothing really

What is the equivalent of citations in ETL?

PrOPr 2007

38

Thank you!

PrOPr 2007

39

Das könnte Ihnen auch gefallen