Sie sind auf Seite 1von 30

Data Vault for EDW

By Raphael Klebanov, WhereScape, Inc.

Agenda
Review of various Data Warehouse models in conjunction with their place in the modern data warehousing methods. The Data Vault, as a preferred flavor of the Enterprise Data Warehouse for different businesses. Overview of the Data Vault concepts and objects. Real-world example where the Data Vault was chosen to replace a more traditional architecture for EDW

Principal Data Flow (simplified)

Different Flavors of an EDW


Dimensional Model (DM)
Ralph Kimball, 1996. The Data Warehouse Toolkit.

Third Normal Form (3NF)


Bill Inmon, 1981. Effective Data Base Design.

Data Vault (DV)


Dan Linstedt, 2000. Data Vault Series.
4

Dimensional Model (DM)


Collection of all data marts within the enterprise. Information is always stored in the DM. In a DM, transaction data is separated into either

"facts": numeric transaction data "dimensions: reference information that gives meaning to the facts.

Key Aspects of a DM Approach


Clarity of design to both developers and business. Optimum design to most BI tools. Ease of use to query the database directly.

Main Drawbacks of the DM


Complicated to maintain the integrity of facts and dimensions. Expensive to modify the DW structure with change in business rules / requirements. Data scrub for conformed dimensions is challenging. Difficult to load type 2 dimensions in real-time.

Conclusion on DM
All of these characteristics boil down to a main usage of the DM in data marts and access/presentation layers. The DW can be created using this approach for small data volumes and stable business structures.

Third Normal Form (3NF)


Central DW referred to as the Corporate Information Factory (CIF). An enterprise has one centralized EDW, and data marts obtain their information from the EDW. In the EDW, information is usually stored in 3NF (Codd's third normal form). In 3NF, the data in the DW is stored by database normalization rules. Tables are grouped together in subject areas that reflect general data categories.

Key Aspects of a 3NF Approach


It is more straightforward to add information into the database containing full historical data from the operational systems. The data structures are more resilient to change since data should only appear in one table (i.e., the data is normalized). Due to optimized, normalized structure, NRT/RT- and VLDB - loading are supported in most cases.

10

The Main Drawbacks of the 3NF


Disadvantages of this approach is step from the number of tables involved. It is difficult for user to join data from different sources into meaningful information. Subsequently, access the information without an exact understanding of the sources of data and of the data structure of the data warehouse. Inflexibility (brittleness) of the 3NF Data Model.

11

Conclusion on 3NF Model


So all these characteristics lead to the realization that the main usage of the 3NF model is Operational Data Stores rather than EDW.

12

Data Vault (DV)


Designed to avoid or minimize the impact of the issues related to DM and 3NF and disadvantages of both methods. DV Modeling is a method of designing an EDW to provide historical storage of data coming in from many operational systems with complete tracing of the origin of all the data coming into the database. This method proved to be highly adaptable to change in the business environment. The Data Vault is built to be organized around Business Keys.

13

Key Aspects of a DV Approach


Less complicated EDW loads resulting in greater stability and performance. Improved flexibility allowing EDW to more easily adapt to changes in the business. More suitability for incremental implementation (Agile DW) ensuring quicker delivery of business value. Due to the highly granular nature of the DV model, it sustains Very Large Database (VLDB) capability resulting in no-need for redesign when EDW matures.

14

Main Drawbacks of the DV


Large amount of joins which makes maintenance of the database more strict. Necessity to follow the modeling rules in the more strict way because small deviation from the DV Business Rules might case serious damage to the whole structure. Like 3NF, impractical for direct querying.

15

Conclusion on DV Model
ANALYTIC DATA FEED BACK TO SOURCE SYSTEMS

EDW

PRE-AGGREGATES

3NF:TENDENCY

16

A Bit of Chemistry

Clear Definition, Removed Ambiguity

Efficient Loading, All the Data all the Time

Common Building Blocks for BI

Business Context, Agile Re-assembly

Atom = Clear Definitions of the Data -- Usually 3NF Water Molecule = 2-1/2normalized DV: Hubs/Links/Sats Sugar Molecule = Tables/Views with Pre-aggregated Data Sugar Cube = Rapid BI Product -- Usually DM

17

More on DVCore Concepts


A HUB table contains a list of uniquely identified

business keys that have a very low tendency to change.


A LINK is either a transaction, a hierarchy, or an

association/relationship between the business keys (HUB keys).


A SATELLITE holds any data with a tendency to

change over time, any descriptive data about a business key (HUB key).
18

Hubs
Identifiable business element. Very low chance of changing (generally, not editable in

source systems). Same semantic meaning and granularity across the enterprise.

Hubs Examples
Key: Nissan-ABC/123-456 Line of Business: NAICS 2007 45A Organization: Empire State College Model Number: 33777185JN
19

Hubs Quiz
A HUB represents an Event or Transaction (True or False) HUB may contain record source as part of business key

(True or False) HUB always has an end-date (True or False) HUB business key can be comprised of multiple columns (True or False) HUB can be dependent on another HUB (True or False)

20

Links
Intersection of two or more Business Keys (Hubs) A Unit of Work (e.g. Product by Supplier Link, Customer

by Category Link) Identifiable business element relationships Business event Transaction between business keys (Hubs) Hierarchy Same As (data cleansing) Includes Hubs Keys as Foreign key

Links Examples:
Invoice Header (Buyer, Seller, Invoice Date, Receive Date) Orders (Employee, Shipper, Customer, Order Date)
21

Links Quiz
A transaction is always represented by a Link (True or

False) A Link can contain business keys (True or False) A unit of work is always represented by a Link (True or False) A link must contain a unit of work (True or False)

22

Satellite
Time dimensional table about Hub or Link Has one migrated foreign key (either from Hub or Link)

At least one satellite row for each Hub Key


Primary Key is the Hub Surrogate Key (Hub_key) and Load

Date

Satellite Notes
Non-identifying business elements Descriptive of Business Key from Hub or Link

Dependent on either Hub or Link as Parent


Never dependent on more than one parent table Never parent table to any other table (no snow flaking!)

Generally, has beginning and ending dates


23

Satellites Quiz
Can Satellite be dependent on 1 or more parent tables

(True or False) Satellite Primary Key is which of the following:


A) Hubs PK B) Sat Load Date C) Sequence Number D) Sub-totals

Satellite can export its Key (True or False) Satellite can be snow flaked (True or False) Satellite is not impacted by Delta Processing (True or

False)

24

NON-Core DV Structures
A PIT (Point-in-Time) is a specialized SATELLITE derivative

that is used to get the latest row AS OF a specific date WITHOUT use of nested sub-queries in the main satellite query. A MEASURE SATELLITE is a specific SATELLITE dedicated to hold particular descriptive data on which calculations or aggregations can be performed for analytical purpose. A REFERENCE is a specific hybrid (flat table instead of Hub/Sat) in which decoding info is truly static, usually de-normalized, with no history. A BRIDGE similar to PIT designed for performance but created from many Hubs and Links, allows computing by columns.
25

Few Lines
In 2008 W.H. (Bill) Inmon stated that the Data Vault is the optimal approach for modeling the EDW in the DW2.0 framework. (DW2.0). The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise. (http://www.tdan.com/view-articles/5054/). The number of Data Vault users surpassed 500 in 2010 and grows rapidly (http://danlinstedt.com/about/dvcustomers/) .

26

The Story

27

About the IPC Environment


28

DS Feeds: Daily, weekly, monthly and ad-hoc from RDBMSs and flat files, some UD EDW Platform: SQL Server 2005+. Projected size of the EDW for 2010 is 45TB, growing 10-15% annually Data Warehouse Builder: WhereScape RED 6 BI: Balanced Insight Consensus/MicroStrategy 9 The Phase I of the Data Vault EDW is completed (approx. 500 objects) along with the Data Mart and BI reports(6 weeks). The subsequent phases are being developing now Also, the re-platforming of the Data Vault to Teradata 13 is underway now

Conclusion

Every Data Warehousing Flavor is applicable depending on phase and purpose of the DW:
Third Normal Form Normalization Rules Data Vault Structure Golden Copy Tables/Views with pre-aggregated data Reusable Components Dimensional Model Interpretation of Data by Users

Specifically, Data Vault Model is, at current time, an optimum approach for Enterprise Data Warehouse building.

29

For More Info Contact

Questions?

Raphael Klebanov, MCS, PSM, CDVDM, TCP

Lead DW/BI Analyst


rklebanov@wherescape.com 303.968.0703
30

raphael_ws learndatavault.com

New Business Supermodel by Dan Linstedt

Das könnte Ihnen auch gefallen