Beruflich Dokumente
Kultur Dokumente
com
Page 1 of 10
Data Vault Series 1 - Data Vault Overview by Dan E. Linstedt Published: July 1, 2002 (Article URL: http://www.tdan.com/view-articles/5054) Published in TDAN.com July 2002
There is a new home for the Data Vault: http://www.danlinstedt.com Articles from this series - Series 1 - Series 2 - Series 3 - Series 4 - Series 5
1.0 Introduction
The purpose of this paper is to present and discuss a patent-pending technique called the Data Vault the next evolution in data modeling for enterprise data warehousing. This is a highly technical paper and is meant for an audience of data modelers, data architects and database administrators. It is not meant for business analysts, project managers, or mainframe programmers. It is recommended that there is a base level of knowledge in common data modeling terms such as table, relationship, parent, child, key (primary/foreign), dimension and fact. The topics in this paper are as follows: Defining a Data Vault. A brief history of data modeling for data warehousing. The problems of existing data warehouse data modeling architectures. The importance of architecture and design for enterprise data warehousing. The components of a Data Vault. Solving the pain of data warehouse architectures. The foundations of the Data Vault architecture. Possible Applications / Implications of the Data Vault. Several of the objectives that you may learn from this paper are: What a Data Vault is and why it makes sense. How to build a small Data Vault of your own. What doesnt work from an enterprise data warehousing perspective. For too long we have waited for data structures to finally catch up with artificial intelligence and data mining applications. Most of the data mining technology has to import flat file information in order to join the form with the function. Unfortunately, volumes in data warehouses are growing rapidly and exporting this information for data mining purposes is becoming increasingly difficult. It simply doesnt make sense to have this discontinuity between form (structure), function (artificial intelligence), and execution (the act of data mining).
http://www.tdan.com/print/5054
6/13/2010
Page 2 of 10
Marrying form, function and execution holds tremendous power for the artificial intelligence (AI) and data mining communities. Having data structures that are mathematically sound increases the ability to bring these technologies back into the database. The Data Vault is based on mathematical principles that allow it to be extensible and capable of handling massive volumes of information. The architecture and structure is designed to handle dynamic changes to relationships between information. A stretch of the imagination might be to one day encapsulate the data with the functions of data mining, hopefully to move towards a self-aware independent piece of information but thats just a dream for now. It is possible to form, drop, and evaluate relationships between data sets dynamically. Thus changing the landscape of what is possible with a data model; essentially bringing the data model into a dynamic state of flux (through the use of data mining/artificial intelligence). By implementing reference architectures on top of a Data Vault structure - the functions that access the content may begin to execute in parallel and in an automated dynamic fashion. The Data Vault solves some of the Enterprise Data Warehousing structural and storage problems from a normalized, best of breed perspective. The concepts provide a whole host of opportunities in applying this unique technology. You must strive to do that which you think you cannot do. Eleanor Roosevelt.
http://www.tdan.com/print/5054
6/13/2010
Page 3 of 10
growing needs of enterprise data warehousing. The term for this is Conformed Data Marts.
Performance and other weaknesses of 3NF and star schema (when used within an enterprise data warehouse) began to show in the 90s as the volume of data increased. The Data Vault is architected to overcome these shortcomings while retaining the strengths of 3NF and star schema architectures. Within the past year (of the date on this article), this technique has been favorably received by industry experts. The Data Vault is the next evolution in data modeling because its architected specifically for data enterprise warehouses.
http://www.tdan.com/print/5054
6/13/2010
Page 4 of 10
Figure 2. Date Time Stamped 3NF The conformed data mart also has trouble. It is a collection of fact tables that are linked together via primary/foreign keys in other words, a linked set of related star schemas. The problems this creates are numerous: isolated subject oriented information, possible data redundancy, inconsistent query structuring, agitated scalability issues, difficulties with fact table linkages (incompatible grain), synchronization issues in near real time loading, limited enterprise views and troublesome data mining. While the star schema is typically bottom up architecture, bottom up implementation - the conformed data mart should be top down architecture and bottom up implementation. However, informal polling has shown that bottom up architecture and bottom up implementation appear to be the standard. One of the most difficult issues of a conformed data mart (or conformed fact tables) is getting the grain right. That means understanding the data as it is aggregated for each fact table and assuring that the aggregation will stay consistent for all time (during the life of the relationship) and the structure of each fact table will not change (i.e., no new dimensions will be added to either fact table). This limits design, scalability and flexibility of the data model. Another issue is the helper table. This table is defined to be a dimension-to-dimension relationship Link. Granularity is very important, as is the stability of the design of the dimension. This too limits design, scalability and flexibility of the data model.
Figure 3. Conformed Data Mart If the granularity of the Revenue Fact is altered, then it is no longer the same (duplicate) fact table. By adding a dimension to one of the fact tables the granularity frequently changes. It has also been suggested that fact tables can be linked together just because they carry the same dimension keys. This is only true if the facts are aggregated to the same granularity, which is an extremely difficult task to maintain as the system grows and matures.
2.3 The Importance of Architecture and Design for Enterprise Data Warehousing
A data warehouse should be top down architecture and bottom up implementation. This allows the
http://www.tdan.com/print/5054
6/13/2010
Page 5 of 10
architecture to reach the maximum known knowledge boundaries while the implementation can be scope controlled which can facilitate fast delivery times. The implementation should therefore be designed as a plug-and-play set of tables without becoming a stovepipe upon delivery. The design and architecture of a data warehouse must be flexible enough to grow and change with the business needs, because the needs of today are not necessarily the needs of tomorrow. Our industry has a need for a formalized data modeling architecture and design that is capable of accurately representing data warehouses. The architecture must be a defined normalization for data warehousing versus a defined normalization for OLTP systems. For example, the defined normalization of OLTP is 1st, 2nd and 3NF; of course this includes 4th, 5th and maybe 6th normal form. Data Warehousing today does not have such a structured or pre-defined normalization for data modeling. It is also apparent that it is no longer sufficient to have a haphazard normalization effort for an enterprise data warehousing architecture. Inconsistencies in modeling techniques lead to maintenance intensive implementations. The Data Vault is a defined normalization of data modeling for data warehouses. Its strength lies in the structure and usage from which the model is built. It utilizes some of the following data modeling techniques: many to many relationships, referential integrity, minimally redundant data sets and business function keyed information hubs. These techniques make the Data Vault data model flexible, expandable and consistent. The approach to building a Data Vault data model is iterative, which provides a platform for data architects and business users to construct enterprise data warehouses in a component-based fashion (see Bill Inmons article: Data Mart Does Not Equal Data Warehouse, DMReview.com)
http://www.tdan.com/print/5054
6/13/2010
Page 6 of 10
have their own record source one from accounting and one from contracts. The obvious preference is to perform cleansing and matching on these numbers to integrate them together. However that topic is out of scope for this paper. The Hubs primary key always migrates outward from the Hub. Once the business is correctly identified through keys (say customer and account) the Link Entities can be constructed.
http://www.tdan.com/print/5054
6/13/2010
Page 7 of 10
Record Source A recording of the source system utilized for data traceability. The Satellite is most closely related to a Type 2 dimension as defined by Ralph Kimball. It stores deltas at a granular level; its function is to provide context around the Hub key. For example, the fact that VIN 1234567 represents a blue Toyota truck today and a red Toyota truck tomorrow. Color may be a Satellite for automobile. Its design relies on the mathematical principles surrounding reduction of data redundancy and rate of change. For instance, if the automobile is a rental, the dates of availability / rented might change daily which is much faster than the rate of change for color, tires or owner. The issue that the Satellite solves is defined as follows: An Automobile dimension may contain 160+ attributes; if the color or tires change then all 160+ attributes must be replicated into a new row (if utilizing a Type 2 dimension). Why replicate data when the rest of the attributes are changing at slower rates of change? If utilizing a Type 1 or Type 3 dimension it is possible to lose partial or complete historical trails. In this case the data modeler should construct at a minimum two Satellites: dates of availability and maintenance/parts. If the customer who rents the auto the first day is Dan and the second day is Jane, then it is the Links responsibility to represent the relationship. The data modeler might attach one or more Satellites on the Link representing dates rented (from/to), condition of vehicle and comments made by the renter.
http://www.tdan.com/print/5054
6/13/2010
Page 8 of 10
11. Satellites always contain either a load date-time stamp, or a numeric reference to a stand-alone load date-time stamp sequence table. 12. Stand-alone tables such as calendars, time, code and description tables may be utilized. 13. Links may have a surrogate key. 14. If a hub has two or more satellites, a point-in-time table may be constructed for ease of joins. 15. Satellites are always delta driven, duplicate rows should not appear. 16. Data is separated into Satellite structures based on: 1) type of information 2) rate of change. These simple components Hub, Link and Satellite combine to form a Data Vault. A Data Vault can be as small as a single Hub with one Satellite, or as large as the scope permits. The scope can always be modified at a later date and scalability is not an issue (nor is granularity of the information). A data modeler can convert small components of their existing data warehouse model to a Data Vault architecture one piece at a time. This is because the changes are isolated to the Hub and Satellites. The business (how functional areas of business interact) is represented by the Links. In this manner the Links can be end-dated, rebuilt, revised and so on.
http://www.tdan.com/print/5054
6/13/2010
Page 9 of 10
come from the source system. Then they proceed to provide a spreadsheet with their daily maintained interpretation of the information. In other words: I (the customer) want to see all VIN numbers that start with X rolled up under label BIG TRUCKS. What the Data Vault provides for this is called a User Grouping Set. Its another Hub (label Big Trucks) with a Satellite describing which VIN numbers roll under this label and a Link to the VIN numbers themselves. In this manner, the original data from the source system are preserved while the query tools can view the information in a manner appropriate to the users needs. When all is said and done a data warehouse is successful if it meets the users needs.
http://www.tdan.com/print/5054
6/13/2010
Page 10 of 10
Dynamic Data Warehousing based on dynamic automated changes made to both process and structure within the warehouse. Exploration Warehousing allowing users to play with the structures of the data warehouse without losing the content. In-Database Data Mining allowing the data mining tools to make use of the historical data, and to better fit the form (structure) with the function of data mining / artificial intelligence. Rapid Linking of External Information An ability to rapidly link and adapt structures to bring in external information and make sense of it within the data warehouse without destroying existing content. The business of Data warehousing is evolving--it must move in order to survive. The architecture and foundations behind what data warehousing means will continue to change. The Data Vault overcomes most of the problems and limitations of the past and stands ready to meet the challenges of the future. If you are interested in more information please contact me at: dlinstedt@coreintegration.com, or through regular mail at: Core Integration Partners, 455 Sherman St, Suite 207, Denver, CO USA, 80203. Or you may also check out our web site at: www.coreintegration.com Thank-you. Copyright 2002-2003, Core Integration Partners, All Rights Reserved. Unless otherwise indicated, all materials are the property of Core Integration Partners, Inc. No part of this document may be reproduced in any form, or by any means, without written permission from Core Integration Partners, Inc. Go to Current Issue | Go to Issue Archive
Recent articles by Dan E. Linstedt Data Vault Series 5 - Loading Practices Convergence: The Freight Train's Coming Data Warehousing Ethical Concerns: Security, Access and Control Dan E. Linstedt Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, XAware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata. He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/. Quality Content for Data Management Professionals Since 1997 Copyright 1997-2010, The Data Administration Newsletter, LLC -- www.TDAN.com TDAN.com is an affiliate of the BeyeNETWORK
http://www.tdan.com/print/5054
6/13/2010