Beruflich Dokumente
Kultur Dokumente
There are two leading approaches to storing data in a data warehouse — the
dimensional approach and the normalized approach.
In the normalized approach, the data in the data warehouse are stored following, to
a degree, database normalization rules. Tables are grouped together by subject
areas that reflect general data categories (e.g., data on customers, products,
finance, etc.). The main advantage of this approach is that it is straightforward to
add information into the database. A disadvantage of this approach is that, because
of the number of tables involved, it can be difficult for users both to:
Join data from different sources into meaningful information and then
Access the information without a precise understanding of the sources of
data and of the data structure of the data warehouse.
Methodologies:
Bottom-up design:
In the so-called bottom-up approach data marts are first created to provide
reporting and analytical capabilities for specific business processes. These data
marts can eventually be integrated to create a comprehensive data warehouse.
Top-Down design:
A data warehouse is a centralized repository for the entire enterprise. The data
warehouse is designed using a normalized enterprise data model. "Atomic" data,
that is, data at the lowest level of detail, are stored in the data warehouse.
Dimensional data marts containing data needed for specific business processes or
specific departments are created from the data warehouse.
Redundant or De-Normalize:
Duplication of data.
Has more data than needed.
Data is expressed in more than one place.
Data mart:
A data mart is a subset of an organizational data store, usually oriented to a specific
purpose or major data subject, which may be distributed to support business needs.
Data marts are analytical data stores designed to focus on specific business
functions for a specific community within an organization. Data marts are often
derived from subsets of data in a data warehouse, though in the bottom-up data
warehouse design methodology the data warehouse is created from the union of
organizational data marts.
Design Schemas:
Star Schema or Dimensional model
Snowflake Schema
Star Schema:
The star schema (sometimes referenced as star join schema) is the simplest style
of data warehouse schema. The star schema consists of a few fact tables (possibly
only one, justifying the name) referencing any number of dimension tables. The star
schema is considered an important special case of the snowflake schema.
The facts that data warehouses helps analyze are classified along different
dimensions: the fact tables hold the main data, while the usually smaller dimension
tables describe each value of a dimension and can be joined to fact tables as
needed.
Dimension tables have a simple primary key, while fact tables have a set of foreign
keys which make up a compound primary key consisting of a combination of
relevant dimension keys.
Reason for using a star schema is its simplicity from the users' point of view:
queries are never complex because the only joins and conditions involve a fact
table and a single level of dimension tables, without the indirect dependencies to
other tables that are possible in a better normalized snowflake schema.
E.g.:
Snowflake Schema:
A snowflake schema is a logical arrangement of tables in a multidimensional
database such that the entity relationship diagram resembles a snowflake in shape.
Closely related to the star schema, the snowflake schema is represented by
centralized fact tables which are connected to multiple dimensions. In the snowflake
schema, however, dimensions are normalized into multiple related tables whereas
the star schema's dimensions are denormalized with each dimension being
represented by a single table. When the dimensions of a snowflake schema are
elaborate, having multiple levels of relationships, and where child tables have
multiple parent tables ("forks in the road"), a complex snowflake shape starts to
emerge. The "snowflaking" effect only affects the dimension tables and not the fact
tables.
E.g.:
Extract:
The first part of an ETL process involves extracting the data from the source
systems. Most data warehousing projects consolidate data from different source
systems. Each separate system may also use a different data organization/format.
Common data source formats are relational databases and flat files, but may
include non-relational database structures such as Information Management System
(IMS) or other data structures such as Virtual Storage Access Method (VSAM) or
Indexed Sequential Access Method (ISAM), or even fetching from outside sources
such as through web spidering or screen-scraping. Extraction converts the data into
a format for transformation processing.
An intrinsic part of the extraction involves the parsing of extracted data, resulting in
a check if the data meets an expected pattern or structure. If not, the data may be
rejected entirely or in part.
Transform:
The transform stage applies a series of rules or functions to the extracted data from
the source to derive the data for loading into the end target. Some data sources will
require very little or even no manipulation of data. In other cases, one or more of
the following transformation types may be required to meet the business and
technical needs of the target database:
Selecting only certain columns to load (or selecting null columns not to load).
For example, if source data has three columns (also called attributes) say
roll_no, age and salary then the extraction may take only roll_no and salary.
Similarly, extraction mechanism may ignore all those records where salary is
not present (salary = null).
Translating coded values (e.g., if the source system stores 1 for male and 2
for female, but the warehouse stores M for male and F for female), this calls
for automated data cleansing; no manual cleansing occurs during ETL
Encoding free-form values (e.g., mapping "Male" to "1" and "Mr" to M)
Deriving a new calculated value (e.g., sale_amount = qty * unit_price)
Filtering
Sorting
Joining data from multiple sources (e.g., lookup, merge)
Aggregation (for example, rollup — summarizing multiple rows of data —
total sales for each store, and for each region, etc.)
Generating surrogate-key values
Transposing or pivoting (turning multiple columns into multiple rows or vice
versa)
Splitting a column into multiple columns (e.g., putting a comma-separated list
specified as a string in one column as individual values in different columns)
Disaggregation of repeating columns into a separate detail table (e.g.,
moving a series of addresses in one record into single addresses in a set of
records in a linked address table)
Lookup and validate the relevant data from tables or referential files for
slowly changing dimensions.
Applying any form of simple or complex data validation. If validation fails, it
may result in a full, partial or no rejection of the data, and thus none, some or
all the data are handed over to the next step, depending on the rule design
and exception handling. Many of the above transformations may result in
exceptions, for example, when a code translation parses an unknown code in
the extracted data.
Load:
The load phase loads the data into the end target, usually the data warehouse
(DW). Depending on the requirements of the organization, this process varies
widely. Some data warehouses may overwrite existing information with cumulative,
frequently updating extract data is done on daily, weekly or monthly. While other
DW (or even other parts of the same DW) may add new data in a historicized form.
Types of Dimensions:
Conformed Dimension
Junk Dimension
Degenerated Dimension
Role-Playing Dimension
Conformed Dimension:
Dimensions are conformed when they are either exactly the same (including keys)
or one is a perfect subset of the other. Most important, the row headers produced in
the answer sets from two different conformed dimensions must be able to match
perfectly.
Junk Dimension:
Degenerated Dimension:
Role-Playing Dimension:
Dimensions are often recycled for multiple applications within the same database.
For instance, a "Date" dimension can be used for "Date of Sale", as well as "Date of
Delivery", or "Date of Hire". This is often referred to as a "role-playing dimension".
Slowly Changing Dimensions (SCD) are dimensions that have data that slowly
changes.
For example, you may have a Dimension in your database that tracks the sales
records of your company's salespeople. Creating sales reports seems simple
enough, until a salesperson is transferred from one regional office to another. How
do you record such a change in your sales Dimension? You could create a second
salesperson record and treat the transferred person as a new sales person, but that
creates problems also.
SCD Methodologies:
Type 0:
Type 1:
The Type 1 methodology overwrites old data with new data, and therefore does not
track historical data at all. The obvious disadvantage to this method of managing
SCDs is that there is no historical record kept in the data warehouse. But an
advantage to this is that these are very easy to maintain.
Type 2:
The Type 2 method tracks historical data by creating multiple records in the
dimensional tables with separate keys. With Type 2, we have unlimited history
preservation as a new record is inserted each time a change is made.
Another popular method for tuple versioning is to add effective date columns.
Type 3:
The Type 3 method tracks changes using separate columns. Whereas Type 2 had
unlimited history preservation, Type 3 has limited history preservation, as it's
limited to the number of columns we designate for storing historical data. Where
the original table structure in Type 1 and Type 2 was very similar, Type 3 will add
additional columns to the tables:
Note that this record cannot track all historical changes, such as when a supplier
moves twice.
Type 4:
The Type 4 method is usually just referred to as using "history tables", where one
table keeps the current data and an additional table is used to keep a record of
some or all changes.
Following the example above, the original table might be called Supplier and the
history table might be called Supplier_History.
Type 6/Hybrid:
The Type 6 method is one that combines the approaches of types 1, 2 and 3 (1 + 2
+ 3 = 6).
This is how the Supplier table would look using Type 6 Slowly Changing
Dimensions:
Alternate Implementation: