Sie sind auf Seite 1von 10

Data warehouse:

A data warehouse is a repository of an organization's electronically stored data,


designed to facilitate reporting and analysis.
It is
 Subject-oriented: The data in the data warehouse is organized so that all
the data elements relating to the same real-
world event or object are linked together.
 Non-Volatile: Data in the data warehouse is never over-written or deleted —
once committed, the data is static, read-only, and retained for
future reporting.
 Integrated: The data warehouse contains data from most or all of an
organization’s operational systems and this data are made
consistent.
 Time-Variant

There are two leading approaches to storing data in a data warehouse — the
dimensional approach and the normalized approach.

In a dimensional approach, transaction data are partitioned into either "facts",


which are generally numeric transaction data, or "dimensions", which are the
reference information that gives context to the facts. For example, a sales
transaction can be broken up into facts such as the number of products ordered and
the price paid for the products, and into dimensions such as order date, customer
name, product number, order ship-to and bill-to locations, and salesperson
responsible for receiving the order. A key advantage of a dimensional approach is
that the data warehouse is easier for the user to understand and to use.

In the normalized approach, the data in the data warehouse are stored following, to
a degree, database normalization rules. Tables are grouped together by subject
areas that reflect general data categories (e.g., data on customers, products,
finance, etc.). The main advantage of this approach is that it is straightforward to
add information into the database. A disadvantage of this approach is that, because
of the number of tables involved, it can be difficult for users both to:

 Join data from different sources into meaningful information and then
 Access the information without a precise understanding of the sources of
data and of the data structure of the data warehouse.

Methodologies:
Bottom-up design:
In the so-called bottom-up approach data marts are first created to provide
reporting and analytical capabilities for specific business processes. These data
marts can eventually be integrated to create a comprehensive data warehouse.

Top-Down design:
A data warehouse is a centralized repository for the entire enterprise. The data
warehouse is designed using a normalized enterprise data model. "Atomic" data,
that is, data at the lowest level of detail, are stored in the data warehouse.
Dimensional data marts containing data needed for specific business processes or
specific departments are created from the data warehouse.

Redundant or De-Normalize:
 Duplication of data.
 Has more data than needed.
 Data is expressed in more than one place.

Data mart:
A data mart is a subset of an organizational data store, usually oriented to a specific
purpose or major data subject, which may be distributed to support business needs.
Data marts are analytical data stores designed to focus on specific business
functions for a specific community within an organization. Data marts are often
derived from subsets of data in a data warehouse, though in the bottom-up data
warehouse design methodology the data warehouse is created from the union of
organizational data marts.

Design Schemas:
 Star Schema or Dimensional model
 Snowflake Schema

Star Schema:

The star schema (sometimes referenced as star join schema) is the simplest style
of data warehouse schema. The star schema consists of a few fact tables (possibly
only one, justifying the name) referencing any number of dimension tables. The star
schema is considered an important special case of the snowflake schema.

The facts that data warehouses helps analyze are classified along different
dimensions: the fact tables hold the main data, while the usually smaller dimension
tables describe each value of a dimension and can be joined to fact tables as
needed.
Dimension tables have a simple primary key, while fact tables have a set of foreign
keys which make up a compound primary key consisting of a combination of
relevant dimension keys.

Reason for using a star schema is its simplicity from the users' point of view:
queries are never complex because the only joins and conditions involve a fact
table and a single level of dimension tables, without the indirect dependencies to
other tables that are possible in a better normalized snowflake schema.

E.g.:

Snowflake Schema:
A snowflake schema is a logical arrangement of tables in a multidimensional
database such that the entity relationship diagram resembles a snowflake in shape.
Closely related to the star schema, the snowflake schema is represented by
centralized fact tables which are connected to multiple dimensions. In the snowflake
schema, however, dimensions are normalized into multiple related tables whereas
the star schema's dimensions are denormalized with each dimension being
represented by a single table. When the dimensions of a snowflake schema are
elaborate, having multiple levels of relationships, and where child tables have
multiple parent tables ("forks in the road"), a complex snowflake shape starts to
emerge. The "snowflaking" effect only affects the dimension tables and not the fact
tables.

E.g.:

Reasons for creating a Data mart:

 Easy access to frequently needed data


 Creates collective view by a group of users
 Improves end-user response time
 Ease of creation
 Lower cost than implementing a full Data warehouse
 Potential users are more clearly defined than in a full Data warehouse
Extract, Transform, Load (ETL):
Extract, transform, and load (ETL) is a process in database usage and especially
in data warehousing that involves:

 Extracting data from outside sources


 Transforming it to fit operational needs (which can include quality levels)
 Loading it into the end target (database or data warehouse)

Extract:

The first part of an ETL process involves extracting the data from the source
systems. Most data warehousing projects consolidate data from different source
systems. Each separate system may also use a different data organization/format.
Common data source formats are relational databases and flat files, but may
include non-relational database structures such as Information Management System
(IMS) or other data structures such as Virtual Storage Access Method (VSAM) or
Indexed Sequential Access Method (ISAM), or even fetching from outside sources
such as through web spidering or screen-scraping. Extraction converts the data into
a format for transformation processing.

An intrinsic part of the extraction involves the parsing of extracted data, resulting in
a check if the data meets an expected pattern or structure. If not, the data may be
rejected entirely or in part.

Transform:

The transform stage applies a series of rules or functions to the extracted data from
the source to derive the data for loading into the end target. Some data sources will
require very little or even no manipulation of data. In other cases, one or more of
the following transformation types may be required to meet the business and
technical needs of the target database:

 Selecting only certain columns to load (or selecting null columns not to load).
For example, if source data has three columns (also called attributes) say
roll_no, age and salary then the extraction may take only roll_no and salary.
Similarly, extraction mechanism may ignore all those records where salary is
not present (salary = null).
 Translating coded values (e.g., if the source system stores 1 for male and 2
for female, but the warehouse stores M for male and F for female), this calls
for automated data cleansing; no manual cleansing occurs during ETL
 Encoding free-form values (e.g., mapping "Male" to "1" and "Mr" to M)
 Deriving a new calculated value (e.g., sale_amount = qty * unit_price)
 Filtering
 Sorting
 Joining data from multiple sources (e.g., lookup, merge)
 Aggregation (for example, rollup — summarizing multiple rows of data —
total sales for each store, and for each region, etc.)
 Generating surrogate-key values
 Transposing or pivoting (turning multiple columns into multiple rows or vice
versa)
 Splitting a column into multiple columns (e.g., putting a comma-separated list
specified as a string in one column as individual values in different columns)
 Disaggregation of repeating columns into a separate detail table (e.g.,
moving a series of addresses in one record into single addresses in a set of
records in a linked address table)
 Lookup and validate the relevant data from tables or referential files for
slowly changing dimensions.
 Applying any form of simple or complex data validation. If validation fails, it
may result in a full, partial or no rejection of the data, and thus none, some or
all the data are handed over to the next step, depending on the rule design
and exception handling. Many of the above transformations may result in
exceptions, for example, when a code translation parses an unknown code in
the extracted data.

Load:

The load phase loads the data into the end target, usually the data warehouse
(DW). Depending on the requirements of the organization, this process varies
widely. Some data warehouses may overwrite existing information with cumulative,
frequently updating extract data is done on daily, weekly or monthly. While other
DW (or even other parts of the same DW) may add new data in a historicized form.

Dimensions in Data warehouse:


In a data warehouse, a dimension is a data element that categorizes each item in
a data set into non-overlapping regions. A data warehouse dimension provides the
means to "slice and dice" data in a data warehouse. Dimensions provide structured
labeling information to otherwise unordered numeric measures. For example,
"Customer", "Date", and "Product" are all dimensions that could be applied
meaningfully to a sales receipt. A dimensional data element is similar to a
categorical variable in statistics.
The primary function of dimensions is threefold: to provide filtering, grouping and
labeling. For example, in a data warehouse where each person is categorized as
having a gender of male, female or unknown, a user of the data warehouse would
then be able to filter or categorize each presentation or report by either filtering
based on the gender dimension or displaying results broken out by the gender.

Types of Dimensions:

 Conformed Dimension
 Junk Dimension
 Degenerated Dimension
 Role-Playing Dimension

Conformed Dimension:

Dimensions are conformed when they are either exactly the same (including keys)
or one is a perfect subset of the other. Most important, the row headers produced in
the answer sets from two different conformed dimensions must be able to match
perfectly.

Conformed dimensions are either identical or strict mathematical subsets of the


most granular, detailed dimension. Dimension tables are not conformed if the
attributes are labeled differently or contain different values. Conformed dimensions
come in several different flavors. At the most basic level, conformed dimensions
mean the exact same thing with every possible fact table to which they are joined.
The date dimension table connected to the sales facts is identical to the date
dimension connected to the inventory facts.

Junk Dimension:

A junk dimension is a convenient grouping of typically low-cardinality flags and


indicators. By creating an abstract dimension, these flags and indicators are
removed from the fact table while placing them into a useful dimensional
framework.

Degenerated Dimension:

A dimension key, such as a transaction number, invoice number, ticket number, or


bill-of-lading number, that has no attributes and hence does not join to an actual
dimension table. Degenerate dimensions are very common when the grain of a fact
table represents a single transaction item or line item because the degenerate
dimension represents the unique identifier of the parent. Degenerate dimensions
often play an integral role in the fact table's primary key.

Role-Playing Dimension:
Dimensions are often recycled for multiple applications within the same database.
For instance, a "Date" dimension can be used for "Date of Sale", as well as "Date of
Delivery", or "Date of Hire". This is often referred to as a "role-playing dimension".

Slowly Changing Dimension (SCD):

Slowly Changing Dimensions (SCD) are dimensions that have data that slowly
changes.
For example, you may have a Dimension in your database that tracks the sales
records of your company's salespeople. Creating sales reports seems simple
enough, until a salesperson is transferred from one regional office to another. How
do you record such a change in your sales Dimension? You could create a second
salesperson record and treat the transferred person as a new sales person, but that
creates problems also.

Dealing with these issues involves SCD management methodologies referred to as


Type 0 through 6. Type 6 SCDs are also sometimes called Hybrid SCDs.

SCD Methodologies:

Type 0:

The Type 0 method is a passive approach to managing dimension value changes,


in which no action is taken. Values remain as they were at the time the dimension
record was first entered. In certain circumstances historical preservation with a
Type 0 SCD may occur. But, higher order SCD types are often employed to
guarantee history preservation, whereas Type 0 provides the least control or no
control over managing a slowly changing dimension.

Type 1:

The Type 1 methodology overwrites old data with new data, and therefore does not
track historical data at all. The obvious disadvantage to this method of managing
SCDs is that there is no historical record kept in the data warehouse. But an
advantage to this is that these are very easy to maintain.

Type 2:

The Type 2 method tracks historical data by creating multiple records in the
dimensional tables with separate keys. With Type 2, we have unlimited history
preservation as a new record is inserted each time a change is made.

E.g.: Table that keeps supplier information.


If the supplier moves to Illinois, the table would look like this:

Another popular method for tuple versioning is to add effective date columns.

Type 3:

The Type 3 method tracks changes using separate columns. Whereas Type 2 had
unlimited history preservation, Type 3 has limited history preservation, as it's
limited to the number of columns we designate for storing historical data. Where
the original table structure in Type 1 and Type 2 was very similar, Type 3 will add
additional columns to the tables:

Note that this record cannot track all historical changes, such as when a supplier
moves twice.

Type 4:

The Type 4 method is usually just referred to as using "history tables", where one
table keeps the current data and an additional table is used to keep a record of
some or all changes.

Following the example above, the original table might be called Supplier and the
history table might be called Supplier_History.
Type 6/Hybrid:

The Type 6 method is one that combines the approaches of types 1, 2 and 3 (1 + 2
+ 3 = 6).

The approach is to use a Type 1 slowly changing dimension, but adding an


additional pair of date columns to indicate the date range at which a particular row
in the dimension applies and a flag to indicate if the record is the current record.

This approach has a number of advantages:


 The user can choose to query using the current values of the dimensional
table by restricting the rows in the Dimension table using a filter to only
select current values
 Alternatively the user can use the "as at the time of the transaction" values
by using one of the date fields on the transaction as a constraint on the
dimension table.
 If there are a number of date columns on the transaction (e.g. Order Date,
Shipping Date, Confirmation Date) then the user can choose which date to
analyze the fact data by - something not possible using other approaches.

This is how the Supplier table would look using Type 6 Slowly Changing
Dimensions:

Alternate Implementation:

Das könnte Ihnen auch gefallen