Sie sind auf Seite 1von 7

Chapter 7: Data Marts and Star Schema Design

Need for Data Marts

End users will get much better performance query from a data mart than a data
warehouse.
End users will have a much easier time navigating through data marts.

Star Schema
A star schema is composed of two basic kinds of tables:
One fact table and
Multiple dimension tables.
The star schema is called a star because of, its appearance, with the fact table as the
actual star and the dimension tables as the light rays emitting from it.
Figure. A simple star schema modeling sales information
PRODUCT
product_key
prod_nm
SALES
prod_line_nm
date_key (F K)
Div_nm
(F K)
flavorcustomer_key
_nm
product_key
units_of
measnm (F K)
Invoice_number
units_per_pkg
units_sold
intro_date
unit_prise
Add_date
unit_cost
update_date
active_fg

CUSTOMER
customer_key
customer_name
street_addr
city
state
zip
add_dt
update_dt
active_fg

DT
date_key
dt
day_of_wk
month_cd
holiday_name
add_dt
update_dt
active_fg
Fact Table
The fact table contains the actual transactions or values being analyzed.

Dimension Table
The dimension table contains descriptive information about transaction or values.
The Design Process
Fact Table Design
Each record in a fact table contains a primary key made up of a concatenation of foreign
keys to dimension tables and the facts or measures uniquely identified by that key.
People frequently (always?) refer to star schemas as denormalized meaning that the
normalized rules are intentionally ignored.
The denormalized structure is a flat file. Is a fact file really denormalized? No. In fact, the
fact table is a highly normalized structure.

Each row consists of a number of attributes (that is, the measure) that are all
attributable to only one primary key. Its certainly in first normal form, as it has
no repeating groups.

It satisfies the conditions for second normal form in that all the attributes are
dependent on primary key.

Finally, none of the attributes are depending on the non-key attributes. Thus, the
fact table is even in third normal form.

So, where is the denormalization? Its actually the stars dimension tables that are
denormalized, not the fact table.
The level of detail captured in a fact table is sometimes called level of granularity or
grain of that table. It is recommended that storing date at the most detailed level
possible. While this can result in much higher disk requirements, capturing the finest
level detail has a significant advantage.
Its always possible to re-create aggregate summary data from detail data, but you
cant create details from summaries.
The finest level of details is frequently called the atomic level of detail. Just as you
cant split an atom into its component parts and still recognize the element they came
from, you cant break an atomic-level transaction apart any further and have it
maintain its identity as a transaction.
Dimension Table

Alone, a fact table is pretty useless. Dimension tables provide meaning to each fact.
What does it mean that we sold 16 units of product_key 14 to customer_key 1714 on
date_key 000701? It means nothing until we decode those foreign keys.
Dimension Table Features
Some of the features of the dimension tables are:

Denormalized
Wide
Short
Use surrogate Keys
Contains Links to Corresponding Records in Source Tables
Contains Additional Date and Active Flag Fields

Denormalized
Dimension tables are usually highly denormalized. Although people refer to star schema
as being denormalized, they are actually referring to the dimension tables, not to the fact
table.
Thus, all the information regarding each dimension element appears on a single record in
the dimension table. Its as though you took a normalized design that describe product
and joined all the tables together to produce this single, denormalized dimension table. In
fact, this is generally how you build a dimension table.
When loading the dimension tables, we join many source tables together, to put the
results into one, flat, denormalized table. Each record in this table fully describes a
dimension element.
In the end, this helps end users query performance because when users run their queries,
the work needed to join these tables together has already been done. In essence, we are
pre-joining tables together, satisfying query-time resource requirements with load time
resources.
Remember that loading is usually done at night when no one needs access anyway.
Wide
The dimension table is wider than most tables in traditional database applications. By
wide, we mean that it has a lot of columns. The more columns you put into your
dimension tables, the more descriptive they will be.
Short
Dimension tables are generally far short than fact tables.

Figure. Dimension data in normalized (OLTP) and denormalized (dimension table) form
PRODUCT
product_code
flavor_cd (F K)
units_of_measure (F K)
prod_line_cd (F K)
product_name
prod_line_nm
units_per_pkg
intro_date
add_date
update_date
active_fg

PRODUCT_LINE
product_line_cd
div_cd (F K)
prod_line_nm

DIVISION
div_cd
div_nm

PRODUCT
FLAVOR
flavor_cd
Flavor_nm

STD_UNITS
units_of_measure_cd
units_of_measure_nm

Normalized

product_code
prod_cd
prod_nm
prod_line_nm
div_nm
flavor_nm
units_of_meas_nm
units_per_pkg
intro_date
add_dt
update_dt
active_fg
Denormalized

Use Surrogate Keys


A surrogate key is a key that has no independent business meaning. Surrogate keys are
usually just a series of sequential numbers assigned to be the primary key for a table.
Surrogate keys are useless to end user and are never queried by them. In the case of fact
table-dimension table relationships, surrogate keys are used to provide efficient
relationships between these tables.
Contains Links to Corresponding Records in Source Tables
Often overlooked in dimension table design is the fact that dimension tables contain links
to corresponding records in source tables. Notice that the dimension tables contain
references to the keys necessary to get to the source records in the source databases.
This serves a couple of purposes.

For one, it allows analysts to audit dimension tables to understand where their
data came from.

More importantly, these links aid the dimension table update process. When the
refresh process looks at the source tables for changes, having this link information
makes it quite easy to tell which source system records generated which
dimension table records.

Contains Additional Date and Active Flags Fields


Its frequently useful to have some administrative fields on your records. These include
fields indicating when each record was added (additional date field) and when it was last
updates.
The active flag field, on the other hand, can be quite useful in the dimension table.
Storing dimension history makes it possible for a given dimension element to have
multiple rows in a dimension table. The active flag, in this case, indicates which the
current definition of that dimension element is.
Identifying Dimension Elements and Fact Table Elements
Plugging the name of data elements at the appropriate place into the following sentence
we can identify whether it is a dimension element or a fact table element:
My user needs to see <data element>, <data element>, and <> broken down by
<data element>, <data element>, and <>.
If the data element in question falls before the word broken, then it is most likely a
measure. Otherwise, it is likely an element of a dimension.
Snowflake Schema: A Variation on the Star Schema
A snowflake schema is a variation on the star schema. The snowflake schema is just like
a star with normalized dimension tables.
Summary Tables
In a data mart environment, summary tables are aggregated versions of atomic-level fact
tables.
They are used to improve response to end user queries. While these executive-level users
could run their queries against the atomic level fact table, the queries will run much
quicker if they could run against already aggregated data.

For example, we might build a table that contains summaries of the sales fact records but
only at month level granularity. The SQL to create this summary table might look
something like the following:
create table sales_month_sum as
select sales.customer_key, sales.product_key, dt.month_cd,
Sum(units_sold) units_sold, units_sold, unit_price, unit_cost
from sales, dt
where sales.date_key = dt.date_key
group by sales.customer_key, sales.product_key, dt.month_cd, sales.unit_price,
sales.unit_cost

Materialized Views
You can create summary tables either manually create tables or using Oracles
materialized view functionality.
Materialized views are another term for Oracles snapshot feature. Implementing your
aggregates as materialized views has a few advantages.

First, it allows you to use Oracle 8is query rewrite capabilities.


Second, it allows you to push the work of maintaining them onto Oracles
shoulder and off of yours.

Two Rules for Determining Which Summaries to Create


There are two key rules you should consider when determining what summaries to create:
The summaries you create must be used, and
They must be dense.
In most data marts, there are a huge number of potential summary tables that could be
created. But keep in mind that there is no value to creating most of them. Dont create
summary tables if you cant envision which queries will access those tables.
The concept of many records can be collapsed into one summary record is sometimes
referred to as density, we generally adhere to a 25 percent rule. Summary tables that are
25 percent (or smaller) the sizes of the original table are probably worth creating. If not,
you should consider other performance-enhancing techniques.
Common Design Complexities
Slowly Changing Dimensions In a slowly changing dimensions, rather than
changing the data in the dimension record, we add another to hold the changed
information.
Nonaditive Facts Nonadditive facts are those facts or measures that cant be
summarized with a simple sum aggregate statement.

Dimension Warehouses and Marts Containing Multiple Fact Tables


When stars within a data mart need access to common dimension tables, these tables
should be shared between these stars. These are frequently referred to as conformed
dimensions.
Some organizations, all the data from their normalized warehouses flow into one,
monolithic data mart. Then they refer to this mart as their dimensional data warehouse
and frequently refer to their normalized warehouses as staging areas.
In this arrangement, users may query directly against this dimensional data warehouse or
against limited-focus data marts that are sourced from this dimensional warehouse.

Das könnte Ihnen auch gefallen