Sie sind auf Seite 1von 74

Introduction to Data

Warehousing

Topics Covered

OLTP (Online Transaction Processing) System


OLTP Data Nature
OLTP Shortcomings
DWH Emergence
What is Data Warehouse?
Data Warehouse vs. Operational Systems
DWH Characteristics and Attributes
Features of Data Warehouse
Elements of Data Warehouse
Method of Development - Operational Systems
Method of Development - Data Warehouse
DWH - Architecture

Topics Covered

Data Modeling Techniques


Dimension Modeling
Star Schema
Snowflake Schema
Design Principle
Data Marts
Metadata
Surrogate Keys
Types of Facts
Slowly Changing Dimension
Conformed Dimension
Factless Fact
Data Warehouse (Dos & Donts)
On Line Analytical Processing
ROLAP/MOLAP/HOLAP
Data Mining
ETL/OLAP/ Data Mining Tools

OLTP (Online Transaction Processing) System:


Improve operational Efficiency
Produce daily or monthly reports to be used by middle and lower
management
Keeps detailed information

OLTP Data Nature

High Volume
Changes with time
Only current Data available
Answers simple queries
Little help to decision maker

OLTP Shortcomings:
Focus on transaction
Large amount of data but
Related to transaction
Does not maintain historical data
Does not maintain summarized data

Does not support analytical report

DWH Emergence:

Management more information conscience


Desktop power more increasing
Hardware prices decreasing
Increasing power of server software
Explosion of internet
End user more technology savvy

What is Data Warehouse?


A data Warehouse is a structured extensible environment designed
for the analysis of non-volatile data, logically and physically
transformed from the multiple source applications to align with
business structure, updated and maintained for a long period,
expressed in simple business terms, and summarized for quick
analysis.
Collection of corporate information, derived directly from
operational systems and some external data.
Also defined as subject-oriented, integrated, non-volatile, timevariant (historical) collection of data designed to address the DSS
needs
Purpose of Data Warehouse is to support business decisions and
not business operations

Data Warehouse vs. Operational Systems

Data Warehouse is developed incrementally (time taken to deliver the


benefits is long)
Operational Systems are primarily concerned
with the handling of a single transaction.
Basically deals with pre-defined events and hence require faster
access.
Each transaction deals with small amount of data
Data Warehouse deals with large amount of data, which are aggregate
in nature.
Contd

Since the pattern of usage of the Data Warehouse and Operational


System are different or not consistent, data from Operational System
and Data Warehouse should not be mixed.
Time sensitivity of data
Operational System requires current data
Data Warehouse requires historical data

DWH Characteristics and Attributes


A Data Warehouse is
Subject Oriented
Integrated
Time-variant
Non-volatile
Summarized

10

DWH Characteristics and Attributes


Subject Oriented
Data Warehouse world is oriented around the major subjects of
the enterprise such as Customer, Vendor, Product. On the
contrary the Operational world is designed around applications
such as loans, savings, pensions, insurance etc.
Integrated
Data found in Warehouse is integrated. Always with no
exception.
Data is Integrated in terms of
Naming convention
Consistent measurement of variables
Encoding structures etc.

11

DWH Characteristics and Attributes


Time-variant
Data found in warehouse is Time variant (time series). In other
words data represents data over a long time horizon - from five
to ten years. Time horizon for Operation system is 60-90 days.
DWH may not have the most current information.
Nonvolatile
Data is loaded into warehouse and after that the data in the
warehouse does not change. Data Warehouse is updated as a
batch processing and no online updations are allowed.
There are two kind of operations that occur in data
warehousing
Initial loading of data
Access of data
Periodic addition of data

12

DWH Characteristics and Attributes


Summarized
In a Data Warehouse, Summary views and aggregates
of the operations data are kept so as to provide faster retrieval of
aggregated information.

13

Features of Data Warehouse

Repository of information
Improved access to integrated data
Provides historical perspective
Variety of end-users use it for different purposes
Requires a major system integration effort
Reduces the reporting and analysis impact on operational systems

14

Elements of Data Warehouse


Source:
Flat files
Source Database
Any other form
Data Staging Area: Intermediate Area
Target:
Database which holds the Data Warehouse or Data Mart

15

Method of Development - Operational Systems

Define requirements
Analysis and Planning based on requirements
Model (E-R Model)
Physical Design
Development (Coding)
Quality Assurance and User Acceptance
Implementation

16

Method of Development - Data Warehouse (uses an


iterative development methodology)

Subject Definition
Data Identification or Data Discovery
Data Acquisition
Data Cleansing
Data Transformation
Data Loading
Exploitation

17

Method of Development - Data Warehouse


Subject Definition
What do I want to analyze ?
What would be the Dimensions ?
Steps

Logical Concept
Build logical data model
Develop transformational model
Translate logical model to physical model

Data Identification or Data Discovery


How I can get what I want to analyze?
Where the needed information/data is stored?

18

Method of Development - Data Warehouse


Data Acquisition
Extracting

data

from

RDBMS/DBMS/Flat

files

Data Cleansing
Removal of inconsistent data
Removal of Unwanted Data
Removing Extreme Cases (data-mining)

Data Transformation
Convert to consistent Business oriented format.
Generate derived information not stored in OLTP systems.

19

Method of Development - Data Warehouse


Data Transformation
Two steps are involved in this process.
Integration and Conversion
Consistent Naming Conventions
Consistent Encoding Structures
Summarization
Keeps summarized Information
Reduces the volume of data to be processed

20

Method of Development - Data Warehouse


Loading the Warehouse
Periodic loading from OLTP environment.
Provides a time variant attribute to data.

Exploitation
Enables the users to view, analyze and report on data
Simple query and reporting
Multidimensional analysis
OLAP using Slice and Dice, Drilling

21

DWH - Architecture
Major Components
Data identification
Cleanup
Extraction, Transformation and loading tools
Metadata repository
Data Marts
Data query, reporting, analysis and mining tools
Data Warehouse administration and management

22

Data Warehouse
Advantage of DWH:
There are many advantages to using a data
warehouse, some of them are:
Data warehouses enhance end-user access
to a wide variety of data.
Decision support system users can obtain
specified trend reports, e.g. the item with
the most sales in a particular area within
the last two years.
Data warehouses can be a significant
enabler of commercial business
applications, particularly customer
relationship management (CRM) systems.
23

Business Intelligence:
Business intelligence (BI) is a broad category of applications and
technologies for gathering, storing, analyzing, and providing access to
data to help enterprise users make better business decisions. BI
applications include the activities of decision support systems, query
and reporting, online analytical processing (OLAP), statistical analysis,
forecasting, and data mining. Business intelligence applications can be:

Mission-critical and integral to an enterprise's operations or occasional to


meet a special requirement
Enterprise-wide or local to one division, department, or project
Centrally initiated or driven by user demand

The term business intelligence (BI) dates to 1958.[1] It refers to


technologies, applications, and practices for the collection, integration,
analysis, and presentation of business information and also sometimes
to the information itself.
BI systems provide historical, current, and predictive views of business
operations, most often using data that has been gathered into a data
warehouse or a data mart and occasionally working from operational
data. Software elements support reporting, interactive "slice-and-dice"
pivot-table analyses, visualization, and statistical data mining.
Applications tackle sales, production, financial, and many other sources
of business data for purposes that include, notably, business
performance management.

24

DWH - Architecture
Metadata Layer
Extraction
FS1
FS2

.
.
.

FSn

Transmission
N
E
T
W
O
R
K

Legacy System

Cleansing
S
T
A
G
I
N
G

Transformation

Data Mart
Population

Aggregation
Summarization

ODS

DM1

DW

DM2

DMn
A
R
E
A

OLAP ANALYSIS
Knowledge Discovery
25

Data Modeling Techniques


ER Modeling is based on the Entities and the relationships between
those entities. The ER model is an abstraction tool because it can be
used to understand and simplify the ambiguous data relationship in the
business world.
Dimension Modeling uses three basic Concepts:
Measures
Facts
Dimensions
Dimension Modeling is powerful in representing the requirements of
business user in the context of database tables.

26

Dimension Modeling

Measure is a numeric attribute of a fact, representing the performance


or behavior of the business relative to the dimensions.
Dimensions are the parameters over which we want to perform Online
Analytical Processing.
Dimensions Hierarchies enables to arrange dimensions into one or
many hierarchies. Each hierarchy can also have multiple hierarchy
levels.
Fact is logical collection of related measures and dimensions,
consisting of measures. In a Data Warehouse, facts are implemented in
the core tables in which all the numeric data is stored.

27

Dimension Modeling
Business model translates into a specific design called
DIMENSIONAL MODEL (also called STAR MODEL).
The outcome of the DIMENSIONAL MODEL is the STAR SCHEMA
or SNOWFLAKE SCHEMA

28

Star Schema
Attributes
A single fact table, with detail and summary data
Fact table primary key has only one key column per dimension
Each dimension is a single table, highly de-normalized
Benefits
Easy to understand, easy to define hierarchies, reduces number
of physical joins, low maintenance, very simple metadata.
Drawbacks
Summary data in the fact table yields poorer performance for
summary levels, huge dimensions tables a problem.

29

Star Schema
A Group of Facts connected to Multiple Dimensions

Channel

Financial
Transactions

Time

Customer

Organization

Product

30

Snowflake Schema

The snowflake schema is an extension of the star schema, where


each point of the star explodes into more points.
Snowflake schemas normalize dimensions to eliminate redundancy.
That is, the dimension data has been grouped into multiple tables
instead of one large table.
While this saves space, it increases the number of dimension tables
and requires more foreign key joins. The result is more complex
queries and reduced query performance.
Usage:Whether one uses a star or a snowflake largely depends on
personal preference and business needs.

31

Snow-Flake Schema
Snow-flake Schema (= Extended Star Schema)
A Group of Facts connected to Dimensions, which are split across
multiple hierarchies and attributes
Time

Product

Financial
Transactions

Channel

Organization

Customer
Segment

Geography

32

Design Principle

The first step in design is to decide what business process(es) to


model, by combining an understanding of the business with an
understanding of what data is available.
The Second step in the design is to decide on the grain of the fact
table in each business process.

33

Design Principle
Designing a Fact Table.
The first step in designing a fact table is to determine the granularity of
the fact table. By granularity, we mean the lowest level of information that
will be stored in the fact table. This constitutes two steps:
1.
Determine which dimensions will be included.
2.
Determine where along the hierarchy of each dimension the information
will be kept.
Which Dimensions To Include
Determining which dimensions to include is usually a straightforward
process, because business processes will often dictate clearly what are the
relevant dimensions. The determining factors usually goes back to the
requirements.
For example, in an off-line retail world, the dimensions for a sales fact
table are usually time, geography, and product. This list, however, is by
no means a complete list for all off-line retailers.

34

What Level Within Each Dimensions To Include


Determining which part of hierarchy the information is stored along
each dimension is a bit more tricky. This is where user requirement
(both stated and possibly future) plays a major role.
In the above example, will the supermarket wanting to do analysis
along at the hourly level? (i.e., looking at how certain products may
sell by different hours of the day.) If so, it makes sense to use 'hour' as
the lowest level of granularity in the time dimension.
If daily analysis is sufficient, then 'day' can be used as the lowest level
of granularity.
Note that sometimes the users will not specify certain requirements,
but based on the industry knowledge, the data warehousing team may
foresee that certain requirements will be forthcoming that may result in
the need of additional details. In such cases, it is prudent for the data
warehousing team to design the fact table such that lower-level
information is included. This will avoid possibly needing to re-design
the fact table in the future. On the other hand, trying to anticipate all
future requirements is an impossible

35

Design Principle

A Data Warehousing almost always demands data expressed at the


lowest possible grain of each dimension, not because queries need
to cut through the database in very precise ways, but...
Effort to normalize any of the table in a dimensional database
solely in order to save disk space are a waste of time.
A dimension tables must not be normalized but should remain as
flat tables. Normalized dimension tables destroy the ability to
browse, and the disk space savings gained by normalizing the
dimension tables are typically less than percent of the total disk
space needed for the overall schema.

36

Design Principle

The use of pre-stored summaries (aggregates) is the single most


effective tool the data warehousing designer has to control
performance.
Each type (grain) of aggregate should occupy its own fact table,
and should be supported by the proper set of dimension tables
containing only those dimensional attribute that are defined for that
grain of aggregate.

37

Data Marts
What is a Data Mart?
It is a subset of Data Warehouse with a specific purpose in mind.
Key to a successful Data Warehouse lies in getting a data mart in place
as soon as possible than implementing the entire Data Warehouse
initiative in one go

38

Metadata
What is Metadata?
Data about data.
Metadata repository / document gives detailed description of the
source, structure, content and attributes of the data warehouse.
Metadata created using Data modeling tools, ETL tools (e.g. ORACLE
Warehouse Builder, INFORMATICA) or manually

39

Surrogate Keys
A surrogate key is a substitution for the natural primary key.
It is just a unique identifier or number for each row that can be used
for the primary key to the table. The only requirement for a surrogate
primary key is that it is unique for each row in the table.
Some tables have columns such as AIRPORT_NAME or
CITY_NAME which are stated as the primary keys (according to the
business users) but ,not only can these change, indexing on a
numerical value is probably better and you could consider creating a
surrogate key called, say, AIRPORT_KEY. This would be internal to
the system and as far as the client is concerned you may display only
the AIRPORT_NAME.

40

Surrogate Key
Pros
Surrogate Keys never need changing
Save space
Improve query performance
Cons
Overhead in the key generation process
The user cannot understand the key, thus the table.
If new developers take over, they will also have to figure out the keys.

41

Types of Facts
There are three types of facts:
Additive: Additive facts are facts that can be summed up through all of
the dimensions in the fact table.
Semi-Additive: Semi-additive facts are facts that can be summed up
for some of the dimensions in the fact table, but not the others.
Non-Additive: Non-additive facts are facts that cannot be summed up
for any of the dimensions present in the fact table.

42

Example Additive:
Fact table (Retailer) with the following columns:
Date
Store
Product
Sales_Amount
The purpose of this table is to record the sales amount for each product
in each store on a daily basis. Sales_Amount is the fact. In this case,
Sales_Amount is an additive fact, because you can sum up this fact
along any of the three dimensions present in the fact table -- date,
store, and product. For example, the sum of Sales_Amount for all 7
days in a week represent the total sales amount for that week.

43

Example Semi-Additive/Non-Additive:
Fact table (bank) with the following columns:
Date
Account
Current_Balance
Profit_Margin
The purpose of this table is to record the current balance for each
account at the end of each day, as well as the profit margin for each
account for each day. Current_Balance and Profit_Margin are the
facts. Current_Balance is a semi-additive fact, as it makes sense to add
them up for all accounts (what's the total current balance for all
accounts in the bank?), but it does not make sense to add them up
through time (adding up all current balances for a given account for
each day of the month does not give us any useful information).
Profit_Margin is a non-additive fact, for it does not make sense to add
them up for the account level or the day level. week.

44

Types of Fact Tables


Based on the classifications, there are two types of fact tables:
Cumulative: This type of fact table describes what has happened over
a period of time. For example, this fact table may describe the total
sales by product by store by day. The facts for this type of fact tables
are mostly additive facts. The first example presented here is a
cumulative fact table.
Snapshot: This type of fact table describes the state of things in a
particular instance of time, and usually includes more semi-additive
and non-additive facts. The second example presented here is a
snapshot fact table.

45

Slowly Changing Dimension


The "Slowly Changing Dimension" problem is a common one particular to
data warehousing. In a nutshell, this applies to cases where the attribute for
a record varies over time
Example:

Cust_Key
1001

Name
Christina

State
Illinois

Christina is a customer with ABC Inc. She first lived in Chicago, Illinois.
So, the original entry in the customer lookup table has the following record:
At a later date, she moved to Los Angeles, California on January, 2003.
How should ABC Inc. now modify its customer table to reflect this change?
This is the "Slowly Changing Dimension" problem.

46

Solving a Slow Dimension


There are in general three ways to solve this type of problem, and they
are categorized as follows:
Type 1: The new record replaces the original record. No trace of the
old record exists.
Type 2: A new record is added into the customer dimension table.
Therefore, the customer is treated essentially as two people.
Type 3: The original record is modified to reflect the change.

47

Type 1

Cust_Key
1001

Name
Christina

State
Illinois

In Type 1 Slowly Changing Dimension, the new information simply


overwrites the original information
Advantages:
- This is the easiest way to handle the Slowly Changing Dimension problem,
since there is no need to keep track of the old information.
Disadvantages:
- All history is lost. By applying this methodology, it is not possible to trace
back in history. For example, in this case, the company would not be able to
know that Christina lived in Illinois before.
When to use Type 1:
Type 1 slowly changing dimension should be used when it is not necessary for
48
the data warehouse to keep track of historical changes

Type 2
Cust_Key

Name

State

1001

Christina

Illinois

1010

Christina

Chicago

In Type 2 Slowly Changing Dimension, a new record is added to the table to


represent the new information. Therefore, both the original and the new record
will be present. The new record gets its own primary key.
Advantages:
- This allows us to accurately keep all historical information.
Disadvantages:
- This will cause the size of the table to grow fast. In cases where the number of
rows for the table is very high to start with, storage and performance can
become a concern.
- This necessarily complicates the ETL process.
When to use Type 2:
-Type 2 slowly changing dimension should be used when it is necessary for the
data warehouse to track historical changes.

49

Type 3

In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular
attribute of interest, one indicating the original value, and one indicating the current value. There
will also be a column that indicates when the current value becomes active.
To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns:
Customer Key ,Name ,Original State ,Current State ,Effective Date
After Christina moved from Illinois to California, the original information gets updated, and we
have the following table (assuming the effective date of change is January 15, 2003):

Advantages:
This does not increase the size of the table, since new information is updated.
This allows us to keep some part of history.
Disadvantages:
Type 3 will not be able to keep all history where an attribute is changed more than once. For
example, if Christina later moves to Texas on December 15, 2003, the California information will
be lost.
Usage:
Type 3 is rarely used in actual practice.
When to use Type 3:
Type III slowly changing dimension should only be used when it is necessary for the data
warehouse to track historical changes, and when such changes will only occur for a finite number
of time

50

Conformed Dimension
A conformed dimension is a single, coherent view of the same piece of
data throughout the organization. The same dimension is used in all
subsequent star schemas defined. This enables reporting across the
complete data warehouse in a simple format.
A conformed dimension is a set of data attributes that have been
physically implemented in multiple data marts using the same
structure, attributes, domain values, definitions and concepts in each
implementation.

51

Conformed Dimension

Unlike in operational systems where data redundancy is normally


avoided, data replication is expected in the Data Warehouse world. To
provide fast access and intuitive "drill down" capabilities of data
originating from multiple operational systems, it is often necessary to
replicate dimensional data in Data Warehouses and in Data Marts.

For example, the Calendar dimension is commonly needed in most data


marts. By making this Calendar dimension adhere to a single structure,
regardless of what data mart it is used in your organization, you can
query by date/time from one data mart to another to another. Conformed
dimensions promote flexibility in your querying while supporting the
benefits of ease of query and departmental subject areas that the starschema approach affords.

52

Factless Fact
A factless fact is a fact table that does not contain numeric addictive
values, but is composed exclusively of keys. They may consist of
nothing but keys.
There are two types of factless fact tables:
Event-tracking
Coverage.

53

Factless Fact Event Tracking


Event tracking records and tracks event that have occurred, such as
college student class attendance.

A factless fact table for recording student attendance on a daily basis at


a college. The five dimension tables contain rich descriptions of dates,
students, courses, teachers, and facilities. There are no additive, numeric
facts.

54

Factless Fact - Coverage


A coverage factless tables support the dimensional model when the primary
fact table is sparse, for example, a sales promotion factless table.

A factless coverage table used in conjunction with an ordinary sales fact


table to answer the question, "Which products were on promotion that did
not sell?

55

Data Warehouse (Dos & Donts)


Dos

Do understand how the data warehouse will support the


strategic goals of the company.

Place data warehouses as close to the user as practical to ensure


convenience of access and to lower network costs.

Emphasize on Data Cleansing

Plan for Huge Storage and Performance related issues in advance

Do think that there are patterns in the data of our company. The
patterns are waiting to be detected. Detection of patterns in your
data is called Data Mining.

De-normalize data

Defer functionality (Think incremental)

56

Data Warehouse (Dos & Donts)


Dont

Don't think Normalization. ( Storage is cheap )

Don't go for Big-Bang approach (Iterative approach is the


right way)

Don't think it as a product (It is a process rather)

Don't think for current when planning for infrastructure like


storage or speed (Think future)

Don't emphasize on tools. (Emphasize on technology).

Don't think reports, think Analysis, Data Mining

Don't ignore user Training & Maintenance. They are as important


as the design itself.

57

On Line Analytical Processing


OnLine - Connected to; actively working with
Analytical - of or relating to analysis
Processing - to move from one state to another
leading to a specific result
OLAP - Connecting to a data source to analyze
information for a specific purpose

58

On Line Analytical Processing


OLAP enables analysts, managers and executives to gain insight
into data through fast, consistent, interactive access to a wide
variety of possible views of information that has been
transformed from raw data to reflect the real dimensionality of
the enterprise as understood by the user.
OLAP Activities
Slice & Dice - Slice and Dice is an ability to move between different
combinations of dimensions to see different slices of the information.
Drill-down - Drilling down or up is a specific analytical technique
whereby the user navigates among levels of data ranging from the
most summarized (up) to the most detailed (down). The drilling paths
may be defined by the hierarchies within dimensions or other
relationships that may be dynamic within or between dimensions

59

On Line Analytical Processing - FASMI test


Fast means faster response time
Analysis means that the system can cope with any
business logic and statistical analysis
Shared is for multiple access
Multidimensional view of data including full support
of hierarchies
Information is all of data and derived information
needed

60

ROLAP/MOLAP/HOLAP
ROLAP stands for Relational OLAP. Users see their data organized in
cubes with dimensions, but the data is really stored in a Relational
Database (RDBMS) like Oracle. The RDBMS will store data at a fine
grain level, response times are usually slow.
MOLAP stands for Multidimensional OLAP. Users see their data
organized in cubes with dimensions, but the data is store in a Multidimensional database (MDBMS) like Oracle Express Server. In a
MOLAP system lot of queries have a finite answer and performance is
usually critical and fast.
HOLAP stands for Hybrid OLAP, it is a combination of both worlds.
Seagate Software's Holos is an example HOLAP environment. In a
HOLAP system one will find queries on aggregated data as well as on
detailed data.

61

Data Mining
Given database of sufficient size and quality, data mining technology
can generate new business opportunities by providing these
capabilities.
Automated prediction of trends and behaviors
Automated discovery of previously unknown pattern

62

Data Mining
Commonly used data mining techniques
Decision trees
Rule induction
Artificial Neural Networks
Clustering
Market Basket Analysis
Link Analysis
Applications
Forecasting
Risk Management
Market Management

63

ETL Tools
Informatica
DataStage
Oracle Warehouse Builder (OWB)

64

OLAP Tools
Congas Products
Impromptu
Tranformer
PowerPlay
Visulizer
Oracle Products
Oracle Discover Administrator
Discover plus
Discover Desktop

65

OLAP Tools
Primary Business Objects products

Business Objects - Full client reporting tool.


WebIntelligence - Thin client reporting tool.
Designer - Interface to design universes.
Supervisor - User administration and metadata
management.
Broadcast Agent Scheduling and distribution
tool.

66

Data Mining Tools

BusinessObject Miner
Cognos 4Thought
Cognos Scenarios
Oracle Data Miner

67

Useful Web Sites/Books


BusinessObjects
www.businessobjects.com
COGNOS
www.cognos.com

Seagate
www.seagate.com

Data Warehousing
www.dw-institute.com

www.dwinfocenter.org
SAS Institute
www.sas.com

Others Links:
http://www.kimballgroup.com/html/designtips.html
http://www.learndatamodeling.com
http://www.1keydata.com/datawarehousing/concepts.html

The Data Warehousing Toolkit - Ralph Kimball


Publisher - John Wiley & Sons Inc.

68

Thank You

69

How would grain change impact the universe?


Is the report grain or DB level grain??
2 aspects:
1. if u have grain which is greater i mean moving from day to
month... If there is no change in the fact table (DB) we may not
have to do any thing... at report level we need to pull the month
portion from the time dimension. This applies to additive facts.
for non additive fact we need to consider the last day flag of time
dimension .
if your fact table is changing i.e. instead of having date_key u have
month_key (this is possible when u maintain day dimension as well
as month dimension), in such a case your universe join conditions
will change.
2. if you r moving to lower grain then ur DB is changing and universe
will also have an impact of the same. probably u need to create a
dimension and fact will have change... so join conditions will
change... and also the reports will change.

70

Which index do you use in DWH?

Bitmap indexes are widely used in data warehousing


environments. The environments typically have large
amounts of data and ad hoc queries, but a low level of
concurrent DML transactions. For such applications,
bitmap indexing provides:

Reduced response time for large classes of ad hoc queries.


Reduced storage requirements compared to other indexing
techniques.
Dramatic performance gains even on hardware with a relatively
small number of CPUs or a small amount of memory.
Efficient maintenance during parallel DML and loads.
Bitmap indexes are most effective for queries that contain multiple
conditions in the WHERE clause. Rows that satisfy some, but not
all, conditions are filtered out before the table itself is accessed.
This improves response time, often dramatically.

71

Which index do you use in DWH?


Bitmap index

Null values
Unlike most other types of indexes, bitmap indexes include rows that have NULL values. Indexing of nulls
can be useful for some types of SQL statements, such as queries with the aggregate function COUNT.

Cardinality
The advantages of using bitmap indexes are greatest for columns in which the ratio of the number of distinct
values to the number of rows in the table is small. We refer to this ratio as the degree of cardinality. A gender
column, which has only two distinct values (male and female), is optimal for a bitmap index. However, data
warehouse administrators also build bitmap indexes on columns with higher cardinalities. For example, on a
table with one million rows, a column with 10,000 distinct values is a candidate for a bitmap index. A bitmap
index on this column can outperform a B-tree index, particularly when this column is often queried in
conjunction with other indexed columns. In fact, in a typical data warehouse environments, a bitmap index can
be considered for any non-unique column.

Using Bitmap Join Indexes in Data Warehouses


In addition to a bitmap index on a single table, you can create a bitmap join index, which is a bitmap index for
the join of two or more tables. In a bitmap join index, the bitmap for the table to be indexed is built for values
coming from the joined tables. In a data warehousing environment, the join condition is an equi-inner join
between the primary key column or columns of the dimension tables and the foreign key column or columns in
the fact table.
A bitmap join index can improve the performance by an order of magnitude. By storing the result of a join, the
join can be avoided completely for SQL statements using a bitmap join index. Furthermore, since it is most
likely to have a much smaller number of distinct values for a bitmap join index compared to a regular bitmap
index on the join column, the bitmaps compress better, yielding to less space consumption than a regular
bitmap join index on the join column. Refer the below link for bitmap join index.
Also you can refer to bitmap index in DWH.doc document
Link: http://download.oracle.com/docs/cd/B14117_01/server.101/b10736/indexes.htm
http://download.oracle.com/docs/cd/B14117_01/server.101/b10736/schemas.htm#g1008401

72

Bitmap index in Star Schema:


To get the best possible performance for star
queries, it is important to follow some basic
guidelines:
A bitmap index should be built on each of the
foreign key columns of the fact table or tables.
The initialization parameter
STAR_TRANSFORMATION_ENABLED
should be set to TRUE. This enables an
important optimizer feature for star-queries. It
is set to FALSE by default for backwardcompatibility.
73

what is degenerate dimensions


In a data warehouse, a degenerate dimension is a
dimension which is derived from the fact table and
doesn't have its own dimension table. Degenerate
dimensions are often used when a fact table's grain
represents transactional level data and one wishes to
maintain system specific identifiers such as order
numbers, invoice numbers and the like without forcing
their inclusion in their own dimension. The decision to
use degenerate dimensions is often based on the desire
to provide a direct reference back to a transactional
system without the overhead of maintaining a separate
dimension table.

74

Das könnte Ihnen auch gefallen