Sie sind auf Seite 1von 33

Designing a Data

Warehouse
Issues in DW design

Data Warehouse
A read-only database for decision
analysis
Subject Oriented

Integrated
Time variant

Nonvolatile
consisting of time stamped
operational and external data.

Data Warehouse vs
Operational Databases

Highly tuned
Real time Data
Detailed records
Current values
Accesses small
amounts of data
in a predictable
manner

Flexible access
Consistent timing
Summarized as
appropriate
Historical
Access large
amounts of data in
unexpected ways

Data Warehouse Purpose

Identify problems in time to avoid


them
Locate opportunities you might
otherwise miss

Data Warehouse:
New Approach
An old idea with a new interest
because of:
Cheap Computing Power
Special Purpose Hardware
New Data Structures
Intelligent Software

Warehousing Problems
Business Issues
Data Quantity
Data Accuracy
Maintenance
Ownership
Cost

Warehousing Problems
Business Issues
Database Issues
DBMS Software
Technology
Complexity

Warehousing Problems
Business Issues
Data Issues
Analysis Issues
User Interface
Intelligent Processing

Three Approaches

Classical Enterprise Database


Contains operational data from all areas of
the organization.

Data Mart
Extracted and managerial support data
designed for departmental or EUC
applications

Data Package
Data required for a specific application

Classical Warehouse
Source

Archived data

Extraction

Batch extraction programs

Data

Atomic transaction data

Tool

VLDB technology

Analysis

IT driven software

Mart
Source

Deposit or External sources

Extraction

Batch summary

Data

Designed departmental database

Tool

OLAP, ROLAP, MDBMS

Analysis

IT driven or trained user

Package
Source

Mart

Extraction

Sample and summary

Data

Problem specific dataset

Tool

PC tools

Analysis

Trained user

Three Fundamental
Processes

Data Acquisition
Data Storage
Data a
Access

Data Acquisition

Handles acquisition of data from


legacy
systems and outside
sources.
Data is identified, copied,
formatted and prepared for loading
into the warehouse.

Acquisition steps

Catalog the data

Clean and prepare the data.

Develop an inventory of where it


is and what it means.
Extract from legacy files and
reformat to make it usable.

Transport data from one location to


another.

Storage
The storage component holds the
data so that the many
different data mining, executive
information
and
decision support systems can
make use of it effectively.

The Storage Area


Managed by
Relational databases

like those from Oracle Corp. or


Informix Software Inc.

Specialized hardware

symmetric multiprocessor (SMP)


or massively parallel processor
(MPP) machines

Storage

The majority of warehouse storage


today is being managed by
relational databases running on
Unix platforms.
Oracle, Sybase Inc., IBM Corp. and
Informix control 65 percent of the
warehouse storage market. Meta
Group Inc. (1996)

Access

Different end-user PCs and workstations


draw data from the warehouse with the
help of multidimensional analysis
products, neural networks, data
discovery tools or analysis tools.
These powerful, "smart" software
products are the real driving force
behind the viability of data
warehousing.

Access Tools

Intelligent Agents and Agencies


Query Facilities and Managed Query
Environments
Statistical Analysis
Data Discovery.
(decision support, artificial intelligence
and expert systems)

OLAP
Data Visualization

Hardware Budget

A typical startup warehouse


project allocates more than 60
percent of its budget for hardware
and software to the creation of a
powerful storage component,
spending just 30 percent on data
mining and user access
technologies.

Systems Analysis Budget

Budgeting for systems analysis and


development, however, follows a very
different pattern.
More than 50 percent of development
dollars are spent on building acquisition
capabilities,
30 percent fund the development of user
solutions and
20 percent are dedicated to the creation
of databases in the storage component.

Design Issues
Relational and Multidimensional
Models
Denormalized and indexed
relational models more flexible
Multidimensional models simpler to
use and more efficient

Star Schemas in a RDBMS


In most companies doing ROLAP, the DBAs
have created countless indexes and summary
tables in order to avoid I/O-intensive table
scans against large fact tables. As the indexes
and summary tables proliferate in order to
optimize performance for the known queries
and aggregations that the users perform, the
build times and disk space needed to create
them has grown enormously, often requiring
more time than is allotted and more space
than the original data!

Building a Data Warehouse


from a Normalized
The steps
Database

Develop a normalized entity-relationship


business model of the data warehouse.
Translate this into a dimensional model.
This step reflects the information and
analytical characteristics of the data
warehouse.
Translate this into the physical model. This
reflects the changes necessary to reach
the stated performance objectives.

The Business Model

Identify the data structure,


attributes and constraints for the
clients data warehousing
environment.
Stable
Optimized for update
Flexible

Business Model
As always in life, there are some
disadvantages to 3NF:
Performance can be truly awful. Most of
the work that is performed on
denormalizing a
data model is an
attempt to reach performance objectives.
The structure can be overwhelmingly
complex. We may wind up creating many
small relations which the user might think
of as a single relation or group of data.

Structural Dimensions

The first step is the development of the


structural dimensions. This step
corresponds very closely to what we
normally do in a relational database.
The star architecture that we will
develop here depends upon taking the
central intersection entities as the fact
tables and building the foreign key =>
primary key relations as dimensions.

Simple DW pattern.

Other Dimensions

Categorical dimensions: generated


groups (additional key components)
Partitioning dimensions: subtypes
(planned vs. actual)
Informational dimensions:
generate different types of data
(messy).

Das könnte Ihnen auch gefallen