Sie sind auf Seite 1von 76

Customer Relationship Management

Unit - IV : Lesson 8

Data Warehousing & OLAP Technology

Definition of Data Warehouse

A data warehouse is a subjectoriented, integrated, time-variant, and

nonvolatile collection of data in


support of managements decision-

making process. W. H. Inmon

Data Warehouse Definition


A data warehouse is a structured repository of historic data.
It is developed in an evolutionary process by integrating data from non-integrated Legacy systems.

What is Data Warehouse?


A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.

Data Warehouse Used for?


Knowledge discovery
Making consolidated reports Finding relationships and correlations Data mining Examples
Banks identifying credit risks Insurance companies searching for fraud Medical research

Data Warehouse Usage


Three kinds of data warehouse applications Information processing supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs Analytical processing

multidimensional analysis of data warehouse data


supports basic OLAP operations, slice-dice, drilling, pivoting Data mining knowledge discovery from hidden patterns supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools

Organizational Information
Levels, formats, and granularities of organizational information

DATA WAREHOUSE FUNDAMENTALS

Data Warehouse Architecture

Very Large Data Bases


Terabytes -- 10^12 bytes: Petabytes -- 10^15 bytes: Exabytes -- 10^18 bytes: Zettabytes -- 10^21 bytes: Walmart -- 24 Terabytes Geographic Information Systems National Medical Records Weather images

Zottabytes -- 10^24 bytes:

Intelligence Agency Videos

Data Warehouse Concepts

Levels of Granularity of Data Warehouse Data

Atomic (Transaction)

Lightly Summarized

Highly Summarized

Scrubbing Data
Sophisticated transformation tools. Used for cleaning the quality of data Clean data is vital for the success of the warehouse Example
Seshadri, Sheshadri, Sesadri, Seshadri S., Srinivasan Seshadri, etc. are the same person

Extraction, Transformation, and Loading (ETL)


Data extraction get data from multiple, heterogeneous, and external sources Data cleaning detect errors in the data and rectify them when possible Data transformation convert data from legacy or host format to warehouse format Load sort, summarize, consolidate, compute views Refresh propagate the updates from the data sources to the warehouse

The ETL Process


Capture Scrub or data cleansing Transform Load and Index
ETL = Extract, transform, and load

Figure 11-10: Steps in data reconciliation

Capture = extractobtaining a snapshot of a chosen subset of the source data for loading into the data warehouse Incremental extract = capturing Static extract = capturing a changes that have occurred since snapshot of the source data at a the last static extract point in time

Figure 11-10: Steps in data reconciliation (continued)

Scrub = cleanseuses pattern recognition and AI techniques to upgrade data quality Fixing errors: misspellings, Also: decoding, reformatting, time

erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies

stamping, conversion, key generation, merging, error detection/logging, locating missing data

Figure 11-10: Steps in data reconciliation (continued)

Transform = convert data from format of operational system to format of data warehouse Record-level:

Selection data partitioning Joining data combining Aggregation data summarization

Field-level:

single-field from one field to one field multi-field from many fields to one, or one field to many

Figure 11-10: Steps in data reconciliation (continued)

Load/Index= place transformed data into the warehouse and create indexes Refresh mode: bulk rewriting of Update mode: only changes in

target data at periodic intervals

source data are written to data warehouse

Data Warehouse - Characteristics


A data warehouse is
subject-oriented
integrated time-varying

non-volatile

Data Warehouse Subject-Oriented


Organized around major subjects, such as customer, product,

sales
Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process

Data WarehouseIntegrated
Constructed by integrating multiple, heterogeneous data sources relational databases, on-line transaction records Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources When data is moved to the warehouse, it is converted. Eg: male, female (0,1)

Data Transformation Example


Data Warehouse appl appl appl appl appl appl appl appl appl appl appl appl A - m,f B - 1,0 C - x,y D - male, female A - pipeline - cm B - pipeline - in C - pipeline - feet D - pipeline - yds A - balance B - bal C - currbal D - balcurr

Data Integrity Problems


Same person, different spellings Agarwal, Agrawal, Aggarwal etc... Multiple ways to denote company name Persistent Systems, PSPL, Persistent Pvt. LTD. Use of different names mumbai, bombay Different account numbers generated by different applications for the same customer Required fields left blank Invalid product codes collected at point of sale manual entry leads to mistakes in case of a problem use 9999999

Data WarehouseTime Variant


The time horizon for the data warehouse is significantly longer than that of operational systems Operational database: current value data Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Every key structure in the data warehouse Contains an element of time, explicitly or implicitly

But the key of operational data may or may not contain time element

Data WarehouseNonvolatile
A physically separate store of data transformed from the

operational environment
Operational update of data does not occur in the data warehouse environment

Does not require transaction processing, recovery, and


concurrency control mechanisms Requires only two operations in data accessing:

initial loading of data and access of data

Building a Data Warehouse


Data Warehouse Lifecycle

Analysis Design Import data Install front-end tools Test and deploy

Stage 1: Analysis
Identify:
Target Questions Data needs Timeliness of data Granularity
Analysis Design Import data Install front-end tools Test and deploy

Create an enterprise-level data dictionary Dimensional analysis


Identify facts and dimensions

Stage 2: Design
Star schema Data Transformation Aggregates Pre-calculated Values HW/SW Architecture
Analysis Design Import data Install front-end tools Test and deploy Dimensional Modeling

Stage 3: Import Data


Identify data sources Extract the needed data from existing systems to a data staging area Transform and Clean the data
Resolve data type conflicts Resolve naming and key conflicts Remove, correct, or flag bad data Conform Dimensions
Analysis Design Import data Install front-end tools Test and deploy

Load the data into the warehouse

Stage 4: Install Front-end Tools


Analysis Design Import data Install front-end tools Test and deploy

Reporting tools Data mining tools GIS Etc.

Stage 5: Test and Deploy


Usability tests Software installation User training Performance tweaking based on usage
Analysis Design Import data Install front-end tools Test and deploy

Data Warehouse vs. Operational DBMS


OLTP (on-line transaction processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. OLAP (on-line analytical processing) Major task of data warehouse system

Data analysis and decision making


Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated

Database design: ER + application vs. star + subject


View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries

The Value of Transactional and Analytical Information

OLTP vs. OLAP


Features
users function DB design data OLTP clerk, IT professional day to day operations application-oriented current, up-to-date detailed, flat relational isolated repetitive read/write index/hash on prim. key short, simple transaction tens thousands 100MB-GB transaction throughput OLAP knowledge worker decision support subject-oriented historical, summarized, multidimensional integrated, consolidated ad-hoc lots of scans complex query millions hundreds 100GB-TB query throughput, response

usage access unit of work # records accessed #users DB size metric

Why Separate Data Warehouse?


High performance for both systems DBMS tuned for OLTP: access methods, indexing, concurrency control, recovery Warehousetuned for OLAP: complex OLAP queries, multidimensional view, consolidation. Different functions and different data: missing data: Decision support requires historical data which operational DBs do not typically maintain data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled

Multi-dimensional Data Model

Data Warehouse: A Multi-Tiered Architecture


Monitor & Integrator OLAP Server

Other sources Operational DBs

Metadata

Extract Transform Load Refresh

Data Warehouse

Serve

Analysis Query Reports Data mining

Data Marts Data Sources Data Storage OLAP Engine Front-End Tools

From Tables and Spreadsheets to Data Cubes


A data warehouse is based on a multidimensional data model which views data in the form of a data cube A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions Dimension tables, such as item (item_name, brand, type), or time(day,

week, month, quarter, year)


Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.

Cube: A Lattice of Cuboids

all time item location supplier

0-D(apex) cuboid

1-D cuboids

time,item

time,location

item,location item,supplier

location,supplier

time,supplier time,item,location

2-D cuboids

time,location,supplier

3-D cuboids
item,location,supplier

time,item,supplier

4-D(base) cuboid
time, item, location, supplier

Multi-dimensional Data

Ram sold Rs. 1000 worth of goods


Dimensions: Product, Region, Time Hierarchical summarization paths
W S N Juice Cola Milk Cream Toothpaste Soap 1 2 34 5 6 7

Product

Product Industry

Region Country

Time Year

Category

Region

Quarter

Product

City

Month

Week

Month
Office Day

A Sample Data Cube


Date
3Qtr 4Qtr Total annual sales of TV in U.S.A.

Canada Mexico
sum

Country

TV PC VCR sum

1Qtr

2Qtr

sum

U.S.A

Multidimensional Analysis
Cube common term for the representation of multidimensional information

3-D Cube
Fact table view:
sale prodId p1 p2 p1 p2 p1 p1 storeId c1 c1 c3 c2 c1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4

Multi-dimensional cube:

day 2 day 1

p1 p2 c1 p1 12 p2 11

c1 44 c2 8

c2 4 c3 50

c3

dimensions = 3

Conceptual Modeling of Data Warehouses


Modeling data warehouses: dimensions & measures

Star schema: A fact table in the middle connected to a set of dimension


tables Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations: Multiple fact tables share dimension tables, viewed as a group of stars forming a recognized pattern therefore called galaxy

schema or fact constellation

Star Schema
Creates non-normalized data structures Easier for users to understand Optimized for OLAP Uses fact (facts or measures in the business) and dimension (establishes the context of the facts) tables

Star Schema
A single fact table and for each dimension one dimension table Does not capture hierarchies directly
T i e
date, custno, prodno, cityname, ...

m
f a c t

p r o d

c u s t

c i t y

Star
product prodId p1 p2 name price bolt 10 nut 5

store

storeId c1 c2 c3

city nyc sfo la

sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97

custId 53 53 111

prodId p1 p2 p1

storeId c1 c1 c3

qty 1 2 5

amt 12 11 50

customer

custId 53 81 111

name joe fred sally

address 10 main 12 main 80 willow

city sfo sfo la

Example of Star Schema


time
time_key day day_of_the_week month quarter year

item
Sales Fact Table time_key item_key branch_key
item_key item_name brand type supplier_type

branch
branch_key branch_name branch_type

location
location_key street city province_or_street country

location_key units_sold

dollars_sold
avg_sales

Measures

Snowflake schema
Represent dimensional hierarchy directly by normalizing tables. Easy to maintain and saves storage
T i
e
date, custno, prodno, cityname, ...

m
f a c t

p r o d

c u s t

c i t y

r e g i o n

Example of Snowflake Schema


time
time_key day day_of_the_week month quarter year

item
Sales Fact Table
item_key item_name brand type supplier_key

supplier
supplier_key supplier_type

time_key
item_key branch_key

branch
branch_key branch_name branch_type

location
location_key street city_key

location_key
units_sold dollars_sold avg_sales Measures

city

city_key city province_or_street country

Dimension Table Examples


Retail -- store name, zip code, product name, product category, day of week Telecommunications -- call origin, call destination Banking -- customer name, account number, branch, account officer Insurance -- policy type, insured party

A Concept Hierarchy: Dimension (location)


all all

region

Europe

...

North_America

country

Germany

...

Spain

Canada

...

Mexico

city

Frankfurt

...

Vancouver

...

Toronto

office

L. Chan

...

M. Wind

Fact Table Examples


Retail -- number of units sold, sales amount Telecommunications -- length of call in minutes, average number of calls Banking -- average monthly balance Insurance -- claims amount

Example of Fact Constellation


time
time_key day day_of_the_week month quarter year

item
Sales Fact Table time_key item_key branch_key
item_key item_name brand type supplier_type

Shipping Fact Table time_key

item_key
shipper_key

from_location
location
location_key street city province_or_street country

branch
branch_key branch_name branch_type

location_key units_sold dollars_sold avg_sales Measures

to_location dollars_cost units_shipped shipper


shipper_key shipper_name location_key shipper_type

II. On-Line Analytical Processing (OLAP)

Making Decision Support Possible

What Is OLAP?
Online Analytical Processing - coined by EF Codd in 1994 paper contracted by Arbor Software* Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information System OLAP = Multidimensional Database MOLAP: Multidimensional OLAP (Arbor Essbase, Oracle Express) ROLAP: Relational OLAP (Informix MetaCube, Microstrategy DSS Agent)

OLAP Client/Server Architecture

OLAP tool Vendors


IBM Informix Cartelon NCR Oracle (Oracle Warehouse builder, Oracle OLAP) Red Brick Sybase SAS Microsoft (SQL Server OLAP) Microstrategy Corporation

Typical OLAP Operations


Roll up (drill-up): summarize data

by climbing up hierarchy or by dimension reduction


Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions

Slice and dice:


project and select Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes.

Roll-up and Drill Down


Higher Level of Aggregation

Sales Channel Region Country State Location Address Sales Representative


Low-level Details

Slicing and Dicing


Product

The Telecomm Slice

Household Telecomm

Video
Audio

Europe
Far East India Retail Direct Special

Sales Channel

A Visual Operation: Pivot (Rotate)

Juice Cola Milk

10
47 30

Cream 12

Product

3/1 3/2 3/3 3/4

Date

Data warehouse architecture

Design of a Data Warehouse: A Business Analysis Framework


Four views regarding the design of a data warehouse Top-down view allows selection of the relevant information necessary for the data warehouse Data source view

exposes the information being captured, stored, and managed by operational systems
Data warehouse view consists of fact tables and dimension tables

Business query view


sees the perspectives of data in the warehouse from the view of end-user

OLAP Server Architectures


Relational OLAP (ROLAP) Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services greater scalability Multidimensional OLAP (MOLAP) Array-based multidimensional storage engine (sparse matrix techniques) fast indexing to pre-computed summarized data Hybrid OLAP (HOLAP) User flexibility, e.g., low level: relational, high-level: array

ROLAP
Relational OLAP Uses a RDBMS to implement and OLAP environment Typically involves a star schema to provide the multidimensional capabilities OLAP tool manipulates RDBMS star schema data Called slowlap by MOLAP vendors

MOLAP
Multidimensional OLAP Uses a MDDBS (e.g., Essbase) to store and access data Usually requires proprietary (non SQL) data access tools Provides exceptionally fast response times

Data Warehouse vs. Data Marts


What comes first

Data Warehouse Models


Enterprise warehouse collects all of the information about subjects spanning the entire organization Data Mart a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart Independent vs. dependent (directly from warehouse) data mart Virtual warehouse A set of views over operational databases Only some of the possible summary views may be materialized

Data Mart
A data mart stores data for a limited number of subject areas, such as marketing and sales data. It is used to support specific applications. An independent data mart is created directly from source systems. A dependent data mart is populated from a data warehouse.

Techniques for Creating Departmental Data Mart


OLAP
Sales Finance Mktg.

Subset Summarized Superset Indexed

Arrayed

From the Data Warehouse to Data Marts


Information Individually Structured Departmentally Structured Less

History Normalized Detailed

Organizationally Structured Data

Data Warehouse

More

Reporting Tools
Andyne Computing -- GQL Brio -- BrioQuery Business Objects -- Business Objects Cognos -- Impromptu Information Builders Inc. -- Focus for Windows Oracle -- Discoverer2000 Platinum Technology -- SQL*Assist, ProReports PowerSoft -- InfoMaker SAS Institute -- SAS/Assist Software AG -- Esperant Sterling Software -- VISION:Data

Warehouse Server Products


Oracle 8 Informix
Online Dynamic Server XPS --Extended Parallel Server Universal Server for object relational applications

Sybase
Adaptive Server 11.5 Sybase MPP Sybase IQ

Warehouse Server Products


Red Brick Warehouse Tandem Nonstop IBM
DB2 MVS Universal Server DB2 400

Teradata

Das könnte Ihnen auch gefallen