Sie sind auf Seite 1von 147

An Introduction to Data Warehousing

Anand Deshpande Persistent Systems Pvt. Ltd. http://www.pspl.co.in

Introduction and Motivation


zWhat is a Warehouse? zData Warehouse Architecture zImplementing the Data Warehouse zAn Introduction to Decision Support zReferences
CSI'99 2

What is a Data Warehouse?

What are the users saying...


zData should be integrated across the enterprise zSummary data had a real value to the organization zHistorical data held the key to understanding data over time zWhat-if capabilities are required
CSI'99 4

What is Data Warehousing?


Information A process of transforming data into information and making it available to users in a timely enough manner to make a difference
[Forrester Research, April 1996]
5

Data
CSI'99

What is a Data Warehouse?


A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin]
CSI'99 6

Data Warehousing Market


zIn 1996, close to 90% of IT professionals had either created a data warehouse or were planning to create one zAverage 3 year ROI of 400% zAverage payback was 2.3 years on costs averaging $2.2 million
CSI'99 7

The Data Warehouse Market


8000

8000 7000 6000 5000 4000 3000 2000 1000 0 Hardware DBMS Tools Total Market
1000 1500 700 300 3500 3000 2000

1995 1998

Values in Millions of US$


CSI'99

Meta Group
8

95% of Fortune 1000 Companies are creating Warehouses


Compound Annual Growth Rate

1994

1999

Total Market Size


Data Extraction/ Movement Administration RDBMS Hardware Consulting Services

$1,568.0
$65.0

$6,960.0
$210.0

34.7%
26.4%

$10.0 $288.0 $1,075.0 $130.0

$450.0 $1,100.0 $3,950.0 $1,250.0

114.1% 30.7% 29.7% 57.3%

CSI'99

All revenues are in millions of U.S. dollars and are Gartner Group estimates. Source: Gartner Group, Inc.

Warehouses are Very Large Databases


35% 30% 25% 20% 15% 10% Initial 5% 0% 5GB
CSI'99

Respondents

Projected 2Q96
Source: META Group, Inc.

10-19GB 5-9GB

50-99GB

250-499GB 500GB-1TB
10

20-49GB

100-249GB

Very Large Data Bases


z Terabytes -- 10^12 bytes:Walmart -- 24 Terabytes z Petabytes -- 10^15 bytes: Geographic z Exabytes -- 10^18 bytes: Information Systems National Medical Records z Zettabytes -- 10^21 bytes: Weather images z Zottabytes -- 10^24 bytes: Intelligence Agency Videos CSI'99 11

Data Warehousing -It is a process


zTechnique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible zA decision support database maintained separately from the organizations operational 12 database

CSI'99

Data Warehouse
zA data warehouse is a
ysubject-oriented yintegrated ytime-varying ynon-volatile

collection of data that is used primarily in organizational decision making.


-- Bill Inmon, Building the Data Warehouse 1996
CSI'99 13

Explorers, Farmers and Tourists


Tourists: Browse information harvested by farmers

Farmers: Harvest information from known access paths Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data
CSI'99 14

Data Warehouse Architecture


Relational Databases Optimized Loader ERP Systems

Extraction Cleansing Data Warehouse Engine Analyze Query

Purchased Data

Legacy Data
CSI'99

Metadata Repository
15

Data Warehouse for Decision Support


zPutting Information technology to help the knowledge worker make faster and better decisions
yWhich of my customers are most likely to go to the competition? yWhat product promotions have the biggest impact on revenue? yHow did the share price of software companies correlate with profits over last 10 years?
CSI'99 16

Decision Support
zUsed to manage and control business zData is historical or point-in-time zOptimized for inquiry rather than update zUse of the system is loosely defined and can be ad-hoc zUsed by managers and end-users to understand the business and make judgements
CSI'99 17

What are Operational Systems?


zThey are OLTP systems zRun mission critical applications zNeed to work with stringent performance requirements for routine tasks zUsed to run a business!
CSI'99 18

RDBMS is used for OLTP


zDatabase Systems have been used traditionally for OLTP
yclerical data processing tasks ydetailed, up to date data ystructured repetitive tasks yread/update a few records yisolation, recovery and integrity are critical
CSI'99 19

More about Operational Systems


z Run the business in real time z Based on up-to-the-second data z Optimized to handle large numbers of simple read/write transactions z Optimized for fast response to predefined transactions z Used by people who deal with customers, products -- clerks, salespeople etc. z They are increasingly used by customers
CSI'99 20

Examples of Operational Data


Data Industry Usage
Track Customer Details Finance Control account activities Retail Generate bills, manage stock Telecomm- Billing unications Control Production

Technology

Volumes

Customer All File Account Balance Point-ofSale data Call Record

Legacy application, flat Small-medium files, main frames Large Legacy applications, hierarchical databases, mainframe ERP, Client/Server, Very Large relational databases Legacy application, Very Large hierarchical database, mainframe ERP, Medium relational databases, AS/400
21

Production ManufactRecord uring


CSI'99

So, whats different?

Application-Orientation vs. Subject-Orientation


Application-Orientation Subject-Orientation

Operational Database
Loans Credit Card Trust Savings
CSI'99

Data Warehouse
Customer Vendor Product Activity
23

OLTP vs. Data Warehouse


zOLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouse zSpecial data organization, access methods and implementation methods are needed to support data warehouse queries (typically multidimensional queries)
ye.g., average amount spent on phone calls between 9AM-5PM in Pune during the month CSI'99 of December

24

OLTP vs Data Warehouse


zOLTP
yApplication Oriented yUsed to run business yDetailed data yCurrent up to date yIsolated Data yRepetitive access yClerical User
CSI'99

zWarehouse (DSS)
ySubject Oriented yUsed to analyze business ySummarized and refined ySnapshot data yIntegrated Data yAd-hoc access yKnowledge User (Manager)

25

OLTP vs Data Warehouse


z OLTP
yPerformance Sensitive yFew Records accessed at a time (tens) yRead/Update Access yNo data redundancy yDatabase Size 100MB -100 GB

z Data Warehouse
yPerformance relaxed yLarge volumes accessed at a time(millions) yMostly Read (Batch Update) yRedundancy present yDatabase Size 100 GB - few terabytes

CSI'99

26

OLTP vs Data Warehouse


zOLTP
yTransaction throughput is the performance metric yThousands of users yManaged in entirety

zData Warehouse
yQuery throughput is the performance metric yHundreds of users yManaged by subsets

CSI'99

27

Why Now?
zData is being produced zERP provides clean data zThe computing power is available zThe computing power is affordable zThe competitive pressures are strong zCommercial products are available
CSI'99 28

To summarize ...
zOLTP Systems are used to run a business

zThe Data Warehouse helps to optimize the business

CSI'99

29

Wal*Mart Case Study


zFounded by Sam Walton zOne the largest Super Market Chains in the US zWal*Mart: 2000+ Retail Stores zSAM's Clubs 100+Wholesalers Stores
CSI'99

xThis case study is from Felipe Carinos (NCR Teradata) presentation made at Stanford Database

30

Old Retail Paradigm


zWal*Mart zSuppliers
yAccept Orders yPromote Products yProvide special Incentives yMonitor and Track The Incentives yBill and Collect Receivables yEstimate Retailer Demands
31

yInventory Management yMerchandise Accounts Payable yPurchasing ySupplier Promotions: National, Region, Store Level

CSI'99

New (Just-In-Time) Retail Paradigm


z No more deals z Shelf-Pass Through (POS Application)
yOne Unit Price yWal*Mart Manager
NSuppliers paid once a week on ACTUAL items sold NDaily Inventory Restock NSuppliers (sometimes SameDay) ship to Wal*Mart

z Warehouse-Pass Through
yStock some Large Items yDistribution Center
NDelivery may come from supplier NSuppliers merchandise unloaded directly onto Wal*Mart Trucks
32

CSI'99

Wal*Mart System
24 TB Raw Disk; 700 1000 Pentium CPUs > 5 Billions 65 weeks (5 Quarters) Current Apps: 75 Million New Apps: 100 Million + zNumber of Users: Thousands zNumber of Queries:60,000 per week zNCR 5100M 96 Nodes; zNumber of Rows: zHistorical Data: zNew Daily Volume:
CSI'99 33

Data Warehouse Architecture


Relational Databases
Optimized Loader

ERP Systems

Extraction Cleansing Data Warehouse Engine Analyze Query

Purchased Data

Legacy Data
CSI'99

Metadata Repository
34

Components of the Warehouse


zData Extraction and Loading zThe Warehouse zAnalyze and Query -- OLAP Tools zMetadata zData Mining

CSI'99

35

Loading the Warehouse

Cleaning the data before it is loaded

Source Data
Operational/ Source Data Sequential Legacy Relational External

zTypically host based, legacy applications


yCustomized applications, COBOL, 3GL, 4GL

zPoint of Contact Devices zExternal Sources


CSI'99

yPOS, ATM, Call switches

yNielsens, IMRA, Vendors, Partners

37

Data Quality - The Reality


zTempting to think that all that is there to creating a data warehouse is extracting operational data and entering into a data warehouse zNothing could be farther from the truth zWarehouse data comes from disparate questionable sources
CSI'99 38

Data Quality - The Reality


zLegacy systems no longer documented zOutside sources with questionable quality procedures zProduction systems with no built in integrity checks and no integration
yOperational systems are usually designed to solve a specific business problem and are rarely developed to a a corporate plan
CSI'99

xAnd get it done quickly, we do not have

39

Data Integration Across Sources


Savings Loans Trust Credit card

Same data different name

Different data Same name

Data found here nowhere else

Different keys same data

CSI'99

40

Data Transformation Example


Data Warehouse
encoding unit field
CSI'99

appl A - m,f appl B - 1,0 appl C - x,y appl D - male, female appl A - pipeline - cm appl B - pipeline - in appl C - pipeline - feet appl D - pipeline - yds appl A - balance appl B - bal appl C - currbal appl D - balcurr
41

Data Integrity Problems


z Same person, different spellings yAgarwal, Agrawal, Aggarwal etc... z Multiple ways to denote company name yPersistent Systems, PSPL, Persistent Pvt. LTD. z Use of different names ymumbai, bombay z Different account numbers generated by different applications for the same customer z Required fields left blank z Invalid product codes collected at point of sale ymanual entry leads to mistakes CSI'99 yin case of a problem use 9999999

42

Data Transformation Terms


zExtracting zConditioning zScrubbing zMerging zHouseholding zEnrichment zScoring zLoading zValidating zDelta Updating

CSI'99

43

Data Transformation Terms


zExtracting

yCapture of data from operational source in as is status ySources for data generally in legacy mainframes in VSAM, IMS, IDMS, DB2; more data today in relational databases on Unix

yThe conversion of data types from the source to the target data store CSI'99 (warehouse) -- always a relational

zConditioning

44

Data Transformation Terms


zHouseholding
yIdentifying all members of a household (living at the same address) yEnsures only one mail is sent to a household yCan result in substantial savings: 1 million catalogues at Rs. 50 each costs Rs. 50 million . A 2% savings would save Rs. 1 million
CSI'99 45

Data Transformation Terms


zEnrichment

yBring data from external sources to augment/enrich operational data. Data sources include Dunn and Bradstreet, Nielson, IMRA etc...

ycomputation of a probability of an event. e.g..., chance that a customer will defect to AT&T from MCI, chance that a customer is likely to buy a new product CSI'99

zScoring

46

Loads
zAfter extracting, scrubbing, cleaning, validating etc. need to load the data into the warehouse zIssues

yhuge volumes of data to be loaded ysmall time window available when warehouse can be taken off line (usually nights) ywhen to build index and summary tables yallow system administrators to monitor, cancel, resume, change load rates yRecover gracefully -- restart after failure from where you were and without loss of data integrity
47

CSI'99

Load Techniques
zUse SQL to append or insert new data
yrecord at a time interface ywill lead to random disk I/Os

zUse batch load utility

CSI'99

48

Load Taxonomy
zIncremental versus Full loads zOnline versus Offline loads

CSI'99

49

Refresh
zPropagate updates on source data to the warehouse zIssues:
ywhen to refresh yhow to refresh -- refresh techniques

CSI'99

50

When to Refresh?
zperiodically (e.g., every night, every week) or after significant events zon every update: not warranted unless warehouse data require current data (up to the minute stock quotes) zrefresh policy set by administrator based on user needs and traffic zpossibly different policies for CSI'99 51

Refresh Techniques
zFull Extract from base tables
yread entire source table: too expensive ymaybe the only choice for legacy systems

CSI'99

52

How To Detect Changes


zCreate a snapshot log table to record ids of updated rows of source data and timestamp zDetect changes by:
yDefining after row triggers to update snapshot log when source table changes yUsing regular transaction log to detect changes to source data
CSI'99 53

Optimizing the Warehouse for Decision Support

Data -- Heart of the Data Warehouse


zHeart of the data warehouse is the data itself! zSingle version of the truth zCorporate memory zData is organized in a way that represents business -- subject orientation
CSI'99 55

Data Warehouse Structure


zSubject Orientation -- customer, product, policy, account etc... A subject may be implemented as a set of related tables. E.g., customer may be five tables

CSI'99

56

Data Warehouse Structure


ybase customer (1985-87)
xcustid, from date, to date, name, phone, dob Time is ybase customer (1988-90) part of xcustid, from date, to date, name, credit rating, key of employer each table

ycustomer activity (1986-89) -- monthly summary ycustomer activity detail (1987-89) ycustomer activity detail (1990-91)

xcustid, activity date, amount, clerk id, order no xcustid, activity date, amount, line item no, order no
CSI'99 57

Data Granularity in Warehouse


zSummarized data stored
yreduce storage costs yreduce cpu usage yincreases performance since smaller number of records to be processed ydesign around traditional high level reporting needs ytradeoff with volume of data to be stored and detailed usage of data
CSI'99 58

Granularity in Warehouse
zCan not answer some questions with summarized data
yDid Anand call Seshadri last month? Not possible to answer if total duration of calls by Anand over a month is only maintained and individual call details are not.

zDetailed data too voluminous


CSI'99 59

Granularity in Warehouse
zTradeoff is to have dual level of granularity
yStore summary data on disks
x95% of DSS processing done against this data

yStore detail on tapes


x5% of DSS processing against this data

CSI'99

60

Vertical Partitioning
acctno balance address date opened . . . . Frequently accessed Rarely accessed

CSI'99

acctno address date -opened . . Smaller table . and so less I/O acctno balance

61

Derived Data
zIntroduction of derived (calculated data) may often help zHave seen this in the context of dual levels of granularity zCan keep auxiliary views and indexes to speed up query processing
CSI'99 62

Schema Design
zDatabase organization

zSchema Types

ymust look like business ymust be recognizable by business user yapproachable by business user yMust be simple yStar Schema yFact Constellation Schema ySnowflake schema
63

CSI'99

Dimension Tables
zDimension tables

yDefine business in terms already familiar to users yWide rows with lots of descriptive text ySmall tables (about a million rows) yJoined to fact table by a foreign key yheavily indexed ytypical dimensions

CSI'99

xtime periods, geographic region (markets, cities), products, customers, salesperson, etc. 64

Fact Table
zCentral table
ymostly raw numeric items ynarrow rows, a few columns at most ylarge number of rows (millions to a billion) yAccess via dimensions

CSI'99

65

Star Schema
zA single fact table and for each dimension one dimension table zDoes not capture hierarchies directly
T i e
date, custno, prodno, cityname, ...

CSI'99

c u s t

f a c t

p r o d
c i t y

66

Snowflake schema
zRepresent dimensional hierarchy directly by normalizing tables. zEasy to maintain and saves storage
T i e
date, custno, prodno, cityname, ...

c u s t
CSI'99

f a c t

p r o d c i t y
r e g i o 67 n

Fact Constellation
zFact Constellation
yMultiple fact tables that share many dimension tables yBooking and Checkout may share many dimension tables in the hotel industry
Hotels

Booking Checkout
Customer

Promotion

Travel Agents
CSI'99

Room Type
68

Denormalization
zNormalization in a data warehouse may lead to lots of small tables zCan lead to excessive I/Os since many tables have to be accessed zDenormalization is the answer especially since updates are rare

CSI'99

69

Creating Arrays
zMany time each occurrence of a sequence of data is in a different physical location zBeneficial to collect all occurrences together and store as an array in a single row zMakes sense only if there are a stable number of occurrences which are accessed together zIn a data warehouse, such situations arise naturally due to time based orientation
ycan create an array by month
CSI'99 70

Selective Redundancy
zDescription of an item can be stored redundantly with order table -- most often item description is also accessed with order table zUpdates have to be careful

CSI'99

71

Data Extraction and Cleansing


zExtract data from existing operational and legacy data zIssues:

ySources of data for the warehouse yData quality at the sources yMerging different data sources yData Transformation yHow to propagate updates (on the sources) to the warehouse yTerabytes of data to be loaded
72

CSI'99

Scrubbing Data
zSophisticated transformation tools. zUsed for cleaning the quality of data zClean data is vital for the success of the warehouse zExample

ySeshadri, Sheshadri, Sesadri, Seshadri S., Srinivasan Seshadri, etc. CSI'99 are the same person

73

Scrubbing Tools
zApertus -- Enterprise/Integrator zVality -- IPE zPostal Soft

CSI'99

74

Partitioning
zBreaking data into several physical units that can be handled separately zNot a question of whether to do it in data warehouses but how to do it zGranularity and partitioning are key to effective implementation of a warehouse
CSI'99 75

Why Partitioning?
zFlexibility in managing data zSmaller physical units allow
yeasy restructuring yfree indexing ysequential scans if needed yeasy reorganization yeasy recovery yeasy monitoring
CSI'99 76

Criterion for Partitioning


zTypically partitioned by
ydate yline of business ygeography yorganizational unit yany combination of above

CSI'99

77

Where to Partition?
zApplication level or DBMS level zMakes sense to partition at application level
yAllows different definition for each year
xImportant since warehouse spans many years and as business evolves definition changes

yAllows data to be moved between processing complexes easily


CSI'99 78

Where to Partition?
zApplication level or DBMS level zMakes sense to partition at application level
yAllows different definition for each year
xImportant since warehouse spans many years and as business evolves definition changes

yAllows data to be moved between processing complexes easily


CSI'99 79

Indexing Techniques
zBitmap index:
yA collection of bitmaps -- one for each distinct value of the column yEach bitmap has N bits where N is the number of rows in the table yA bit corresponding to a value v for a row r is set if and only if r has the value for the indexed attribute
CSI'99 80

Bitmap Index

M F F M F F

Y Y N N Y N

0 1 1 0 1 1

1 1 0 0 1 0

0 1 0 0 1 0

Customer
CSI'99

Query : select * from customer where 81 gender = F and vote = Y

Join Indexes
zPre-computed joins zA join index between a fact table and a dimension table correlates a dimension tuple with the fact tuples that have the same value on the common dimensional attribute
ye.g., a join index on city dimension of calls fact table ycorrelates for each city the calls (in the calls table) that originated from CSI'99

82

Join Indexes
zJoin indexes can also span multiple dimension tables
ye.g., a join index on city and time dimension of calls fact table

CSI'99

83

Star Join Processing


zUse join indexes to join dimension and fact table
Calls C+T Time Location Plan
CSI'99

C+T+L C+T+L +P
84

Optimized Star Join Processing

Time Location Plan

Apply Selections Calls Virtual Cross Product of T, L and P

CSI'99

85

Bitmapped Join Processing

Bitmaps Time Location Plan Calls Calls Calls


1 0 1

0 0 1

AND

1 1 0

CSI'99

86

Intelligent Scan
zPiggyback multiple scans of a relation (Redbrick)
ypiggybacking also done if second scan starts a little while after the first scan

CSI'99

87

Parallel Query Processing


zThree forms of parallelism
yIndependent yPipelined yPartitioned and partition and replicate

zDeterrents to parallelism
ystartup ycommunication
CSI'99 88

Parallel Query Processing


zPartitioned Data

zParallel algorithms for relational operators zParallel Utilities


yJoins, Aggregates, Sort yLoad, Archive, Update, Parse, Checkpoint, Recovery

y Parallel scans yYields I/O parallelism

zParallel Query Optimization


CSI'99 89

Pre-computed Aggregates
zKeep aggregated data for efficiency (pre-computed queries) zQuestions
yWhich aggregates to compute? yHow to update aggregates? yHow to use pre-computed aggregates in queries?
CSI'99 90

Pre-computed Aggregates
zAggregated table can be maintained by the
ywarehouse server ymiddle tier yclient applications

zPre-computed aggregates -- special case of materialized views -- same questions and issues remain
CSI'99 91

SQL Extensions
zExtended family of aggregate functions
yrank (top 10 customers) ypercentile (top 30% of customers) ymedian, mode yObject Relational Systems allow addition of new aggregate functions

CSI'99

92

SQL Extensions
zReporting features
yrunning total, cumulative totals

zCube operator
ygroup by on all subsets of a set of attributes (month,city) yredundant scan and sorting of data can be avoided

CSI'99

93

Technological Requirements
zManaging Large amounts of data zManaging multiple media -- storage hierarchy
ycache (L1 and L2) ymain memory ydisks yoptical disks ytapes yfiche
CSI'99 94

Technological Requirements
zAbility to index data at will
ytemporary indices, sparse indices

zAbility to monitor data freely and easily

zNeed to interface to many technologies


yfor both receiving and passing data

yto determine whether reorganization is required yto determine if index is poorly structured yto determine statistical composition of data

CSI'99

95

Technological Requirements
zProgrammer/Designer control of data zParallel Storage/Management of data zGood Metadata management zLoad the warehouse efficiently zUse indexes efficiently zCompaction of data
CSI'99 96

Technological Requirements
zCompound Keys zVariable Length data zLock Management
yNeed to be able to turn the lock manager on and off

zIndex Only processing

CSI'99

97

Warehouse Server Products


zOracle 8 zInformix

yOnline Dynamic Server yXPS --Extended Parallel Server yUniversal Server for object relational applications yAdaptive Server 11.5 ySybase MPP ySybase IQ
98

zSybase

CSI'99

Warehouse Server Products


zRed Brick Warehouse zTandem Nonstop zIBM
yDB2 MVS yUniversal Server yDB2 400

zTeradata
CSI'99 99

Server Scalability
zScalability is the #1 IT requirement for Data Warehousing zHardware Platform options
ySMP yClusters (shared disk) yMPP
xLoosely coupled (shared nothing) xHybrid
CSI'99 100

SMP Characteristics
z SMP -- Symmetric multi processing -- shared everything z Multiple CPUs share same memory z Workload is balanced across CPUs by OS z Scalability is limited to bandwidth of internal bus and OS architecture z Not tolerant to failure in processing node z Architecture is mostly invisible to applications

CSI'99

101

SMP Benefits
zLower entry point -- can start with SMP zMature technology

CSI'99

102

MPP Characteristics
zEach node owns a portion of the database zNodes are connected via an interconnection network zEach node can be a single CPU or SMP zLoad balancing done by application zHigh scalability due to local processing isolation

CSI'99

103

MPP benefits
zHigh availability zHigh scalability

CSI'99

104

Other Warehouse Related Products


zConnectivity to Sources
yApertus yInformation Builders EDA/SQL yPlatimum Infohub ySAS Connect yIBM Data Joiner yOracle Open Connect yInformix Express Gateway
CSI'99 105

Other Warehouse Related Products


zData extract, clean, transform, refresh
yCA-Ingres replicator yCarleton Passport yPrism Warehouse Manager ySAS Access ySybase Replication Server yPlatinum Inforefiner, Infopump
CSI'99 106

Other Warehouse Related Products


zQuery/Reporting Environments
yBrio/Query yCognos Impromptu yInformix Viewpoint yCA Visual Express yBusiness Objects yPlatinum Forest and Trees

CSI'99

107

Data Warehouse vs. Data Marts

What comes first

From the Data Warehouse to Data Marts


Information
Individually Structured Departmentally Structured

Less History Normalized Detailed More

Organizationally Structured

Data Warehouse

Data
CSI'99 109

Data Warehouse and Data Marts


OLAP Data Mart Lightly summarized Departmentally structured

Organizationally structured Atomic Detailed Data Warehouse Data


CSI'99 110

Characteristics of the Departmental Data Mart


zOLAP zSmall zFlexible zCustomized by Department zSource is departmentally structured data warehouse
CSI'99 111

Techniques for Creating Departmental Data Mart


zOLAP zSubset zSummarized zSuperset zIndexed zArrayed
CSI'99 112

Sales

Finance

Mktg.

Data Mart Centric


Data Sources

Data Marts

Data Warehouse

CSI'99

113

Problems with Data Mart Centric Solution

If you end up creating multiple warehouses, integrating them is a problem


CSI'99 114

True Warehouse
Data Sources

Data Warehouse

Data Marts

CSI'99

115

Myths surrounding OLAP Servers and Data Marts


z Data marts and OLAP servers are departmental solutions supporting a handful of users z Million dollar massively parallel hardware is needed to deliver fast time for complex queries z OLAP servers require massive and unwieldy indices z Complex OLAP queries clog the network with data z Data warehouses must be at least 100 GB to be effective
Source -- Arbor Software Home Page

CSI'99

116

Viewing the Data with OLAP

Making Decision Support Possible

Limitations of SQL A Freshman in Business needs a Ph.D. in SQL

-- Ralph Kimball
CSI'99 118

Typical OLAP Queries


z Write a multi-table join to compare sales for each product line YTD this year vs. last year. z Repeat the above process to find the top 5 product contributors to margin. z Repeat the above process to find the sales of a product line to new vs. existing customers. z Repeat the above process to find the customers that have had negative sales growth.
CSI'99 119

What Is OLAP?
z Online Analytical Processing - coined by EF Codd in 1994 paper contracted by Arbor Software* z Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information System z OLAP = Multidimensional Database z MOLAP: Multidimensional OLAP (Arbor Essbase, Oracle Express) z ROLAP: Relational OLAP (Informix MetaCube, Microstrategy DSS Agent)
* Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html CSI'99 120

The OLAP Market


zRapid growth in the enterprise market zSignificant consolidation activity among major DBMS vendors
y10/94: Sybase acquires ExpressWay y7/95: Oracle acquires Express y11/95: Informix acquires Metacube y1/97: Arbor partners up with IBM y10/96: Microsoft acquires Panorama y1995: $700 Million y1997: $2.1 Billion

zResult: OLAP shifted from small vertical niche to mainstream DBMS category
CSI'99 121

Strengths of OLAP
zIt is a powerful visualization paradigm zIt provides fast, interactive response times zIt is good for analyzing time series zIt can be useful to find some clusters and outliners zMany vendors offer OLAP tools
CSI'99 122

OLAP Is FASMI
zFast zAnalysis zShared zMultidimensional zInformation

Nigel Pendse, Richard Creath - The OLAP Report


CSI'99 123

Multi-dimensional Data
zHeyI sold $100M worth of goods
W S N Juice Cola Milk Cream Toothpaste Soap 1 2 34 5 6 7

Re gi on

Dimensions: Product, Region, Time Hierarchical summarization paths


Product Industry Region Country Time Year

Product

Category

Region

Quarter

Product

City

Month

Week

Month
CSI'99

Office

Day124

Visualizing Neighbors is simpler


1 Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar
CSI'99

Month Apr Apr Apr Apr Apr Apr Apr Apr May May May May May May May May Jun Jun

Store 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2

Sales

125

Slicing and Dicing


Product The Telecomm Slice

Household Telecomm Video Audio

ns io eg Europe
Far East India

Retail Direct
CSI'99

Special

Sales Channel
126

Roll-up and Drill Down


Higher Level of Aggregation

zSales Channel zRegion zCountry zState zLocation Address zSales Representative

Drill-Down

Roll Up

Low-level Details
127

CSI'99

Nature of OLAP Analysis


zAggregation -- (total sales, percent-to-total) zComparison -- Budget vs. Expenses zRanking -- Top 10, quartile analysis zAccess to detailed and aggregate data zComplex criteria specification zVisualization CSI'99

128

Organizationally Structured Data


zDifferent Departments look at the same detailed data in different ways. Without the detailed, organizationally structured data as a foundation, there is no reconcilability of data
marketing sales finance
CSI'99

manufacturing

129

Multidimensional Spreadsheets
zAnalysts need spreadsheets that support
ypivot tables (cross-tabs) ydrill-down and roll-up yslice and dice ysort yselections yderived attributes

zPopular in retail domain

CSI'99

130

SQL Extensions
zFront-end tools require
yExtended Family of Aggregate Functions
xrank, median, mode

yReporting Features
xrunning totals, cumulative totals

yResults of multiple group by


xtotal sales by month and total sales by product

yData Cube
CSI'99 131

Relational OLAP: 3 Tier DSS


Data Warehouse ROLAP Engine Decision Support Client

Database Layer

Application Logic Layer

Presentation Layer

Store atomic data in industry standard RDBMS.


CSI'99

Generate SQL execution plans in the ROLAP engine to obtain OLAP functionality.

Obtain multidimensional reports from the DSS Client.


132

MD-OLAP: 2 Tier DSS


MDDB Engine MDDB Engine Decision Support Client

Database Layer

Application Logic Layer

Presentation Layer

Store atomic data in a proprietary data structure (MDDB), pre-calculate as many outcomes as possible, obtain OLAP functionality via proprietary algorithms running against this data.
CSI'99

Obtain multidimensional reports from the DSS Client.


133

Typical OLAP Problems


Data Explosion Number of Aggregations

70000 60000 50000 40000 30000 20000 10000 0 2

Data Explosion Syndrome


65536

16384 16 3 81 4 256 1024 5 6 4096 7 8


134

(4 levels in each dimension) CSI'99

Number of Dimensions
Microsoft TechEd98

Reporting Tools
z Andyne Computing -- GQL z Brio -- BrioQuery z Business Objects -- Business Objects z Cognos -- Impromptu z Information Builders Inc. -- Focus for Windows z Oracle -- Discoverer2000 z Platinum Technology -- SQL*Assist, ProReports z PowerSoft -- InfoMaker z SAS Institute -- SAS/Assist z Software AG -- Esperant z Sterling Software -- VISION:Data
CSI'99 135

OLAP and Executive Information Systems


z Andyne Computing -- Pablo z Arbor Software -- Essbase z Cognos -- PowerPlay z Comshare -- Commander OLAP z Holistic Systems -- Holos z Information Advantage -AXSYS, WebOLAP z Informix -- Metacube z Microstrategies --DSS/Agent
CSI'99

z Microsoft -- Plato z Oracle -- Express z Pilot -- LightShip z Planning Sciences -Gentium z Platinum Technology -ProdeaBeacon, Forest & Trees z SAS Institute -- SAS/EIS, OLAP++ z Speedware -- Media
136

Extraction and Transformation Tools


z Carleton Corporation -- Passport z Evolutionary Technologies Inc. -- Extract z Informatica -- OpenBridge z Information Builders Inc. -- EDA Copy Manager z Platinum Technology -- InfoRefiner z Prism Solutions -- Prism Warehouse Manager z Red Brick Systems -- DecisionScape Formation
CSI'99 137

Scrubbing Tools
zApertus -- Enterprise/Integrator zVality -- IPE zPostal Soft

CSI'99

138

Warehouse Products
zComputer Associates -- CA-Ingres zHewlett-Packard -- Allbase/SQL zInformix -- Informix, Informix XPS zMicrosoft -- SQL Server zOracle -- Oracle7, Oracle Parallel Server zRed Brick -- Red Brick Warehouse zSAS Institute -- SAS zSoftware AG -- ADABAS zSybase -- SQL Server, IQ, MPP
CSI'99 139

4GL's, GUI Builders, and PC Databases


zInformation Builders -zLotus -Approach Focus

zMicrosoft -- Access, Visual Basic zMITI -- SQR/Workbench zPowerSoft -PowerBuilder zSAS Institute -- SAS/AF
CSI'99 140

Data Mining Products


zDataMind -- neurOagent zInformation Discovery -- IDIS zSAS Institute -- SAS/Neuronets

CSI'99

141

Data Warehouse
zW.H. Inmon, Building the Data Warehouse, Second Edition, John Wiley and Sons, 1996 zW.H. Inmon, J. D. Welch, Katherine L. Glassey, Managing the Data Warehouse, John Wiley and Sons, 1997 zBarry Devlin, Data Warehouse from Architecture to Implementation, Addison Wesley Longman, Inc 1997
CSI'99 142

Data Warehouse
zW.H. Inmon, John A. Zachman, Jonathan G. Geiger, Data Stores Data Warehousing and the Zachman Framework, McGraw Hill Series on Data Warehousing and Data Management, 1997 zRalph Kimball, The Data Warehouse Toolkit, John Wiley and Sons, 1996

CSI'99

143

OLAP and DSS


zErik Thomsen, OLAP Solutions, John Wiley and Sons 1997 zMicrosoft TechEd Transparencies from Microsoft TechEd 98 zEssbase Product Literature zOracle Express Product Literature zMicrosoft Plato Web Site zMicrostrategy Web Site
CSI'99 144

Data Mining
zMichael J.A. Berry and Gordon Linoff, Data Mining Techniques, John Wiley and Sons 1997 zPeter Adriaans and Dolf Zantinge, Data Mining, Addison Wesley Longman Ltd. 1996 zKDD Conferences

CSI'99

145

Other Tutorials
z Donovan Schneider, Data Warehousing Tutorial, Tutorial at International Conference for Management of Data (SIGMOD 1996) and International Conference on Very Large Data Bases 97 z Umeshwar Dayal and Surajit Chaudhuri, Data Warehousing Tutorial at International Conference on Very Large Data Bases 1996 z Anand Deshpande and S. Seshadri, Tutorial on Datawarehousing and Data Mining, CSI-97
CSI'99 146

Useful URLs
zRalph Kimballs home page
yhttp://www.rkimball.com

zLarry Greenfields Data Warehouse Information Center


yhttp://pwp.starnetinc.com/larryg/

zData Warehousing Institute


yhttp://www.dw-institute.com/

zOLAP Council
yhttp://www.olapcouncil.com/
CSI'99 147

Das könnte Ihnen auch gefallen