Beruflich Dokumente
Kultur Dokumente
Data
CSI'99
8000 7000 6000 5000 4000 3000 2000 1000 0 Hardware DBMS Tools Total Market
1000 1500 700 300 3500 3000 2000
1995 1998
Meta Group
8
1994
1999
$1,568.0
$65.0
$6,960.0
$210.0
34.7%
26.4%
CSI'99
All revenues are in millions of U.S. dollars and are Gartner Group estimates. Source: Gartner Group, Inc.
Respondents
Projected 2Q96
Source: META Group, Inc.
10-19GB 5-9GB
50-99GB
250-499GB 500GB-1TB
10
20-49GB
100-249GB
CSI'99
Data Warehouse
zA data warehouse is a
ysubject-oriented yintegrated ytime-varying ynon-volatile
Farmers: Harvest information from known access paths Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data
CSI'99 14
Purchased Data
Legacy Data
CSI'99
Metadata Repository
15
Decision Support
zUsed to manage and control business zData is historical or point-in-time zOptimized for inquiry rather than update zUse of the system is loosely defined and can be ad-hoc zUsed by managers and end-users to understand the business and make judgements
CSI'99 17
Technology
Volumes
Legacy application, flat Small-medium files, main frames Large Legacy applications, hierarchical databases, mainframe ERP, Client/Server, Very Large relational databases Legacy application, Very Large hierarchical database, mainframe ERP, Medium relational databases, AS/400
21
Operational Database
Loans Credit Card Trust Savings
CSI'99
Data Warehouse
Customer Vendor Product Activity
23
24
zWarehouse (DSS)
ySubject Oriented yUsed to analyze business ySummarized and refined ySnapshot data yIntegrated Data yAd-hoc access yKnowledge User (Manager)
25
z Data Warehouse
yPerformance relaxed yLarge volumes accessed at a time(millions) yMostly Read (Batch Update) yRedundancy present yDatabase Size 100 GB - few terabytes
CSI'99
26
zData Warehouse
yQuery throughput is the performance metric yHundreds of users yManaged by subsets
CSI'99
27
Why Now?
zData is being produced zERP provides clean data zThe computing power is available zThe computing power is affordable zThe competitive pressures are strong zCommercial products are available
CSI'99 28
To summarize ...
zOLTP Systems are used to run a business
CSI'99
29
xThis case study is from Felipe Carinos (NCR Teradata) presentation made at Stanford Database
30
yInventory Management yMerchandise Accounts Payable yPurchasing ySupplier Promotions: National, Region, Store Level
CSI'99
z Warehouse-Pass Through
yStock some Large Items yDistribution Center
NDelivery may come from supplier NSuppliers merchandise unloaded directly onto Wal*Mart Trucks
32
CSI'99
Wal*Mart System
24 TB Raw Disk; 700 1000 Pentium CPUs > 5 Billions 65 weeks (5 Quarters) Current Apps: 75 Million New Apps: 100 Million + zNumber of Users: Thousands zNumber of Queries:60,000 per week zNCR 5100M 96 Nodes; zNumber of Rows: zHistorical Data: zNew Daily Volume:
CSI'99 33
ERP Systems
Purchased Data
Legacy Data
CSI'99
Metadata Repository
34
CSI'99
35
Source Data
Operational/ Source Data Sequential Legacy Relational External
37
39
CSI'99
40
appl A - m,f appl B - 1,0 appl C - x,y appl D - male, female appl A - pipeline - cm appl B - pipeline - in appl C - pipeline - feet appl D - pipeline - yds appl A - balance appl B - bal appl C - currbal appl D - balcurr
41
42
CSI'99
43
yCapture of data from operational source in as is status ySources for data generally in legacy mainframes in VSAM, IMS, IDMS, DB2; more data today in relational databases on Unix
yThe conversion of data types from the source to the target data store CSI'99 (warehouse) -- always a relational
zConditioning
44
yBring data from external sources to augment/enrich operational data. Data sources include Dunn and Bradstreet, Nielson, IMRA etc...
ycomputation of a probability of an event. e.g..., chance that a customer will defect to AT&T from MCI, chance that a customer is likely to buy a new product CSI'99
zScoring
46
Loads
zAfter extracting, scrubbing, cleaning, validating etc. need to load the data into the warehouse zIssues
yhuge volumes of data to be loaded ysmall time window available when warehouse can be taken off line (usually nights) ywhen to build index and summary tables yallow system administrators to monitor, cancel, resume, change load rates yRecover gracefully -- restart after failure from where you were and without loss of data integrity
47
CSI'99
Load Techniques
zUse SQL to append or insert new data
yrecord at a time interface ywill lead to random disk I/Os
CSI'99
48
Load Taxonomy
zIncremental versus Full loads zOnline versus Offline loads
CSI'99
49
Refresh
zPropagate updates on source data to the warehouse zIssues:
ywhen to refresh yhow to refresh -- refresh techniques
CSI'99
50
When to Refresh?
zperiodically (e.g., every night, every week) or after significant events zon every update: not warranted unless warehouse data require current data (up to the minute stock quotes) zrefresh policy set by administrator based on user needs and traffic zpossibly different policies for CSI'99 51
Refresh Techniques
zFull Extract from base tables
yread entire source table: too expensive ymaybe the only choice for legacy systems
CSI'99
52
CSI'99
56
ycustomer activity (1986-89) -- monthly summary ycustomer activity detail (1987-89) ycustomer activity detail (1990-91)
xcustid, activity date, amount, clerk id, order no xcustid, activity date, amount, line item no, order no
CSI'99 57
Granularity in Warehouse
zCan not answer some questions with summarized data
yDid Anand call Seshadri last month? Not possible to answer if total duration of calls by Anand over a month is only maintained and individual call details are not.
Granularity in Warehouse
zTradeoff is to have dual level of granularity
yStore summary data on disks
x95% of DSS processing done against this data
CSI'99
60
Vertical Partitioning
acctno balance address date opened . . . . Frequently accessed Rarely accessed
CSI'99
acctno address date -opened . . Smaller table . and so less I/O acctno balance
61
Derived Data
zIntroduction of derived (calculated data) may often help zHave seen this in the context of dual levels of granularity zCan keep auxiliary views and indexes to speed up query processing
CSI'99 62
Schema Design
zDatabase organization
zSchema Types
ymust look like business ymust be recognizable by business user yapproachable by business user yMust be simple yStar Schema yFact Constellation Schema ySnowflake schema
63
CSI'99
Dimension Tables
zDimension tables
yDefine business in terms already familiar to users yWide rows with lots of descriptive text ySmall tables (about a million rows) yJoined to fact table by a foreign key yheavily indexed ytypical dimensions
CSI'99
xtime periods, geographic region (markets, cities), products, customers, salesperson, etc. 64
Fact Table
zCentral table
ymostly raw numeric items ynarrow rows, a few columns at most ylarge number of rows (millions to a billion) yAccess via dimensions
CSI'99
65
Star Schema
zA single fact table and for each dimension one dimension table zDoes not capture hierarchies directly
T i e
date, custno, prodno, cityname, ...
CSI'99
c u s t
f a c t
p r o d
c i t y
66
Snowflake schema
zRepresent dimensional hierarchy directly by normalizing tables. zEasy to maintain and saves storage
T i e
date, custno, prodno, cityname, ...
c u s t
CSI'99
f a c t
p r o d c i t y
r e g i o 67 n
Fact Constellation
zFact Constellation
yMultiple fact tables that share many dimension tables yBooking and Checkout may share many dimension tables in the hotel industry
Hotels
Booking Checkout
Customer
Promotion
Travel Agents
CSI'99
Room Type
68
Denormalization
zNormalization in a data warehouse may lead to lots of small tables zCan lead to excessive I/Os since many tables have to be accessed zDenormalization is the answer especially since updates are rare
CSI'99
69
Creating Arrays
zMany time each occurrence of a sequence of data is in a different physical location zBeneficial to collect all occurrences together and store as an array in a single row zMakes sense only if there are a stable number of occurrences which are accessed together zIn a data warehouse, such situations arise naturally due to time based orientation
ycan create an array by month
CSI'99 70
Selective Redundancy
zDescription of an item can be stored redundantly with order table -- most often item description is also accessed with order table zUpdates have to be careful
CSI'99
71
ySources of data for the warehouse yData quality at the sources yMerging different data sources yData Transformation yHow to propagate updates (on the sources) to the warehouse yTerabytes of data to be loaded
72
CSI'99
Scrubbing Data
zSophisticated transformation tools. zUsed for cleaning the quality of data zClean data is vital for the success of the warehouse zExample
ySeshadri, Sheshadri, Sesadri, Seshadri S., Srinivasan Seshadri, etc. CSI'99 are the same person
73
Scrubbing Tools
zApertus -- Enterprise/Integrator zVality -- IPE zPostal Soft
CSI'99
74
Partitioning
zBreaking data into several physical units that can be handled separately zNot a question of whether to do it in data warehouses but how to do it zGranularity and partitioning are key to effective implementation of a warehouse
CSI'99 75
Why Partitioning?
zFlexibility in managing data zSmaller physical units allow
yeasy restructuring yfree indexing ysequential scans if needed yeasy reorganization yeasy recovery yeasy monitoring
CSI'99 76
CSI'99
77
Where to Partition?
zApplication level or DBMS level zMakes sense to partition at application level
yAllows different definition for each year
xImportant since warehouse spans many years and as business evolves definition changes
Where to Partition?
zApplication level or DBMS level zMakes sense to partition at application level
yAllows different definition for each year
xImportant since warehouse spans many years and as business evolves definition changes
Indexing Techniques
zBitmap index:
yA collection of bitmaps -- one for each distinct value of the column yEach bitmap has N bits where N is the number of rows in the table yA bit corresponding to a value v for a row r is set if and only if r has the value for the indexed attribute
CSI'99 80
Bitmap Index
M F F M F F
Y Y N N Y N
0 1 1 0 1 1
1 1 0 0 1 0
0 1 0 0 1 0
Customer
CSI'99
Join Indexes
zPre-computed joins zA join index between a fact table and a dimension table correlates a dimension tuple with the fact tuples that have the same value on the common dimensional attribute
ye.g., a join index on city dimension of calls fact table ycorrelates for each city the calls (in the calls table) that originated from CSI'99
82
Join Indexes
zJoin indexes can also span multiple dimension tables
ye.g., a join index on city and time dimension of calls fact table
CSI'99
83
C+T+L C+T+L +P
84
CSI'99
85
0 0 1
AND
1 1 0
CSI'99
86
Intelligent Scan
zPiggyback multiple scans of a relation (Redbrick)
ypiggybacking also done if second scan starts a little while after the first scan
CSI'99
87
zDeterrents to parallelism
ystartup ycommunication
CSI'99 88
Pre-computed Aggregates
zKeep aggregated data for efficiency (pre-computed queries) zQuestions
yWhich aggregates to compute? yHow to update aggregates? yHow to use pre-computed aggregates in queries?
CSI'99 90
Pre-computed Aggregates
zAggregated table can be maintained by the
ywarehouse server ymiddle tier yclient applications
zPre-computed aggregates -- special case of materialized views -- same questions and issues remain
CSI'99 91
SQL Extensions
zExtended family of aggregate functions
yrank (top 10 customers) ypercentile (top 30% of customers) ymedian, mode yObject Relational Systems allow addition of new aggregate functions
CSI'99
92
SQL Extensions
zReporting features
yrunning total, cumulative totals
zCube operator
ygroup by on all subsets of a set of attributes (month,city) yredundant scan and sorting of data can be avoided
CSI'99
93
Technological Requirements
zManaging Large amounts of data zManaging multiple media -- storage hierarchy
ycache (L1 and L2) ymain memory ydisks yoptical disks ytapes yfiche
CSI'99 94
Technological Requirements
zAbility to index data at will
ytemporary indices, sparse indices
yto determine whether reorganization is required yto determine if index is poorly structured yto determine statistical composition of data
CSI'99
95
Technological Requirements
zProgrammer/Designer control of data zParallel Storage/Management of data zGood Metadata management zLoad the warehouse efficiently zUse indexes efficiently zCompaction of data
CSI'99 96
Technological Requirements
zCompound Keys zVariable Length data zLock Management
yNeed to be able to turn the lock manager on and off
CSI'99
97
yOnline Dynamic Server yXPS --Extended Parallel Server yUniversal Server for object relational applications yAdaptive Server 11.5 ySybase MPP ySybase IQ
98
zSybase
CSI'99
zTeradata
CSI'99 99
Server Scalability
zScalability is the #1 IT requirement for Data Warehousing zHardware Platform options
ySMP yClusters (shared disk) yMPP
xLoosely coupled (shared nothing) xHybrid
CSI'99 100
SMP Characteristics
z SMP -- Symmetric multi processing -- shared everything z Multiple CPUs share same memory z Workload is balanced across CPUs by OS z Scalability is limited to bandwidth of internal bus and OS architecture z Not tolerant to failure in processing node z Architecture is mostly invisible to applications
CSI'99
101
SMP Benefits
zLower entry point -- can start with SMP zMature technology
CSI'99
102
MPP Characteristics
zEach node owns a portion of the database zNodes are connected via an interconnection network zEach node can be a single CPU or SMP zLoad balancing done by application zHigh scalability due to local processing isolation
CSI'99
103
MPP benefits
zHigh availability zHigh scalability
CSI'99
104
CSI'99
107
Organizationally Structured
Data Warehouse
Data
CSI'99 109
Sales
Finance
Mktg.
Data Marts
Data Warehouse
CSI'99
113
True Warehouse
Data Sources
Data Warehouse
Data Marts
CSI'99
115
CSI'99
116
-- Ralph Kimball
CSI'99 118
What Is OLAP?
z Online Analytical Processing - coined by EF Codd in 1994 paper contracted by Arbor Software* z Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information System z OLAP = Multidimensional Database z MOLAP: Multidimensional OLAP (Arbor Essbase, Oracle Express) z ROLAP: Relational OLAP (Informix MetaCube, Microstrategy DSS Agent)
* Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html CSI'99 120
zResult: OLAP shifted from small vertical niche to mainstream DBMS category
CSI'99 121
Strengths of OLAP
zIt is a powerful visualization paradigm zIt provides fast, interactive response times zIt is good for analyzing time series zIt can be useful to find some clusters and outliners zMany vendors offer OLAP tools
CSI'99 122
OLAP Is FASMI
zFast zAnalysis zShared zMultidimensional zInformation
Multi-dimensional Data
zHeyI sold $100M worth of goods
W S N Juice Cola Milk Cream Toothpaste Soap 1 2 34 5 6 7
Re gi on
Product
Category
Region
Quarter
Product
City
Month
Week
Month
CSI'99
Office
Day124
Month Apr Apr Apr Apr Apr Apr Apr Apr May May May May May May May May Jun Jun
Store 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2
Sales
125
ns io eg Europe
Far East India
Retail Direct
CSI'99
Special
Sales Channel
126
Drill-Down
Roll Up
Low-level Details
127
CSI'99
128
manufacturing
129
Multidimensional Spreadsheets
zAnalysts need spreadsheets that support
ypivot tables (cross-tabs) ydrill-down and roll-up yslice and dice ysort yselections yderived attributes
CSI'99
130
SQL Extensions
zFront-end tools require
yExtended Family of Aggregate Functions
xrank, median, mode
yReporting Features
xrunning totals, cumulative totals
yData Cube
CSI'99 131
Database Layer
Presentation Layer
Generate SQL execution plans in the ROLAP engine to obtain OLAP functionality.
Database Layer
Presentation Layer
Store atomic data in a proprietary data structure (MDDB), pre-calculate as many outcomes as possible, obtain OLAP functionality via proprietary algorithms running against this data.
CSI'99
Number of Dimensions
Microsoft TechEd98
Reporting Tools
z Andyne Computing -- GQL z Brio -- BrioQuery z Business Objects -- Business Objects z Cognos -- Impromptu z Information Builders Inc. -- Focus for Windows z Oracle -- Discoverer2000 z Platinum Technology -- SQL*Assist, ProReports z PowerSoft -- InfoMaker z SAS Institute -- SAS/Assist z Software AG -- Esperant z Sterling Software -- VISION:Data
CSI'99 135
z Microsoft -- Plato z Oracle -- Express z Pilot -- LightShip z Planning Sciences -Gentium z Platinum Technology -ProdeaBeacon, Forest & Trees z SAS Institute -- SAS/EIS, OLAP++ z Speedware -- Media
136
Scrubbing Tools
zApertus -- Enterprise/Integrator zVality -- IPE zPostal Soft
CSI'99
138
Warehouse Products
zComputer Associates -- CA-Ingres zHewlett-Packard -- Allbase/SQL zInformix -- Informix, Informix XPS zMicrosoft -- SQL Server zOracle -- Oracle7, Oracle Parallel Server zRed Brick -- Red Brick Warehouse zSAS Institute -- SAS zSoftware AG -- ADABAS zSybase -- SQL Server, IQ, MPP
CSI'99 139
zMicrosoft -- Access, Visual Basic zMITI -- SQR/Workbench zPowerSoft -PowerBuilder zSAS Institute -- SAS/AF
CSI'99 140
CSI'99
141
Data Warehouse
zW.H. Inmon, Building the Data Warehouse, Second Edition, John Wiley and Sons, 1996 zW.H. Inmon, J. D. Welch, Katherine L. Glassey, Managing the Data Warehouse, John Wiley and Sons, 1997 zBarry Devlin, Data Warehouse from Architecture to Implementation, Addison Wesley Longman, Inc 1997
CSI'99 142
Data Warehouse
zW.H. Inmon, John A. Zachman, Jonathan G. Geiger, Data Stores Data Warehousing and the Zachman Framework, McGraw Hill Series on Data Warehousing and Data Management, 1997 zRalph Kimball, The Data Warehouse Toolkit, John Wiley and Sons, 1996
CSI'99
143
Data Mining
zMichael J.A. Berry and Gordon Linoff, Data Mining Techniques, John Wiley and Sons 1997 zPeter Adriaans and Dolf Zantinge, Data Mining, Addison Wesley Longman Ltd. 1996 zKDD Conferences
CSI'99
145
Other Tutorials
z Donovan Schneider, Data Warehousing Tutorial, Tutorial at International Conference for Management of Data (SIGMOD 1996) and International Conference on Very Large Data Bases 97 z Umeshwar Dayal and Surajit Chaudhuri, Data Warehousing Tutorial at International Conference on Very Large Data Bases 1996 z Anand Deshpande and S. Seshadri, Tutorial on Datawarehousing and Data Mining, CSI-97
CSI'99 146
Useful URLs
zRalph Kimballs home page
yhttp://www.rkimball.com
zOLAP Council
yhttp://www.olapcouncil.com/
CSI'99 147