Sie sind auf Seite 1von 82

Chapter 17

Data Integration Practices and


Relational DBMS Extensions for
Data Warehouses

Database Design, Application Development, and Copyright 2011 by Michael V. Mannino. All rights reserved.
Administration, 5th Edition
Outline
Data integration concepts
SQL extensions for multidimensional data
Summary data storage and optimization

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 2
Data Integration Concepts

Data sources
Workflow representation
Data cleaning techniques
Data integration architectures and tools
Managing the refresh process

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 3
Data Sources
Cooperative:
Notification using triggers
Requires source system changes
Logged
Readily available
Extraneous data in logs
Queryable
Queries using timestamps
Requires timestamps in source data
Snapshot
Periodic dumps of source data
Significant processing for difference operations
Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 4
Maintenance Workflow
Notification
Update
Phase
Propagation

Auditing
Integration
Phase
Merging

Auditing

Preparation Cleaning
Phase
Transportation

Extraction

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 5
Initial Data Warehouse Load
Major development activity
Different characteristics than refresh
Discover many data quality problems
Difficult to estimate time requirements
Part of data warehouse extensions

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 6
Data Quality Measures
Completeness
Lack of ambiguity
Timeliness
Correctness
Consistency

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 7
Data Quality Problems
Multiple identifiers
Multiple field names
Different units
Missing values
Orphaned values
Multipurpose fields
Conflicting data
Different update times

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 8
Data Cleaning Tasks
Parsing
Correcting
Standardizing
Matching
Consolidating

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 9
Parsing
Locates and identifies individual data
elements in the source files
Context free and context sensitive parsing
Requires pattern specification
Regular expressions (regex) for pattern
specification
Isolates parsed data elements in the target
record

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 10
Regular Expression Basics
Search expression: combination of meta
characters, literals, and escape characters
that define a search pattern
Meta character: special meaning within a
search expression
Literal: character to match exactly
Escape sequence: removes special
meaning of meta character

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 11
Simple Examples
Target strings
String1 Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)
String2 Mozilla/4.75 [en](X11;U;Linux2.2.16-22 i586)
Regular expressions
m match in String1 but not String2
a/4 match in both String1 and String2
5 [ no match in String1 but match in String2
le match in String1 but no match in String2

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 12
Enumeration, Range and Position
Meta Characters
[] match any enclosed character once
[Ff] matches F or f
[0123456789] matches any digit
[0-9] matches any digit
[^0-9] matches anything other than a digit
Position characters
^ string beginning: ^win matches window but not erwin
$ string ending: win$ matches erwin but not window
. any character in specified position: win. matches window but
not erwin

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 13
Iteration Meta Characters
?: matches preceding character 0 or 1 times
*: matches preceding character 0 or more times
+: matches preceding character 1 or more times
{n} matches preceding character exactly n times
{n,m} matches preceding character a minimum of n and
maximum of m times

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 14
Iteration Meta Character
Examples
colou?r matches both color and colour
tre* matches tree, tread, and trough
tre+ matches tree and tread but not trough
[0-9]{3}-[0-9]{4} matches 123-4567 but not
1234-567
ba{2,3}b matches baab and baaab but not
bab or baaaab

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 15
Groups in Regular Expressions
Pair of parentheses used to group subpatterns
Usage
Reuse patterns in an expression
Parsing subpatterns
Example
Input: abbc
Pattern: a(b*)c
Group 0: abbc, Group 1: bb

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 16
Parsing Example

Parsed Data in Target File


First Name: Beth
Middle Name: Christine
Input Data from Source File Last Name: Parker
Beth Christine Parker, SLS MGR Title: SLS MGR
Regional Port Authority Firm: Regional Port Authority
Federal Building Location: Federal Building
12800 Lake Calumet Number: 12800
Hedgewisch, IL Street: Lake Calumet
City: Hedgewisch
State: IL

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 17
Correcting Values
Missing values
- Default value
- Typical value: median or average
- Relationship to other attributes
Conflicting values
- More recent value
- More credible source
Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 18
Correcting Example

Corrected Data
Parsed Data First Name: Beth
First Name: Beth Middle Name: Christine
Middle Name: Christine Last Name: Parker
Last Name: Parker Title: SLS MGR
Title: SLS MGR Firm: Regional Port Authority
Firm: Regional Port Authority Location: Federal Building
Location: Federal Building Number: 12800
Number: 12800 Street: South Butler Drive
Street: Lake Calumet City: Chicago
City: Hedgewisch State: IL
State: IL Zip: 60633
Zip+Four: 2398

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 19
Standardization
Applies conversion routines to transform
data into its preferred (and consistent)
format
Uses both standard and custom business
rules
Common standardizations:
Unit of measure transformations
Standard abbreviations (state names, titles,
street types)
Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 20
Standardization Example

Corrected Data
Corrected Data Pre-name: Ms.
First Name: Beth First Name: Beth
Middle Name: Christine 1st Name Match
Last Name: Parker Standards: Elizabeth, Bethany, Bethel
Title: SLS MGR Middle Name: Christine
Firm: Regional Port Authority Last Name: Parker
Location: Federal Building Title: Sales Mgr.
Number: 12800 Firm: Regional Port Authority
Street: South Butler Drive Location: Federal Building
City: Chicago Number: 12800
State: IL Street: S. Butler Dr.
Zip: 60633 City: Chicago
Zip+Four: 2398 State: IL
Zip: 60633
Zip+Four: 2398
Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 21
Matching
Identify duplicates
Difficult matching process: no common
identifier
Data mining problem
Known as the record linkage problem or entity
identification problem
Many approaches
Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 22
Entity Identification Applications
Marketing
Law enforcement
Fraud detection
Government

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 23
Entity Identification Outcomes
Predicted \ Actual Match Non Match
Match True match False match
Possible Match Investigation Investigation
Non Match False non match True non match

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 24
Distance Measures for Comparison
Edit distance: number of deletions,
insertions, or substitutions required to
transform a source string into a target
string.
N-gram distance: breaks text into
subsequences of length N
Phonetic distance: codes words into
standard consonant sounds

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 25
Matching Example

Corrected Data (Data Source #2)


Corrected Data (Data Source #1) Pre-name: Ms.
Pre-name: Ms. First Name: Elizabeth
First Name: Beth 1st Name Match
1st Name Match Standards: Beth, Bethany, Bethel
Standards: Elizabeth, Bethany, Bethel Middle Name: Christine
Middle Name: Christine Last Name: Parker-Lewis
Last Name: Parker Title:
Title: Sales Mgr. Firm: Regional Port Authority
Firm: Regional Port Authority Location: Federal Building
Location: Federal Building Number: 12800
Number: 12800 Street: S. Butler Dr., Suite 2
Street: S. Butler Dr. City: Chicago
City: Chicago State: IL
State: IL Zip: 60633
Zip: 60633 Zip+Four: 2398
Zip+Four: 2398 Phone: 708-555-1234
Fax: 708-555-5678

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 26
Consolidation
Analyzing and identifying relationships
between matched records
Consolidating/merging them into one
representation.

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 27
Consolidation Example

Consolidated Data
Name: Ms. Beth (Elizabeth)
Corrected Data (Data Source #1) Christine Parker-Lewis
Title: Sales Mgr.
Firm: Regional Port Authority
Location: Federal Building
Address: 12800 S. Butler Dr., Suite 2
Chicago, IL 60633-2398
Corrected Data (Data Source #2)
Phone: 708-555-1234
Fax: 708-555-5678

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 28
Household Consolidation

William Janet Karen William


Jones Jones Jones Jones Jr.

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 29
Legacy Systems View (3 Clients)

Policy No.
Account No. ME309451-2
83451234

Transaction
B498/97
Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 30
The Reality ONE Client

Account No.
83451234 Policy No.
ME309451-2

Transaction
B498/97
Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 31
Data IntegrationTools
Specification based for workflow and
transformations
Minimize custom coding
Open source, third party, and DBMS
based tools
Evolution from disjoint, specialized tools to
integrated development environments
Sometimes called ETL tools

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 32
Data Integration Tool Features
Workflow specification
Transformation specification
Job management
Data profiling
Change data capture
Integrated development environment
Repository
Connectivity for databases and files
Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 33
Data Integration Marketplace
Balanced with proprietary and open-
source products from DBMS vendors and
third party vendors
Base products and subscription services
for open source products
Developing marketplace with acquisitions
and new product development

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 34
Data Integration Architectures
a) ETL Architecture

Data
Source

Data Extract Transform Load


Source (ETL Engine) DW Tables

Data
Source

b) ELT Architecture

Data
Source

Data Extract Transform


Load DW Tables
Source (Relational
DBMS)

Data
Source

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 35
Talend Open Data Solutions
Base product: Talend Open Studio
Extended product and services: Talend
Integration Suite and MPx/RTx options
Repository service: Talend On Demand
Data quality/profiling product: Talend Data
Quality

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 36
Talend Open Studio
Business modeling
Graphical job design using components
Meta data respository
Database connectivity
Job execution

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 37
Talend Job Design
Graphical notation for transformations
Palette of transformation components
Data quality components
Database and file components
ELT components: transformations in target
database
Processing components
XML components

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 38
Excel Job Design Example

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 39
Talend Component Details

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 40
Excel Job Execution Example

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 41
Oracle Integration Tools
Oracle Warehouse Builder
Oracle Data Integrator
INSERT statement extensions
Change data capture
Data pump

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 42
Oracle Warehouse Builder
Extraction, Transformation, and Load (ETL)
Data modeling
Data profiling and data quality
Metadata management
Business-level integration of ERP application data
Integration with Oracle business intelligence tools for
reporting purposes
Advanced data lineage and impact analysis

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 43
INSERT Statement Extensions
MERGE: the ability to update or insert a row
conditionally into a table.
Conditions are specified in the ON clause
Combines insert and update statements
Multiple table INSERT
External data sources have to be segregated based on logical
attributes for insertion into different target objects
Data can end up in several or exactly one target, depending on
the business transformation rules

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 44
Change Data Capture
Capture and publish committed change data
Synchronous using triggers
Asynchronous using log files

Publisher: a DBA who creates and maintains schema


objects
Determines source databases and tables
Allows subscribers to have controlled access to change data

Subscriber: consumers of the published change data


Create subscriptions using subscriber views
Notify when ready to receive a set of change data

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 45
Data Pump
Enables very high-speed movement of
data and metadata from one database to
another
Import: a utility for loading an export dump
file set into a target system
Export: utility for unloading data and
metadata into a set of operating system
files called a dump file set

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 46
Refresh Processing
Primarily
dimension
External changes
Unknown data sources

processes Load time lag

Data
Valid time lag Data warehouse
integration
tools

Internal
data sources
Fact and
Staging
dimension
Area
Accounting changes

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 47
Determining the Refresh
Frequency

Maximize net refresh benefit


Value of data timeliness
Cost of refresh
Satisfy data warehouse and source
system constraints

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 48
Refresh Constraints
Source access: restrictions on time and
frequency
Integration: restrictions that require
concurrent reconciliation
Completeness/consistency: loading in the
same refresh period
Availability: load scheduling restrictions
due to storage capacity, online availability,
and server usage
Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 49
SQL Extensions
Review of GROUP BY clause
Subtotal operators
Analytic functions

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 50
GROUP BY Review
Provides summary data not row data
One row per combination of grouping columns
Aggregate function: statistical function that returns one
value for a set of values (min, max, average, sum,
count, )
Row pattern: grouping columns, aggregate functions
Syntax rule: all non aggregate columns in SELECT
must be in GROUP BY
HAVING clause for conditions involving aggregate
functions

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 51
GROUP BY Example
SELECT StoreZip, TimeMonth,
SUM(SalesDollar) AS SumSales
FROM SSSales, SSStore, SSTimeDim
WHERE SSSales.StoreId = SSStore.StoreId
AND SSSales.TimeNo = SSTimeDim.TimeNo
AND (StoreNation = 'USA'
OR StoreNation = 'Canada')
AND TimeYear = 2009
GROUP BY StoreZip, TimeMonth

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 52
Motivation for Subtotal Extensions
Lack of subtotals in GROUP BY result
Show subtotals in a data cube
Provide control over subtotals in GROUP
BY result
Provide a bridge between relational
database representation and data cubes

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 53
CUBE/GROUP BY Comparison

SELECT State, Month, SUM(Sales)


GROUP BY State, Month

State Month SUM(Sales)


CA Dec 100
CA Feb 75 Month
CO Dec 150 State
Dec Jan Feb Total
CO Jan 100
CA 100 - 75 175
CO Feb 200
CO 150 100 200 450
CN Dec 50
CN 50 75 - 125
CN Jan 75
Total 300 175 275 750

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 54
GROUP BY Extensions
ROLLUP operator
CUBE operator
GROUPING SETS operator
Other extensions
Ranking
Ratios
Moving summary values

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 55
CUBE Operator
Complete set of subtotals
Appropriate for independent dimensions
Not order dependent: specify columns in
any order

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 56
CUBE Example
SELECT StoreZip, TimeMonth,
SUM(SalesDollar) AS SumSales
FROM Sales, Store, Time
WHERE Sales.StoreId = Store.StoreId
AND Sales.TimeNo = Time.TimeNo
AND (StoreNation = 'USA'
OR StoreNation = 'Canada')
AND TimeYear = 2009
GROUP BY CUBE (StoreZip, TimeMonth)

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 57
CUBE/GROUP BY Comparison
SELECT State, Month, SUM(Sales)
GROUP BY CUBE(State, Month)
State Month SUM(Sales)
SELECT State, Month, SUM(Sales)
CA Dec 100
GROUP BY State, Month
CA Feb 75
CO Dec 150 State Month SUM(Sales)
CO Jan 100
CO Feb 200 CA Dec 100
CN Dec 50
CN Jan 75
CA Feb 75
CA - 175 CO Dec 150
CO - 450 CO Jan 100
CN - 125
CO Feb 200
- Dec 300
- Jan 175 CN Dec 50
- Feb 275 CN Jan 75
- - 750
Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 58
CUBE Operator Details
Two grouping columns
Maximum of M N rows
Maximum subtotal rows: M + N + 1 (not in GROUP
BY but in data cube)
3 additional SELECT statements with GROUP BY
clauses: one SELECT statement per set of subtotals
Three grouping columns
Maximum of M N P rows
Maximum subtotal rows: M + N + P + M*N + M*P +
N*P + 1
Number of additional SELECT blocks: 7
Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 59
SELECT Statements without
CUBE
SELECT StoreZip, TimeMonth, SUM(SalesDollar) AS
SumSales

GROUP BY StoreZip, TimeMonth
UNION
SELECT StoreZip, 0, SUM(SalesDollar) AS SumSales

GROUP BY StoreZip
UNION
SELECT '', TimeMonth, SUM(SalesDollar) AS SumSales

GROUP BY TimeMonth
UNION
SELECT '', 0, SUM(SalesDollar) AS SumSales

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 60
ROLLUP Operator

Appropriate for hierarchical dimensions


Partial set of subtotals
Order dependent: broad (general) to
narrow (specific) order

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 61
ROLLUP/GROUP BY Comparison
SELECT State, Month, SUM(Sales)
GROUP BY ROLLUP(State, Month) SELECT State, Month, SUM(Sales)
GROUP BY State, Month
State Month SUM(Sales)
CA Dec 100 State Month SUM(Sales)
CA Feb 75
CO Dec 150 CA Dec 100
CO Jan 100 CA Feb 75
CO Feb 200
CO Dec 150
CN Dec 50
CO Jan 100
CN Jan 75
CA - 175 CO Feb 200
CO - 450 CN Dec 50
CN - 125 CN Jan 75
- - 750
Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 62
ROLLUP Example
SELECT TimeYear, TimeMonth,
SUM(SalesDollar) AS SumSales
FROM Sales, Store, Time
WHERE Sales.StoreId = Store.StoreId
AND Sales.TimeNo = Time.TimeNo
AND (StoreNation = 'USA'
OR StoreNation = 'Canada')
AND TimeYear BETWEEN 2009 AND 2010
GROUP BY ROLLUP (TimeYear, TimeMonth);

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 63
ROLLUP Details
Two grouping columns
N values in outer most column
N + 1 subtotal rows
Three grouping columns
ROLLUP (Col1, Col2, Col3) where Col1 has M
distinct values, Col2 has N distinct values, and Col3
has P distinct values
Maximum subtotal rows: M * N + M + 1
N additional SELECT clauses for N rollup
columns

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 64
SELECT Statements without ROLLUP
SELECT TimeYear, TimeMonth, SUM(SalesDollar)
AS SumSales

GROUP BY TimeYear, TimeMonth
UNION
SELECT TimeYear, 0, SUM(SalesDollar) AS
SumSales

GROUP BY TimeYear
UNION
SELECT 0, 0, SUM(SalesDollar) AS SumSales
Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 65
GROUPING SETS Operator
Precise control of subtotals
Explicit specification of columns in which
totals are produced
Normal grouping columns must be
specified if desired
Can produce subtotals only
Specification is similar to explicit
specification of SELECT statements

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 66
GROUPING SETS Example
SELECT StoreZip, TimeMonth,
SUM(SalesDollar) AS SumSales
FROM Sales, Store, Time
WHERE Sales.StoreId = Store.StoreId
AND Sales.TimeNo = Time.TimeNo
AND (StoreNation = 'USA'
OR StoreNation = 'Canada')
AND TimeYear = 2009
GROUP BY GROUPING SETS((StoreZip, TimeMonth),
StoreZip, TimeMonth, ());

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 67
ROLLUP/GROUPING SETS
Comparison
SELECT TimeYear, TimeMonth, SUM(Sales)
GROUP BY ROLLUP(TimeYear, TimeMonth)
SELECT TimeMonth, TimeYear, Month, SUM(Sales)
GROUP BY GROUPING SETS ((TimeYear, TimeMonth),
TimeYear, ())

SELECT TimeYear, TimeMonth, TimeDay, SUM(Sales)


GROUP BY ROLLUP(TimeYear, TimeMonth, TimeDay)
SELECT TimeYear, TimeMonth, TimeDay, SUM(Sales)
GROUP BY GROUPING SETS ((TimeYear, TimeMonth,
TimeDay), (TimeYear, TimeMonth), TimeYear, ())

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 68
CUBE/GROUPING SETS
Comparison
SELECT State, Month, SUM(Sales)
GROUP BY CUBE(State, Month)
SELECT State, Month, SUM(Sales)
GROUP BY GROUPING SETS ((State, Month), State,
Month,())
-- (State, Month): normal GROUP BY result
SELECT State, Month, Product, SUM(Sales)
GROUP BY CUBE(State, Month, Product)
SELECT State, Month, Product, SUM(Sales)
GROUP BY GROUPING SETS ((State,Month,Product),
(State,Month), (State,Product), (Month,Product),
State, Month, Product,())
-- (State,Month,Product): normal GROUP BY result

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 69
Variations of the Grouping
Operators

Partial cube
Partial rollup
Composite columns
CUBE and ROLLUP inside a GROUPIING
SETS operation

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 70
Variation Examples
Partial CUBE
GROUP BY Month, CUBE(Product, State)
Generates totals on <Month, Product, State>, <Month,
Product>, <Month, State>, <Month>
Partial ROLLUP
GROUP BY State, ROLLUP(Year, Month)
Generates totals on <State, Year, Month>, <State, Year>,
<State>
Composite columns
GROUP BY ROLLUP(Nation, Region, (State, City))
Generates totals on <Nation, Region, State, City>, <Nation,
Region>, <Nation>, and <>.
Composite column (State, City) is treated as a single column.

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 71
Materialized Views
Stored view
Periodically refreshed with source data
Usually contain summary data
Fast query response for summary data
Appropriate in query dominant
environments

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 72
SQL Analysis Functions
Ranking
Ratio
Moving totals and averages using the WINDOW
clause
Oracle ranking functions
RANK
DENSE_RANK
CUME_DIST
PERCENT_RANK
NTILE
Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 73
Materialized View Example
CREATE MATERIALIZED VIEW MV1
BUILD IMMEDIATE
REFRESH COMPLETE ON DEMAND
ENABLE QUERY REWRITE AS
SELECT StoreState, TimeYear,
SUM(SalesDollar) AS
SUMDollar1
FROM Sales, Store, Time
WHERE Sales.StoreId = Store.StoreId
AND Sales.TimeNo = Time.TimeNo
AND TimeYear > 2007
GROUP BY StoreState, TimeYear;

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 74
Query Rewriting
Substitution process
Materialized view replaces references to
fact and dimension tables in a query
Query optimizer must evaluate whether
the substitution will improve performance
over the original query
More complex than query modification
process for traditional views

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 75
Query Rewriting Process

QueryFD QueryMV Results


Rewrite SQL Engine

QueryFD: query that references fact and dimension tables


QueryMV: rewrite of QueryFD such that materialized views are
substituted for fact and dimension tables whenever justified by
expected performance improvements.

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 76
Query Modification Process

QueryV QueryB Results


Modify SQL Engine

Query V: query that references a view

Query B: modification of Query V such that references to the view are


replaced by references to base tables.

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 77
Query Rewriting Principles
Row conditions: query conditions at least as
restrictive as MV conditions
Grouping detail: query grouping columns at least
as general as MV grouping columns
Grouping dependencies: query columns must
match or be derivable by functional
dependencies
Aggregate functions: query aggregate functions
must match or be derivable from MV aggregate
functions
Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 78
Query Rewriting Example
-- Data warehouse query
SELECT StoreState, TimeYear, SUM(SalesDollar)
FROM Sales, Store, Time
WHERE Sales.StoreId = Store.StoreId
AND Sales.TimeNo = Time.TimeNo
AND StoreNation IN ('USA','Canada')
AND TimeYear = 2009
GROUP BY StoreState, TimeYear;
-- Query Rewrite
-- Replace Sales and Time tables with MV1
SELECT DISTINCT MV1.StoreState,
TimeYear, SumDollar1
FROM MV1, Store
WHERE MV1.StoreState = Store.StoreState
AND TimeYear = 2009
AND StoreNation IN ('USA','Canada');
Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 79
Storage and Optimization
Technologies

MOLAP: direct storage and manipulation


of data cubes
ROLAP: relational extensions to support
multidimensional data
HOLAP: combine MOLAP and ROLAP
storage engines

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 80
ROLAP Techniques
Bitmap join indexes
Star join optimization
Query rewriting
Summary storage advisors
Parallel query execution

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 81
Summary
Maintaining a data warehouse is an
important, operational problem.
Increasing usage of data integration tools
SQL extensions for subtotals and analysis
functions
Substantial product extensions for efficient
management of summary data and
optimization techniques

Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 82

Das könnte Ihnen auch gefallen