Data Integration Architectures and Tools

Chapter 17
Data Integration Practices and

Relational DBMS Extensions for
Data Warehouses
Database Design, Application Development, and Copyright 2011 by Michael V. Mannino. All rights reserved.
Administration, 5th Edition
Outline
Data integration concepts
SQL extensions for multidimensional data
Summary data storage and optimization
Chapter 17: Data Integration Practices and Relational DBMS Extensions Slide 2
Data Integration Concepts
Data sources
Workflow representation
Data cleaning techniques
Data integration architectures and tools
Managing the refresh process
Data Sources
Cooperative:
Notification using triggers
Requires source system changes
Logged
Readily available
Extraneous data in logs
Queryable
Queries using timestamps
Requires timestamps in source data
Snapshot
Periodic dumps of source data
Significant processing for difference operations
Maintenance Workflow
Notification
Update
Phase
Propagation
Auditing
Integration
Phase
Merging
Auditing
Preparation Cleaning
Phase
Transportation
Extraction
Initial Data Warehouse Load
Major development activity
Different characteristics than refresh
Discover many data quality problems
Difficult to estimate time requirements
Part of data warehouse extensions
Data Quality Measures
Completeness
Lack of ambiguity
Timeliness
Correctness
Consistency
Data Quality Problems
Multiple identifiers
Multiple field names
Different units
Missing values
Orphaned values
Multipurpose fields
Conflicting data
Different update times
Data Cleaning Tasks
Parsing
Correcting
Standardizing
Matching
Consolidating
Parsing
Locates and identifies individual data
elements in the source files
Context free and context sensitive parsing
Requires pattern specification
Regular expressions (regex) for pattern
specification
Isolates parsed data elements in the target
record
Regular Expression Basics
Search expression: combination of meta
characters, literals, and escape characters
that define a search pattern
Meta character: special meaning within a
search expression
Literal: character to match exactly
Escape sequence: removes special
meaning of meta character
Simple Examples
Target strings
String1 Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)
String2 Mozilla/4.75 [en](X11;U;Linux2.2.16-22 i586)
Regular expressions
m match in String1 but not String2
a/4 match in both String1 and String2
5 [ no match in String1 but match in String2
le match in String1 but no match in String2
Enumeration, Range and Position
Meta Characters
[] match any enclosed character once
[Ff] matches F or f
[0123456789] matches any digit
[0-9] matches any digit
[^0-9] matches anything other than a digit
Position characters
^ string beginning: ^win matches window but not erwin
$ string ending: win$ matches erwin but not window
. any character in specified position: win. matches window but
not erwin
Iteration Meta Characters
?: matches preceding character 0 or 1 times
*: matches preceding character 0 or more times
+: matches preceding character 1 or more times
{n} matches preceding character exactly n times
{n,m} matches preceding character a minimum of n and
maximum of m times
Iteration Meta Character
Examples
colou?r matches both color and colour
tre* matches tree, tread, and trough
tre+ matches tree and tread but not trough
[0-9]{3}-[0-9]{4} matches 123-4567 but not
1234-567
ba{2,3}b matches baab and baaab but not
bab or baaaab
Groups in Regular Expressions
Pair of parentheses used to group subpatterns
Usage
Reuse patterns in an expression
Parsing subpatterns
Example
Input: abbc
Pattern: a(b*)c
Group 0: abbc, Group 1: bb
Parsing Example
Parsed Data in Target File

First Name: Beth
Middle Name: Christine
Input Data from Source File Last Name: Parker
Beth Christine Parker, SLS MGR Title: SLS MGR
Regional Port Authority Firm: Regional Port Authority
Federal Building Location: Federal Building
12800 Lake Calumet Number: 12800
Hedgewisch, IL Street: Lake Calumet
City: Hedgewisch
State: IL
Correcting Values
Missing values
- Default value
- Typical value: median or average
- Relationship to other attributes
Conflicting values
- More recent value
- More credible source
Correcting Example
Corrected Data
Parsed Data First Name: Beth
First Name: Beth Middle Name: Christine
Middle Name: Christine Last Name: Parker
Last Name: Parker Title: SLS MGR
Title: SLS MGR Firm: Regional Port Authority
Firm: Regional Port Authority Location: Federal Building
Location: Federal Building Number: 12800
Number: 12800 Street: South Butler Drive
Street: Lake Calumet City: Chicago
City: Hedgewisch State: IL
State: IL Zip: 60633
Zip+Four: 2398
Standardization
Applies conversion routines to transform
data into its preferred (and consistent)
format
Uses both standard and custom business
rules
Common standardizations:
Unit of measure transformations
Standard abbreviations (state names, titles,
street types)
Standardization Example
Corrected Data
Corrected Data Pre-name: Ms.
First Name: Beth First Name: Beth
Middle Name: Christine 1st Name Match
Last Name: Parker Standards: Elizabeth, Bethany, Bethel
Title: SLS MGR Middle Name: Christine
Firm: Regional Port Authority Last Name: Parker
Location: Federal Building Title: Sales Mgr.
Number: 12800 Firm: Regional Port Authority
Street: South Butler Drive Location: Federal Building
City: Chicago Number: 12800
State: IL Street: S. Butler Dr.
Zip: 60633 City: Chicago
Zip+Four: 2398 State: IL
Zip: 60633
Zip+Four: 2398
Matching
Identify duplicates
Difficult matching process: no common
identifier
Data mining problem
Known as the record linkage problem or entity
identification problem
Many approaches
Entity Identification Applications
Marketing
Law enforcement
Fraud detection
Government
Entity Identification Outcomes
Predicted \ Actual Match Non Match
Match True match False match
Possible Match Investigation Investigation
Non Match False non match True non match
Distance Measures for Comparison
Edit distance: number of deletions,
insertions, or substitutions required to
transform a source string into a target
string.
N-gram distance: breaks text into
subsequences of length N
Phonetic distance: codes words into
standard consonant sounds
Matching Example
Corrected Data (Data Source #2)

Corrected Data (Data Source #1) Pre-name: Ms.
Pre-name: Ms. First Name: Elizabeth
First Name: Beth 1st Name Match
1st Name Match Standards: Beth, Bethany, Bethel
Standards: Elizabeth, Bethany, Bethel Middle Name: Christine
Middle Name: Christine Last Name: Parker-Lewis
Last Name: Parker Title:
Title: Sales Mgr. Firm: Regional Port Authority
Firm: Regional Port Authority Location: Federal Building
Location: Federal Building Number: 12800
Number: 12800 Street: S. Butler Dr., Suite 2
Street: S. Butler Dr. City: Chicago
City: Chicago State: IL
State: IL Zip: 60633
Zip: 60633 Zip+Four: 2398
Zip+Four: 2398 Phone: 708-555-1234
Fax: 708-555-5678
Consolidation
Analyzing and identifying relationships
between matched records
Consolidating/merging them into one
representation.
Consolidation Example
Consolidated Data
Name: Ms. Beth (Elizabeth)
Corrected Data (Data Source #1) Christine Parker-Lewis
Title: Sales Mgr.
Firm: Regional Port Authority
Location: Federal Building
Address: 12800 S. Butler Dr., Suite 2
Chicago, IL 60633-2398
Corrected Data (Data Source #2)
Phone: 708-555-1234
Fax: 708-555-5678
Household Consolidation
William Janet Karen William

Jones Jones Jones Jones Jr.
Legacy Systems View (3 Clients)
Policy No.
Account No. ME309451-2
83451234
Transaction
B498/97
The Reality ONE Client
Account No.
83451234 Policy No.
ME309451-2
Transaction
B498/97
Data IntegrationTools
Specification based for workflow and
transformations
Minimize custom coding
Open source, third party, and DBMS
based tools
Evolution from disjoint, specialized tools to
integrated development environments
Sometimes called ETL tools
Data Integration Tool Features
Workflow specification
Transformation specification
Job management
Data profiling
Change data capture
Integrated development environment
Repository
Connectivity for databases and files
Data Integration Marketplace
Balanced with proprietary and open-
source products from DBMS vendors and
third party vendors
Base products and subscription services
for open source products
Developing marketplace with acquisitions
and new product development
Data Integration Architectures
a) ETL Architecture
Data
Source
Data Extract Transform Load

Source (ETL Engine) DW Tables
Data
Source
b) ELT Architecture
Data
Source
Data Extract Transform

Load DW Tables
Source (Relational
DBMS)
Data
Source
Talend Open Data Solutions
Base product: Talend Open Studio
Extended product and services: Talend
Integration Suite and MPx/RTx options
Repository service: Talend On Demand
Data quality/profiling product: Talend Data
Quality
Talend Open Studio
Business modeling
Graphical job design using components
Meta data respository
Database connectivity
Job execution
Talend Job Design
Graphical notation for transformations
Palette of transformation components
Data quality components
Database and file components
ELT components: transformations in target
database
Processing components
XML components
Excel Job Design Example
Talend Component Details
Excel Job Execution Example
Oracle Integration Tools
Oracle Warehouse Builder
Oracle Data Integrator
INSERT statement extensions
Change data capture
Data pump
Oracle Warehouse Builder
Extraction, Transformation, and Load (ETL)
Data modeling
Data profiling and data quality
Metadata management
Business-level integration of ERP application data
Integration with Oracle business intelligence tools for
reporting purposes
Advanced data lineage and impact analysis
INSERT Statement Extensions
MERGE: the ability to update or insert a row
conditionally into a table.
Conditions are specified in the ON clause
Combines insert and update statements
Multiple table INSERT
External data sources have to be segregated based on logical
attributes for insertion into different target objects
Data can end up in several or exactly one target, depending on
the business transformation rules
Change Data Capture
Capture and publish committed change data
Synchronous using triggers
Asynchronous using log files
Publisher: a DBA who creates and maintains schema

objects
Determines source databases and tables
Allows subscribers to have controlled access to change data
Subscriber: consumers of the published change data

Create subscriptions using subscriber views
Notify when ready to receive a set of change data
Data Pump
Enables very high-speed movement of
data and metadata from one database to
another
Import: a utility for loading an export dump
file set into a target system
Export: utility for unloading data and
metadata into a set of operating system
files called a dump file set
Refresh Processing
Primarily
dimension
External changes
Unknown data sources
processes Load time lag
Data
Valid time lag Data warehouse
integration
tools
Internal
data sources
Fact and
Staging
dimension
Area
Accounting changes
Determining the Refresh
Frequency
Maximize net refresh benefit

Value of data timeliness
Cost of refresh
Satisfy data warehouse and source
system constraints
Refresh Constraints
Source access: restrictions on time and
frequency
Integration: restrictions that require
concurrent reconciliation
Completeness/consistency: loading in the
same refresh period
Availability: load scheduling restrictions
due to storage capacity, online availability,
and server usage
SQL Extensions
Review of GROUP BY clause
Subtotal operators
Analytic functions
GROUP BY Review
Provides summary data not row data
One row per combination of grouping columns
Aggregate function: statistical function that returns one
value for a set of values (min, max, average, sum,
count, )
Row pattern: grouping columns, aggregate functions
Syntax rule: all non aggregate columns in SELECT
must be in GROUP BY
HAVING clause for conditions involving aggregate
functions
GROUP BY Example
SELECT StoreZip, TimeMonth,
SUM(SalesDollar) AS SumSales
FROM SSSales, SSStore, SSTimeDim
WHERE SSSales.StoreId = SSStore.StoreId
AND SSSales.TimeNo = SSTimeDim.TimeNo
AND (StoreNation = 'USA'
OR StoreNation = 'Canada')
AND TimeYear = 2009
GROUP BY StoreZip, TimeMonth
Motivation for Subtotal Extensions
Lack of subtotals in GROUP BY result
Show subtotals in a data cube
Provide control over subtotals in GROUP
BY result
Provide a bridge between relational
database representation and data cubes
CUBE/GROUP BY Comparison
SELECT State, Month, SUM(Sales)

GROUP BY State, Month
State Month SUM(Sales)

CA Dec 100
CA Feb 75 Month
CO Dec 150 State
Dec Jan Feb Total
CO Jan 100
CA 100 - 75 175
CO Feb 200
CO 150 100 200 450
CN Dec 50
CN 50 75 - 125
CN Jan 75
Total 300 175 275 750
GROUP BY Extensions
ROLLUP operator
CUBE operator
GROUPING SETS operator
Other extensions
Ranking
Ratios
Moving summary values
CUBE Operator
Complete set of subtotals
Appropriate for independent dimensions
Not order dependent: specify columns in
any order
CUBE Example
FROM Sales, Store, Time
WHERE Sales.StoreId = Store.StoreId
AND Sales.TimeNo = Time.TimeNo
AND TimeYear = 2009
GROUP BY CUBE (StoreZip, TimeMonth)
CUBE/GROUP BY Comparison
GROUP BY CUBE(State, Month)
CA Dec 100
CA Feb 75
CO Dec 150 State Month SUM(Sales)
CO Jan 100
CO Feb 200 CA Dec 100
CN Dec 50
CN Jan 75
CA Feb 75
CA - 175 CO Dec 150
CO - 450 CO Jan 100
CN - 125
CO Feb 200
- Dec 300
- Jan 175 CN Dec 50
- Feb 275 CN Jan 75
- - 750
CUBE Operator Details
Two grouping columns
Maximum of M N rows
Maximum subtotal rows: M + N + 1 (not in GROUP
BY but in data cube)
3 additional SELECT statements with GROUP BY
clauses: one SELECT statement per set of subtotals
Three grouping columns
Maximum of M N P rows
Maximum subtotal rows: M + N + P + M*N + M*P +
N*P + 1
Number of additional SELECT blocks: 7
SELECT Statements without
CUBE
SELECT StoreZip, TimeMonth, SUM(SalesDollar) AS
SumSales

GROUP BY StoreZip, TimeMonth
UNION
SELECT StoreZip, 0, SUM(SalesDollar) AS SumSales

GROUP BY StoreZip
UNION
SELECT '', TimeMonth, SUM(SalesDollar) AS SumSales

GROUP BY TimeMonth
UNION
SELECT '', 0, SUM(SalesDollar) AS SumSales
ROLLUP Operator
Appropriate for hierarchical dimensions

Partial set of subtotals
Order dependent: broad (general) to
narrow (specific) order
ROLLUP/GROUP BY Comparison
GROUP BY ROLLUP(State, Month) SELECT State, Month, SUM(Sales)
CA Dec 100 State Month SUM(Sales)
CA Feb 75
CO Dec 150 CA Dec 100
CO Jan 100 CA Feb 75
CO Feb 200
CO Dec 150
CN Dec 50
CO Jan 100
CN Jan 75
CA - 175 CO Feb 200
CO - 450 CN Dec 50
CN - 125 CN Jan 75
- - 750
ROLLUP Example
SELECT TimeYear, TimeMonth,
AND TimeYear BETWEEN 2009 AND 2010
GROUP BY ROLLUP (TimeYear, TimeMonth);
ROLLUP Details
Two grouping columns
N values in outer most column
N + 1 subtotal rows
Three grouping columns
ROLLUP (Col1, Col2, Col3) where Col1 has M
distinct values, Col2 has N distinct values, and Col3
has P distinct values
Maximum subtotal rows: M * N + M + 1
N additional SELECT clauses for N rollup
columns
SELECT Statements without ROLLUP
SELECT TimeYear, TimeMonth, SUM(SalesDollar)
AS SumSales

GROUP BY TimeYear, TimeMonth
UNION
SELECT TimeYear, 0, SUM(SalesDollar) AS
SumSales

GROUP BY TimeYear
UNION
SELECT 0, 0, SUM(SalesDollar) AS SumSales
GROUPING SETS Operator
Precise control of subtotals
Explicit specification of columns in which
totals are produced
Normal grouping columns must be
specified if desired
Can produce subtotals only
Specification is similar to explicit
specification of SELECT statements
GROUPING SETS Example
AND TimeYear = 2009
GROUP BY GROUPING SETS((StoreZip, TimeMonth),
StoreZip, TimeMonth, ());
ROLLUP/GROUPING SETS
Comparison
SELECT TimeYear, TimeMonth, SUM(Sales)
GROUP BY ROLLUP(TimeYear, TimeMonth)
SELECT TimeMonth, TimeYear, Month, SUM(Sales)
GROUP BY GROUPING SETS ((TimeYear, TimeMonth),
TimeYear, ())
SELECT TimeYear, TimeMonth, TimeDay, SUM(Sales)

GROUP BY ROLLUP(TimeYear, TimeMonth, TimeDay)
SELECT TimeYear, TimeMonth, TimeDay, SUM(Sales)
GROUP BY GROUPING SETS ((TimeYear, TimeMonth,
TimeDay), (TimeYear, TimeMonth), TimeYear, ())
CUBE/GROUPING SETS
Comparison
GROUP BY CUBE(State, Month)
GROUP BY GROUPING SETS ((State, Month), State,
Month,())
-- (State, Month): normal GROUP BY result
SELECT State, Month, Product, SUM(Sales)
GROUP BY CUBE(State, Month, Product)
SELECT State, Month, Product, SUM(Sales)
GROUP BY GROUPING SETS ((State,Month,Product),
(State,Month), (State,Product), (Month,Product),
State, Month, Product,())
-- (State,Month,Product): normal GROUP BY result
Variations of the Grouping
Operators
Partial cube
Partial rollup
Composite columns
CUBE and ROLLUP inside a GROUPIING
SETS operation
Variation Examples
Partial CUBE
GROUP BY Month, CUBE(Product, State)
Generates totals on <Month, Product, State>, <Month,
Product>, <Month, State>, <Month>
Partial ROLLUP
GROUP BY State, ROLLUP(Year, Month)
Generates totals on <State, Year, Month>, <State, Year>,
<State>
Composite columns
GROUP BY ROLLUP(Nation, Region, (State, City))
Generates totals on <Nation, Region, State, City>, <Nation,
Region>, <Nation>, and <>.
Composite column (State, City) is treated as a single column.
Materialized Views
Stored view
Periodically refreshed with source data
Usually contain summary data
Fast query response for summary data
Appropriate in query dominant
environments
SQL Analysis Functions
Ranking
Ratio
Moving totals and averages using the WINDOW
clause
Oracle ranking functions
RANK
DENSE_RANK
CUME_DIST
PERCENT_RANK
NTILE
Materialized View Example
CREATE MATERIALIZED VIEW MV1
BUILD IMMEDIATE
REFRESH COMPLETE ON DEMAND
ENABLE QUERY REWRITE AS
SELECT StoreState, TimeYear,
SUM(SalesDollar) AS
SUMDollar1
AND TimeYear > 2007
GROUP BY StoreState, TimeYear;
Query Rewriting
Substitution process
Materialized view replaces references to
fact and dimension tables in a query
Query optimizer must evaluate whether
the substitution will improve performance
over the original query
More complex than query modification
process for traditional views
Query Rewriting Process
QueryFD QueryMV Results

Rewrite SQL Engine
QueryFD: query that references fact and dimension tables

QueryMV: rewrite of QueryFD such that materialized views are
substituted for fact and dimension tables whenever justified by
expected performance improvements.
Query Modification Process
QueryV QueryB Results

Modify SQL Engine
Query V: query that references a view
Query B: modification of Query V such that references to the view are

replaced by references to base tables.
Query Rewriting Principles
Row conditions: query conditions at least as
restrictive as MV conditions
Grouping detail: query grouping columns at least
as general as MV grouping columns
Grouping dependencies: query columns must
match or be derivable by functional
dependencies
Aggregate functions: query aggregate functions
must match or be derivable from MV aggregate
functions
Query Rewriting Example
-- Data warehouse query
SELECT StoreState, TimeYear, SUM(SalesDollar)
AND StoreNation IN ('USA','Canada')
AND TimeYear = 2009
GROUP BY StoreState, TimeYear;
-- Query Rewrite
-- Replace Sales and Time tables with MV1
SELECT DISTINCT MV1.StoreState,
TimeYear, SumDollar1
FROM MV1, Store
WHERE MV1.StoreState = Store.StoreState
AND TimeYear = 2009
AND StoreNation IN ('USA','Canada');
Storage and Optimization
Technologies
MOLAP: direct storage and manipulation

of data cubes
ROLAP: relational extensions to support
multidimensional data
HOLAP: combine MOLAP and ROLAP
storage engines
ROLAP Techniques
Bitmap join indexes
Star join optimization
Query rewriting
Summary storage advisors
Parallel query execution
Summary
Maintaining a data warehouse is an
important, operational problem.
Increasing usage of data integration tools
SQL extensions for subtotals and analysis
functions
Substantial product extensions for efficient
management of summary data and
optimization techniques

Data Integration Architectures and Tools

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Data Integration Architectures and Tools

Hochgeladen von

Copyright:

Verfügbare Formate

Chapter 17

Data Integration Practices and

Parsed Data in Target File

Corrected Data (Data Source #2)

William Janet Karen William

Data Extract Transform Load

Data Extract Transform

Publisher: a DBA who creates and maintains schema

Subscriber: consumers of the published change data

processes Load time lag

Maximize net refresh benefit

SELECT State, Month, SUM(Sales)

State Month SUM(Sales)

Appropriate for hierarchical dimensions

SELECT TimeYear, TimeMonth, TimeDay, SUM(Sales)

QueryFD QueryMV Results

QueryFD: query that references fact and dimension tables

QueryV QueryB Results

Query V: query that references a view

Query B: modification of Query V such that references to the view are

MOLAP: direct storage and manipulation

Das könnte Ihnen auch gefallen