Beruflich Dokumente
Kultur Dokumente
Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load
Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load
Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing
An Overview
10
11
12
13
Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
14
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
15 22
16 22
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
17 22
Data Modeling
18
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.
19
20 22
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
21 22
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.
22 2 2
23 22
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential Wipro lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Ltd Confidential
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId city c1 nyc c2 sfo c3 la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
24 22
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south
Dimension Table
city
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
region
25
26
27
Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data
Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with
inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse
Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data
Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t
Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
28
Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy
29
Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.
Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
30
Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic
31
32
ETL Architecture
33
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
D a ta tra n sfo rm a ti n o
I te g ra ti g d i m i a r d a ta typ e s n n ssi l C h a n g i g co d e s n A d d i g a ti e a ttri u te n m b S u m m a ri n g d a ta zi C a l l ti g d e ri d va l e s cu a n ve u R e n o rm a l zi g d a ta i n
Data loading
Initial and incremental loading Updation of metadata
34
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
35
36
Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
37
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment
ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
38
Metadata Management
39
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
40
Importance Of Metadata
Locating Information Time spent in looking for information . How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?
41
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
43
44
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
46
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft
47
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
49
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft
50
OLAP
51
Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools
09/02/2012
52
52
09/02/2012
53
53
Operational data; OLTPs Consolidation data; OLAP are the original source data comes from the of the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities
Short and fast inserts Periodic long-running and updates initiated by batch jobs refresh the end users data
54
09/02/2012
54
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions).
A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
55
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
Increased Complexity...
COLOR DEALER
Relational DBMS
MDDB
Sales Volumes
M O D E L
Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde
DEALERSHIP
COLOR
27 x 4 = 108 cells
56
3 x 3 x 3 = 27 cells
57
Sparsity
09/02/2012
58
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
58
If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 19 Sales Volumes FOX 31 63 WELD 14 Van 6 5 4 Miini 31 M O KELLY 54 Coupe 27 D E LINK 03 56 3 5 5 L KRANZ 41 Sedan 4 3 2 45 LUCUS 33 41 Blue Red White WEISS 23 19 COLOR
L A S T N A M E
EMPLOYEE #
09/02/2012
59
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
59
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
09/02/2012
60
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
60
M O D E L
6 3 4
Blue
5 5 3
Red
4 5 2
White
C O L O R ( ROTATE 90 )
o
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Sedan
COLOR
View #1
View #2
09/02/2012
61
61
M O D E L
Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Carr Gleason Clyde White Red Blue Mini Van Coupe Sedan
M O D E L
Mini Van Coupe Sedan Clyde Gleason Carr Blue Red White
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
09/02/2012
62
62
Sales Volumes
M O D E L
Mini Van
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
09/02/2012
63
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
63
ORGANIZATION DIMENSION
REGION Chicago Clyde Gleason Midwest St. Louis Carr Levi Gary Lucas Bolton
DISTRICT DEALERSHIP
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
09/02/2012
64
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
64
09/02/2012
65
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
65
East West Central 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Year 1999
66
67
68
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
OLAP Applications
09/02/2012
69
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
69
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
09/02/2012
70
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
70
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
09/02/2012
71
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
71
ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on mid-tier server; target database on database server Processing split between mid-tier & database servers
09/02/2012
72
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
72
Any Client
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
09/02/2012
73
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
73
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
09/02/2012
74
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
74
Architecture Comparison
MOLAP
Definition MDDB OLAP = Transaction level data + summary in MDDB Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB)
ROLAP
Relational OLAP = Transaction level data + summary in RDBMS No Sparsity To the necessary extent
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Medium: MDDB Server + large disk space cost Small transactional data + complex model + frequent summary analysis
Low: Only RDBMS + disk space cost Very large transactional data & it needs to be viewed / sorted
Where to apply?
09/02/2012
75
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
75
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
76
09/02/2012
76
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
09/02/2012
77
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
77
78
79
80
81
82
Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.
83
84
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
85
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures:
Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.
86
Unit Testing
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
87
Integration Testing
Integration testing will involve following: Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
88
Performance Testing
Performance Testing should check for :
ETL processes completing within time window. Monitoring and measuring the data quality issues. Refresh times for standard/complex reports.
89
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
90
Questions
91
Thank You
92
93
Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load
94
Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing
95
An Overview
96
97
98
99
100
Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load
101
Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing
102
An Overview
103
104
105
106
107
Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load
108
Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing
109
An Overview
110
111
112
113
Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
114
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
115 222
Data Modeling
116
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.
117
118 222
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId city c1 nyc c2 sfo c3 la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
119 222
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south
Dimension Table
city
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
region
120
121
122
Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data
Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with
inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse
Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data
Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t
Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
123
Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy
124
Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.
Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
125
Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic
126
127
ETL Architecture
128
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
D a ta tra n sfo rm a ti n o
I te g ra ti g d i m i a r d a ta typ e s n n ssi l C h a n g i g co d e s n A d d i g a ti e a ttri u te n m b S u m m a ri n g d a ta zi C a l l ti g d e ri d va l e s cu a n ve u R e n o rm a l zi g d a ta i n
Data loading
Initial and incremental loading Updation of metadata
129
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
130
131
Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
132
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment
ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
133
Metadata Management
134
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
135
Importance Of Metadata
Locating Information Time spent in looking for information . How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?
136
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
138
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
140
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft
141
OLAP
142
Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools
09/02/2012
143
143
09/02/2012
144
144
Operational data; OLTPs Consolidation data; OLAP are the original source data comes from the of the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities
Short and fast inserts Periodic long-running and updates initiated by batch jobs refresh the end users data
145
09/02/2012
145
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions).
A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
146
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
Increased Complexity...
COLOR DEALER
Relational DBMS
MDDB
Sales Volumes
M O D E L
Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde
DEALERSHIP
COLOR
27 x 4 = 108 cells
147
3 x 3 x 3 = 27 cells
148
Sparsity
09/02/2012
149
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
149
If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 19 Sales Volumes FOX 31 63 WELD 14 Van 6 5 4 Miini 31 M O KELLY 54 Coupe 27 D E LINK 03 56 3 5 5 L KRANZ 41 Sedan 4 3 2 45 LUCUS 33 41 Blue Red White WEISS 23 19 COLOR
L A S T N A M E
EMPLOYEE #
09/02/2012
150
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
150
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
09/02/2012
151
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
151
M O D E L
6 3 4
Blue
5 5 3
Red
4 5 2
White
C O L O R ( ROTATE 90 )
o
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Sedan
COLOR
View #1
View #2
09/02/2012
152
152
M O D E L
Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Carr Gleason Clyde White Red Blue Mini Van Coupe Sedan
M O D E L
Mini Van Coupe Sedan Clyde Gleason Carr Blue Red White
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
09/02/2012
153
153
Sales Volumes
M O D E L
Mini Van
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
09/02/2012
154
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
154
ORGANIZATION DIMENSION
REGION Chicago Clyde Gleason Midwest St. Louis Carr Levi Gary Lucas Bolton
DISTRICT DEALERSHIP
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
09/02/2012
155
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
155
09/02/2012
156
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
156
East West Central 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Year 1999
157
158
159
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
OLAP Applications
09/02/2012
160
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
160
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
09/02/2012
161
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
161
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
09/02/2012
162
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
162
ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on mid-tier server; target database on database server Processing split between mid-tier & database servers
09/02/2012
163
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
163
Any Client
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
09/02/2012
164
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
164
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
09/02/2012
165
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
165
Architecture Comparison
MOLAP
Definition MDDB OLAP = Transaction level data + summary in MDDB Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB)
ROLAP
Relational OLAP = Transaction level data + summary in RDBMS No Sparsity To the necessary extent
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Medium: MDDB Server + large disk space cost Small transactional data + complex model + frequent summary analysis
Low: Only RDBMS + disk space cost Very large transactional data & it needs to be viewed / sorted
Where to apply?
09/02/2012
166
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
166
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
167
09/02/2012
167
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
09/02/2012
168
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
168
169
170
171
172
173
Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.
174
175
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
176
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures:
Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.
177
Unit Testing
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
178
Integration Testing
Integration testing will involve following: Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
179
Performance Testing
Performance Testing should check for :
ETL processes completing within time window. Monitoring and measuring the data quality issues. Refresh times for standard/complex reports.
180
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
181
Questions
182
Thank You
183
Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
184
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
185 222
Data Modeling
186
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.
187
188 222
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId city c1 nyc c2 sfo c3 la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
189 222
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south
Dimension Table
city
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
region
190
191
192
Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data
Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with
inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse
Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data
Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t
Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
193
Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy
194
Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.
Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
195
Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic
196
197
ETL Architecture
198
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
D a ta tra n sfo rm a ti n o
I te g ra ti g d i m i a r d a ta typ e s n n ssi l C h a n g i g co d e s n A d d i g a ti e a ttri u te n m b S u m m a ri n g d a ta zi C a l l ti g d e ri d va l e s cu a n ve u R e n o rm a l zi g d a ta i n
Data loading
Initial and incremental loading Updation of metadata
199
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
200
201
202
203
204
Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load
205
206
207
Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load
208
Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing
209
An Overview
210
211
212
213
Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
214
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
215 222
216 222
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
217 222
Data Modeling
218
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.
219
220 222
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
221 22 2
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.
222 22 2
223 222
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential Wipro lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Ltd Confidential
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId city c1 nyc c2 sfo c3 la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
224 222
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south
Dimension Table
city
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
region
225
226
227
Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data
Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with
inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse
Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data
Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t
Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
228
Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy
229
Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.
Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
230
Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic
231
232
ETL Architecture
233
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
D a ta tra n sfo rm a ti n o
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
234
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
235
236
Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
237
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment
ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
238
Metadata Management
239
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
240
Importance Of Metadata
Locating Information Time spent in looking for information . How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?
241
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
243
244
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
246
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft
247
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
249
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft
250
OLAP
251
Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools
09/02/2012
252
252
09/02/2012
253
253
Operational data; OLTPs Consolidation data; OLAP are the original source data comes from the of the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities
Short and fast inserts Periodic long-running and updates initiated by batch jobs refresh the end users data
254
09/02/2012
254
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions).
A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
255
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
Increased Complexity...
COLOR DEALER
Relational DBMS
MDDB
Sales Volumes
M O D E L
Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde
DEALERSHIP
COLOR
27 x 4 = 108 cells
256
3 x 3 x 3 = 27 cells
257
Sparsity
09/02/2012
258
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
258
If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 19 Sales Volumes FOX 31 63 WELD 14 Van 6 5 4 Miini 31 M O KELLY 54 Coupe 27 D E LINK 03 56 3 5 5 L KRANZ 41 Sedan 4 3 2 45 LUCUS 33 41 Blue Red White WEISS 23 19 COLOR
L A S T N A M E
EMPLOYEE #
09/02/2012
259
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
259
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
09/02/2012
260
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
260
M O D E L
6 3 4
Blue
5 5 3
Red
4 5 2
White
C O L O R ( ROTATE 90 )
o
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Sedan
COLOR
View #1
View #2
09/02/2012
261
261
M O D E L
Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Carr Gleason Clyde White Red Blue Mini Van Coupe Sedan
M O D E L
Mini Van Coupe Sedan Clyde Gleason Carr Blue Red White
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
09/02/2012
262
262
Sales Volumes
M O D E L
Mini Van
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
09/02/2012
263
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
263
ORGANIZATION DIMENSION
REGION Chicago Clyde Gleason Midwest St. Louis Carr Levi Gary Lucas Bolton
DISTRICT DEALERSHIP
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
09/02/2012
264
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
264
09/02/2012
265
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
265
East West Central 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Year 1999
266
267
268
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
OLAP Applications
09/02/2012
269
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
269
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
09/02/2012
270
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
270
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
09/02/2012
271
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
271
ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on mid-tier server; target database on database server Processing split between mid-tier & database servers
09/02/2012
272
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
272
Any Client
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
09/02/2012
273
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
273
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
09/02/2012
274
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
274
Architecture Comparison
MOLAP
Definition MDDB OLAP = Transaction level data + summary in MDDB Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB)
ROLAP
Relational OLAP = Transaction level data + summary in RDBMS No Sparsity To the necessary extent
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Medium: MDDB Server + large disk space cost Small transactional data + complex model + frequent summary analysis
Low: Only RDBMS + disk space cost Very large transactional data & it needs to be viewed / sorted
Where to apply?
09/02/2012
275
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
275
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
276
09/02/2012
276
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
09/02/2012
277
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
277
278
279
280
281
282
Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.
283
284
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
285
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures:
Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.
286
Unit Testing
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
287
Integration Testing
Integration testing will involve following: Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
288
Performance Testing
Performance Testing should check for :
ETL processes completing within time window. Monitoring and measuring the data quality issues. Refresh times for standard/complex reports.
289
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
290
Questions
291
Thank You
292
293
Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load
294
Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing
295
An Overview
296
297
298
299
300
Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load
301
Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing
302
An Overview
303
304
305
306
307
Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load
308
Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing
309
An Overview
310
311
312
313
Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
314
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
315 222
Data Modeling
316
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.
317
318 222
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId city c1 nyc c2 sfo c3 la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
319 222
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south
Dimension Table
city
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
region
320
321
322
Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data
Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with
inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse
Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data
Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t
Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
323
Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy
324
Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.
Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
325
Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic
326
327
ETL Architecture
328
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
D a ta tra n sfo rm a ti n o
I te g ra ti g d i m i a r d a ta typ e s n n ssi l C h a n g i g co d e s n A d d i g a ti e a ttri u te n m b S u m m a ri n g d a ta zi C a l l ti g d e ri d va l e s cu a n ve u R e n o rm a l zi g d a ta i n
Data loading
Initial and incremental loading Updation of metadata
329
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
330
331
Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
332
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment
ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
333
Metadata Management
334
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
335
Importance Of Metadata
Locating Information Time spent in looking for information . How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?
336
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
338
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
340
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft
341
OLAP
342
Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools
09/02/2012
343
343
09/02/2012
344
344
Operational data; OLTPs Consolidation data; OLAP are the original source data comes from the of the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities
Short and fast inserts Periodic long-running and updates initiated by batch jobs refresh the end users data
345
09/02/2012
345
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions).
A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
346
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
Increased Complexity...
COLOR DEALER
Relational DBMS
MDDB
Sales Volumes
M O D E L
Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde
DEALERSHIP
COLOR
27 x 4 = 108 cells
347
3 x 3 x 3 = 27 cells
348
Sparsity
09/02/2012
349
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
349
If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 19 Sales Volumes FOX 31 63 WELD 14 Van 6 5 4 Miini 31 M O KELLY 54 Coupe 27 D E LINK 03 56 3 5 5 L KRANZ 41 Sedan 4 3 2 45 LUCUS 33 41 Blue Red White WEISS 23 19 COLOR
L A S T N A M E
EMPLOYEE #
09/02/2012
350
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
350
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
09/02/2012
351
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
351
M O D E L
6 3 4
Blue
5 5 3
Red
4 5 2
White
C O L O R ( ROTATE 90 )
o
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Sedan
COLOR
View #1
View #2
09/02/2012
352
352
M O D E L
Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Carr Gleason Clyde White Red Blue Mini Van Coupe Sedan
M O D E L
Mini Van Coupe Sedan Clyde Gleason Carr Blue Red White
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
09/02/2012
353
353
Sales Volumes
M O D E L
Mini Van
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
09/02/2012
354
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
354
ORGANIZATION DIMENSION
REGION Chicago Clyde Gleason Midwest St. Louis Carr Levi Gary Lucas Bolton
DISTRICT DEALERSHIP
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
09/02/2012
355
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
355
09/02/2012
356
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
356
East West Central 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Year 1999
357
358
359
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
OLAP Applications
09/02/2012
360
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
360
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
09/02/2012
361
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
361
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
09/02/2012
362
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
362
ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on mid-tier server; target database on database server Processing split between mid-tier & database servers
09/02/2012
363
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
363
Any Client
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
09/02/2012
364
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
364
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
09/02/2012
365
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
365
Architecture Comparison
MOLAP
Definition MDDB OLAP = Transaction level data + summary in MDDB Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB)
ROLAP
Relational OLAP = Transaction level data + summary in RDBMS No Sparsity To the necessary extent
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Medium: MDDB Server + large disk space cost Small transactional data + complex model + frequent summary analysis
Low: Only RDBMS + disk space cost Very large transactional data & it needs to be viewed / sorted
Where to apply?
09/02/2012
366
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
366
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
367
09/02/2012
367
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
09/02/2012
368
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
368
369
370
371
372
373
Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.
374
375
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
376
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures:
Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.
377
Unit Testing
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
378
Integration Testing
Integration testing will involve following: Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
379
Performance Testing
Performance Testing should check for :
ETL processes completing within time window. Monitoring and measuring the data quality issues. Refresh times for standard/complex reports.
380
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
381
Questions
382
Thank You
383
Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
384
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
385 222
Data Modeling
386
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.
387
388 222
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId city c1 nyc c2 sfo c3 la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
389 222
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south
Dimension Table
city
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
region
390
391
392
Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data
Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with
inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse
Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data
Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t
Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
393
Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy
394
Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.
Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
395
Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic
396
397
ETL Architecture
398
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
D a ta tra n sfo rm a ti n o
I te g ra ti g d i m i a r d a ta typ e s n n ssi l C h a n g i g co d e s n A d d i g a ti e a ttri u te n m b S u m m a ri n g d a ta zi C a l l ti g d e ri d va l e s cu a n ve u R e n o rm a l zi g d a ta i n
Data loading
Initial and incremental loading Updation of metadata
399
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
400
401
Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
402
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment
ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
403
Metadata Management
404
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
405
Importance Of Metadata
Locating Information Time spent in looking for information . How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?
406
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
408
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
410
411
Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load
412
413
414
415
416
Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load
417
418
419
Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load
420
Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing
421
An Overview
422
423
424
425
Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
426
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
427 222
428 222
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
429 222
Data Modeling
430
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.
431
432 222
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
433 222
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.
434 22 2
435 222
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential Wipro lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Ltd Confidential
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId city c1 nyc c2 sfo c3 la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
436 222
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south
Dimension Table
city
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
region
437
438
439
Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data
Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with
inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse
Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data
Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t
Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
440
Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy
441
Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.
Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
442
Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic
443
444
ETL Architecture
445
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
D a ta tra n sfo rm a ti n o
I te g ra ti g d i m i a r d a ta typ e s n n ssi l C h a n g i g co d e s n A d d i g a ti e a ttri u te n m b S u m m a ri n g d a ta zi C a l l ti g d e ri d va l e s cu a n ve u R e n o rm a l zi g d a ta i n
Data loading
Initial and incremental loading Updation of metadata
446
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
447
448
Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
449
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment
ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
450
Metadata Management
451
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
452
Importance Of Metadata
Locating Information Time spent in looking for information . How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?
453
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
455
456
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
458
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft
459
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
461
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft
462
OLAP
463
Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools
09/02/2012
464
464
09/02/2012
465
465
Operational data; OLTPs Consolidation data; OLAP are the original source data comes from the of the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities
Short and fast inserts Periodic long-running and updates initiated by batch jobs refresh the end users data
466
09/02/2012
466
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions).
A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
467
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
Increased Complexity...
COLOR DEALER
Relational DBMS
MDDB
Sales Volumes
M O D E L
Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde
DEALERSHIP
COLOR
27 x 4 = 108 cells
468
3 x 3 x 3 = 27 cells
469
Sparsity
09/02/2012
470
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
470
If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 19 Sales Volumes FOX 31 63 WELD 14 Van 6 5 4 Miini 31 M O KELLY 54 Coupe 27 D E LINK 03 56 3 5 5 L KRANZ 41 Sedan 4 3 2 45 LUCUS 33 41 Blue Red White WEISS 23 19 COLOR
L A S T N A M E
EMPLOYEE #
09/02/2012
471
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
471
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
09/02/2012
472
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
472
M O D E L
6 3 4
Blue
5 5 3
Red
4 5 2
White
C O L O R ( ROTATE 90 )
o
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Sedan
COLOR
View #1
View #2
09/02/2012
473
473
M O D E L
Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Carr Gleason Clyde White Red Blue Mini Van Coupe Sedan
M O D E L
Mini Van Coupe Sedan Clyde Gleason Carr Blue Red White
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
09/02/2012
474
474
Sales Volumes
M O D E L
Mini Van
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
09/02/2012
475
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
475
ORGANIZATION DIMENSION
REGION Chicago Clyde Gleason Midwest St. Louis Carr Levi Gary Lucas Bolton
DISTRICT DEALERSHIP
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
09/02/2012
476
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
476
09/02/2012
477
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
477
East West Central 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Year 1999
478
479
480
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
OLAP Applications
09/02/2012
481
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
481
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
09/02/2012
482
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
482
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
09/02/2012
483
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
483
ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on mid-tier server; target database on database server Processing split between mid-tier & database servers
09/02/2012
484
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
484
Any Client
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
09/02/2012
485
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
485
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
09/02/2012
486
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
486
Architecture Comparison
MOLAP
Definition MDDB OLAP = Transaction level data + summary in MDDB Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB)
ROLAP
Relational OLAP = Transaction level data + summary in RDBMS No Sparsity To the necessary extent
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Medium: MDDB Server + large disk space cost Small transactional data + complex model + frequent summary analysis
Low: Only RDBMS + disk space cost Very large transactional data & it needs to be viewed / sorted
Where to apply?
09/02/2012
487
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
487
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
488
09/02/2012
488
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
09/02/2012
489
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
489
490
491
492
493
494
Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.
495
496
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
497
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures:
Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.
498
Unit Testing
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
499
Integration Testing
Integration testing will involve following: Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
500
Performance Testing
Performance Testing should check for :
ETL processes completing within time window. Monitoring and measuring the data quality issues. Refresh times for standard/complex reports.
501
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
502
Questions
503
Thank You
504
505
Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load
506
Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing
507
An Overview
508
509
510
511
512
Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load
513
Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing
514
An Overview
515
516
517
518
519
Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load
520
Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing
521
An Overview
522
523
524
525
Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
526
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
527 222
Data Modeling
528
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.
529
530 222
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId city c1 nyc c2 sfo c3 la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
531 222
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south
Dimension Table
city
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
region
532
533
534
Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data
Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with
inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse
Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data
Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t
Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
535
Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy
536
Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.
Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
537
Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic
538
539
ETL Architecture
540
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
D a ta tra n sfo rm a ti n o
I te g ra ti g d i m i a r d a ta typ e s n n ssi l C h a n g i g co d e s n A d d i g a ti e a ttri u te n m b S u m m a ri n g d a ta zi C a l l ti g d e ri d va l e s cu a n ve u R e n o rm a l zi g d a ta i n
Data loading
Initial and incremental loading Updation of metadata
541
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
542
543
Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
544
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment
ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
545
Metadata Management
546
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
547
Importance Of Metadata
Locating Information Time spent in looking for information . How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?
548
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
550
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
552
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft
553
OLAP
554
Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools
09/02/2012
555
555
09/02/2012
556
556
Operational data; OLTPs Consolidation data; OLAP are the original source data comes from the of the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities
Short and fast inserts Periodic long-running and updates initiated by batch jobs refresh the end users data
557
09/02/2012
557
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions).
A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
558
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
Increased Complexity...
COLOR DEALER
Relational DBMS
MDDB
Sales Volumes
M O D E L
Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde
DEALERSHIP
COLOR
27 x 4 = 108 cells
559
3 x 3 x 3 = 27 cells
560
Sparsity
09/02/2012
561
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
561
If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 19 Sales Volumes FOX 31 63 WELD 14 Van 6 5 4 Miini 31 M O KELLY 54 Coupe 27 D E LINK 03 56 3 5 5 L KRANZ 41 Sedan 4 3 2 45 LUCUS 33 41 Blue Red White WEISS 23 19 COLOR
L A S T N A M E
EMPLOYEE #
09/02/2012
562
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
562
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
09/02/2012
563
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
563
M O D E L
6 3 4
Blue
5 5 3
Red
4 5 2
White
C O L O R ( ROTATE 90 )
o
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Sedan
COLOR
View #1
View #2
09/02/2012
564
564
M O D E L
Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Carr Gleason Clyde White Red Blue Mini Van Coupe Sedan
M O D E L
Mini Van Coupe Sedan Clyde Gleason Carr Blue Red White
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
09/02/2012
565
565
Sales Volumes
M O D E L
Mini Van
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
09/02/2012
566
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
566
ORGANIZATION DIMENSION
REGION Chicago Clyde Gleason Midwest St. Louis Carr Levi Gary Lucas Bolton
DISTRICT DEALERSHIP
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
09/02/2012
567
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
567
09/02/2012
568
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
568
East West Central 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Year 1999
569
570
571
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
OLAP Applications
09/02/2012
572
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
572
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
09/02/2012
573
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
573
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
09/02/2012
574
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
574
ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on mid-tier server; target database on database server Processing split between mid-tier & database servers
09/02/2012
575
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
575
Any Client
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
09/02/2012
576
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
576
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
09/02/2012
577
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
577
Architecture Comparison
MOLAP
Definition MDDB OLAP = Transaction level data + summary in MDDB Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB)
ROLAP
Relational OLAP = Transaction level data + summary in RDBMS No Sparsity To the necessary extent
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Medium: MDDB Server + large disk space cost Small transactional data + complex model + frequent summary analysis
Low: Only RDBMS + disk space cost Very large transactional data & it needs to be viewed / sorted
Where to apply?
09/02/2012
578
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
578
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
579
09/02/2012
579
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
09/02/2012
580
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
580
581
582
583
584
585
Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.
586
587
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
588
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures:
Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.
589
Unit Testing
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
590
Integration Testing
Integration testing will involve following: Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
591
Performance Testing
Performance Testing should check for :
ETL processes completing within time window. Monitoring and measuring the data quality issues. Refresh times for standard/complex reports.
592
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
593
Questions
594
Thank You
595
Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
596
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
597 222
Data Modeling
598
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.
599
600 222
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId city c1 nyc c2 sfo c3 la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
601 222
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south
Dimension Table
city
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
region
602
603
604
Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data
Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with
inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse
Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data
Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t
Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
605
Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy
606
Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.
Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
607
Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic
608
609
ETL Architecture
610
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
D a ta tra n sfo rm a ti n o
I te g ra ti g d i m i a r d a ta typ e s n n ssi l C h a n g i g co d e s n A d d i g a ti e a ttri u te n m b S u m m a ri n g d a ta zi C a l l ti g d e ri d va l e s cu a n ve u R e n o rm a l zi g d a ta i n
Data loading
Initial and incremental loading Updation of metadata
611
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
612
613
614
615
616
Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load
617
618
619
Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load
620
Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing
621
An Overview
622
623
624
625
Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
626
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
627 222
628 222
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
629 222
Data Modeling
630
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.
631
632 222
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
633 222
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.
634 222
635 222
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential Wipro lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Ltd Confidential
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId city c1 nyc c2 sfo c3 la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
636 22 2
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south
Dimension Table
city
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
region
637
638
639
Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data
Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with
inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse
Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data
Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t
Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
640
Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy
641
Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.
Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
642
Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic
643
644
ETL Architecture
645
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
D a ta tra n sfo rm a ti n o
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
646
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
647
648
Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
649
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment
ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
650
Metadata Management
651
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
652
Importance Of Metadata
Locating Information Time spent in looking for information . How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?
653
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
655
656
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
658
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft
659
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
661
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft
662
OLAP
663
Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools
09/02/2012
664
664
09/02/2012
665
665
Operational data; OLTPs Consolidation data; OLAP are the original source data comes from the of the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities
Short and fast inserts Periodic long-running and updates initiated by batch jobs refresh the end users data
666
09/02/2012
666
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions).
A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
667
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
Increased Complexity...
COLOR DEALER
Relational DBMS
MDDB
Sales Volumes
M O D E L
Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde
DEALERSHIP
COLOR
27 x 4 = 108 cells
668
3 x 3 x 3 = 27 cells
669
Sparsity
09/02/2012
670
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
670
If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 19 Sales Volumes FOX 31 63 WELD 14 Van 6 5 4 Miini 31 M O KELLY 54 Coupe 27 D E LINK 03 56 3 5 5 L KRANZ 41 Sedan 4 3 2 45 LUCUS 33 41 Blue Red White WEISS 23 19 COLOR
L A S T N A M E
EMPLOYEE #
09/02/2012
671
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
671
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
09/02/2012
672
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
672
M O D E L
6 3 4
Blue
5 5 3
Red
4 5 2
White
C O L O R ( ROTATE 90 )
o
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Sedan
COLOR
View #1
View #2
09/02/2012
673
673
M O D E L
Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Carr Gleason Clyde White Red Blue Mini Van Coupe Sedan
M O D E L
Mini Van Coupe Sedan Clyde Gleason Carr Blue Red White
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
09/02/2012
674
674
Sales Volumes
M O D E L
Mini Van
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
09/02/2012
675
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
675
ORGANIZATION DIMENSION
REGION Chicago Clyde Gleason Midwest St. Louis Carr Levi Gary Lucas Bolton
DISTRICT DEALERSHIP
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
09/02/2012
676
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
676
09/02/2012
677
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
677
East West Central 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Year 1999
678
679
680
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
OLAP Applications
09/02/2012
681
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
681
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
09/02/2012
682
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
682
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
09/02/2012
683
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
683
ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on mid-tier server; target database on database server Processing split between mid-tier & database servers
09/02/2012
684
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
684
Any Client
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
09/02/2012
685
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
685
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
09/02/2012
686
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
686
Architecture Comparison
MOLAP
Definition MDDB OLAP = Transaction level data + summary in MDDB Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB)
ROLAP
Relational OLAP = Transaction level data + summary in RDBMS No Sparsity To the necessary extent
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Medium: MDDB Server + large disk space cost Small transactional data + complex model + frequent summary analysis
Low: Only RDBMS + disk space cost Very large transactional data & it needs to be viewed / sorted
Where to apply?
09/02/2012
687
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
687
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
688
09/02/2012
688
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
09/02/2012
689
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
689
690
691
692
693
694
Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.
695
696
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
697
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures:
Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.
698
Unit Testing
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
699
Integration Testing
Integration testing will involve following: Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
700
Performance Testing
Performance Testing should check for :
ETL processes completing within time window. Monitoring and measuring the data quality issues. Refresh times for standard/complex reports.
701
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
702
Questions
703
Thank You
704
705
Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load
706
Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing
707
An Overview
708
709
710
711
712
Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load
713
Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing
714
An Overview
715
716
717
718
719
Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load
720
Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing
721
An Overview
722
723
724
725
Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
726
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
727 222
Data Modeling
728
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.
729
730 222
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId city c1 nyc c2 sfo c3 la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
731 222
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south
Dimension Table
city
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
region
732
733
734
Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data
Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with
inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse
Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data
Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t
Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
735
Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy
736
Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.
Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
737
Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic
738
739
ETL Architecture
740
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
D a ta tra n sfo rm a ti n o
I te g ra ti g d i m i a r d a ta typ e s n n ssi l C h a n g i g co d e s n A d d i g a ti e a ttri u te n m b S u m m a ri n g d a ta zi C a l l ti g d e ri d va l e s cu a n ve u R e n o rm a l zi g d a ta i n
Data loading
Initial and incremental loading Updation of metadata
741
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
742
743
Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
744
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment
ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
745
Metadata Management
746
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
747
Importance Of Metadata
Locating Information Time spent in looking for information . How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?
748
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
750
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
752
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft
753
OLAP
754
Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools
09/02/2012
755
755
09/02/2012
756
756
Operational data; OLTPs Consolidation data; OLAP are the original source data comes from the of the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities
Short and fast inserts Periodic long-running and updates initiated by batch jobs refresh the end users data
757
09/02/2012
757
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions).
A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
758
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
Increased Complexity...
COLOR DEALER
Relational DBMS
MDDB
Sales Volumes
M O D E L
Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde
DEALERSHIP
COLOR
27 x 4 = 108 cells
759
3 x 3 x 3 = 27 cells
760
Sparsity
09/02/2012
761
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
761
If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 19 Sales Volumes FOX 31 63 WELD 14 Van 6 5 4 Miini 31 M O KELLY 54 Coupe 27 D E LINK 03 56 3 5 5 L KRANZ 41 Sedan 4 3 2 45 LUCUS 33 41 Blue Red White WEISS 23 19 COLOR
L A S T N A M E
EMPLOYEE #
09/02/2012
762
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
762
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
09/02/2012
763
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
763
M O D E L
6 3 4
Blue
5 5 3
Red
4 5 2
White
C O L O R ( ROTATE 90 )
o
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Sedan
COLOR
View #1
View #2
09/02/2012
764
764
M O D E L
Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Carr Gleason Clyde White Red Blue Mini Van Coupe Sedan
M O D E L
Mini Van Coupe Sedan Clyde Gleason Carr Blue Red White
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
09/02/2012
765
765
Sales Volumes
M O D E L
Mini Van
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
09/02/2012
766
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
766
ORGANIZATION DIMENSION
REGION Chicago Clyde Gleason Midwest St. Louis Carr Levi Gary Lucas Bolton
DISTRICT DEALERSHIP
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
09/02/2012
767
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
767
09/02/2012
768
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
768
East West Central 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Year 1999
769
770
771
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
OLAP Applications
09/02/2012
772
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
772
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
09/02/2012
773
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
773
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
09/02/2012
774
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
774
ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on mid-tier server; target database on database server Processing split between mid-tier & database servers
09/02/2012
775
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
775
Any Client
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
09/02/2012
776
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
776
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
09/02/2012
777
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
777
Architecture Comparison
MOLAP
Definition MDDB OLAP = Transaction level data + summary in MDDB Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB)
ROLAP
Relational OLAP = Transaction level data + summary in RDBMS No Sparsity To the necessary extent
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Medium: MDDB Server + large disk space cost Small transactional data + complex model + frequent summary analysis
Low: Only RDBMS + disk space cost Very large transactional data & it needs to be viewed / sorted
Where to apply?
09/02/2012
778
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
778
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
779
09/02/2012
779
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
09/02/2012
780
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
780
781
782
783
784
785
Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.
786
787
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
788
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures:
Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.
789
Unit Testing
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
790
Integration Testing
Integration testing will involve following: Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
791
Performance Testing
Performance Testing should check for :
ETL processes completing within time window. Monitoring and measuring the data quality issues. Refresh times for standard/complex reports.
792
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
793
Questions
794
Thank You
795
Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
796
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
797 222
Data Modeling
798
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.
799
800 222
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId city c1 nyc c2 sfo c3 la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
801 222
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south
Dimension Table
city
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
region
802
803
804
Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data
Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with
inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse
Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data
Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t
Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
805
Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy
806
Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.
Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
807
Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic
808
809
ETL Architecture
810
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
D a ta tra n sfo rm a ti n o
I te g ra ti g d i m i a r d a ta typ e s n n ssi l C h a n g i g co d e s n A d d i g a ti e a ttri u te n m b S u m m a ri n g d a ta zi C a l l ti g d e ri d va l e s cu a n ve u R e n o rm a l zi g d a ta i n
Data loading
Initial and incremental loading Updation of metadata
811
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
812
813
Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
814
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment
ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
815
Metadata Management
816
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
817
Importance Of Metadata
Locating Information Time spent in looking for information . How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?
818
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
820
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
822
823
Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load
824
825
Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load
826
Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing
827
An Overview
828
829
830
831
Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
832
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
833 222
Data Modeling
834
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.
835
836 222
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId city c1 nyc c2 sfo c3 la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
837 222
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south
Dimension Table
city
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
region
838
839
840
Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data
Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with
inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse
Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data
Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t
Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
841
Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy
842
Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.
Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
843
Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic
844
845
ETL Architecture
846
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
D a ta tra n sfo rm a ti n o
I te g ra ti g d i m i a r d a ta typ e s n n ssi l C h a n g i g co d e s n A d d i g a ti e a ttri u te n m b S u m m a ri n g d a ta zi C a l l ti g d e ri d va l e s cu a n ve u R e n o rm a l zi g d a ta i n
Data loading
Initial and incremental loading Updation of metadata
847
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
848
849
Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
850
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment
ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
851
Metadata Management
852
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
853
Importance Of Metadata
Locating Information Time spent in looking for information . How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?
854
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
856
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
858
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft
859
OLAP
860
Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools
09/02/2012
861
861
09/02/2012
862
862
Operational data; OLTPs Consolidation data; OLAP are the original source data comes from the of the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities
Short and fast inserts Periodic long-running and updates initiated by batch jobs refresh the end users data
863
09/02/2012
863
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions).
A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
864
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
Increased Complexity...
COLOR DEALER
Relational DBMS
MDDB
Sales Volumes
M O D E L
Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde
DEALERSHIP
COLOR
27 x 4 = 108 cells
865
3 x 3 x 3 = 27 cells
866
Sparsity
09/02/2012
867
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
867
If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 19 Sales Volumes FOX 31 63 WELD 14 Van 6 5 4 Miini 31 M O KELLY 54 Coupe 27 D E LINK 03 56 3 5 5 L KRANZ 41 Sedan 4 3 2 45 LUCUS 33 41 Blue Red White WEISS 23 19 COLOR
L A S T N A M E
EMPLOYEE #
09/02/2012
868
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
868
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
09/02/2012
869
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
869
M O D E L
6 3 4
Blue
5 5 3
Red
4 5 2
White
C O L O R ( ROTATE 90 )
o
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Sedan
COLOR
View #1
View #2
09/02/2012
870
870
M O D E L
Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Carr Gleason Clyde White Red Blue Mini Van Coupe Sedan
M O D E L
Mini Van Coupe Sedan Clyde Gleason Carr Blue Red White
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
09/02/2012
871
871
Sales Volumes
M O D E L
Mini Van
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
09/02/2012
872
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
872
ORGANIZATION DIMENSION
REGION Chicago Clyde Gleason Midwest St. Louis Carr Levi Gary Lucas Bolton
DISTRICT DEALERSHIP
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
09/02/2012
873
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
873
09/02/2012
874
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
874
East West Central 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Year 1999
875
876
877
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
OLAP Applications
09/02/2012
878
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
878
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
09/02/2012
879
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
879
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
09/02/2012
880
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
880
ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on mid-tier server; target database on database server Processing split between mid-tier & database servers
09/02/2012
881
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
881
Any Client
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
09/02/2012
882
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
882
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
09/02/2012
883
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
883
Architecture Comparison
MOLAP
Definition MDDB OLAP = Transaction level data + summary in MDDB Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB)
ROLAP
Relational OLAP = Transaction level data + summary in RDBMS No Sparsity To the necessary extent
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Medium: MDDB Server + large disk space cost Small transactional data + complex model + frequent summary analysis
Low: Only RDBMS + disk space cost Very large transactional data & it needs to be viewed / sorted
Where to apply?
09/02/2012
884
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
884
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
885
09/02/2012
885
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
09/02/2012
886
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
886
887
888
889
890
891
Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.
892
893
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
894
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures:
Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.
895
Unit Testing
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
896
Integration Testing
Integration testing will involve following: Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
897
Performance Testing
Performance Testing should check for :
ETL processes completing within time window. Monitoring and measuring the data quality issues. Refresh times for standard/complex reports.
898
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
899
Questions
900
Thank You
901
Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing
902
An Overview
903
904
905
906
Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
907
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
908 222
Data Modeling
909
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.
910
911 222
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId city c1 nyc c2 sfo c3 la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
912 222
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south
Dimension Table
city
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
region
913
914
915
Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data
Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with
inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse
Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data
Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t
Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
916
Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy
917
Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.
Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
918
Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic
919
920
ETL Architecture
921
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
D a ta tra n sfo rm a ti n o
I te g ra ti g d i m i a r d a ta typ e s n n ssi l C h a n g i g co d e s n A d d i g a ti e a ttri u te n m b S u m m a ri n g d a ta zi C a l l ti g d e ri d va l e s cu a n ve u R e n o rm a l zi g d a ta i n
Data loading
Initial and incremental loading Updation of metadata
922
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
923
924
Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
925
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment
ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
926
Metadata Management
927
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
928
Importance Of Metadata
Locating Information Time spent in looking for information . How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?
929
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
931
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
933
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft
934
OLAP
935
Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools
09/02/2012
936
936
09/02/2012
937
937
Operational data; OLTPs Consolidation data; OLAP are the original source data comes from the of the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities
Short and fast inserts Periodic long-running and updates initiated by batch jobs refresh the end users data
938
09/02/2012
938
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions).
A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
939
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
Increased Complexity...
COLOR DEALER
Relational DBMS
MDDB
Sales Volumes
M O D E L
Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde
DEALERSHIP
COLOR
27 x 4 = 108 cells
940
3 x 3 x 3 = 27 cells
941
Sparsity
09/02/2012
942
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
942
If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 19 Sales Volumes FOX 31 63 WELD 14 Van 6 5 4 Miini 31 M O KELLY 54 Coupe 27 D E LINK 03 56 3 5 5 L KRANZ 41 Sedan 4 3 2 45 LUCUS 33 41 Blue Red White WEISS 23 19 COLOR
L A S T N A M E
EMPLOYEE #
09/02/2012
943
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
943
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
09/02/2012
944
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
944
M O D E L
6 3 4
Blue
5 5 3
Red
4 5 2
White
C O L O R ( ROTATE 90 )
o
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Sedan
COLOR
View #1
View #2
09/02/2012
945
945
M O D E L
Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Carr Gleason Clyde White Red Blue Mini Van Coupe Sedan
M O D E L
Mini Van Coupe Sedan Clyde Gleason Carr Blue Red White
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
09/02/2012
946
946
Sales Volumes
M O D E L
Mini Van
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
09/02/2012
947
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
947
ORGANIZATION DIMENSION
REGION Chicago Clyde Gleason Midwest St. Louis Carr Levi Gary Lucas Bolton
DISTRICT DEALERSHIP
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
09/02/2012
948
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
948
09/02/2012
949
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
949
East West Central 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Year 1999
950
951
952
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
OLAP Applications
09/02/2012
953
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
953
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
09/02/2012
954
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
954
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
09/02/2012
955
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
955
ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on mid-tier server; target database on database server Processing split between mid-tier & database servers
09/02/2012
956
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
956
Any Client
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
09/02/2012
957
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
957
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
09/02/2012
958
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
958
Architecture Comparison
MOLAP
Definition MDDB OLAP = Transaction level data + summary in MDDB Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB)
ROLAP
Relational OLAP = Transaction level data + summary in RDBMS No Sparsity To the necessary extent
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Medium: MDDB Server + large disk space cost Small transactional data + complex model + frequent summary analysis
Low: Only RDBMS + disk space cost Very large transactional data & it needs to be viewed / sorted
Where to apply?
09/02/2012
959
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
959
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
960
09/02/2012
960
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
09/02/2012
961
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
961
962
963
964
965
966
Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.
967
968
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
969