Sie sind auf Seite 1von 969

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview

Understanding What is a Data Warehouse

10

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

11

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing ( OLTP ) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

12

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture

What makes a Data Warehouse

13

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

14

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

15 22

2009 Wipro Ltd - Confidential 2222 Wipro Ltd - Confidential

16 22

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

17 22

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Data Modeling

Effective way of using a Data Warehouse

18

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model

Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.

Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model o

Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema

19

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A table that contains the measures of interest. Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.

20 22

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

21 22

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model

Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.

Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model o

Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema

22 2 2

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A table that contains the measures of interest. Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.

23 22

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential Wipro lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Ltd Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId city c1 nyc c2 sfo c3 la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

24 22

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south

Dimension Table
city

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.

region

regId north south

name cold region warm region

25

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

26

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

The Need For Data Quality


Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

27

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Understand Information Flow In Organization

Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data

Id e n tify P o te n tia l Id e n tify P o te n tia l P ro b le m A re a s & A s s e s P ro b le m A re a s & A ss e s Im p a ct Im p a ct

M e a s u re Q u a lity O f M e a s u re Q u a lity O f D a ta D a ta C le a n & L o a d C le a n & L o a d D a ta D a ta

Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse

C o n tin u o u s M o n ito rin g C o n tin u o u s M o n ito rin g

Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data

Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t

Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

28

Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy

29

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.

Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities

30

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

31

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

32

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

33

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

D a ta tra n sfo rm a ti n o
I te g ra ti g d i m i a r d a ta typ e s n n ssi l C h a n g i g co d e s n A d d i g a ti e a ttri u te n m b S u m m a ri n g d a ta zi C a l l ti g d e ri d va l e s cu a n ve u R e n o rm a l zi g d a ta i n

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

34

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.

The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

35

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

36

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
37

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment

ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential

38

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

39

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What Is Metadata?

Metadata is Information...

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

40

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information . How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

41

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


Provide a simple catalogue of business metadata descriptions and views Document/manage metadata descriptions from an integrated development environment Enable DW users to identify and invoke pre-built queries against the data stores Design and enhance new data models and schemas for the data warehouse Capture data transformation rules between the operational and data warehousing databases Provide change impact analysis, and update across these technologies
42
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

43

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

44

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
45
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories IBM, Oracle and Microsoft to offer free or near-free basic repository services Enable organisations to reuse metadata across technologies Integrate DB design, data transformation and BI tools from different vendors Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

46

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards CDIF (CASE Data Interchange Format) OMG (Object Management Group)-CWM

Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft

MDC (Metadata Coalition)

47

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
48
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories IBM, Oracle and Microsoft to offer free or near-free basic repository services Enable organisations to reuse metadata across technologies Integrate DB design, data transformation and BI tools from different vendors Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

49

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards CDIF (CASE Data Interchange Format) OMG (Object Management Group)-CWM

Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft

MDC (Metadata Coalition)

50

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP

51

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools

09/02/2012

52

52

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing


OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers

09/02/2012

53

53

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP


OLTP System Source of data OLAP System

Operational data; OLTPs Consolidation data; OLAP are the original source data comes from the of the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities

Purpose of data What the data reveals Inserts and Updates

Short and fast inserts Periodic long-running and updates initiated by batch jobs refresh the end users data
54

09/02/2012

54

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions).

A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members

55
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB:


VOL . MINI VAN BLUE MINI VAN BLUE MINI VAN BLUE MINI VAN RED MINI VAN RED MINI VAN RED MINI VAN WHITE MINI VAN WHITE MINI VAN WHITE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN BLUE SEDAN BLUE SEDAN BLUE ... MODEL Clyde 6 Gleason 3 Carr 2 Clyde 5 Gleason3 Carr 1 Clyde 3 Gleason1 Carr 4 BLUE Clyde 3 BLUE Gleason BLUE Carr 3 RED Clyde 4 RED Gleason3 RED Carr 6 WHITE Clyde 2 WHITE Gleason3 WHITE Carr 5 Clyde 4 Gleason 3 Carr 2 ...

Increased Complexity...
COLOR DEALER

Relational DBMS

MDDB

Sales Volumes

M O D E L

Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde

DEALERSHIP

COLOR

27 x 4 = 108 cells
56

3 x 3 x 3 = 27 cells

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS


Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table. Storage Space Very low Space Consumption compared to Relational DB Performance Gives much better performance . Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries. Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
09/02/2012
57
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

57

Issues with MDDB

Sparsity

- Input data in applications are typically sparse


-Due to Sparsity -Due to Summarization

-Increases with increased dimensions Data Explosion Performance


-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

09/02/2012
58
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

58

Issues with MDDB - Sparsity Example

If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 19 Sales Volumes FOX 31 63 WELD 14 Van 6 5 4 Miini 31 M O KELLY 54 Coupe 27 D E LINK 03 56 3 5 5 L KRANZ 41 Sedan 4 3 2 45 LUCUS 33 41 Blue Red White WEISS 23 19 COLOR

Smith Regan Fox

L A S T N A M E

Weld Kelly Link Kranz Lucas Weiss

EMPLOYEE #

09/02/2012
59
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

59

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

09/02/2012
60
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

60

Features of OLAP - Rotation


Complex Queries & Sorts in Relational environment translated to simple rotation.
Sales Volumes

M O D E L

Mini Van Coupe

6 3 4
Blue

5 5 3
Red

4 5 2
White

C O L O R ( ROTATE 90 )
o

Blue Red White

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Sedan

Mini Van Coupe

COLOR

View #1

View #2

09/02/2012
61

2 dimensional array has 2 views.


2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

61

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr Gleason Clyde White Red Blue Mini Van Coupe Sedan

Carr Gleason Blue Red White

Clyde Mini Van Coupe Sedan

M O D E L

Mini Van Coupe Sedan Clyde Gleason Carr Blue Red White

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

MODEL

COLOR

COLOR

View #4

View #5

View #6

09/02/2012
62

3 dimensional array has 6 views.


2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

62

Features of OLAP - Slicing / Filtering


MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van

Mini Van Coupe Normal Metal Blue Blue

Coupe

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
09/02/2012
63
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

63

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Chicago Clyde Gleason Midwest St. Louis Carr Levi Gary Lucas Bolton

DISTRICT DEALERSHIP

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

09/02/2012
64
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

64

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

09/02/2012
65
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

65

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0

East West Central 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Year 1999

Drill-down from Year to Quarter


09/02/2012
66
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

66

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5 0 January February March Year 1999 East West Central

Drill-down from Quarter to Month

67

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop
09/02/2012
68
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

68

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Applications

09/02/2012
69
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

69

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.

09/02/2012
70
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

70

ROLAP - Standard SQL storage

Relational DW

MDDB - Relational Mapping

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
09/02/2012
71
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

71

ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on mid-tier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

09/02/2012
72
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

72

HOLAP - Combination of RDBMS and MDDB


OLAP Cube

Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
09/02/2012
73
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

73

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

09/02/2012
74
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

74

Architecture Comparison

MOLAP
Definition MDDB OLAP = Transaction level data + summary in MDDB Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB)

ROLAP
Relational OLAP = Transaction level data + summary in RDBMS No Sparsity To the necessary extent

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost Small transactional data + complex model + frequent summary analysis

Low: Only RDBMS + disk space cost Very large transactional data & it needs to be viewed / sorted

Where to apply?

09/02/2012
75
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

75

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
76

09/02/2012
76
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
09/02/2012
77
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

77

Data Warehouse Testing

78

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview


There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions The methodology required for testing a Data Warehouse is different from testing a typical transaction system

79

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

80

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System.


User-Triggered vs. System triggered In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)

81

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fillup maximum possible combination of dimensions and facts. Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

82

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.

83

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the end-result data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of : Requirements testing Unit testing Integration testing Performance testing Acceptance testing

84

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?

85

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures:

Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.

86

Unit Testing

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit Testing the Report data:

Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

87

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following: Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation

88

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :

ETL processes completing within time window. Monitoring and measuring the data quality issues. Refresh times for standard/complex reports.

89

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

90

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Questions

91

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Thank You

92

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

93

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load

94

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing

95

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview

Understanding What is a Data Warehouse

96

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

97

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing ( OLTP ) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

98

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture

What makes a Data Warehouse

99

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

100

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load

101

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing

102

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview

Understanding What is a Data Warehouse

103

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

104

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing ( OLTP ) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

105

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture

What makes a Data Warehouse

106

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

107

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load

108

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing

109

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview

Understanding What is a Data Warehouse

110

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

111

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing ( OLTP ) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

112

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture

What makes a Data Warehouse

113

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

114

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

115 222

2009 Wipro Ltd - Confidential 2222 Wipro Ltd - Confidential

Data Modeling

Effective way of using a Data Warehouse

116

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model

Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.

Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model o

Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema

117

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A table that contains the measures of interest. Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.

118 222

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId city c1 nyc c2 sfo c3 la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

119 222

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south

Dimension Table
city

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.

region

regId north south

name cold region warm region

120

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

121

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

The Need For Data Quality


Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

122

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Understand Information Flow In Organization

Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data

Id e n tify P o te n tia l Id e n tify P o te n tia l P ro b le m A re a s & A s s e s P ro b le m A re a s & A ss e s Im p a ct Im p a ct

M e a s u re Q u a lity O f M e a s u re Q u a lity O f D a ta D a ta C le a n & L o a d C le a n & L o a d D a ta D a ta

Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse

C o n tin u o u s M o n ito rin g C o n tin u o u s M o n ito rin g

Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data

Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t

Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

123

Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy

124

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.

Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities

125

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

126

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

127

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

128

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

D a ta tra n sfo rm a ti n o
I te g ra ti g d i m i a r d a ta typ e s n n ssi l C h a n g i g co d e s n A d d i g a ti e a ttri u te n m b S u m m a ri n g d a ta zi C a l l ti g d e ri d va l e s cu a n ve u R e n o rm a l zi g d a ta i n

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

129

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.

The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

130

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

131

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
132

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment

ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential

133

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

134

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What Is Metadata?

Metadata is Information...

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

135

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information . How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

136

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


Provide a simple catalogue of business metadata descriptions and views Document/manage metadata descriptions from an integrated development environment Enable DW users to identify and invoke pre-built queries against the data stores Design and enhance new data models and schemas for the data warehouse Capture data transformation rules between the operational and data warehousing databases Provide change impact analysis, and update across these technologies
137
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

138

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
139
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories IBM, Oracle and Microsoft to offer free or near-free basic repository services Enable organisations to reuse metadata across technologies Integrate DB design, data transformation and BI tools from different vendors Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

140

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards CDIF (CASE Data Interchange Format) OMG (Object Management Group)-CWM

Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft

MDC (Metadata Coalition)

141

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP

142

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools

09/02/2012

143

143

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing


OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers

09/02/2012

144

144

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP


OLTP System Source of data OLAP System

Operational data; OLTPs Consolidation data; OLAP are the original source data comes from the of the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities

Purpose of data What the data reveals Inserts and Updates

Short and fast inserts Periodic long-running and updates initiated by batch jobs refresh the end users data
145

09/02/2012

145

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions).

A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members

146
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB:


VOL . MINI VAN BLUE MINI VAN BLUE MINI VAN BLUE MINI VAN RED MINI VAN RED MINI VAN RED MINI VAN WHITE MINI VAN WHITE MINI VAN WHITE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN BLUE SEDAN BLUE SEDAN BLUE ... MODEL Clyde 6 Gleason 3 Carr 2 Clyde 5 Gleason3 Carr 1 Clyde 3 Gleason1 Carr 4 BLUE Clyde 3 BLUE Gleason BLUE Carr 3 RED Clyde 4 RED Gleason3 RED Carr 6 WHITE Clyde 2 WHITE Gleason3 WHITE Carr 5 Clyde 4 Gleason 3 Carr 2 ...

Increased Complexity...
COLOR DEALER

Relational DBMS

MDDB

Sales Volumes

M O D E L

Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde

DEALERSHIP

COLOR

27 x 4 = 108 cells
147

3 x 3 x 3 = 27 cells

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS


Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table. Storage Space Very low Space Consumption compared to Relational DB Performance Gives much better performance . Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries. Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
09/02/2012
148
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

148

Issues with MDDB

Sparsity

- Input data in applications are typically sparse


-Due to Sparsity -Due to Summarization

-Increases with increased dimensions Data Explosion Performance


-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

09/02/2012
149
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

149

Issues with MDDB - Sparsity Example

If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 19 Sales Volumes FOX 31 63 WELD 14 Van 6 5 4 Miini 31 M O KELLY 54 Coupe 27 D E LINK 03 56 3 5 5 L KRANZ 41 Sedan 4 3 2 45 LUCUS 33 41 Blue Red White WEISS 23 19 COLOR

Smith Regan Fox

L A S T N A M E

Weld Kelly Link Kranz Lucas Weiss

EMPLOYEE #

09/02/2012
150
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

150

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

09/02/2012
151
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

151

Features of OLAP - Rotation


Complex Queries & Sorts in Relational environment translated to simple rotation.
Sales Volumes

M O D E L

Mini Van Coupe

6 3 4
Blue

5 5 3
Red

4 5 2
White

C O L O R ( ROTATE 90 )
o

Blue Red White

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Sedan

Mini Van Coupe

COLOR

View #1

View #2

09/02/2012
152

2 dimensional array has 2 views.


2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

152

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr Gleason Clyde White Red Blue Mini Van Coupe Sedan

Carr Gleason Blue Red White

Clyde Mini Van Coupe Sedan

M O D E L

Mini Van Coupe Sedan Clyde Gleason Carr Blue Red White

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

MODEL

COLOR

COLOR

View #4

View #5

View #6

09/02/2012
153

3 dimensional array has 6 views.


2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

153

Features of OLAP - Slicing / Filtering


MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van

Mini Van Coupe Normal Metal Blue Blue

Coupe

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
09/02/2012
154
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

154

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Chicago Clyde Gleason Midwest St. Louis Carr Levi Gary Lucas Bolton

DISTRICT DEALERSHIP

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

09/02/2012
155
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

155

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

09/02/2012
156
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

156

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0

East West Central 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Year 1999

Drill-down from Year to Quarter


09/02/2012
157
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

157

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5 0 January February March Year 1999 East West Central

Drill-down from Quarter to Month

158

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop
09/02/2012
159
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

159

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Applications

09/02/2012
160
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

160

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.

09/02/2012
161
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

161

ROLAP - Standard SQL storage

Relational DW

MDDB - Relational Mapping

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
09/02/2012
162
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

162

ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on mid-tier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

09/02/2012
163
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

163

HOLAP - Combination of RDBMS and MDDB


OLAP Cube

Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
09/02/2012
164
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

164

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

09/02/2012
165
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

165

Architecture Comparison

MOLAP
Definition MDDB OLAP = Transaction level data + summary in MDDB Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB)

ROLAP
Relational OLAP = Transaction level data + summary in RDBMS No Sparsity To the necessary extent

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost Small transactional data + complex model + frequent summary analysis

Low: Only RDBMS + disk space cost Very large transactional data & it needs to be viewed / sorted

Where to apply?

09/02/2012
166
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

166

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
167

09/02/2012
167
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
09/02/2012
168
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

168

Data Warehouse Testing

169

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview


There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions The methodology required for testing a Data Warehouse is different from testing a typical transaction system

170

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

171

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System.


User-Triggered vs. System triggered In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)

172

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fillup maximum possible combination of dimensions and facts. Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

173

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.

174

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the end-result data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of : Requirements testing Unit testing Integration testing Performance testing Acceptance testing

175

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?

176

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures:

Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.

177

Unit Testing

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit Testing the Report data:

Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

178

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following: Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation

179

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :

ETL processes completing within time window. Monitoring and measuring the data quality issues. Refresh times for standard/complex reports.

180

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

181

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Questions

182

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Thank You

183

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

184

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

185 222

2009 Wipro Ltd - Confidential 2222 Wipro Ltd - Confidential

Data Modeling

Effective way of using a Data Warehouse

186

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model

Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.

Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model o

Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema

187

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A table that contains the measures of interest. Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.

188 222

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId city c1 nyc c2 sfo c3 la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

189 222

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south

Dimension Table
city

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.

region

regId north south

name cold region warm region

190

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

191

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

The Need For Data Quality


Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

192

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Understand Information Flow In Organization

Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data

Id e n tify P o te n tia l Id e n tify P o te n tia l P ro b le m A re a s & A s s e s P ro b le m A re a s & A ss e s Im p a ct Im p a ct

M e a s u re Q u a lity O f M e a s u re Q u a lity O f D a ta D a ta C le a n & L o a d C le a n & L o a d D a ta D a ta

Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse

C o n tin u o u s M o n ito rin g C o n tin u o u s M o n ito rin g

Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data

Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t

Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

193

Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy

194

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.

Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities

195

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

196

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

197

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

198

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

D a ta tra n sfo rm a ti n o
I te g ra ti g d i m i a r d a ta typ e s n n ssi l C h a n g i g co d e s n A d d i g a ti e a ttri u te n m b S u m m a ri n g d a ta zi C a l l ti g d e ri d va l e s cu a n ve u R e n o rm a l zi g d a ta i n

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

199

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.

The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

200

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

201

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

202

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

203

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

204

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load

205

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

206

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

207

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load

208

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing

209

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview

Understanding What is a Data Warehouse

210

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

211

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing ( OLTP ) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

212

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture

What makes a Data Warehouse

213

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

214

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

215 222

2009 Wipro Ltd - Confidential 2222 Wipro Ltd - Confidential

216 222

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

217 222

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Data Modeling

Effective way of using a Data Warehouse

218

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model

Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.

Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model o

Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema

219

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A table that contains the measures of interest. Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.

220 222

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

221 22 2

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model

Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.

Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model o

Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema

222 22 2

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A table that contains the measures of interest. Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.

223 222

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential Wipro lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Ltd Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId city c1 nyc c2 sfo c3 la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

224 222

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south

Dimension Table
city

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.

region

regId north south

name cold region warm region

225

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

226

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

The Need For Data Quality


Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

227

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Understand Information Flow In Organization

Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data

Id e n tify P o te n tia l Id e n tify P o te n tia l P ro b le m A re a s & A s s e s P ro b le m A re a s & A ss e s Im p a ct Im p a ct

M e a s u re Q u a lity O f M e a s u re Q u a lity O f D a ta D a ta C le a n & L o a d C le a n & L o a d D a ta D a ta

Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse

C o n tin u o u s M o n ito rin g C o n tin u o u s M o n ito rin g

Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data

Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t

Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

228

Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy

229

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.

Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities

230

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

231

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

232

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

233

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

D a ta tra n sfo rm a ti n o
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

234

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.

The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

235

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

236

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
237

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment

ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential

238

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

239

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What Is Metadata?

Metadata is Information...

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

240

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information . How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

241

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


Provide a simple catalogue of business metadata descriptions and views Document/manage metadata descriptions from an integrated development environment Enable DW users to identify and invoke pre-built queries against the data stores Design and enhance new data models and schemas for the data warehouse Capture data transformation rules between the operational and data warehousing databases Provide change impact analysis, and update across these technologies
242
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

243

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

244

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
245
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories IBM, Oracle and Microsoft to offer free or near-free basic repository services Enable organisations to reuse metadata across technologies Integrate DB design, data transformation and BI tools from different vendors Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

246

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards CDIF (CASE Data Interchange Format) OMG (Object Management Group)-CWM

Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft

MDC (Metadata Coalition)

247

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
248
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories IBM, Oracle and Microsoft to offer free or near-free basic repository services Enable organisations to reuse metadata across technologies Integrate DB design, data transformation and BI tools from different vendors Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

249

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards CDIF (CASE Data Interchange Format) OMG (Object Management Group)-CWM

Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft

MDC (Metadata Coalition)

250

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP

251

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools

09/02/2012

252

252

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing


OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers

09/02/2012

253

253

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP


OLTP System Source of data OLAP System

Operational data; OLTPs Consolidation data; OLAP are the original source data comes from the of the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities

Purpose of data What the data reveals Inserts and Updates

Short and fast inserts Periodic long-running and updates initiated by batch jobs refresh the end users data
254

09/02/2012

254

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions).

A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members

255
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB:


VOL . MINI VAN BLUE MINI VAN BLUE MINI VAN BLUE MINI VAN RED MINI VAN RED MINI VAN RED MINI VAN WHITE MINI VAN WHITE MINI VAN WHITE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN BLUE SEDAN BLUE SEDAN BLUE ... MODEL Clyde 6 Gleason 3 Carr 2 Clyde 5 Gleason3 Carr 1 Clyde 3 Gleason1 Carr 4 BLUE Clyde 3 BLUE Gleason BLUE Carr 3 RED Clyde 4 RED Gleason3 RED Carr 6 WHITE Clyde 2 WHITE Gleason3 WHITE Carr 5 Clyde 4 Gleason 3 Carr 2 ...

Increased Complexity...
COLOR DEALER

Relational DBMS

MDDB

Sales Volumes

M O D E L

Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde

DEALERSHIP

COLOR

27 x 4 = 108 cells
256

3 x 3 x 3 = 27 cells

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS


Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table. Storage Space Very low Space Consumption compared to Relational DB Performance Gives much better performance . Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries. Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
09/02/2012
257
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

257

Issues with MDDB

Sparsity

- Input data in applications are typically sparse


-Due to Sparsity -Due to Summarization

-Increases with increased dimensions Data Explosion Performance


-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

09/02/2012
258
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

258

Issues with MDDB - Sparsity Example

If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 19 Sales Volumes FOX 31 63 WELD 14 Van 6 5 4 Miini 31 M O KELLY 54 Coupe 27 D E LINK 03 56 3 5 5 L KRANZ 41 Sedan 4 3 2 45 LUCUS 33 41 Blue Red White WEISS 23 19 COLOR

Smith Regan Fox

L A S T N A M E

Weld Kelly Link Kranz Lucas Weiss

EMPLOYEE #

09/02/2012
259
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

259

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

09/02/2012
260
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

260

Features of OLAP - Rotation


Complex Queries & Sorts in Relational environment translated to simple rotation.
Sales Volumes

M O D E L

Mini Van Coupe

6 3 4
Blue

5 5 3
Red

4 5 2
White

C O L O R ( ROTATE 90 )
o

Blue Red White

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Sedan

Mini Van Coupe

COLOR

View #1

View #2

09/02/2012
261

2 dimensional array has 2 views.


2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

261

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr Gleason Clyde White Red Blue Mini Van Coupe Sedan

Carr Gleason Blue Red White

Clyde Mini Van Coupe Sedan

M O D E L

Mini Van Coupe Sedan Clyde Gleason Carr Blue Red White

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

MODEL

COLOR

COLOR

View #4

View #5

View #6

09/02/2012
262

3 dimensional array has 6 views.


2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

262

Features of OLAP - Slicing / Filtering


MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van

Mini Van Coupe Normal Metal Blue Blue

Coupe

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
09/02/2012
263
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

263

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Chicago Clyde Gleason Midwest St. Louis Carr Levi Gary Lucas Bolton

DISTRICT DEALERSHIP

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

09/02/2012
264
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

264

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

09/02/2012
265
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

265

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0

East West Central 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Year 1999

Drill-down from Year to Quarter


09/02/2012
266
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

266

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5 0 January February March Year 1999 East West Central

Drill-down from Quarter to Month

267

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop
09/02/2012
268
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

268

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Applications

09/02/2012
269
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

269

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.

09/02/2012
270
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

270

ROLAP - Standard SQL storage

Relational DW

MDDB - Relational Mapping

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
09/02/2012
271
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

271

ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on mid-tier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

09/02/2012
272
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

272

HOLAP - Combination of RDBMS and MDDB


OLAP Cube

Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
09/02/2012
273
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

273

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

09/02/2012
274
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

274

Architecture Comparison

MOLAP
Definition MDDB OLAP = Transaction level data + summary in MDDB Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB)

ROLAP
Relational OLAP = Transaction level data + summary in RDBMS No Sparsity To the necessary extent

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost Small transactional data + complex model + frequent summary analysis

Low: Only RDBMS + disk space cost Very large transactional data & it needs to be viewed / sorted

Where to apply?

09/02/2012
275
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

275

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
276

09/02/2012
276
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
09/02/2012
277
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

277

Data Warehouse Testing

278

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview


There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions The methodology required for testing a Data Warehouse is different from testing a typical transaction system

279

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

280

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System.


User-Triggered vs. System triggered In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)

281

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fillup maximum possible combination of dimensions and facts. Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

282

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.

283

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the end-result data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of : Requirements testing Unit testing Integration testing Performance testing Acceptance testing

284

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?

285

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures:

Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.

286

Unit Testing

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit Testing the Report data:

Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

287

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following: Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation

288

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :

ETL processes completing within time window. Monitoring and measuring the data quality issues. Refresh times for standard/complex reports.

289

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

290

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Questions

291

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Thank You

292

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

293

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load

294

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing

295

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview

Understanding What is a Data Warehouse

296

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

297

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing ( OLTP ) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

298

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture

What makes a Data Warehouse

299

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

300

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load

301

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing

302

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview

Understanding What is a Data Warehouse

303

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

304

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing ( OLTP ) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

305

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture

What makes a Data Warehouse

306

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

307

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load

308

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing

309

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview

Understanding What is a Data Warehouse

310

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

311

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing ( OLTP ) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

312

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture

What makes a Data Warehouse

313

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

314

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

315 222

2009 Wipro Ltd - Confidential 2222 Wipro Ltd - Confidential

Data Modeling

Effective way of using a Data Warehouse

316

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model

Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.

Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model o

Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema

317

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A table that contains the measures of interest. Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.

318 222

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId city c1 nyc c2 sfo c3 la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

319 222

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south

Dimension Table
city

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.

region

regId north south

name cold region warm region

320

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

321

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

The Need For Data Quality


Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

322

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Understand Information Flow In Organization

Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data

Id e n tify P o te n tia l Id e n tify P o te n tia l P ro b le m A re a s & A s s e s P ro b le m A re a s & A ss e s Im p a ct Im p a ct

M e a s u re Q u a lity O f M e a s u re Q u a lity O f D a ta D a ta C le a n & L o a d C le a n & L o a d D a ta D a ta

Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse

C o n tin u o u s M o n ito rin g C o n tin u o u s M o n ito rin g

Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data

Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t

Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

323

Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy

324

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.

Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities

325

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

326

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

327

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

328

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

D a ta tra n sfo rm a ti n o
I te g ra ti g d i m i a r d a ta typ e s n n ssi l C h a n g i g co d e s n A d d i g a ti e a ttri u te n m b S u m m a ri n g d a ta zi C a l l ti g d e ri d va l e s cu a n ve u R e n o rm a l zi g d a ta i n

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

329

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.

The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

330

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

331

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
332

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment

ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential

333

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

334

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What Is Metadata?

Metadata is Information...

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

335

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information . How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

336

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


Provide a simple catalogue of business metadata descriptions and views Document/manage metadata descriptions from an integrated development environment Enable DW users to identify and invoke pre-built queries against the data stores Design and enhance new data models and schemas for the data warehouse Capture data transformation rules between the operational and data warehousing databases Provide change impact analysis, and update across these technologies
337
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

338

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
339
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories IBM, Oracle and Microsoft to offer free or near-free basic repository services Enable organisations to reuse metadata across technologies Integrate DB design, data transformation and BI tools from different vendors Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

340

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards CDIF (CASE Data Interchange Format) OMG (Object Management Group)-CWM

Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft

MDC (Metadata Coalition)

341

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP

342

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools

09/02/2012

343

343

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing


OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers

09/02/2012

344

344

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP


OLTP System Source of data OLAP System

Operational data; OLTPs Consolidation data; OLAP are the original source data comes from the of the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities

Purpose of data What the data reveals Inserts and Updates

Short and fast inserts Periodic long-running and updates initiated by batch jobs refresh the end users data
345

09/02/2012

345

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions).

A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members

346
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB:


VOL . MINI VAN BLUE MINI VAN BLUE MINI VAN BLUE MINI VAN RED MINI VAN RED MINI VAN RED MINI VAN WHITE MINI VAN WHITE MINI VAN WHITE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN BLUE SEDAN BLUE SEDAN BLUE ... MODEL Clyde 6 Gleason 3 Carr 2 Clyde 5 Gleason3 Carr 1 Clyde 3 Gleason1 Carr 4 BLUE Clyde 3 BLUE Gleason BLUE Carr 3 RED Clyde 4 RED Gleason3 RED Carr 6 WHITE Clyde 2 WHITE Gleason3 WHITE Carr 5 Clyde 4 Gleason 3 Carr 2 ...

Increased Complexity...
COLOR DEALER

Relational DBMS

MDDB

Sales Volumes

M O D E L

Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde

DEALERSHIP

COLOR

27 x 4 = 108 cells
347

3 x 3 x 3 = 27 cells

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS


Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table. Storage Space Very low Space Consumption compared to Relational DB Performance Gives much better performance . Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries. Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
09/02/2012
348
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

348

Issues with MDDB

Sparsity

- Input data in applications are typically sparse


-Due to Sparsity -Due to Summarization

-Increases with increased dimensions Data Explosion Performance


-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

09/02/2012
349
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

349

Issues with MDDB - Sparsity Example

If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 19 Sales Volumes FOX 31 63 WELD 14 Van 6 5 4 Miini 31 M O KELLY 54 Coupe 27 D E LINK 03 56 3 5 5 L KRANZ 41 Sedan 4 3 2 45 LUCUS 33 41 Blue Red White WEISS 23 19 COLOR

Smith Regan Fox

L A S T N A M E

Weld Kelly Link Kranz Lucas Weiss

EMPLOYEE #

09/02/2012
350
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

350

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

09/02/2012
351
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

351

Features of OLAP - Rotation


Complex Queries & Sorts in Relational environment translated to simple rotation.
Sales Volumes

M O D E L

Mini Van Coupe

6 3 4
Blue

5 5 3
Red

4 5 2
White

C O L O R ( ROTATE 90 )
o

Blue Red White

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Sedan

Mini Van Coupe

COLOR

View #1

View #2

09/02/2012
352

2 dimensional array has 2 views.


2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

352

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr Gleason Clyde White Red Blue Mini Van Coupe Sedan

Carr Gleason Blue Red White

Clyde Mini Van Coupe Sedan

M O D E L

Mini Van Coupe Sedan Clyde Gleason Carr Blue Red White

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

MODEL

COLOR

COLOR

View #4

View #5

View #6

09/02/2012
353

3 dimensional array has 6 views.


2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

353

Features of OLAP - Slicing / Filtering


MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van

Mini Van Coupe Normal Metal Blue Blue

Coupe

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
09/02/2012
354
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

354

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Chicago Clyde Gleason Midwest St. Louis Carr Levi Gary Lucas Bolton

DISTRICT DEALERSHIP

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

09/02/2012
355
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

355

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

09/02/2012
356
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

356

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0

East West Central 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Year 1999

Drill-down from Year to Quarter


09/02/2012
357
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

357

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5 0 January February March Year 1999 East West Central

Drill-down from Quarter to Month

358

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop
09/02/2012
359
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

359

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Applications

09/02/2012
360
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

360

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.

09/02/2012
361
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

361

ROLAP - Standard SQL storage

Relational DW

MDDB - Relational Mapping

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
09/02/2012
362
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

362

ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on mid-tier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

09/02/2012
363
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

363

HOLAP - Combination of RDBMS and MDDB


OLAP Cube

Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
09/02/2012
364
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

364

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

09/02/2012
365
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

365

Architecture Comparison

MOLAP
Definition MDDB OLAP = Transaction level data + summary in MDDB Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB)

ROLAP
Relational OLAP = Transaction level data + summary in RDBMS No Sparsity To the necessary extent

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost Small transactional data + complex model + frequent summary analysis

Low: Only RDBMS + disk space cost Very large transactional data & it needs to be viewed / sorted

Where to apply?

09/02/2012
366
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

366

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
367

09/02/2012
367
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
09/02/2012
368
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

368

Data Warehouse Testing

369

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview


There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions The methodology required for testing a Data Warehouse is different from testing a typical transaction system

370

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

371

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System.


User-Triggered vs. System triggered In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)

372

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fillup maximum possible combination of dimensions and facts. Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

373

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.

374

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the end-result data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of : Requirements testing Unit testing Integration testing Performance testing Acceptance testing

375

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?

376

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures:

Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.

377

Unit Testing

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit Testing the Report data:

Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

378

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following: Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation

379

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :

ETL processes completing within time window. Monitoring and measuring the data quality issues. Refresh times for standard/complex reports.

380

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

381

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Questions

382

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Thank You

383

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

384

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

385 222

2009 Wipro Ltd - Confidential 2222 Wipro Ltd - Confidential

Data Modeling

Effective way of using a Data Warehouse

386

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model

Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.

Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model o

Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema

387

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A table that contains the measures of interest. Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.

388 222

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId city c1 nyc c2 sfo c3 la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

389 222

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south

Dimension Table
city

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.

region

regId north south

name cold region warm region

390

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

391

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

The Need For Data Quality


Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

392

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Understand Information Flow In Organization

Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data

Id e n tify P o te n tia l Id e n tify P o te n tia l P ro b le m A re a s & A s s e s P ro b le m A re a s & A ss e s Im p a ct Im p a ct

M e a s u re Q u a lity O f M e a s u re Q u a lity O f D a ta D a ta C le a n & L o a d C le a n & L o a d D a ta D a ta

Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse

C o n tin u o u s M o n ito rin g C o n tin u o u s M o n ito rin g

Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data

Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t

Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

393

Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy

394

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.

Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities

395

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

396

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

397

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

398

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

D a ta tra n sfo rm a ti n o
I te g ra ti g d i m i a r d a ta typ e s n n ssi l C h a n g i g co d e s n A d d i g a ti e a ttri u te n m b S u m m a ri n g d a ta zi C a l l ti g d e ri d va l e s cu a n ve u R e n o rm a l zi g d a ta i n

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

399

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.

The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

400

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

401

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
402

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment

ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential

403

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

404

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What Is Metadata?

Metadata is Information...

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

405

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information . How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

406

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


Provide a simple catalogue of business metadata descriptions and views Document/manage metadata descriptions from an integrated development environment Enable DW users to identify and invoke pre-built queries against the data stores Design and enhance new data models and schemas for the data warehouse Capture data transformation rules between the operational and data warehousing databases Provide change impact analysis, and update across these technologies
407
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

408

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
409
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories IBM, Oracle and Microsoft to offer free or near-free basic repository services Enable organisations to reuse metadata across technologies Integrate DB design, data transformation and BI tools from different vendors Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

410

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

411

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load

412

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

413

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

414

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

415

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

416

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load

417

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

418

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

419

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load

420

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing

421

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview

Understanding What is a Data Warehouse

422

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

423

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing ( OLTP ) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

424

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture

What makes a Data Warehouse

425

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

426

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

427 222

2009 Wipro Ltd - Confidential 2222 Wipro Ltd - Confidential

428 222

2009 Wipro Ltd - Confidential 2222 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

429 222

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Data Modeling

Effective way of using a Data Warehouse

430

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model

Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.

Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model o

Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema

431

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A table that contains the measures of interest. Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.

432 222

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

433 222

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model

Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.

Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model o

Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema

434 22 2

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A table that contains the measures of interest. Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.

435 222

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential Wipro lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Ltd Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId city c1 nyc c2 sfo c3 la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

436 222

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south

Dimension Table
city

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.

region

regId north south

name cold region warm region

437

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

438

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

The Need For Data Quality


Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

439

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Understand Information Flow In Organization

Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data

Id e n tify P o te n tia l Id e n tify P o te n tia l P ro b le m A re a s & A s s e s P ro b le m A re a s & A ss e s Im p a ct Im p a ct

M e a s u re Q u a lity O f M e a s u re Q u a lity O f D a ta D a ta C le a n & L o a d C le a n & L o a d D a ta D a ta

Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse

C o n tin u o u s M o n ito rin g C o n tin u o u s M o n ito rin g

Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data

Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t

Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

440

Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy

441

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.

Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities

442

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

443

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

444

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

445

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

D a ta tra n sfo rm a ti n o
I te g ra ti g d i m i a r d a ta typ e s n n ssi l C h a n g i g co d e s n A d d i g a ti e a ttri u te n m b S u m m a ri n g d a ta zi C a l l ti g d e ri d va l e s cu a n ve u R e n o rm a l zi g d a ta i n

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

446

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.

The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

447

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

448

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
449

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment

ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential

450

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

451

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What Is Metadata?

Metadata is Information...

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

452

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information . How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

453

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


Provide a simple catalogue of business metadata descriptions and views Document/manage metadata descriptions from an integrated development environment Enable DW users to identify and invoke pre-built queries against the data stores Design and enhance new data models and schemas for the data warehouse Capture data transformation rules between the operational and data warehousing databases Provide change impact analysis, and update across these technologies
454
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

455

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

456

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
457
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories IBM, Oracle and Microsoft to offer free or near-free basic repository services Enable organisations to reuse metadata across technologies Integrate DB design, data transformation and BI tools from different vendors Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

458

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards CDIF (CASE Data Interchange Format) OMG (Object Management Group)-CWM

Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft

MDC (Metadata Coalition)

459

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
460
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories IBM, Oracle and Microsoft to offer free or near-free basic repository services Enable organisations to reuse metadata across technologies Integrate DB design, data transformation and BI tools from different vendors Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

461

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards CDIF (CASE Data Interchange Format) OMG (Object Management Group)-CWM

Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft

MDC (Metadata Coalition)

462

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP

463

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools

09/02/2012

464

464

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing


OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers

09/02/2012

465

465

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP


OLTP System Source of data OLAP System

Operational data; OLTPs Consolidation data; OLAP are the original source data comes from the of the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities

Purpose of data What the data reveals Inserts and Updates

Short and fast inserts Periodic long-running and updates initiated by batch jobs refresh the end users data
466

09/02/2012

466

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions).

A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members

467
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB:


VOL . MINI VAN BLUE MINI VAN BLUE MINI VAN BLUE MINI VAN RED MINI VAN RED MINI VAN RED MINI VAN WHITE MINI VAN WHITE MINI VAN WHITE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN BLUE SEDAN BLUE SEDAN BLUE ... MODEL Clyde 6 Gleason 3 Carr 2 Clyde 5 Gleason3 Carr 1 Clyde 3 Gleason1 Carr 4 BLUE Clyde 3 BLUE Gleason BLUE Carr 3 RED Clyde 4 RED Gleason3 RED Carr 6 WHITE Clyde 2 WHITE Gleason3 WHITE Carr 5 Clyde 4 Gleason 3 Carr 2 ...

Increased Complexity...
COLOR DEALER

Relational DBMS

MDDB

Sales Volumes

M O D E L

Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde

DEALERSHIP

COLOR

27 x 4 = 108 cells
468

3 x 3 x 3 = 27 cells

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS


Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table. Storage Space Very low Space Consumption compared to Relational DB Performance Gives much better performance . Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries. Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
09/02/2012
469
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

469

Issues with MDDB

Sparsity

- Input data in applications are typically sparse


-Due to Sparsity -Due to Summarization

-Increases with increased dimensions Data Explosion Performance


-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

09/02/2012
470
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

470

Issues with MDDB - Sparsity Example

If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 19 Sales Volumes FOX 31 63 WELD 14 Van 6 5 4 Miini 31 M O KELLY 54 Coupe 27 D E LINK 03 56 3 5 5 L KRANZ 41 Sedan 4 3 2 45 LUCUS 33 41 Blue Red White WEISS 23 19 COLOR

Smith Regan Fox

L A S T N A M E

Weld Kelly Link Kranz Lucas Weiss

EMPLOYEE #

09/02/2012
471
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

471

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

09/02/2012
472
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

472

Features of OLAP - Rotation


Complex Queries & Sorts in Relational environment translated to simple rotation.
Sales Volumes

M O D E L

Mini Van Coupe

6 3 4
Blue

5 5 3
Red

4 5 2
White

C O L O R ( ROTATE 90 )
o

Blue Red White

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Sedan

Mini Van Coupe

COLOR

View #1

View #2

09/02/2012
473

2 dimensional array has 2 views.


2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

473

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr Gleason Clyde White Red Blue Mini Van Coupe Sedan

Carr Gleason Blue Red White

Clyde Mini Van Coupe Sedan

M O D E L

Mini Van Coupe Sedan Clyde Gleason Carr Blue Red White

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

MODEL

COLOR

COLOR

View #4

View #5

View #6

09/02/2012
474

3 dimensional array has 6 views.


2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

474

Features of OLAP - Slicing / Filtering


MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van

Mini Van Coupe Normal Metal Blue Blue

Coupe

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
09/02/2012
475
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

475

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Chicago Clyde Gleason Midwest St. Louis Carr Levi Gary Lucas Bolton

DISTRICT DEALERSHIP

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

09/02/2012
476
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

476

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

09/02/2012
477
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

477

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0

East West Central 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Year 1999

Drill-down from Year to Quarter


09/02/2012
478
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

478

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5 0 January February March Year 1999 East West Central

Drill-down from Quarter to Month

479

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop
09/02/2012
480
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

480

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Applications

09/02/2012
481
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

481

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.

09/02/2012
482
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

482

ROLAP - Standard SQL storage

Relational DW

MDDB - Relational Mapping

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
09/02/2012
483
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

483

ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on mid-tier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

09/02/2012
484
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

484

HOLAP - Combination of RDBMS and MDDB


OLAP Cube

Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
09/02/2012
485
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

485

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

09/02/2012
486
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

486

Architecture Comparison

MOLAP
Definition MDDB OLAP = Transaction level data + summary in MDDB Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB)

ROLAP
Relational OLAP = Transaction level data + summary in RDBMS No Sparsity To the necessary extent

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost Small transactional data + complex model + frequent summary analysis

Low: Only RDBMS + disk space cost Very large transactional data & it needs to be viewed / sorted

Where to apply?

09/02/2012
487
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

487

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
488

09/02/2012
488
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
09/02/2012
489
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

489

Data Warehouse Testing

490

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview


There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions The methodology required for testing a Data Warehouse is different from testing a typical transaction system

491

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

492

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System.


User-Triggered vs. System triggered In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)

493

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fillup maximum possible combination of dimensions and facts. Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

494

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.

495

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the end-result data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of : Requirements testing Unit testing Integration testing Performance testing Acceptance testing

496

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?

497

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures:

Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.

498

Unit Testing

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit Testing the Report data:

Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

499

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following: Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation

500

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :

ETL processes completing within time window. Monitoring and measuring the data quality issues. Refresh times for standard/complex reports.

501

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

502

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Questions

503

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Thank You

504

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

505

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load

506

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing

507

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview

Understanding What is a Data Warehouse

508

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

509

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing ( OLTP ) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

510

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture

What makes a Data Warehouse

511

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

512

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load

513

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing

514

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview

Understanding What is a Data Warehouse

515

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

516

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing ( OLTP ) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

517

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture

What makes a Data Warehouse

518

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

519

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load

520

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing

521

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview

Understanding What is a Data Warehouse

522

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

523

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing ( OLTP ) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

524

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture

What makes a Data Warehouse

525

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

526

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

527 222

2009 Wipro Ltd - Confidential 2222 Wipro Ltd - Confidential

Data Modeling

Effective way of using a Data Warehouse

528

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model

Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.

Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model o

Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema

529

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A table that contains the measures of interest. Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.

530 222

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId city c1 nyc c2 sfo c3 la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

531 222

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south

Dimension Table
city

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.

region

regId north south

name cold region warm region

532

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

533

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

The Need For Data Quality


Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

534

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Understand Information Flow In Organization

Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data

Id e n tify P o te n tia l Id e n tify P o te n tia l P ro b le m A re a s & A s s e s P ro b le m A re a s & A ss e s Im p a ct Im p a ct

M e a s u re Q u a lity O f M e a s u re Q u a lity O f D a ta D a ta C le a n & L o a d C le a n & L o a d D a ta D a ta

Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse

C o n tin u o u s M o n ito rin g C o n tin u o u s M o n ito rin g

Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data

Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t

Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

535

Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy

536

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.

Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities

537

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

538

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

539

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

540

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

D a ta tra n sfo rm a ti n o
I te g ra ti g d i m i a r d a ta typ e s n n ssi l C h a n g i g co d e s n A d d i g a ti e a ttri u te n m b S u m m a ri n g d a ta zi C a l l ti g d e ri d va l e s cu a n ve u R e n o rm a l zi g d a ta i n

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

541

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.

The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

542

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

543

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
544

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment

ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential

545

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

546

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What Is Metadata?

Metadata is Information...

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

547

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information . How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

548

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


Provide a simple catalogue of business metadata descriptions and views Document/manage metadata descriptions from an integrated development environment Enable DW users to identify and invoke pre-built queries against the data stores Design and enhance new data models and schemas for the data warehouse Capture data transformation rules between the operational and data warehousing databases Provide change impact analysis, and update across these technologies
549
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

550

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
551
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories IBM, Oracle and Microsoft to offer free or near-free basic repository services Enable organisations to reuse metadata across technologies Integrate DB design, data transformation and BI tools from different vendors Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

552

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards CDIF (CASE Data Interchange Format) OMG (Object Management Group)-CWM

Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft

MDC (Metadata Coalition)

553

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP

554

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools

09/02/2012

555

555

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing


OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers

09/02/2012

556

556

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP


OLTP System Source of data OLAP System

Operational data; OLTPs Consolidation data; OLAP are the original source data comes from the of the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities

Purpose of data What the data reveals Inserts and Updates

Short and fast inserts Periodic long-running and updates initiated by batch jobs refresh the end users data
557

09/02/2012

557

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions).

A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members

558
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB:


VOL . MINI VAN BLUE MINI VAN BLUE MINI VAN BLUE MINI VAN RED MINI VAN RED MINI VAN RED MINI VAN WHITE MINI VAN WHITE MINI VAN WHITE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN BLUE SEDAN BLUE SEDAN BLUE ... MODEL Clyde 6 Gleason 3 Carr 2 Clyde 5 Gleason3 Carr 1 Clyde 3 Gleason1 Carr 4 BLUE Clyde 3 BLUE Gleason BLUE Carr 3 RED Clyde 4 RED Gleason3 RED Carr 6 WHITE Clyde 2 WHITE Gleason3 WHITE Carr 5 Clyde 4 Gleason 3 Carr 2 ...

Increased Complexity...
COLOR DEALER

Relational DBMS

MDDB

Sales Volumes

M O D E L

Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde

DEALERSHIP

COLOR

27 x 4 = 108 cells
559

3 x 3 x 3 = 27 cells

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS


Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table. Storage Space Very low Space Consumption compared to Relational DB Performance Gives much better performance . Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries. Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
09/02/2012
560
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

560

Issues with MDDB

Sparsity

- Input data in applications are typically sparse


-Due to Sparsity -Due to Summarization

-Increases with increased dimensions Data Explosion Performance


-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

09/02/2012
561
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

561

Issues with MDDB - Sparsity Example

If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 19 Sales Volumes FOX 31 63 WELD 14 Van 6 5 4 Miini 31 M O KELLY 54 Coupe 27 D E LINK 03 56 3 5 5 L KRANZ 41 Sedan 4 3 2 45 LUCUS 33 41 Blue Red White WEISS 23 19 COLOR

Smith Regan Fox

L A S T N A M E

Weld Kelly Link Kranz Lucas Weiss

EMPLOYEE #

09/02/2012
562
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

562

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

09/02/2012
563
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

563

Features of OLAP - Rotation


Complex Queries & Sorts in Relational environment translated to simple rotation.
Sales Volumes

M O D E L

Mini Van Coupe

6 3 4
Blue

5 5 3
Red

4 5 2
White

C O L O R ( ROTATE 90 )
o

Blue Red White

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Sedan

Mini Van Coupe

COLOR

View #1

View #2

09/02/2012
564

2 dimensional array has 2 views.


2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

564

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr Gleason Clyde White Red Blue Mini Van Coupe Sedan

Carr Gleason Blue Red White

Clyde Mini Van Coupe Sedan

M O D E L

Mini Van Coupe Sedan Clyde Gleason Carr Blue Red White

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

MODEL

COLOR

COLOR

View #4

View #5

View #6

09/02/2012
565

3 dimensional array has 6 views.


2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

565

Features of OLAP - Slicing / Filtering


MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van

Mini Van Coupe Normal Metal Blue Blue

Coupe

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
09/02/2012
566
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

566

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Chicago Clyde Gleason Midwest St. Louis Carr Levi Gary Lucas Bolton

DISTRICT DEALERSHIP

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

09/02/2012
567
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

567

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

09/02/2012
568
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

568

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0

East West Central 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Year 1999

Drill-down from Year to Quarter


09/02/2012
569
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

569

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5 0 January February March Year 1999 East West Central

Drill-down from Quarter to Month

570

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop
09/02/2012
571
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

571

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Applications

09/02/2012
572
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

572

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.

09/02/2012
573
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

573

ROLAP - Standard SQL storage

Relational DW

MDDB - Relational Mapping

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
09/02/2012
574
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

574

ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on mid-tier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

09/02/2012
575
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

575

HOLAP - Combination of RDBMS and MDDB


OLAP Cube

Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
09/02/2012
576
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

576

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

09/02/2012
577
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

577

Architecture Comparison

MOLAP
Definition MDDB OLAP = Transaction level data + summary in MDDB Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB)

ROLAP
Relational OLAP = Transaction level data + summary in RDBMS No Sparsity To the necessary extent

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost Small transactional data + complex model + frequent summary analysis

Low: Only RDBMS + disk space cost Very large transactional data & it needs to be viewed / sorted

Where to apply?

09/02/2012
578
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

578

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
579

09/02/2012
579
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
09/02/2012
580
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

580

Data Warehouse Testing

581

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview


There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions The methodology required for testing a Data Warehouse is different from testing a typical transaction system

582

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

583

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System.


User-Triggered vs. System triggered In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)

584

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fillup maximum possible combination of dimensions and facts. Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

585

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.

586

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the end-result data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of : Requirements testing Unit testing Integration testing Performance testing Acceptance testing

587

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?

588

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures:

Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.

589

Unit Testing

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit Testing the Report data:

Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

590

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following: Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation

591

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :

ETL processes completing within time window. Monitoring and measuring the data quality issues. Refresh times for standard/complex reports.

592

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

593

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Questions

594

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Thank You

595

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

596

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

597 222

2009 Wipro Ltd - Confidential 2222 Wipro Ltd - Confidential

Data Modeling

Effective way of using a Data Warehouse

598

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model

Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.

Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model o

Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema

599

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A table that contains the measures of interest. Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.

600 222

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId city c1 nyc c2 sfo c3 la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

601 222

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south

Dimension Table
city

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.

region

regId north south

name cold region warm region

602

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

603

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

The Need For Data Quality


Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

604

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Understand Information Flow In Organization

Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data

Id e n tify P o te n tia l Id e n tify P o te n tia l P ro b le m A re a s & A s s e s P ro b le m A re a s & A ss e s Im p a ct Im p a ct

M e a s u re Q u a lity O f M e a s u re Q u a lity O f D a ta D a ta C le a n & L o a d C le a n & L o a d D a ta D a ta

Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse

C o n tin u o u s M o n ito rin g C o n tin u o u s M o n ito rin g

Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data

Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t

Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

605

Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy

606

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.

Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities

607

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

608

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

609

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

610

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

D a ta tra n sfo rm a ti n o
I te g ra ti g d i m i a r d a ta typ e s n n ssi l C h a n g i g co d e s n A d d i g a ti e a ttri u te n m b S u m m a ri n g d a ta zi C a l l ti g d e ri d va l e s cu a n ve u R e n o rm a l zi g d a ta i n

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

611

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.

The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

612

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

613

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

614

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

615

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

616

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load

617

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

618

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

619

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load

620

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing

621

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview

Understanding What is a Data Warehouse

622

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

623

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing ( OLTP ) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

624

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture

What makes a Data Warehouse

625

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

626

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

627 222

2009 Wipro Ltd - Confidential 2222 Wipro Ltd - Confidential

628 222

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

629 222

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Data Modeling

Effective way of using a Data Warehouse

630

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model

Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.

Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model o

Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema

631

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A table that contains the measures of interest. Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.

632 222

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

633 222

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model

Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.

Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model o

Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema

634 222

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A table that contains the measures of interest. Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.

635 222

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential Wipro lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Ltd Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId city c1 nyc c2 sfo c3 la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

636 22 2

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south

Dimension Table
city

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.

region

regId north south

name cold region warm region

637

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

638

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

The Need For Data Quality


Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

639

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Understand Information Flow In Organization

Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data

Id e n tify P o te n tia l Id e n tify P o te n tia l P ro b le m A re a s & A s s e s P ro b le m A re a s & A ss e s Im p a ct Im p a ct

M e a s u re Q u a lity O f M e a s u re Q u a lity O f D a ta D a ta C le a n & L o a d C le a n & L o a d D a ta D a ta

Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse

C o n tin u o u s M o n ito rin g C o n tin u o u s M o n ito rin g

Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data

Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t

Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

640

Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy

641

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.

Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities

642

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

643

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

644

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

645

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

D a ta tra n sfo rm a ti n o
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

646

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.

The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

647

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

648

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
649

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment

ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential

650

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

651

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What Is Metadata?

Metadata is Information...

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

652

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information . How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

653

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


Provide a simple catalogue of business metadata descriptions and views Document/manage metadata descriptions from an integrated development environment Enable DW users to identify and invoke pre-built queries against the data stores Design and enhance new data models and schemas for the data warehouse Capture data transformation rules between the operational and data warehousing databases Provide change impact analysis, and update across these technologies
654
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

655

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

656

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
657
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories IBM, Oracle and Microsoft to offer free or near-free basic repository services Enable organisations to reuse metadata across technologies Integrate DB design, data transformation and BI tools from different vendors Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

658

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards CDIF (CASE Data Interchange Format) OMG (Object Management Group)-CWM

Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft

MDC (Metadata Coalition)

659

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
660
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories IBM, Oracle and Microsoft to offer free or near-free basic repository services Enable organisations to reuse metadata across technologies Integrate DB design, data transformation and BI tools from different vendors Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

661

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards CDIF (CASE Data Interchange Format) OMG (Object Management Group)-CWM

Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft

MDC (Metadata Coalition)

662

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP

663

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools

09/02/2012

664

664

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing


OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers

09/02/2012

665

665

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP


OLTP System Source of data OLAP System

Operational data; OLTPs Consolidation data; OLAP are the original source data comes from the of the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities

Purpose of data What the data reveals Inserts and Updates

Short and fast inserts Periodic long-running and updates initiated by batch jobs refresh the end users data
666

09/02/2012

666

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions).

A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members

667
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB:


VOL . MINI VAN BLUE MINI VAN BLUE MINI VAN BLUE MINI VAN RED MINI VAN RED MINI VAN RED MINI VAN WHITE MINI VAN WHITE MINI VAN WHITE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN BLUE SEDAN BLUE SEDAN BLUE ... MODEL Clyde 6 Gleason 3 Carr 2 Clyde 5 Gleason3 Carr 1 Clyde 3 Gleason1 Carr 4 BLUE Clyde 3 BLUE Gleason BLUE Carr 3 RED Clyde 4 RED Gleason3 RED Carr 6 WHITE Clyde 2 WHITE Gleason3 WHITE Carr 5 Clyde 4 Gleason 3 Carr 2 ...

Increased Complexity...
COLOR DEALER

Relational DBMS

MDDB

Sales Volumes

M O D E L

Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde

DEALERSHIP

COLOR

27 x 4 = 108 cells
668

3 x 3 x 3 = 27 cells

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS


Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table. Storage Space Very low Space Consumption compared to Relational DB Performance Gives much better performance . Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries. Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
09/02/2012
669
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

669

Issues with MDDB

Sparsity

- Input data in applications are typically sparse


-Due to Sparsity -Due to Summarization

-Increases with increased dimensions Data Explosion Performance


-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

09/02/2012
670
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

670

Issues with MDDB - Sparsity Example

If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 19 Sales Volumes FOX 31 63 WELD 14 Van 6 5 4 Miini 31 M O KELLY 54 Coupe 27 D E LINK 03 56 3 5 5 L KRANZ 41 Sedan 4 3 2 45 LUCUS 33 41 Blue Red White WEISS 23 19 COLOR

Smith Regan Fox

L A S T N A M E

Weld Kelly Link Kranz Lucas Weiss

EMPLOYEE #

09/02/2012
671
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

671

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

09/02/2012
672
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

672

Features of OLAP - Rotation


Complex Queries & Sorts in Relational environment translated to simple rotation.
Sales Volumes

M O D E L

Mini Van Coupe

6 3 4
Blue

5 5 3
Red

4 5 2
White

C O L O R ( ROTATE 90 )
o

Blue Red White

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Sedan

Mini Van Coupe

COLOR

View #1

View #2

09/02/2012
673

2 dimensional array has 2 views.


2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

673

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr Gleason Clyde White Red Blue Mini Van Coupe Sedan

Carr Gleason Blue Red White

Clyde Mini Van Coupe Sedan

M O D E L

Mini Van Coupe Sedan Clyde Gleason Carr Blue Red White

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

MODEL

COLOR

COLOR

View #4

View #5

View #6

09/02/2012
674

3 dimensional array has 6 views.


2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

674

Features of OLAP - Slicing / Filtering


MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van

Mini Van Coupe Normal Metal Blue Blue

Coupe

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
09/02/2012
675
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

675

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Chicago Clyde Gleason Midwest St. Louis Carr Levi Gary Lucas Bolton

DISTRICT DEALERSHIP

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

09/02/2012
676
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

676

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

09/02/2012
677
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

677

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0

East West Central 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Year 1999

Drill-down from Year to Quarter


09/02/2012
678
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

678

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5 0 January February March Year 1999 East West Central

Drill-down from Quarter to Month

679

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop
09/02/2012
680
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

680

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Applications

09/02/2012
681
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

681

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.

09/02/2012
682
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

682

ROLAP - Standard SQL storage

Relational DW

MDDB - Relational Mapping

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
09/02/2012
683
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

683

ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on mid-tier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

09/02/2012
684
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

684

HOLAP - Combination of RDBMS and MDDB


OLAP Cube

Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
09/02/2012
685
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

685

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

09/02/2012
686
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

686

Architecture Comparison

MOLAP
Definition MDDB OLAP = Transaction level data + summary in MDDB Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB)

ROLAP
Relational OLAP = Transaction level data + summary in RDBMS No Sparsity To the necessary extent

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost Small transactional data + complex model + frequent summary analysis

Low: Only RDBMS + disk space cost Very large transactional data & it needs to be viewed / sorted

Where to apply?

09/02/2012
687
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

687

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
688

09/02/2012
688
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
09/02/2012
689
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

689

Data Warehouse Testing

690

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview


There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions The methodology required for testing a Data Warehouse is different from testing a typical transaction system

691

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

692

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System.


User-Triggered vs. System triggered In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)

693

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fillup maximum possible combination of dimensions and facts. Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

694

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.

695

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the end-result data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of : Requirements testing Unit testing Integration testing Performance testing Acceptance testing

696

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?

697

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures:

Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.

698

Unit Testing

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit Testing the Report data:

Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

699

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following: Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation

700

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :

ETL processes completing within time window. Monitoring and measuring the data quality issues. Refresh times for standard/complex reports.

701

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

702

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Questions

703

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Thank You

704

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

705

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load

706

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing

707

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview

Understanding What is a Data Warehouse

708

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

709

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing ( OLTP ) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

710

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture

What makes a Data Warehouse

711

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

712

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load

713

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing

714

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview

Understanding What is a Data Warehouse

715

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

716

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing ( OLTP ) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

717

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture

What makes a Data Warehouse

718

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

719

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load

720

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing

721

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview

Understanding What is a Data Warehouse

722

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

723

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing ( OLTP ) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

724

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture

What makes a Data Warehouse

725

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

726

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

727 222

2009 Wipro Ltd - Confidential 2222 Wipro Ltd - Confidential

Data Modeling

Effective way of using a Data Warehouse

728

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model

Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.

Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model o

Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema

729

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A table that contains the measures of interest. Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.

730 222

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId city c1 nyc c2 sfo c3 la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

731 222

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south

Dimension Table
city

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.

region

regId north south

name cold region warm region

732

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

733

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

The Need For Data Quality


Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

734

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Understand Information Flow In Organization

Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data

Id e n tify P o te n tia l Id e n tify P o te n tia l P ro b le m A re a s & A s s e s P ro b le m A re a s & A ss e s Im p a ct Im p a ct

M e a s u re Q u a lity O f M e a s u re Q u a lity O f D a ta D a ta C le a n & L o a d C le a n & L o a d D a ta D a ta

Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse

C o n tin u o u s M o n ito rin g C o n tin u o u s M o n ito rin g

Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data

Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t

Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

735

Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy

736

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.

Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities

737

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

738

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

739

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

740

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

D a ta tra n sfo rm a ti n o
I te g ra ti g d i m i a r d a ta typ e s n n ssi l C h a n g i g co d e s n A d d i g a ti e a ttri u te n m b S u m m a ri n g d a ta zi C a l l ti g d e ri d va l e s cu a n ve u R e n o rm a l zi g d a ta i n

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

741

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.

The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

742

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

743

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
744

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment

ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential

745

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

746

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What Is Metadata?

Metadata is Information...

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

747

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information . How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

748

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


Provide a simple catalogue of business metadata descriptions and views Document/manage metadata descriptions from an integrated development environment Enable DW users to identify and invoke pre-built queries against the data stores Design and enhance new data models and schemas for the data warehouse Capture data transformation rules between the operational and data warehousing databases Provide change impact analysis, and update across these technologies
749
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

750

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
751
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories IBM, Oracle and Microsoft to offer free or near-free basic repository services Enable organisations to reuse metadata across technologies Integrate DB design, data transformation and BI tools from different vendors Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

752

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards CDIF (CASE Data Interchange Format) OMG (Object Management Group)-CWM

Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft

MDC (Metadata Coalition)

753

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP

754

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools

09/02/2012

755

755

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing


OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers

09/02/2012

756

756

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP


OLTP System Source of data OLAP System

Operational data; OLTPs Consolidation data; OLAP are the original source data comes from the of the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities

Purpose of data What the data reveals Inserts and Updates

Short and fast inserts Periodic long-running and updates initiated by batch jobs refresh the end users data
757

09/02/2012

757

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions).

A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members

758
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB:


VOL . MINI VAN BLUE MINI VAN BLUE MINI VAN BLUE MINI VAN RED MINI VAN RED MINI VAN RED MINI VAN WHITE MINI VAN WHITE MINI VAN WHITE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN BLUE SEDAN BLUE SEDAN BLUE ... MODEL Clyde 6 Gleason 3 Carr 2 Clyde 5 Gleason3 Carr 1 Clyde 3 Gleason1 Carr 4 BLUE Clyde 3 BLUE Gleason BLUE Carr 3 RED Clyde 4 RED Gleason3 RED Carr 6 WHITE Clyde 2 WHITE Gleason3 WHITE Carr 5 Clyde 4 Gleason 3 Carr 2 ...

Increased Complexity...
COLOR DEALER

Relational DBMS

MDDB

Sales Volumes

M O D E L

Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde

DEALERSHIP

COLOR

27 x 4 = 108 cells
759

3 x 3 x 3 = 27 cells

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS


Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table. Storage Space Very low Space Consumption compared to Relational DB Performance Gives much better performance . Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries. Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
09/02/2012
760
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

760

Issues with MDDB

Sparsity

- Input data in applications are typically sparse


-Due to Sparsity -Due to Summarization

-Increases with increased dimensions Data Explosion Performance


-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

09/02/2012
761
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

761

Issues with MDDB - Sparsity Example

If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 19 Sales Volumes FOX 31 63 WELD 14 Van 6 5 4 Miini 31 M O KELLY 54 Coupe 27 D E LINK 03 56 3 5 5 L KRANZ 41 Sedan 4 3 2 45 LUCUS 33 41 Blue Red White WEISS 23 19 COLOR

Smith Regan Fox

L A S T N A M E

Weld Kelly Link Kranz Lucas Weiss

EMPLOYEE #

09/02/2012
762
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

762

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

09/02/2012
763
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

763

Features of OLAP - Rotation


Complex Queries & Sorts in Relational environment translated to simple rotation.
Sales Volumes

M O D E L

Mini Van Coupe

6 3 4
Blue

5 5 3
Red

4 5 2
White

C O L O R ( ROTATE 90 )
o

Blue Red White

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Sedan

Mini Van Coupe

COLOR

View #1

View #2

09/02/2012
764

2 dimensional array has 2 views.


2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

764

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr Gleason Clyde White Red Blue Mini Van Coupe Sedan

Carr Gleason Blue Red White

Clyde Mini Van Coupe Sedan

M O D E L

Mini Van Coupe Sedan Clyde Gleason Carr Blue Red White

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

MODEL

COLOR

COLOR

View #4

View #5

View #6

09/02/2012
765

3 dimensional array has 6 views.


2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

765

Features of OLAP - Slicing / Filtering


MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van

Mini Van Coupe Normal Metal Blue Blue

Coupe

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
09/02/2012
766
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

766

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Chicago Clyde Gleason Midwest St. Louis Carr Levi Gary Lucas Bolton

DISTRICT DEALERSHIP

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

09/02/2012
767
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

767

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

09/02/2012
768
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

768

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0

East West Central 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Year 1999

Drill-down from Year to Quarter


09/02/2012
769
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

769

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5 0 January February March Year 1999 East West Central

Drill-down from Quarter to Month

770

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop
09/02/2012
771
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

771

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Applications

09/02/2012
772
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

772

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.

09/02/2012
773
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

773

ROLAP - Standard SQL storage

Relational DW

MDDB - Relational Mapping

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
09/02/2012
774
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

774

ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on mid-tier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

09/02/2012
775
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

775

HOLAP - Combination of RDBMS and MDDB


OLAP Cube

Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
09/02/2012
776
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

776

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

09/02/2012
777
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

777

Architecture Comparison

MOLAP
Definition MDDB OLAP = Transaction level data + summary in MDDB Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB)

ROLAP
Relational OLAP = Transaction level data + summary in RDBMS No Sparsity To the necessary extent

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost Small transactional data + complex model + frequent summary analysis

Low: Only RDBMS + disk space cost Very large transactional data & it needs to be viewed / sorted

Where to apply?

09/02/2012
778
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

778

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
779

09/02/2012
779
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
09/02/2012
780
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

780

Data Warehouse Testing

781

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview


There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions The methodology required for testing a Data Warehouse is different from testing a typical transaction system

782

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

783

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System.


User-Triggered vs. System triggered In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)

784

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fillup maximum possible combination of dimensions and facts. Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

785

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.

786

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the end-result data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of : Requirements testing Unit testing Integration testing Performance testing Acceptance testing

787

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?

788

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures:

Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.

789

Unit Testing

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit Testing the Report data:

Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

790

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following: Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation

791

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :

ETL processes completing within time window. Monitoring and measuring the data quality issues. Refresh times for standard/complex reports.

792

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

793

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Questions

794

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Thank You

795

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

796

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

797 222

2009 Wipro Ltd - Confidential 2222 Wipro Ltd - Confidential

Data Modeling

Effective way of using a Data Warehouse

798

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model

Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.

Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model o

Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema

799

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A table that contains the measures of interest. Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.

800 222

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId city c1 nyc c2 sfo c3 la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

801 222

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south

Dimension Table
city

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.

region

regId north south

name cold region warm region

802

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

803

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

The Need For Data Quality


Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

804

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Understand Information Flow In Organization

Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data

Id e n tify P o te n tia l Id e n tify P o te n tia l P ro b le m A re a s & A s s e s P ro b le m A re a s & A ss e s Im p a ct Im p a ct

M e a s u re Q u a lity O f M e a s u re Q u a lity O f D a ta D a ta C le a n & L o a d C le a n & L o a d D a ta D a ta

Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse

C o n tin u o u s M o n ito rin g C o n tin u o u s M o n ito rin g

Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data

Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t

Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

805

Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy

806

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.

Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities

807

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

808

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

809

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

810

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

D a ta tra n sfo rm a ti n o
I te g ra ti g d i m i a r d a ta typ e s n n ssi l C h a n g i g co d e s n A d d i g a ti e a ttri u te n m b S u m m a ri n g d a ta zi C a l l ti g d e ri d va l e s cu a n ve u R e n o rm a l zi g d a ta i n

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

811

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.

The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

812

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

813

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
814

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment

ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential

815

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

816

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What Is Metadata?

Metadata is Information...

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

817

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information . How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

818

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


Provide a simple catalogue of business metadata descriptions and views Document/manage metadata descriptions from an integrated development environment Enable DW users to identify and invoke pre-built queries against the data stores Design and enhance new data models and schemas for the data warehouse Capture data transformation rules between the operational and data warehousing databases Provide change impact analysis, and update across these technologies
819
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

820

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
821
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories IBM, Oracle and Microsoft to offer free or near-free basic repository services Enable organisations to reuse metadata across technologies Integrate DB design, data transformation and BI tools from different vendors Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

822

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

823

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load

824

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

825

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content
1 2 3 4 5 An Overview of Data Warehouse Data Warehouse Architecture Data Modeling for Data Warehouse Overview of Data Cleansing Data Extraction , Transformation , Load

826

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing

827

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview

Understanding What is a Data Warehouse

828

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

829

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing ( OLTP ) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

830

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture

What makes a Data Warehouse

831

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

832

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

833 222

2009 Wipro Ltd - Confidential 2222 Wipro Ltd - Confidential

Data Modeling

Effective way of using a Data Warehouse

834

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model

Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.

Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model o

Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema

835

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A table that contains the measures of interest. Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.

836 222

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId city c1 nyc c2 sfo c3 la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

837 222

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south

Dimension Table
city

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.

region

regId north south

name cold region warm region

838

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

839

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

The Need For Data Quality


Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

840

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Understand Information Flow In Organization

Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data

Id e n tify P o te n tia l Id e n tify P o te n tia l P ro b le m A re a s & A s s e s P ro b le m A re a s & A ss e s Im p a ct Im p a ct

M e a s u re Q u a lity O f M e a s u re Q u a lity O f D a ta D a ta C le a n & L o a d C le a n & L o a d D a ta D a ta

Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse

C o n tin u o u s M o n ito rin g C o n tin u o u s M o n ito rin g

Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data

Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t

Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

841

Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy

842

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.

Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities

843

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

844

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

845

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

846

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

D a ta tra n sfo rm a ti n o
I te g ra ti g d i m i a r d a ta typ e s n n ssi l C h a n g i g co d e s n A d d i g a ti e a ttri u te n m b S u m m a ri n g d a ta zi C a l l ti g d e ri d va l e s cu a n ve u R e n o rm a l zi g d a ta i n

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

847

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.

The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

848

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

849

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
850

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment

ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential

851

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

852

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What Is Metadata?

Metadata is Information...

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

853

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information . How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

854

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


Provide a simple catalogue of business metadata descriptions and views Document/manage metadata descriptions from an integrated development environment Enable DW users to identify and invoke pre-built queries against the data stores Design and enhance new data models and schemas for the data warehouse Capture data transformation rules between the operational and data warehousing databases Provide change impact analysis, and update across these technologies
855
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

856

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
857
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories IBM, Oracle and Microsoft to offer free or near-free basic repository services Enable organisations to reuse metadata across technologies Integrate DB design, data transformation and BI tools from different vendors Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

858

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards CDIF (CASE Data Interchange Format) OMG (Object Management Group)-CWM

Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft

MDC (Metadata Coalition)

859

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP

860

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools

09/02/2012

861

861

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing


OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers

09/02/2012

862

862

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP


OLTP System Source of data OLAP System

Operational data; OLTPs Consolidation data; OLAP are the original source data comes from the of the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities

Purpose of data What the data reveals Inserts and Updates

Short and fast inserts Periodic long-running and updates initiated by batch jobs refresh the end users data
863

09/02/2012

863

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions).

A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members

864
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB:


VOL . MINI VAN BLUE MINI VAN BLUE MINI VAN BLUE MINI VAN RED MINI VAN RED MINI VAN RED MINI VAN WHITE MINI VAN WHITE MINI VAN WHITE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN BLUE SEDAN BLUE SEDAN BLUE ... MODEL Clyde 6 Gleason 3 Carr 2 Clyde 5 Gleason3 Carr 1 Clyde 3 Gleason1 Carr 4 BLUE Clyde 3 BLUE Gleason BLUE Carr 3 RED Clyde 4 RED Gleason3 RED Carr 6 WHITE Clyde 2 WHITE Gleason3 WHITE Carr 5 Clyde 4 Gleason 3 Carr 2 ...

Increased Complexity...
COLOR DEALER

Relational DBMS

MDDB

Sales Volumes

M O D E L

Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde

DEALERSHIP

COLOR

27 x 4 = 108 cells
865

3 x 3 x 3 = 27 cells

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS


Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table. Storage Space Very low Space Consumption compared to Relational DB Performance Gives much better performance . Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries. Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
09/02/2012
866
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

866

Issues with MDDB

Sparsity

- Input data in applications are typically sparse


-Due to Sparsity -Due to Summarization

-Increases with increased dimensions Data Explosion Performance


-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

09/02/2012
867
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

867

Issues with MDDB - Sparsity Example

If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 19 Sales Volumes FOX 31 63 WELD 14 Van 6 5 4 Miini 31 M O KELLY 54 Coupe 27 D E LINK 03 56 3 5 5 L KRANZ 41 Sedan 4 3 2 45 LUCUS 33 41 Blue Red White WEISS 23 19 COLOR

Smith Regan Fox

L A S T N A M E

Weld Kelly Link Kranz Lucas Weiss

EMPLOYEE #

09/02/2012
868
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

868

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

09/02/2012
869
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

869

Features of OLAP - Rotation


Complex Queries & Sorts in Relational environment translated to simple rotation.
Sales Volumes

M O D E L

Mini Van Coupe

6 3 4
Blue

5 5 3
Red

4 5 2
White

C O L O R ( ROTATE 90 )
o

Blue Red White

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Sedan

Mini Van Coupe

COLOR

View #1

View #2

09/02/2012
870

2 dimensional array has 2 views.


2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

870

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr Gleason Clyde White Red Blue Mini Van Coupe Sedan

Carr Gleason Blue Red White

Clyde Mini Van Coupe Sedan

M O D E L

Mini Van Coupe Sedan Clyde Gleason Carr Blue Red White

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

MODEL

COLOR

COLOR

View #4

View #5

View #6

09/02/2012
871

3 dimensional array has 6 views.


2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

871

Features of OLAP - Slicing / Filtering


MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van

Mini Van Coupe Normal Metal Blue Blue

Coupe

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
09/02/2012
872
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

872

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Chicago Clyde Gleason Midwest St. Louis Carr Levi Gary Lucas Bolton

DISTRICT DEALERSHIP

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

09/02/2012
873
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

873

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

09/02/2012
874
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

874

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0

East West Central 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Year 1999

Drill-down from Year to Quarter


09/02/2012
875
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

875

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5 0 January February March Year 1999 East West Central

Drill-down from Quarter to Month

876

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop
09/02/2012
877
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

877

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Applications

09/02/2012
878
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

878

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.

09/02/2012
879
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

879

ROLAP - Standard SQL storage

Relational DW

MDDB - Relational Mapping

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
09/02/2012
880
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

880

ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on mid-tier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

09/02/2012
881
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

881

HOLAP - Combination of RDBMS and MDDB


OLAP Cube

Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
09/02/2012
882
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

882

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

09/02/2012
883
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

883

Architecture Comparison

MOLAP
Definition MDDB OLAP = Transaction level data + summary in MDDB Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB)

ROLAP
Relational OLAP = Transaction level data + summary in RDBMS No Sparsity To the necessary extent

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost Small transactional data + complex model + frequent summary analysis

Low: Only RDBMS + disk space cost Very large transactional data & it needs to be viewed / sorted

Where to apply?

09/02/2012
884
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

884

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
885

09/02/2012
885
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
09/02/2012
886
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

886

Data Warehouse Testing

887

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview


There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions The methodology required for testing a Data Warehouse is different from testing a typical transaction system

888

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

889

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System.


User-Triggered vs. System triggered In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)

890

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fillup maximum possible combination of dimensions and facts. Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

891

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.

892

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the end-result data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of : Requirements testing Unit testing Integration testing Performance testing Acceptance testing

893

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?

894

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures:

Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.

895

Unit Testing

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit Testing the Report data:

Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

896

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following: Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation

897

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :

ETL processes completing within time window. Monitoring and measuring the data quality issues. Refresh times for standard/complex reports.

898

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

899

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Questions

900

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Thank You

901

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 7 8 Metadata Management OLAP Data Warehouse Testing

902

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview

Understanding What is a Data Warehouse

903

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

904

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing ( OLTP ) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

905

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture

What makes a Data Warehouse

906

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Components of Warehouse
Source Tables : These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End - user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

907

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

908 222

2009 Wipro Ltd - Confidential 2222 Wipro Ltd - Confidential

Data Modeling

Effective way of using a Data Warehouse

909

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model

Entity : Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship : relating entities to other entities.

Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model o

Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema

910

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A table that contains the measures of interest. Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.

911 222

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by 2009 Wipro Ltd Confidential lookup tables. Attributes --are the non-key columns in the lookup tables. 2222 Wipro Ltd Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId city c1 nyc c2 sfo c3 la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

912 222

2009 Wipro Ltd - Confidential Wipro 2222 Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south

Dimension Table
city

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.

region

regId north south

name cold region warm region

913

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

914

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

The Need For Data Quality


Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

915

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Understand Information Flow In Organization

Identify authoritative data sources Identify authoritative data sources Interview Employees & Customers Interview Employees & Customers Data Entry Points Data Entry Points Cost of bad data Cost of bad data

Id e n tify P o te n tia l Id e n tify P o te n tia l P ro b le m A re a s & A s s e s P ro b le m A re a s & A ss e s Im p a ct Im p a ct

M e a s u re Q u a lity O f M e a s u re Q u a lity O f D a ta D a ta C le a n & L o a d C le a n & L o a d D a ta D a ta

Use business rule discovery tools to identify data with Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect inconsistent, missing, incomplete, duplicate or incorrect values values
Use data cleansing tools to clean data at the source Use data cleansing tools to clean data at the source Load only clean data into the data warehouse Load only clean data into the data warehouse

C o n tin u o u s M o n ito rin g C o n tin u o u s M o n ito rin g

Schedule Periodic Cleansing of Source Data Schedule Periodic Cleansing of Source Data

Id e n tify A re a s o f Id e n tify A re a s o f Im p ro v e m e n t Im p ro v e m e n t

Identify & Correct Cause of Defects Identify & Correct Cause of Defects Refine data capture mechanisms at source Refine data capture mechanisms at source Educate users on importance of DQ Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

916

Customized Programs Strengths: Addresses specific needs No bulky one time investment Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools Strength Provide automated assessment Limitation No measure of data accuracy

917

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery tools Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.

Data Reengineering & Cleansing tools Strengths Usually are integrated packages with cleansing features as Add-on Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities

918

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

919

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

920

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

921

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture

Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

D a ta tra n sfo rm a ti n o
I te g ra ti g d i m i a r d a ta typ e s n n ssi l C h a n g i g co d e s n A d d i g a ti e a ttri u te n m b S u m m a ri n g d a ta zi C a l l ti g d e ri d va l e s cu a n ve u R e n o rm a l zi g d a ta i n

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

922

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.

The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

923

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

924

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
925

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment

ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential

926

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

927

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

What Is Metadata?

Metadata is Information...

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

928

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information . How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

929

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


Provide a simple catalogue of business metadata descriptions and views Document/manage metadata descriptions from an integrated development environment Enable DW users to identify and invoke pre-built queries against the data stores Design and enhance new data models and schemas for the data warehouse Capture data transformation rules between the operational and data warehousing databases Provide change impact analysis, and update across these technologies
930
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

931

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
932
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories IBM, Oracle and Microsoft to offer free or near-free basic repository services Enable organisations to reuse metadata across technologies Integrate DB design, data transformation and BI tools from different vendors Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

933

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards CDIF (CASE Data Interchange Format) OMG (Object Management Group)-CWM

Most frequently used interchange standard Addresses only a limited subset of metadata artifacts XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CAPLATINUM Technology (Founding Member), Viasoft

MDC (Metadata Coalition)

934

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP

935

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools

09/02/2012

936

936

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing


OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers

09/02/2012

937

937

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP


OLTP System Source of data OLAP System

Operational data; OLTPs Consolidation data; OLAP are the original source data comes from the of the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities

Purpose of data What the data reveals Inserts and Updates

Short and fast inserts Periodic long-running and updates initiated by batch jobs refresh the end users data
938

09/02/2012

938

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions).

A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members

939
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB:


VOL . MINI VAN BLUE MINI VAN BLUE MINI VAN BLUE MINI VAN RED MINI VAN RED MINI VAN RED MINI VAN WHITE MINI VAN WHITE MINI VAN WHITE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN BLUE SEDAN BLUE SEDAN BLUE ... MODEL Clyde 6 Gleason 3 Carr 2 Clyde 5 Gleason3 Carr 1 Clyde 3 Gleason1 Carr 4 BLUE Clyde 3 BLUE Gleason BLUE Carr 3 RED Clyde 4 RED Gleason3 RED Carr 6 WHITE Clyde 2 WHITE Gleason3 WHITE Carr 5 Clyde 4 Gleason 3 Carr 2 ...

Increased Complexity...
COLOR DEALER

Relational DBMS

MDDB

Sales Volumes

M O D E L

Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde

DEALERSHIP

COLOR

27 x 4 = 108 cells
940

3 x 3 x 3 = 27 cells

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS


Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table. Storage Space Very low Space Consumption compared to Relational DB Performance Gives much better performance . Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries. Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
09/02/2012
941
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

941

Issues with MDDB

Sparsity

- Input data in applications are typically sparse


-Due to Sparsity -Due to Summarization

-Increases with increased dimensions Data Explosion Performance


-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

09/02/2012
942
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

942

Issues with MDDB - Sparsity Example

If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 19 Sales Volumes FOX 31 63 WELD 14 Van 6 5 4 Miini 31 M O KELLY 54 Coupe 27 D E LINK 03 56 3 5 5 L KRANZ 41 Sedan 4 3 2 45 LUCUS 33 41 Blue Red White WEISS 23 19 COLOR

Smith Regan Fox

L A S T N A M E

Weld Kelly Link Kranz Lucas Weiss

EMPLOYEE #

09/02/2012
943
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

943

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

09/02/2012
944
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

944

Features of OLAP - Rotation


Complex Queries & Sorts in Relational environment translated to simple rotation.
Sales Volumes

M O D E L

Mini Van Coupe

6 3 4
Blue

5 5 3
Red

4 5 2
White

C O L O R ( ROTATE 90 )
o

Blue Red White

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Sedan

Mini Van Coupe

COLOR

View #1

View #2

09/02/2012
945

2 dimensional array has 2 views.


2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

945

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van Coupe Sedan Blue Red White Carr Gleason Clyde

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr Gleason Clyde White Red Blue Mini Van Coupe Sedan

Carr Gleason Blue Red White

Clyde Mini Van Coupe Sedan

M O D E L

Mini Van Coupe Sedan Clyde Gleason Carr Blue Red White

COLOR

( ROTATE 90 )

MODEL

( ROTATE 90 )

DEALERSHIP

MODEL

COLOR

COLOR

View #4

View #5

View #6

09/02/2012
946

3 dimensional array has 6 views.


2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

946

Features of OLAP - Slicing / Filtering


MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van

Mini Van Coupe Normal Metal Blue Blue

Coupe

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
09/02/2012
947
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

947

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Chicago Clyde Gleason Midwest St. Louis Carr Levi Gary Lucas Bolton

DISTRICT DEALERSHIP

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

09/02/2012
948
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

948

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

09/02/2012
949
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

949

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0

East West Central 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Year 1999

Drill-down from Year to Quarter


09/02/2012
950
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

950

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5 0 January February March Year 1999 East West Central

Drill-down from Quarter to Month

951

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop
09/02/2012
952
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

952

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Applications

09/02/2012
953
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

953

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.

09/02/2012
954
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

954

ROLAP - Standard SQL storage

Relational DW

MDDB - Relational Mapping

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
09/02/2012
955
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

955

ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on mid-tier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

09/02/2012
956
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

956

HOLAP - Combination of RDBMS and MDDB


OLAP Cube

Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
09/02/2012
957
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

957

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

09/02/2012
958
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

958

Architecture Comparison

MOLAP
Definition MDDB OLAP = Transaction level data + summary in MDDB Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB)

ROLAP
Relational OLAP = Transaction level data + summary in RDBMS No Sparsity To the necessary extent

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost Small transactional data + complex model + frequent summary analysis

Low: Only RDBMS + disk space cost Very large transactional data & it needs to be viewed / sorted

Where to apply?

09/02/2012
959
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

959

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
960

09/02/2012
960
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
09/02/2012
961
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

961

Data Warehouse Testing

962

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview


There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions The methodology required for testing a Data Warehouse is different from testing a typical transaction system

963

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

964

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System.


User-Triggered vs. System triggered In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)

965

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fillup maximum possible combination of dimensions and facts. Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

966

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.

967

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the end-result data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of : Requirements testing Unit testing Integration testing Performance testing Acceptance testing

968

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?

969

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Das könnte Ihnen auch gefallen