You are on page 1of 68

Data Warehouse Concepts

Data Warehouse (DW): Is a subject oriented, integrated, non-volatile, time-variant,


collection of data organized to support management needs.
Also referred to as Central Data Warehouse (hub)
Data Warehouse serves as a single-source hub of integrated data upon which all
downstream data stores are dependent. The Data Warehouse has roles of intake,
integration, and distribution.
EDW is also referred to as the Atomic Data Store. If its present it serves as the single
source of consistent
and accurate information.
The EDW is also an optional component especially if a DW environment is built using
the Ralph Kimball way.
In this model data from staging is integrated and directly loaded into Data Marts. We will
discuss these in detail
in Module 5.
In terms of definition an EDW is defined as data store containing Subject Oriented,
Integrated,
Non-Volatile, and Time Variant Data.
The size of an EDW is extremely large. Data in an EDW is not deleted under normal
circumstances. In other
words there can be only soft deletes.
The Data model most often is and E-R model (Entity Relationship model), normalized to
some extent.

One of the key features of EDW is that it stores historical data and data at the most
granular level. Its a
truly corporate representation of data; however access to it is limited as it is not meant for
reporting purpose.

Data Mart
2

One of the important feature of a Data Mart is that its data model is customized for a
business process
or a department, does not contain all corporate-level data as in the case of an EDW and
hence takes less
time to build and maintain.
Data is represented in an elegant manner; a manner in which the business can understand
the structure and
contents. Data model is also demoralized.
The Data model consists of a large centralized table called the 'FACT' table (which
consists of measures or values
that the business is looking for) and a set of small descriptive entity tables called as the
'DIMENSION' tables.
If a dimension is a 'Conformed Dimension' then it can be shared across different Data
Marts, thus minimizing the
design time. We will talk about these in detail in the subsequent modules.

Data Marts represented data which is either business process specific and / or department
specific. (A business
process can consist of multiple departments.) The feature of a Data Mart is that it stores
data in a business-friendly
representation also called as dimensional model or a star schema. The data stored in a
Data Mart may not be at the
most granular level; quite often it is aggregated and summarized.
3

Its usage is more for analytical and reporting purpose. BI and DSS tools make use of this
data structure for OLAP, data
visualization, query, search and analysis, and BI reporting.

Analytics
Analytics: Analytics is the science of analysis. It defines how a business or an entity
arrives at an optimal or realistic decision based on existing data.
Application of Analytics include the study of business data using statistical analysis in
order to discover and understand historical patterns with an eye to predicting and
improving business performance.
In other words Applied Business Analytics is Business Intelligence. BI Analytics
consists of the following:
Query, Reporting and Search Tools
OLAP, Visualization and Data Mining Tools
Executive Dashboards and Scorecards
Predictive Analysis Tools

As can be seen in the diagram on the left-hand side, X axis denotes business value (going
from low to high) and
the Y axis denotes complexity in terms of Analytics (from bottom to top).

In terms of business queries:

1. What happened? - We get the answer to this using Reporting and Query tools
2. Why did it happen? - We get the answer to this using OLAP and Visualization tools
3. Whats happening now? - We get the answer to this using Dashboards and Scorecards
4. What might happen? - We get the answers to this using Predictive analysis tools

Dashboards, Scorecards, and Predictive Analysis are used by executives will be coved in
subsequent modules.

Metadata
Metadata: Two contractors are assigned a task of building a bridge. One is to start
building from east end and the other is to start building from the west end. Both have to
meet in the center and then merge.
When they arrived at the center point, one end of the bridge was higher than the other by
a few inches. This was because one group of contractors and their engineers used
kilograms and meters, while another used pounds and feet. It caused the parent company
losses in billions!
Reason - It wasn't the data that was faulty; it was the Metadata!
Metadata is 'Data about Data'. It refers to data that tries to describe a data set in terms of
its Value, Content, Quality, Significance.
It provides insight into data for information like:
1. What kind of Data ?
2. Who is the owner of this data ?
3. How was the data created ?
4. What are the attributes and significance of the data created or collected ?

Inmon's Central Data Warehouse - Hub and Spoke architecture: Inmon defines a
Data Warehouse "A subject oriented, integrated, non-volatile, time-variant, collection of
data organized to support management needs." (W. H. Inmon, Database Newsletter,
July/August 1992)

The intent of this definition is that the Data Warehouse serves as a single-source Hub of
integrated data upon which all downstream data stores are dependent. The Inmon Data
Warehouse has roles of intake, integration, and distribution.
Kimball's definition: Bus Architecture: Kimball defines the warehouse as "nothing more
than the union of all the constituent data marts." (Ralph Kimball, et. al, The Data
Warehouse Life Cycle Toolkit, Wiley Computer Publishing, 1998)
This definition contradicts the concept of the Data Warehouse as a single-source Hub.
The Kimball Data Warehouse assumes all data store roles -- intake, integration,
distribution, access, and delivery.
The Inmon' Data Warehouse has the following approaches:

Inmon's approach is to have a single, consistent, and accurate storage of data, this
he termed as the EDW or the Enterprise Data Warehouse

Data Marts would then be built as subsets of the Data Warehouse, data marts
would be department or business process specific from which BI reporting could
be done

Advantage of this approach as per Inmon is that, there would single, consistent,
accurate source of corporate data, thus reducing data redundancy. Data Design,
consistency, and change can be much better handled

The disadvantage of this process is that the time required to build an EDW is
quite huge, it may take years for an EDW to be fully functional, The cost of
building the EDW is huge. moreover to get a buy in from the business
stakeholders becomes difficult as the return on investment (ROI) is not realized
early

Hub and Spoke architecture


The Hub-and-spoke architecture provides a single integrated and consistent source of
data from which data marts are populated. The warehouse structure is defined through
enterprise modeling (top down methodology).
The ETL processes acquire the data from the sources, transform the data in
7

accordance with established enterprise-wide business rules, and load the Hub data
store (central Data Warehouse or persistent staging area). The strength of this
architecture is enforced integration of data.

As can be seen in the diagram we are assuming that there is no Integration Hub
currently in place. Source or the OLTP systems are on the left and the data marts
are on the right side of the diagram, Different source systems may feed a single data
mart as seen in the diagram.
8

Hence there would be need to create a lot many interfaces and consequently a lot of
hardware, software and maintenance would be required which would add to the
overall
cost. Bottom line is that if there are m applications and n data marts, then m x n
interfaces
would be needed to build, maintain and execute the Data Warehouse.
Data redundancy is another factor, as it is quite possible that a given application feeds
more
than one data mart and that each of these data mart stores the same data. Lack of
synchronization
between these data marts may result in data inconsistency and data quality issues,
leading to
The Business Loosing Faith in the Data.

In the diagram, we have the source systems on the left and the data marts on the right
hand side, as can be seen source systems are named as App1, App2, and App3 and so
on,
each application feeds more than one data mart. We have 4 data marts for separate
business
processes namely, Finance, Sales, Marketing, and Accounting.
Each mart consists of data unique to it and also consists of data which is common

across the other


data marts, as multiple data marts can receive the same set of data from the same
application. What
this means is that, there is no demarcation between Common Corporate Data and
Unique Business
Process Specific or Departmental Data.
Maintenance of Interface Applications for each source to data mart, Data Consistency
and Reliability
can be an issue.
Cost of maintenance of Interfaces increases geometrically with every increase in
either source system
application feeding the data marts and / or the addition of a new data mart.
Continuing with the example of the previous slide, we look at what other problems
can arise in absence
of the Integration Hub.
In the diagram, we have the source systems on the left and the data marts on the right
hand side, as can
be seen source systems are named as App1, App2, and App3 and so on, each
application feeds more than
one data mart. We have 4 data marts for separate business processes namely, Finance,
Sales, Marketing,
and Accounting.
Each mart consists of data unique to it and also consists of data which is common
across the other
data marts, as multiple data marts can receive the same set of data from the same
application. What
this means is that: There is no demarcation between Common Corporate Data and
Unique Business
Process Specific or Departmental Data. Maintenance of Interface Applications for
each source to data
mart, Data Consistency and Reliability can be an issue. Cost of maintenance of
Interfaces increases
geometrically with every increase in either source system application feeding the data
marts and / or
the addition of a new data mart.

10

As we can see in the diagram each data mart returns a different figure, this is
because there are some customers. who are unique for a given data mart while
there are others who are common across data marts.
Thus it becomes very difficult to say which customers are common across which
data marts, as there is no way to get this information, moreover since each data mart
returns a different figure for the number of customers, there is no way to tell as to
which
figure is correct.
The business folks would thus loose all faith in the data they are seeing, as this data is
inconsistent across the different data marts.

11

As can be seen in the diagram, having an integration Hub adds


value because:
1. There is an orderly approach to building of interfaces, if there
are m applications and n data marts, then we would need only m+n interfaces.
2. Data consistency, accuracy, and reliability is increased as there is now
a single integration point for all kind of source data, data redundancy is also
reduced to a great extent.

In continuation of the previous screen, refer to the contents on this


screen for additional information on an Integration Hub. The diagram
shows one more advantage of having an integration Hub. As can be
seen, common corporate data and Business specific data is clearly
demarcated. Current level is integrated and detailed data is maintained
only in the integration Hub.

12

Now if we ask the same question we asked a few screens back,


that is, "How many customers are there?". What would be the answer?
The integration Hub maintains a unique definition for each type of
customer present. When we look into each data mart for
"How many customers are there?", the numbers in each data mart
may still vary; however, each mart also gives the type of customers that
13

are present rather than just the number of customers. Thus each data mart
can now answer exactly as to how many and what type of customer it consists.

The Kimball Data Warehouse approach includes:

Kimball's prime objective was to get the Data Warehouse up and running as
quickly as possible

He proposed that the data marts could directly be built from the source systems,
instead of having a centralized repository like and EDW as proposed by Inmon

The basic requirements to build such a Data Warehouse is to have a set of


conformed dimensions and facts across the different business departments

The advantage of Kimball model is that multiple data marts serving the different
business units could be built in parallel, with each data mart having only its
departmental data, the time required to build the EDW is thus eliminated, Return
on investment (ROI) is also realized early as the marts are up and running in
quick time and business can see the value from the reports generated from these
marts

The disadvantages of this approach is that data redundancy would still persists as
it is quite possible that each of the built marts may have some common set of
entities, for example the sales mart would also need the product data, the
inventory mart would also need the product data. Integration of data across the
marts over the years would be another challenge

Kimball Data Warehouse: Bus integration


Bus Integration Architecture
The Bus Architecture relies on the development of conformed data marts populated
directly from the operational sources or through a transient staging area. Data
consistency from source-to-mart and mart-to-mart are achieved through applying
conventions and standards (conformed facts and dimensions) as the data marts are
populated.

14

The strength of this architecture is consistency without the overhead of the central
Data Warehouse.

The basis of this process is to have the data marts up and running as quickly
as possible so that business can see the benefits of these marts, parallel development
of different data marts and avoid the cost and time required to build and enterprise
Data Warehouse. Data redundancy is not really a criterion for this approach.
As can be seen in the diagram on the right hand side, data from disparate data sources
is directly fed into the conformed data marts through the integration bus.
A conformed data mart is one which consists of conformed dimensions (and facts). A
conformed
dimension is one which holds the same business meaning and significance across the
multiple
data marts of which it can be a part of.
The conformance of the dimension (and the data mart) is built using the bus
architecture
framework, we will look at this in the subsequent slides.
As stated by Kimball, the strength of this architecture is consistency without the

15

overhead of
the integration Hub or the central Data Warehouse.
Conformed dimension:
Dimension which retains the same business and technical nomenclature even if
shared across Business processes.
Shared dimensions should conform.
Identical dimensions should have the same definitions, keys, labels, and values

Business Analysts and Architects from different business streams arrive at single
description of a dimension and its attributes. This results in a conformed
dimension
Conformed Dimensions are listed on the 'X' Axis, Business Process are listed on
the 'Y' Axis
The Matrix is completed by filling in an 'X' at intersection of a Business Process
and Dimension, implies 'This dimension required for this business process'
Once finalized, parrallel development of Data Marts can begin, each business
process corresponding to a Data Mart

16

As can be seen in the diagram, the business processes or the departments of a


corporate
are represented on the Y axis and the conformed dimensions are represented on the X
axis.
We have sales, orders, and inventory on the Y axis as the business processes, and we
have
Product, Customer, Vendor, Date, Store, Distributor, and Promotion as the Conformed
dimensions.
The intersection of which dimension is part of which business process (or
department) is noted with
a cross (X). For example, the Sales business process will consists of Product,
Customer, Store, Date,
and the Promotion dimension. This matrix then helps in parallel development of each
business process
or data marts.
This is another way of representation of the bus architecture of conformed
dimensions and business process.

17

Data Warehouse architecture is of three types:

Type 1: ER Model for EDW and Star Schema for Data Marts Inmon

Type 2: Dimensional Model all through - Kimball


Type 3: ER model for DW - Teradata

18

Refer to the diagram on the screen to understand the Inmon Data Warehouse
architecture. It is also termed as the Corporate Information Factory. One of the
salient features of this architecture is that it consists of an Integration Hub or the
EDW or Enterprise Data Warehouse, which stores all of the corporate integrated
and detailed data.
In continuation of the previous screen, this screen illustrates another way
of representing the Inmon architecture.

19

As can be seen in the diagram, there are 5 verticals namely, Source systems
or Operational Systems, Data preparation area, ER Model or the Detail data,
Dimensional Model or the Data Marts, and Access and Delivery.
Please note that the ER vertical consists of the Enterprise Data Warehouse,
which holds all corporate information.

Refer to the contents on the screen to understand the Kimball Architecture


diagram.

20

There are four verticals in the Kimball Architecture diagram, namely, Source Systems
or Operational Systems, Data Preparation, Dimensional model, and Access and
Delivery.
Difference is that there is no integration Hub here; instead data is directly loaded into
data marts from the source systems. This kind of an approach is faster to build and
uses
the bus architecture approach, parallel data mart development can take place, and
Return
on Investment (ROI) is visible early.

21

There are five different roles in the Data Warehouse environment from a
Data Store perspective.
Five different roles in DW environment are as follows:
1. Intake -Intake, Integration, Distribution, Delivery, and Access are the five
primary responsibilities of a Data Store.
2. Integration -Integration describes how the data fits together. The challenge for
warehousing architect is to design and implement consistent and interconnected
data that provides readily accessible, meaningful business information. Integration
occurs at many levels the key level, the attribute level, the definition level, the
structural level, and so forth (Data Warehouse Types, www.billinmon.com)
Additional data cleansing processes, beyond those performed at intake, may be
required to achieve desired levels of data integration
3. Distribution -Data stores with distribution responsibility serve as long-term
information assets with broad scope. Distribution is the progression of consistent
data from such a data store to those data stores designed to address specific
business needs for decision support and analysis
4. Delivery -Data stores with delivery responsibility combine data as 'in business
context' information structures to present to business units who need it. Delivery
is facilitated by a host of technologies and related tools - data marts, data views,
multidimensional cubes, web reports, spreadsheets, queries, and so on.

22

5. Access -Data stores with access responsibility are those that provide business
retrieval of integrated data typically the targets of a distribution process. Accessoptimized data stores are biased toward easy of understanding and navigation by
business users

23

We start with the Inmon Data Warehouse and see which of these 5 roles suit what
purpose.
We can see 3 roles being defined here, in the center left of the diagram. These are Intake,
Integration, and Distribution.
What this means is that the Inmon Data Warehouse is responsible for intake, integration
and
distribution of data as part of its Data Warehouse architecture.
It treats delivery and access outside the prerogative of its Data Warehouse environment.
Essentially what this means is that Inmons DW environment is limited to creation of
EDW.
Building marts and then generating reports out of these marts is considered an external
act.
Going by this definition an Inmon Data Warehouse would:
1. Intake data from various sources
2. Integrate it to form a complete and conform record and finally
3. Distribute the data to various data sources (business unit specific) mainly from the
EDW
to the Data Marts

As can bee seen in the diagram, all the 5 defined roles serve the Kimball Data Warehouse,
What this means is that all 5 roles, from Intake to Integration to Distribution to Delivery

24

to Access
are part of the Kimball Data Warehouse. It means that Kimball Data Warehouse involves
all the 5
roles, from intake of to finally, the accessing the data marts for generating reports.

ETL is classified into the following five categories:


1.
2.
3.
4.
5.

Data Profiling
Data Cleansing
Data Integration, Consolidation, and Population
Data Replication
Data Federation

Each of the five categories are described, in details, in the subsequent screens.
ETL classification: Data Profiling
Features include:

Analysis of metadata and data values; detection of differences between defined


and inferred properties
Discovery of dependencies within source tables (functional dependencies,
primary key) and across (detect common domain: redundancy, foreign-key
relationship)
Recommendation for target data model (for example, primary key, foreign key,
normalized design)

Benefits include:

Data quality by understanding the metadata of your data sources (structure and the
relationships within and among them) supported through efficient tooling

25

It is basically the analysis of data and metadata values for their correctness,
in other words detection of differences between defined and inferred properties.
This step is carried out in the initial stages of the Data Warehouse, even before
data is loaded from the source systems into the Data Warehouse, it is carried out,
so as to ascertain the quality of data that would be loaded into the Data Warehouse,
in case the quality is not up to the mark, then it leads to a full fledged Data Quality
initiative.
ETL classification: Data Cleansing
Features include:

Data standardization transforms different input formats into a consolidated output


format
Creating single domain fields
Incorporating business and industry standards
Data matching
Data enrichment
Data survivorship

Benefits include:

Reduce costs by improving data quality and consistency

26

ETL classification: Data Integration, Consolidation, and Population


Features include:

Complex transformations
High data volume (billions of records)
Performance and scalability of target access more important than data
concurrency in target
De-coupled model: Minimal impact on source systems due to target access
Target may collect historical snapshots of integrated information

Benefits include:

Gain insight through single version of truth in distributed, heterogeneous and


possibly low-quality data environment

This includes integrating data across data sources, transformation of data and
storing it as a single consistent, detailed, and accurate data. The other aspect is

27

consolidation and loading of data into the Data Warehouse (EDW and Data Marts).
Emphasis here is more on the Performance and Scalability, while accessing the data
from the target data marts. Data Consolidation provides single version of truth and
data
integration, from different heterogeneous sources.
ETL classification: Data Replication
Features include:

Unidirectional data distribution or Bidirectional data synchronization


Increased performance and scalability through distribution of load to multiple
specialized copies of information
Increased availability and reliability for failover scenarios
Low-latency, high throughput data movement with queue-base replication
Automated and system-supported conflict resolution for bi-directional replication

Benefits include:

Improved performance, scalability, reliability, and availability while guaranteeing


consistency

ETL classification: Data Federation


Data Federation is also known as 'On Demand Integration'.
Features include:

On demand integration instead of copy management and data redundancy


Real-time access to distributed information as if from a single source
Flexible and extensible integration approach for dynamically changing
environment
Query optimization
Integration of structured and unstructured information
28

Benefits include:

Time to market and control costs when joining distributed (rather homogeneous)
information

ETL and related technologies


Extract Transform and Load (ETL) is not the only technology that is used in DWBI.
There are in all, four technologies that are used variously in DWBI. These are as follows:

Extract Transform Load (ETL)


Enterprise Information Integration (EII)
Enterprise Application Integration (EAI)
Extract Load Transform (ELT)

EII - Enterprise Information Integration


Optimized and transparent data access and transformation layer providing a single
relational interface across all enterprise data
Allows users to easily combine data warehousing reports with newly acquired real
time analytic information with transparent queries - without caring where the data
lives

Example: Ipedo, Data Mirror

EAI - Enterprise Application Integration


Message-based, transaction-oriented, point-to-point (or point-to-hub) brokering
and transformation for application-to-application integration

29

Enables data sharing among partners in a supply chain, or brings transactional


applications together after acquisition

Example: BizTalk, Tibco

ELT - Extract Load Transform


Set-oriented, point-in-time loading for migration and transformation for data
warehousing.
Supports the large scale loading of a DW or migration of vast quantities of data
between systems

Example: Sunopsis

ETL and Related Technologies - ETL versus ELT


In ETL the data is at first manipulated outside the database to cleanse and
sort, and only then is the result loaded into the database, as illustrated in
the following diagram:

ETL and Related Technologies - ETL versus ELT


In ELT the raw data from the source system is first loaded into the staging tables in
a database. Only then is it cleansed and loaded, as illustrated in the
following diagram:

30

The similarities between ETL, EII, and EAI

31

Analytics and BI

Analytics leverage data in a particular functional process (or application) to


enable 'context-specific insight that is actionable'. It can be used in many
industries in real-time data-processing situations to allow for faster business
decisions.

32

Analytics is different from BI, although BI products play a role in analytics.

Application of Analytics include the study of business data using statistical


analysis in order to discover and understand historical patterns with an eye to
predicting and improving business performance.

In short, Applied Business Analytics is Business Intelligence.


BI is an architecture and a collection of integrated operational as well as decisionsupport applications and databases that provide the business community easy
access to business data.

BI is all about how to capture, access, understand, analyze and turn one of the
most valuable assets of an enterprise - raw data - into actionable information in
order to improve business performance.

Business Analytics is the analytical process of Reasoning, Forecasting, and


Measuring Business Actions and Processes based on extracted patterns in
collected business data and business plans.

Ad-Hoc query and Reporting Tool


Reporting Tools are a category of data access solution in which information is
represented in the form of reports. Reporting Tools are also referred to as Query,
Search and Reporting tools. It presents state of data and information at that point in
time in a report format. The two types of Query and Reporting Tool are as described
in the tabs alongside.
Managed Query and Canned Reporting Tool:
This handles the canned reports which are essentially reports which are preformulated and stored as a query plan. They can either be scheduled to run or, run ondemand. These reports can however accept runtime parameters as inputs at any given
point in time.

33

34

35

36

37

38

39

40

41

Metadata is 'data about data'. It provides a basis for trust in information, providing
visibility into lineage, relationships to other systems, and business definitions.
It refers to data that tries to describe a data set in terms of its value, content, quality,
significance. It also provides an insight into data for information like:

What kind of data?


Who is the owner of the data?
How was the data created?
What are the attributes and significance of the data created or collected?

42

Need for Metadata


Faster development, faster maintenance: Helps accelerate development by
actively sharing knowledge through the analysis, design, and build process, even
with external technologies. Also serves as a automatic form of documentation to
make maintenance easier, and provides the ability to assess the impact of
changes prior to making them
Better business and IT collaboration: Aligns business and IT understanding by
linking business terms, rules, and taxonomies to technical artifacts. Also allows
business and IT resources to collaborate while using tools tailored to their roles.
Trust: Supports a higher degree of trust in information by keeping a record of
collaboration, and the ability to see where information comes from. More
consistent: Improve consistency, accuracy and speed of data for DW by
providing business specific and technical specific information of available DW
data.
Reduced time: Reduce development time by integrating and merging data from
disparate sources into the DW.
Reliability: Increase data reliability by having consistent definitions and
nomenclatures of data within the DW.
Integration of data: Helps in integration of data across enterprise, especially
when acquisitions and mergers are the order of the day.

43

44

45

Characteristics of DW Metadata
DW Metadata typical helps in tracking the following:
Extract Information: Last Refresh / Load - Date / Time
Historical Information about data and Metadata: Versioning and Data Access
Patterns over a period of time
Data Mapping Information: Source to Target and Transformation Rules
Summarization: Aggregation Algorithm
Archiving: Period of Data Purging
Reference and Standardization: Aliases and Lookups

46

Business drivers for MDM(Master Data Management)


"Unless enterprises figure out how to synchronize Master Data among departments,
divisions and enterprises, the value promised from business process fusion
will be much less than expected."
Gartner, October
2003

47

Challenges faced with Master Data


The challenges faced in implementing MDM solutions are:
Duplicates: Distinct supplier records for 'IBM', 'I.B.M', and 'International
Business Machines'
Multiple conflicting views: Material Master Data is out of sync between ERP
systems
Data quality issues: Customer movement (clean data erodes quickly)
Fragmentation: Product cost and specification managed by discrete out-of-sync
ERP systems

48

The example shows three types of environments, which deal with all kinds of
corporate
data. However, they still lack the consistency required to deliver, real time,
and accurate
information, as these environments were designed to serve a specific
function for a specific
set of corporate data.
In the diagram, on the left is the Data Warehouse system, in the middle is
the
EAI (Enterprise Application Integration), and on the right is the ERP
system.
The Data Warehouse is unidirectional, that is, data flows from source to
target, reverse
synchronization is not possible, it also works on the principle of batch
updates, hence
real-time synchronization of data is not possible.
The Enterprise Application Integration (EAI) environment does not
preserve history, it is
event driven, and moreover the investment on data synchronization is huge.
In case of the ERP systems, where data proliferation is huge, there is no
synchronization
of data between the different ERPs and there is high investment required
on consolidation
of data.

49

Main purpose of Master Data


Decouple Master Data from individual applications and provide single version of
truth for Master Data (analytical, operational, reference, and so on)
What is Master Data?

50

Describe core business entities: Customers, suppliers, partners, products,


materials, chart of accounts, location, and employees
o High-value information used repeatedly across business processes
o Generally used across multiple LOB (Line of Business)
Gives business context by providing concrete data models processes for a
particular domain

What are the benefits of Master Data?


Common authoritative source of accurate, consistent, and comprehensive master
information for business services to access critical business information
Common business services supporting consistent information-centric procedures
across all applications within the enterprise and extended enterprise
Business process support to integrate with or drive business processes across
heterogeneous applications by making data actionable

51

52

53

Continuing with the bank example, refer to the diagram on the below screen, where
you will see the
Master Record for a given customer as presented on the right hand side of
this diagram.

This would become the Master Data repository for the customer.
The approach for Master Data repository is as follows:
1. Core Master Data resides in the Master Repository and published out to the
dependent applications. This means that all the attributes, which are common
across the three applications, reside now in the Master Data repository. Only those
attributes, which are specific to a type of account reside in that specific account or
application. For example, 'Loan A/C No' would reside only in the 'Home Loan
Customer Application', whereas 'Max Credit Limit' would reside only in the
'Credit Card Customer Application'.
2. Applications also store the master attributes but they share a global primary key,
the
individual keys (primary keys) of the individual applications are copied in the
Master Data
repository.
3. Any changes can be introduced in each application but these need to be
synchronized
with the central system.
In the next screen, you will see the CDI MDM service.

54

There are two types of approaches to an MDM initiative: Distinct MDM per Master
and
Platform Centric Approach. This means that a MDM initiative would be
different than a
Product MDM initiative. This is easy to build and maintain. It is cost
effective. However, this
type of approach lacks enterprise scalability.

55

The other approach to MDM is platform centric approach. All the master entities
like customer,
product, and material reside in a single repository. It has enterprise
scalability. But, it can be very
complex and can take much longer to build.

Introduction to Data Mining


The following concepts will be described at length in Data Mining:
1. Definition of Data Mining
2. Need for Data Mining
3. Advantages of Data Mining

56

4. Examples of Data Mining


5. Processes of Data Mining
6. Applications and Tools of Data Mining
Definition of Data Mining
Data Mining is the detection of unknown, valuable, and nontrivial information in
large volumes of data using automated statistical analysis.
Data Mining helps in trying to predict future trends and discover patterns
and behavior that have previously been unnoticed or unearthed.
It leads to simplification and automation of statistical process of deriving
information from huge volumes of data.
It is the detection of unknown, valuable, and nontrivial information in large volumes
of data using automated statistical analysis. Data Mining is a process that uses a variety
of data analysis tools to discover patterns and relationships in data that may be used to
make
valid predictions. Simply stated, it is mining for gold! Data is invaluable and for sure it is
Gold!
Need for Data Mining
Who are our best Customers?
How can we detect fraud?
What do we need to know in order to predict and prevent losses?
In competitive market, answers to the business questions above can make all the
difference in profitability and loss of market share. Data Mining provides
IT with the tools to answer these questions; thus producing and discovering
new information and knowledge that decision makers can act upon.
It does this by using sophisticated techniques such as artificial intelligence
to build a model of the real world based on data collected from a variety of
sources including corporate transactions, customer histories, and
demographics, and from external sources such as credit bureaus.
This model produces patterns in the information that can support decision
making and predict new business opportunities.
Need for Data Mining - Continued
Can we replace skilled business analysts with Data Mining?
How is Data Mining related to DWBI?
Can Data Mining replace OLAP and reporting applications?

57

The answers to these questions are as follows:

Data Mining does not replace business analysts and managers, It compliments
these users to confirm their empirical observations, find new patterns, that yield
steady incremental improvement and breakthrough insight.

Data Mining follows Data Warehousing. Data, to be mined, is extracted from


EDW. The need for data cleansing, integration, and consolidation is thus
eliminated. With EDW as foundation, 70% of the Data Mining effort is eliminated
thereby saving time and money, increasing reliability, and delivering faster results.

Data Mining cannot replace OLAP and reporting tools, they compliment each
other. The outcome of the patterns discovered using Data Mining need to be
analyzed before being put into action, in order to know the implications of such
patterns. OLAP tool can allow the analysts to get answers to these queries.

Advantages of Data Mining


Advantages of Data Mining are as follows:

Add value to data holding: Data collected as part of DW-BI initiatives


Competitive advantage
More efficient and effective decision making
Data volumes are ever increasing: Business feels there is value in historical data.
Automated knowledge discovery is the only way to explore this data
Supports high level and long term decision making
Allows business to be proactive and prospective

58

Data Mining process: Data preparation


Data preparation consists of the following:
Collection
Assessment
Consolidation and Cleaning
Data Selection
Cross Validation includes the following:
o Break up data into groups of small size
59

Use one group for testing and one group for building the mining model

Types of Data Mining models


Trying to predict the future or trying to predict the state of the world?
Descriptive Models involve algorithm like:
Clustering
Associations
Sequence discovery
Example:
1. Clustering Algorithms - K-means, Kohonen
2. Associations Algorithms - Apriori and GRIz

Predictive Models involves algorithm like:


Classification
Regression
Time series

60

DW evolution: Naturally Evolving Architecture - drawbacks


Classification:

Build structures from examples of past decisions that can be used to make
decisions for unknown cases
Predict the cluster in which the new case fits in

Regression:

Forecast the future values based on the current values


If the type is 'Simple': One independent variable
If the type is 'Multiple': More than one independent variable

61

Time series:

Forecasts the future trends, Model includes time hierarchy like, year, quarter,
month, week, and so on
Considers the impact of seasonality and calendar effects

Data Mining: Applications and Tools


Applications used in Data Mining include:

Target Marketing
Churn Analysis
Customer Profiling
Bioinformatics
Fraud Detection
Medical Diagnostics

Tools and Vendors used in Data Mining include:

IBM: Intelligent Miner and SPSS


SAS: Enterprise Miner
SGI: Mine Set

Introduction to Data Governance


Introduction to Data Governance includes:

Data Governance: Definition


Need for Data Governance
Advantages of Data Governance
Implementation Approach
Data Stewardship
Characteristics of a Governed Organization

Data Governance: Definition


Data Governance refers to the overall management of availability, usability, and
security of data employed in an enterprise.
It includes a governing body, a defined set of procedures, and a plan to
execute these procedures.
It involves stewardship and data security.
It also helps in adhering to regulatory compliances.

62

"Dirty Data is a Business Problem, Not an IT Problem"

Gartner March 2007


Over the next two years, more than 25 percent of critical data in Fortune 1000 companies will
continue to be flawed, that is, the information will be inaccurate, incomplete or duplicated

Gartner
Businesses are discovering that their success is increasingly tied to the quality of their
information. Organizations rely on this data to make significant decisions that
can affect customer retention, supply chain efficiency and regulatory
compliance. As companies collect more and more information about their
customers, products, suppliers, inventory and finances, it becomes more
difficult to accurately maintain that information in a usable, logical
framework.
Data Governance is nothing but management of data which involves creation,
availability, usability, security, and decimation of all kind of data.
The need for Data Governance is as follows:

The amount of data is increasing every year, IDC estimates that the world will
reach a zettabyte of data (1,000 exabytes or 1 million pedabytes) in 2010.
A significant portion of all corporate data is flawed.
Process failure and information scrap and rework caused by defective information
costs the United States alone $1.5 trillion or more.

63

The amount of data - and the prevalence of bad data - is growing steadily.

Advantages of enterprise-wide Data Governance are as follows:

Enterprise data is frequently held in disparate applications across multiple


departments and geographies.

The confusion caused by this disjointed network of applications leads to poor


customer service, redundant marketing campaigns, inaccurate product shipments
and, ultimately, a higher cost of doing business.

To address the spread of data and eliminate silos of corporate information, many
corporates implement enterprise-wide Data Governance programs, which attempt
to codify and enforce best practices for data management across the organization.

Data Governance employs a 'Holistic' approach to the management of People, Policies,


and Technology that manage enterprise data, thereby providing the following benefits:

Effective decisions: Better data drives more effective decisions across every level
of the organization.

Better strategies: With more unified view of the enterprise, managers and
executives are able to devise strategies that make the company more profitable.

Increase in consistency and confidence: Consistent enterprise view of


organization's data, leads to increase in consistency and confidence in decision
making.

Reduction in risk: Decrease the risk of regulatory fines by adhering to rules,


processes, and standards for creation, acquisition, usage, decimation, security,
maintenance, and availability of data.

Consistent information quality: Data governance is a quality control discipline


for assessing, managing, using, improving, monitoring, maintaining, and
protecting organizational information. Data governance initiatives improve data
quality by assigning a team responsible for data's accuracy, accessibility,
consistency, and completeness, among other metrics.

64

Accountability for Information creation, usage, and decimation: Define roles


and responsibilities for data quality that ensures accountability, authority, and
supervision.

Implementation of Data Governance is a multi-faced process, which includes the


following:

Data Governance is an evolutionary process

Set up of Data Resource Management team and supervision by business data


stewards

Define and maintain data strategy and policies, manage data issues, estimate data
value and data management costs, and justify the budget for data management
programs

Enforce Data management policies and programs to promote them

Make users aware of these policies and programs

Have Data Stewardship, Strategy, and Governance in place

It requires at first, a buy-in from the top executives of the organization, this is
followed by setting up
a Data Management Team consisting of data stewards and other data
stakeholders.
A charter and plan is prepared, which lays down the rules and policies for
data management.
Allocation of appropriate budget for data management program is an
important step here.
This is followed by enforcing the data management programs and
promoting them.
Finally, users are made aware of the policies by conducting trainings; they
are encouraged to adhere to
these guidelines.
Data Stewardship

65

It is a role assigned to a person responsible for maintaining data element in a metadata


registry. Its main objective is to manage an organizations data assets in order to improve
its integrity, usability, accessibility, and quality. A Data Steward ensures that each data
element has or does the following:

Has clear and unambiguous data definition


Does not conflict with other data elements in the metadata registry
Documents the origin and source of each metadata element
Has adequate documentation on appropriate usage
Has data security specification and retention criteria

Characteristics of a Governed Organization


At the Governed stage, an organization has a unified Data Governance strategy
throughout the enterprise.
Data quality, data integration, and data synchronization are integral parts
of all business processes, and the organization achieves impressive results
from a single and unified view of the enterprise.
PEOPLE
Characteristics exhibited by the people of a Governed Organization include the
following:

Data Governance has executive-level sponsorship with direct CEO support


Business users take an active role in data strategy and delivery
A data quality or Data Governance group works directly with data stewards,
application developers, and database administrators
Organization has 'zero defect' policies for data collection and management

POLICIES
Features of the policies implemented by the Governed Organizations include the
following:

New initiatives are only approved after careful consideration of how the
initiatives will impact the existing data infrastructure
Automated policies are in place to ensure that data remains consistent, accurate,
and reliable throughout the enterprise
A service oriented architecture (SOA) encapsulates business rules for data quality
and identity management

TECHNOLOGY
Technologies and tools that are in place in a Governed Organization are as follows:

66

Data quality and data integration tools are standardized across the organization
All aspects of the organization use standard business rules created and maintained
by designated data stewards
Data is continuously inspected and any deviations from standards are resolved
immediately
Data models capture the business meaning and technical details of all corporate
data elements

Risk and Rewards


Risks and rewards associated with Governed Organization include:

Risk: Low. Master Data tightly controlled across the enterprise, allowing the
organization to maintain high-quality information about its customers, prospects,
inventory and products
Rewards: High. Corporate data practices can lead to a better understanding about
an organizations current business landscape - allowing management to have full
confidence in all data-based decisions

67

68