You are on page 1of 68

Data Warehouse Concepts

Data Warehouse (DW): Is a subject oriented, integrated, non-volatile, time-variant,

collection of data organized to support management needs.
Also referred to as Central Data Warehouse (hub)
Data Warehouse serves as a single-source hub of integrated data upon which all
downstream data stores are dependent. The Data Warehouse has roles of intake,
integration, and distribution.
EDW is also referred to as the Atomic Data Store. If its present it serves as the single
source of consistent
and accurate information.
The EDW is also an optional component especially if a DW environment is built using
the Ralph Kimball way.
In this model data from staging is integrated and directly loaded into Data Marts. We will
discuss these in detail
in Module 5.
In terms of definition an EDW is defined as data store containing Subject Oriented,
Non-Volatile, and Time Variant Data.
The size of an EDW is extremely large. Data in an EDW is not deleted under normal
circumstances. In other
words there can be only soft deletes.
The Data model most often is and E-R model (Entity Relationship model), normalized to
some extent.

One of the key features of EDW is that it stores historical data and data at the most
granular level. Its a
truly corporate representation of data; however access to it is limited as it is not meant for
reporting purpose.

Data Mart

One of the important feature of a Data Mart is that its data model is customized for a
business process
or a department, does not contain all corporate-level data as in the case of an EDW and
hence takes less
time to build and maintain.
Data is represented in an elegant manner; a manner in which the business can understand
the structure and
contents. Data model is also demoralized.
The Data model consists of a large centralized table called the 'FACT' table (which
consists of measures or values
that the business is looking for) and a set of small descriptive entity tables called as the
'DIMENSION' tables.
If a dimension is a 'Conformed Dimension' then it can be shared across different Data
Marts, thus minimizing the
design time. We will talk about these in detail in the subsequent modules.

Data Marts represented data which is either business process specific and / or department
specific. (A business
process can consist of multiple departments.) The feature of a Data Mart is that it stores
data in a business-friendly
representation also called as dimensional model or a star schema. The data stored in a
Data Mart may not be at the
most granular level; quite often it is aggregated and summarized.

Its usage is more for analytical and reporting purpose. BI and DSS tools make use of this
data structure for OLAP, data
visualization, query, search and analysis, and BI reporting.

Analytics: Analytics is the science of analysis. It defines how a business or an entity
arrives at an optimal or realistic decision based on existing data.
Application of Analytics include the study of business data using statistical analysis in
order to discover and understand historical patterns with an eye to predicting and
improving business performance.
In other words Applied Business Analytics is Business Intelligence. BI Analytics
consists of the following:
Query, Reporting and Search Tools
OLAP, Visualization and Data Mining Tools
Executive Dashboards and Scorecards
Predictive Analysis Tools

As can be seen in the diagram on the left-hand side, X axis denotes business value (going
from low to high) and
the Y axis denotes complexity in terms of Analytics (from bottom to top).

In terms of business queries:

1. What happened? - We get the answer to this using Reporting and Query tools
2. Why did it happen? - We get the answer to this using OLAP and Visualization tools
3. Whats happening now? - We get the answer to this using Dashboards and Scorecards
4. What might happen? - We get the answers to this using Predictive analysis tools

Dashboards, Scorecards, and Predictive Analysis are used by executives will be coved in
subsequent modules.

Metadata: Two contractors are assigned a task of building a bridge. One is to start
building from east end and the other is to start building from the west end. Both have to
meet in the center and then merge.
When they arrived at the center point, one end of the bridge was higher than the other by
a few inches. This was because one group of contractors and their engineers used
kilograms and meters, while another used pounds and feet. It caused the parent company
losses in billions!
Reason - It wasn't the data that was faulty; it was the Metadata!
Metadata is 'Data about Data'. It refers to data that tries to describe a data set in terms of
its Value, Content, Quality, Significance.
It provides insight into data for information like:
1. What kind of Data ?
2. Who is the owner of this data ?
3. How was the data created ?
4. What are the attributes and significance of the data created or collected ?

Inmon's Central Data Warehouse - Hub and Spoke architecture: Inmon defines a
Data Warehouse "A subject oriented, integrated, non-volatile, time-variant, collection of
data organized to support management needs." (W. H. Inmon, Database Newsletter,
July/August 1992)

The intent of this definition is that the Data Warehouse serves as a single-source Hub of
integrated data upon which all downstream data stores are dependent. The Inmon Data
Warehouse has roles of intake, integration, and distribution.
Kimball's definition: Bus Architecture: Kimball defines the warehouse as "nothing more
than the union of all the constituent data marts." (Ralph Kimball, et. al, The Data
Warehouse Life Cycle Toolkit, Wiley Computer Publishing, 1998)
This definition contradicts the concept of the Data Warehouse as a single-source Hub.
The Kimball Data Warehouse assumes all data store roles -- intake, integration,
distribution, access, and delivery.
The Inmon' Data Warehouse has the following approaches:

Inmon's approach is to have a single, consistent, and accurate storage of data, this
he termed as the EDW or the Enterprise Data Warehouse

Data Marts would then be built as subsets of the Data Warehouse, data marts
would be department or business process specific from which BI reporting could
be done

Advantage of this approach as per Inmon is that, there would single, consistent,
accurate source of corporate data, thus reducing data redundancy. Data Design,
consistency, and change can be much better handled

The disadvantage of this process is that the time required to build an EDW is
quite huge, it may take years for an EDW to be fully functional, The cost of
building the EDW is huge. moreover to get a buy in from the business
stakeholders becomes difficult as the return on investment (ROI) is not realized

Hub and Spoke architecture

The Hub-and-spoke architecture provides a single integrated and consistent source of
data from which data marts are populated. The warehouse structure is defined through
enterprise modeling (top down methodology).
The ETL processes acquire the data from the sources, transform the data in

accordance with established enterprise-wide business rules, and load the Hub data
store (central Data Warehouse or persistent staging area). The strength of this
architecture is enforced integration of data.

As can be seen in the diagram we are assuming that there is no Integration Hub
currently in place. Source or the OLTP systems are on the left and the data marts
are on the right side of the diagram, Different source systems may feed a single data
mart as seen in the diagram.

Hence there would be need to create a lot many interfaces and consequently a lot of
hardware, software and maintenance would be required which would add to the
cost. Bottom line is that if there are m applications and n data marts, then m x n
would be needed to build, maintain and execute the Data Warehouse.
Data redundancy is another factor, as it is quite possible that a given application feeds
than one data mart and that each of these data mart stores the same data. Lack of
between these data marts may result in data inconsistency and data quality issues,
leading to
The Business Loosing Faith in the Data.

In the diagram, we have the source systems on the left and the data marts on the right
hand side, as can be seen source systems are named as App1, App2, and App3 and so
each application feeds more than one data mart. We have 4 data marts for separate
processes namely, Finance, Sales, Marketing, and Accounting.
Each mart consists of data unique to it and also consists of data which is common

across the other

data marts, as multiple data marts can receive the same set of data from the same
application. What
this means is that, there is no demarcation between Common Corporate Data and
Unique Business
Process Specific or Departmental Data.
Maintenance of Interface Applications for each source to data mart, Data Consistency
and Reliability
can be an issue.
Cost of maintenance of Interfaces increases geometrically with every increase in
either source system
application feeding the data marts and / or the addition of a new data mart.
Continuing with the example of the previous slide, we look at what other problems
can arise in absence
of the Integration Hub.
In the diagram, we have the source systems on the left and the data marts on the right
hand side, as can
be seen source systems are named as App1, App2, and App3 and so on, each
application feeds more than
one data mart. We have 4 data marts for separate business processes namely, Finance,
Sales, Marketing,
and Accounting.
Each mart consists of data unique to it and also consists of data which is common
across the other
data marts, as multiple data marts can receive the same set of data from the same
application. What
this means is that: There is no demarcation between Common Corporate Data and
Unique Business
Process Specific or Departmental Data. Maintenance of Interface Applications for
each source to data
mart, Data Consistency and Reliability can be an issue. Cost of maintenance of
Interfaces increases
geometrically with every increase in either source system application feeding the data
marts and / or
the addition of a new data mart.


As we can see in the diagram each data mart returns a different figure, this is
because there are some customers. who are unique for a given data mart while
there are others who are common across data marts.
Thus it becomes very difficult to say which customers are common across which
data marts, as there is no way to get this information, moreover since each data mart
returns a different figure for the number of customers, there is no way to tell as to
figure is correct.
The business folks would thus loose all faith in the data they are seeing, as this data is
inconsistent across the different data marts.


As can be seen in the diagram, having an integration Hub adds

value because:
1. There is an orderly approach to building of interfaces, if there
are m applications and n data marts, then we would need only m+n interfaces.
2. Data consistency, accuracy, and reliability is increased as there is now
a single integration point for all kind of source data, data redundancy is also
reduced to a great extent.

In continuation of the previous screen, refer to the contents on this

screen for additional information on an Integration Hub. The diagram
shows one more advantage of having an integration Hub. As can be
seen, common corporate data and Business specific data is clearly
demarcated. Current level is integrated and detailed data is maintained
only in the integration Hub.


Now if we ask the same question we asked a few screens back,

that is, "How many customers are there?". What would be the answer?
The integration Hub maintains a unique definition for each type of
customer present. When we look into each data mart for
"How many customers are there?", the numbers in each data mart
may still vary; however, each mart also gives the type of customers that

are present rather than just the number of customers. Thus each data mart
can now answer exactly as to how many and what type of customer it consists.

The Kimball Data Warehouse approach includes:

Kimball's prime objective was to get the Data Warehouse up and running as
quickly as possible

He proposed that the data marts could directly be built from the source systems,
instead of having a centralized repository like and EDW as proposed by Inmon

The basic requirements to build such a Data Warehouse is to have a set of

conformed dimensions and facts across the different business departments

The advantage of Kimball model is that multiple data marts serving the different
business units could be built in parallel, with each data mart having only its
departmental data, the time required to build the EDW is thus eliminated, Return
on investment (ROI) is also realized early as the marts are up and running in
quick time and business can see the value from the reports generated from these

The disadvantages of this approach is that data redundancy would still persists as
it is quite possible that each of the built marts may have some common set of
entities, for example the sales mart would also need the product data, the
inventory mart would also need the product data. Integration of data across the
marts over the years would be another challenge

Kimball Data Warehouse: Bus integration

Bus Integration Architecture
The Bus Architecture relies on the development of conformed data marts populated
directly from the operational sources or through a transient staging area. Data
consistency from source-to-mart and mart-to-mart are achieved through applying
conventions and standards (conformed facts and dimensions) as the data marts are


The strength of this architecture is consistency without the overhead of the central
Data Warehouse.

The basis of this process is to have the data marts up and running as quickly
as possible so that business can see the benefits of these marts, parallel development
of different data marts and avoid the cost and time required to build and enterprise
Data Warehouse. Data redundancy is not really a criterion for this approach.
As can be seen in the diagram on the right hand side, data from disparate data sources
is directly fed into the conformed data marts through the integration bus.
A conformed data mart is one which consists of conformed dimensions (and facts). A
dimension is one which holds the same business meaning and significance across the
data marts of which it can be a part of.
The conformance of the dimension (and the data mart) is built using the bus
framework, we will look at this in the subsequent slides.
As stated by Kimball, the strength of this architecture is consistency without the


overhead of
the integration Hub or the central Data Warehouse.
Conformed dimension:
Dimension which retains the same business and technical nomenclature even if
shared across Business processes.
Shared dimensions should conform.
Identical dimensions should have the same definitions, keys, labels, and values

Business Analysts and Architects from different business streams arrive at single
description of a dimension and its attributes. This results in a conformed
Conformed Dimensions are listed on the 'X' Axis, Business Process are listed on
the 'Y' Axis
The Matrix is completed by filling in an 'X' at intersection of a Business Process
and Dimension, implies 'This dimension required for this business process'
Once finalized, parrallel development of Data Marts can begin, each business
process corresponding to a Data Mart


As can be seen in the diagram, the business processes or the departments of a

are represented on the Y axis and the conformed dimensions are represented on the X
We have sales, orders, and inventory on the Y axis as the business processes, and we
Product, Customer, Vendor, Date, Store, Distributor, and Promotion as the Conformed
The intersection of which dimension is part of which business process (or
department) is noted with
a cross (X). For example, the Sales business process will consists of Product,
Customer, Store, Date,
and the Promotion dimension. This matrix then helps in parallel development of each
business process
or data marts.
This is another way of representation of the bus architecture of conformed
dimensions and business process.


Data Warehouse architecture is of three types:

Type 1: ER Model for EDW and Star Schema for Data Marts Inmon

Type 2: Dimensional Model all through - Kimball

Type 3: ER model for DW - Teradata


Refer to the diagram on the screen to understand the Inmon Data Warehouse
architecture. It is also termed as the Corporate Information Factory. One of the
salient features of this architecture is that it consists of an Integration Hub or the
EDW or Enterprise Data Warehouse, which stores all of the corporate integrated
and detailed data.
In continuation of the previous screen, this screen illustrates another way
of representing the Inmon architecture.


As can be seen in the diagram, there are 5 verticals namely, Source systems
or Operational Systems, Data preparation area, ER Model or the Detail data,
Dimensional Model or the Data Marts, and Access and Delivery.
Please note that the ER vertical consists of the Enterprise Data Warehouse,
which holds all corporate information.

Refer to the contents on the screen to understand the Kimball Architecture



There are four verticals in the Kimball Architecture diagram, namely, Source Systems
or Operational Systems, Data Preparation, Dimensional model, and Access and
Difference is that there is no integration Hub here; instead data is directly loaded into
data marts from the source systems. This kind of an approach is faster to build and
the bus architecture approach, parallel data mart development can take place, and
on Investment (ROI) is visible early.


There are five different roles in the Data Warehouse environment from a
Data Store perspective.
Five different roles in DW environment are as follows:
1. Intake -Intake, Integration, Distribution, Delivery, and Access are the five
primary responsibilities of a Data Store.
2. Integration -Integration describes how the data fits together. The challenge for
warehousing architect is to design and implement consistent and interconnected
data that provides readily accessible, meaningful business information. Integration
occurs at many levels the key level, the attribute level, the definition level, the
structural level, and so forth (Data Warehouse Types,
Additional data cleansing processes, beyond those performed at intake, may be
required to achieve desired levels of data integration
3. Distribution -Data stores with distribution responsibility serve as long-term
information assets with broad scope. Distribution is the progression of consistent
data from such a data store to those data stores designed to address specific
business needs for decision support and analysis
4. Delivery -Data stores with delivery responsibility combine data as 'in business
context' information structures to present to business units who need it. Delivery
is facilitated by a host of technologies and related tools - data marts, data views,
multidimensional cubes, web reports, spreadsheets, queries, and so on.


5. Access -Data stores with access responsibility are those that provide business
retrieval of integrated data typically the targets of a distribution process. Accessoptimized data stores are biased toward easy of understanding and navigation by
business users


We start with the Inmon Data Warehouse and see which of these 5 roles suit what
We can see 3 roles being defined here, in the center left of the diagram. These are Intake,
Integration, and Distribution.
What this means is that the Inmon Data Warehouse is responsible for intake, integration
distribution of data as part of its Data Warehouse architecture.
It treats delivery and access outside the prerogative of its Data Warehouse environment.
Essentially what this means is that Inmons DW environment is limited to creation of
Building marts and then generating reports out of these marts is considered an external
Going by this definition an Inmon Data Warehouse would:
1. Intake data from various sources
2. Integrate it to form a complete and conform record and finally
3. Distribute the data to various data sources (business unit specific) mainly from the
to the Data Marts

As can bee seen in the diagram, all the 5 defined roles serve the Kimball Data Warehouse,
What this means is that all 5 roles, from Intake to Integration to Distribution to Delivery


to Access
are part of the Kimball Data Warehouse. It means that Kimball Data Warehouse involves
all the 5
roles, from intake of to finally, the accessing the data marts for generating reports.

ETL is classified into the following five categories:


Data Profiling
Data Cleansing
Data Integration, Consolidation, and Population
Data Replication
Data Federation

Each of the five categories are described, in details, in the subsequent screens.
ETL classification: Data Profiling
Features include:

Analysis of metadata and data values; detection of differences between defined

and inferred properties
Discovery of dependencies within source tables (functional dependencies,
primary key) and across (detect common domain: redundancy, foreign-key
Recommendation for target data model (for example, primary key, foreign key,
normalized design)

Benefits include:

Data quality by understanding the metadata of your data sources (structure and the
relationships within and among them) supported through efficient tooling


It is basically the analysis of data and metadata values for their correctness,
in other words detection of differences between defined and inferred properties.
This step is carried out in the initial stages of the Data Warehouse, even before
data is loaded from the source systems into the Data Warehouse, it is carried out,
so as to ascertain the quality of data that would be loaded into the Data Warehouse,
in case the quality is not up to the mark, then it leads to a full fledged Data Quality
ETL classification: Data Cleansing
Features include:

Data standardization transforms different input formats into a consolidated output

Creating single domain fields
Incorporating business and industry standards
Data matching
Data enrichment
Data survivorship

Benefits include:

Reduce costs by improving data quality and consistency


ETL classification: Data Integration, Consolidation, and Population

Features include:

Complex transformations
High data volume (billions of records)
Performance and scalability of target access more important than data
concurrency in target
De-coupled model: Minimal impact on source systems due to target access
Target may collect historical snapshots of integrated information

Benefits include:

Gain insight through single version of truth in distributed, heterogeneous and

possibly low-quality data environment

This includes integrating data across data sources, transformation of data and
storing it as a single consistent, detailed, and accurate data. The other aspect is


consolidation and loading of data into the Data Warehouse (EDW and Data Marts).
Emphasis here is more on the Performance and Scalability, while accessing the data
from the target data marts. Data Consolidation provides single version of truth and
integration, from different heterogeneous sources.
ETL classification: Data Replication
Features include:

Unidirectional data distribution or Bidirectional data synchronization

Increased performance and scalability through distribution of load to multiple
specialized copies of information
Increased availability and reliability for failover scenarios
Low-latency, high throughput data movement with queue-base replication
Automated and system-supported conflict resolution for bi-directional replication

Benefits include:

Improved performance, scalability, reliability, and availability while guaranteeing


ETL classification: Data Federation

Data Federation is also known as 'On Demand Integration'.
Features include:

On demand integration instead of copy management and data redundancy

Real-time access to distributed information as if from a single source
Flexible and extensible integration approach for dynamically changing
Query optimization
Integration of structured and unstructured information

Benefits include:

Time to market and control costs when joining distributed (rather homogeneous)

ETL and related technologies

Extract Transform and Load (ETL) is not the only technology that is used in DWBI.
There are in all, four technologies that are used variously in DWBI. These are as follows:

Extract Transform Load (ETL)

Enterprise Information Integration (EII)
Enterprise Application Integration (EAI)
Extract Load Transform (ELT)

EII - Enterprise Information Integration

Optimized and transparent data access and transformation layer providing a single
relational interface across all enterprise data
Allows users to easily combine data warehousing reports with newly acquired real
time analytic information with transparent queries - without caring where the data

Example: Ipedo, Data Mirror

EAI - Enterprise Application Integration

Message-based, transaction-oriented, point-to-point (or point-to-hub) brokering
and transformation for application-to-application integration


Enables data sharing among partners in a supply chain, or brings transactional

applications together after acquisition

Example: BizTalk, Tibco

ELT - Extract Load Transform

Set-oriented, point-in-time loading for migration and transformation for data
Supports the large scale loading of a DW or migration of vast quantities of data
between systems

Example: Sunopsis

ETL and Related Technologies - ETL versus ELT

In ETL the data is at first manipulated outside the database to cleanse and
sort, and only then is the result loaded into the database, as illustrated in
the following diagram:

ETL and Related Technologies - ETL versus ELT

In ELT the raw data from the source system is first loaded into the staging tables in
a database. Only then is it cleansed and loaded, as illustrated in the
following diagram:


The similarities between ETL, EII, and EAI


Analytics and BI

Analytics leverage data in a particular functional process (or application) to

enable 'context-specific insight that is actionable'. It can be used in many
industries in real-time data-processing situations to allow for faster business


Analytics is different from BI, although BI products play a role in analytics.

Application of Analytics include the study of business data using statistical

analysis in order to discover and understand historical patterns with an eye to
predicting and improving business performance.

In short, Applied Business Analytics is Business Intelligence.

BI is an architecture and a collection of integrated operational as well as decisionsupport applications and databases that provide the business community easy
access to business data.

BI is all about how to capture, access, understand, analyze and turn one of the
most valuable assets of an enterprise - raw data - into actionable information in
order to improve business performance.

Business Analytics is the analytical process of Reasoning, Forecasting, and

Measuring Business Actions and Processes based on extracted patterns in
collected business data and business plans.

Ad-Hoc query and Reporting Tool

Reporting Tools are a category of data access solution in which information is
represented in the form of reports. Reporting Tools are also referred to as Query,
Search and Reporting tools. It presents state of data and information at that point in
time in a report format. The two types of Query and Reporting Tool are as described
in the tabs alongside.
Managed Query and Canned Reporting Tool:
This handles the canned reports which are essentially reports which are preformulated and stored as a query plan. They can either be scheduled to run or, run ondemand. These reports can however accept runtime parameters as inputs at any given
point in time.










Metadata is 'data about data'. It provides a basis for trust in information, providing
visibility into lineage, relationships to other systems, and business definitions.
It refers to data that tries to describe a data set in terms of its value, content, quality,
significance. It also provides an insight into data for information like:

What kind of data?

Who is the owner of the data?
How was the data created?
What are the attributes and significance of the data created or collected?


Need for Metadata

Faster development, faster maintenance: Helps accelerate development by
actively sharing knowledge through the analysis, design, and build process, even
with external technologies. Also serves as a automatic form of documentation to
make maintenance easier, and provides the ability to assess the impact of
changes prior to making them
Better business and IT collaboration: Aligns business and IT understanding by
linking business terms, rules, and taxonomies to technical artifacts. Also allows
business and IT resources to collaborate while using tools tailored to their roles.
Trust: Supports a higher degree of trust in information by keeping a record of
collaboration, and the ability to see where information comes from. More
consistent: Improve consistency, accuracy and speed of data for DW by
providing business specific and technical specific information of available DW
Reduced time: Reduce development time by integrating and merging data from
disparate sources into the DW.
Reliability: Increase data reliability by having consistent definitions and
nomenclatures of data within the DW.
Integration of data: Helps in integration of data across enterprise, especially
when acquisitions and mergers are the order of the day.




Characteristics of DW Metadata
DW Metadata typical helps in tracking the following:
Extract Information: Last Refresh / Load - Date / Time
Historical Information about data and Metadata: Versioning and Data Access
Patterns over a period of time
Data Mapping Information: Source to Target and Transformation Rules
Summarization: Aggregation Algorithm
Archiving: Period of Data Purging
Reference and Standardization: Aliases and Lookups


Business drivers for MDM(Master Data Management)

"Unless enterprises figure out how to synchronize Master Data among departments,
divisions and enterprises, the value promised from business process fusion
will be much less than expected."
Gartner, October


Challenges faced with Master Data

The challenges faced in implementing MDM solutions are:
Duplicates: Distinct supplier records for 'IBM', 'I.B.M', and 'International
Business Machines'
Multiple conflicting views: Material Master Data is out of sync between ERP
Data quality issues: Customer movement (clean data erodes quickly)
Fragmentation: Product cost and specification managed by discrete out-of-sync
ERP systems


The example shows three types of environments, which deal with all kinds of
data. However, they still lack the consistency required to deliver, real time,
and accurate
information, as these environments were designed to serve a specific
function for a specific
set of corporate data.
In the diagram, on the left is the Data Warehouse system, in the middle is
EAI (Enterprise Application Integration), and on the right is the ERP
The Data Warehouse is unidirectional, that is, data flows from source to
target, reverse
synchronization is not possible, it also works on the principle of batch
updates, hence
real-time synchronization of data is not possible.
The Enterprise Application Integration (EAI) environment does not
preserve history, it is
event driven, and moreover the investment on data synchronization is huge.
In case of the ERP systems, where data proliferation is huge, there is no
of data between the different ERPs and there is high investment required
on consolidation
of data.


Main purpose of Master Data

Decouple Master Data from individual applications and provide single version of
truth for Master Data (analytical, operational, reference, and so on)
What is Master Data?


Describe core business entities: Customers, suppliers, partners, products,

materials, chart of accounts, location, and employees
o High-value information used repeatedly across business processes
o Generally used across multiple LOB (Line of Business)
Gives business context by providing concrete data models processes for a
particular domain

What are the benefits of Master Data?

Common authoritative source of accurate, consistent, and comprehensive master
information for business services to access critical business information
Common business services supporting consistent information-centric procedures
across all applications within the enterprise and extended enterprise
Business process support to integrate with or drive business processes across
heterogeneous applications by making data actionable




Continuing with the bank example, refer to the diagram on the below screen, where
you will see the
Master Record for a given customer as presented on the right hand side of
this diagram.

This would become the Master Data repository for the customer.
The approach for Master Data repository is as follows:
1. Core Master Data resides in the Master Repository and published out to the
dependent applications. This means that all the attributes, which are common
across the three applications, reside now in the Master Data repository. Only those
attributes, which are specific to a type of account reside in that specific account or
application. For example, 'Loan A/C No' would reside only in the 'Home Loan
Customer Application', whereas 'Max Credit Limit' would reside only in the
'Credit Card Customer Application'.
2. Applications also store the master attributes but they share a global primary key,
individual keys (primary keys) of the individual applications are copied in the
Master Data
3. Any changes can be introduced in each application but these need to be
with the central system.
In the next screen, you will see the CDI MDM service.


There are two types of approaches to an MDM initiative: Distinct MDM per Master
Platform Centric Approach. This means that a MDM initiative would be
different than a
Product MDM initiative. This is easy to build and maintain. It is cost
effective. However, this
type of approach lacks enterprise scalability.


The other approach to MDM is platform centric approach. All the master entities
like customer,
product, and material reside in a single repository. It has enterprise
scalability. But, it can be very
complex and can take much longer to build.

Introduction to Data Mining

The following concepts will be described at length in Data Mining:
1. Definition of Data Mining
2. Need for Data Mining
3. Advantages of Data Mining


4. Examples of Data Mining

5. Processes of Data Mining
6. Applications and Tools of Data Mining
Definition of Data Mining
Data Mining is the detection of unknown, valuable, and nontrivial information in
large volumes of data using automated statistical analysis.
Data Mining helps in trying to predict future trends and discover patterns
and behavior that have previously been unnoticed or unearthed.
It leads to simplification and automation of statistical process of deriving
information from huge volumes of data.
It is the detection of unknown, valuable, and nontrivial information in large volumes
of data using automated statistical analysis. Data Mining is a process that uses a variety
of data analysis tools to discover patterns and relationships in data that may be used to
valid predictions. Simply stated, it is mining for gold! Data is invaluable and for sure it is
Need for Data Mining
Who are our best Customers?
How can we detect fraud?
What do we need to know in order to predict and prevent losses?
In competitive market, answers to the business questions above can make all the
difference in profitability and loss of market share. Data Mining provides
IT with the tools to answer these questions; thus producing and discovering
new information and knowledge that decision makers can act upon.
It does this by using sophisticated techniques such as artificial intelligence
to build a model of the real world based on data collected from a variety of
sources including corporate transactions, customer histories, and
demographics, and from external sources such as credit bureaus.
This model produces patterns in the information that can support decision
making and predict new business opportunities.
Need for Data Mining - Continued
Can we replace skilled business analysts with Data Mining?
How is Data Mining related to DWBI?
Can Data Mining replace OLAP and reporting applications?


The answers to these questions are as follows:

Data Mining does not replace business analysts and managers, It compliments
these users to confirm their empirical observations, find new patterns, that yield
steady incremental improvement and breakthrough insight.

Data Mining follows Data Warehousing. Data, to be mined, is extracted from

EDW. The need for data cleansing, integration, and consolidation is thus
eliminated. With EDW as foundation, 70% of the Data Mining effort is eliminated
thereby saving time and money, increasing reliability, and delivering faster results.

Data Mining cannot replace OLAP and reporting tools, they compliment each
other. The outcome of the patterns discovered using Data Mining need to be
analyzed before being put into action, in order to know the implications of such
patterns. OLAP tool can allow the analysts to get answers to these queries.

Advantages of Data Mining

Advantages of Data Mining are as follows:

Add value to data holding: Data collected as part of DW-BI initiatives

Competitive advantage
More efficient and effective decision making
Data volumes are ever increasing: Business feels there is value in historical data.
Automated knowledge discovery is the only way to explore this data
Supports high level and long term decision making
Allows business to be proactive and prospective


Data Mining process: Data preparation

Data preparation consists of the following:
Consolidation and Cleaning
Data Selection
Cross Validation includes the following:
o Break up data into groups of small size

Use one group for testing and one group for building the mining model

Types of Data Mining models

Trying to predict the future or trying to predict the state of the world?
Descriptive Models involve algorithm like:
Sequence discovery
1. Clustering Algorithms - K-means, Kohonen
2. Associations Algorithms - Apriori and GRIz

Predictive Models involves algorithm like:

Time series


DW evolution: Naturally Evolving Architecture - drawbacks


Build structures from examples of past decisions that can be used to make
decisions for unknown cases
Predict the cluster in which the new case fits in


Forecast the future values based on the current values

If the type is 'Simple': One independent variable
If the type is 'Multiple': More than one independent variable


Time series:

Forecasts the future trends, Model includes time hierarchy like, year, quarter,
month, week, and so on
Considers the impact of seasonality and calendar effects

Data Mining: Applications and Tools

Applications used in Data Mining include:

Target Marketing
Churn Analysis
Customer Profiling
Fraud Detection
Medical Diagnostics

Tools and Vendors used in Data Mining include:

IBM: Intelligent Miner and SPSS

SAS: Enterprise Miner
SGI: Mine Set

Introduction to Data Governance

Introduction to Data Governance includes:

Data Governance: Definition

Need for Data Governance
Advantages of Data Governance
Implementation Approach
Data Stewardship
Characteristics of a Governed Organization

Data Governance: Definition

Data Governance refers to the overall management of availability, usability, and
security of data employed in an enterprise.
It includes a governing body, a defined set of procedures, and a plan to
execute these procedures.
It involves stewardship and data security.
It also helps in adhering to regulatory compliances.


"Dirty Data is a Business Problem, Not an IT Problem"

Gartner March 2007

Over the next two years, more than 25 percent of critical data in Fortune 1000 companies will
continue to be flawed, that is, the information will be inaccurate, incomplete or duplicated

Businesses are discovering that their success is increasingly tied to the quality of their
information. Organizations rely on this data to make significant decisions that
can affect customer retention, supply chain efficiency and regulatory
compliance. As companies collect more and more information about their
customers, products, suppliers, inventory and finances, it becomes more
difficult to accurately maintain that information in a usable, logical
Data Governance is nothing but management of data which involves creation,
availability, usability, security, and decimation of all kind of data.
The need for Data Governance is as follows:

The amount of data is increasing every year, IDC estimates that the world will
reach a zettabyte of data (1,000 exabytes or 1 million pedabytes) in 2010.
A significant portion of all corporate data is flawed.
Process failure and information scrap and rework caused by defective information
costs the United States alone $1.5 trillion or more.


The amount of data - and the prevalence of bad data - is growing steadily.

Advantages of enterprise-wide Data Governance are as follows:

Enterprise data is frequently held in disparate applications across multiple

departments and geographies.

The confusion caused by this disjointed network of applications leads to poor

customer service, redundant marketing campaigns, inaccurate product shipments
and, ultimately, a higher cost of doing business.

To address the spread of data and eliminate silos of corporate information, many
corporates implement enterprise-wide Data Governance programs, which attempt
to codify and enforce best practices for data management across the organization.

Data Governance employs a 'Holistic' approach to the management of People, Policies,

and Technology that manage enterprise data, thereby providing the following benefits:

Effective decisions: Better data drives more effective decisions across every level
of the organization.

Better strategies: With more unified view of the enterprise, managers and
executives are able to devise strategies that make the company more profitable.

Increase in consistency and confidence: Consistent enterprise view of

organization's data, leads to increase in consistency and confidence in decision

Reduction in risk: Decrease the risk of regulatory fines by adhering to rules,

processes, and standards for creation, acquisition, usage, decimation, security,
maintenance, and availability of data.

Consistent information quality: Data governance is a quality control discipline

for assessing, managing, using, improving, monitoring, maintaining, and
protecting organizational information. Data governance initiatives improve data
quality by assigning a team responsible for data's accuracy, accessibility,
consistency, and completeness, among other metrics.


Accountability for Information creation, usage, and decimation: Define roles

and responsibilities for data quality that ensures accountability, authority, and

Implementation of Data Governance is a multi-faced process, which includes the


Data Governance is an evolutionary process

Set up of Data Resource Management team and supervision by business data


Define and maintain data strategy and policies, manage data issues, estimate data
value and data management costs, and justify the budget for data management

Enforce Data management policies and programs to promote them

Make users aware of these policies and programs

Have Data Stewardship, Strategy, and Governance in place

It requires at first, a buy-in from the top executives of the organization, this is
followed by setting up
a Data Management Team consisting of data stewards and other data
A charter and plan is prepared, which lays down the rules and policies for
data management.
Allocation of appropriate budget for data management program is an
important step here.
This is followed by enforcing the data management programs and
promoting them.
Finally, users are made aware of the policies by conducting trainings; they
are encouraged to adhere to
these guidelines.
Data Stewardship


It is a role assigned to a person responsible for maintaining data element in a metadata

registry. Its main objective is to manage an organizations data assets in order to improve
its integrity, usability, accessibility, and quality. A Data Steward ensures that each data
element has or does the following:

Has clear and unambiguous data definition

Does not conflict with other data elements in the metadata registry
Documents the origin and source of each metadata element
Has adequate documentation on appropriate usage
Has data security specification and retention criteria

Characteristics of a Governed Organization

At the Governed stage, an organization has a unified Data Governance strategy
throughout the enterprise.
Data quality, data integration, and data synchronization are integral parts
of all business processes, and the organization achieves impressive results
from a single and unified view of the enterprise.
Characteristics exhibited by the people of a Governed Organization include the

Data Governance has executive-level sponsorship with direct CEO support

Business users take an active role in data strategy and delivery
A data quality or Data Governance group works directly with data stewards,
application developers, and database administrators
Organization has 'zero defect' policies for data collection and management

Features of the policies implemented by the Governed Organizations include the

New initiatives are only approved after careful consideration of how the
initiatives will impact the existing data infrastructure
Automated policies are in place to ensure that data remains consistent, accurate,
and reliable throughout the enterprise
A service oriented architecture (SOA) encapsulates business rules for data quality
and identity management

Technologies and tools that are in place in a Governed Organization are as follows:


Data quality and data integration tools are standardized across the organization
All aspects of the organization use standard business rules created and maintained
by designated data stewards
Data is continuously inspected and any deviations from standards are resolved
Data models capture the business meaning and technical details of all corporate
data elements

Risk and Rewards

Risks and rewards associated with Governed Organization include:

Risk: Low. Master Data tightly controlled across the enterprise, allowing the
organization to maintain high-quality information about its customers, prospects,
inventory and products
Rewards: High. Corporate data practices can lead to a better understanding about
an organizations current business landscape - allowing management to have full
confidence in all data-based decisions