You are on page 1of 64

Data Warehousing

Dr DVLN Somayajulu
Professor
Department of Computer Science and Engineering
National Institute of Technology
Warangal

1
Outline
 DataWarehouse Evolution
 What is Data Warehouse
 Uses of Data Warehouse
 Advantages of Data Warehouse
 Types of Warehouses
 Roadmap to Data Warehousing
 Conclusion

2
What is Data Warehousing?
A process of transforming
Information
data into information and
making it available to
users in a timely enough
manner to make a
difference

[Forrester Research, April 1996]

Data
3
What are Data Warehouses?
 Data warehouses store large volumes of data
which are frequently used by DSS
 It is maintained separately from the
organization’s operational databases
 Data warehouses are relatively static with
only infrequent updates
 A data warehouse is a stand-alone repository
of information, integrated from several,
possibly heterogeneous operational
databases
4
Data Warehousing
 Is the enabling technology that
facilitates improved business decision-
making
 It’s a process, not a product
 A technique for assembling and
managing a wide variety of data from
multiple operational systems for
decision support and analytical
processing
It’s a journey
destination ... not a
5
Data Collection and Database Creation ( 1960s and earlier)
- Primitive File processing

Database Management Systems


( 1970s – early 1980s)

Data Warehousing &Data Mining


Advanced Web-based DBMS
( late 1980s )
Database ( late 1990s)
- Data Warehousing and OLAP Technology
Systems - Data Mining and Knowledge Discovery - XML DBMS
mid 1980s - Web Mining

Data Warehouse New Generation of Integrated Information Systems


Evolution 2000 …. 6
Data Collection and Database
Creation
File Systems ( late 1960s)
– Applications:
» Pay rolls
» Utility bills
» Tenth class and intermediate examination
results in Andhra Pradesh
– Characterizations
» batch requirements
» No queries
» Formatted Reports
» One user /one application
» Typically prepared in COBOL
7
File systems
 Reporting system of primitive type

 Fixed reports and formats

 Unable to integrate different files

 Non-aggregated , rigid in their approach.

8
Data Collection and Database Creation ( 1960s and earlier)
- Primitive File processing

Database Management Systems


( 1970s – early 1980s)

Data Warehousing &Data Mining


Advanced Web-based DBMS
( late 1980s )
Database ( late 1990s)
- Data Warehousing and OLAP Technology
Systems - Data Mining and Knowledge Discovery - XML DBMS
mid 1980s - Web Mining

Data Warehouse New Generation of Integrated Information Systems


Evolution 2000 …. 9
Database Management Systems
 Applications:
– Airlines reservations
– Bank accounts
– Hotel room booking
– Sales in a super market
 Characterizations
– Ad hoc queries
– User friendly
– Several users / application
– On line
– Prepared using database packages like Oracle,
DB2, SQL Server, Sybase etc

10
OLTP Vs Warehousing
 Organized by transactions Vs organized
by particular subject
 More number of users vs less number of
users
 Accesses few records vs entire table
 Small databases vs large databases
 Normalized data structure vs un-
normalized
 Continuous update vs periodic update

11
Normal Reporting Architecture

Source
Reports

Reports

Reports

12
Examples of OLTP Systems
 General ledger
 Accounts payable
 Financial management
 Order processing
 Order entry
 Inventory

13
Problems with current reporting
structures
 Accessibility
 Timeliness
 Format
 Integration

14
Data Collection and Database Creation ( 1960s and earlier)
- Primitive File processing

Database Management Systems


( 1970s – early 1980s)

Data Warehousing &Data Mining


Advanced Web-based DBMS
( late 1980s )
Database ( late 1990s)
- Data Warehousing and OLAP Technology
Systems - Data Mining and Knowledge Discovery - XML DBMS
mid 1980s - Web Mining

Data Warehouse New Generation of Integrated Information Systems


Evolution 2000 …. 15
What Is a Data Warehouse?
– A blend of processes
» Hardware and software Production Databases
» Business knowledge
» Systems integration skills
» Incremental and Staging Area
evolutionary
– Environment and
infrastructure Data Warehouse

– Information provider
– Not an off-the-shelf
product Users

– Not a single project


© Prentice Hall
The Difference Between Data
and Information
 Data
– What are the total sales for region A?
– Which salesperson earned the highest
commission this month?
 Information
– How have the sales for region A changed
over the past five years?
– Which products should sell best next year?
– Tell me something I did not know.
© Prentice Hall
Data Warehouse Properties

Subject Integrated
Oriented

Data
Warehouse

Non Volatile Time Variant

© Prentice Hall
Subject-oriented
 Organized around major subjects
such as customer, supplier, product,
time and sales
 Focuses on the modeling and
analysis of data for decision makers.
 Provides simple and concise view on
a particular subject issue by
excluding unwanted data for support
of making decisions.

19
Covers subjects of interest rather
than application areas
OLTP::
Retail Sales Outlet Sales Catalog Sales
System System System

Subject Oriented Sales


Information
Warehouse:
Sales Subject Area 20
Subject Oriented
Data is categorized and stored by business subject rather than
by application.

Equity Customer
Customer
Plans Financial
Financial
Information
Information
gS
in
s
v
a
Shares
Loans Data Warehouse
Subject Area

Insurance
Operational Systems

© Prentice Hall
Subject Areas
– Business area organization
– Typical subject areas
» Customer accounts Customer
Customer
Financial
Financial
» Product sales Information
Information
» Customer savings
Data Warehouse
» Toll call usage Subject Area
» Passenger booking
» Insurance claims
– Model contains measures and analysis
criteria

© Prentice Hall
Subject Oriented

Process Oriented Subject Oriented

Entry
Sales Rep Sales
Sales
Quantity Sold
Prod Number
Date Customers
Customers
Customer Name
Product Description
Unit Price Products
Products
Mail Address

Transactional Storage Data Warehouse Storage


23
Data Sources Data Warehouse Reports &
Analyses
Internal Data
•Sales data Report
•Accounting
•HR
Graph
•Other
Integrated
Data
External Data
•Demographics
•Purchased Pie chart
Customer lists
•Other
Histogram

A data warehouse brings together data from various Sources and makes
24
it available to users eager to create their own reports
Integrated
Data on a given subject is defined and stored once.

Savings No
Application Application
Flavor
Current
Accounts
Application

Loans Subject = Customer


Application

Operational Environment Data Warehouse

© Prentice Hall
Integration of Data

Appl. A - M, F
Encoding Appl. B - 1, 0 M, F
Appl. C - X, Y

Appl. A - pipeline cm.


Unit of Appl. B - pipeline inches pipeline cm
Attributes

Integration
Appl. C - pipeline mcf

Appl. A - balance dec(13,2)


Physical Appl. B - balance PIC 9(9)V99 balance dec(13, 2)
Attributes Appl. C - balance float

Appl. A - bal-on-hand
Naming Appl. B - current_balance balance
Conventions Appl. C - balance

Appl. A - date (Julian)


Data Appl. B - date (yymmdd) date (Julian)
Consistency Appl. C - date (absolute)

Transactional Storage Data Warehouse Storage


26
Integration
– Data from diverse operational systems
– One set of consistent, accurate, quality
information
– Standardization
» Naming conventions
» Coding structures
» Data attributes
» Measures
– Cleaning and integration process
» Fixes inconsistencies
» Consumes resources
» Requires commitment
© Prentice Hall
Data Integrity Problems
 Same person, different spellings
– Agarwal, Agrawal, Aggarwal etc...
 Multiple ways to denote company name
– Persistent Systems, PSPL, Persistent Pvt. LTD.
 Use of different names
– mumbai, bombay
 Different account numbers generated by different
applications for the same customer
 Required fields left blank
 Invalid product codes collected at point of sale
– manual entry leads to mistakes
– “in case of a problem use 9999999”
28
Integrated
 Constructed by integrating multiple
heterogeneous sources, such as
relational databases, flat files, and
online transaction records
 Data cleaning and data integration
techniques are used
– To ensure consistency in naming
conventions, encoding structures,
attribute measures and so on

29
Provides common coding of data
Both within and across subject areas
OLTP:
Retail Sales Outlet Sales Catalog Sales
System System System

Product Code Product Code Product Code


9999999 xxxxxxx xxxx99.99
Warehouse:
Product Code:
Common code or a mapping of the various source
codes
Sales: Subject Area
Integration of Coding
Schemes 30
Time Variant
Data is stored as a series of snapshots, each representing a
period of time.

Time Data
01/97 Data for January

02/97 Data for February

03/97 Data for March

Data
Warehouse

© Prentice Hall
Time Variant
– Historical data Time Data

01/97 Data for January

» Trend analysis 02/97 Data for February

03/97 Data for March


» Forecasting
» What-if
– Time element in database columns
– Refresh cycle defined for new snapshots
– Refresh frequency determined by users
– Grain must be determined
– Grain may not be the same as refresh
frequency
© Prentice Hall
Time Variant Data Analysis

Current Data Historical Data


S ales ( R e g ion , Y e ar - Y ear 9 7 - 1 st Q tr)

20

15
S a le s ( in la k h s
10 E a st
)
W e st
5 N o r th

0
Ja nuary F ebruary M arch
Y ear97

Transactional Storage Data Warehouse Storage


33
Time variant
 Data are stored to provide
information from a historical
perspective
– Example the past 5 – 15 years

 Every key structure contains an


element of time either implicitly or
explicitly
34
The time dimension is the key to most
Management decisions
OLTP::
December 2002
1 2 3 4 5
6 7 8 9 1 1 1
0 1 2
1 1 1 1 1 1 1
3 4 5 6 7 8 9
2 2 2 2 2 2 2
0 1 2 3 4 5 6
2 2 2 3 3
7 8 9 0 1

Warehouse:
2000 2001 2002
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
6 7 8 9 1 1 1 6 7 8 9 1 1 1 6 7 8 9 1 1 1
0 1 2 0 1 2 0 1 2
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
3 4 5 6 7 8 9 3 4 5 6 7 8 9 3 4 5 6 7 8 9
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6
2 2 2 3 3 2 2 2 3 3 2 2 2 3 3
7 8 9 0 1 7 8 9 0 1 7 8 9 0 1

Time Dimension in OLTP


35
Vs. the data warehouse
Non Volatile
Typically data in the data warehouse is not updated or deleted.

Load

Operational Databases Warehouse Database

INSERT Read
Read
UPDATE
DELETE

© Prentice Hall
Volatility of Data

Volatile Non-Volatile

Insert Change

Delete Access

Insert Load
Change
Access

Record-by-Record Data Manipulation Mass Load / Access of Data

Transactional Storage Data Warehouse Storage


37
Non-volatile
 Warehouse is physically a separate
store of data transformed from the
application data found in the
operational environment
 No need for transaction
processing, recovery and
concurrency control methods due
to separation
 Needs only two operations – initial
loading of data and access of data
38
Allowing users to write would
compromise the integrity of the data
OLTP::
U Read
S
E
R Write

Warehouse:
U Read
S
E
R
OLTP read/write vs.
Data warehouse read-only
39
Changing Data
First time load

Operational Databases Warehouse Database

Refresh

Refresh

Purge
or
Archive
Refresh

© Prentice Hall
Characteristics of data in DW
DW can be viewed as an informational
system with the following attributes –
 It is a database designed for analytical
tasks using data from multiple applications.
 Supports small no. of users with long
interactions.
 Usage is read-intensive.
 Content is periodically updated (mostly
additions)
 Contains a few large tables

41
Benefits of DW
 Access to a wide variety of data
 Results can be presented in a variety of formats
(reports, graphs)
 Enhances the value of operational business
applns.
 Cost of product introduction comes down with
target marketing campaigns.
 Better decisions at low cost.
 Clear picture on asset and liability mgmt,
enterprise wide purchasing and inventory
patterns.
 Maintain good relations with customers by
knowing their requirements. 42
Limitations of DW
 Can not create additional data.
 If data quality is poor then decision will
be inaccurate.

43
Risks of DW
 Organizational: risks relate to project team
 Technological: selection of technology, poor
scalability of architecture.
 Project Mgmt: scale and scope of projects
are ill-defined.
 Data and Design: poor quality of data,
unreliable data, improper collection of data.

44
Why Separate Data
Warehouse?
 Performance
– Op dbs designed & tuned for known txs &
workloads.
– Complex OLAP queries would degrade perf. for op
txs.
– Special data organization, access &
implementation methods needed for
multidimensional views & queries.

45
Why Separate Data
Warehouse?
 Function
– Missing data: Decision support requires historical data, which
op dbs do not typically maintain.
– Data consolidation: Decision support requires consolidation
(aggregation, summarization) of data from many
heterogeneous sources: op dbs, external sources.
– Data quality: Different sources typically use inconsistent data
representations, codes, and formats which have to be
reconciled.

46
Data Warehouse Evolution
STAGE 1 STAGE 2 STAGE 3 STAGE 4 STAGE 5
REPORTING ANALYZING PREDICTING OPERATIONALIZING ACTIVE
WAREHOUSING
WHAT happened? WHY did it happen? WHY will it happen?WHAT IS Happening? MAKING it happen!

Primarily Increase in Analytical Continuous Update & Event Initiated


Batch Ad Hoc Modeling Time Sensitive Queries Actions
Queries Grows Become Important Take Hold

Batch Ad Hoc Analytics Continuous Update/Short Queries Event Initiated Actions


47
Data Warehouse Pitfalls
 You are going to spend much time extracting,
cleaning, and loading data
 Despite best efforts at project management, data
warehousing project scope will increase
 You are going to find problems with systems feeding
the data warehouse
 You will find the need to store data not being
captured by any existing system
 You will need to validate data not being validated by
transaction processing systems

48
Data Warehouse Pitfalls
 Some transaction processing systems feeding the
warehousing system will not contain detail
 Many warehouse end users will be trained and never or
seldom apply their training
 After end users receive query and report tools, requests
for IS written reports may increase
 Your warehouse users will develop conflicting business
rules
 Large scale data warehousing can become an exercise
in data homogenizing

49
Data Warehouse Pitfalls
 'Overhead' can eat up great amounts of disk
space
 The time it takes to load the warehouse will
expand to the amount of the time in the
available window... and then some
 Assigning security cannot be done with a
transaction processing system mindset
 You are building a HIGH maintenance system
 You will fail if you concentrate on resource
optimization to the neglect of project, data, and
customer management issues and an
understanding of what adds value to the
customer 50
Warehouse Architecture
Normally a relational database
Business intelligence tools,
designed to hold large amounts
Executiveofinformation systems,
Operations requiring
information for dataloading,
analysis
OLAP, processing, and applications
and data mining
manipulating data fromAny electronic
the DW. repository
Covers user of
management, security, information
and capacitythat containsas
restrictions data of
well.
Extracts
interest for data from
management usethe
orsource
analyticslocations and transforms it to the
target format and structure

Informs operators about Data Warehouse system status


and contains information about the data stored within,
such as table & column names, descriptions, etc. 51
http://en.wikipedia.org/wiki/Data_warehousing
Data Warehousing
 Applications:
– Sales per item per branch of a retail chain store
– Sales per item per month
– Rainfall in a particular month spanning over a
period in Andhra Pradesh
– Revenue collection of a particular category
spanning over a period (eg., Electronics)

 Characterizations
– Aggregate queries on certain specified attribute
( called Dimension)
– Data Organized around major subjects such as
supplier, stores etc.
52
Business Intelligence vs
Transaction Processing
 Operational systems are designed to work
with small pieces of information
 Operational data must frequently be updated
in real time
 Operational system schemas are designed
for rapid data input.
 Operational users need immediate response
 Operational system usage patterns are
relatively predictable
 Design of operational system is complex. 53
Uses of Data Warehouses

 Presentation of standard reports


and graphs

 For dimensional analysis

 Data mining

54
Advantages of Warehousing
 Lowers cost of information access

 Improves customer responsiveness

 Identifies
hidden business
opportunities

 Helps to make strategic decision

55
Types of Data Warehouses
 Operational data store: Operational
data mirror.
– Example: item in stock
 Enterprise data warehouse:
Historical analysis, Complex
pattern analysis
 Data marts

56
Return on Investment
Data warehouse enhances better
market for customers. It depends
on:
– More rapid access to data
– More reliable reporting
– More flexible data presentation

57
Return on Investment ( contd)
Warehousing go/no go decision depends on:
– Does it give competitive advantage?
– Does it improve the bottom line?
– Will it deliver on all its promises?
– Will it be delivered on time?
– What is the risk, if you don’t do it?
– What is the risk, if you do it?
– Will it be delivered on budget?

58
Source Systems Executives, Managers,
and Business Analysts

Client Data

Custom
Application

How business is really doing?


ERP

Packaged
Application

Custom 59
Application
Source Systems Executives, Managers,
and Business Analysts

Client Data

OLAP
Cube
Custom Data
Application Warehouse

OLAP
Cube
ERP

Data Warehousing and OLAP


Key Benefits
•VeryFast •Offloads queries
Packaged from production
Application •Very flexible system

•All potential •Provides consistent


Custom queries available data model 60
Application
Roadmap to Data warehousing
 Data extracted, transformed and
cleaned
 Stored in a database – RDBMS,
MDD
 Query and Reporting Systems
 Executive Information System and
Decision Support System

61
Conclusion
 File systems are primitive reporting
structures, generally targeted towards a
person. They are rigid in format and
generate periodic reports.

 Databases ( OLTP) provide capability of


operational data for an organization they
provide online access, flexible query and
can serve multiple users. They are
complex in nature and generally suitable
for small volumes of data.
62
Conclusion (contd)
 Data warehouses are sophisticated
reporting structures for an
organization. They work on archival
data, provide flexible reporting for
the organizations. They are
organized along some identified
attributes( dimensions), generally
time. The motivation for data
warehouse comes from the need
from the top management. 63
Online Transaction Processing
Systems
A transaction processing system (TPS) is an
organized collection of people, procedures, software,
databases, and devices used to record completed
business transactions.

 Process business exchanges


 Maintain records about the exchanges
 Handle routine, yet critical, tasks
 Perform simple calculations

64