Sie sind auf Seite 1von 44

Data Warehousing

Md Tabrez Nafis
Department of Computer Science & Engineering
JAMIA HAMDARD, New Delhi

1
2
The Need for Data Analysis
• Managers track daily transactions to evaluate
how the business is performing
• Strategies should be developed to meet
organizational goals using operational
databases
• Data analysis provides information about
short-term tactical evaluations and strategies

3
Need to Separate Operational and
Informational Systems
• Operational system used to run a business in real
time based on current data.
– E.g. sales order processing, reservation systems, patient
registration,
– Process large volumes of relatively simple read/write
transactions, while providing fast response.
• Information systems designed to support decision
making based on historical data.
– Designed for complex and read-only queries or data mining
application.
– Sales trend analysis, customer segmentation, and human
resource planning.
4
Need to Separate Operational and
Informational Systems (2)
• It is essential to separate informational processing
from operational processing by creating a data
warehouse.
– A DW centralizes data (at least logically) that are scattered
throughout disparate operational systems and makes them
readily available for decision support.
– A properly designed DW adds value to data by improving
their quality and consistency.
– A separate data warehouse eliminates much of the
contention for resources that results when informational
applications are cofounded with operational processing.
5
How to Analyze Data ?
• Comprehensive, cohesive, integrated tools and
processes
– Capture, collect, integrate, store, and analyze data
– Generate information to support business decision
making
• Framework that allows a business to transform:
– Data into information
– Information into knowledge
– Knowledge into wisdom

6
Goal
• Main goal: improved decision making
• Other benefits
– Integrating architecture
– Common user interface for data reporting and
analysis
– Common data repository encourage single version
of company data
– Improved organizational performance

7
Operational Data Vs Decision Support
Data
• Data analysis effectiveness depends on quality
of data gathered at operational level
• Operational data (day to day use) are seldom
well-suited for decision support tasks
• Need reformat data in order to be useful for
decision making

8
Operational Data vs.
Decision Support Data
• Operational data
– Mostly stored in relational database
– Optimized to support transactions representing
daily operations
• Decision support data differs from operational
data in three main areas:
– Time span
– Granularity
– Dimensionality

9
Operational Data vs.
Decision Support Data

10
Problem: Heterogeneous
Information Sources
“Heterogeneities are
everywhere” Personal
Databases

World
Scientific Databases
Wide
Web
Digital Libraries
 Different interfaces
 Different data representations
 Duplicate and inconsistent information 11
Problem: Data Management in
Large Enterprises
• Vertical fragmentation of informational systems
(vertical stove pipes)
• Result of application (user)-driven development
of operational systems
Sales Planning Suppliers In. Control
Stock Mngmt Debt Mngmt Inventory
... ... ...

Sales Administration Finance Manufacturing ...


12
Goal: Unified Access to Data

Integration System

World
Wide
Personal
Web
Digital Libraries Scientific Databases Databases

 Collects and combines information


 Provides integrated view, uniform user interface
 Supports sharing
13
Why Separate Data Warehouse?
• High performance for both systems:
– DBMS — tuned for OLTP: access methods, indexing,
concurrency control, recovery
– Warehouse — tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
• Different functions and different data:
– missing data: Decision support requires historical data which
operational DBs do not typically maintain
– data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
– data quality: different sources typically use inconsistent data
representations, codes and formats which have to be
reconciled.
14
Why a Warehouse?
 Two Approaches:
 Query-Driven (Lazy)
 Warehouse (Eager)

Source Source

15
The Traditional Research Approach
 Query-driven (lazy, on-demand)
Clients

Integration System Metadata

...
Wrapper Wrapper Wrapper

...
Source Source Source

16
Disadvantages of Query-Driven
Approach
 Delay in query processing
 Slow or unavailable information sources
 Complex filtering and integration
 Inefficient and potentially expensive for
frequent queries
 Competes with local processing at sources
 Hasn’t caught on in industry

17
The Warehousing Approach
 Information Clients
integrated in
advance Data
Warehouse
 Stored in wh
for direct
Metadata
querying and
Integration System

analysis ...
Extractor/ Extractor/ Extractor/
Monitor Monitor Monitor

...
Source Source Source
18
Advantages of Warehousing Approach
• High query performance
– But not necessarily most current information
• Doesn’t interfere with local processing at sources
– Complex queries at warehouse
– OLTP at information sources
• Information copied at warehouse
– Can summarize, restructure, etc.
– Can store historical information
– Security, no auditing
• Has caught on in industry
19
Why Data Warehouse?
Why The Hype?
Data Warehousing and Industry
• One of the hottest topic in IS.
– Over 90% of larger companies either have a DW or are starting one.
• Warehousing is big business
– Old statistics from Megroup.
• $3.5 billion in early 1997
• $8 billion in 1998 [Metagroup]
• over $200 billion over next 5 years.
– Latest by IDC on DW tools.
• $5 billion in 1999.
• $16 billion in 2004.
– Latest by IDC on CRM applications
• $61 billion in 2001
• $148 billion in 2005

21
Data Warehousing and Industry (2)
• A 1996 study of 62 data warehousing projects
showed an average return on investment of
321%, with an average payback period of 2.73
years.
• In 2003, some people are skeptical.
• WalMart has largest warehouse
– 900-CPU, 2,700 disk, 23 TB Teradata system
– ~7TB in warehouse
– 40-50GB per day

22
What is Data Warehouse?
• Defined in many different ways, but not rigorously.
– A decision support database that is maintained separately from the
organization’s operational database
– Support information processing by providing a solid platform of
consolidated, historical data for analysis.

• “A data warehouse is a subject-oriented, integrated, time-

variant, and nonvolatile collection of data in support of


management’s decision-making process.”—W. H. Inmon
• Data warehousing:
– The process of constructing and using data warehouses
23
William H. Inmon
• Father of the data warehouse
• Co-creator of the Corporate
Information Factory.
• He has 35 years of
experience in database
technology management
and data warehouse design.
Characteristics of Data Warehouse

• Subject oriented. Data are organized based on how


the users refer to them.
• Integrated. All inconsistencies regarding naming
convention and value representations are removed.
• Nonvolatile. Data are stored in read-only format and
do not change over time.
• Time variant. Data are not current but normally time
series.
Characteristics of Data Warehouse

• Summarized Operational data are mapped into a


decision-usable format
• Large volume. Time series data sets are normally
quite large.
• Not normalized. DW data can be, and often are,
redundant.
• Metadata. Data about data are stored.
• Data sources. Data come from internal and external
unintegrated operational systems.
Data Warehouse

• Subject oriented
• Data integrated
• Time variant
• Nonvolatile
Data Warehouse—Subject-Oriented
• Organized around major subjects, such as customer, product,
sales
• Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing
• Provide a simple and concise view around particular subject
issues by excluding data that are not useful in the decision
support process

28
A Data Warehouse is Subject Oriented
Subject Oriented
Operational Applications/ Data Warehouse Subjects
Databases

• Data is stored by business subject rather than by


application

Order Billing Customer


Accounts Receivable Claims
Accounts Payable Sales
Loans Product
Savings
Life Insurance Claims Processing
Auto Insurance

30
Subject Orientation

Application Environment Data warehouse


Environment
Design activities must be equally DW world is primarily void of process
focused on both process and database design and tends to focus exclusively on
design issues of data modeling and database
design
Data Warehouse—Integrated
• Constructed by integrating multiple, heterogeneous data
sources
– relational databases, flat files, on-line transaction records
• Data cleaning and data integration techniques are applied.
– Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.
– When data is moved to the warehouse, it is converted.

32
Data Integrated
• Integration –consistency naming conventions
and measurement attributers, accuracy, and
common aggregation.
• Establishment of a common unit of measure
for all synonymous data elements from
dissimilar database.
• The data must be stored in the DW in an
integrated, globally acceptable manner
Data Integrated
Integrated
• Data is stored once in a single integrated location

Operational Environment Decision Support Environment

Savings
Database
Data Warehouse
Savings Database
Application No
Application
Flavor
Customer data stored in several Databases Current Accounts
Database
Current
Accounts
Application

Personal Loans
Database
Personal
Loans Subject = Customer
Application

35
Data Warehouse—Time Variant
• The time horizon for the data warehouse is significantly longer
than that of operational systems
– Operational database: current value data
– Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
• Every key structure in the data warehouse
– Contains an element of time, explicitly or implicitly
– But the key of operational data may or may not contain
“time element”

36
Time-variant

• Data is stored as a series of snapshots or views which


records data content and context across time.

Data Warehouse Data

Time Data

{
Key, Version and Date timestamp

- Data is tagged with some element of time - creation date,


as of date/to , etc.
- Data is available for long periods of time. For example,
five or more years

37
Time Variant
• In an operational application system, the expectation
is that all data within the database are accurate as of
the moment of access. In the DW data are simply
assumed to be accurate as of some moment in time
and not necessarily right now.
• One of the places where DW data display time
variance is in the structure of the record key. Every
primary key contained within the DW must contain,
either implicitly or explicitly an element of time( day,
week, month, etc)
Time Variant
• Every piece of data contained within the
warehouse must be associated with a
particular point in time if any useful analysis is
to be conducted with it.
• Another aspect of time variance in DW data is
that, once recorded, data within the
warehouse cannot be updated or changed.
Nonvolatility
• Typical activities such as deletes, inserts, and
changes that are performed in an operational
application environment are completely
nonexistent in a DW environment.
• Only two data operations are ever performed
in the DW: data loading and data access
Non-volatile
• Existing data in the warehouse is not overwritten or
updated.
External Source
Systems

Create
Update
Delete
Transactions

Internal Source
Data
Systems
Warehouse
READ-ONLY

Data Warehouse
Business Users & Applications

41
Nonvolatility
Application DW
The design issues must focus on data Such issues are no concern to in a DW
integrity and update anomalies. Complex environment because data update is never
processes must be coded to ensure that the performed.
data update processes allow for high
integrity of the final product.

Data is placed in normalized form to Designers find it useful to store many of


ensure a minimal redundancy (totals that such calculations or summarizations.
could be calculated would never be stored)

The technologies necessary to support Relative simplicity in technology


issues of transaction and data recovery,
roll back, and detection and remedy of
deadlock are quite complex.
Data Warehousing --
It is a process
• Technique for assembling and
managing data from various
sources for the purpose of
answering business questions.
Thus making decisions that were
not previous possible
• A decision support database
maintained separately from the
organization’s operational
database

43
44

Das könnte Ihnen auch gefallen