Beruflich Dokumente
Kultur Dokumente
Md Tabrez Nafis
Department of Computer Science & Engineering
JAMIA HAMDARD, New Delhi
1
2
The Need for Data Analysis
• Managers track daily transactions to evaluate
how the business is performing
• Strategies should be developed to meet
organizational goals using operational
databases
• Data analysis provides information about
short-term tactical evaluations and strategies
3
Need to Separate Operational and
Informational Systems
• Operational system used to run a business in real
time based on current data.
– E.g. sales order processing, reservation systems, patient
registration,
– Process large volumes of relatively simple read/write
transactions, while providing fast response.
• Information systems designed to support decision
making based on historical data.
– Designed for complex and read-only queries or data mining
application.
– Sales trend analysis, customer segmentation, and human
resource planning.
4
Need to Separate Operational and
Informational Systems (2)
• It is essential to separate informational processing
from operational processing by creating a data
warehouse.
– A DW centralizes data (at least logically) that are scattered
throughout disparate operational systems and makes them
readily available for decision support.
– A properly designed DW adds value to data by improving
their quality and consistency.
– A separate data warehouse eliminates much of the
contention for resources that results when informational
applications are cofounded with operational processing.
5
How to Analyze Data ?
• Comprehensive, cohesive, integrated tools and
processes
– Capture, collect, integrate, store, and analyze data
– Generate information to support business decision
making
• Framework that allows a business to transform:
– Data into information
– Information into knowledge
– Knowledge into wisdom
6
Goal
• Main goal: improved decision making
• Other benefits
– Integrating architecture
– Common user interface for data reporting and
analysis
– Common data repository encourage single version
of company data
– Improved organizational performance
7
Operational Data Vs Decision Support
Data
• Data analysis effectiveness depends on quality
of data gathered at operational level
• Operational data (day to day use) are seldom
well-suited for decision support tasks
• Need reformat data in order to be useful for
decision making
8
Operational Data vs.
Decision Support Data
• Operational data
– Mostly stored in relational database
– Optimized to support transactions representing
daily operations
• Decision support data differs from operational
data in three main areas:
– Time span
– Granularity
– Dimensionality
9
Operational Data vs.
Decision Support Data
10
Problem: Heterogeneous
Information Sources
“Heterogeneities are
everywhere” Personal
Databases
World
Scientific Databases
Wide
Web
Digital Libraries
Different interfaces
Different data representations
Duplicate and inconsistent information 11
Problem: Data Management in
Large Enterprises
• Vertical fragmentation of informational systems
(vertical stove pipes)
• Result of application (user)-driven development
of operational systems
Sales Planning Suppliers In. Control
Stock Mngmt Debt Mngmt Inventory
... ... ...
Integration System
World
Wide
Personal
Web
Digital Libraries Scientific Databases Databases
Source Source
15
The Traditional Research Approach
Query-driven (lazy, on-demand)
Clients
...
Wrapper Wrapper Wrapper
...
Source Source Source
16
Disadvantages of Query-Driven
Approach
Delay in query processing
Slow or unavailable information sources
Complex filtering and integration
Inefficient and potentially expensive for
frequent queries
Competes with local processing at sources
Hasn’t caught on in industry
17
The Warehousing Approach
Information Clients
integrated in
advance Data
Warehouse
Stored in wh
for direct
Metadata
querying and
Integration System
analysis ...
Extractor/ Extractor/ Extractor/
Monitor Monitor Monitor
...
Source Source Source
18
Advantages of Warehousing Approach
• High query performance
– But not necessarily most current information
• Doesn’t interfere with local processing at sources
– Complex queries at warehouse
– OLTP at information sources
• Information copied at warehouse
– Can summarize, restructure, etc.
– Can store historical information
– Security, no auditing
• Has caught on in industry
19
Why Data Warehouse?
Why The Hype?
Data Warehousing and Industry
• One of the hottest topic in IS.
– Over 90% of larger companies either have a DW or are starting one.
• Warehousing is big business
– Old statistics from Megroup.
• $3.5 billion in early 1997
• $8 billion in 1998 [Metagroup]
• over $200 billion over next 5 years.
– Latest by IDC on DW tools.
• $5 billion in 1999.
• $16 billion in 2004.
– Latest by IDC on CRM applications
• $61 billion in 2001
• $148 billion in 2005
21
Data Warehousing and Industry (2)
• A 1996 study of 62 data warehousing projects
showed an average return on investment of
321%, with an average payback period of 2.73
years.
• In 2003, some people are skeptical.
• WalMart has largest warehouse
– 900-CPU, 2,700 disk, 23 TB Teradata system
– ~7TB in warehouse
– 40-50GB per day
22
What is Data Warehouse?
• Defined in many different ways, but not rigorously.
– A decision support database that is maintained separately from the
organization’s operational database
– Support information processing by providing a solid platform of
consolidated, historical data for analysis.
• Subject oriented
• Data integrated
• Time variant
• Nonvolatile
Data Warehouse—Subject-Oriented
• Organized around major subjects, such as customer, product,
sales
• Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing
• Provide a simple and concise view around particular subject
issues by excluding data that are not useful in the decision
support process
28
A Data Warehouse is Subject Oriented
Subject Oriented
Operational Applications/ Data Warehouse Subjects
Databases
30
Subject Orientation
32
Data Integrated
• Integration –consistency naming conventions
and measurement attributers, accuracy, and
common aggregation.
• Establishment of a common unit of measure
for all synonymous data elements from
dissimilar database.
• The data must be stored in the DW in an
integrated, globally acceptable manner
Data Integrated
Integrated
• Data is stored once in a single integrated location
Savings
Database
Data Warehouse
Savings Database
Application No
Application
Flavor
Customer data stored in several Databases Current Accounts
Database
Current
Accounts
Application
Personal Loans
Database
Personal
Loans Subject = Customer
Application
35
Data Warehouse—Time Variant
• The time horizon for the data warehouse is significantly longer
than that of operational systems
– Operational database: current value data
– Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
• Every key structure in the data warehouse
– Contains an element of time, explicitly or implicitly
– But the key of operational data may or may not contain
“time element”
36
Time-variant
Time Data
{
Key, Version and Date timestamp
37
Time Variant
• In an operational application system, the expectation
is that all data within the database are accurate as of
the moment of access. In the DW data are simply
assumed to be accurate as of some moment in time
and not necessarily right now.
• One of the places where DW data display time
variance is in the structure of the record key. Every
primary key contained within the DW must contain,
either implicitly or explicitly an element of time( day,
week, month, etc)
Time Variant
• Every piece of data contained within the
warehouse must be associated with a
particular point in time if any useful analysis is
to be conducted with it.
• Another aspect of time variance in DW data is
that, once recorded, data within the
warehouse cannot be updated or changed.
Nonvolatility
• Typical activities such as deletes, inserts, and
changes that are performed in an operational
application environment are completely
nonexistent in a DW environment.
• Only two data operations are ever performed
in the DW: data loading and data access
Non-volatile
• Existing data in the warehouse is not overwritten or
updated.
External Source
Systems
Create
Update
Delete
Transactions
Internal Source
Data
Systems
Warehouse
READ-ONLY
Data Warehouse
Business Users & Applications
41
Nonvolatility
Application DW
The design issues must focus on data Such issues are no concern to in a DW
integrity and update anomalies. Complex environment because data update is never
processes must be coded to ensure that the performed.
data update processes allow for high
integrity of the final product.
43
44