Sie sind auf Seite 1von 27

February 22, 2009 ICS426: Introduction 1

DATA WAREHOUSING
AND
DATA MINING
February 22, 2009 ICS426: Introduction 2
Course Overview
Introduction
Data Preporcessing
DW and OLAP
Data Mining
February 22, 2009 ICS426: Introduction 3
Motivation
Data flood
Automated data collection tools and mature database technology
lead to tremendous amounts of data stored in databases, data
warehouses and other information repositories
There is a tremendous increase in the amount of data recorded
and stored on digital media
We are producing over two exabites (10^18) of data per year
Storage capacity, for a fixed price, appears to be doubling
approximately every 9 months
Data stored in worlds databases doubles every 20 months
Other growth rate estimates even higher
February 22, 2009 ICS426: Introduction 4
Data, Data everywhere - yet ...
I cant find the data I need
data is scattered over the network
many versions, subtle differences
I cant get the data I need
need an expert to get the data
I cant understand the data I found
available data poorly documented
I cant use the data I found
results are unexpected
data needs to be transformed from one
form to other
February 22, 2009 ICS426: Introduction 5
Motivation
Very little data will ever be looked at by a human. We are drowning
in data, but starving for knowledge! The greatest problem of today
is how to teach people to ignore the irrelevant, how to refuse to
know things, before they are suffocated. For too many facts are as
bad as none at all.
Knowledge Discovery is NEEDEDto make sense and use of data.
Solution: Data warehousing and data mining
Data warehousing and On-Line Analytical Processing (OLAP)
Extraction of interesting knowledge (rules, regularities, patterns,
constraints) from data in large databases
February 22, 2009 ICS426: Introduction 6
Knowledge Discovery (KDD) Process
February 22, 2009 ICS426: Introduction 7
KDD Process: Several Key Steps
Learning the application domain
relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
Data mining
summarization, classification, regression, association, clustering
Pattern evaluation and knowledge presentation
Use of discovered knowledge
February 22, 2009 ICS426: Introduction 8
What is a Data Warehouse?
A single, complete and consistent
store of data obtained from a variety
of different sources made available to
end users in a what they can
understand and use in a business
context.
[Barry Devlin]
February 22, 2009 ICS426: Introduction 9
What are the users saying...
Data should be integrated across the
enterprise
Summary data has a real value to the
organization
Historical data holds the key to
understanding data over time
What-if capabilities are required
February 22, 2009 ICS426: Introduction 10
What is Data Warehousing?
A process of transforming data into
information and making it available
to users in a timely enough manner
to make a difference
[Forrester Research, April 1996]
Data
Information
February 22, 2009 ICS426: Introduction 11
Evolution
60s: Batch reports
hard to find and analyze information
inflexible and expensive, reprogram every new request
70s: Terminal-based DSS and EIS (executive information systems)
still inflexible, not integrated with desktop tools
80s: Desktop data access and analysis tools
query tools, spreadsheets, GUIs
easier to use, but only access operational databases
90s: Data warehousing with integrated OLAP engines and tools
2000s:
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information
systems
February 22, 2009 ICS426: Introduction 12
Very Large Data Bases
Walmart -- 24 Terabytes
Geographic Information Systems
National Medical Records
Weather images
Intelligence Agency Videos
Terabytes -- 10^12 bytes:
Petabytes -- 10^15 bytes:
Exabytes -- 10^18 bytes:
Zettabytes -- 10^21 bytes:
Zottabytes -- 10^24 bytes:
February 22, 2009 ICS426: Introduction 13
Data Warehousing -- It is a process
Technique for assembling and managing
data from various sources for the purpose
of answering business questions. Thus
making decisions that were not previous
possible
A decision support database maintained
separately from the organizations
operational database
February 22, 2009 ICS426: Introduction 14
Data Warehouse
A data warehouse is a
subject-oriented
integrated
time-varying
non-volatile
collection of data that is used primarily in organizational decision
making.
-- Bill Inmon, Building the Data Warehouse 1996
February 22, 2009 ICS426: Introduction 15
Data Warehouse Architecture
Relational
Databases
Legacy
Data
Purchased
Data
ERP
Systems
Data Warehouse
Engine
Optimized Loader
Extraction
Cleansing
Analyze
Query
Metadata Repository
February 22, 2009 ICS426: Introduction 16
Data Warehouse for Decision Support & OLAP
Putting Information technology to help the knowledge worker
make faster and better decisions
Which of my customers are most likely to go to the
competition?
What product promotions have the biggest impact on revenue?
How did the share price of software companies correlate with
profits over last 10 years?
February 22, 2009 ICS426: Introduction 17
Decision Support
Used to manage and control business
Data is historical or point-in-time
Optimized for inquiry rather than update
Use of the system is loosely defined and can be ad-hoc
Used by managers and end-users to understand the business and
make judgements
February 22, 2009 ICS426: Introduction 18
Data Mining works with Warehouse Data
Data Warehousing provides the Enterprise
with a memory
Data Mining provides the Enterprise
with intelligence
February 22, 2009 ICS426: Introduction 19
Why Data Mining
Credit ratings/targeted marketing:
Given a database of 100,000 names, which persons are the
least likely to default on their credit cards?
Identify likely responders to sales promotions
Fraud detection
Which types of transactions are likely to be fraudulent, given
the demographics and transactional history of a particular
customer?
Customer relationship management:
Which of my customers are likely to be the most loyal, and
which are most likely to leave for a competitor? :
Data Mining helps extract such information
February 22, 2009 ICS426: Introduction 20
Why DM: A producer wants to know.
Which are our
lowest/highest margin
customers ?
Which are our
lowest/highest margin
customers ?
Who are my customers
and what products
are they buying?
Who are my customers
and what products
are they buying?
Which customers
are most likely to go
to the competition ?
Which customers
are most likely to go
to the competition ?
What impact will
new products/services
have on revenue
and margins?
What impact will
new products/services
have on revenue
and margins?
What product prom-
-otions have the biggest
impact on revenue?
What product prom-
-otions have the biggest
impact on revenue?
What is the most
effective distribution
channel?
What is the most
effective distribution
channel?
February 22, 2009 ICS426: Introduction 21
What is Data Mining?
Data mining: a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc
Many Definitions
Non-trivial extraction of implicit, previously unknown and
potentially useful information from huge amount of data
Exploration & analysis, by automatic or semi-automatic means,
of large quantities of data in order to discover meaningful
patterns
February 22, 2009 ICS426: Introduction 22
Data Mining: Confluence of Multiple Disciplines
?
20x20 ~ 2^400 10^120 patterns
February 22, 2009 ICS426: Introduction 23
Some basic operations
Predictive:
Regression
Classification
Collaborative Filtering
Descriptive:
Clustering / similarity matching
Association rules and variants
Deviation detection
February 22, 2009 ICS426: Introduction 24
Applications
Banking: loan/credit card approval
predict good customers based on old customers
Customer relationship management:
identify those who are likely to leave for a competitor.
Targeted marketing:
identify likely responders to promotions
Fraud detection: telecommunications, financial transactions
from an online stream of event identify fraudulent events
Manufacturing and production:
automatically adjust knobs when process parameter changes
February 22, 2009 ICS426: Introduction 25
Applications
Medicine: disease outcome, effectiveness of treatments
analyze patient disease history: find relationship between
diseases
Molecular/Pharmaceutical: identify new drugs
Scientific data analysis:
identify new galaxies by searching for sub clusters
Web site/store design and promotion:
find affinity of visitor to pages and modify layout
February 22, 2009 ICS426: Introduction 26
The course
(4)
OLAP
DM
DS
DS
DS
(2) (3)
DW DP
Association
Classification
Clustering
(5)
(6)
(7)
DS = Data source
DW = Data warehouse
DM = Data Mining
DP = Data processing
February 22, 2009 ICS426: Introduction 27
END

Das könnte Ihnen auch gefallen