Sie sind auf Seite 1von 35

LESSON 1

LESSON MAP
Learning outcome
Next lesson

Data Warehouse
Introduction
Why Data
Summary
Architecture Warehouse?
Characteristics
Definition
OLAP
Nonvolatile
Data mining Subject oriented

Time-variant
Integrated

1
Learning outcome
• Describe the purpose of a data warehouse.
• Describe the characteristics of a data warehouse.
• Explain the relationship between a data warehouse
and an operational database.
• Explain the architecture of a data warehouse.
• Explain the related technologies of a data warehouse

Previous slide: Lesson Map 2


Introduction
Why Data
Warehouse?
Operational database Data warehouse

• Supports day-to-day business operations • Stores business data


• Stores operational data
• Data are structured to help
• Competitive advantage thru efficient and
decision makers in making strategic
cost effective services to the customer
decisions
• How much revenue was generated during
• What was the revenue for each quarter
by each customer during this year?
of this year by geographic region and
customer?
• What are expected sales by region next
year?
Previous slide: Learning Outcome 3
Introduction
Definition
A Data Warehouse is ...
A data warehouse is a subject-oriented, integrated, time-variant, and
Nonvolatile; collection of data in support of management’s
decisions Inmon, W.H.
Building the Data Warehouse
Wellesley, MA: QED Tech. Pub. Group,
1992

Data  prepared; organized; presented

Previous slide: Why Data Warehouse? 4


CHARACTERISTICS

… subject-oriented ...
• The data in the warehouse is defined and
organized in business terms, and is grouped
under business-oriented subject headings,
such as
– customers
– products
– sales
rather than application oriented data.
Savings Account System
Investment Account System
Checking Account System
Previous slide: Definition of Data
Warehouse 5
CHARACTERISTICS
… integrated ...
• The data warehouse contents are defined such that
they are valid across the enterprise and its operational
and external data sources

Data warehouse
Operational systems
• The data in the warehouse should be
– clean
– validated
– properly integrated
CHARACTERISTICS
… time-variant ...
• All data in the data warehouse is time-
stamped at time of entry into the
warehouse or when it is summarized
within the warehouse.
• This chronological recording of data
provides historical and trend analysis
possibilities.
• On the contrary, operational data is
overwritten, since past values are not of
interests.

Previous slide: Integrated 7


CHARACTERISTICS
… nonvolatile ...
• Once loaded into the data warehouse, the
data is not updated.
• Data acts as a stable resource for
consistent reporting and comparative
analysis.
• On the contrary, operational data is
updated (inserted, deleted, modified).
Change Change
Access
Insert
Insert Load
Replace Replace

Previous slide: Time variant 8


An Example of Data Integration
Checking Account System  Operational
Jane Doe (name) data
Female (gender)
Bounced check #145 on 1/5/95
Opened account 1994 Customer
Jane Doe
Savings Account System
Female
Jane Doe
Bounced check #145
F (gender)
Married
Opened account 1992
Owns 25 Shares Exxon
Investment Account System Customer since 1992
Jane Doe
Owns 25 Shares Exxon  data
warehouse
Opened account 1995
Previous slide: Non volatile 9
An Architecture for Data Warehousing
metadata USER1

OLAP
USER2
external sources
extraction used
cleaning by
data
validation
warehouse data
summarize.
mining
USER3

operational
query
databases

data mart
Previous slide: Data Integration 10
On-Line Analytical Processing
(OLAP)
• Term introduced by E.F. Codd (1993) in
contrast to On-Line Transaction
Processing (OLTP)
• The OLAP Council’s definition:
“A category of software technology that
enables analysts, managers and executives
to gain insight into data through fast,
consistent, interactive access to a wide
variety of possible views of information that
have been transformed from raw data to
reflect the real dimensionality of the
enterprise as understood by the user”
On-Line Analytical Processing
(OLAP)
• Basic idea: users should be able to
manipulate enterprise data models
across many dimensions to understand
changes that are occurring.
• Data used in OLAP should be in the
form of a multi-dimensional cube.
Market

Product
Dimensional Hierarchies
• Each dimension can be hierarchically
structured
Year Country

Type of product Month State

Product Week City

Item Day Store


OLAP Operations
• Rollup: decreasing the level of detail
• Drill-down: increasing the level of detail
[time: year  month  week  day]
• Slice-and-dice: selection and projection
sales  [product, market, time]

• Pivot: re-orienting the multidimensional


view of data
sales  [market, product, time]
Implementing Multi-
dimensionality
• Multi-dimensional databases
(MDDB)
• To make relational databases handle
multidimensionality, two kinds of
tables are introduced:
– Fact table: contains numerical facts. It is
long and thin.
– Dimension tables: contain pointers to the
fact table. They show where the
information can be found. A separate
table is provided for each dimension.
Dimension tables are small, short, and
wide.
Star Schema Fact Table
Market Dimension STORE KEY Time Dimension
STORE KEY PRODUCT KEY
PERIOD KEY PERIOD KEY
Store Desc.
Dollars Period Desc.
City
Units Year
State Quarter
District ID Price
Month
District Desc.
Product Dimension Day
Region ID
Region Desc. PRODUCT KEY
Regional Mgr. Product Desc.
Level Brand
Color
Size
Manufacturer
MOLAP, ROLAP, DSS
• The OLAP technology is considered an
extension of the original DSS technology.
• DSS applications are tools that access and
analyze data in relational database (RDB)
tables.
• OLAP tools access and analyze
multidimensional data (typically three, up to
ten-dimensional data).
• OLAP technology is called MOLAP/ROLAP
(multidimensional/relational OLAP) if it uses an
MDDB/RDB.
OLAP/DSS
• OLAP tools focus on providing multi-
dimensional data analysis, that is
superior to SQL in computing
summaries and breakdowns along many
dimensions.
• OLAP tools require strong interaction
from the users to identify interesting
patterns in data.
• An OLAP tool evaluates a precise query
that the user formulates.
• OLAP users are “farmers”.
OLAP Tools
ROLAP
Microsoft Analysis Services (Microsoft), MicroStrategy 8
(Microstrategy) and BusinessObjects XI - French
BusinessObjects, Oracle BI (the former Siebel
Analytics). There is also an open source ROLAP server -
Mondrian.
MOLAP
Essbase, MIS Alea (Systems Union) and TM1 (Applix
Inc) .There is also an open source MOLAP server Palo.
HOLAP
Microsoft Analysis Services, MicroStrategy and SAP AG
BI Accelerator.
Mondrian
• Mondrian is an Open Source OLAP (online analytical processing) server,
written in the Java programming language. It supports the MDX
(multidimensional expressions) query language and the XML for Analysis
and JOLAP. It reads from SQL and other data sources and aggregates data
in a memory cache.

• Mondrian is used for:


• High performance, interactive analysis of large or small volumes of
information
• "Dimensional" exploration of data, for example analyzing sales by product
line, by region, by time period
• Parsing of Multi-Dimensional eXpression (MDX) language into Structured
Query Language (SQL) to retrieve answers to dimensional queries
• High-speed queries through the use of aggregate tables in the RDBMS
• Advanced calculations using the calculation expressions of the MDX
language

http://www.cs.brown.edu/courses/cs227/Papers/Visualizat
ion/Choong.pdf
DBMS for Warehouse
• Multidimensional DBMS  Essbase,
UniVerse
• Relational DBMS  Oracle, SG
server, DB2, MySQL, PostgreSQL,
Firebird,
DATA MINING
Definition : The process of extracting valid, previously unknown, comprehensible
and actionable information from large databases and using it to
make crucial business decisions.

Data mining assists business analysts with finding patterns and


relationships in the data — it does not tell you the value of the
patterns to the organization.

Data mining applications


• Marketing
• Identifying buying patterns of customers
• Predicting response to mailing campaigns
• Banking
• Identifying loyal customers
• Determining credit card spending by customer groups
• Insurance
• Claims analysis
• Predicting which customers will buy new policies
Data mining operations and associated techniques

Operations Data mining techniques


Predictive modeling Classification
Value prediction
Database segmentation Clustering

Link analysis Association discovery


Sequential pattern
discovery
Deviation detection Statistics
Visualization
Classification
Classification is a data mining (machine learning) technique
used to predict group membership for data instances. For example,
you may wish to use classification to predict whether the weather
on a particular day will be “sunny”, “rainy” or “cloudy”.
Popular classification techniques include decision trees and neural
networks.
Clustering
• Clustering is a data mining (machine learning) technique
used to place data elements into related groups without
advance knowledge of the group definitions.
• Clustering divides a database into different groups. The goal
of clustering is to find groups that are very different from
each other, and whose members are very similar to each
other.
Example: splitting the database by age groupings
customer  old age group, young age group

Association discovery
Are occurrences that are linked to a single event
 Example: Supermarket: purchase beer and buy
peanuts 55% of the time

Sequential pattern discovery


Occurs where events are linked over time; that is
one event leading to another later event.
Example: 65% of the time, purchase of a house,
followed by a purchase of curtains after two months.
Value prediction - Forecasting
• Is used to discover pattern in the data
can lead to predictions about the
future. Example: Projection of sales in
the next 12 months
Deviation Detection
Deviation detection is often a
source of true discovery
because it identifies outliers,
which express deviation from
some previously known
expectation and norm.
Example: quality control
Data mining tools
• Enterprise Miner ; SAS Institute [decision
trees; association; linear model; time
series]
• Intelligent Miner ; IBM; [decision trees;
association; linear model; time series
sequential]
• Scenario; Cognos [decision trees]
Data Stream Mining
Characteristics of Data Streams
Data Streams vs DBMS
Data streams
continuous, ordered, changing, fast, huge amount
Traditional DBMS
data stored in finite, persistent data sets

Characteristics
• Huge volumes of continuous data, possibly infinite
• Fast changing and requires fast, real-time response
• Data stream captures nicely our data processing needs of today
• Random access is expensive
• single scan algorithm (can only have one look)
• Store only the summary of the data seen thus far
• Most stream data are at pretty low-level or multi-dimensional in nature,
needs multi-level and multi-dimensional processing
Goal: Mine patterns, process queries and compute statistics on data
streams in real-time
Applications of Stream Data Mining

What are the Applications?

• 􀂉 Telecommunication calling records


• 􀂉 Business: credit card transaction flows
• 􀂉 Network monitoring and traffic engineering
• 􀂉 Financial market: stock exchange
• 􀂉 Engineering & industrial processes: power supply &
manufacturing
• 􀂉 Sensor, monitoring & surveillance: video streams, RFIDs
• 􀂉 Security monitoring
• 􀂉 Web logs and Web page click stream
Useful Websites
• Data warehouse
– http://en.wikipedia.org/wiki/Data_warehouse
– http://www.dwinfocenter.org/
– http://www.datawarehousingonline.com/
– http://www.1keydata.com/datawarehousing/conce
pts.html
– http://www.1keydata.com/datawarehousing/dataw
arehouse.html
– http://www.lc.leidenuniv.nl/awcourse/oracle/serve
r.920/a96520/concept.htm
Useful Websites
• Data mining
http://www.anderson.ucla.edu/faculty/jason.fra
nd/teacher/technologies/palace/datamining.h
tm
http://www.autonlab.org/tutorials/
http://www.eco.utexas.edu/~norman/BUS.FOR/
course.mat/Alex/
http://www.thearling.com/text/dmwhite/dmwhit
e.htm
http://www.the-data-
mine.com/bin/view/Misc/DataMiningTutorials
Useful Websites - Data stream
mining
• http://www.csse.monash.edu.au/~mgaber/WResourc
es.htm
• http://www.csse.monash.edu.au/~mgaber/CameraRe
adyPAKDD.pdf
• http://www.sigmod.org/sigmod/record/issues/0506/p
18-survey-gaber.pdf
• http://en.wikipedia.org/wiki/Data_stream_mining
• http://domino.research.ibm.com/comm/research.nsf/
pages/r.kdd.innovation.html
• http://www.public.asu.edu/~huanliu/CFP/CFPMiningS
treamData.html
• http://citeseer.ist.psu.edu/640620.html
Summary
• Introduction to data warehouse
• Characteristics of a data warehouse
• Data warehouse architecture
• OLAP and data mining
Next lesson

Data warehouse and data mining tools

Previous slide: Architecture 34


Words of encouragement

Das könnte Ihnen auch gefallen