You are on page 1of 96

DATA WAREHOUSING AND DATA MINING

M.Mageshwari,Lecturer M.S.P.V.L Polytechnic College

Course Overview
The course: what and how 0. Introduction I. Data Warehousing II. Decision Support and OLAP III. Data Mining IV. Looking Ahead

Demos and Labs

0. Introduction
Data Warehousing, OLAP and data mining: what and why (now)? Relation to OLTP A case study demos, labs

A producer wants to know.


Which are our lowest/highest margin customers ? What is the most effective distribution channel? Who are my customers and what products are they buying?

What product prom-otions have the biggest impact on revenue? What impact will new products/services have on revenue and margins?

Which customers are most likely to go to the competition ?

Data, Data everywhere yet ... I cant find the data I need
data is scattered over the network many versions, subtle differences

I cant get the data I need


need an expert to get the data

I cant understand the data I found


available data poorly documented

I cant use the data I found


results are unexpected data needs to be transformed from one form to other
5

What is a Data Warehouse?


A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.

What are the users saying...


Data should be integrated across the enterprise Summary data has a real value to the organization Historical data holds the key to understanding data over time What-if capabilities are required
7

What is Data Warehousing?


Information A process of transforming data into information and making it available to users in a timely enough manner to make a difference

Data
8

Evolution
60s: Batch reports
hard to find and analyze information inflexible and expensive, reprogram every new request

70s: Terminal-based DSS(Decision Support System and EIS (executive information systems)
still inflexible, not integrated with desktop tools

Data Warehouse Structure


base customer (1985-87)
custid, from date, to date, name, phone, dob Time is base customer (1988-90) part of custid, from date, to date, name, credit rating, key of employer each table

customer activity (1986-89) -- monthly summary customer activity detail (1987-89) customer activity detail (1990-91)

custid, activity date, amount, clerk id, order no custid, activity date, amount, line item no, order no
10

Definition of DSS
Decision support system is defined as a system that helps the decision makers in various levels to take decisions This system uses data, analytical models and user friendly software for taking decision
11

Definition of EIS
Executive information system(EIS) is defined as a system that helps the high level executives to take policy decisions. This system user higher level data, analytical models and user friendly software for taking decisions.
12

Evolution
80s: Desktop data access and analysis tools
query tools, spreadsheets, GUIs easier to use, but only access operational databases

90s: Data warehousing with integrated OLAP(online analytical processing)engines and tools
13

Data Warehousing -It is a process


Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible A decision support database maintained separately from the organizations operational database 14

Characteristics of Data Warehouse


A data warehouse is a
subject-oriented
integrated time-varying

non-volatile

collection of data that is used primarily in organizational decision making.

15

subject-oriented
A data warehouse is organized around the major subjects of the organization such as customer, supplier, product, sales, etc.., Data warehouse provides a simple and concise view around a particular subject by excluding data that are not useful to the decision support process.

16

Integrated:
A data warehouse is constructed by integrating multiple sources of data such as relational database, flat files and on-line transaction records. Data cleaning and data integration techniques are applied to ensure consistency in naming conventions, encoding structures, attributes etc..,
17

Time Variant
Data warehouse maintains records of both historical and current data. So it can provide information in a historical perspective

18

Non Volatile

Once data warehouse is loaded with data, it is not possible to perform any modifications in the stored data.

19

Explorers, Farmers and Tourists


Tourists: Browse information about Tourists

Farmers: Harvest information from known access paths


Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data
20

Application-Orientation vs. Subject-Orientation


Application-Orientation Subject-Orientation

Operational Database
Loans Credit Card Trust Customer

Data Warehouse

Vendor Product

Savings

Activity
21

Functioning of Data warehousing

Data Source

cleaning

Transformation

Data Warehouse
New Update

22

Collection data
Data warehousing collect data from various data sources such as relational data base, flat files and on-line records The collection of data are stored in database inside the warehouse. The type of data collection used depends on the architecture of the ware house.
23

Integration
Each and every data source uses from different schema. Data warehouse get data from different source with different schema and convert the data from various sources into a common integrated schema.

24

Star Schema
A single fact table and for each dimension one dimension table Does not capture hierarchies directly
T i e
date, custno, prodno, cityname, ...

c u s t

f a c t

p r o d

c i t y

25

Snowflake schema
Represent dimensional hierarchy directly by normalizing tables. Easy to maintain and saves storage
T i
e
date, custno, prodno, cityname, ...

c u s t

f a c t

p r o d c i t y
r e g i o 26 n

Data transformation and cleaning


The task of correcting and preparing the data is called data cleaning.

Data source delivers data into the database of data warehouse it should be corrected.

27

Update of data
Update on tables at the data sources must be sent to the data warehouse.

If the tables in data warehouse are same as sources, the updation is easy.

28

Summarizing data
The raw data generated by a transaction may be too large to store online. Therefore, we can use summary of transactions for easy querying.

29

Data Warehouse for Decision Support & OLAP


Putting Information technology to help the knowledge worker make faster and better decisions
Which of my customers are most likely to go to the competition? What product promotions have the biggest impact on revenue? How did the share price of software companies correlate with profits over last 10 years?
30

Decision Support
Used to manage and control business Data is historical or point-in-time Optimized for inquiry rather than update

Use of the system is loosely defined and can be ad-hoc


Used by managers and end-users to understand the business and make judgments
31

OLAP(Online analytical processing)


A data warehouse stores data , but OLAP transform the data warehouse data into specific meaningful information. Therefore OLAP provides a user friendly environment for interactive data analysis.

32

OLAP
DATA WAREHOUSE SQL Result OLAP SERVER Request

Result FRONT END set TOOL User


33

OLAP OPERATION on the multidimensional data


Roll-up(GROUP) Drill down(Less) Slice and Dice(Pice) Pivot(rotate)

34

TYPES OF OLAP
MOLAP(MULTIDIMENSIONAL OLAP) ROLAP(RELATIONAL ROLAP)

35

Multi-dimensional Data
HeyI sold $100M worth of goods
Dimensions: Product, Region, Time Hierarchical summarization paths
W S N Juice Cola Milk Cream Toothpaste Soap 1 2 34 5 6 7

Product

Product Industry

Region Country

Time Year

Category

Region

Quarter

Product

City

Month

Week
36

Month
Office Day

Data Warehouse Architecture


Relational Databases
Optimized Loader

ERP Systems

Extraction Cleansing Data Warehouse Engine Analyze Query

Purchased Data

Legacy Data

Metadata Repository
37

Architecture of data warehousing


External data Data Manager External data

Data Acquisition

Warehouse data

Data Dictionary

Data Access

Information Directiory

Middleware

Design

Warehouse data

Management

38

Architecture of

39

Design Component
The data warehouse designer design the database of the data warehouse and the warehouse administrator manages the data warehouse. The designer and administrator use the design component to design and store data

40

Types of design
Bottom-up design Business value can be returned as quickly as the first data marts can be created Top-down design Atomic data, that is, data at the lowest level of detail, are stored in the data warehouse.
41

Hybrid design . Hybrid methodologies have evolved to take advantage of the fast turnaround time of bottom-up design and the enterprise-wide data consistency of top-down design.

42

Data Manager Component


The database in the data warehouse uses the data manager component for managing and accessing the data stored in the data warehouse. Rdbms Mdbms

43

Management Component
Administering data acquisition operation Managing backup copies of the data Recovering the lost data Providing security to the data stored in the data warehouse. Authorizing access to the data stored in the data warehouse.
44

Data Acquisition Component


This component acquires data from various sources by using the data acquisition applications The data acquisition applications are based on rules that are defined by the data warehouse developers.

45

The operation performed during data clean up


Restructuring the records and fields of the database tables. Removing the irrelevant and redundant data obtaining and adding missing data. Verifying integrity and consistency of the data

46

The operation performed on the data for enhancement are


Decoding and translating the values in fields. Summarizing data Calculating the derived values.

47

Information directory Component


This component helps the end users to know the details of the data stored in the data warehouse. This is done with the help of the data about the data named meta data. Technical data Business data
48

Middleware Component
This components connect to the local databases. Analytical server used to analyze multidimensional data. Intelligent data warehousing middleware to control the access to the warehouse database.

49

Data mart
Data mart is a database that contains data needed for a small group of users for their own department needs.

Dependent data mart Independent data mart

50

Different between data warehouse and data mart


Data warehouse Data mart is therefore useful for small organizations with very few departments If you listen to some vendors, you may be left thinking that building data warehouses is a waste of time. This supports the entire information requirement of an organization. This has large model, wider implementation, large data and more number of users. Data Mart data warehousing is suitable to support an entire corporate environment. data mart vendor that tells you this are looking out for their own best interests.

This support the information requirement of a department in an organization This has small data model, shorter implementation, less data and some users.
51

Since each department has its own data mart, the departments can summarize, sort , select structure etc their own departments data. This will not confused with any other department. The department can do whatever DSS processing they want. The processing cost and storage are less that the data warehouse. The department can select a software for their data mart. it is powerful to fit their needs.
52

Advantages of data mart

Data warehousing life cycle


Design

Enhance Operate

prototype

deploy

53

Data Modeling(Multi-dimensional Database)


HeyI sold $100M worth of goods
W S N Juice Cola Milk Cream Toothpaste Soap 1 2 34 5 6 7

Dimensions: Product, Region, periods Hierarchical summarization paths

Product

Product Industry

Region Country

Period Year

Category

Region

Quarter

Product

City

Month

Week
54

Month
Office Day

Building of data warehouse

The builder must forecast the usage of the warehouse by the users. The design should support accessing data with any meaningful values of the attributes. To build a good data warehouse data acquisition process must follow the steps given flow
extract the data from multiple heterogeneous sources Format the data for consistency within the warehouse. The data must be cleaned to ensure validity The data must be converted from relational ,object oriented ,hierarchy model to a multidimensional model. The data are loaded into the warehouse. Good monitoring tools are necessary to recover from 55 incorrect load.

Data warehouse and views


Data warehouse is a permanent storage of data in multidimensional tables. View are temporarily created when needed using data warehouse. This is used for decision support system.

56

Different between data warehouse and views


Data warehouse Data warehouse is a permanent storage data. Data warehouse are multidimensional Data warehouse can be indexed to maximize performance. Views Views are created from warehouse data when needed and it is not permanent Views are relational Views cannot be indexed.

Data warehouse provides specific support to a functionality


Data warehouse provide large amount of data.

Views cannot give specific support to a functionality.


Views are created by extracting minimum data from data warehouse.

57

Data warehouse Future


New techniques must be introduced in data cleaning ,indexing and partitioning. The manual operation involved in data acquisition ,management data quality and performance maximization must be automated. Proper business rules must be developed and incorporated in warehouse creation and maintenance process.
58

Data Mining

Data mining is sorting through data to identify patterns and establish relationships.

59

Data Mining (cont.)

60

Data Mining works with Warehouse Data


Data Warehousing provides the Enterprise with a memory

Data Mining provides the Enterprise with intelligence


61

Data Mining Motivation


The key in business is to know something that nobody else knows. Aristotle Onassis
PHOTO: LUCINDA DOUGLAS-MENZIES

PHOTO: HULTON-DEUTSCH COLL

To understand is to perceive patterns.


Sir Isaiah Berlin
62

Application Areas
Industry Finance Insurance Telecommunication Application Credit Card Analysis Claims, Fraud Analysis Call record analysis

Consumer goods promotion analysis Data Service providers Value added data Utilities Power usage analysis
63

Data Mining in Use


The US Government uses Data Mining to track fraud A Supermarket becomes an information broker Basketball teams use it to track game strategy Cross Selling Warranty claims Routing Holding on to Good Customers Weeding out Bad Customers

64

What is data mining technology


The process of extracting or finding hidden knowledge from large database is called data mining. Ex: Age 21------ we can understand he is major

data

information

65

Data Mining Technology


Patterns
Data Mining Knowledge

Selection and transformation

Data Warehouse

Cleaning and Integration

Databases

Flat Files

66

The various step


Data cleaning To remove noise and inconsistent data Data integration Data from multiple sources are combined Data selection relevant data are retrieved from the database for analysis

67

Data transformation The selected data are made for mining by performing aggregation operations Data mining Intelligent methods are applied to extract data patterns Pattern evaluation Identify the needed patterns Knowledge presentation present the mined knowledge to the user

68

Loading the Warehouse

Cleaning the data before it is loaded

Data Integration Across Sources


Savings Loans Trust Credit card

Same data different name

Different data Same name

Data found here nowhere else

Different keys same data

70

Data Transformation Example


Data Warehouse
appl appl appl appl appl appl appl appl appl appl appl appl A - m,f B - 1,0 C - x,y D - male, female A - pipeline - cm B - pipeline - in C - pipeline - feet D - pipeline - yds A - balance B - bal C - currbal D - balcurr
71

Structuring/Modeling Issues

Data Warehouse vs. Data Marts

From the Data Warehouse to Data Marts


Information
Individually Structured Departmentally Structured

Less

History Normalized Detailed


More

Organizationally Structured

Data Warehouse

Data
74

Data Warehouse and Data Marts


OLAP Data Mart Lightly summarized Departmentally structured

Organizationally structured Atomic Detailed Data Warehouse Data


75

Characteristics of the Departmental Data Mart


OLAP Small Flexible Customized by Department Source is departmentally structured data warehouse
76

Techniques for Creating Departmental Data Mart


OLAP
Sales Finance Mktg.

Subset Summarized Superset Indexed

Arrayed
77

Data Mart Centric


Data Sources

Data Marts

Data Warehouse

78

True Warehouse
Data Sources

Data Warehouse

Data Marts

79

II. On-Line Analytical Processing (OLAP)

Making Decision Support Possible

What Is OLAP?
Online Analytical Processing - coined by EF Codd in 1994 paper contracted by Arbor Software Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information System OLAP = Multidimensional Database MOLAP: Multidimensional OLAP (Arbor Essbase, Oracle Express) ROLAP: Relational OLAP (Informix MetaCube, Microstrategy DSS Agent)
81

The OLAP Market


Rapid growth in the enterprise market Significant consolidation activity among major DBMS vendors
10/94: Sybase acquires ExpressWay 7/95: Oracle acquires Express 11/95: Informix acquires Metacube 1/97: Arbor partners up with IBM 10/96: Microsoft acquires Panorama 1995: $700 Million 1997: $2.1 Billion

Result: OLAP shifted from small vertical niche to mainstream DBMS category
82

Strengths of OLAP
It is a powerful visualization paradigm It provides fast, interactive response times

It is good for analyzing time series


It can be useful to find some clusters and outliers Many vendors offer OLAP tools
83

OLAP Is FASMI
Fast Analysis Shared Multidimensional Information

84

Data Cube Lattice


Cube lattice ABC AB AC BC A B C none Can materialize some groupbys, compute others on demand Question: which groupbys to materialze? Question: what indices to create Question: how to organize data (chunks, etc)
85

Visualizing Neighbors is simpler


1 Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar 2 3 4 5 6 7 8
Month Apr Apr Apr Apr Apr Apr Apr Apr May May May May May May May May Jun Jun Store 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 Sales

86

A Visual Operation: Pivot (Rotate)

Juice Cola Milk

10
47 30

Cream 12

Product

3/1 3/2 3/3 3/4

Date

87

Slicing and Dicing


Product

The Telecomm Slice

Household Telecomm

Video
Audio

Europe
Far East India Retail Direct Special

Sales Channel
88

Roll-up and Drill Down


Higher Level of Aggregation

Sales Channel Region Country State Location Address Sales Representative

Low-level Details
89

Nature of OLAP Analysis


Aggregation -- (total sales, percent-to-total) Comparison -- Budget vs. Expenses Ranking -- Top 10, quartile analysis Access to detailed and aggregate data Complex criteria specification Visualization

90

Organizationally Structured Data


Different Departments look at the same detailed data in different ways. Without the detailed, organizationally structured data as a foundation, there is no reconcilability of data
marketing
sales

finance

manufacturing

91

Multidimensional Spreadsheets
Analysts need spreadsheets that support
pivot tables (cross-tabs) drill-down and roll-up slice and dice sort selections derived attributes

Popular in retail domain

92

OLAP Operations

Roll Up

Drill Down

Single Cell

Multiple Cells

Slice

Dice

Prentice Hall

93

Relational OLAP: 3 Tier DSS


Data Warehouse ROLAP Engine Decision Support Client

Database Layer

Application Logic Layer

Presentation Layer

Store atomic data in industry standard RDBMS.

Generate SQL execution plans in the ROLAP engine to obtain OLAP functionality.

Obtain multidimensional reports from the DSS Client.


94

MD-OLAP: 2 Tier DSS


MDDB Engine MDDB Engine Decision Support Client

Database Layer

Application Logic Layer

Presentation Layer

Store atomic data in a proprietary data structure (MDDB), pre-calculate as many outcomes as possible, obtain OLAP functionality via proprietary algorithms running against this data.

Obtain multidimensional reports from the DSS Client.


95

MSPVL Polytechnic College Pavoorchatram

96