You are on page 1of 16

Data Warehouse

Main repository of the organization's historical data, its corporate memory.

The data warehouse contains the raw material for management's decision support system. The data warehouse is optimized for reporting and analysis (online analytical processing, or OLAP).

Frequently data in Data Warehouses is heavily denormalised, summarised and/or stored in a dimensionbased model but this is not always required to achieve acceptable query response times

Features of Data warehouse

Subject-oriented - Data in the database is organized so that all the data elements relating to the same real-world event or object are linked together.
Time-variant - The changes to the data in the database are tracked and recorded so that reports can be produced showing changes over time. Non-volatile - Data in the database is never over-written or deleted, but retained for future reporting. Integrated - Database contains data from most or all of an organization's operational applications,and that this data is made consistent.

Data Warehouse - Non Volatile

A physically separate store of data transformed from the operational environment. Operational update of data does not occur in the data warehouse environment.
Does not require transaction processing, recovery, and concurrency control mechanisms. Requires only two operations in data accessing:
initial loading of data and access of data.

Data Warehouse -Time Variant

The time horizon for the data warehouse is significantly longer than that of operational systems.
Operational database: current value data. Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)

Every key structure in the data warehouse

Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain time element.

Data Warehouse - Integrated

Constructed by integrating multiple, heterogeneous data sources
relational databases, flat files, on-line transaction records

Data cleaning and data integration techniques are applied.

Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources
E.g., Hotel price: currency, tax, breakfast covered, etc.

When data is moved to the warehouse, it is converted.


Data WarehouseSubject-Oriented
Organized around major subjects, such as customer, product, sales. Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.

Architecture for a Data warehouse.

Components of a Data Warehouse

Database Servers: Operational data accumulated during standard business must be extracted and stored into a database. - Most companies use a relational database stored on a mainframe server. - Oracle, Sybase, SQL server and DB2 are just a few of the database systems available.
Queries/Reports: Querying is a broad term that encompasses all the activities of requesting data from a data warehouse for analysis. Reports are then generated to display the results for the specified query. Querying, obviously, is the whole point of using the data warehouse.

OLAP/ Multi-dimensional analysis: Relational databases store data in a two dimensional format; tables of data represented by rows and columns. Multi-dimensional analysis, commonly referred to as OnLine Analytical Processing (OLAP), offer an extension to the relational model to provide a multi-dimensional view of the data. These tools allow users to drill down from summary data sets into the specific data underlying the summaries. Data Mining: The process of analyzing business data in the data warehouse to find unknown patterns or rules of information that you can use to tailor business operations. Data mining predicts future trends and behaviors, allowing businesses to make proactive, knowledge driven decisions.

What is OLAP
Basic idea: converting data into information that decision makers need
Concept to analyze data by multiple dimension in a structure called data cube

On-Line Analytical Processing (OLAP)

OLAP is the use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques Relational OLAP (ROLAP)
OLAP tools that view the database as a traditional relational database in either a star schema or other normalized or denormalized set of tables.

Multidimensional OLAP (MOLAP)

OLAP tools that load data into an intermediate structure, usually a three or higher dimensional array.

The data warehouse addresses these factors and provides many advantages including: Improved end-user access to a wide variety of data Increased data consistency Additional documentation of the data Potentially lower computing costs and increased productivity Providing a place to combine related data from separate sources Creation of a computing infrastructure that can support changes in computer systems and business structures Empowering end-users to perform any level of ad-hoc queries or reports without impacting the performance of the operational systems

Why Data Mining

Credit ratings/targeted marketing:
Given a database of 100,000 names, which persons are the least likely to default on their credit cards?
Identify likely responders to sales promotions

Fraud detection
Which types of transactions are likely to be fraudulent, given the demographics and transactional history of a particular customer?

Customer relationship management:

Which of my customers are likely to be the most loyal, and which are most likely to leave for a competitor? :

Data Mining helps extract such information

Data mining
Process of semi-automatically analyzing large databases to find patterns that are:
valid: hold on new data with some certainity novel: non-obvious to the system useful: should be possible to act on the item understandable: humans should be able to interpret the pattern

Also known as Knowledge Discovery in Databases (KDD)

Banking: loan/credit card approval
predict good customers based on old customers

Customer relationship management:

identify those who are likely to leave for a competitor.

Targeted marketing:
identify likely responders to promotions

Fraud detection: telecommunications, financial transactions

from an online stream of event identify fraudulent events

Manufacturing and production:

automatically adjust knobs when process parameter changes

Applications (continued)
Medicine: disease outcome, effectiveness of treatments
analyze patient disease history: find relationship between diseases

Molecular/Pharmaceutical: identify new drugs Scientific data analysis:

identify new galaxies by searching for sub clusters

Web site/store design and promotion:

find affinity of visitor to pages and modify layout