Sie sind auf Seite 1von 31

Data-ware Housing

By : Mr.Nilesh Magar Lecturer in Computer Science, MIT- MACS College, Kothrud, Pune-38.

Introduction

Definition : Simplex perception- No more than collection of Key pieces of information used to manage & direct the business for the most profitable outcome. Precise Definition- It concentrate on data- Data should be subject oriented, be consistent across sources & so on. Pearsons Definition- It is more than vast data- it is also process involved in getting that data from source to table & from table to analysts. ** In other word ** A DWH is the data (Meta/fact/dimension/aggregate) and process manager (load/warehouse/query) that make information available, enabling people to make informed decision.

Data-ware housing Architecture : DWH must architected to support three major driving factors. 1) Populating DWH. 2) Day-to-Day management of DWH. 3) The ability to cope with requirement evolution.

Typical Process flow within D.W.H

Warehouse User Source Data transformation and movement

Extract & load

Query

Archive data

Processes :

1. Extract & load the data 2. Clean & transform data in to a form that can cope with large data volume & provide good query performance. 3. Back up & Archive data 4. Manage queries & direct them to appropriate data Sources.

Extract & load process:

Op. Data

Suitable for operational System, May have been modified & extended over yrs to support performance.

Reconstructed

D.W.H

1) Extract & load process:


a. Controlling the processes: determine when to start extracting the data, run transformation, consistency check & so on. Eg: Retail sales analysis b. When to initiate the extract: Data should be in a consistent state. Same instances of time. Eg. Telecom c. Loading the data: Temporary Data store. Clean up & Consistency check. X Eg. Current subscriber & Current Event DB. d. Copy Management tools & data clean-up.: coding

2) Clean & transformation

a. Clean & transform the data in to a structure that speed up queries

b. Partition data in order to speed up queries, optimize h/w performance& simplify the management of DWH

Clean & transformation


a. Clean & transform the data in to a structure that speed up queries Make sure data is consistent within itself. Eg: row Make sure data is consistent with other data With in the same source. Make sure data is consistent with data in the other source system Make sure data is consistent with the information already in the warehouse.

3) Back-up & archive process :


Back-up regularly- recover from loss/failure  In Archiving older data is removed from system

4) Query management process :

Directing query to most effective data source.

Process Architecture

Process Extract & load

Function Extract & load the data, performing simple transformations before & during load.

System manager Load Manager

Clean & transform Data Backup & archive

Transforms & Manages data

Warehouse manager

Backs up & archives data warehouse

Ware house manager Query Manager

Query Manager

Directs & manages queries

Data

Information

Decision

Operational Data

L O A D M A N A G E R

Summary info

Q U E R Y M A N A G E R

Data dipper

Detailed information Meta Data

Operational Data

Warehouse Manager

OLAP tools

Architecture of data-ware house

Load Manager

System Component that perform all the operations necessary to support the extract and load process.

 

Off-the-Shelf tools, bespoke coding, C programs & Shell script. Size & Complexity will vary between specific solutions from d.h.w to d.h.w., larger the degree of overlap between source systems, the larger the load manager will be.

Third-Party tools max-20 to 25 % of the total system fun.

Load Manager Architecture

1) Extract the data from source systems. 2) Fast load the extracted data into a temporary data store. 3) Perform Simple transformations into a structure similar to the one in the data ware house.

Each of these function has to be operate automatically & recover from any error it encounters, to very large extent with no human intervention.

Extract data from source system

  

In order get hold of the source data it has to be transfer from Source systems, and made available to D.W.H.. ASCII files are FTP across the LAN. Current gateways tech. operates too slowly to compete to FTP.

Fast Load

 Data should be loaded into warehouse in the fastest possible time, in order to minimize the total load window.  This becomes critical as the no. of data sources increases and time window shrinks.  In practice it is more effective to load the data in to a relational D.B. prior to applying transformation & checks.(ASCII)

Simple Transformation

 Before or during the load there will be an opportunity to perform simple Transformations on the data.  Here we perform those transformation that does not require complex Logic, or use of relational set operators. Eg: retail management system.: 1) Strip out all the column that are not required in DWH. 2) Convert all the values to the required data types;

Load Manager Architecture

Load Manager

Controlling Process

Stored Procedure
Copy management tools

Temporary data Store

File structure

Fast loader

are ouse str.

Ware-house Manager

System Component that perform all the operations necessary to support the Ware house management process.

Third party system management tools, bespoke coding, C programs & Shell script.

As the Load manager size & Complexity of ware-house manager will vary between specific solution. Unlike L.M. the complexity of WH manager is driven by extend to which the operational management of the DHW has been automated.

Third-Party tools max-40 % of the total system fun.

Ware-house Manager Architecture

1) Analyze the data to perform consistency & referential integrity check 2) Transform & merge the source data in to a temporary data source into the Published DWH. 3) Create indexes, business view, partition views & so on. 4) Generate denormalization if appropriate.

Ware house Manager Architecture

Ware-house Manager Controlling Process

Temporary data store

Stored Procedure

Star flake schema

Backup /recovery tool

SQL scripts

Summary tables

Using temporary destination table :

Once the data is in temporary Store, the next step is to crate a set of tables identical to the destination table in the DWH.

 

Ex: if the data in DWH is highly partitioned. As we r abt. to execute substantial constancy check, data should not be loaded until it has been cleaned up.

 

If consistency check fails Although Relational databases some form rollback, but in practice it is easy to load data in temporary area, clean it up & then publish it to the DWH.

Complex Transformation
Reconcile data

Transform into a star flake schema:


   Transform it into a form suitable for decision support queries. Transform into a form in which the bulk of factual data lies in the center. Star schema, snowflake schema, star flake schems.

Create Indexes & views:

 One would expect the index creation time to be significant, even if we need only to create index against fact table partition.  Because of this most relational technology have facilities to create indexes in parallel, distributing the load across the H/W & significantly reducing the elapsed time.  Overhead of inserting a row into a table.

Generate the summaries:

 Ware-house manager has to create a set of the aggregation to speed up query performance. Generated Automatically.

Query manager:

System Component that perform all the operations necessary to support the Query management process.

User access tools, specialist data-ware housing monitoring tools, native data base facilities, bespoke coding, C programs & Shell script.

 

Size & Complexity will vary between specific solutions. Unlike the L.M. complexity of Q.M. is driven by th extent to which the facilities are provided by user access tools or native DB facilities.

Query Manager Architecture

1. Direct queries to the appropriate tables 2. Schedule the execution of the user queries.