Sie sind auf Seite 1von 49

DataWare Housing

Data ware housing is a database that is managed separately from organization's operational database. A data warehouse is a subject oriented,integrated,timing variant and non volatile collection data in support of managements decision making process.

Why Need of data warehousing


Strategic Information NEED of analysis of outcome KDD

Strategic Information
Who needs Strategic information?
Executive and managers who are responsible for keeping enterprise competitive need information to make proper decision. Needs to establish Goals,set Objectives.

Operational Vs Informational
Data Content is Current value Data Structure is optimized for transaction Access Frequencies is High Access type is read,update,delete Usage is preditable,repititive User is large Data content is achieved,derived and summarized Data Structure are optimized for complex queries Access frequencies is low. Access type is read Usage is Adhoc,random Relatively small

DSS for Strategic Information


WE need to build a new environment which keeps informational different from operational which includes
Database designed for analytical tasks Data from multiple applications Read Intensive data usage Content to include current and historical data Ability to get queries and results online Ability for users to initiate reports .

Features in Data Ware Housing

Subject Oriented data


In operational system data by individual application .These datasets are used for backups ,updation,stock verification.
Banking application for debit and credit Railway reservation system.

In Data ware housing data is stored by subjects not by applications


Subjects differ from enterprise to enterprise For a manufacturing company sales,shipment etc are subjects For retailer sales and checkout are subjects Claims under insurance is a subject,claim processing is an application

Integrated data
For decision making we need to pull together all the relevant data from various application. Sources are different from different database ,data segments, different formats. In integration we need to remove inconsistencies and standardize various data elements.
Savings account,loan account, checking accounts make bank accounts. Standardization is naming convention,codes,data attributes.

Time Variant Data


OLTP always store current data .It accepts updation,deletion,and modification. Dataware house is meant for analysis ,and decision making ,so all types of history data needs to be supported. Data is stored as snapshots over past and current periods using a time element. The time variant nature of data warehousing
allows for analysis of the past Relates information to the present Enables forecast of the future

NON VOLATILE Data


OLTP Database Data Ware house

Read Read Add/Change/Delete

NonVolatile data
Data obtained from various operational and pertinent data from outside source are transformed ,integrated and stored in the data ware house. The OLTP is used for current stock and data ware house is used for snapshots Data from operational setup moved at different frequencies to data warehouse Business transaction does not update the data in datawarehouse,it is done in operational database,once data is moved to datawarehouse it unchanged.

Data Granuality
In operational system data is usually kept at lowest level of detail.
Order in quantity,price at unit level and at the end sum to get toatl sales and purchase of the month. User queries about data analysis in data warehouse he sees for summary data.If wants he may go for further breakdown. In datawarehousing we find efficient to keep summary data at different levels.

Datawarehouse and Datamart


Corporate and enterprisewide. Union of all data mart Organised on E-R model Departmental A single business process Structure to suit the departmental view of data. Technology optimal for data access and analysis.

Queries on presentation resources

TOP-DOWN Approach
ADVANTAGE
An enterprise view of data Single central storage data of content Centralized rules and control May see quick result if implemented with iteration

DISADVANTAGE Takes longer time to build with iteration Highly risk to failure High outlay with the proof of concept

Bottom Up approach
Faster and easier implementation of manageable pieces. Less risk of failure Group of concepts Allows project team to learn and grow. Each data mart has its narrow view of data Premeates redundant data in every data mart Perpetuates inconsistent and irreconcilable data Proliferates unmanageable interfaces.

COMPONENTS OF DATA WAREHOUSE


When we need to build a dataware house we need a strategy at a requirement and benefit of the company Basic requirement of the data warehouse is
Source data Data staging Data Storage Management and Control Information Delivery Metadata

Source data Component


Source data coming to dataware house are broadly categorized into four parts:
Production data Internal data Archived data External data

Production Data
The category of data based on information requirement comes from various enterprises and different operational systems. Operational systems does not have a broad queries and all queries are predictable.We need to run across different platform . We need to challenge standardize and transform the data.

Internal Data
Every organisation have their own Intenal data which could be useful for dataware house. Internal data adds additional complexity to the process of transformation ad integration. We need to do strategic evaluation after taking data from various sources.

Archived Data
In operational systems we periodically take the old data and store it on a archive file . Some opeartional system takes archieve in day basis ,month basis andf some year basis. Since data warehouse keeps historical data for the snapshots of data archive file is necessary.

External Data
External is also equally important for datawarehouses. Since souces within your organisation is not sufficient itself it is necessary for external sources also. We need to transform or standardise the data since data from external don not conform to our formats.

Data Staging Component


After data is extracted from various operational system and from external sources we need to prepare for data warehouses. Data staging provides a function to be clean,change,combine ,convert the source for data storage of data warehouse. Data Staging is divided into Three components
Data Extraction Data Transformation Data Loading

Data extraction
Data extraction is quite complex since data diversity is much more. Data Extraction tools are available ,to extract the data to a separate environment from where moving data to database can be easier. Start Extracting the data from data source when it represents same snapshots of time as other data sources. Do not execute consistency until all the data sources have been stored in temporary data store.

Functions and services of data Extraction


Select data sources and determine the types of filters to be applied to individual sources Generate automatic extract files from operational systems using replication and other techniques Create intermediatory files to store selected data and merge later Transport extracted files from multiple platform Provide automated job control services for creating extract files Reformat input from outside sources Reformat input from departmental data files,database and spread sheets

Generate common application code for data extraction Resolve inconsistencies for common data elements from multiple sources

Data Transformation
Data Conversion is an important feature. Since we may move from file based to database. Number of tasks to perform for data conversion are
Data Cleansing Data standardization Data purging for unnecessary data Sorting and merging

Clean and data Transform


Data needs to be cleaned and checked in the following ways:
Make sure that the data is consistent within itself(eg phone number,address) Make sure that the data is consistent with other data within same source(SKU/customer units in transaction with valid SKU/Customer) Make sure that data is consistent with data in other source systems.(Customer record in customer database with customer events) Make sure data is consistent with information already in data ware house.(Existing customer list with previous version already existed)

Data Transformation
Transform extracted data into appropriate formats of data structures Provide default values as specified. Major features as splitting ,consolidation,standardization and deduplication.

Summerization of Data Transformation


Map input data to data for datawarehouse repository Clean data ,deduplicate and merge and purge. Convert data Types Calculate and derive attribute value Check for referential integrity Aggregate data as needed Resolve missing value Consolidate and integrate data

Data Loading
Data loading takes place for
Initial data Incremental data revision as ongoing basis.

Data Loading
Load transformed and consolidated data in the form of load image into datawarehouse repository. Some loaders generate primary keys for the tables being loaded. For load images available on the same RDBMS engine as the data warehouse ,pre coded procedure stored itself may be used for loading.

Load Manager architecture


The architecture of a load manager is such it performs the operation
Extract the data from source system Fast load the extracted data into a temporary data store Perform simple transformations into a structure similar to the one in data warehouse.

Data Storage Component


The data Storage in data warehouse is a separate repository. The repository in an operational data contains current data and normalized form. In the warehouse analysts need to know about their snapshots,stability .

Informational Component
Information should be delivered for all types of users in data warehousing. They may be novice,casual user,business analyst. The report generated through can be adhoc report,complex queries,MD analysis,Statistical analysis,EIS Feed,data mining.

Metadata Component
Metadata in a data ware house is like data dictionary in DBMS. Data dictionary keeps information about Logical data data structures ,information about file and address.

Types of Metadata
Operational Metadata:Data for data ware house comes from several opeartional systems of enterprise.The data elements selected are from different fields.While delivering we must tie back the original and deliver. Extraction Metadata:Extraction and transformation contain metadat about extraction frequencies ,extraction methods,business rules. End-user metadata:
Helps end user to find information Allows end user to use their own business technology

Why Metadata
Opens the door to the end user and make the content recognizable to user Provide content and structure to user. It connects to all parts of data ware house.

Management and Control component


The management and control component coordinates the services and activities of data ware house. The component controls the data transformation and data transfer into the data ware house storage. Metadata is source of information in management module.

WAREHOUSE MANAGER ARCHITECTURE


Te architecture of a warehouse manager is such that it performs following operations:
Analyze the data to perform consistency and referential integrity Transform and merge the source data in the temporary data store into the published data in warehouse. Create indexes ,business views against the base data. Update all existing aggregations. Back up incrementally or totally the data within the data warehouse.

Query Manager
The query manager is the system component that perform all the operations necessary to support the query management process. Its function is as following operations
Direct queries to the appropriate tables Schedule the execution of user queries

Benefits of Data warehouse


data warehouse provides a common data model for all data of interest regardless of the data's source Prior to loading data into the data warehouse, inconsistencies are identified and resolved. This greatly simplifies reporting and analysis. information in the data warehouse is under the control of data warehouse users so that, even if the source system data is purged over time, the information in the warehouse can be stored safely for extended periods of time. Because they are separate from operational systems, data warehouses provide retrieval of data without slowing down operational systems Data warehouses can work in conjunction with and, hence, enhance the value of operational business applications, notably customer relationship management (CRM) systems Data warehouses facilitate decision support system applications such as trend reports (e.g., the items with the most sales in a particular area within the last two years), exception reports, and reports that show actual performance versus goals.

Disadvantage of data warehouse


Data warehouses are not the optimal environment for unstructured data. Because data must be extracted, transformed and loaded into the warehouse, there is an element of latency in data warehouse data. Over their life, data warehouses can have high costs. The data warehouse is usually not static. Maintenance costs are high. Data warehouses can get outdated relatively quickly. There is a cost of delivering sub optimal information to the organization. There is often a fine line between data warehouses and operational systems. Duplicate, expensive functionality may be developed. Or, functionality may be developed in the data warehouse that, in retrospect, should have been developed in the operational systems and vice versa.

EndEnd-user Access Tools


end Users interact with the warehouse using end-user access tools  Can be categorized into five main groups  Data reporting and query tools (Query by Example MS Access DBMS)  Application development tools (application used to access major DBS Oracle, sybase..)  Executive information system (EIS) tools (For sales, marketing and finance)  Online analytical processing (OLAP) tools (Allow users to analyze the data using complex and multidimentional views-from multiple databases) views Data mining tools (allow the discovery of new patterns and trend by mining a large amount of data using statistical, mathematical tools)

Data Warehousing: Data flows


Inflow, Upflow, Downflow, Outflow and Metaflow

 The process associated with adding value to the data in the warehouse through summarizing, packaging, and distribution of the data  Summarizing the data works by choosing, projecting, joining, and grouping relational data into views that are more convenient and useful to the end users. Summarizing data goes users. beyond simple relational operations to involves sophistacated statistical analysis including identifying trends, clustering, and sampling the data  Packeging the data involves converting the detailed or summarized information into more useful formats, such as spreadsheets, test documents, charts, other graphical presentations, private databases, and animation. animation.  Distribute the data in appropiate groups to increase its availability and accessibility

Upflow

backing The processes associated with archiving and backing-up of data in the warehouse  Archiving the effectiveness and performace maintanance is achieved by transferring the older data of limited value to storage archivers such as magnetic tapes, optical disk or digital storage devices  If the databases in a warehouse are very big, partitioning is a useful design option which enables the fragmentation of a table storing enournous number of records into smaller tables. tables. Thus, preserving data warehouse performance  The downflow of data includes the processes to ensure that the current state of the data warehouse can be rebuilt following data loss, or software/hardware failures. Archived data should failures. be stored in a way that allows the re-establishement of the data rein the warehouse when required

Downflow

Outflow
Involves the process associated with making the data availabe to the end-users end This involves two activities such as data accessing and delivering Data accessing is concerned with satisfying the end userss requests for the data they need. The main problem here is the need. creation of an environment so that the users can effectively use the query tools to access the most appropiate data source. source. Delivering activity makes possible the information delivery to the users systems/workstations. This activity is referred to as systems/workstations. a type of publish-and-subscribe process. Data warehouse publish-andprocess. publishes several business objects that are revised periodically by monitoring usage patterns. Users subcriber to patterns. the set of business objects that best meets their needs. needs.

Metaflow
Meta Meta-flow is a description of the data contents of the data warehouse, what is in it, where it came from originally, and what has been done to it by way of cleansing, integrating, and summarizing  Managing the metadata (data about the data)

Inflow
 The processes associated with the extracti  on, cleansing, and loading of the data from the source systems into the data warehouse  Cleaning include removing inconsistencies, adding missing fields, and cross-checking for data integrity cross Transformation include adding date/time stamp fields, summarizing detailed data, deriving new fields to store calculated data  Extract the relevant data from multiple, heterogeneous, and external sources (commercial tools are used)  Then mapped and loaded into the warehouse

Das könnte Ihnen auch gefallen