Sie sind auf Seite 1von 6

Extract Transform and Load

Introduction to Data Warehousing What is data warehouse?


A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users. A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inman: Subject Oriented Integrated Nonvolatile Time Variant Subject Oriented Data warehouses are designed to help you analyze data. Integrated Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated. Nonvolatile Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant.

Contrasting OLTP and Data Warehousing Environments

Data Warehouse's Architecture

What is data warehousing?


Data warehousing concepts are used to design, create and manage a data warehouse that provides a centralized company database. Data warehouses were first developed in the late 80s and early 90s as a response to the need for business analysis that could not be effectively met by current operational database systems. To meet this need, the process of the recording, collecting, filtering and loading of data into a database was revised, streamlined and customized to support analysis and decision-making. This serves to differentiate these data repositories from the regular transactional systems that are central to operations. Distinguishing Characteristics Purpose is the distinguishing characteristic of these specialized data repositories, not form. This means that the form of the stored data or the type of the database used can vary widely. Data can either be normalized or de-normalized and the database itself can take on a number of forms from an object database to a hierarchical database, relational, flat file or multidimensional. The data itself can change a number of times and the database should be designed to accommodate this, but the most important basis for design is in the way it is set up to support decision-making for a specific action or entity. Design and Implementation Challenges The implementation challenge starts with the collection of disparate data from several sources, including but not limited to the transactional and operational databases. This database structure must be able to merge old or existing data with new data and transform it into a standard format compatible to the data warehouse platform. Integrating disparate data requires resolving conflicts in areas such as naming and grouping protocol, units of measuring and maybe even time zones. Benefits and Rewards Because an efficient database need to be both customized to meet a specific need and flexible enough to handle disparate and changing data, database design can be complicated. The rewards for successfully putting together and running a good database far outweighs the challenges though. Some of these benefits include the following: providing a standardized data format for different data sources and interest areas, which will not only streamline analysis and reporting but also make the data reusable for different departments, interest groups and levels allowing for more user control over data paving the way for necessary purges and safer storage faster data retrieval that does not impede or slow down operations streamlines data processing for performance assessment, trend analysis and forecasting reports strengthens and speeds up decision making processes for both core business operations and customer relationship management In essence, data warehousing solutions are meant to enhance data collection and integration to enable accurate and timely reporting. Since good design translates to improved information handling and management, it supports quick, efficient and informed business analysis and decision-making, which are essential to staying competitive and profitable. With such clear benefits to data warehousing, companies should commit resources and develop a

strong enterprise vision to ensuring that a workable data warehouse is put into place and maintained.

Overview of ETL (Extract Transform and Load)


You need to load your data warehouse regularly so that it can serve its purpose of facilitating business analysis. To do this, data from one or more operational systems needs to be extracted and copied into the warehouse. The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stands for extraction, transformation, and loading. One should understand that ETL refers to a broad process, and not three well-defined steps.

What happens during the ETL process?


During extraction, the desired data is identified and extracted from many different sources, including database systems and applications. Very often, it is not possible to identify the specific subset of interest; therefore more data than necessary has to be extracted, so the identification of the relevant data will be done at a later point in time. Depending on the source system's capabilities (for example, operating system resources), some transformations may take place during this extraction process. The size of the extracted data varies from hundreds of kilobytes up to gigabytes, depending on the source system and the business situation. The same is true for the time delta between two (logically) identical extractions: the time span may vary between days/hours and minutes to near real-time. Web server log files for example can easily become hundreds of megabytes in a very short period of time. After extracting data, it has to be physically transported to the target system or an intermediate system for further processing. Depending on the chosen way of transportation, some transformations can be done during this process, too. For example, a SQL statement which directly accesses a remote target through a gateway can concatenate two columns as part of the SELECT statement. During loading, you physically insert the new, clean data into the production data warehouse schema, and take all of the other steps necessary (such as building indexes, validating constraints, taking backups) to make this new data available to the end users. Once all of this data has been loaded into the data warehouse, the materialized views have to be updated to reflect the latest data.

Tools used for data warehousing


IBM datastage v7.5.3 (WINDOWS platform) IBM datastage v8.1(WINDOWS platform) iWay(UNIX & WINDOWS platform)

Jobs
Jobs can be a UNIX script, java program or any other program which can be invoked from shell. A set of banking regulations put forth by the Basel committee on Bank Supervision, which regulates finance and banking internationally. Basel II attempts to integrate Basel capital standard with national regulations, by setting the minimum capital requirement of financial institutions with the goal of ensuring institution liquidity. There are three pillars of Basel II: Pillar 1: Minimum capital requirements Pillar 2: Supervisory review process Pillar 3: Market discipline requirements As per Gartner's survey, compliance with Basel II will require software packages that can maximize a bank's ability to identify, measure the risk and allocate the capital to specific risk. Banks need risk analytic solution, to collate the data, analyze and report the findings (measurement of risk) from analysis of the data. As a result, the above functionalities have been addressed by proposed solution in various phases. There are two types of Jobs:Basel Non-Basel Basel jobs are those jobs which are critical for portfolio generation for the end user. Whereas, Non-Basel jobs are those jobs which are not critical for portfolio generation for end user. Under Basel jobs we have 3 modules: Operational Data Store (ODS)-The ODS is a centralized data store containing current customer relationships and demographics as well as enterprise reference data that will be used by multiple future channel applications. Risk Data Repository (RDR) - contains Transactional details of the customer. RDR contains Retail and Wholesale applications. Retail applications are used for monthly and month end portfolio generation whereas wholesale applications are used for daily portfolio generation. ODS to RDR (O2R)-the data processed by ODS jobs is merged with the RDR data in the staging area. As we know, that the ODS data will hold only the customer related information whereas RDR data would be transactional data of customers. Both these data are merged for the portfolio generation.

Scheduling Tools
They are used for defining, scheduling and monitoring jobs of IBM datastage for various frequencies of report generation. Frequency for portfolio generation varies vastly depending upon the nature and requirement of the project for instance:

Daily Weekly Monthly Month End 3rd Friday(dependent on Time zone of country specific) 3rd Business Day 5th Business Day Scheduling tools currently used are:CA Workload Automation AE(Autosys Edition)(UNIX-AIX & WINDOWS platform) CA Workload Automation CA 7 Edition(Mainframe platform)

Das könnte Ihnen auch gefallen