Sie sind auf Seite 1von 10

1

Data Warehousing
Data Base: Data base is collection of related data (or) A Data Base is organized collection of data. Data Base Management System (DBMS): It is a collection of programs for creating, maintaining and manipulating data in database. It allows the organization to store data in one location from which multiple users are able to access data. Data Model: A database model is a theory or specification describing how a database is structured and used. I. Hierarchical Data Model II. Network Data Model III. Relational Data Model Data Modeling: The process of designing the database is called as Data Modeling or Dimensional modeling. A database architect (or) data modeler designs the warehouse with set of tables. Ex: Erwin, Oracle Designer, Power Designer. RDBMS: Data base on relational model. It is collection of programs for maintaining data in relational data base. RDBMS is optimized for data warehouse, data mart and OLAP applications. Ex: Oracle, Microsoft SQL Server, IBM DB2 Multidimensional Database: Designed for analyzing large groups of records. Ex: Hyperion Essbase, Oracle Express, Crystal Holos. Table: It is a collection of rows and columns. Table is the primary storage unit for data. Data Warehouse A data warehouse is RDBMS, which is specifically designed for business analyses and making decisions to achieve the business goals. A data warehouse is design to support decision making process hence it is called as Decision Support System (DSS). A data warehouse is historical database which stores historical business information required for analysis. A data warehouse is an integrated database which stores the data in an integrated format; the data is collected from multiple OLTP source systems. According to W.H.Inmon a Data warehouse is a 1. Time Variant (Analyzing the business with different time periods) 2. Non-Volatile (Data is static, not reflect to change) 3. Subject Oriented (A subject is derived from multiple OLTP. Ex: Sales, Accounts, HR) 4. Integrated Database (Collects the data from multiple databases) Data Warehousing: A data warehousing is a process of building a data warehouse. The process of designing, building and maintaining a data warehouse system.

Data Warehousing Basics

2 Data Mart: Data mart is a subset of data warehouse, one business area or one subject area is called as data mart. Ex: HR, Finance, Accounts, Sales Enterprise is an integration of various departments (HR, Sales, Marketing) called Data Marts. Integration of multiple data marts is an Enterprise Data Ware house (EDW). Data Mart is divided into 2 types I. Dependent Data Mart: In a top-down approach a data mart development depends on EDW hence data mart are known as Dependent Data mart II. Independent Data Mart: In a bottom-up approach a data mart is Independent of EDW, hence such data mart are known as Independent Data Mart Data Warehousing Approaches: I. Top-Down Data warehousing approach: According to Inmon first we need to build EDW from the EDW design data marts will be created II. Bottom-Up Data warehousing approach: According to Kimbol data marts are first created; integrating the data marts EDW will be defined. Reason for build Data warehouse Data is scattered at different places Data inconsistency Depending on volatile and non-volatile ODS (Operational Data Store) An operational data store (or "ODS") is a database designed to integrate data from multiple sources for additional operations on the data. OLAP (Online Analytical Processing) OLAP is a set of specifications or technologies which allows client application in retrieving the data from data warehouse. OLAP is interface or gateway between the user and database. -ROLAP (Relational OLAP): Is used to query the data from relational sources like SQL, Oracle, Sybase and Teradata. -MOLAP (Multidimensional OLAP): Is used to query the data from multidimensional data cube. -HOLAP (Hybrid OLAP): A combination of ROLAP and MOLAP. -DOLAP (Desktop OLAP or Database OLAP): Is used to query the database which is constructed by using desktop databases like dbase, FoxPro, XML, TXT file. OLTP (Online Transactional System) and OLAP (Online Analytical System) Two major ways to organize data- each optimized for different use. Data base is divided into 2 parts OLTP and OLAP
OLTP It is useful to store transactional data. Transactional data means detailed data & current data Transactional system is useful to run the business Data stores permanently and can be changed (Volatile) Used by more number of users OLAP It is useful to store analytical data. Analytical data means Historical data and summarized data. Analytical system is useful to analyze the business Data is non-volatile Used by less number of users

Data Warehousing Basics

3
Supports CRUD (Create, Read, Update, Delete) Transactional schema optimized for read/write multiple joins Use for Data entry & Data retrieval Ex: PeopleSoft database Application oriented data E R Modeling Supports only Read Analytics schema optimized for querying large datasetsfew joins Use for Reports, Charts and Pivot tables Ex: OBIEE Subject oriented data Dimensional modeling

Columns: In Data warehouse we have 3 types of columns in any table 1. Key columns 2.Descriptive columns 3.Measures columns Ex: In an EMP table Key Columns Descriptive Columns Measure Columns EmpNo, DeptNo Ename, JobDesc Sal, Comm Dimension Table: If a table contains only Key Column and Descriptive column (Data is often descriptive-Alphanumeric). It qualifies the fact data. Ex: product line, address, month, year. Dimension table consist of primary key columns. Data flow always from dimension to the fact table. Note: we can find char, Varchar, date data types. Fact Table: A fact table typically has two types of columns, foreign keys to dimension tables called Key column and measures (facts) those that contain numeric facts (Data is often numerical). It is the central table in the Star schema. Ex: Sales Revenue, Units sold, units shipped What is the relation between the Dimension tables and Fact tables? Relationship between dimension to fact table is 1:M with Many being on fact side. Dimension: Dimensions are categories of attributes by which the business is defined. A dimension table typically has two types of columns, primary keys to fact tables and textual or descriptive data. Common dimensions are time periods, products, markets, customers. i) Slowly Changing Dimension: A dimension which can be changed over the period of time is known as slowly changing dimension. Ex: Customer, Country and state names may change over a time. Categorized into three types Type 1: Overwriting the old values (stores only current data in the target it doesnt maintain any history) Type 2: Creating an another additional record (inserts a new record for each update) Type 3: Creating new fields (maintains partial history current and previous information) ii) Rapidly Changing Dimension: A dimension attribute that changes frequently is a Rapidly Changing Attribute. A dimension is considered to be a rapidly changing dimension if one or more of its attributes changes frequently in many rows. For a rapidly changing dimension, the dimension table can grow very large from the application of numerous Type 2 changes. The terms "rapid" and "large" are relative.

Data Warehousing Basics

4 One solution is to move the attribute to its own dimension, with a separate foreign key in the fact table. This new dimension is called a Rapidly Changing Dimension. iii) Confirmed Dimension: If a dimensional table is connected with more than one fact table it is called as Confirmed dimension to those fact tables. In a schema if a dimension table is acting as source for more than one fact table then that dimension is called as conformed dimension. iv) Junk Dimension: A dimension which cant be used to describe key performance indicators is known as Junk dimension. Dimension which cant be used to describe facts. Ex: Phone Number, Fax number, Customer address. v) Degenerated Dimension: A degenerate dimension is data that is dimensional in nature but stored in a fact table. A table consists of both descriptive type of columns and measure columns. Ex: Order Number, invoice date vi) Role Playing Dimensions: Role Playing dimension refers to a dimension can play different roles in a fact table depending on the context. For example, a "Date" dimension can be used for "Date of Sale", as well as "Date of Delivery", or "Date of Hire". vii) Mini Dimension: Rapidly changing dimension can grow very large, perhaps too large. To control the size, split off some attributes into mini dimensions. viii) Subset Dimension: Subset Dimension is a performance tuning technique used in Siebel Analytics Data Warehouse. These subset dimension tables are extracted from dimension tables using a filter on a particular attribute of the parent dimension table. In other word, they contain all the columns of the parent dimension table. Subset dimension tables are primarily used in conjunction with aggregate tables. ix) Shrunken Dimension: A shrunken dimension is a subset of another dimension. x) Monster Dimension: A very large dimension There are 3 types of Facts/Measures i) Additive Facts: Additive facts are facts that can be summed up through all of the dimensions in the fact table. A sales fact is a good example for additive fact. ii) Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others. Ex: Daily balances fact can be summed up through the customers dimension but not through the time dimension. iii) Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table. Ex: Facts which have percentages, ratios calculated, pressure, temperature. Fact less fact: Fact less fact is a fact table that contains no measures or facts. These tables are called "fact less fact tables", or "junction tables". Ex: like in industry analyze the male and female employee, attendance in a class room. (Present or absent)

Data Warehousing Basics

5 Cube: Cube is logical organization of measures with identical dimensions. The edges of the cube contain dimension members and the body of the cube contains data values. For example, sales data can be organized into a cube, whose edges contain values from the time, product, and customer dimensions and whose body contains Volume Sales and Dollar Sales data. In a star schema, a cube is represented by a fact table. Schema: The arrangement of tables is called as Schema. Based on the arrangement of tables schemas are divided into 3 types 1. Star Schema: -Organizes data into a central fact table with surrounding dimension tables -Dimension tables do not directly relate to each other -Each dimension row has many associated fact rows -Fact and Dimensional tables connected with only one join, so performance of schema is good 2. Snow Flake Schema: One big dimension table split into n number of dimensional tables -In snowflake schema one fact table and dimension table will be connected with n number of joins. So the performance will be degraded. 3. Mixed Schema (or) Constellation Schema (or) Galaxy Schema: -Combination of star and snow flake schema is called as mixed schema. We can join 2 fact tables by using this Constellation Schema. (Called as Fact constellation) Dimensional Modeling Dimensional modeling is a technique for logically organizing business data in a way that helps end users understand it -Data is separated into facts and dimensions -Users view facts in any combination of the dimensions Federated query: When a query is fired on two different sources, that query is called as federated query. A query retrieving data from two different sources ex: Oracle and XML. If a query is firing on multiple data sources then that query is called as Federated Query. Note: none of the reporting tools entertain federated queries because of poor performance. Have you worked on Federated Query? No, we are using warehouse it is a single source. There is no chance to work on federated query as I am working in warehousing not on OLTP. Surrogate Key: A surrogate key is an artificial key that is treated as a Primary Key. A surrogate key is a system generated sequential number that is treated as a Primary key. Metadata: Data about data is called Metadata. Metadata tools are used for gathering, storing, updating and for retrieving the Business and Technical metadata of an organization.

Data Warehousing Basics

ETL
ETL (Extract, Transform and Load): ETL is a process in database usage and especially in data warehousing that involves: Extracting data from outside sources Transforming it to fit operational needs (which can include quality levels) Loading it into the end target (database or data warehouse) Code Base ETL: An ETL application can be developed using programming languages such as SQL, PL/SQL. Ex: Teradata, SAS, ETL utilities. GUI Base ETL: Designing the ETL application with a simple graphical user interfacing, point and click techniques. Ex: Informatica, Data Stage, Abnitio, ODI. Staging: Is a temporary memory where data Transformation activities takes place Data Acquisition: Process of extracting the relevant business information, transforming data into required business format and loading it into the target system. Process of Data Acquisition: 1) Data Extraction: Process of reading the data from various source systems. 2) Data Transformation: Process of transforming data into required business format. a. Data Cleansing: Process of removing unwanted data from staging. Ex: removing spaces, duplicates, null handling. b. Data Scrubbing: Process of deriving new attributes (attributes = new table columns) c. Data Aggregation: Process of calculating the summarys from detailed data. d. Data Merging: Process of integrating the data from multiple source system. i. Horizontal Merging: Merging the records horizontally using joins ii. Vertical Merging: Merging the records vertically when the two sources are having same meta data using Union. 3) Data Loading: Process of inserting the data into a target system. a. Initial Load (or) Full Load: First time we are loading the data b. Incremental Load (or) Delta Load: Loading second time or updated data. ETL Plan: ETL plan defines Extraction, transformation and loading. In Informatica ETL plan is called as Mapping. ETL Client: Is a graphical user component where an ETL developer can design ETL plan. ETL Repository: The place where we store the metadata is called repository. ETL Repository is brain of ETL system where we store metadata such as ETL plans. ETL Server: Is an ETL engine which performs extraction, transformation and loading.

Data Warehousing Basics

Informatica Power Center 8.6.0


Informatica Power center is a GUI based ETL tool of Informatica company established in 1993. Till Informatica 7 it is client server architecture. From 8.6.0 it is Service oriented architecture (SOA) Mapping: Graphical representation of data flow from source to target. A mapping defines ETL process. Mapping is designed with the following components. I. Source Definition: It is the structure of the source table or file from which data will be extracted. II. Target Definition: It is the structure of the target tables into which data loads. III. Transformation Logic: It defines data transformation.

Components of Informatica
When we install Informatica Power center the following components will be installed. 1. Power center Domain: It is the primary unit for management and administration. Power Center domain is the collection of all the servers required to support Power Center functionality. Each domain has gateway (called domain server) hosts. Whenever you want to use Power Center services you send a request to domain server. Based on request type it redirects your request to one of the Power Center services. 2. Power Center Clients: The PowerCenter Client consists of multiple tools which are used to code and give instructions to PowerCenter servers. They are used to manage users, define sources and targets, build mappings and mapplets with the transformation logic, and create workflows to run the mapping logic. The PowerCenter Client connects to the repository through the Repository Service to fetch details. It connects to the Integration Service to start workflows. 3. Power center Repository: Repository is nothing but a relational database which stores all the metadata created in Power Center. Whenever you develop mapping, session, work flow, execute them or do anything meaningful (literally), entries are made in the repository. 4. Integration Services: It extracts data from sources, processes it as per the business logic and loads data to target Data Base. Is an ETL engine. 5. Repository Service: Repository Service is the one that understands content of the repository, fetches data from the repository and sends it back to the requesting components (mostly client tools and integration service) 6. Web Service Hub: Web Services Hub exposes Power Center functionality to external clients through web services. 7. Power Center Administration Console: It is a web based component useful to do all administration activities.

Power Center Client Applications (GUI based Client components)


1. 2. 3. 4. PowerCenter Designer Create ETL mappings PowerCenter Workflow Manager Create and Start workflows PowerCenter Workflow Monitor Monitor and control workflows PowerCenter Repository Manager

Data Warehousing Basics

PowerCenter Designer: Useful to create ETL process known as mapping. Steps: i. Creating ODBC Connection ii. Importing Source iii. Importing Target iv. Connecting source and target in a mapping v. Creating Transformations PowerCenter Workflow Manager: Allows you to create powercenter objects. 1. Create a session for each mapping 2. Create workflow to start sessions Session: A session is a powercenter object which runs mapping. A session is pointer to mapping. A session must be in workflow it cannot execute individually. Workflow: Workflow can run one or more sessions. The sessions can be sequential or parallel. A workflow can have multiple sessions. Steps: i. Creating Workflow ii. Creating a Task iii. Creating Database Connections iv. Linking Session & Workflow and running work flow PowerCenter Workflow Monitor: To monitor the session and workflow running on integration service. PowerCenter Repository Manager: Manage repository connections, folders, objects, users and groups.

Transformation:
Transformations help to transform the source data according to the requirements of target system. It uses transformation logic or business logic to process the data. Passive and Active Transformation: There are 2 types of transformations Passive Transformation: The no. of records in source and the no. of records in target are equal. Active Transformation: The no. of records in source and the no. of records in target are not equal Connected and unconnected Transformation: A transformation which is part of mapping data flow direction is known as connected transformation. It is connected the source and connected to the target. A transformation which is not part of mapping data flow direction is known as unconnected transformations. It is neither connected to source nor connected to the target. Transformation Types (Below are some examples) Active Transformation Rank Transformation
Aggregator Transformation Sorter Transformation Router Transformation

Passive Transformation Look up transformation


Expression Transformation Lookup Transformation Stored Procedure Transformation

Data Warehousing Basics

9
Joiner Transformation Union Transformation Update Strategy Transformation Normalized Transformation Transaction Control Transaction Filter Transformation

Sequence Generated Transformation


XML Source Qualifier Transformation

Expression Transformation: This is used to perform non-aggregate functions, i.e to calculate values in a single row. Example: to calculate discount of each product or to concatenate first and last names or to convert date to a string field. Aggregator Transformation: It calculates values on group of rows it doesnt calculate row-by-row. Aggregator transformation performs aggregate functions like average, sum, count etc. on multiple rows or groups. The Integration Service performs these calculations as it reads and stores data group and row data in an aggregate cache. Difference between Aggregator and Expression Transformation: Expression transformation permits you to perform calculations row by row basis only. In Aggregator you can perform calculations on groups. Filter Transformation: It allows rows that meet the specified filter condition and removes the rows that do not meet the condition. For example, to find all the employees who are working in New York or to find out all the faculty member teaching Chemistry in a state. The input ports for the filter must come from a single transformation. You cannot concatenate ports from more than one transformation into the Filter transformation. Components: Transformation, Ports, Properties, Metadata Extensions. Joiner Transformation: Is an Active and Connected Transformation. It is used to join data from two related heterogeneous sources residing in different locations or to join data from the same source. In order to join two sources, there must be at least one or more pairs of matching column between the sources and a must to specify one source as master and the other as detail. For example: to join a flat file and a relational source or to join two flat files or to join a relational source and a XML source. The Joiner transformation supports the following types of joins: Normal, Master Outer, Detail Outer, Full Outer Rank Transformation: It can rank the data, it creates rank index. Sorter Transformation: It is used to sort the data. Router Transformation: It filters the data based on multiple conditions unlike the filter transformation. Lookup Transformation: Is used to lookup the data either on relational RDBMS or flat file. We can do a lookup on a flat file. It uses 4 types of caches only for lookup transformation Static It doesnt change read only Dynamic It inserted & updated

Data Warehousing Basics

10 Persistent Reversible between sessions, it will not be selected. Shared it can be used across different sessions Union Transformation: It is used to merge the data from heterogeneous or homogenous sources. Update Strategy Transformation: Is used to update, delete, insert on the target table. 4 flags are constant. DD-Insert 0 DD-Update 1 DD-Delete 2 DD-Reject 3

Data Warehousing Basics

Das könnte Ihnen auch gefallen