Sie sind auf Seite 1von 9

What is a Data Warehousing? Data Warehouse is a repository of integrated information, available for queries and analysis.

Data and information are extracted from heterogeneous sources as they are generated....This makes it much easier and more efficient to run queries over data that originally came from different sources. Typical relational databases are designed for on-line transactional processing (OLTP) and do not meet the requirements for effective on-line analytical processing (OLAP). As a result, data warehouses are designed differently than traditional relational databases. What are Data Marts? Data Mart is a segment of a data warehouse that can provide data for reporting and analysis on a section, unit, department or operation in the company, e.g. sales, payroll, production. Data marts are sometimes complete individual data warehouses which are usually smaller than the corporate data warehouse. What is ER Diagram? Stands for entity relationship diagrams. It is the first step in the design of data model which will later lead to a physical database design of possible a OLTP or OLAP database What is a Star Schema? A relational database schema organized around a central table (fact table) joined to a few smaller tables (dimension tables) using foreign key references. The fact table contains raw numeric items that represent relevant business facts (price, discount values, number of units sold, dollar value, etc.) What is Dimensional Modeling? Dimensional Modeling is a design concept used by many data warehouse designers to build their data warehouse. In this design model all the data is stored in two types of tables - Facts table and Dimension table. Fact table contains the facts/measurements of the business and the dimension table contains the context of measurements i.e., the dimensions on which the facts are calculated. What Snow Flake Schema? Snowflake schemas normalize dimensions to eliminate redundancy. That is, the dimension data has been grouped into multiple tables instead of one large table. For example, a product dimension table in a star schema might be normalized into a products table, a product_category table, and a product_manufacturer table in a snowflake schema. While this saves space, it increases the number of dimension tables and requires more foreign key joins. The result is more complex queries and reduced query performance What are the Different methods of loading Dimension tables? The data in the dimension tables may change over a period of time. Depending upon how you want to treat the historical data in dimension tables, there are three different ways of loading the (slowly) varying dimensions: 1. Type One Dimension: do not keep the history. Hence update the record if found, else insert the data. 2. Type Two Dimension: Do not update the existing record. Create a new record (with version number or change date as part of key) of the dimension, while retaining the old one.

2. Type three Dimension keeps more than one column for each changing attribute. The new value of the attribute is recorded in the existing record, but in an empty column. Type 2 dimensions are the most commonly used dimension. What are Aggregate tables? Aggregate table contains the summary of existing warehouse data which is grouped to certain levels of dimensions. Retrieving the required data from the actual table, which have millions of records will take more time and also affects the server performance. To avoid this we can aggregate the table to certain required level and can use it. This tables reduces the load in the database server and increases the performance of the query and can retrieve the result very fast. What is the Difference between OLTP and OLAP?

Current data Short database transactions Online update/insert/delete Normalization is promoted High volume transactions Transaction recovery is necessary

OLAP Current and historical data Long database transactions Batch update/insert/delete Demoralization is promoted Low volume transactions Transaction recovery is not necessary What is ETL?

ETL is extraction, transformation and loading, ETL technology is used for extraction the information from the source database and loading it to the target database with the necessary transformations done in between. What is Fact table?

A table in a data warehouse whose entries describe data in a fact table. Dimension tables contain the data from which dimensions are created. What is a dimension table? A dimensional table is a collection of hierarchies and categories along which the user can drill down and drill up. It contains only the textual attributes. What is a lookup table? A lookup table is nothing but a 'lookup' it gives values to referenced table (it is a reference), it is used at the run time, it saves joins and space in terms of transformations. Example, a lookup table called states, provide actual state name ('Texas') in place of TX to the output. What is real time data-warehousing?

In real-time data warehousing, your warehouse contains completely up-to-date data and is synchronized with the source systems that provide the source data. In near-real-time data warehousing, there is a minimal delay between source data being generated and being available in the data warehouse. Therefore, if you want to achieve real-time or near-real-time updates to your data warehouse, youll need to do three things: 1. Reduce or eliminate the time taken to get new and changed data out of your source systems. 2. Eliminate, or reduce as much as possible, the time required to cleanse, transform and load your data. 3. Reduce as much as possible the time required to update your aggregates. Starting with version 9i, and continuing with the latest 10g release, Oracle has gradually introduced features into the database to support real-time and near-real-time, data warehousing. These features include: + Change Data Capture + External tables, table functions, pipelining, and the MERGE command, and + Fast refresh materialized views What is data mining? Data mining is a process of extracting hidden trends within a data warehouse. For example an insurance data ware house can be used to mine data for the most high risk people to insure in a certain geographical area. What are Normalization, First Normal Form, Second Normal Form, And Third Normal Form? Normalization: The process of decomposing tables to eliminate data redundancy is called Normalization. 1N.F:- The table should contain scalar or atomic values. 2 N.F:- Table should be in 1N.F + No partial functional dependencies 3 N.F:-Table should be in 2 N.F + No transitive dependencies What type of Indexing mechanism do we need to use for a typical data warehouse? Space requirements for indexes in a warehouse are often significantly larger than the space needed to store the data, especially for the fact table and particularly if the indexes are B*trees. Hence, you may want to keep indexing on the fact table to a minimum. Typically, you may have one or two concatenated B*tree indexes on the fact table; however, most of your indexes should be bitmap indexes. Bitmap indexes on the foreign key columns on the fact table are often useful for star query transformation. The maintenance window will also dictate whether you use partitioned indexes, which can be faster and easier to maintain. Bitmap indexes also take up much less space than B*tree indexes and so should be preferred. On the other hand, dimension tables are much smaller compared with the fact table and could be indexed much more extensively. Any column of the dimension table that is frequently used in selections or is a level in a dimension object is a good candidate for indexing.

Which columns go to the fact table and which columns go the dimension table? The Aggregation or calculated value columns will go to Fact Table and details information will go to dimensional table. What is a level of Granularity of a fact table? Level of granularity means level of detail that you put into the fact table in a data warehouse. For example: Based on design you can decide to put the sales data in each transaction. Now, level of granularity would mean what detail you are willing to put for each transactional fact. Product sales with respect to each minute or you want to aggregate it upto minute and put that data. What are non-additive facts? Fact table typically has two types of columns: those that contain numeric facts (often called measurements), and those that are foreign keys to dimension tables. A fact table contains either detail-level facts or facts that have been aggregated. Fact tables that contain aggregated facts are often called summary tables. A fact table usually contains facts with the same level of aggregation. Though most facts are additive, they can also be semi-additive or non-additive. Additive facts can be aggregated by simple arithmetical addition. A common example of this is sales. Non-additive facts cannot be added at all. An example of this is averages. Semi-additive facts can be aggregated along some of the dimensions and not along others. An example of this is inventory levels, where you cannot tell what a level means simply by looking at it.

What are Semi-additive and factless facts and in which scenario will you use such kinds of fact tables? Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others. For example: Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information How do you load the time dimension? Every Data warehouse maintains a time dimension. It would be at the most granular level at which the business runs at (ex: week day, day of the month and so on). Depending on the data loads, these time dimensions are updated. Weekly process gets updated every week and monthly process, every month Why are OLTP database designs not generally a good idea for a Data Warehouse? OLTP cannot store historical information about the organization. It is used for storing the details of daily transactions while a data warehouse is a huge storage of historical information obtained from different data marts for making intelligent decisions about the organization.

Why should you put your data warehouse on a different system than your OLTP system? Data Warehouse is a part of OLAP (On-Line Analytical Processing). It is the source from which any BI tools fetch data for Analytical, reporting or data mining purposes. It generally contains the data through the whole life cycle of the company/product. DWH contains historical, integrated, demoralized, subject oriented data. However, on the other hand the OLTP system contains data that is generally limited to last couple of months or a year at most. The nature of data in OLTP is: current, volatile and highly normalized. Since, both systems are different in nature and functionality we should always keep them in different systems. What are the difference between Snow flake and Star Schema. What are situations where Snow flake Schema is better than Star Schema to use and when the opposite is true?

Star schema contains the dimension tables mapped around one or more fact tables. It is a demoralized model. No need to use complicated joins. Queries results fast. Snowflake schema It is the normalized form of Star schema. Contains in-depth joins ,bcas the tables r splitted in to many pieces. We can easily do modification directly in the tables. We have to use complicated joins ,since we have more tables . There will be some delay in processing the Query . What is the main difference between schema in RDBMS and schemas in Data Warehouse? RDBMS Schema * Used for OLTP systems * Traditional and old schema * Normalized * Difficult to understand and navigate * Cannot solve extract and complex problems * Poorly modeled DWH Schema * Used for OLAP systems * New generation schema * De Normalized * Easy to understand and navigate * Extract and complex problems can be easily solved * Very good model What is degenerate dimension table? In simple terms, the column in a fact table that does not map to any dimensions, neither it s a measure column.

For e.g Invoice_no, Invoice_line_no in fact table will be a degenerate dimension (columns), provided if you dont have a dimension called invoice What are the possible data marts in Retail sales? Product information Store Time What is meant by metadata in context of a Data warehouse? Metadata or Meta Data Metadata is data about data. Examples of metadata include data element descriptions, data type descriptions, attribute/property descriptions, range/domain descriptions, and process/method descriptions. The repository environment encompasses all corporate metadata resources: database catalogs, data dictionaries, and navigation services. Metadata includes things like the name, length, valid values, and description of a data element. Metadata is stored in a data dictionary and repository. It insulates the data warehouse from changes in the schema of operational systems. Metadata Synchronization The process of consolidating, relating and synchronizing data elements with the same or similar meaning from different systems. Metadata synchronization joins these differing elements together in the data warehouse to allow for easier access What is surrogate key? Where we use it? Explain Surrogate key is a substitution for the natural primary key. It is just a unique identifier or number for each row that can be used for the primary key to the table. The only requirement for a surrogate primary key is that it is unique for each row in the table. Data warehouses typically use a surrogate, (also known as artificial or identity key), key for the dimension tables primary keys. They can use Info sequence generator, or Oracle sequence, or SQL Server Identity values for the surrogate key. It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult. Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the primary keys (according to the business users) but ,not only can these change, indexing on a numerical value is probably better and you could consider creating a surrogate key called, say, AIRPORT_ID. This would be internal to the system and as far as the client is concerned you may display only the AIRPORT_NAME. 2. Adapted from response by Vincent on Thursday, March 13, 2003 Another benefit you can get from surrogate keys (SID) is: Tracking the SCD - Slowly Changing Dimension. Let me give you a simple, classical example: On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's what would be in your Employee Dimension). This employee has a turnover allocated to him on the Business Unit 'BU1' But on the 2nd of June the Employee 'E1' is muted from Business Unit 'BU1' to

Business Unit 'BU2.' All the new turnover have to belong to the new Business Unit 'BU2' but the old one should Belong to the Business Unit 'BU1.' If you used the natural business key 'E1' for your employee within your data warehouse everything would be allocated to Business Unit 'BU2' even what actually belongs to 'BU1.' If you use surrogate keys, you could create on the 2nd of June a new record for the Employee 'E1' in your Employee Dimension with a new surrogate key. This way, in your fact table, you have your old data (before 2nd of June) with the SID of the Employee 'E1' + 'BU1.' All new data (after 2nd of June) would take the SID of the employee 'E1' + 'BU2.' You could consider Slowly Changing Dimension as an enlargement of your natural key: natural key of the Employee was Employee Code 'E1' but for you it becomes Employee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference with the natural key enlargement process, is that you might not have all part of your new key within your fact table, so you might not be able to do the join on the new enlarge key -> so you need another id. What is Data warehousing Hierarchy? Hierarchies Hierarchies are logical structures that use ordered levels as a means of organizing data. A hierarchy can be used to define data aggregation. For example, in a time dimension, a hierarchy might aggregate data from the month level to the quarter level to the year level. A hierarchy can also be used to define a navigational drill path and to establish a family structure. Within a hierarchy, each level is logically connected to the levels above and below it. Data values at lower levels aggregate into the data values at higher levels. A dimension can be composed of more than one hierarchy. For example, in the product dimension, there might be two hierarchies-one for product categories and one for product suppliers. Dimension hierarchies also group levels from general to granular. Query tools use hierarchies to enable you to drill down into your data to view different levels of granularity. This is one of the key benefits of a data warehouse. When designing hierarchies, you must consider the relationships in business structures. For example, a divisional multilevel sales organization. Hierarchies impose a family structure on dimension values. For a particular level value, a value at the next higher level is its parent, and values at the next lower level are its children. These familial relationships enable analysts to access data quickly. Levels A level represents a position in a hierarchy. For example, a time dimension might have a hierarchy that represents data at the month, quarter, and year levels. Levels range from general to specific, with the root level as the highest or most general level. The levels in a dimension are organized into one or more hierarchies. Level Relationships Level relationships specify top-to-bottom ordering of levels from most general (the root) to most specific information. They define the parent-child relationship between the levels in a hierarchy.

Hierarchies are also essential components in enabling more complex rewrites. For example, the database can aggregate an existing sales revenue on a quarterly base to a yearly aggregation when the dimensional dependencies between quarter and year are known. What is conformed fact? The relationship between the facts and dimensions are with 3NF, and can work in any type of joins are called as conformed schema, the members of that schema is called so What is the need of surrogate key; why primary key not used as surrogate key? Surrogate key is system generated unique number. Identity in SQL, Sequence in Oracle, Sequence Generator in Informatica. For Example: You are extracting Customer Information from OLTP Source and after ETL process, loading customer information in a dimension table (DW). If you take SCD Type 1, Yes you can use Primary Key of Source CustomerID as Primary Key in Dimension Table. But if you would like to preserve history of customer in Dimension table i.e. Type 2. Then you need another unique no apart from CustomerID. There you have to use Surrogate Key. Another reason : If you have AlphaNumeric as a CustomerID. Then you have to use surrogate key in Dimension Table. It is advisable to have system generated small integer number as a surrogate key in the dimension table. so that indexing and retrieval is much faster. What is meant by Aggregate Fact table? Factable having aggregated calculations like sum, avg, sum(sal)+sum(comm), these are Aggregated Factable What is the purpose of "Factless Fact Table"? How it is involved in Many to many relationship? Factless fact contains only the key fields and any related dimensional bridge attributes. Doesn't contain actual measure fields that are used for aggregates or summary. Factless facts can be used as a bridge tables or to store measures that are used directly without aggregation What are late arriving Facts and late arriving dim ? How does it impacts DW? Late arriving Fact table: This is rarely happens in practice. For example there was a credit card of HDFC transaction happened on 25th Mar 2005, but this record we received on 14th Aug 2007. During this period there is a possibility of change in dimensional data. There are two table one is transaction fact table and customer dimensional table Transactional fact table is having two time stamps 1. Transaction date 2. Load date Customer dimensional table is having two time stamps 1. Enter date (when customer provided the details) 2. Load date In case of late arrive fact records we have get the proper key from the dimensional tables as these tables are maintaining the history.

In above scenario we have to extract the right key by using transaction date <= enter date from customer dimension table for the particular customer. Late arriving Dimension table: This is very general practice in consumer finance projects Some times we will get the transactional data without dimensional date. In this practice we will follow two methods. 1. Create one dummy record in dimensional table and join with fact table. Once we receive this data the dummy record in the Dim table will be updated using type SCD1 2. Place the transactional data in staging table till we receive the dim records. Re keying processed will be used to get the key from dim table Re keying process: Generally in ETL process first we will load dim records and then go for facts. At the time of loading fact records keying will be done with dim tables. The unkeyed records will be kept in staging table and remaining data will be stored in target table. Next day at the time of keying process the unkeyed data will also used for keying is called as re-keying. Re-keying process will be continued till we get the key from dimensional tables

Explain the situations where snowflake is better than star schema A snowflake schema is a way to handle problems that do not fit within the star schema. It consists of outrigger tables which relate to dimensions rather than to the fact table. The amount of space taken up by dimensions is so small compared to the space required for a fact table as to be insignificant. Therefore, tablespace or disk space is not a considered a reason to create a snowflake schema. The main reason for creating a snowflake is to make it simpler and faster for a report writer to create drop down boxes. Rather than having to write a select distinct statement, they can simply select * from the code table. Junk dimensions and mini dimensions are another reason to create add outriggers. The junk dimensions contain data from a normal dimension that you wish to separate out, such as fields that change quickly. Updates are so slow that they can add hours to the load process. With a junk dimension, it is possible to drop and add records rather than update. Mini dimensions contain data that is so dissimilar between two or more source systems that would cause a very sparse main dimension. The conformed data that can be obtained from all source systems is contained in the parent dimension and the data from each source system that does not match is contained in the child dimension. Finally, if you are unlucky enough to have end users actually adding or updating data to the data warehouse rather than just batch loads, it may be necessary to add these outriggers to maintain referential integrity in the data being loaded.