Sie sind auf Seite 1von 7

What is a Data Warehouse?

A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process. - Bill Inmon

Subject Oriented: It means that all relevant data about a subject is gathered and stored as a single set in a useful format. Thus a data warehouse is organized around major subjects, excludes data that is not useful in decision support process Warehouse is organized around major subjects of the enterprise (Customers, Products, and Sales) rather than major application areas (customer invoicing, stock control, product sales)

Integrated: Data that is gathered into the data warehouse from a variety of sources (relational DB, flat files, legacy systems) the data warehouse provides mechanism to store this data in a globally accepted fashion with consistent naming conventions, measurements, encoding structures, and physical attributes, even when the underlying operational systems store the data differently.

Time-variant: All data in the data warehouse is identified with a particular time period. Data is stored to provide information from a historical perspective. Non-volatile: this means data warehouse is read-only. Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business.

Overview of Data warehouse:

The whole process shown above is known as data warehousing process. Many sources, may be of several types are integrated into staging tables (load all the data as it is), filtered based on the

required (called extraction), validated and transformed and loaded into data warehouse (facts and dimensions). And later used for reporting and analysis the main process in data warehouse is ETL (extract, transform and load) process, which is carried by ETL tools available. Next stage is commonly the reporting, which is carried by the reporting tools. Data Warehouse versus Data Mart

Data Warehouse Scope Application Independent Centralized

Data Mart Specific Application Decentralized by User Area

Data Subjects Sources Other Characteristics

Historical, Detailed and Summarized Multiple Subjects Many Internal and external Sources Flexible Data Oriented Large Single Complex Structure

Some history, Detailed and Summarized One central subject of interest to users Few Internal and external Sources Restrictive Project Oriented Start Small, Becomes large Multi semi-complex structures

Common words in data warehouse: We shall now carry on with the common terms used in the data warehouse and the practical perspectives. Dimensions: The dimensions or the dimension tables are the tables in the data warehouse which act as the source for the fact tables. Dimensions get source from multiple sources and we are required to generate surrogate keys accordingly (these are independent to any changes made in the business logics). Now, these loaded dimensions act as the source for the fact tables. Facts: The facts or the fact tables refer to the dimensions using the primary key-foreign key relationships. A fact refers to multiple dimensions.

The fact tables contain only the foreign keys that refer to the dimension tables, some amount fields and also aggregated fields. Rolling down, we look into the deeper concepts in data warehouse: Factless Fact: This is a table, which contains just the foreign keys from the dimensions and does not contain any amount/ aggregated fields. Dirty Dimensions: The dimension tables that have the values that change very frequently, by which the dimension table need to get updated frequently. Junk Dimensions: The dimensions that are already loaded, but that are not in use by any of the fact tables are called the junk dimensions.

In a data warehouse, there may be single fact table or even multiple fact tables used, depending on the requirements of the business. Generally there would be multiple fact tables.

Confirmed Dimensions: The dimensions that are used by more than 1 fact table as its source are called confirmed dimensions.

De-generated Dimensions: A degenerate dimension is a dimension that is stored in the fact table rather than the dimension table. It eliminates the need to join to a Dimension table This is nothing but dimension data stored within fact tables. A degenerate dimension is data that is dimensional in nature but stored in a fact table. For example, if you have a dimension that only has Order Number and Order Line Number; you would have a 1:1 relationship with the Fact table. Do you want to have two tables with a billion rows or one table with a billion rows? Therefore, this would be a degenerate dimension and Order Number and Order Line Number would be stored in the Fact table. E.g.: Lets consider we have a dimension that has number fields and have one-to-one relation with fact table. In such case; we can go with one fact table with millions of records instead of two tables with that many records. We consider storing that number fields within fact itself instead of keeping it in a separate dimension table to save the space.

Having known the above, moving forward, we look in the Dimensional modeling techniques used: Dimensional Modeling: In Dimensional model, the fact is always surrounded by corresponding dimensions. Star Schema: The modeling where the dimensions are not subdivided is called a star schema.

Snowflake Schema: The modeling where the dimensions are subdivided as a hierarchy is called a snowflake schema.

The star schema is de-normalized. The snowflake schema is in normalized form. In many projects, generally the star schema is used, so the query results are faster and the usage of joins here is simple. Where as in snowflake schema, the joins are complex since it involves many tables and it is very complex. So, where ever possible, in situations where the snowflake schema can be used, gives more understanding, but the query performance goes down.

Using snowflake schema, the redundancy of data can be avoided, and also the maintenance of data becomes easy. Slowly Changing Dimension: Dimensions that change over time are called Slowly Changing Dimensions. The changing dimension problem means that the proper description of the old client must be used with the old data warehouse schema. Usually the data warehouse must assign generalized key to these important dimensions in order to distinguish multiple snapshots of clients over a period of time. Slowly Changing Dimensions are often categorized into three types namely Type1, Type2 and Type3. Type 1: Overwriting the old values. In this Type 1, it Overwrite the old values. Type 2: Creating another additional record. In this Type 2, the old values will not be replaced but a new row containing the new values will be added to the table. Type 3: Creating new fields. In this Type 3, the latest update to the changed values can be seen. Surrogate Keys: A surrogate key is an artificial or synthetic key that is used as a substitute for a natural key. Surrogate keys are keys that are maintained within the data warehouse instead of the natural keys taken from source data systems. The surrogate keys basically serve to join the dimension tables to the fact table. Surrogate keys serve as an important means of identifying each instance or entity inside a dimension table.

Types of Systems: OLTP: Online Transaction Processing OLAP: Online Analytical Processing Difference between OLTP and OLAP: On-Line Transaction Processing Continuously Updated Data Fully Normalized data model to ensure consistency Complex data model Focus on single record access Emphasis on Update Record Replication Is Difficult On-Line Analytical Processing Read Only Snapshot Normalization is not required for consistency Simplified Data model Focus on multiple record analysis Emphasis on Search Speed Replication Is Easy

Das könnte Ihnen auch gefallen